A not so comprehensive list of proofs and other things because I’m quite bad at probability and always forget.
DISCLAIMER: These are written for Future & Past Ivy’s, so I cannot guarantee any of this is useful to you. If it is, then that’d be a very nice coincidence :)
Definitions
Expected Value
The likely outcome of a random event.
You can think of it like a weighted average, where each outcome is weighted by the chance that it occurs.
Variance
The expected value of the squared deviation from the mean.
Standard Deviation
The square root of the variance.
Bernoulli Distribution
Probability distribution if you just flipped a coin.
Binomial Distribution
Probability distribution of how many heads you’d get when flipping coins.
Proofs
Linearity of Expectation
When finding , this is equal to even if there are no independent
Note: Proof is taken from Brilliant. You can easily adapt this from discrete variables to continuous by doing integration.
Binomial Distribution Mean & Variance
The binomial distribution of .
TODO: Finish proof here of why mean is and variance is
Conditional Probability
In english, the first term is asking, what is the probability we’re in Pie A if we already know we’re in Pie B. For pictures of the “pies” I’m referencing, see image a little below.1
Working outwards, we know that the correct subset of things we’re looking for is . The problem is that this isn’t scaled correctly. This is assuming we have no prior information. If we know that we’re already inside Pie B, then the probability of guessing should be improved. You can think of the as cutting the space we care about into just Pie B.
Note: because if and are independent, then knowing you’re in Pie B has no effect on whether or not you’re in Pie A.
Rules
Probability Product Rule
If you just rearrange our equation from conditional probability, it’s pretty clear to see that .
Probability Decomposition
If we have something like , then this can be reduced to . Let’s deduce why this is the case.
For the first equation, I think of it as asking “Given we know that we’re in Pie C, what are the odds that we’re in both Pie A and Pie B”. Now it becomes quite clear why you can rephrase it as a two step question:
- Given we’re in Pie C, what are the odds we’re in Pie B?
- Given we’re now in Pie B & C, what are the odds we’re in Pie A?
It’s kind of like drilling down into more and more specific questions. Rather than trying to break it down from our first function, I find it more intuitive to build up probabilities.
Probability Sum Rule
If you want to find the probability of sum event occurring, the sum rule says you can “invent” a new dimension and sum over that. One example is if I want to find the probability that someone is a boy, then I can sum over
You can imagine that this works for any other dimension that you can think of and have the joint probabilities for.
Deriving Baye’s Rule
How can we derive Baye’s Rule from our simpler “axioms” above?
From above, it’s just a definition now that . Let’s now rename it so instead of and , we’re looking for - evidence and - our hypothesis. . What we care about normally is given our evidence, what are the chances our hypothesis is true? If you just shuffle around terms, then we’ll get which is Baye’s Rule!
Frequentist vs. Bayesian Statistics
For both Bayesian and Frequentist statistics, we’re trying to figure out parameters that are able to explain some dataset we get. Using the vocab above, that’d be . There are two slightly different ways to approach this. If we want to take the straightforward approach, then we just do . If we assume that the data is modeled with a normal distribution, we can do gradient descent on the likelihood of our data given our parameters and get an answer.2
But what if we wanted to do something slightly different? We could instead do . Given the dataset, what are the most likely parameters? Using Baye’s Rule which we derived above, we can turn this into:
\begin{eqnarray} \hat{\Theta} &=& argmax_\Theta P(\Theta \mid D) \nonumber \\ &=& argmax_\Theta \frac{P(D \mid \Theta)P(\Theta)}{P(D)} \nonumber \\ &=& argmax_\Theta P(D \mid \Theta)P(\Theta) \nonumber \\ \end{eqnarray} This looks almost identical to when we were figuring out except we have an extra term now of our prior! In this case, the prior is our assumption of what we think the parameter is likely going.
At the end of the day, the main difference between the two schools of thought is whether or not we consider the prior. Frequentist is a clean slate without priors where Bayesian comes in with ideas of what parameter should be in a model and includes that in your calculation.
Sidenote: Why might we prefer Bayesian or Frequentist? If you come in with prior assumptions, this can prevent overfitting to a training set. It also lets you be more efficient with you’re learning since you come in with some notion of what the answer should be. In the limit, these methods both provide correct estimates, but it’s in small sample sizes and everyday cases where you might prefer a Bayesian model over a Frequentist one. Of course if your prior is wrong, then this could lead to poorer modeling.
Some more ramblings on Bayesian Statistics
Note, that there is a level of subjectivity in Bayesian statistics. Specifically, how you decide what hypothesis are likely depend on the distributions you choose for your prior and your likelihood.
TODO: Should expound on this a lot more. Happy to chat with people and try to work out the specifics :)
Sources
- Linearity of Expectation | Brilliant Math & Science Wiki
- Information Theory, Inference, and Learning Algorithms
I’m well aware my probability skills are horrendous.↩︎
We’re using argmax since we’re trying to maximize the likelihood of each data point. For a lot of machine learning, you’ll see argmin, and that’s just minimizing the negative likelihood. You’ll also definitely see negative log likelihood (NLL) which is taking the log of the likelihood. This just allows for more numerical stability (no need for very tiny probabilities) and summation instead of multiplication.↩︎