← Probability Definitions & Proofs

A not so comprehensive list of proofs and other things because I’m quite bad at probability and always forget.

DISCLAIMER: These are written for Future & Past Ivy’s, so I cannot guarantee any of this is useful to you. If it is, then that’d be a very nice coincidence :)

Definitions

Expected Value

The likely outcome of a random event.

E[x]=xxp(x) E[x] = \sum_x x*p(x) You can think of it like a weighted average, where each outcome is weighted by the chance that it occurs.

Variance

The expected value of the squared deviation from the mean.

Var[x]=E[(xE[x])2] \text{Var}[x] = E[(x - E[x])^2]

Standard Deviation

The square root of the variance.

SD[x]=Var[x] \text{SD}[x] = \sqrt{\text{Var}[x]}

Bernoulli Distribution

Probability distribution if you just flipped a coin.

Binomial Distribution

Probability distribution of how many heads you’d get when flipping NN coins.

Proofs

Linearity of Expectation

When finding E[X+Y]E[X + Y] , this is equal to E[X]+E[Y]E[X] + E[Y] even if there are no independent

E[X+Y]=xy[(x+y)P(X=x,Y=y)]=xy[xP(X=x,Y=y)]+xy[yP(X=x,Y=y)]=xxyP(X=x,Y=y)P(X=x)+yxP(X=x,Y=y)P(Y=y)=xxP(X=x)+yyP(Y=y)=E[X]+E[Y]. \begin{aligned} E[X+Y] & =\sum_x \sum_y[(x+y) \cdot P(X=x, Y=y)] \\ & =\sum_x \sum_y[x \cdot P(X=x, Y=y)]+\sum_x \sum_y[y \cdot P(X=x, Y=y)] \\ & =\sum_x x \underbrace{\sum_y P(X=x, Y=y)}_{P(X=x)}+\sum_y \underbrace{\sum_x P(X=x, Y=y)}_{P(Y=y)} \\ & =\sum_x x \cdot P(X=x)+\sum_y y \cdot P(Y=y) \\ & =E[X]+E[Y] . \end{aligned} Note: Proof is taken from Brilliant. You can easily adapt this from discrete variables to continuous by doing integration.

Binomial Distribution Mean & Variance

The binomial distribution of P(xf,N)=(Nx)fx(1f)NxP(x \mid f, N)=\binom{N}{x} f^x (1-f)^{N-x}.

TODO: Finish proof here of why mean is NfNf and variance is Nf(1f)Nf(1-f)

Conditional Probability

P(AB)=P(A,B)P(B)P(A \mid B) = \frac{P(A, B)}{P(B)}

In english, the first term is asking, what is the probability we’re in Pie A if we already know we’re in Pie B. For pictures of the “pies” I’m referencing, see image a little below.1

Working outwards, we know that the correct subset of things we’re looking for is P(A,B)P(A, B). The problem is that this isn’t scaled correctly. This is assuming we have no prior information. If we know that we’re already inside Pie B, then the probability of guessing AA should be improved. You can think of the B\mid B as cutting the space we care about into just Pie B.

Note: P(AB)P(A,B)P(A \mid B) \ge P(A, B) because if AA and BB are independent, then knowing you’re in Pie B has no effect on whether or not you’re in Pie A.

Rules

Probability Product Rule

If you just rearrange our equation from conditional probability, it’s pretty clear to see that P(A,B)=P(AB)P(B)P(A, B) = P(A \mid B)P(B).

Probability Decomposition

If we have something like P(A,BC)P(A, B \mid C), then this can be reduced to P(AB,C)P(BC)P(A \mid B, C) P(B \mid C). Let’s deduce why this is the case.

For the first equation, I think of it as asking “Given we know that we’re in Pie C, what are the odds that we’re in both Pie A and Pie B”. Now it becomes quite clear why you can rephrase it as a two step question:

  1. Given we’re in Pie C, what are the odds we’re in Pie B?
  2. Given we’re now in Pie B & C, what are the odds we’re in Pie A?

It’s kind of like drilling down into more and more specific questions. Rather than trying to break it down from our first function, I find it more intuitive to build up probabilities.

Probability Sum Rule

P(X)=YP(X,Y)P(X) = \sum_Y P(X, Y)

If you want to find the probability of sum event occurring, the sum rule says you can “invent” a new dimension and sum over that. One example is if I want to find the probability that someone is a boy, then I can sum over P(boy,happy)+P(boy,sad)+P(\text{boy}, \text{happy}) + P(\text{boy}, \text{sad}) + \dots

You can imagine that this works for any other dimension that you can think of and have the joint probabilities P(X,Y)P(X, Y) for.

Deriving Baye’s Rule

P(HE)=P(H)P(EH)P(E) P(H \mid E) = \frac{P(H)P(E \mid H)}{P(E)}

How can we derive Baye’s Rule from our simpler “axioms” above?

From above, it’s just a definition now that P(AB)=P(A)P(BA)=P(B)P(AB)P(AB) = P(A)P(B \mid A) = P(B)P(A \mid B). Let’s now rename it so instead of AA and BB, we’re looking for EE - evidence and HH - our hypothesis. P(H)P(EH)=P(E)P(HE)P(H)P(E \mid H) = P(E)P(H \mid E). What we care about normally is given our evidence, what are the chances our hypothesis is true? If you just shuffle around terms, then we’ll get P(HE)=P(H)P(EH)P(E)P(H \mid E) = \frac{P(H)P(E \mid H)}{P(E)} which is Baye’s Rule!

Frequentist vs. Bayesian Statistics

For both Bayesian and Frequentist statistics, we’re trying to figure out parameters that are able to explain some dataset we get. Using the vocab above, that’d be P(DΘ)P(D\mid\Theta). There are two slightly different ways to approach this. If we want to take the straightforward approach, then we just do argmaxΘP(DΘ)argmax_\Theta P(D \mid \Theta). If we assume that the data is modeled with a normal distribution, we can do gradient descent on the likelihood of our data given our parameters and get an answer.2

But what if we wanted to do something slightly different? We could instead do argminΘP(ΘD)argmin_\Theta P(\Theta \mid D). Given the dataset, what are the most likely parameters? Using Baye’s Rule which we derived above, we can turn this into:

\begin{eqnarray} \hat{\Theta} &=& argmax_\Theta P(\Theta \mid D) \nonumber \\ &=& argmax_\Theta \frac{P(D \mid \Theta)P(\Theta)}{P(D)} \nonumber \\ &=& argmax_\Theta P(D \mid \Theta)P(\Theta) \nonumber \\ \end{eqnarray} This looks almost identical to when we were figuring out P(DΘ)P(D \mid \Theta) except we have an extra term now of our prior! In this case, the prior is our assumption of what we think the parameter is likely going.

At the end of the day, the main difference between the two schools of thought is whether or not we consider the prior. Frequentist is a clean slate without priors where Bayesian comes in with ideas of what parameter should be in a model and includes that in your calculation.

Sidenote: Why might we prefer Bayesian or Frequentist? If you come in with prior assumptions, this can prevent overfitting to a training set. It also lets you be more efficient with you’re learning since you come in with some notion of what the answer should be. In the limit, these methods both provide correct estimates, but it’s in small sample sizes and everyday cases where you might prefer a Bayesian model over a Frequentist one. Of course if your prior is wrong, then this could lead to poorer modeling.

Some more ramblings on Bayesian Statistics

Note, that there is a level of subjectivity in Bayesian statistics. Specifically, how you decide what hypothesis are likely depend on the distributions you choose for your prior and your likelihood.

TODO: Should expound on this a lot more. Happy to chat with people and try to work out the specifics :)

Sources


  1. I’m well aware my probability skills are horrendous.↩︎

  2. We’re using argmax since we’re trying to maximize the likelihood of each data point. For a lot of machine learning, you’ll see argmin, and that’s just minimizing the negative likelihood. You’ll also definitely see negative log likelihood (NLL) which is taking the log of the likelihood. This just allows for more numerical stability (no need for very tiny probabilities) and summation instead of multiplication.↩︎