Master Probability Theory in One Article

The first article in the series of mastering a subject in one document. The purpose of this series is to introduce an understanding of a whole topic as concisely and clearly as possible for easy review.
Update record:
2025.01.05 Completed version 1.0 (organized some important concepts and conclusions from the Princeton textbook on probability theory)

Prerequisite Knowledge#

Sum of an arithmetic series: $S_n = a_1+a_2+...a_n=\frac{n*(a_1+a_n)}{2}$
Sum of a geometric series: $S_n = 1+r+r^2+r^3+...+r^n=\frac{1-r^{n+1}}{1-r}$

Permutations: $A_n^m = \frac{n!}{(n-m)!}$
Combinations: $C_n^m = \frac{n!}{m!(n-m)!}$

Antiderivative:

If $F'(x)=f(x)$ , then F is called an antiderivative of f, or an (indefinite) integral of f.
Antiderivatives are not unique; different antiderivatives of the same f differ by a constant.

Fundamental Theorem of Calculus: Let f be a piecewise continuous function, and F be any antiderivative of f. Then $\int_a^bf(x)dx=F(b)-F(a)$ .
The area under the curve y=f(x) between x=a and x=b is equal to the value of the antiderivative of f at b minus the value of the antiderivative of f at a.

Taylor series: If f is n times differentiable, then the n-th Taylor series of f at point a is
$T_n(x) := \sum_{k=0}^n \frac{f^{(k)}(a)}{k!}(x-a)^k$
The Taylor series at the origin (a=0) is also known as the Maclaurin series.

Basic Probability Theorems#

Conditional probability: $Pr(A|B) = \frac{Pr(A \cap B)}{Pr(B)}$
Independence: If events A and B satisfy $Pr(A \cap B) = Pr(A) * Pr(B)$ , then A and B are independent.
Commutativity: $Pr(A\cap B) = Pr(B\cap A)$
Total probability formula: If $\{B_1, B_2, ...\}$ forms a partition of the sample space S (divided into at most countable parts), then for any $A \subset S$ , we have $Pr(A) = \sum_n Pr(A|B_n) \cdot Pr(B_n)$
Bayes' Theorem: Let $\{A_i\}_{i=1} ^ n$ be a partition of the sample space, then $Pr(A|B) = \frac {Pr(B|A) \cdot Pr(A)}{\sum_{i=1}^n Pr(B|A_i) \cdot Pr(A_i)}$ . (Based on conditional probability, the numerator is transformed using commutativity + conditional probability, and the denominator is the total probability formula.)

Random Variables#

Discrete Random Variables

Probability density function (PDF): $f_X(x)=Prob(w \in \Omega: X(w)=x)$
Cumulative distribution function (CDF): $F_X(x)=Prob(w \in \Omega: X(w) \leq x)$

Continuous Random Variables

Let X be a random variable. If there exists a real-valued function $f_X$ such that: $f_X$ is a piecewise continuous function, $f_X(x)\geq0$ , and $\int_{-\infty}^{+\infty}f_X(t)dt=1$ , then X is a continuous random variable, and $f_X$ is the probability density function of X.
Cumulative distribution function (CDF): $F_X(x)=Prob(X \leq x)=\int_{-\infty}^xf_X(t)dt$

Expected Value: Let X be a random variable defined on R, with probability density function $f_X$ . The expected value of the function $g(X)$ is

E[g(X)] = \begin{cases} \int_{-\infty}^{+\infty}g(x)*f_X(x)dx & \text{if X is continuous} \\ \sum_n g(x_n) * f_X(x_n) & \text{if X is discrete} \end{cases}

If $g(x)=x^r$ , $E[X^r]$ is called the r-th moment of X, and $E[(X-E[X])^r]$ is called the r-th central moment of X.
(Why care about moments: The more Taylor coefficients you know, the better the approximation of the function, similar to understanding the properties of the shape of the probability density function.)

The mean (average, expected value, denoted as $\mu$ ) of X is the first moment:
$E[X]=\mu=\int_{-\infty}^{+\infty} x * f_X(x)dx$
The variance of X (denoted as $\sigma_X^2$ or $Var(X)$ ) is the second central moment, the expected value of $g(X)=(X-\mu_X)^2$ .
$E[(X-E[X])^2]=E[X^2]-E[X]^2=\sigma_X^2=\int_{-\infty}^{+\infty} (x-\mu_X)^2 * f_X(x)dx$
The standard deviation is the square root of the variance, $\sigma_X=\sqrt{\sigma_X^2}$

Let X, Y, Z be continuous random variables,
Joint Probability Density Function: $Prob((X,Y,Z)\in S) = \iiint_S f_{X,Y,Z}(x,y,z)dx\,dy\,dz$
The marginal probability density function of X: $f_X(x) = \int_{y=-\infty}^{+\infty} \int_{z=-\infty}^{+\infty} f_{X,Y,Z}(x,y,z) dy\,dz$

Properties of Expectation:

The expectation of a sum equals the sum of expectations: $E[a_1g_1(X_1)+a_2g_2(X_2)]=a_1E[g_1(X_1)] + a_2E[g_2(X_2)]$
Let X be a random variable with mean $\mu_X$ and variance $\sigma_X^2$ . Then the mean and variance of the random variable $Y=aX+b$ are: $\mu_Y=a*\mu_X+b$ and $\sigma_Y^2=a^2\sigma_X^2$
Let X be a random variable, then $\sigma_X^2=E[X^2]-E[X]^2$

Properties of Mean and Variance:

If X and Y are independent random variables, then $E[XY]=E[X]E[Y]$ , and $E[(X-\mu_X)(Y-\mu_Y)]=E[X-\mu_X]E[Y-\mu_Y]=0$
Mean and Variance of the Sum of Random Variables: Let $X_1,X_2,...,X_n$ be n random variables with means $\mu_{X_1},\mu_{X_2},...,\mu_{X_n}$ and variances $\sigma^2_{X_1},\sigma^2_{X_2},...,\sigma^2_{X_n}$ .
Let $X=X_1+X_2+...+X_n$ , then the mean of X is $\mu_X=\mu_{X_1}+\mu_{X_2}+...+\mu_{X_n}$ .
When random variables are independent, $\sigma_X^2=\sigma^2_{X_1}+\sigma^2_{X_2}+...+\sigma^2_{X_n}$ .
Covariance: $\sigma_{XY}=Cov(X,Y)=E[(X-\mu_X)(Y-\mu_Y)]=E[XY]-\mu_X\mu_Y$ .
The covariance of two independent random variables is 0, but a covariance of 0 does not imply independence (e.g., X is a symmetric distribution random variable with mean 0, and $Y=X^2$ ).
If $X=X_1+X_2+...+X_n$ , then $Var(X)=\sum_{i=1}^nVar(X_i)+2\sum_{1\leq i < j \leq n }Cov(X_i, X_j)$ .
Correlation Coefficient (essentially a normalization of covariance): $\rho=\frac{Cov(X,Y)}{\sigma_X \sigma_Y}$ .
Covariance/correlation coefficient describes the linear relationship between two variables.

Special Distributions#

Name	Probability Density Function	Mean $\mu$	Variance $\sigma^2$	Remarks
Bernoulli Distribution $X \sim Bern(p)$	$Prob(X=x)=\begin{cases} p & \text{if } x=1 \\ 1-p & \text{if } x=0 \end{cases}$	p	p(1-p)
Binomial Distribution $X \sim Bin(n,p)$	$Prob(X=k)=\begin{cases} C_n^kp^k(1-p)^{n-k} & \text{if } k \in \{0,1,...n\} \\ 0 & \text{otherwise} \end{cases}$	np	np(1-p)	Number of heads in n independent coin flips
Geometric Distribution $X \sim Geom(p)$	$Prob(X=n)=\begin{cases} p(1-p)^{n-1} & \text{if } n \in \{0,1,...n\} \\ 0 & \text{otherwise} \end{cases}$	$\frac{1}{p}$	$\frac{1-p}{p^2}$	Number of trials until the first success

Exponential Distribution $X \sim Exp(\lambda)$	$f_X(x)=\begin{cases} \frac{1}{\lambda}e^{-x/\lambda} & \text{if } x \geq 0 \\ 0 & \text{otherwise} \end{cases}$	$\lambda$	$\lambda^2$
Normal Distribution $X \sim N(\mu, \sigma^2)$	$f_X(x) = \frac{1}{\sqrt{2 \pi}\sigma} e^{-\frac{(x-\mu)^2}{2 \sigma^2}}$	$\mu$	$\sigma^2$

Cumulative Distribution Method for Generating Random Numbers (Inverse Transform Method): Let X be a random variable with probability density function $f_X$ and cumulative distribution function $F_X$ . If Y is a random variable uniformly distributed over [0,1], then $X=F^{-1}_X(Y)$ .
(Refer to: Rendering and Sampling (1): Inverse Transform Sampling—Principles and Practical Applications - ZUIcat's Article - Zhihu)

Hypothesis Testing#

Null Hypothesis: Usually the opposite of the conclusion you want to prove. Assume the null hypothesis is correct and try to use data to refute it.
Alternative Hypothesis: The conclusion you want to prove.

Z-test

Let X be a normally distributed random variable with a known variance $\sigma^2$ , and assume its mean is $\mu$ .
Let $x_1, x_2, ..., x_n$ be n independent observations drawn from this distribution, and let $\bar x = \frac{x_1+x_2+...+x_n}{n}$ be the sample mean.
The observed z-test statistic: $z=\frac{\bar x - \mu}{\sqrt{\sigma^2/n}}$ . Follows a normal distribution with mean 0 and variance 1.
Based on the probability of the z-test statistic deviating from 0 (p-value), if $p<\text{significance level } \alpha$ , then reject the null hypothesis. (p actually expresses the probability of observing the current sample data under the premise that the null hypothesis is true.)
One-tailed test, two-tailed test: The difference lies in whether the parameter being measured is greater than (or less than) a certain value, or whether there is a significant difference from a certain value.

T-test

If no information about the variance is known, then the sample variance needs to be calculated to estimate the variance:
$s^2=\frac{1}{n-1}\sum_{i=1}^n(x_i-\bar x)^2$
Compared to the conventional variance, the denominator here is n-1. (When there is only one sample, it is actually impossible to estimate the variance.)
T-test statistic: $t=\frac{\bar x - \mu}{\sqrt{s^2/n}} \sim t_{n-1}$ , follows a t-distribution with n-1 degrees of freedom. (Correspondingly, the p-value needs to be calculated based on the t-distribution; the more degrees of freedom, the closer it approaches the normal distribution.)