banner
Arbre

Arbre

do something...

Master Probability Theory in One Article

The first article in the series of mastering a subject in one document. The purpose of this series is to introduce an understanding of a whole topic as concisely and clearly as possible for easy review.
Update record:
2025.01.05 Completed version 1.0 (organized some important concepts and conclusions from the Princeton textbook on probability theory)

Prerequisite Knowledge#

Sum of an arithmetic series: Sn=a1+a2+...an=n(a1+an)2S_n = a_1+a_2+...a_n=\frac{n*(a_1+a_n)}{2}
Sum of a geometric series: Sn=1+r+r2+r3+...+rn=1rn+11rS_n = 1+r+r^2+r^3+...+r^n=\frac{1-r^{n+1}}{1-r}

Permutations: Anm=n!(nm)!A_n^m = \frac{n!}{(n-m)!}
Combinations: Cnm=n!m!(nm)!C_n^m = \frac{n!}{m!(n-m)!}

Antiderivative:

  1. If F(x)=f(x)F'(x)=f(x), then F is called an antiderivative of f, or an (indefinite) integral of f.
  2. Antiderivatives are not unique; different antiderivatives of the same f differ by a constant.

Fundamental Theorem of Calculus: Let f be a piecewise continuous function, and F be any antiderivative of f. Then abf(x)dx=F(b)F(a)\int_a^bf(x)dx=F(b)-F(a).
The area under the curve y=f(x) between x=a and x=b is equal to the value of the antiderivative of f at b minus the value of the antiderivative of f at a.

Taylor series: If f is n times differentiable, then the n-th Taylor series of f at point a is
Tn(x):=k=0nf(k)(a)k!(xa)kT_n(x) := \sum_{k=0}^n \frac{f^{(k)}(a)}{k!}(x-a)^k
The Taylor series at the origin (a=0) is also known as the Maclaurin series.

Basic Probability Theorems#

Conditional probability: Pr(AB)=Pr(AB)Pr(B)Pr(A|B) = \frac{Pr(A \cap B)}{Pr(B)}
Independence: If events A and B satisfy Pr(AB)=Pr(A)Pr(B)Pr(A \cap B) = Pr(A) * Pr(B), then A and B are independent.
Commutativity: Pr(AB)=Pr(BA)Pr(A\cap B) = Pr(B\cap A)
Total probability formula: If {B1,B2,...}\{B_1, B_2, ...\} forms a partition of the sample space S (divided into at most countable parts), then for any ASA \subset S, we have Pr(A)=nPr(ABn)Pr(Bn)Pr(A) = \sum_n Pr(A|B_n) \cdot Pr(B_n)
Bayes' Theorem: Let {Ai}i=1n\{A_i\}_{i=1} ^ n be a partition of the sample space, then Pr(AB)=Pr(BA)Pr(A)i=1nPr(BAi)Pr(Ai)Pr(A|B) = \frac {Pr(B|A) \cdot Pr(A)}{\sum_{i=1}^n Pr(B|A_i) \cdot Pr(A_i)}. (Based on conditional probability, the numerator is transformed using commutativity + conditional probability, and the denominator is the total probability formula.)

Random Variables#

Discrete Random Variables

  1. Probability density function (PDF): fX(x)=Prob(wΩ:X(w)=x)f_X(x)=Prob(w \in \Omega: X(w)=x)
  2. Cumulative distribution function (CDF): FX(x)=Prob(wΩ:X(w)x)F_X(x)=Prob(w \in \Omega: X(w) \leq x)

Continuous Random Variables

  1. Let X be a random variable. If there exists a real-valued function fXf_X such that: fXf_X is a piecewise continuous function, fX(x)0f_X(x)\geq0, and +fX(t)dt=1\int_{-\infty}^{+\infty}f_X(t)dt=1, then X is a continuous random variable, and fXf_X is the probability density function of X.
  2. Cumulative distribution function (CDF): FX(x)=Prob(Xx)=xfX(t)dtF_X(x)=Prob(X \leq x)=\int_{-\infty}^xf_X(t)dt

Expected Value: Let X be a random variable defined on R, with probability density function fXf_X. The expected value of the function g(X)g(X) is

E[g(X)]={+g(x)fX(x)dxif X is continuousng(xn)fX(xn)if X is discreteE[g(X)] = \begin{cases} \int_{-\infty}^{+\infty}g(x)*f_X(x)dx & \text{if X is continuous} \\ \sum_n g(x_n) * f_X(x_n) & \text{if X is discrete} \end{cases}

If g(x)=xrg(x)=x^r, E[Xr]E[X^r] is called the r-th moment of X, and E[(XE[X])r]E[(X-E[X])^r] is called the r-th central moment of X.
(Why care about moments: The more Taylor coefficients you know, the better the approximation of the function, similar to understanding the properties of the shape of the probability density function.)

  1. The mean (average, expected value, denoted as μ\mu) of X is the first moment:
    E[X]=μ=+xfX(x)dxE[X]=\mu=\int_{-\infty}^{+\infty} x * f_X(x)dx
  2. The variance of X (denoted as σX2\sigma_X^2 or Var(X)Var(X)) is the second central moment, the expected value of g(X)=(XμX)2g(X)=(X-\mu_X)^2.
    E[(XE[X])2]=E[X2]E[X]2=σX2=+(xμX)2fX(x)dxE[(X-E[X])^2]=E[X^2]-E[X]^2=\sigma_X^2=\int_{-\infty}^{+\infty} (x-\mu_X)^2 * f_X(x)dx
  3. The standard deviation is the square root of the variance, σX=σX2\sigma_X=\sqrt{\sigma_X^2}

Let X, Y, Z be continuous random variables,
Joint Probability Density Function: Prob((X,Y,Z)S)=SfX,Y,Z(x,y,z)dxdydzProb((X,Y,Z)\in S) = \iiint_S f_{X,Y,Z}(x,y,z)dx\,dy\,dz
The marginal probability density function of X: fX(x)=y=+z=+fX,Y,Z(x,y,z)dydzf_X(x) = \int_{y=-\infty}^{+\infty} \int_{z=-\infty}^{+\infty} f_{X,Y,Z}(x,y,z) dy\,dz

Properties of Expectation:

  1. The expectation of a sum equals the sum of expectations: E[a1g1(X1)+a2g2(X2)]=a1E[g1(X1)]+a2E[g2(X2)]E[a_1g_1(X_1)+a_2g_2(X_2)]=a_1E[g_1(X_1)] + a_2E[g_2(X_2)]
  2. Let X be a random variable with mean μX\mu_X and variance σX2\sigma_X^2. Then the mean and variance of the random variable Y=aX+bY=aX+b are: μY=aμX+b\mu_Y=a*\mu_X+b and σY2=a2σX2\sigma_Y^2=a^2\sigma_X^2
  3. Let X be a random variable, then σX2=E[X2]E[X]2\sigma_X^2=E[X^2]-E[X]^2

Properties of Mean and Variance:

  1. If X and Y are independent random variables, then E[XY]=E[X]E[Y]E[XY]=E[X]E[Y], and E[(XμX)(YμY)]=E[XμX]E[YμY]=0E[(X-\mu_X)(Y-\mu_Y)]=E[X-\mu_X]E[Y-\mu_Y]=0
  2. Mean and Variance of the Sum of Random Variables: Let X1,X2,...,XnX_1,X_2,...,X_n be n random variables with means μX1,μX2,...,μXn\mu_{X_1},\mu_{X_2},...,\mu_{X_n} and variances σX12,σX22,...,σXn2\sigma^2_{X_1},\sigma^2_{X_2},...,\sigma^2_{X_n}.
    Let X=X1+X2+...+XnX=X_1+X_2+...+X_n, then the mean of X is μX=μX1+μX2+...+μXn\mu_X=\mu_{X_1}+\mu_{X_2}+...+\mu_{X_n}.
    When random variables are independent, σX2=σX12+σX22+...+σXn2\sigma_X^2=\sigma^2_{X_1}+\sigma^2_{X_2}+...+\sigma^2_{X_n}.
  3. Covariance: σXY=Cov(X,Y)=E[(XμX)(YμY)]=E[XY]μXμY\sigma_{XY}=Cov(X,Y)=E[(X-\mu_X)(Y-\mu_Y)]=E[XY]-\mu_X\mu_Y.
    The covariance of two independent random variables is 0, but a covariance of 0 does not imply independence (e.g., X is a symmetric distribution random variable with mean 0, and Y=X2Y=X^2).
    If X=X1+X2+...+XnX=X_1+X_2+...+X_n, then Var(X)=i=1nVar(Xi)+21i<jnCov(Xi,Xj)Var(X)=\sum_{i=1}^nVar(X_i)+2\sum_{1\leq i < j \leq n }Cov(X_i, X_j).
  4. Correlation Coefficient (essentially a normalization of covariance): ρ=Cov(X,Y)σXσY\rho=\frac{Cov(X,Y)}{\sigma_X \sigma_Y}.
    Covariance/correlation coefficient describes the linear relationship between two variables.

Special Distributions#

NameProbability Density FunctionMean μ\muVariance σ2\sigma^2Remarks
Bernoulli Distribution XBern(p)X \sim Bern(p)Prob(X=x)={pif x=11pif x=0Prob(X=x)=\begin{cases} p & \text{if } x=1 \\ 1-p & \text{if } x=0 \end{cases}pp(1-p)
Binomial Distribution XBin(n,p)X \sim Bin(n,p)Prob(X=k)={Cnkpk(1p)nkif k{0,1,...n}0otherwiseProb(X=k)=\begin{cases} C_n^kp^k(1-p)^{n-k} & \text{if } k \in \{0,1,...n\} \\ 0 & \text{otherwise} \end{cases}npnp(1-p)Number of heads in n independent coin flips
Geometric Distribution XGeom(p)X \sim Geom(p)Prob(X=n)={p(1p)n1if n{0,1,...n}0otherwiseProb(X=n)=\begin{cases} p(1-p)^{n-1} & \text{if } n \in \{0,1,...n\} \\ 0 & \text{otherwise} \end{cases}1p\frac{1}{p}1pp2\frac{1-p}{p^2}Number of trials until the first success
Exponential Distribution XExp(λ)X \sim Exp(\lambda)fX(x)={1λex/λif x00otherwisef_X(x)=\begin{cases} \frac{1}{\lambda}e^{-x/\lambda} & \text{if } x \geq 0 \\ 0 & \text{otherwise} \end{cases}λ\lambdaλ2\lambda^2
Normal Distribution XN(μ,σ2)X \sim N(\mu, \sigma^2)fX(x)=12πσe(xμ)22σ2f_X(x) = \frac{1}{\sqrt{2 \pi}\sigma} e^{-\frac{(x-\mu)^2}{2 \sigma^2}}μ\muσ2\sigma^2

Cumulative Distribution Method for Generating Random Numbers (Inverse Transform Method): Let X be a random variable with probability density function fXf_X and cumulative distribution function FXF_X. If Y is a random variable uniformly distributed over [0,1], then X=FX1(Y)X=F^{-1}_X(Y).
(Refer to: Rendering and Sampling (1): Inverse Transform Sampling—Principles and Practical Applications - ZUIcat's Article - Zhihu)

Hypothesis Testing#

Null Hypothesis: Usually the opposite of the conclusion you want to prove. Assume the null hypothesis is correct and try to use data to refute it.
Alternative Hypothesis: The conclusion you want to prove.

Z-test

  1. Let X be a normally distributed random variable with a known variance σ2\sigma^2, and assume its mean is μ\mu.
  2. Let x1,x2,...,xnx_1, x_2, ..., x_n be n independent observations drawn from this distribution, and let xˉ=x1+x2+...+xnn\bar x = \frac{x_1+x_2+...+x_n}{n} be the sample mean.
  3. The observed z-test statistic: z=xˉμσ2/nz=\frac{\bar x - \mu}{\sqrt{\sigma^2/n}}. Follows a normal distribution with mean 0 and variance 1.
  4. Based on the probability of the z-test statistic deviating from 0 (p-value), if p<significance level αp<\text{significance level } \alpha, then reject the null hypothesis. (p actually expresses the probability of observing the current sample data under the premise that the null hypothesis is true.)
  5. One-tailed test, two-tailed test: The difference lies in whether the parameter being measured is greater than (or less than) a certain value, or whether there is a significant difference from a certain value.

T-test

  1. If no information about the variance is known, then the sample variance needs to be calculated to estimate the variance:
    s2=1n1i=1n(xixˉ)2s^2=\frac{1}{n-1}\sum_{i=1}^n(x_i-\bar x)^2
  2. Compared to the conventional variance, the denominator here is n-1. (When there is only one sample, it is actually impossible to estimate the variance.)
  3. T-test statistic: t=xˉμs2/ntn1t=\frac{\bar x - \mu}{\sqrt{s^2/n}} \sim t_{n-1}, follows a t-distribution with n-1 degrees of freedom. (Correspondingly, the p-value needs to be calculated based on the t-distribution; the more degrees of freedom, the closer it approaches the normal distribution.)
Loading...
Ownership of this post data is guaranteed by blockchain and smart contracts to the creator alone.