Probability Theory

\[\newcommand{\ind}[1]{\mathbf{1}_{#1}} \newcommand{\dd}{\mathrm{d}}\]

A measure space such that the measure of the entire space equals $1$ is a probability space. Hereafter it is understood that a probability space $(\Omega,\mathfrak{F},\mathbb{P})$ is fixed once and for all. In the context of probability measures, some mathematical objects are given different names since the subject was developed independently of Lebesgue measure theory. For instance

A measurable set is called an event.
A measurable function $X$ is called a random variable. If $X$ takes values in $\mathbb{R}$, this is called a real random variable or simply a random variable. In general, if $X\colon \Omega \to E$, $X$ is an $E$-valued random variable. It is understood that $E$ is a measurable space.
The pushforward $\mu:= \mathbb{P} \circ X^{-1}$ is called the law of $X$.
The almost everywhere or a.e. notation is replaced by almost sure or a.s. notation.
The probabilistic notation is in use: $\mathbb{P}\left(\left\{ \omega \in \Omega \mid \text{property } p(\omega) \text{ holds}\right\} \right)$ is denoted $\mathbb{P}\left(p\right)$. For instance $\mathbb P(X\ge 5)$ means $\mathbb{P}\left(\left\{ \omega \in \Omega \mid X(\omega) \ge 5 \right\} \right)$. This is due to the fact that the probability space is non-canonical, and one usually is interested in properties that hold regardless of the actual choice of the probability space. The measure-theoretic setting is mostly used to ensure the existence of these spaces.
The integral is called expected value or expectation and is denoted $\mathbb{E}[X]:= \int X(\omega) d\mathbb{P}(\omega)$. For instance, for an $E$-valued random variable $X$ and $f\colon E \to \mathbb{R}$ such that $f(X)\in L^1(\mathbb{P})$ \[ \mathbb{E}[f(X)] = \int_{E} \, f d\mu \] where $\mu$ is the law of $X$.

Discrete Probabilities

Let $(\Omega,\mathfrak{F})$ be a measurable space. A probability $\mu$ on $\Omega$ is called discrete if there exists a countable subset $S\subset \Omega$ such that $\mu(S)=1$. For a discrete measure it is quite intuitive that it is enough to know the probability of each point $x\in S$ to characterize a probability $\mu$.

Indeed the following result can be easily established.

Proposition 1 There is a one-to-one correspondence between

The set of discrete probabilities $\mathcal{P}_{\mathrm{discrete}}(\Omega)$ on $\Omega$.
The maps $\Omega \ni \omega \mapsto \mu_\omega \in [0,1]$, such that $\sum_{\omega \in \Omega} \mu_\omega=1$ (the convergence of the sum implies that the set $\{\omega \mid \mu_\omega>0\}$ is at most countable).

Such a correspondence is given as follows:

For $\mu \in \mathcal{P}_{\mathrm{discrete}}(\Omega)$, set $\mu_\omega:= \mu(\{\omega\})$. Then $\omega\mapsto \mu_\omega$ satisfies the hypothesis as in b.
For a map $\mu_\cdot$ as in b., define a probability on $\Omega$ setting $\mu(A):=\sum_{\omega\in A} \mu_\omega$.

It is clear that for countable spaces every probability is discrete. In this case, one usually calls $(\mu_\omega)_\omega$ a probability. This approach easily extends to atomic probabilities.

Conditional Expectation

For discrete random variables, the meaning of conditional expectations and probability is quite intuitive. If $A$ is an event with $\mathbb{P}(A)\in (0,1)$, and $X$ a real-valued random variable, then \[ \begin{aligned} \mathbb{E}[X|A]= \mathbb{E}[X \ind{A}]/\mathbb{P}(A), \qquad \mathbb{E}[X|A^c]= \mathbb{E}[X \ind{A^c}]/\mathbb{P}(A^c) \end{aligned} \tag{1}\] is an intuitive definition.

There is however an important notion that extends this elementary approach. In Equation 1, we can identify the two values given in the formula with a function on $\Omega$ that takes one constant value on $A$ and another constant value on $A^c$. In measure-theoretic terms, this is a function that is measurable w.r.t. the $\sigma$-algebra $\mathfrak{G}_A:=\{\emptyset,A,A^c,\Omega\}$. As we will see, it is convenient to denote this function as follows \[ \mathbb{E}[X|\mathfrak{G}_A](\omega)= \begin{cases} \mathbb{E}[X \ind{A}]/\mathbb{P}(A) & \text{if $\omega \in A$} \\ \mathbb{E}[X \ind{A^c}]/\mathbb{P}(A^c) & \text{if $\omega \in A^c$} \end{cases} \] Then the random variable $Z:= \mathbb{E}[X|\mathfrak{G}_A]$ has two properties

As remarked, $Z$ is $\mathfrak{G}_A$-measurable (as a function from $\Omega$ to $\mathbb{R}$).
For any random variable $Y$ that is also $\mathfrak{G}_A$-measurable, say $Y(\omega)=\alpha$ on $A$ and $Y(\omega)=\beta$ on $A^c$, it holds \[ \mathbb{E}[Z Y]= \mathbb{E}[X Y] \tag{2}\]

Taking $\alpha=1$ and $\beta=0$ and vice versa, it is not hard to check that $Z$ is the only random variable (up to a.s. equivalence) having such properties.

Exercise 1 Verify that Equation 2 holds and that $Z$ is uniquely defined up to a.s. equivalence.

These aforementioned properties a. and b. are what allow us to define conditional expectation for any sub-$\sigma$-algebra.

Remark. Let $\mathfrak{G}$ be a sub-$\sigma$-algebra of $\mathfrak{F}$ and let $X\in L^{1}(\mathbb{P})$ be a random variable. Then there exists a unique (up to a.s. equivalence) random variable $Z$ such that

$Z$ is $\mathfrak{G}$-measurable.
For all bounded $\mathfrak{G}$-measurable random variables $Y$ it holds \[ \mathbb{E}[Z Y]= \mathbb{E}[X Y] \tag{3}\]

Indeed, consider the (signed) finite measure $\nu$ on $(\Omega, \mathfrak{G})$ given by $\nu(A):= \mathbb{E}[ X \ind{A}]$. Denote $\mathbb{P}_{\mathfrak{G}}$ the restriction of $\mathbb{P}$ to $\mathfrak{G}$.

$\nu$ is absolutely continuous w.r.t. $\mathbb{P}_{\mathfrak{G}}$ (since $\nu(A)=0$ whenever $\mathbb{P}(A)=0$). By the Radon-Nikodym theorem (applied to the positive and negative parts of $X$), there exists a unique function $Z\colon \Omega \to \mathbb{R}$, which is $\mathfrak{G}$-measurable, such that b. holds.

Definition 1 The unique (up to a.s. equivalence) random variable $Z$ defined in the above remark is called the conditional expectation of $X$ given $\mathfrak{G}$ or the expectation of $X$ knowing $\mathfrak{G}$.

Proposition 2 (Properties of the conditional expectation) It holds

If $\mathfrak{H}$ is a sub-$\sigma$-algebra of $\mathfrak{G}$, then a.s. \[ \mathbb{E}[\mathbb{E}[X | \mathfrak{G}] | \mathfrak{H}] = \mathbb{E}[X | \mathfrak{H}] \] In particular taking $\mathfrak{H}$ the trivial $\sigma$-algebra $\mathbb{E} [\mathbb{E}[X | \mathfrak{G}] ] = \mathbb{E} [X]$.
If $X$ is independent of $\mathfrak{G}$, then $\mathbb{E}[X | \mathfrak{G}]= \mathbb{E}[X]$.
If $Y$ is $\mathfrak{G}$-measurable, and $XY \in L^1(\Omega)$, then $\mathbb{E}[X Y | \mathfrak{G}]= \mathbb{E}[X|\mathfrak{G}] Y$ a.s..
Convergence theorems for integrals extend to their conditional version. E.g. if $X_n\uparrow X$ a.s. (monotone convergence), then $\mathbb{E}[X_n | \mathfrak{G}] \uparrow \mathbb{E}[X | \mathfrak{G}]$.

Exercise 2 Prove Proposition 2.

Exercise 3 Assume that $X \in L^2(\Omega,\mathfrak{F},\mathbb{P})$, and let $\mathfrak{G} \subset \mathfrak{F}$ be a sub-$\sigma$-algebra of $\mathfrak{F}$. Prove that $\mathbb{E}[X|\mathfrak{G}]$ is the orthogonal projection (in the Hilbert space $L^2(\Omega,\mathfrak{F},\mathbb{P})$) of $X$ on the (closed) subspace $L^2(\Omega,\mathfrak{G},\mathbb{P})$.

Use this fact to give an immediate interpretation of the property 1 in Proposition 2.

Topologies on the space of Probability Measures

Let $(E,\mathfrak{E})$ be a measurable space. Let’s review some common distances on the space $\mathcal{P}(E)$ of probability measures on $E$

Definition 2 For $\mu,\nu \in \mathcal{P}(E)$ define the total variation distance as \[ \|\mu-\nu\|_{TV}:=\sup_{A} |\mu(A) - \nu (A)| = \frac{1}{2} \sup_{|f|\le 1} \int f \, d\mu - \int f \, d\nu \] where the suprema are taken over measurable events $A \subset E$ and measurable functions $f \colon E \to \mathbb{R}$ with $|f|\le 1$.

Remark. It holds $\|\mu-\nu\|_{TV} \le 1$. Moreover if $E=\mathbb{R}$ and $\mu =\varrho dx$ and $\nu = \varrho' dx$, then $\|\mu-\nu\|_{TV}= \tfrac{1}{2} \|\varrho-\varrho'\|_{L^1}$.

Definition 3 Assume that $E$ is a metric space with distance $d$, and that $\mathfrak{E}$ is the associated Borel $\sigma$-algebra. For $A\subset E$ and $\varepsilon>0$, define $A^\varepsilon:=\{ x\in E \mid d(x,A)< \varepsilon\}$. For $\lambda>0$, the Lévy-Prokhorov distance $d_\lambda \colon \mathcal{P}(E) \times \mathcal{P}(E)\to [0,\lambda]$ is defined as \[ d_\lambda(\mu,\nu):= \inf \left\{ \varepsilon>0 \mid \mu(A)\le \nu(A^\varepsilon)+\frac{\varepsilon}{\lambda}, \nu(A)\le \mu(A^\varepsilon)+\frac{\varepsilon}{\lambda},\,\qquad \text{for all closed $A \subset E$} \right\} \]

Remark. It holds

$d_\lambda(\delta_x,\delta_y)= \min(d(x,y),\lambda)$. In other words if $\lambda$ is larger than the diameter of $E$, $d_\lambda$ is indeed a lift of $d$ to $\mathcal{P}(E)$.
If $d$ and $d'$ generate the same topology on $E$, then $d_\lambda$ and $d^{\prime}_{\lambda'}$ generate the same topology on $\mathcal{P}(E)$.
If $E$ is a Polish space, $\mathcal{P}(E)$ equipped with the Lévy-Prokhorov distance is a Polish space.
If $E$ is a Polish space, a sequence $\mu_n$ converges to $\mu$ in $\mathcal{P}(E)$ (with the Lévy-Prokhorov distance) iff \[ \lim_n \int f d\mu_n = \int f d\mu \] for all $f\in C_{\mathrm{b}}(E)$.
A subset $\mathcal{K}\subset \mathcal{P}(E)$ is precompact in the Lévy-Prokhorov topology iff¹ \[ \inf_{K \text{compact}} \sup_{\mu \in \mathcal{K}} \mu(K^c)=0 \tag{4}\]
We can rephrase the last point as follows: for $(\mu_n)$ a sequence in $\mathcal{P}(E)$, there exists a subsequence $n_k$ and a non-negative Borel measure $\mu$ with $\mu(E)\le 1$ such that $\mu_{n_k}(f)\to \mu(f)$ for every $f\in C_{\mathrm{b}}(E)$. $\mu$ is a probability iff for each $\varepsilon>0$ there exists a compact $K^\varepsilon\subset E$ such that \[ \varliminf_{k} \mu_{n_k}(K^\varepsilon)\ge 1-\varepsilon \]

Entropies

The notion of entropy can be introduced in several different contexts and with slightly different meanings.

Definition 4 (Relative entropy) Let $(E,\mathfrak{E},m)$ be a probability space with reference measure $m$. The relative entropy (in Mathematics) or Kullback–Leibler divergence (in Computer Science) between $\mu$ and $m$ is \[ H(\mu|m):= \sup_{f} \int f d\mu - \log \int e^f dm \] where the supremum is taken over all bounded measurable functions $f\colon E \to \mathbb{R}$.

Proposition 3 It holds:

$H(\mu|m)\ge 0$ and $H(\mu|m)=0$ iff $\mu=m$.
$H(\cdot|\cdot)$ is jointly convex, namely $H(\alpha \mu + (1-\alpha) \mu'| \alpha m +(1-\alpha)m') \le \alpha H(\mu|m) + (1-\alpha) H(\mu'|m')$.
If $E$ is a Polish space $H(\mu|m)$ is jointly lower semicontinuous in the Lévy-Prokhorov metric.
For $h(v):=v \log v$ (or equivalently $h(v)=v\log v -v +1$) \[ \begin{aligned} H(\mu|m)= \begin{cases} \int h(\varrho) dm & \text{if $\mu=\varrho m$ (in the sense of Radon-Nikodym)} \\ +\infty & \text{otherwise} \end{cases} \end{aligned} \tag{5}\]
If $m,m' \in \mathcal{P}(E)$, with $m=\tfrac{1}{Z} e^{-V} m'$ for some measurable $V\in L^1(\mu)$ and $Z>0$, then \[ H(\mu|m)= H(\mu|m')+\int V d\mu + \log Z \tag{6}\]
For each event $A\subset E$ \[ \mu(A) \le \frac{H(\mu|m)+\log 2}{1+\log(1/m(A))} \]
It holds \[ \begin{aligned} & \|\mu-m\|_{TV}^2 \le \tfrac{1}{2} H(\mu|m) \\ & \|\mu-m\|_{TV}^2 \le 1- \exp(-H(\mu|m)) \end{aligned} \tag{7}\]

Proof. We give a sketch of the proofs. While some of the arguments may sound abstract, they become quite elementary on finite or countable spaces.

Take $f$ constant in the definition of $H$ (also follows from the next point).
The function $\mathcal{P}(E)\times \mathcal{P}(E) \ni (\mu,m) \mapsto \mu(f)-\log m(e^f) \in \mathbb{R}$ is convex for each $f$, as the supremum of convex functions is convex.
For the sake of simplicity, let’s consider the case where $E$ is compact. Then since $C(E)$ is dense in $L^1(\mu)$, it is the same to take the supremum over $f\in C(E)$ in the definition of $H$. But then the map $\mathcal{P}(E)\times \mathcal{P}(E) \ni (\mu,m) \mapsto \mu(f)-\log m(e^f) \in \mathbb{R}$ is continuous in the Lévy-Prokhorov metric, and the supremum of continuous functions is lower semicontinuous. If $E$ is locally compact, we may replace $C(E)$ with functions with compact support. In the general case, we may replace $C(E)$ with functions that are uniformly continuous w.r.t. to a fixed totally bounded metric on $E$ (such a metric always exists on Polish spaces).
If there exists a set $A$ such that $\mu(A)>0$ but $m(A)=0$, then for $c>0$ take $f= c \ind{A}$. We get \[ H(\mu|m) \ge c \mu(A) - \log (e^c m(A)+ e^0 m(A^c))= c \mu(A) \] Since this holds for any $c>0$, $H(\mu|m)=+\infty$. If such a set $A$ does not exist, by Radon-Nikodym, we can assume that $\mu = \varrho m$ for some $\varrho\in L^1(m)$. First assume that $\varrho$ is bounded and bounded away from $0$. Then take $f=\log \varrho + g$ for some arbitrary $g$ bounded measurable, to get \[ \begin{aligned} H(\mu|m) & = \sup_g \int \log \varrho d\mu + \int g d\mu - \log \int e^{\log\varrho + g} dm \\ & = \int h(\varrho) dm +\sup_g \int g d\mu - \log \int e^{g} d\mu \end{aligned} \] The last supremum is non-positive by Jensen inequality, so the $\sup_g$ equals $0$ (achieved for $g$ constant). It is then easy to adapt the argument when $\log\varrho$ is unbounded.
Straightforward from the properties of the logarithm and the chain rule.
Take $f= c \ind{A}$ and optimize over $c>0$.
This is the Pinsker inequality, which can be proved by elementary methods but is beyond the scope of this note.

Relation to the classical entropy

Remark. On a finite space $E$, one may define $\mathrm{Ent}(\mu):= \sum_{x\in E} \mu_x \log \mu_x$. Notice that, in the relative entropy notation, this is nothing but \[ \mathrm{Ent}(\mu)= H(\mu|m') - \log(|E|) \] where $m'$ is the uniform probability on $E$, namely $m'_x=1/|E|$ for all $x\in E$. From Equation 6, if $m_x=e^{-V(x)}/Z$ for some $V\colon E \to \mathbb{R} \cup \{+\infty\}$ and $Z=\sum_x e^{-V(x)}$, it follows \[ H(\mu|m)=\mathrm{Ent}(\mu)+\int V d\mu + \log(Z/|E|) \]

In the physical literature, $S(\mu):= -\mathrm{Ent}(\mu)$ is called the entropy, $V(x)$ is interpreted as $V(x)= \beta h(x)$ where $h(x)$ is the energy of the configuration of the state $x\in E$, $\beta=1/(\kappa T)$ where $\kappa$ is a universal constant (the Boltzmann constant), and $T$ is the temperature. $\int h \mu$ is interpreted as the energy of the “state” $\mu$. So that one defines the free energy \[ F(\mu)= \text{energy - $\kappa T$ entropy } = \int h d\mu - \kappa T S(\mu) = \tfrac{1}{\beta} H(\mu|m) - \tfrac{1}{\beta}\log(Z/|E|) \] In particular the statement “$\mu \mapsto H(\mu|m)$ is minimized at $\mu=m$” is equivalently rephrased as “the measure $m=e^{-h/(\kappa T)}/Z$ minimizes the free energy”. Unfortunately the different notations and nomenclature persist to this day, and as a general rule:

$H(\mu|m)$ is used in the Probability literature, regardless of the space.
$D_{KL}(\mu||m)$ is used in the Computer Science literature. This coincides with $H(\mu|m)$.
$F(\mu)$ is used in the Physical literature. This differs by some constants from $H(\mu|m)$, which are not relevant for fixed $\beta$ or $h$ (e.g. if we look at these as functions of $\mu$), but are relevant if we consider $H(\mu|m)$ as a function of both $\mu$ and $m$.

Large Deviations

In this section we review some basic concentration properties of sequences of probability measures. As above, $E$ is a Polish space, equipped with its Borel $\sigma$-algebra.

Definition 5 A function $I\colon E \to (-\infty,\infty]$ is lower semicontinuous (or lsc) if for all $c\in \mathbb{R}$ the set $\{I\le c\}$ is closed.

$I$ is called coercive if $\{I\le c\}$ is either empty or precompact.

In particular, if $I$ is lsc and coercive, then it admits a minimum on $E$².

Proposition 4 Let $(\mu_n)$ be a sequence of probability measures on $E$, and $\mathbf{a}=(a_n)$ a sequence of reals with $\lim_{n\to \infty} a_n=+\infty$. Let $B_{\varepsilon}(x)$ be the ball of radius $\varepsilon$ centered at $x$. Define \[ \begin{aligned} \underline{I}(x)\equiv \underline{I}^{\mathbf{a}}(x):= - \lim_{\varepsilon \to 0} \varlimsup_n \frac{1}{a_n} \log \mu_n(B_\varepsilon(x)) \in [0,\infty] \\ \overline{I}(x) \equiv \overline{I}^{\mathbf{a}}(x):= - \lim_{\varepsilon \to 0} \varliminf_n \frac{1}{a_n} \log \mu_n(B_\varepsilon(x)) \in [0,\infty] \end{aligned} \]

Then:

$\underline{I}$ and $\overline{I}$ are lsc.
$\underline{I}$ is the optimal (i.e. the largest) lsc function, and $\overline{I}$ is the optimal (i.e. smallest) function, such that the following inequalities hold \[ \begin{aligned} \mu_n(K)\le \exp\left(-a_n \inf_{x\in K}\underline{I}(x)+ o(a_n)\right), \qquad \text{for all $K\subset E$} \\ \mu_n(O)\ge \exp\left(-a_n \inf_{x\in O}\underline{I}(x)+ o(a_n)\right), \qquad \text{for all $O\subset E$} \end{aligned} \]
Equivalently, they are the optimal (lsc) functions such that for each $f\in C_b(E)$ \[ \begin{aligned} \mu_n(e^{a_n f}) \le \exp\left(a_n \sup_x (f(x)-\underline{I}(x))+ o(a_n)\right) \\ \mu_n(e^{a_n f}) \ge \exp\left(a_n \sup_x (f(x)-\overline{I}(x))+ o(a_n)\right) \end{aligned} \]

Remark. Let $x\in E$. For each sequence $\nu_n\to \delta_x$ it holds \[ \varliminf_n \frac{1}{a_n} H(\nu_n|\mu_n) \ge \underline{I}^{\mathbf{a}}(x) \] Moreover there exists a sequence $\nu_n\to \delta_x$ such that \[ \varliminf_n \frac{1}{a_n} H(\nu_n|\mu_n) \le \overline{I}^{\mathbf{a}}(x) \] $\underline{I}^{\mathbf{a}}(x)$, $\overline{I}^{\mathbf{a}}(x)$ are the optimal functions for which these two statements hold.

Large Deviations should be compared with the narrow convergence Definition 3, in which convergence is substantially equivalent to $\mu_n(f)\to \mu(f)$ for $f\in C_{\mathrm{b}}(E)$ and similar inequalities hold on open and closed sets. Informally speaking, convergence of probability measures corresponds to the case $a_n=1$ of Proposition 4.

Definition 6 Let $(\mu_n)$, $(a_n)$ be as in Proposition 4. $(\mu_n)$ satisfies a Large Deviations Principle with speed $a_n$ and rate function $I\colon E\to [0,\infty]$ if $\underline{I}=\overline{I}=:I$. The Large Deviations Principle is non-trivial if there exists $x\in E$ such that $I(x)\in (0,\infty)$.

Proposition 5 Let $(\mu_n)$, $(a_n)$ be as in Proposition 4. There exists a subsequence $n_k$ along which a Large Deviations Principle holds.

Inequalities

We list some notable inequalities here. The proofs are elementary, except for the Ahlswede-Daykin inequality on product spaces.

Markov inequalities

Proposition 6 (Markov inequality) Let $X$ be a real random variable, then for $c>0$ \[ \mathbb{P}[X \ge c] \le \mathbb{E}[|X|]/c \]

While this appears to be a crude inequality, we can apply it to any non-decreasing function $\varphi \colon \mathbb{R}\to \mathbb{R}^+$ to get for any $X$ and $c\in \mathbb{R}$ \[ \mathbb{P}[X \ge c] \le \mathbb{P}[\varphi(X) \ge \varphi(c)] \le \mathbb{E}[\varphi(X)]/\varphi(c) \] On the other hand, the latter is trivially an equality for $\varphi(x)=\ind{(\infty,c]}(x)$. This entails the following stronger statement

Proposition 7 (Markov equality) For any real random variable and $c\in \mathbb{R}$ it holds \[ \mathbb{P}[X \ge c] = \inf_{\varphi} \mathbb{E}[\varphi(X)]/\varphi(c) \] where the infimum is taken over all non-decreasing $\varphi\colon \mathbb{R}\to [0,\infty)$ with $\varphi(c)>0$.

In particular

taking $\varphi(x)=|x-\mathbb{E}[X]|$ we get the Chebyshev inequality \[ \mathbb{P}[|X - \mathbb{E}[X]| \ge c] \le \operatorname{Var}[X] c^{-2} \]
taking $\varphi(x)=e^{\lambda x}$ for $\lambda>0$, we get the Chernoff inequality \[ \mathbb{P}[X \ge c] \le \exp(- \psi(c)) \] where $\psi$ is given by the Legendre duality formula $\psi(c):= \sup_{\lambda\ge 0} \lambda c - \log \mathbb{E}[e^{\lambda X}] \in [0,+\infty]$.

Jensen inequalities

Since a convex function is a supremum of affine functions, exchanging the supremum with the expected value we get the famous Jensen inequality, which holds a rather general linear space.

Proposition 8 (Jensen inequality) If $f$ is a convex function, $X$ a random variable such that $f(X)\in L^1(\mathbb{P})$, it holds \[ \mathbb{E}[f(X)|\mathfrak{G}] \ge f\left(\mathbb{E}[X|\mathfrak{G}]\right) \]

Hölder inequalities

Proposition 9 (Hölder inequality) If $p_1,\ldots,p_n,q \in [1,\infty]$ are such that $\sum_i 1/p_i \le 1/q$ then \[ \mathbb{E}[ | X_1 \cdots X_n|^q]^{1/q} \le \mathbb{E}[ |X_1|^{p_1}]^{1/p_1} \cdots \mathbb{E}[| X_n|^{p_n}]^{1/p_n} \]

Kochen-Stone lemma

This is the counterpart of the Borel-Cantelli lemma

Proposition 10 (Kochen-Stone Lemma) Let $(A_n)$ be a sequence of events such that \[ \sum_{n=1}^\infty \mathbb{P}(A_n) = \infty \] Then \[ \mathbb{P}(\limsup_{n \to \infty} A_n) \ge \limsup_{k \to \infty} \frac{\left(\sum_{n=1}^{k} \mathbb{P}(A_n) \right)^2}{\sum_{1 \leq m,n \leq k} \mathbb{P}(A_m \cap A_n)} \tag{8}\]

In particular, if the $(A_n)$ are pairwise independent (or $A_n$ is independent of $A_m$ for all but finitely many $m$), then under Equation 8 $\mathbb{P}(\limsup_{n \to \infty} A_n)=1$.

Correlation inequalities

In this section we introduce two classes of non-trivial correlation inequalities: those in the GKS/Ginibre class, and those in the FKG class.

Ginibre inequality

Definition 7 (Convex cone) A subset $C$ of a real vector space is called a (blunt) convex cone if it is closed under linear combinations with non-negative coefficents: if $u,v\in C$, then $\alpha u+ \beta v \in C$ for $\alpha,\beta \ge 0$.

The smallest convex cone containing a subset $A$ of the vector space is called the convex cone generated by $A$ (this is well-defined as the intersection of all convex cones containing $A$)

Definition 8 Let $A \subset L^1(\mathbb{P})$ be a set of integrable random variables. We say that $A$ satisfies the Ginibre condition if for each $N\ge 1$, $X_1,\ldots,X_N \in A$, and $\epsilon_1,\ldots,\epsilon_N \in \{-1,+1\}$ \[ \int \prod\nolimits_{i=1}^N \left(X_i(\omega)+ \epsilon_i X_i(\omega')\right) \mathbb{P}(d \omega) \mathbb{P}(d\omega') \ge 0 \] In other words, if $(Y_1,\ldots,Y_N)$ is an independent copy of $(X_1,\ldots,X_N)$ then $\mathbb{E}[(X_1\pm Y_1)\cdots(X_N\pm Y_N)]\ge 0$, regardless of the signs $\pm$ in each factor.

Theorem 1 (Ginibre inequality) Let $A \subset L^1(\mathbb{P})$ be a set of integrable random variables satisfying the Ginibre condition. If $X,Y,H$ are random variables in the convex cone generated by $A$, and $e^{-H}, X e^{-H}, Ye^{-H} \in L^1(\mathbb{P})$, then \[ \mathbb{E}\left[X Y e^{-H}\right] \mathbb{E}\left[e^{-H}\right] \ge \mathbb{E}\left[X e^{-H}\right] \mathbb{E}\left[Y e^{-H}\right] \]

General AD and FKG inequalities

To properly define a general version of the AD, Holley and FKG inequalities, we first recall the notion of distributive lattice.

Definition 9 (Measurable distributive lattice) A partial order relation $\preccurlyeq$ defined on a measurable space $(\Omega,\mathfrak{G})$ is measurable if the set $\{(x,y) \mid x \preccurlyeq y\}$ is measurable.

A set $\Omega$ equipped with a partial order relation $\preccurlyeq$ is a distributive lattice if for any elements $x, y, z \in \Omega$:

There exist a unique greatest lower bound (meet) $x \wedge y$ and a unique least upper bound (join) $x \vee y$ (lattice property).
The operations are distributive: $x \wedge (y \vee z) = (x \wedge y) \vee (x \wedge z)$.

A measurable distributive lattice is a triple $(\Omega,\mathfrak{F},\preccurlyeq)$ where $(\Omega,\mathfrak{F})$ is a measurable space and $\preccurlyeq$ is a measurable partial order that defines a distributive lattice structure on $\Omega$.

Definition 10 (Correlation inequalities) Let $\mu$ be a $\sigma$-finite measure on a measurable distributive lattice $(\Omega,\mathfrak{F},\preccurlyeq)$.

We say that $\mu$ satisfies the Ahlswede-Daykin inequality if, for all measurable $f_1,f_2,f_3,f_4 \colon \Omega \to [0,\infty]$ such that for $x,y\in \Omega$ \[ f_1(x\vee y) f_2(x \wedge y) \ge f_3(x) f_4(y) \qquad x,y \in \Omega \] it holds \[ \mu(f_1) \mu(f_2) \ge \mu(f_3) \mu(f_4) \tag{9}\]
We say that $\mu$ satisfies the Holley inequality if, for all measurable functions $h,g_1,g_2 \colon \Omega \to [0,\infty]$ such that $h$ is non-decreasing and \[ g_1(x \vee y) g_2(x \wedge y) \ge g_1(x) g_2(y) \qquad x,y \in \Omega \] it holds \[ \mu(h \,g_1) \mu(g_2) \ge \mu(g_1) \mu(h g_2) \tag{10}\]
If $\mu$ is a probability, we say that $\mu$ satisfies the FKG inequality if for all measurable non-decreasing functions $f,g \colon \Omega \to [0,\infty]$ \[ \mu(f g) \ge \mu(f) \mu(g) \]

Proposition 11 If a $\sigma$-finite measure $\mu$ satisfies the Ahlswede-Daykin inequality, then it satisfies the Holley inequality.

If a probability $\mu$ satisfies the Holley inequality, then it satisfies the FKG inequality.

Proof. For the first statement, take in the Equation 9, $f_1= h g_1$, $f_2=g_2$, $f_3=g_1$, $f_4= h g_2$. It is easy to see that these satisfy the conditions for the Ahlswede-Daykin inequality, since $h(x \vee y)\ge h(y)$. But for such a choice of the four functions, the Ahlswede-Daykin inequality reduces to the Holley’s inequality.

For the second statement, we can assume $f \in L^1(\mu)$ up to a simple approximation argument. Then take in Equation 10 $g_1=f$, $g_2=1$, $h=g$.

A special class of measurable distributive lattices are product lattices. Suppose that, for $t$ in some arbitrary index set $T$, $(\Omega_t,\mathfrak{F}_t,\preccurlyeq_t)$ is a measurable distributive lattice and assume that $\preccurlyeq_t$ is a total order relation. Then the product space $\Omega=\prod_{t\in T} \Omega_t$ is naturally equipped with a partial order relation: $\omega \preccurlyeq \omega'$ iff $\omega_t\preccurlyeq_t \omega_t'$ for all $t\in T$. In this case we say that $\Omega$ is a measurable product distributive lattice.

The main statement of the next theorem is proved in (Batty and Bollmann 1980).

Theorem 2 (General Ahlswede-Daykin theorem) Any product measure on a product distributive lattice satisfies the Ahlswede-Daykin inequality.

In particular, since any finite distributive lattice can be regarded as a sublattice of a (finite) product distributive lattice, we have that the counting measure over a finite distributive lattice satisfies the Ahlswede-Daykin inequality.

References

Batty, CJK, and HW Bollmann. 1980. “Generalised Holley-Preston Inequalities on Measure Spaces and Their Products.” Zeitschrift für Wahrscheinlichkeitstheorie Und Verwandte Gebiete 53 (2): 157–73.

Footnotes

The property Equation 4 is usually called tightness of the family $\mathcal{K}$ of probabilities.↩︎
This easy-to-prove fact, known as Bolzano-Weierstrass theorem, is usually discussed in calculus classes, and the rigor of the proof has played an important role to inspire modern Mathematics.↩︎