Chapter 7.3. Selected Topics: Quantitative Limit Theorems

\[\newcommand{\st}{\, : \:} \newcommand{\ind}[1]{\mathbf{1}_{#1}} \newcommand{\dd}{\mathrm{d}}\]

Suppose we observe a simple random walk on \(\mathbb{Z}\), and we keep track of the number of times the walker crosses a (directed) edge (e.g. adding one card to a pile at each crossing). Over time we will observe a random height profile of the pile of cards emerging at large time scales.

Figure 1

For instance, in the symmetric case, we have seen that for this specific Markov chain, the number of times the edge \((n,n+1)\) is crossed (during an excursion from \(0\)) is a Galton-Watson process (indexed by \(n\)). So, as we follow the chain over a long time, the curve will be distributed as a sum of independent Galton-Watson processes. Moreover, in this example, from the position of the random walk at the final time, we can exactly infer the difference between the number of crossings of the edge \((n,n+1)\) and the number of crossings of \((n+1,n)\) (i.e. the difference between the blue bars and orange bars in the video). For instance, if the final position is \(X_T=5\), then the walker has crossed the edges \((n,n+1)\) and \((n+1,n)\) with \(n<0\) or \(n\ge 5\), the same number of times; while, for \(0 \le n\le 4\), the edge \((n,n+1)\) has been crossed once more than \((n+1,n)\).

A more interesting example is a random walk on \(\mathbb{Z}_N\).

Figure 2

In this case, the difference between the blue bars and the orange bars (referring to the video) will give us the number of times the walker wound around the loop, information that cannot be retrieved solely by knowing the final position.

As we look at Markov chains on large graph embedded in higher dimensional manifolds, things become more interesting.

Figure 3: As we run a Markov chain on a graph embedded in a manifold, the paths of the Markov chain can be interpreted as a discretization of a random curve on the manifold (e.g. we can obtain a continuous curve joining the points visited sequentially by the Markov chain with geodesics). For instance, we can embed \(\mathbb{Z}_N^2\) in a \(2\)-dimensional flat torus. When \(N\) is large, we see a continuous random dynamics emerge. It is well-known (and already visible in the simulations) that the occupation measure (called \(\pi^T\) below), converges to the Lebesgues measure on the torus. But we see fluctuations around this limiting behavior, which we may want to estimate. The same happens for the number of windings \(h^T\) of the chain (see Exercise 1). The random variable \(h^T\) of the Markov chain approximates (as \(N\to \infty\)) the homology class of the corresponding limiting random curve on the manifold.

Preliminaries for Markov Currents

In general, for a Markov chain on a finite or countable state space \(S\), we may want to consider a random observable that counts how many times the chain passes through a given edge.

Notation for probabilities on edges

Let us denote by \(E\) the space of edges in the graph induced by the transition probabilities \((p_{x,y})\), that is \(E:=\{(x,y) \st p_{x,y}>0\}\). As usual, we assume that \(p_{x,y}\) is fixed from the start and denote \(e\in E\) a generic edge. As for probabilities on the vertices, a probability on the space of edges is identified with a collection \((\jmath_e)_{e\in E}\) with \(\jmath_e\ge 0\) and \(\sum_e \jmath_e=1\). The space \(\mathcal{P}(E)\) of probability measures on \(E\) is equipped with its own Lévy distance; if \(|E|\) is finite, this is the usual topology of the simplex \(\{\jmath_e\ge 0, \sum_e \jmath_e=1\}\) regarded as a subset of \(\mathbb{R}^{|E|}\), or more generally (if \(E\) is countable), the Lévy distance is topologically equivalent (once we enumerate the edges \(\{e_0,e_1,\ldots\}\)) to \[ d(\jmath,\jmath')=\sum\nolimits_{k} |\jmath_{e_k}- \jmath_{e_k}'| 2^{-k} \tag{1}\] That is, a sequence \(\jmath^n\) converges to \(\jmath\) iff all the values \(\jmath^n_e\) converge.

For \(\mu \in \mathcal{P}(S)\) and \(f\colon S\to \mathbb{R}\) we usually denote the integral \(\mu(f)=\sum_x \mu_x f(x)\). Similarly for \(\jmath \in \mathcal{P}(E)\) and \(\xi \colon E\to \mathbb{R}\), we denote \(\jmath(\xi):=\sum_{e\in E} \jmath_e \xi(e)\).

If \(\jmath\) is a probability on edges, then we denote \(\mu^\jmath \in \mathcal{P}(S)\) the probability on vertices given by \[ \mu^\jmath_x=\sum_{y \st (x,y)\in E} \jmath_{x,y} \tag{2}\] and \(p^\jmath\) the transition probabilities \[ p^\jmath_{x,y}=\jmath_{x,y}/\mu^\jmath_{x} \tag{3}\]

Currents as probabilities on edges

The example above motivates us to consider the measure on the edges \[ \begin{aligned} & J^T:= \frac{1}{T} \sum\nolimits_{t=0}^{T-1} \delta_{(X_t,X_{t+1})} \\ & J^T_{x,y} \equiv \frac{|\{t \le T-1 \st X_t=x, X_{t+1}=y \}|}{T} \end{aligned} \] In other words \(J^T_e\) counts the number of times the chain passed through the (directed) edge \(e\) up to time \(T\), normalized by the total number of jumps (that is, \(T\)). \(J^T\) is a probability measure on the space of edges \(E\) of \(S\). We can regard \(J^T\) as a map on the Skorohod space of paths \[ J^T \colon D(S)\to \mathcal{P}(E) \] associating to each path a probability on the edges, according to the number of times the path \(\mathbf{X}\) crossed each edge up to time \(T\).

Remark. \(J^T\) is a richer observable than the occupation measure \(\pi^T:=\sum_{t=0}^{T-1} \delta_{X_t}\), which encodes the (normalized) number of times the chain visited each vertex. Indeed \[ \mu^{J^T}_x:=\sum\nolimits_{y} J^T_{x,y}=\pi^T_x \tag{4}\] so we can recover \(\pi^T\) from \(J^T\). Regarding \(J^T\) as a probability on \(S\times S\), this tells us that \(\pi^T\) is just its projection on the first component. For similarities with \(1\)-currents on manifolds, sometimes \(J^T\) is referred to as a current.

Problems and estimates of interest

We already know that \(\pi^T\) converges to the invariant measure \(m\) (for irreducible, positive-recurrent chains), as \(T\to \infty\). Each time the chain visits a vertex \(x\), there is a probability \(p_{x,y}\) to cross the edge \((x,y)\); and as \(T\to \infty\), the (irreducible, positive-recurrent) Markov chain \(\mathbf{X}\) will visit \(x\) a fraction \(m_x\) of the times. Thus we expect that \[ \lim_T J^T_{x,y}=m_x p_{x,y} \tag{5}\] which is indeed compatible with Equation 4.

In this chapter we want to understand a more quantitative convergence result for \(J^T\), which in turn will imply Equation 5, and via Equation 4 similar bounds for \(\pi^T\). This is a highly motivated problem, let us consider an example which may give us an idea1.

Example 1 In the Transmission Control Protocol for internet communication, a server \(x\) will choose the next directly-connected server \(y\) to send a packet of data it received, depending on the final destination of the packet (this is inferred via subnet mask, so a packet indeed bounces across many servers, and does not follow an optimal geodesic path). It is a fair assumption to think that the packets perform a Markov chain with some transition probabilities \(p_{x,y}\). A pair \((x,y)\) of directly-connected servers will have a maximum connection capacity \(C_{x,y}\) defined over a small amount of time, and the transmission will fail if this capacity is exceeded. If we observe the traffic over a period of time, we can estimate \(p_{x,y}\) very precisely, and may want to know how to assign the capacities \(C_{x,y}\) to minimize the probability of failures. That is, we need to know, for an arbitrary assignment \(C_{x,y}\) \[ \mathbb{P}(J^T_{x,y} \le C_{x,y}, \forall x,y \in S) \approx ? \]

Main results: Statements

Definition 1 Recall Equation 2, Equation 3 and define the function \(I\colon \mathcal{P}(E)\to [0,\infty]\) \[ I(\jmath):= \begin{cases} \sum\nolimits_{x,y} \mu^{\jmath}_x p^\jmath_{x,y} \log(p^\jmath_{x,y}/p_{x,y}) & \text{if $\sum_y \jmath_{x,y}=\sum_y \jmath_{y,x}$} \\ +\infty & \text{otherwise} \end{cases} \tag{6}\]

Definition 2 (Donsker-Varadhan functional) Define the function2 \(I_{DV}\colon \mathcal{P}(E)\to [0,\infty]\) \[ I_{DV}(\mu):= \inf \{I(\jmath), \jmath \in \mathcal{P}(E) \st \mu^\jmath =\mu \} \tag{7}\]

In Exercise 3, Exercise 4 and Exercise 5 a more explicit expression and some properties of \(I_{DV}\) are shown.

Remark. Considering the definition of the relative entropy \(H(\cdot|\cdot)\), \(I(\cdot)\) can also be written as \[ I(\jmath):= \begin{cases} \sum\nolimits_x \mu^\jmath_x H(p^\jmath_{x,\cdot}|p_{x,\cdot}) & \text{if $\sum_y \jmath_{x,y}=\sum_y \jmath_{y,x}$} \\ +\infty & \text{otherwise} \end{cases} \tag{8}\]

Our main result is the following

Theorem 1 Consider an irreducible Markov chain on a finite state space. Then for any closed set \(\mathcal{C}\subset \mathcal{P}(E)\) it holds \[ \mathbb{P}(J^T\in \mathcal{C}) \le e^{-T \inf_{\jmath\in \mathcal{C}} I(\jmath) + o(T)} \tag{9}\]

This in turn implies an exponential version of Equation 5 (in probability) and, via Equation 4, a similar result for \(\pi^T\).

Corollary 1 Define the typical current as \(\bar{\jmath}_{x,y}:=m_x p_{x,y}\). Then, under the same conditions as Theorem 1, \(I(\jmath)=0\) iff \(\jmath=\bar{\jmath}\). In particular for each \(\varepsilon>0\) there exists a constant \(c_\varepsilon>0\) such that \[ \mathbb{P}\left(\sum\nolimits_{e \in E} |J^T_e - \bar{\jmath}_e| \ge \varepsilon \right) \le e^{-T (c_\varepsilon + o(1))} \]

Corollary 2 Under the same hypotheses as Theorem 1, for any closed set \(\mathcal{C}\subset \mathcal{P}(S)\) it holds \[ \mathbb{P}(\pi^T\in \mathcal{C}) \le e^{-T \inf_{\mu\in \mathcal{C}} I_{\mathrm{DV}}(\mu) + o(T)} \tag{10}\]

One may prove (beyond the scope of this note) that \(I\) is optimal in the following sense: for each measurable \(\mathcal{A} \subset \mathcal{P}(E)\) \[ e^{-T \inf_{\jmath\in \mathrm{Interior}(\mathcal{A})} I(\jmath) + o(T)} \le \mathbb{P}(J^T\in \mathcal{A}) \le e^{-T \inf_{\jmath\in \mathrm{Closure}(\mathcal{A})} I(\jmath) + o(T)} \] (and a similar statement holds for \(I_{DV}\)).

Remark. The condition \(\sum_y \jmath_{x,y}=\sum_y \jmath_{y,x}\) implies that the measure \(\mu^\jmath\) is invariant for the induced transition probabilities \(p^\jmath\). Thus, Theorem 1 can be interpreted as follows: the probability that the observed current \(J^T\) fluctuates near a given \(\jmath\) decays exponentially as \(\exp(-T I(\jmath))\). The rate function \(I(\jmath)\) represents the relative entropy of the modified dynamics \(p^\jmath\) with respect to the original dynamics \(p\), averaged over the \(p^\jmath\)-invariant measure \(\mu^\jmath\).

Similarly, Corollary 2 quantifies the rarity of observing an empirical measure \(\pi^T\) close to \(\mu\) (where \(\mu\) is not necessarily the invariant measure). The cost is given by \(I_{DV}(\mu)\), which quantitatively expresses the “lack of \(P\)-invariance” of \(\mu\), see e.g. the bound Exercise 4.

Main results: Proofs

This section is dedicated to the proof of Theorem 1, Corollary 1, Corollary 2.

Definition 3 A function \(F\colon \mathcal{P}(E)\to [0,\infty]\) is lower semicontinuous (or lsc for short) if for each \(c\ge 0\) the set \(\{\jmath \in \mathcal{P}(E) \st F(\jmath)\le c\}\) is closed.

The following lemma states that, if we prove an inequality of the type Equation 9, replacing \(I\) by some function \(I_\alpha\) depending on a parameter, we can get the same inequality optimizing over \(\alpha\). This is not just a consequence of taking the \(\inf\) over \(\alpha\) on both sides of the inequality, since we have to exchange the \(\sup_\alpha\) (the \(\sup\) appearing because of the minus sign in the exponential), with the \(\inf_{\jmath}\).

Lemma 1 For \(\mathcal{U}\) an arbitrary index set, and \(\alpha \in \mathcal{U}\), let \(I_\alpha \colon \mathcal{P}(E)\to [0,\infty]\) be a lsc function. Suppose that, for each \(\alpha \in \mathcal{U}\), Equation 9 holds with \(I\) replaced by \(I_\alpha\), that is \[ \varlimsup_T \frac{1}{T} \log \mathbb{P}(J^T \in \mathcal{C}) \le - \inf_{\jmath \in \mathcal{C}} I_\alpha(\jmath), \qquad \text{for all $\mathcal{C}$ closed} \tag{11}\] Then Equation 9 holds with \(I\) replaced by \[ \hat{I}(\jmath):=\sup_{\alpha \in \mathcal{U}} I_\alpha(\jmath) \]

Proof. By general principles of Large Deviations Theory, there exists an optimal lower semicontinuous function \(\overline{I}\) such that Equation 11 holds (with \(\overline{I}\) in place of \(I_\alpha\)) on compact sets. Since we assumed \(E\) finite, \(\mathcal{P}(E)\) is compact, and thus each closed set is compact. Therefore Equation 11 reads as \[ \overline{I} \ge I_\alpha \qquad \text{for all $\alpha \in \mathcal{U}$} \] from which \(\overline{I} \ge \sup_{\alpha\in \mathcal{U}} I_\alpha\).

Lemma 2 For \(\xi \in C(E)\), define \(\phi_\xi \in C(S)\) as \[ \phi_\xi(x):= \log \left(\sum\nolimits_{y} e^{\xi(x,y)} p_{x,y}\right) \] Then \[ \mathbb{E}\left[\exp\left(\sum_{t=0}^{T-1}\xi(X_t,X_{t+1}) - \phi_\xi(X_t) \right) \right]=1 \tag{12}\]

Proof. Define \(q_{x,y}:= e^{\xi(x,y) - \phi_\xi(x)}p_{x,y}\). Then \(q_{x,y}\ge 0\) and \(\sum_y q_{x,y}=1\). Therefore \((q_{x,y})\) are Markov transition probabilities, and we can construct a measure \(\mathbb{Q}\) on \(D(S)\). We have that \[ \mathbb{Q}(X_0=x_0,\ldots,X_T=x_T)= \exp\left(\sum_{t=0}^{T-1}\xi(x_t,x_{t+1}) - \phi_\xi(x_t) \right) \mathbb{P}(X_0=x_0,\ldots,X_T=x_T) \] Thus Equation 12 is just the expected value of the constant function \(\ind{}\) under \(\mathbb{Q}\).

Lemma 3 For \(\xi, \phi_\xi\) as in Lemma 2, and \(\mathcal{A} \subset \mathcal{P}(E)\) \[ \mathbb{P}(J_T\in \mathcal{A}) \le \exp( - \inf_{\jmath\in \mathcal{A}} \jmath(\xi) - \mu^\jmath(\phi_\xi) ) \]

Proof. The exponent in Equation 12 can be written as \(J^T(\xi)-\pi^T(\phi_\xi)\). Recalling Equation 4, and simply just multiplying and dividing by the same factor \[ \begin{aligned} \mathbb{P}(J^T\in \mathcal{A}) & = \mathbb{E}\left[ e^{-T( J^T(\xi) -\mu^{J^T}(\phi_\xi)) } e^{\sum_{t=0}^{T-1}\xi(X_t,X_{t+1}) - \phi_\xi(X_t)} \ind{\mathcal{A}}(J^T)\right] \\ & \le \sup_{\jmath \in \mathcal{A}} e^{- T (\jmath(\xi) - \mu^\jmath(\phi_\xi) )} \mathbb{E}\left[ e^{\sum\nolimits_{t=0}^{T-1}\xi(X_t,X_{t+1}) - \phi_\xi(X_t)} \ind{\mathcal{A}}(J^T)\right] \\ & \le \exp( - T \inf_{\jmath \in \mathcal{A}} ( \jmath(\xi) - \mu^\jmath(\phi_\xi)) )\, \mathbb{E}\left[ e^{\sum\nolimits_{t=0}^{T-1}\xi(X_t,X_{t+1}) - \phi_\xi(X_t)}\right] \end{aligned} \] which coincides with the statement in view of Lemma 2.

For a function \(f\colon S\to \mathbb{R}\), let \(df(x,y):= f(y)-f(x)\). Define \[ \mathcal{P}_\ast(E):= \left\{ \jmath \in \mathcal{P}(E) \st \jmath(df)=0, \text{ for all $f\in C(S)$ } \right\} = \left\{ \jmath \in \mathcal{P}(E) \st \sum\nolimits_{y} \jmath_{x,y} = \sum\nolimits_{z} \jmath_{z,x}, \text{ for all $x \in S$ } \right\} \]

Lemma 4 For any compact set \(\mathcal{K} \subset \mathcal{P}(E)\) such that \(\mathcal{K} \cap \mathcal{P}_\ast(E) =\emptyset\), it holds \[ \mathbb{P}(J^T \in \mathcal{K}) =0, \qquad \text{for all $T$ large enough.} \]

Proof. Given \(T\ge 1\), let \(y_0=X_T,\ldots,y_N=X_0\) be a path in \(D(S)\) connecting \(X_T\) to \(X_0\) in a minimal number of steps (thus at most \(|E|\) steps). Define the current \[ \tilde{J}^T=\frac{1}{T+N}\sum_{t=0}^{T+N-1}\delta_{(X_t,X_{t+1})} \]

Telescoping the sum \[ \begin{aligned} & J^T(df) =\frac{1}{T}\sum_{t=0}^{T-1} f(X_{t+1})-f(X_t) = (f(X_T)-f(X_0))/T \\ & \tilde{J}^T(df)=0 \end{aligned} \] On the other hand, by Equation 1, \(d(J^T,\tilde{J}^T)\le |E|/T\). In particular \(J^T\) is at distance no more than \(|E|/T\) from \(\mathcal{P}_\ast(E)\) with probability \(1\). Since \(\mathcal{K}\) is compact and \(\mathcal{P}_\ast(E)\) is closed (indeed compact as we assume \(S\) finite), if these two sets do not intersect they are at a strictly positive distance. Therefore \(J^T\notin \mathcal{K}\) with probability \(1\), for \(|E|/T\) small enough.

Proof (Theorem 1). Lemma 3 can be interpreted as: Equation 9 holds with \(I\) replaced by \[ I_\xi(\jmath):=\jmath(\xi) - \mu^\jmath(\phi_\xi) \] Lemma 4 can be interpreted as: Equation 9 holds with \(I\) replaced by \[ I_\ast(\jmath):= \begin{cases} 0 & \text{if $\jmath \in \mathcal{P}_\ast(E)$.} \\ +\infty & \text{otherwise.} \end{cases} \]

\(I_\xi\) is continuous (in particular lsc), \(I_\ast\) is lsc on \(\mathcal{P}(E)\). Therefore by Lemma 1 the inequality Equation 9 holds with rate \[ \hat{I}(\jmath)= \begin{cases} \sup_{\xi} I_\xi(\jmath) & \text{if $\jmath \in \mathcal{P}_\ast(E)$.} \\ +\infty & \text{otherwise.} \end{cases} \]

Now take \(\xi(x,y)= \log(\jmath_{x,y}/p_{x,y})\) (the definition of \(\xi(x,y)\) if \(\jmath_{x,y}=0\) is irrelevant). To get \[ \begin{aligned} I_\xi(\jmath) & = \sum\nolimits_{x,y} \jmath_{x,y} \log(\jmath_{x,y}/p_{x,y}) - \sum\nolimits_{x} \mu^\jmath_x \log \left(\sum\nolimits_{z} \jmath_{x,z}\right) \\ & = \sum\nolimits_{x,y} \mu^{\jmath}_x p^\jmath_{x,y} \log(\mu^\jmath_x p^{\jmath}_{x,y} / p_{x,y}) - \sum\nolimits_{x} \mu^\jmath_x \log \mu^\jmath_x \\ & = \sum\nolimits_{x,y} \mu^{\jmath}_x p^\jmath_{x,y} \log(p^\jmath_{x,y}/p_{x,y}) = I(\jmath) \end{aligned} \]

Therefore, for \(\jmath \in \mathcal{P}_\ast(E)\), \(\hat{I}(\xi)=\sup_{\xi} I_\xi(\jmath)\ge I(\jmath)\). Since the inequality holds trivially (\(+\infty=+\infty\)) outside \(\mathcal{P}_\ast(E)\), we have that Equation 9 holds.

Proof (Corollary 1). Given two probabilities \(\mu,\nu \in \mathcal{P}(S)\), the function \(\nu \mapsto \sum_y \nu_y \log \nu_y/\mu_y\) is minimized at \(\nu=\mu\) (where it vanishes). Therefore, in Equation 6, each term in the sum over \(x\) is non-negative.

If \(I(\jmath)=0\), it must hold \(p^{\jmath}_{x,y}=p_{x,y}\) for every \(y\) and every \(x\) such that \(\mu_x^\jmath>0\) (\(p^\jmath\) is not defined if \(\mu_x^\jmath=0\) so we can arbitrarily set it to \(p^{\jmath}_{x,y}=p_{x,y}\) everywhere). But then the condition \(\jmath \in \mathcal{P}_\ast(E)\), implies that \(\mu^\jmath\) is invariant for \(p^\jmath_{x,y}\equiv p_{x,y}\). By irreducibility \(m\) is the only invariant probability for \((p_{x,y})\), thus \(\mu^\jmath=m\). Thus \(\jmath_{x,y}=\mu^\jmath_{x} p^\jmath_{x,y}=m_x p_{x,y}=\bar{\jmath}_{x,y}\).

Now, the set \(\mathcal{K}_\varepsilon := \{ \jmath \st \sum_{e} |\jmath_{e}- \bar{\jmath}_e|\ge \varepsilon\}\) is a compact set at positive distance from \(\bar{\jmath}\). Thus \(c_\varepsilon:=\inf_{\jmath \in \mathcal{K_\varepsilon}} I(\jmath)>0\), since \(I\) is lsc.

Proof. For \(\mathcal{C} \subset \mathcal{P}(S)\) closed, define \(\mathcal{C'}:=\{ \jmath \st \mu^\jmath \in \mathcal{C}\}\) (so just the preimage of \(\mathcal{C}\) w.r.t. the map \(\jmath \mapsto \mu^\jmath\)). \(\mathcal{C}'\) is closed in \(\mathcal{P}(E)\), and thus since \(\pi^T=\mu^{J^T}\) \[ \mathbb{P}(\pi^T\in \mathcal{C}) = \mathbb{P}(J^T \in \mathcal{C}' ) \le e^{-T \inf_{\jmath\in \mathcal{C}'} I(\jmath) + o(T)} = e^{-T \inf_{\mu\in \mathcal{C}} I_{\mathrm{DV}}(\mu) + o(T)} \tag{13}\]

Additional Exercises

Exercise 1 Find an explicit bound (similar to Equation 9) for \(h^T\) the number of windings of a nearest neighbors random walk on \(\mathbb{Z}_N\). Hint: write the number of windings \(h^T\) in terms of \(J^T\).

Exercise 2 Let \(F \colon \mathcal{P}(E) \to \mathbb{R}\) be continuous (and bounded, which is a consequence of \(E\) being finite). Prove that \[ \mathbb{E}\left[e^{T F(J^T)} \right] \le e^{T \sup_{\jmath} (F(\jmath)-I(\jmath)) + o(T)} \]

Exercise 3 Prove the formula \[ I_{DV}(\mu) = \sup_{u\in C(S), u>0} \sum_x \mu_x \log\left(\frac{u(x)}{(Pu)(x)} \right) \tag{14}\]

In principle this can be solved with Lagrange multipliers and extensive but elementary computations. A slightly more abstract application of the technique however allows a quicker approach.

The infimum in the definition Equation 7 may be taken over \(\jmath\in \mathcal{P}_\ast(E)\), since \(I(\jmath)=+\infty\) otherwise. Therefore one can write \(\jmath_{x,y}=\mu_x q_{x,y}\), where the Markov transition probabilities \(Q=(q_{x,y})_{x,y}\) satisfy \(\mu Q=\mu\) (from \(\jmath\in \mathcal{P}_\ast(E)\)).

We can use an arbitrary function \(f\in C(S)\) as Lagrange multiplier to impose the constraint \(\mu Q=\mu\). Indeed, for \(Q=(q_{x,y})_{x,y}\) \[ I_{DV}(\mu)= \inf_{Q} \sup_{f\in C(S)} \sum_{x,y} \mu_x q_{x,y} \log\left(\frac{q_{x,y}}{p_{x,y}} \right) +\mu((I-Q)f) \] where now the infimum is taken over an arbitrary transition probabilities \(Q=(q_{x,y})_{x,y}\), since the supremum over \(f\) equals \(+\infty\) unless \(\mu=\mu Q\) is satisfied.

The key point now is the following. The expression in the r.h.s. of which we are taking the \(\inf_Q \sup_f\) is

  • convex (thus continuous) in \(Q\), for \(Q\) in the convex compact set \(\sum_{y} q_{x,y}=1\), \(q_{x,y}\ge 0\).
  • affine (thus continuous) in \(f\), for \(f\in C(S)\).

We can therefore use Sion’s minmax principle, and exchange the order of the \(\inf\) and \(\sup\) to get \[ I_{DV}(\mu)= \sup_{f\in C(S)} \inf_{Q} \sum_{x,y} \mu_x q_{x,y} \log\left(\frac{q_{x,y}}{p_{x,y}} \right) +\mu((I-Q)f) = \sup_{f\in C(S)} \mu(f) + \inf_{Q} \mu_x q_{x,y} \log\left(\frac{q_{x,y}}{p_{x,y}e^{f(y)}}\right) \] Now the infimum over \(q_{x,y}\) is easily computed (since we do not need to impose the “difficult” condition \(\mu=Q\mu\), just \(\sum_y q_{x,y}=1\)), and we get \[ I_{DV}(\mu)= \sup_{f\in C(S)} \sum_x \mu_x f(x) - \mu_x \log\left(\sum_y p_{x,y} e^{f(y)}\right) \] which gives the wanted expression once we set \(u=e^f\).

This proof ultimately suggests that, if we repeat the proof of Theorem 1 for \(\pi^T\) instead of \(J^T\), it would have been enough to optimize over \(\xi(x,y)=f(y)\) (depending only on \(y\) instead of a generic function of both \(x\) and \(y\)), consistently with the fact that the bound for \(\pi^T\) is a weaker result, and indeed can be derived from the exponential bound for \(J^T\).

Exercise 4 Prove the explicit bounds \[ \mathcal{E}(\sqrt{\varrho},\sqrt{\varrho}) \le I_{DV}(\mu) \le \sum_{x,y} \mu_x \mu_y \log\left(\frac{\mu_y}{p_{x,y}}\right) \tag{15}\] where \(\mathcal{E}\) is the Dirichlet form and \(\varrho_x:=\mu_x/m_x\) is density of \(\mu\) w.r.t. the invariant measure \(m\).

Following the beginning of Exercise 3, we can write \(I_{DV}\) as an infimum over Markov transition probabilities \(Q=(q_{x,y})\) that leave \(\mu\) invariant: \(\mu=\mu Q\). The choice \(q_{x,y}=\mu_y\) leaves \(\mu\) invariant, as easily verified, and yields the stated upper bound for \(I_{DV}\).

As for the lower bound, in Equation 14 take \(u=\sqrt{\varrho}\) (the case where \(\varrho\) vanishes at some point is easily dealt with, since these points are irrelevant and one can set \(u(x)=1\) if \(\varrho(x)=0\)). The expression in Equation 14 for this particular choice of \(u\) is not larger than the supremum over \(u\), thus \[ I_{DV}(\mu) \ge \sum_x \mu_x \log\left(\frac{\sqrt{\varrho}(x)}{(P\sqrt{\varrho})(x)} \right)= - \sum_x m_x \log\left( 1- \frac{((I-P) \sqrt{\varrho})(x)}{\sqrt{\varrho(x)}}\right) \] Apply \(-\log(1-z)\ge z\) for each term of the sum to get \[ I_{DV}(\mu) \ge \sum_x m_x \varrho(x) \frac{((I-P) \sqrt{\varrho})(x)}{\sqrt{\varrho(x)}} \] which simplifies to the wanted inequality.

Exercise 5 Prove that \(I_{DV}(\mu)=0\) iff \(\mu=m\) is the invariant measure (recall that we assumed the chain to be irreducible). Deduce, as in Corollary 1 that for some \(c_\varepsilon>0\) \[ \mathbb{P}\left(\sum\nolimits_{x \in S} |\pi^T_x - m_x| \ge \varepsilon \right) \le e^{-T (c_\varepsilon + o(1))} \] where \(o(1)\) vanishes as \(T\to \infty\).

Since \(m=\mu^{\bar{\jmath}}\) and \(I(\bar{\jmath})=0\) \[ 0\le I_{DV}(\mu)\le I(\bar{\jmath})=0 \]

Conversely, if \(I_{DV}(\mu)=0\), then \(\mathcal{E}(\sqrt{\varrho},\sqrt{\varrho})=0\). Since we assumed the chain to be irreducible, this implies that \(\varrho\) is constant, from point c. of the Proposition on Dirichlet forms with \(f=g\). This is equivalent to \(\mu=m\) since they are both probabilities.

The exponential bound follows the same argument as Corollary 1, since it only uses that \(I(\cdot)\) has a unique root.

Footnotes

  1. Fluctuations of this type of mathematical objects (e.g. \(J^T\)) have been shown to have a rather deep meaning, providing a universal dynamical notion of the concept of entropy in complex systems.↩︎

  2. This is called the Donsker-Varadhan functional, named after a famous series of papers by the NYU mathematicians M.R.Donsker and S.R.S.Varadhan in the 70s. The papers are mostly about continuous-time Markov processes (so can be rather technical), but the discrete-time adaptation is straightforward.↩︎