Chapter 7.4. Selected Topics: Monte Carlo Methods

\[\newcommand{\ind}[1]{\mathbf{1}_{#1}} \newcommand{\dd}{\mathrm{d}}\]

Markov chains Monte Carlo

One of the most relevant applications of Markov chains is sampling ($S$-valued) Random Variables. Given a target measure $m\in \mathcal{P}(S)$, one is interested in sampling a random variable $X$ with distribution $m$, namely $\mathbb{P}(X=x)=m_x$ for all $x\in S$. A possible strategy is to run a Markov chain (ideally irreducible, aperiodic) on $S$, with invariant measure $m$. As one simulates the chain for a long time $T$, the law of $X_T$ will converge exponentially fast to the target measure $m$. Of course it is important, and highly non-trivial, to estimate accurately the speed of convergence of the law of $X_T$. Most importantly, given the target measure $m$, to design some transition probabilities that will guarantee a fast convergence and a low computational cost. The latter property can be formalized mathematically in many ways (e.g. avoiding the computation of some complex quantities, the number of calls of $m_x$ per step, or simply the simulation algorithm being nearest-neighbor w.r.t. some natural structure on the graph where the target measure $m$ is defined), none of them fully satisfying.

Example 1 (Simulated Annealing) Suppose one wants to find the global minimum of a cost function $U: S \to \mathbb{R}$. In many applications (e.g., combinatorial optimization), $S$ is very large and $U$ has many local minima, making local or deterministic search algorithms ineffective.

The idea of simulated annealing is to turn this optimization problem into a sampling problem. For a parameter $\beta > 0$ (interpreted as the inverse temperature), define the Gibbs measure: \[ m^{\beta}_x = \frac{1}{Z^\beta} e^{-\beta U(x)}, \qquad Z^\beta = \sum_{y \in S} e^{-\beta U(y)} \] As $\beta \to \infty$ (temperature goes to zero), the measure $m^{\beta}$ concentrates on the set of global minima of $U$. If we can sample from $m^{\beta}$ for very large $\beta$, we find the minimum with large probability (the probability to miss it is exponentially small in $\beta$).

However, direct sampling is impractical, e.g. due to the constant $Z^\beta$ (computing $Z^\beta$ implies computing all the $U(x)$). Instead, we run a Markov chain designed to have $m^{\beta}$ as its invariant measure. Let us assume that $S$ has a graph structure, with degree bounded by $D$ (that is, no point has more than $D$ neighbors). This is motivated by the fact that, although $|S|$ is large in many cases, it has a (connected, undirected) graph structure with a small degree (see Example 2 below; or consider the case of a large box $S= \{-N,\ldots,N\}^d$ in $\mathbb{Z}^d$, so $|S|=(2N+1)^d$ but $D=2d$). We then take \[ p^\beta_{x,y}= \begin{cases} e^{-\beta (U(y) - U(x))^+}/D & \text{if $x,y$ are neighbors,} \\ 1- \!\!\!\!\!\! \sum\limits_{{z \text{ neighbor of $x$}}}\!\!\!\!\!\! e^{-\beta (U(z) - U(x))^+}/D & \text{if $y=x$,} \\ 0 & \text{otherwise.} \end{cases} \tag{1}\] where $a^+$ denotes the positive part of a real number $a$.

This choice corresponds, at a given step when $X_t=x$, to choose with probability $1/D$ a given candidate neighbor $y$ of $x$. If $U(y)\le U(x)$, then $X_{t+1}=y$. If $U(y)>U(x)$, then with probability $e^{-\beta (U(y)-U(x))}$ we set $X_{t+1}=y$, and otherwise one does not move, $X_{t+1}=x$. The idea is that this procedure, on the one hand favors jumping toward smaller values of $U$, and on the other hand allows (even if with small probability) “escaping valleys” around local minimizers of $U$. The invariance of $m$ for this random dynamics is a consequence of the more general Proposition 1.

In practice, in the so-called simulated annealing algorithm, one simulates this chain with a slowly increasing $\beta$ (so $\beta=\beta_t$) as the time $t$ increases, theoretically allowing the chain to escape local minima and settle in the global minimum. As increasing $\beta$ means decreasing the temperature over time, the name of this influential technique comes from a heat treatment of metals.

Example 2 (Traveling salesman) Consider $N$ cities labeled $1, \dots, N$, and a matrix $(c_{i,j})_{1\le i,j \le N}$ representing the cost of traveling between city $i$ and city $j$. A traveling salesman has to visit all $N$ towns and come back home, and has to decide the optimal order of towns to visit. This means we are looking for a permutation $\sigma \in \mathcal{S}_N$ that minimizes the total cost.

Here, the state space $S = \mathcal{S}_N$ is the set of all permutations of the $N$ cities. The cost function $U \colon S \to \mathbb{R}$ is the total cost of the tour (we understand sums $\pmod N$, so $N+1=1$): \[ U(\sigma) = \sum_{i=1}^{N} c_{\sigma(i), \sigma(i+1)} \] To apply the method in Example 1, we must define a graph structure on $S$. Let us try different strategies.

Two permutations are connected if one can be obtained from the other by swapping two indices. If we denote by $\pi^{i,j}\sigma$ the transposition swapping the entries at position $i$ and $j$, the neighbors of $\sigma$ are the $D=\binom{N}{2}$ permutations $(\pi^{i,j}\sigma)_{i< j}$.
Two permutations are connected if one can be obtained from the other by swapping two consecutive indices. This means that the neighbors of $\sigma$ are all the permutations obtained from $\sigma$ with an adjacent transposition, namely the $D=N$ permutations $(\pi^{i,i+1} \sigma)_{i}$.
If $c_{i,j}=c_{j,i}$ is symmetric, two permutations are connected if one can be obtained from the other by reversing the order of visit of a sub-segment of indices. If we denote by $\epsilon^{i,j}\sigma$ the permutation obtained by reversing the path between index $i$ and $j$, the neighbors of $\sigma$ are the $D=\binom{N}{2}$ permutations $(\epsilon^{i,j}\sigma)_{i< j}$.

The possible steps in cases a.,b.,c.. — Figure 1: The different single-step transitions for the three examples of the Salesman problem.

Now, as in Example 1, we fix a large constant $\beta>0$ (which governs the probability of accepting “uphill” moves), and we simulate a Markov chain with transition probabilities given by Equation 1. A crucial computational advantage of the local dynamics a.,b.,c.., is that we do not need to ever calculate the full cost $U(\sigma)$, but only the difference of $U(\cdot)$ between neighbor permutations. This is easily computed (see Figure 1) as \[ \delta U:= \begin{cases} U(\pi^{i,j}\sigma) - U(\sigma) = (c_{\sigma(i-1),\sigma(j)} + c_{\sigma(j),\sigma(i+1)} + c_{\sigma(j-1),\sigma(i)}+ c_{\sigma(i),\sigma(j+1)}) - (c_{\sigma(i-1),\sigma(i)} + c_{\sigma(i),\sigma(i+1)} + c_{\sigma(j-1),\sigma(j)}+c_{\sigma(j),\sigma(j+1)}) & \text{for case a.} \\ U(\pi^{i,i+1}\sigma) - U(\sigma) = (c_{\sigma(i-1),\sigma(i+1)} + c_{\sigma(i+1),\sigma(i)} + c_{\sigma(i),\sigma(i+2)}) - (c_{\sigma(i-1),\sigma(i)} + c_{\sigma(i),\sigma(i+1)}+c_{\sigma(i+1),\sigma(i+2)}) & \text{for case b.} \\ U(\epsilon^{i,j}\sigma) - U(\sigma) = (c_{\sigma(i-1),\sigma(j)} + c_{\sigma(i),\sigma(j+1)}) - (c_{\sigma(i-1),\sigma(i)} + c_{\sigma(j),\sigma(j+1)}) & \text{for case c.} \end{cases} \tag{2}\] Thus to evaluate Equation 1, we never need to compute $Z^\beta$ or even $U$, just a combination of respectively $8$, $6$, $4$ cost terms (for cases a.,b.,c.) (regardless of $N$) . The algorithm is then

Start with an arbitrary permutation, e.g. the identity or a random one.
Pick two random indices $i,j$ (for the dynamics a.), or one random index $i$ (for dynamics b., as $j=i+1$ in this case).
Compute the value $\delta U$ in Equation 2.
If $\delta U \le 0$, then perform the step (namely swap $i$ and $j$ in case a., swap $i$ and $i+1$ in case b., reverse the path between $i$ and $j$ in case c.). If $\delta U>0$, then perform the step with probability $e^{-\delta U}$, and do not make any change otherwise.
Repeat from point 2..

Figure 2

Of course, a critical question is how many times we should repeat steps 2.-5.. This is a problem of great practical and theoretical interest. The answer critically depends on $N$, $\beta$, $U$, and the acceptance of possible errors. In many situations, one may just take “$T$ large enough” and avoid mathematical complications.

In the examples a.,b.,c. above, it is quite clear what to expect:

The cases a. and c. are relatively similar, each point has $N(N-1)/2$ neighbors, and they usually need less steps to visit the state space and “find” the minimizers of $U$. Usually c. is expected to perform better than a.. b. instead only allows $N$ neighbors, and has a higher chance to get stuck in “local minimizers” of $U$ (namely, to find a path of lower cost, it first needs to swap several adjacent indices where the cost increases, but these transitions are unlikely as $\beta$ gets large). In simulations, we see that that b. is indeed the algorithm for which $U$ decreases the slowest with iterations.
However each single step of b. is much simpler computationally, and if we plot instead $U(X_t)$ not as the function of the number of steps, but the cpu time taken to simulate, the gap between b. and the other algorithm is much reduced.
Ultimately, we see that increasing $\beta$ as the simulation goes on is the best strategy, and certainly for b. to perform, we just need to tune $\beta \equiv \beta_t$ so that it grows, but slowly compared to a. and b..

#| '!! shinylive warning !!': |
#|   shinylive does not work in self-contained HTML documents.
#|   Please set `embed-resources: false` in your metadata.
#| standalone: true
#| components: [viewer, editor]
#| viewerHeight: 1600
#| editorHeight: 100
#| layout: vertical
#| id: tsp_metropolis

import time
import numpy as np
import matplotlib.pyplot as plt
from shiny import App, render, ui, reactive, req
import theme

# ==============================================================================
# 1. CONFIGURATION & DEFINITIONS
# ==============================================================================

CHECK_INTERVAL = 200     # Check time/limit every N steps
MAX_PLOT_POINTS = 2000   # Max points to render per line (downsampling)

COST_FUNCTIONS = {
    "Squared Euclidean (L2^2)": lambda x, y: np.sum((x - y)**2),
    "Euclidean (L2)": lambda x, y: np.sqrt(np.sum((x - y)**2)),
    "Manhattan (L1)": lambda x, y: np.sum(np.abs(x - y)),
}

DEFAULT_N = 30
DEFAULT_SEED = 123
DEFAULT_ITER = 5000
DEFAULT_TIME = 5.0

# ==============================================================================
# 2. CORE LOGIC
# ==============================================================================

def generate_cities(n, seed):
    rng = np.random.default_rng(seed)
    return rng.random((n, 2))

def compute_cost_matrix(cities, metric_func):
    n = len(cities)
    mat = np.zeros((n, n))
    for i in range(n):
        for j in range(n):
            if i != j:
                mat[i, j] = metric_func(cities[i], cities[j])
    return mat

def compute_total_cost(sigma, cost_mat):
    n = len(sigma)
    cost = 0.0
    for i in range(n):
        cost += cost_mat[sigma[i], sigma[(i + 1) % n]]
    return cost

def run_metropolis(cities, cost_mat, strategy, beta_range, limit_mode, limit_val):
    n = len(cities)
    rng = np.random.default_rng()
    
    sigma = np.arange(n)
    rng.shuffle(sigma)
    
    current_cost = compute_total_cost(sigma, cost_mat)
    best_sigma = sigma.copy()
    best_cost = current_cost
    
    # Pre-allocate estimate
    est_size = int(limit_val) if limit_mode == 'iter' else 10000
    costs_hist = [current_cost]
    
    beta_start, beta_end = beta_range
    start_time = time.time()
    step = 0
    keep_running = True
    
    while keep_running:
        step += 1
        
        # Annealing progress
        if limit_mode == 'iter':
            progress = min(1.0, step / limit_val)
        else:
            now = time.time()
            progress = min(1.0, (now - start_time) / limit_val)
            
        beta = beta_start + (beta_end - beta_start) * progress
        
        # Proposal
        i = rng.integers(0, n)
        j = 0
        delta_u = 0.0
        
        # Strategy A: Random Swap
        if strategy == 'a':
            j = rng.integers(0, n)
            while i == j: j = rng.integers(0, n)
            
            i_prev, i_next = (i - 1) % n, (i + 1) % n
            j_prev, j_next = (j - 1) % n, (j + 1) % n

            if i_next == j: # Adjacent i, i+1
                 p_im1, p_i, p_j, p_jp1 = sigma[i_prev], sigma[i], sigma[j], sigma[j_next]
                 delta_u = (cost_mat[p_im1, p_j] + cost_mat[p_j, p_i] + cost_mat[p_i, p_jp1]) - \
                           (cost_mat[p_im1, p_i] + cost_mat[p_i, p_j] + cost_mat[p_j, p_jp1])
            elif j_next == i: # Adjacent j, j+1
                 p_jm1, p_j, p_i, p_ip1 = sigma[j_prev], sigma[j], sigma[i], sigma[i_next]
                 delta_u = (cost_mat[p_jm1, p_i] + cost_mat[p_i, p_j] + cost_mat[p_j, p_ip1]) - \
                           (cost_mat[p_jm1, p_j] + cost_mat[p_j, p_i] + cost_mat[p_i, p_ip1])
            else:
                p_im1, p_i, p_ip1 = sigma[i_prev], sigma[i], sigma[i_next]
                p_jm1, p_j, p_jp1 = sigma[j_prev], sigma[j], sigma[j_next]
                term_add = cost_mat[p_im1, p_j] + cost_mat[p_j, p_ip1] + cost_mat[p_jm1, p_i] + cost_mat[p_i, p_jp1]
                term_sub = cost_mat[p_im1, p_i] + cost_mat[p_i, p_ip1] + cost_mat[p_jm1, p_j] + cost_mat[p_j, p_jp1]
                delta_u = term_add - term_sub

        # Strategy B: Adjacent Swap
        elif strategy == 'b':
            j = (i + 1) % n
            i_prev = (i - 1) % n
            j_next = (j + 1) % n
            p_im1, p_i, p_j, p_jn = sigma[i_prev], sigma[i], sigma[j], sigma[j_next]
            delta_u = (cost_mat[p_im1, p_j] + cost_mat[p_j, p_i] + cost_mat[p_i, p_jn]) - \
                      (cost_mat[p_im1, p_i] + cost_mat[p_i, p_j] + cost_mat[p_j, p_jn])

        # Strategy C: 2-Opt (Reverse Segment)
        elif strategy == 'c':
            j = rng.integers(0, n)
            # Avoid neighbors for 2-opt to be meaningful (though math holds)
            while i == j or j == (i - 1) % n or j == (i + 1) % n:
                j = rng.integers(0, n)
            
            i_prev = (i - 1) % n
            j_next = (j + 1) % n
            p_im1, p_i = sigma[i_prev], sigma[i]
            p_j, p_jn  = sigma[j], sigma[j_next]
            delta_u = (cost_mat[p_im1, p_j] + cost_mat[p_i, p_jn]) - \
                      (cost_mat[p_im1, p_i] + cost_mat[p_j, p_jn])

        # Acceptance
        if delta_u <= 0 or rng.random() < np.exp(-beta * delta_u):
            current_cost += delta_u
            if strategy == 'a' or strategy == 'b':
                sigma[i], sigma[j] = sigma[j], sigma[i]
            elif strategy == 'c':
                if j > i:
                    sigma[i:j+1] = sigma[i:j+1][::-1]
                else:
                    segment = np.concatenate((sigma[i:], sigma[:j+1]))
                    rev_segment = segment[::-1]
                    len_first = n - i
                    sigma[i:] = rev_segment[:len_first]
                    sigma[:j+1] = rev_segment[len_first:]
            
            if current_cost < best_cost:
                best_cost = current_cost
                best_sigma = sigma.copy()

        costs_hist.append(current_cost)
        
        # Check Limits
        if step % CHECK_INTERVAL == 0:
            elapsed = time.time() - start_time
            if limit_mode == 'iter' and step >= limit_val:
                keep_running = False
            elif limit_mode == 'time' and elapsed >= limit_val:
                keep_running = False
    
    total_time = time.time() - start_time
    # Create axes
    iter_axis = np.arange(len(costs_hist))
    time_axis = np.linspace(0, total_time, len(costs_hist))
    
    return {
        "costs": np.array(costs_hist),
        "best_sigma": best_sigma,
        "best_cost": best_cost,
        "time_axis": time_axis,
        "iter_axis": iter_axis
    }


# ==============================================================================
# 3. UI DEFINITION
# ==============================================================================

app_ui = theme.layout(
    ui.sidebar(
        ui.h4("Configuration"),
        
        # Compact Row 1: N and Seed
        ui.row(
            ui.column(6, ui.input_numeric("n_cities", "N (Cities)", value=DEFAULT_N, min=4, max=200)),
            ui.column(6, ui.input_numeric("seed", "Seed", value=DEFAULT_SEED))
        ),
        
        ui.input_select("cost_metric", "Cost Function", choices=list(COST_FUNCTIONS.keys())),
        
        ui.hr(),
        ui.h5("Stop Condition"),
        
        # Compact Row 2: Mode and Limit
        ui.row(
            ui.column(6, ui.input_select("stop_mode", "Stop By", choices={"iter": "Iterations", "time": "Time (s)"})),
            ui.column(6, ui.input_numeric("stop_val", "Limit", value=DEFAULT_ITER))
        ),
        
        ui.hr(),
        ui.h5("Annealing Schedules"),
        ui.p("Start → End values for Beta", class_="text-muted small mb-2"),
        
        # Compact Rows for Betas
        ui.row(
            ui.column(6, ui.input_numeric("beta1_s", "Set 1 Start", value=1.0, step=0.5)),
            ui.column(6, ui.input_numeric("beta1_e", "End", value=10.0, step=1.0)),
        ),
        ui.row(
            ui.column(6, ui.input_numeric("beta2_s", "Set 2 Start", value=0.1, step=0.1)),
            ui.column(6, ui.input_numeric("beta2_e", "End", value=50.0, step=5.0)),
        ),
        ui.row(
            ui.column(6, ui.input_numeric("beta3_s", "Set 3 Start", value=0.01, step=0.01)),
            ui.column(6, ui.input_numeric("beta3_e", "End", value=100.0, step=10.0)),
        ),
        
        ui.input_action_button("run", "Run Simulation", class_="btn-primary w-100 mt-3"),
        width=350
    ),
    
    ui.tags.style("""
        .card-header { font-weight: bold; }
        .shiny-plot-output { margin-bottom: 20px; }
    """),
    
    ui.card(
        ui.card_header("1. Best Tours Found (Per Method)"),
        ui.output_plot("plot_tours", height="300px")
    ),
    
    ui.card(
        ui.card_header("2. Convergence: Strategy A (Random Swap)"),
        ui.output_plot("plot_conv_a", height="250px")
    ),
    
    ui.card(
        ui.card_header("3. Convergence: Strategy B (Adjacent Swap)"),
        ui.output_plot("plot_conv_b", height="250px")
    ),
    
    ui.card(
        ui.card_header("4. Convergence: Strategy C (Sub-segment Reverse)"),
        ui.output_plot("plot_conv_c", height="250px")
    ),
    
    title="TSP Metropolis Strategies"
)

# ==============================================================================
# 4. SERVER LOGIC
# ==============================================================================

def server(input, output, session):
    theme.setup_plot_style()
    sim_data = reactive.Value(None)
    
    # Dynamic Defaults for Limit based on Mode
    @reactive.Effect
    @reactive.event(input.stop_mode)
    def _update_defaults():
        mode = input.stop_mode()
        if mode == 'iter':
            ui.update_numeric("stop_val", value=DEFAULT_ITER)
        else:
            ui.update_numeric("stop_val", value=DEFAULT_TIME)
    
    @reactive.Effect
    @reactive.event(input.run)
    def run_simulation_handler():
        n = input.n_cities()
        seed = input.seed()
        metric_name = input.cost_metric()
        metric_func = COST_FUNCTIONS[metric_name]
        
        stop_mode = input.stop_mode()
        stop_val = input.stop_val()
        
        betas = [
            (input.beta1_s(), input.beta1_e()),
            (input.beta2_s(), input.beta2_e()),
            (input.beta3_s(), input.beta3_e())
        ]
        
        cities = generate_cities(n, seed)
        cost_mat = compute_cost_matrix(cities, metric_func)
        
        strategies = ['a', 'b', 'c']
        results_storage = {}
        
        with ui.Progress(min=0, max=9) as p:
            count = 0
            for strat in strategies:
                strat_results = []
                for b_range in betas:
                    p.set(count, message=f"Strat {strat.upper()} ({b_range})...")
                    res = run_metropolis(cities, cost_mat, strat, b_range, stop_mode, stop_val)
                    res['beta_range'] = b_range
                    strat_results.append(res)
                    count += 1
                results_storage[strat] = strat_results
        
        sim_data.set({
            "cities": cities,
            "results": results_storage,
            "metric": metric_name,
            "stop_mode": stop_mode
        })

    def plot_convergence(strat_key, ax):
        data = sim_data.get()
        if not data: return
        
        res_list = data['results'][strat_key]
        mode = data['stop_mode']
        
        for res in res_list:
            y = res['costs']
            x = res['time_axis'] if mode == 'time' else res['iter_axis']
            
            # Downsampling Logic
            if len(y) > MAX_PLOT_POINTS:
                # Calculate stride to get roughly MAX_PLOT_POINTS
                stride = int(np.ceil(len(y) / MAX_PLOT_POINTS))
                y = y[::stride]
                x = x[::stride]
            
            beta_txt = f"β: {res['beta_range'][0]}→{res['beta_range'][1]}"
            ax.plot(x, y, label=beta_txt, linewidth=1.2, alpha=0.8)
            
        ax.set_ylabel("Total Cost U(σ)")
        ax.set_xlabel("Time (s)" if mode == 'time' else "Iterations")
        ax.grid(True, linestyle=':', alpha=0.6)
        ax.legend(fontsize='small', loc='upper right')

    @output
    @render.plot
    def plot_tours():
        data = sim_data.get()
        if not data: return
        
        cities = data['cities']
        results = data['results']
        
        fig, axes = plt.subplots(1, 3, figsize=(10, 3.5), constrained_layout=True)
        titles = {'a': "A: Swap", 'b': "B: Adj Swap", 'c': "C: 2-Opt"}
        
        for idx, strat in enumerate(['a', 'b', 'c']):
            ax = axes[idx]
            res_list = results[strat]
            best_run = min(res_list, key=lambda x: x['best_cost'])
            
            sigma = best_run['best_sigma']
            tour = cities[np.concatenate((sigma, [sigma[0]]))]
            
            ax.plot(tour[:, 0], tour[:, 1], 'o-', markersize=3, linewidth=1, color='#2c3e50')
            ax.plot(tour[0, 0], tour[0, 1], 'o', color='#e74c3c', markersize=5)
            
            ax.set_title(f"{titles[strat]}\nCost: {best_run['best_cost']:.3f}", fontsize=10)
            ax.axis('off')
            ax.set_aspect('equal')
            
        return fig

    @output
    @render.plot
    def plot_conv_a():
        fig, ax = plt.subplots(figsize=(8, 3), constrained_layout=True)
        plot_convergence('a', ax)
        return fig

    @output
    @render.plot
    def plot_conv_b():
        fig, ax = plt.subplots(figsize=(8, 3), constrained_layout=True)
        plot_convergence('b', ax)
        return fig

    @output
    @render.plot
    def plot_conv_c():
        fig, ax = plt.subplots(figsize=(8, 3), constrained_layout=True)
        plot_convergence('c', ax)
        return fig

app = App(app_ui, server)


## file: theme.py
from shiny import ui
import matplotlib.pyplot as plt

def layout(sidebar, *args, **kwargs):
    """
    Standard layout wrapper for all Shinylive apps.
    
    Args:
        sidebar: A ui.sidebar() object.
        *args: Main content elements.
        **kwargs: Additional arguments for ui.page_sidebar (e.g., title, fillable).
    """
    return ui.page_sidebar(
        sidebar,
        # Standard CSS across all apps
        ui.tags.style("""
            :root {
                --bs-body-font-family: "Inter", sans-serif;
                --bs-font-monospace: "Source Code Pro", monospace;
            }
            body { 
                font-family: var(--bs-body-font-family) !important; 
            }
            .bslib-sidebar-layout .sidebar-content { 
                font-size: 14px; 
            }
            /* Ensure plots blend in if needed */
            .shiny-plot-output {
                background-color: transparent;
            }
        """),
        *args,
        **kwargs
    )

def setup_plot_style():
    """
    Configures Matplotlib to match the web theme (Source Sans 3).
    """
    # Attempt to use Inter if available, else fallback to sans-serif
    plt.rcParams.update({
        "font.family": "sans-serif",
        "font.sans-serif": ["Inter", "Source Sans 3", "DejaVu Sans", "sans-serif"],
        "axes.titlesize": 13,
        "axes.labelsize": 11,
        "xtick.labelsize": 10,
        "ytick.labelsize": 10,
        "figure.figsize": (8, 5),
        "axes.grid": True,
        "grid.alpha": 0.3,
        "grid.linestyle": ":",
        # Transparent backgrounds to match web theme
        "figure.facecolor": "none",
        "axes.facecolor": "none",
        "savefig.facecolor": "none"
    })

Ultimately, the technique boils down to the following approach, called MCMC (Markov chain Monte Carlo). Given $m \in \mathcal{P}(S)$, find some “computationally optimal” transition probabilities $(p^t_{x,y})_{t\ge 0, x,y\in S}$ such that $\sum_{y} m_y p^{t}_{y,x} = m_x$, for all $t\in \mathbb{N}, x\in S$. We will focus on the time-homogeneous case, where $p^t_{x,y}\equiv p_{x,y}$ does not depend on $t$.

Let us give some examples, assuming (with no loss of generality) $m_x>0$ for all $x\in S$.

$p_{x,y}=m_y$, the trivial perfect simulation, yielding a Markov chain which is just an i.i.d. sequence. In this case, the law of $X_t$ converges to $m$ in just one step.
$p_{x,y}=\varepsilon \sqrt{m_y/m_x} \ind{x\neq y} + (1- \varepsilon \sum_{z\neq x} \sqrt{m_z/m_x}) \ind{x=y}$, known as the global heat bath. Here $0< \varepsilon \le \inf_{x,y} \sqrt{m_x/m_y}$, so that $p_{x,y}$ is indeed a transition probability with reversible, and thus invariant, measure $m$.
$p_{x,y}=A_{x,y} \sqrt{m_y/m_x}$, generalizing the previous examples. Here $A_{x,y}$ is a sufficiently small symmetric matrix, meaning \[ \begin{aligned} A_{x,y}=A_{y,x} \ge 0, \qquad x,y \in S, \qquad A_{x,x} \sqrt{m_x}=1-\sum_{z\neq x} S_{x,z} \sqrt{m_z}, \qquad x\in S \end{aligned} \tag{3}\] Clearly $m_x p_{x,y}=A_{x,y} \sqrt{m_x m_y}=m_y p_{y,x}$, so that $m$ is reversible, and thus invariant.

In typical applications, $S$ is large and $m_x$ is a somehow complex measure. The first two examples above are highly impractical. In the first case, we just sample $m$ deterministically. The instantaneous convergence however implies a very high computational cost of the single step needed. In the global heat bath, at each step we need to know the ratios $\sqrt{m_y/m_x}$ for all $x,y\in S$, and despite this being typically simpler to compute than $m_x$ (see Example 2), it still requires the access to $|S|^2$ ratios of the invariant measure values. The third general case is more interesting (and will be further generalized below), but so far does not tell us much about a good choice of $A_{x,y}$.

Metropolis-Hastings dynamics

The problem becomes much more interesting both mathematically and practically, if we assume that some further structure is given. In particular, we assume that one is given some original transition probabilities $q_{x,y}$. Beware, we are not assuming that $m$ is invariant for $q_{x,y}$ here, rather $q_{x,y}$ should be thought as deriving from some feature, for instance geometric properties, of the problem of interest. If $S$ has a graph structure for instance, $q_{x,y}$ can be the transition probabilities of the simple random walk on $S$.

Proposition 1 (Metropolis-Hastings dynamics) Let $S$ be finite, and suppose we are given

A target measure $m\in \mathcal{P}(S)$.
A proposal kernel $Q=(q_{x,y})_{x,y}$, that is a Markov transition kernel.
A skew vorticity kernel $R=(r_{x,y})_{x,y}$, such that $r_{x,y}=-r_{y,x}$ and $\sum_y r_{x,y}=0$.

Assume that $r_{x,y} \ge - m_y q_{y,x}$ for all $x,y\in S$ (in particular, since $R$ is skew symmetric, $- m_y q_{y,x}\le r_{x,y}\le m_x q_{x,y}$). Then \[ p_{x,y}:= \begin{cases} \min(q_{x,y},\frac{r_{x,y}+m_y q_{y,x}}{m_x}) & \text{if $y\neq x$} \\ 1- \sum_{z\neq x} \min(q_{x,z},\frac{r_{x,z}+m_z q_{z,x}}{m_x}) & \text{if $y=x$} \end{cases} \tag{4}\] defines valid Markov transition probabilities, referred to as the (generalized) Metropolis-Hastings kernel, which admits $m$ as invariant probability.

Assume moreover that $q_{x,y}$ is irreducible and that $q_{x,y}>0$ whenever $q_{y,x}>0$. Then $p_{x,y}$ is irreducible and thus $m$ is the unique invariant probability.

Proof. The normalization property $\sum_y p_{x,y}=1$ just follows from the definition of $p_{x,x}$. We next check that $p_{x,y}\ge 0$. This is just the condition $r_{x,y} \ge - m_y q_{y,x}$ for the off-diagonal terms. While $p_{x,x}\ge 0$ since we can bound the sum with the negative sign by $\sum_{z\neq x} q_{x,z}=1-q_{x,x}\le 1$.

We next verify that $m$ is invariant. Indeed \[ \begin{aligned} \sum_{y\neq x} m_x p_{x,y} & = \sum_{y\neq x} \min(m_x q_{x,y},r_{x,y}+m_y q_{y,x}) = \sum_{y\neq x} \min(m_x q_{x,y}- r_{x,y},m_y q_{y,x}) + \sum_{y\neq x} r_{x,y} \\ & = \sum_{y\neq x} \min(m_x q_{x,y}+r_{y,x}, m_y q_{y,x}) =\sum_{y\neq x} m_y p_{y,x} \end{aligned} \] where, from the first to the second line, we used $r_{x,y}=-r_{y,x}$ and $\sum_{y\neq x} r_{x,y}=0$ (since $r_{x,x}=0$).

Metropolis-Hastings algorithms perform very well in several situations, however they feature some weak aspects:

If our task is to converge to the invariant measure as fast as possible, having a chance $p_{x,x}>0$ not to move at all, is hardly optimal.
As $\beta$ increases, the presence of local minima of $U$ can increase dramatically the convergence time of the Markov chain.

Exercise 1 Check that, if $r=0$, then $m$ is reversible for the transition probabilities $p_{x,y}$ defined in Equation 4.

Exercise 2 Verify that hte Metropolis-Hastings dynamics include the aforementioned example: the trivial perfect simulation, the global heat bath, and the symmetric dynamics Equation 3. (Show that for a specific choice of $r_{x,y}$, $q_{x,y}$ one gets those dynamics).

Glauber and Kawasaki dynamics [Sketch]

The Glauber dynamics is a specific instance of MCMC often used in statistical physics for spin systems (e.g., the Ising model), corresponding to a “single-site heat bath.” In particular, we consider the case where $S$ is a product space, say $S = \{-1, +1\}^V$, for some finite graph $V$. We interpret $S$ as a spin space: $V$ each point in $V$ has a feature, say the spin, which can be either $+1$, or $-1$ at a given time. Moreover $V$ is itself a graph, namely we have a notion of “neighbors” of a given $v\in V$. The target measure is the Gibbs measure $m(\sigma) \propto e^{-\beta H(\sigma)}$, where the Hamiltonian is a function $H\colon S\to \mathbb{R}$, e.g. $H(\sigma) = - \sum_{(u,v)\in E} J_{u,v} \sigma_u \sigma_v$. At each step, the dynamics proceeds as follows:

Select a vertex $v \in V$ uniformly at random.
Replace the spin $\sigma_v$ with a new value $\sigma'_v \in \{-1, +1\}$ drawn from the conditional distribution of the spin at $v$ given the configuration of its neighbors.

The transition probability to flip a spin at site $v$ (changing $\sigma$ to $\sigma^v$) is given by: \[ p_{\sigma,\sigma^v} = \frac{1}{|V|} \frac{e^{-\beta H(\sigma^v)}}{e^{-\beta H(\sigma)} + e^{-\beta H(\sigma^v)}} = \frac{1}{|V|} \frac{1}{1 + e^{\beta (H(\sigma^v) - H(\sigma))}} \] This dynamics is reversible with respect to the Gibbs measure. A key feature is that the energy difference $H(\sigma^v) - H(\sigma)$ depends only on the neighbors of $v$, making the update computationally very cheap ($O(1)$ if the degree of the graph is bounded).

While Glauber dynamics allows the total “magnetization” (sum of spins) to change, Kawasaki dynamics is designed to sample from the Gibbs measure conditioned on a fixed number of $+1$ spins (canonical ensemble). For instance we can think in this case that $S=\{0,1\}^V$, $0$ representing an empty site, and $1$ a site occupied by a particle. We want a dynamics that preserves the total number of particles. For instance:

Select a pair of vertices $(u,v)$ connected by an edge in the graph.
Propose to swap the values of $\sigma$ at $u$ and $v$.
Accept the swap with a probability determined by the Metropolis rule: \[ p_{\sigma, \sigma^{u,v}} = \min\left(1, e^{-\beta(H(\sigma^{u,v}) - H(\sigma))}\right) \] where $\sigma^{u,v}$ is the configuration $\sigma$ with the values at $u$ and $v$ swapped.

If $\sigma_u = \sigma_v$, the swap changes nothing and the energy is constant. If they differ, the particle moves. Since particles are only moved and never created or destroyed, the total number of $+1$ spins is conserved throughout the evolution of the chain.

Perfect Simulations

Finally we discuss a slightly different situation, where we want to sample the unknown invariant measure of a given Markov chain, so the transition probabilities are given in this case. We assume that the transition probabilities satisfy the following condition \[ Z:= \sum_y \inf_{x\in S} p_{x,y} >0 \tag{5}\] In other words, there is at least one point on which the Markov chain can jump with uniformly positive probability, regardless of the initial point $x$. While this may seem very restrictive, even if $P=(p_{x,y})_{x,y}$ fails to satisfy Equation 5, maybe some power $P^t$ will. And in this case, we can just apply the same technique to the $t$-multistep transition probability. In particular this will work for any aperiodic chain on a finite or countable state space.

Remark. Assuming Equation 5, we can write \[ p_{x,y}= (1-\varepsilon) q_{x,y}+ \varepsilon \nu_y \tag{6}\] for some $\varepsilon>0$, $\nu\in \mathcal{P}(S)$ and Markov transition probabilities $Q=(q_{x,y})_{x,y}$. For instance we can take $\varepsilon=Z$ and \[ q_{x,y}= (p_{x,y}-\varepsilon \nu_y)/(1-\varepsilon) \]

Notice that, if $p_{x,y}$ is irreducible (or more generally considering only the communicating class containing the support of $\nu$), then $P$ induces a positive recurrent Markov chain (since at each step we have probability $\varepsilon \nu_y>0$ to recur to some point $y$ in the support of $\nu$). In particular, the Markov chain admits a unique invariant distribution $m\in \mathcal{P}(S)$.

Proposition 2 (Perfect Simulation Algorithm) Assume an irreducible Markov chain on a finite or countable state space $S$, with transition probabilities $P=(p_{x,y})$, satisfies Equation 5, and let $m$ be its invariant probability. For any decomposition $p_{x,y}= (1-\varepsilon) q_{x,y}+ \varepsilon \nu_y$ as in Equation 6 the following holds.

Let $\tau$ be a geometric random variable parameter $\varepsilon$, that is $\mathbb{P}(\tau=k)=\varepsilon (1-\varepsilon)^k$ for $k\in \mathbb{N}$. Let $\mathbf{Y}:=(Y_t)$ be a Markov chain with transition probabilities $Q=(q_{x,y})$ and initial distribution $\nu$. Then $Y_\tau$ has distribution $m$.

Proof. Let $\mu$ be the law of $Y_\tau$, then for $y\in S$ \[ \begin{aligned} \mu_y = \sum\nolimits_{k=0}^\infty \mathbb{P}(Y_\tau=y| \tau=k) \mathbb{P}(\tau=k)= \sum\nolimits_{k=0}^\infty \varepsilon (1-\varepsilon)^k \mathbb{P}(Y_k=y) \end{aligned} \] Therefore $\mu =\varepsilon \sum_k (1-\varepsilon)^k \nu Q^k$. We need to check $\mu=\mu P$, since we already know that the invariant measure is unique and the only solution to this equation is $\mu=m$.

In operator’s notation $P= \varepsilon R+(1-\varepsilon) Q$, where $R_{x,y}=\nu_y$ for all $y\in S$. Notice that $\lambda R= \nu$ for all $\lambda \in \mathcal{P}(S)$. Therefore \[ \begin{aligned} \mu P & = \varepsilon \mu R + (1-\varepsilon) \mu Q= \varepsilon \nu + (1-\varepsilon) \left(\varepsilon \sum\nolimits_{k=0}^\infty (1-\varepsilon)^k \nu Q^k \right) Q= \varepsilon \left(\nu +\sum\nolimits_{k=0}^\infty (1-\varepsilon)^{k+1} \nu Q^{k+1} \right) \\ & = \varepsilon \left( \nu (1-\varepsilon)^{0} \nu Q^{0}+ \sum\nolimits_{k=1}^\infty (1-\varepsilon)^{k} \nu Q^{k} \right)=\mu \end{aligned} \]

This means that, if $\nu$ and $q_{x,y}$ are computationally accessible, we have a very practical way to sample $m$:

Sample a geometric random variable $\tau$ with success parameter $\varepsilon$ (e.g. count the number of failures till the first success when flipping a coin with $\varepsilon$).
Sample a point $x_0 \in S$ with distribution $\nu$.
Sample $\tau$ steps of a Markov chain starting at $x_0$ with transition probabilities $(q_{x,y})$.
The position of the Markov chain after $\tau$ step samples $m$.

This approach is perfect, since we are not claiming any convergence in the long time asymptotics. It is just an exact sampling, although the number of iterations we need to simulate $(Y_t)$ is random and (with small probability) possibly very long.