1. Introduction
The journey began by framing image generation as a latent variable model problem. In this initial stage, Denoising Diffusion Probabilistic Models (DDPM) leveraged the Evidence Lower Bound (ELBO) to optimize the reversal of a Markovian noise process. The core objective was centered on "$\epsilon$-prediction," where a neural network learned to estimate and subtract noise step-by-step to recover the data distribution.
As the field matured, researchers shifted from discrete steps to continuous-time modeling. By employing Stochastic Differential Equations (SDEs) and ItĂŽ calculus, the diffusion process was reimagined as a continuous evolution of data particles within a random force field. This transition allowed for a more rigorous mathematical treatment of how data density transforms over time.
A major theoretical leap occurred with the introduction of the Fokker-Planck equation. Instead of tracking individual particles, this perspective focused on the evolution of the entire probability density. It provided the mathematical proof that for every stochastic process (SDE), there exists a corresponding deterministic flowâthe Probability Flow ODEâthat results in the exact same marginal distributions.
With the foundation of Fokker-Planck, the paradigm shifted from "denoising" to "navigation." The goal became finding the optimal Velocity Field ($v$) to guide data points along a specific trajectory. This marked the birth of velocity-based learning, where the network predicts the direction and speed of data movement rather than just the noise component.
Flow Matching emerged to clean up and unify the diverse landscape of generative objectives (noise, data, score, and velocity). This era introduced Rectified Flows, which "straighten" the curved paths of traditional diffusion. By enforcing straight-line trajectories, Flow Matching minimized transport costs and enabled high-quality sampling in significantly fewer steps.
To further optimize these paths, the Schrödinger Bridge problem was revisited. This addressed the challenge of transitioning between two distributions (A to B) with minimum energy expenditure and optimal entropy within a finite timeframe. This connection to Optimal Transport theory allowed for direct image-to-image translations and more efficient couplings between arbitrary distributions without strictly relying on a Gaussian prior.
We are currently entering the era of Generator Matching (GM), the "Grand Unified Theory" of generative modeling. GM acts as a meta-framework that encompasses Diffusion, Flow, and even discrete Jump processes. By utilizing the infinitesimal generator ($\mathcal{L}$), it defines all forms of data transformation within a single mathematical operator. This facilitates the creation of truly unified multimodal systems where any data type can be modeled as a flow on a geometric manifold.
Below is a summary of the upcoming sections:
- Section 2 - From Vector Field to Score Functions: ...
- Section 3 - Divergence and Continuity Equation: ...
- Section 4 - Advanced Perspectives: ...
- Section 5 - Optimal Transport and Schrödinger Bridges: ...
- Section 6 - Sampling and Control: ...
- Section 7 - Sophisticated Solvers: ...
- Section 8 - Consistency Models: ...
2. From Vector Field to Score Functions
2.1 Multivariable Calculus
2.2 Vector Fields
Definition of Vector Fields
Let $U$ be a subset of $\mathbb{R}^n$. A function $\mathbf{F}: U \to \mathbb{R}^n$ is called a vector fields on $U$.
Conservative Fields
Let $U$ be an open set of $\mathbb{R}^n$. A vector fields $\mathbf{F}: U \to \mathbb{R}^n$ is called a conservative, or a gradient, field on $U$ if there exists a smooth real-value function $f: U \to \mathbb{R}$ such that $\mathbf{F} = \nabla f$. Such an $f$ is called a potential function for $\mathbf{F}$.
Line Integrals
Let $U$ be an open set of $\mathbb{R}^n$, and let $\mathbf{F}: U \to \mathbb{R}^n$ be a continuous vector field on $U$. If $\alpha: [a, b] \to U$, then the line integral, denoted by $\int_{\alpha} \mathbf{F} \cdot d\mathbf{s}$, is defined to be: $$\int_{\alpha} \mathbf{F} \cdot d\mathbf{s} = \int_a^b \mathbf{F}(\alpha(t)) \cdot \alpha'(t) dt$$
2.3 Score Functions
Definition of Score Functions
âSuppose our dataset consists of i.i.d. samples $\{x_i \in \mathbb{R}^d\}_{i=1}^N$ from an unknown data distribution $p_{data}(x)$. We define the score of a probability density $p(x)$ to be $\nabla_x \log p(x)$. The score network $s_\theta: \mathbb{R}^D \to \mathbb{R}^D$ is a neural network parameterized by $\theta$, which will be trained to approximate the score of $p_{data}(x)$. The goal of generative modeling is to use the dataset to learn a model for generating new samples from $p_{data}(x)$. The framework of score-based generative modeling has two ingredients: score matching and Langevin dynamics.â (Yang Song & Stefano Ermon, 2019)
The definition of the Score Function as $\nabla_x \log p(x)$ is not an arbitrary choice; it is a mathematically strategic definition that solves the most fundamental problem in generative modeling: the intractability of the partition function. Here is why Yang Song and Stefano Ermon (building on the work of Aapo HyvÀrinen) use this specific formulation.
Bypassing the Partition Function ($Z$). In traditional generative modeling, a probability density function is often expressed as: $$p(x) = \frac{\tilde{p}(x)}{Z}$$ Where $\tilde{p}(x)$ is the unnormalized density (the part the model can calculate) and $Z = \int \tilde{p}(x) dx$ is the partition function (the normalization constant).
In high-dimensional spaces, calculating $Z$ is nearly impossible because it requires integrating over all possible values of $x$. By taking the gradient of the log-density, we get: $$\nabla_x \log p(x) = \nabla_x \log \left( \frac{\tilde{p}(x)}{Z} \right) = \nabla_x \log \tilde{p}(x) - \nabla_x \log Z$$ Since $Z$ is a constant with respect to $x$, its gradient is zero ($\nabla_x \log Z = 0$). Therefore: $$\nabla_x \log p(x) = \nabla_x \log \tilde{p}(x)$$ So, the Score Function allows us to learn the structure of the data distribution without ever needing to calculate the impossible normalization constant $Z$.
The Score as a Vector Field of Direction. Mathematically, the gradient $\nabla_x$ points in the direction of the steepest increase of a function. By defining the score as $\nabla_x \log p(x)$, we create a vector field that always points toward regions of higher data density. In regions where there is no data, the score points toward the data manifold. At the 'peaks' (modes) of the distribution, the score becomes zero. This makes the score function a perfect 'guide' for moving noise towards data.
Connection to Langevin Dynamics. The second 'ingredient' mentioned in your text is Langevin Dynamics. This is a stochastic process used for sampling. The update rule is: $$x_{t+1} = x_t + \epsilon \nabla_x \log p(x) + \sqrt{2\epsilon} z_t$$ In this physical analogy, the Score Function acts as a force field. It 'pushes' the sample $x$ toward the high-probability regions of $p_{data}(x)$, while the $z_t$ term adds a bit of random 'jitter' to ensure the sampler explores the whole distribution rather than getting stuck in one spot. Without the gradient of the log-density, we wouldn't have a mathematical "force" to drive the sampling process.
Fisher Information and Score Matching. Historically, in statistics, $\nabla_\theta \log p(x|\theta)$ is known as the Fisher Score. Song & Ermon adapted this into a spatial score ($\nabla_x$). The reason we can train a neural network $s_\theta(x)$ to approximate this without knowing $p_{data}(x)$ is through Score Matching. HyvÀrinen proved that: $$\mathbb{E}_{p_{data}(x)} [\|\nabla_x \log p_{data}(x) - s_\theta(x)\|^2]$$ Can be rewritten (using integration by parts) into a form that only requires the samples $x_i$ and the derivatives of the neural network, making the 'unknown' $p_{data}$ disappear from the loss function.
Change in Probability (Log-Likelihood)
According to the Fundamental Theorem of Line Integrals, if a vector field $\mathbf{F}$ is conservative (i.e., $\mathbf{F} = \nabla f$), then the line integral along a curve $\alpha$ from point $a$ to point $b$ depends only on the values of the potential function $f$ at the endpoints: $$\int_{\alpha} \nabla f \cdot d\mathbf{s} = f(\alpha(b)) - f(\alpha(a))$$ Applying this theorem to the Score Function: $$\int_{\alpha} \nabla \log p(\mathbf{x}) \cdot d\mathbf{s} = \log p(\mathbf{x}_{final}) - \log p(\mathbf{x}_{initial}) = \log \frac{p(\mathbf{x}_{final})}{p(\mathbf{x}_{initial})}$$
The line integral of the Score Function along a trajectory $\alpha$ represents the total cumulative change in the log-probability of the data as it moves along that path. A positive integral value indicates that the trajectory is moving toward 'more realistic' data regionsâareas characterized by higher probability density.
3. Divergence and Continuity Equation
3.1 Gauss's Theorem and Divergence
Definition of Divergence
While Shimamoto emphasizes geometric intuition, the formal mathematical definition of the Divergence of a vector field $\mathbf{F} = \langle F_1, F_2, \dots, F_n \rangle$ in $n$-dimensional space is the sum of the partial derivatives of its components: $$\text{div } \mathbf{F} = \nabla \cdot \mathbf{F} = \frac{\partial F_1}{\partial x_1} + \frac{\partial F_2}{\partial x_2} + \dots + \frac{\partial F_n}{\partial x_n}$$
In the specific case of 3D space ($n=3$), where the vector field is given by $\mathbf{F} = \langle M, N, P \rangle$, the formula simplifies to: $$\text{div } \mathbf{F} = \frac{\partial M}{\partial x} + \frac{\partial N}{\partial y} + \frac{\partial P}{\partial z}$$
Shimamoto describes Divergence as a measure of the 'flux density' or the rate at which a 'fluid' (or probability density) is expanding or contracting at a specific point:
- Source: If $\text{div } \mathbf{F} > 0$ at a point, the field is 'spreading out.' This indicates a source where matter, energy, or probability is being generated or flowing outward.
- Sink: If $\text{div } \mathbf{F} < 0$ at a point, the field is 'compressing.' This indicates a sink where the flow is accumulating or vanishing.
- Incompressible: If $\text{div } \mathbf{F} = 0$, the flow is steady, with no net gain or loss at that point.
Gauss's Theorem
Gauss's Theorem, also known as the Divergence Theorem, states that the total flux of a vector field $\mathbf{F}$ through a closed surface $S$ is equal to the integral of the divergence of $\mathbf{F}$ over the volume $V$ enclosed by $S$: $$\iint_S \mathbf{F} \cdot d\mathbf{S} = \iiint_V (\nabla \cdot \mathbf{F}) dV = \iiint_V \text{div } \mathbf{F} dV$$
In the context of Diffusion Models, Gauss's Theorem provides a mathematical framework for understanding how local changes in probability density (as measured by divergence) relate to the overall flow of probability mass across space. It helps us analyze how probability 'flows' from low-density regions (sources) to high-density regions (sinks) during the generative process.
3.2 Continuity Equation and Fokker-Planck Equation
Continuity Equation
The Continuity Equation is a fundamental principle in physics that describes the conservation of some quantity (like mass, energy, or probability) as it flows through space and time. In the context of probability density $p(\mathbf{x}, t)$ evolving under a velocity field $\mathbf{v}(\mathbf{x}, t)$, the Continuity Equation can be expressed as: $$\frac{\partial p}{\partial t} + \nabla \cdot (p \mathbf{v}) = 0$$
This equation states that the rate of change of probability density at any point in space and time is balanced by the net flow of probability into or out of that point. It ensures that probability is conserved during the generative process modeled by diffusion.
ItĂŽ's Stochastic Differential Equation
ItĂŽ's Stochastic Differential Equation (SDE) is a mathematical framework for modeling systems that evolve with both deterministic and stochastic components. An SDE can be written in the form: $$d\mathbf{X}_t = \mathbf{f}(\mathbf{X}_t, t) dt + \mathbf{G}(\mathbf{X}_t, t) d\mathbf{W}_t$$ where $\mathbf{X}_t$ is the state of the system at time $t$, $\mathbf{f}$ is the drift term representing deterministic dynamics, $\mathbf{g}$ is the diffusion term representing stochastic dynamics, and $d\mathbf{W}_t$ is a Wiener process (or Brownian motion).
In Diffusion Models, the generative process can be modeled as an SDE where the drift term guides samples toward high-density regions of the data distribution, while the diffusion term adds noise to ensure diversity in generated samples.
Fokker-Planck Equation
The Fokker-Planck Equation describes the time evolution of the probability density function of a stochastic process governed by an SDE. For an SDE of the form given above, the corresponding Fokker-Planck Equation is: $$\frac{\partial p}{\partial t} = -\nabla \cdot (\mathbf{f} p) + \frac{1}{2} \nabla^2 : (\mathbf{G} \mathbf{G}^T p)$$ where $\nabla^2 :$ denotes the divergence of the diffusion term.
In the context of Diffusion Models, the Fokker-Planck Equation provides a theoretical framework for understanding how the probability density evolves over time under the influence of both deterministic and stochastic forces. It helps us analyze how samples transition from noise to structured data as they are guided by the Score Function and influenced by random perturbations.
Sub-Summary 1
Stokes' Theorem
Stokes' Theorem is a powerful mathematical tool that relates the surface integral of a vector field over a surface $S$ to the line integral of the same vector field around the boundary curve $\partial S$ of that surface. Formally, it can be expressed as: $$\oint_{\partial S} \mathbf{F} \cdot d\mathbf{s} = \iint_S (\nabla \times \mathbf{F}) \cdot d\mathbf{S}$$
Given that the Score Function is defined as the gradient of the log-density, $\mathbf{s} = \nabla \log p$, it inherently constitutes a conservative vector field. A fundamental property of such fields is that their curl is identically zero: $$\nabla \times \mathbf{s} = \nabla \times (\nabla \log p) = 0$$ By invoking Stokes' Theorem, it can be demonstrated that the line integral around any closed loop $C$ vanishes, as it equates to the surface integral of a null curl: $$\oint_C \mathbf{s} \cdot d\mathbf{r} = \iint_S (\nabla \times \mathbf{s}) \cdot d\mathbf{S} = 0$$ This mathematical constraint carries profound implications for model convergence. Should the Score Function exhibit any rotational component ($\text{curl} \neq 0$), data particles propelled by the score would risk becoming trapped in limit cycles, thereby failing to converge toward the true data distribution. Consequently, the 'potential function' nature of the Score Function ensures a pure probability flow, guiding the sampling process directly and efficiently toward the high-density regions of the data manifold.
Sub-Summary 2
Probability mass is conserved. The rate at which probability density $p(\mathbf{x},t)$ changes inside a volume $V$ must equal the net flux $\mathbf{j}$ passing through the boundary $\partial V$. $V$ represents an arbitrary volume in an n-dimensional space (in the context of Diffusion, this space corresponds to your data domain).
To relate the surface flow (Flux) to the internal change (Density), we use Gauss's Theorem: $$\iint_{\partial V} \mathbf{j} \cdot d\mathbf{S} = \iiint_V (\nabla \cdot \mathbf{j}) dV$$ This theorem is the 'key' that allows us to translate a global conservation law into a local differential equation.
Since probability is conserved, the increase inside equals the negative of the outward flux: $$\iiint_V \frac{\partial p}{\partial t} \, dV = - \iint_{\partial V} \mathbf{j} \cdot d\mathbf{S}$$ By applying Gauss's Theorem, we derive the Continuity Equation, ensuring that data points are neither created nor destroyed, only moved: $$\iiint_V \frac{\partial p}{\partial t} \, dV = - \iiint_V (\nabla \cdot \mathbf{j}) \, dV$$ Thus, we arrive at the local form of the Continuity Equation: $$\frac{\partial p_t(\mathbf{x})}{\partial t} = -\nabla \cdot \mathbf{j}(\mathbf{x}, t)$$
The flux $\mathbf{j}$ is generated by the motion of 'data particles' following an ItĂŽ Stochastic Differential Equation (SDE): $$d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)dt + \mathbf{G}(t)d\mathbf{w}$$ where $\mathbf{f}$ is the drift term (deterministic part) and $\mathbf{G}(t)d\mathbf{w}$ is the diffusion term (stochastic part).
The total flux $\mathbf{j}$ in this process consists of two parts, deterministic drift and stochastic diffusion: $$\mathbf{j}(\mathbf{x}, t) = \underbrace{\mathbf{f}(\mathbf{x}, t) p_t(\mathbf{x})}_{\text{Drift flux}} - \underbrace{\frac{1}{2} \nabla \cdot [\mathbf{G}\mathbf{G}^T p_t(\mathbf{x})]}_{\text{Diffusion flux}}$$ To introduce the Score Function $s(\mathbf{x}) = \nabla_{\mathbf{x}} \log p_t(\mathbf{x})$, we use the following identity: $$\nabla \cdot (\mathbf{A} p) = p (\nabla \cdot \mathbf{A}) + \mathbf{A} \nabla p \text{ (Mathematical Identity)} $$ Here, $\mathbf{A}$ is the Diffusion Tensor ($\mathbf{A} = \mathbf{G}\mathbf{G}^T$). It represents the structure and correlation of the noise. Assuming the noise $\mathbf{G}$ is constant with respect to space ($\nabla \cdot \mathbf{A} = 0$), and using the log-derivative trick ($\nabla p = p \nabla \log p$), the flux simplifies to: $$\mathbf{j}(\mathbf{x}, t) = \left[ \mathbf{f}(\mathbf{x}, t) - \frac{1}{2} \mathbf{G}\mathbf{G}^T \nabla_{\mathbf{x}} \log p_t(\mathbf{x}) \right] p_t(\mathbf{x})$$ The Final Fokker-Planck Equation. By substituting $\mathbf{j}$ back into the Continuity Equation, we obtain the governing equation for the entire Diffusion process expressed via the Score Function: $$\frac{\partial p_t(\mathbf{x})}{\partial t} = -\nabla \cdot \left( \underbrace{\left[ \mathbf{f}(\mathbf{x}, t) - \frac{1}{2} \mathbf{G}\mathbf{G}^T \nabla_{\mathbf{x}} \log p_t(\mathbf{x}) \right]}_{\text{Velocity Field } \mathbf{v}(\mathbf{x}, t)} p_t(\mathbf{x}) \right)$$
This equation is the mathematical bridge that allows us to replace stochastic uncertainty with deterministic flow. It proves that by simply learning the gradient of the log-density (the Score), a neural network can perfectly simulate the complex evolution of data distributions.
4. Advanced Perspectives
4.1 Flow Matching Framework
If we already have a velocity field defined by the Fokker-Planck equation, why do we still need Flow Matching? The Fokker-Planck equation provides a theoretical framework for understanding how probability densities evolve under the influence of both deterministic and stochastic forces. However, it does not prescribe a specific method for training a neural network to learn this evolution. Flow Matching is a practical framework that allows us to directly train a neural network to match the velocity field defined by the Fokker-Planck equation. It provides a way to optimize the parameters of the neural network so that it can accurately capture the dynamics of the data distribution as it evolves over time.
In Sub-Summary 2, we proof that Fokker-Planck equation can be expressed in terms of the Continuity Equation: $$\frac{\partial p_t(\mathbf{x})}{\partial t} = -\nabla \cdot \left( \mathbf{v}(\mathbf{x}, t) p_t(\mathbf{x}) \right)$$ Where $\mathbf{v}(\mathbf{x}, t) = \mathbf{f}(\mathbf{x}, t) - \frac{1}{2} \mathbf{G}\mathbf{G}^T \nabla_{\mathbf{x}} \log p_t(\mathbf{x})$ is the velocity field that governs the flow of probability density.
This equation asserts that although data particles move randomly (SDE), their density evolves deterministically along a flow. This observation opens up the possibility of replacing the stochastic differential equation with an ordinary differential equation (ODE): $$\frac{d\mathbf{x}}{dt} = \mathbf{v}(\mathbf{x}, t)$$
Neural Ordinary Differential Equations
Let $z(t)$ be a finite continuous random variable with probability $p(z(t))$ dependent on time. Let $\frac{dz}{dt} = f(z(t),t)$ be a differential equation describing a continuous-in-time transformation of $z(t)$. Assuming that $f$ is uniformly Lipschitz continuous in $z$ and continuous in $t$, then the change in log probability also follows a differential equation, $$\frac{\partial \log p(\mathbf{z}(t))}{\partial t} = -\text{tr}\left(\frac{\partial f}{\partial \mathbf{z}(t)}\right)$$
All flow-based models must adhere to the principle of probability conservation. As shown in Sub-Summary 2, the continuity equation describes the evolution of the density $p_t$ under the influence of the velocity field $v_t$: $$\frac{\partial p_t(\mathbf{x})}{\partial t} + \nabla \cdot (p_t(\mathbf{x}) \mathbf{v}(\mathbf{x}, t)) = 0$$ Expansion of the Divergence operator applied to a product ($\nabla \cdot (p \mathbf{v}) = (\nabla p) \cdot \mathbf{v} + p (\nabla \cdot \mathbf{v})$), we have: $$\frac{\partial p_t}{\partial t} + \mathbf{v} \cdot \nabla p_t + p_t (\nabla \cdot \mathbf{v}) = 0$$ During the inference process, rather than observing density fluctuations from a fixed vantage point, we track the trajectory of data particles as they evolve according to an Ordinary Differential Equation (ODE): $\frac{d\mathbf{x}}{dt} = \mathbf{v}(\mathbf{x}, t)$. The rate of change in density perceived by a particle as it moves is defined as the Total Derivative: $$\frac{d p_t(\mathbf{x}(t))}{dt} = \underbrace{\frac{\partial p_t}{\partial t}}_{\text{Local change}} + \underbrace{\frac{d\mathbf{x}}{dt} \cdot \nabla p_t}_{\text{Convective change}}$$ By substituting the velocity vector $\frac{d\mathbf{x}}{dt} = \mathbf{v}$ into the equation, it becomes evident that the expression $\left( \frac{\partial p_t}{\partial t} + \mathbf{v} \cdot \nabla p_t \right)$ constitutes the total derivative $\frac{d p_t}{dt}$. Drawing from the fundamental equations established in the initial step, we can derive the following relationship: $$\frac{d p_t}{dt} = -p_t (\nabla \cdot \mathbf{v})$$ Dividing both sides of the equation by $p_t$ yields:$$\frac{1}{p_t} \frac{d p_t}{dt} = -\nabla \cdot \mathbf{v}$$In accordance with the Chain Rule in calculus (specifically $\frac{d \log u}{dt} = \frac{1}{u} \frac{du}{dt}$), the left-hand side of the equation can be reformulated as: $$\frac{d \log p_t(\mathbf{x}(t))}{dt} = -\nabla \cdot \mathbf{v}(\mathbf{x}, t)$$ Furthermore, the Divergence of the velocity field $\mathbf{v}$ is defined as the sum of the partial derivatives of its components with respect to their corresponding directions: $$\nabla \cdot \mathbf{v} = \frac{\partial v_1}{\partial x_1} + \frac{\partial v_2}{\partial x_2} + \dots + \frac{\partial v_n}{\partial x_n}$$ When examining the Jacobian matrix $\mathbf{J} = \frac{\partial \mathbf{v}}{\partial \mathbf{x}}$ (where each element $J_{ij} = \frac{\partial v_i}{\partial x_j}$), it is observed that the terms comprising the divergence are identical to the diagonal elements of this matrix. In the field of linear algebra, the sum of these diagonal entries is referred to as the Trace ($\text{Tr}$). Consequently: $$\nabla \cdot \mathbf{v} = \text{Tr} \left( \frac{\partial \mathbf{v}}{\partial \mathbf{x}} \right)$$ By consolidating these analytical steps, we arrive at the final identity regarding the log-density change over time: $$\frac{d \log p_t(\mathbf{x})}{dt} = -\text{Tr} \left( \frac{\partial \mathbf{v}_\theta}{\partial \mathbf{x}} \right)$$
Probability Paths
Flow Matching Framework
Conditional Flow Matching
4.2 Four Prediction Types Are Equivalent
$\epsilon$-Prediction - Noise Prediction
$x$-Prediction - Clean Prediction
Score Prediction
$v$-Prediction - Velocity Prediction
Let the stochastic process be governed by the general ItĂŽ SDE: $$d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)dt + \mathbf{G}d\mathbf{w}$$ where $\mathbf{f}(\mathbf{x}, t)$ is the drift vector and $\mathbf{G}$ is the diffusion matrix. The corresponding Probability Flow ODE that preserves the same marginal probability density $p_t(\mathbf{x})$ is defined as: $$\frac{d\mathbf{x}}{dt} = \mathbf{v}(\mathbf{x}, t) = \mathbf{f}(\mathbf{x}, t) - \frac{1}{2} \mathbf{G}\mathbf{G}^T \nabla_{\mathbf{x}} \log p_t(\mathbf{x})$$ Given the optimal neural network predictions for noise ($\boldsymbol{\epsilon}^*$), clean data ($\mathbf{x}_0^*$), score ($\mathbf{s}^*$), and velocity ($\mathbf{v}^*$), the following generalized equivalences hold:
- Score and Noise Relationship. $$\boldsymbol{\epsilon}^*(\mathbf{x}_t, t) = -\sigma_t \mathbf{s}^*(\mathbf{x}_t, t)$$ (The noise prediction is a scaled version of the negative score function.)
- Score and Clean Data Relationship. $$\mathbf{x}_0^*(\mathbf{x}_t, t) = \frac{1}{\alpha_t} \left( \mathbf{x}_t + \sigma_t^2 \mathbf{s}^*(\mathbf{x}_t, t) \right)$$ (The clean data estimate is derived by moving from $\mathbf{x}_t$ along the score direction to the origin.)
- Velocity and SDE Components (The Bridge). The optimal velocity prediction $\mathbf{v}^*$ is equivalent to the drift of the Probability Flow ODE: $$\mathbf{v}^*(\mathbf{x}_t, t) = \mathbf{f}(\mathbf{x}_t, t) - \frac{1}{2} \mathbf{G}\mathbf{G}^T \mathbf{s}^*(\mathbf{x}_t, t)$$
5. Optimal Transport and Schrödinger Bridges
5.1 Schrödinger Bridges
The Schrödinger Bridge is a foundational concept in probability theory, optimal transport, and statistical physics, originally formulated by physicist Erwin Schrödinger in 1931.
Conceptually, it answers a thought experiment: Imagine you observe a cloud of particles undergoing random motion (like Brownian motion) at time $t=0$ with a specific starting distribution. At a later time $t=T$, you observe them again, but they form a completely different distributionâone that is highly unlikely to have occurred naturally by pure random chance. What is the most probable path those particles took to get from the initial distribution to the final one?
The Schrödinger Bridge finds the most likely sequence of intermediate states, bridging the two distributions with minimal unnatural forcing.
Let $\Omega$ be the space of continuous paths over a time interval $[0, T]$ with values in a state space (such as $\mathbb{R}^d$).
- Let $\mathbb{W}$ be a reference measure on this path space. This typically represents the 'prior' or unforced dynamics, such as the path measure of a standard Brownian motion.
- Let $p_0$ and $p_T$ be the prescribed marginal probability distributions at the initial time $t=0$ and the final time $t=T$.
- Let $\mathbb{P}$ be any candidate path measure whose endpoints match $p_0$ and $p_T$.
Before applying the bridge, we must establish the 'reference measure' or the prior unconstrained dynamics. In the continuous-time diffusion framework, the baseline motion of data particles is governed by an ItĂŽ Stochastic Differential Equation (SDE): $$d\mathbf{x} = \mathbf{f}(\mathbf{x}, t)dt + \mathbf{G}d\mathbf{w}$$ Where $\mathbf{f}(\mathbf{x}, t)$ represents the deterministic drift vector and $\mathbf{G}$ is the diffusion matrix controlling the stochastic noise.
As established by the local form of the Continuity Equation and Gauss's Theorem, probability mass is conserved during this process. The time evolution of the probability density $p_t(\mathbf{x})$ under these reference dynamics is perfectly described by the standard Fokker-Planck Equation: $$\frac{\partial p_t(\mathbf{x})}{\partial t} = -\nabla \cdot \left( \left[ \mathbf{f}(\mathbf{x}, t) - \frac{1}{2} \mathbf{G}\mathbf{G}^T \nabla_{\mathbf{x}} \log p_t(\mathbf{x}) \right] p_t(\mathbf{x}) \right)$$ In this unconstrained state, the velocity field guiding the probability flow is naturally $\mathbf{v}(\mathbf{x}, t) = \mathbf{f}(\mathbf{x}, t) - \frac{1}{2} \mathbf{G}\mathbf{G}^T \nabla_{\mathbf{x}} \log p_t(\mathbf{x})$.
The Schrödinger Bridge Problem imposes two strict boundary conditions: an initial distribution $p_0(\mathbf{x})$ and a specific, complex target distribution $p_T(\mathbf{x})$. The goal is to find a new stochastic process that exactly bridges these two distributions in finite time $T$, while minimizing the Kullback-Leibler (KL) divergence with respect to the reference SDE defined above.
To force the particles to reach $p_T$ instead of diffusing endlessly into noise, we must apply an external 'control force' to the drift. According to stochastic control theory and the Hopf-Cole transformation, the optimal controlled SDE that solves the Schrödinger Bridge takes the following form: $$d\mathbf{x} = \left[ \mathbf{f}(\mathbf{x}, t) + \mathbf{G}\mathbf{G}^T \nabla_{\mathbf{x}} \log \Psi(\mathbf{x}, t) \right] dt + \mathbf{G}d\mathbf{w}$$ Here, $\Psi(\mathbf{x}, t)$ is a scalar function known as the forward Schrödinger potential. The term $\mathbf{G}\mathbf{G}^T \nabla_{\mathbf{x}} \log \Psi(\mathbf{x}, t)$ represents the precise, minimal mathematical guidance required to steer the probability flow toward the destination distribution.
The Schrödinger-Fokker-Planck Equation
When we inject this optimal, controlled drift back into the continuity framework, the fundamental velocity field driving the density changes. We substitute the new drift $\mathbf{f}_{SB} = \mathbf{f} + \mathbf{G}\mathbf{G}^T \nabla \log \Psi$ into the flux equation. The updated velocity field for the Schrödinger Bridge becomes: $$\mathbf{v}_{SB}(\mathbf{x}, t) = \mathbf{f}(\mathbf{x}, t) + \mathbf{G}\mathbf{G}^T \nabla_{\mathbf{x}} \log \Psi(\mathbf{x}, t) - \frac{1}{2} \mathbf{G}\mathbf{G}^T \nabla_{\mathbf{x}} \log p_t(\mathbf{x})$$ Consequently, the modified Fokker-Planck Equation governing the precise evolution of the bridged density is: $$\frac{\partial p_t(\mathbf{x})}{\partial t} = -\nabla \cdot \left( \left[ \mathbf{f}(\mathbf{x}, t) + \mathbf{G}\mathbf{G}^T \nabla_{\mathbf{x}} \log \Psi(\mathbf{x}, t) - \frac{1}{2} \mathbf{G}\mathbf{G}^T \nabla_{\mathbf{x}} \log p_t(\mathbf{x}) \right] p_t(\mathbf{x}) \right)$$
System of Coupled Potentials
Unlike standard diffusion models that only require learning a single score function $\nabla_{\mathbf{x}} \log p_t(\mathbf{x})$ pointing toward the data manifold, solving the SB-modified Fokker-Planck equation requires modeling the interaction between the start and end points simultaneously. The probability density at any intermediate time $t$ in a Schrödinger Bridge is mathematically factored into the product of two distinct potentials: $$p_t(\mathbf{x}) = \Psi(\mathbf{x}, t) \hat{\Psi}(\mathbf{x}, t)$$
- $\Psi(\mathbf{x}, t)$ is the forward potential (ensuring the system respects the prior dynamic structure).
- $\hat{\Psi}(\mathbf{x}, t)$ is the backward potential (acting as the gravitational pull of the target distribution $p_T$).
The Substitution
If we substitute these new SB components into the velocity field equation, we get: $$\mathbf{v}_{SB}(\mathbf{x}, t) = \underbrace{\mathbf{f}(\mathbf{x}, t) + \mathbf{G}\mathbf{G}^T \nabla_{\mathbf{x}} \log \Psi(\mathbf{x}, t)}_{\text{New Controlled Drift}} - \frac{1}{2} \mathbf{G}\mathbf{G}^T \nabla_{\mathbf{x}} \log \big( \Psi(\mathbf{x}, t) \hat{\Psi}(\mathbf{x}, t) \big)$$ Because the logarithm of a product is the sum of the logarithms, we can expand the gradient of the log-density: $$\nabla_{\mathbf{x}} \log p_t = \nabla_{\mathbf{x}} \log \Psi + \nabla_{\mathbf{x}} \log \hat{\Psi}$$ Substitute this expansion back into the velocity equation: $$\mathbf{v}_{SB}(\mathbf{x}, t) = \mathbf{f} + \mathbf{G}\mathbf{G}^T \nabla_{\mathbf{x}} \log \Psi - \frac{1}{2} \mathbf{G}\mathbf{G}^T \nabla_{\mathbf{x}} \log \Psi - \frac{1}{2} \mathbf{G}\mathbf{G}^T \nabla_{\mathbf{x}} \log \hat{\Psi}$$ Combine the $\nabla_{\mathbf{x}} \log \Psi$ terms: $$\mathbf{v}_{SB}(\mathbf{x}, t) = \mathbf{f} + \frac{1}{2} \mathbf{G}\mathbf{G}^T \nabla_{\mathbf{x}} \log \Psi - \frac{1}{2} \mathbf{G}\mathbf{G}^T \nabla_{\mathbf{x}} \log \hat{\Psi}$$Factor out the $\frac{1}{2} \mathbf{G}\mathbf{G}^T$: $$\mathbf{v}_{SB}(\mathbf{x}, t) = \mathbf{f}(\mathbf{x}, t) + \frac{1}{2} \mathbf{G}\mathbf{G}^T \Big( \nabla_{\mathbf{x}} \log \Psi(\mathbf{x}, t) - \nabla_{\mathbf{x}} \log \hat{\Psi}(\mathbf{x}, t) \Big)$$ Using the logarithm quotient rule, we arrive at the final, simplified velocity field.
Practical Implementation via IPF
Analytically solving this modified Fokker-Planck equation for high-dimensional data (like images) is mathematically intractable. In modern machine learning, this is resolved using the Diffusion Schrödinger Bridge (DSB) framework.
Instead of solving the PDEs directly, DSB approximates the solution using an algorithm called Iterative Proportional Fitting (IPF). IPF trains two separate neural networksâone to estimate the forward drift and one to estimate the backward driftâand alternates simulating forward SDE trajectories and backward SDE trajectories. Through successive half-iterations (simulating forward to update the backward network, then backward to update the forward network), the modeled velocity field progressively aligns with the theoretical $\mathbf{v}_{SB}(\mathbf{x}, t)$, granting the ability to map between domains with exceptional efficiency.
5.2 Optimal Transport as a Special Case of Schrödinger Bridges
The Dynamic Formulation of Optimal Transport
To see the connection, we must look at Optimal Transport not through static maps, but through its dynamic formulation, famously known as the Benamou-Brenier formula.
magine moving a pile of sand (distribution $p_0$) to a hole (distribution $p_T$) over a time interval $t \in [0,T]$. Standard OT seeks the absolute 'cheapest' flowâminimizing the kinetic energy of the transport without any random scattering of the sand.
Formula for OT (Benamou-Brenier): $$W_2^2(p_0, p_T) = \inf_{\rho, \mathbf{v}} \int_0^T \int_{\mathbb{R}^d} \frac{1}{2} \|\mathbf{v}(\mathbf{x}, t)\|^2 \rho(\mathbf{x}, t) d\mathbf{x} dt$$ ubject to the Continuity Equation (deterministic flow):$$\frac{\partial \rho}{\partial t} + \nabla \cdot (\rho \mathbf{v}) = 0$$ With boundary conditions: $\rho(\mathbf{x}, 0) = p_0(\mathbf{x})$ and $\rho(\mathbf{x}, T) = p_T(\mathbf{x})$. n this deterministic model, the equivalent SDE describing particle motion has no noise: $$d\mathbf{x} = \mathbf{v}(\mathbf{x}, t)dt$$
The Schrödinger Bridge (Entropy-Regularized OT)
The Schrödinger Bridge tackles the exact same goal (moving from $p_0$ to $p_T$), but it assumes the particles are naturally undergoing Brownian motion (diffusion). Therefore, the transport plan must balance minimizing the transport effort (kinetic energy) against the natural dispersion (entropy).
Let $\epsilon > 0$ represent the variance of the diffusion process (the 'temperature' or noise level).
Formula for Schrödinger Bridge: $$SB_\epsilon(p_0, p_T) = \inf_{\rho, \mathbf{v}} \int_0^T \int_{\mathbb{R}^d} \frac{1}{2} \|\mathbf{v}(\mathbf{x}, t)\|^2 \rho(\mathbf{x}, t) d\mathbf{x} dt$$ Subject to the Fokker-Planck Equation (stochastic flow): $$\frac{\partial \rho}{\partial t} + \nabla \cdot (\rho \mathbf{v}) = \epsilon \Delta \rho$$ (Note: $\Delta \rho$ is the Laplacian, which is equivalent to $\nabla \cdot \nabla \rho$, representing the diffusion term).
With the same boundary conditions: $\rho(\mathbf{x}, 0) = p_0(\mathbf{x})$ and $\rho(\mathbf{x}, T) = p_T(\mathbf{x})$. In this stochastic model, the equivalent SDE describing particle motion includes a Wiener process (noise) scaled by $\epsilon$: $$d\mathbf{x} = \mathbf{v}(\mathbf{x}, t)dt + \sqrt{2\epsilon} d\mathbf{w}_t$$
The Limit Theorem
The structural equivalence between the two formulas is exact, differing only in the constraint governing the density evolution (Continuity vs. Fokker-Planck). The formal relationship is defined by taking the limit of the Schrödinger Bridge as the noise parameter $\epsilon$ goes to $0$: $$\lim_{\epsilon \to 0} SB_\epsilon(p_0, p_T) = W_2^2(p_0, p_T)$$
6. Sampling and Control
6.1 Classifier-Free Guidance
6.2 Direct Preference Optimization
6.3 Training-Free Guidance
7. Sophisticated Solvers
7.1 DPM-Solver/DPM-Solver++
7.2 Other Sophisticated Solvers
8. Consistency Models
8.1 Consistency Functions
8.2 Training Mechanism
9. Conclusion
In this blog post, we have explored the mathematical foundations of Diffusion Models, starting from the concept of vector fields and score functions, to the role of divergence and the continuity equation in modeling probability flow. We also delved into advanced perspectives such as flow matching frameworks, optimal transport, Schrödinger bridges, sampling techniques, and sophisticated solvers. By understanding these underlying principles, we can better appreciate the power and flexibility of Diffusion Models in generative modeling tasks.