Diffusion Schrödinger Bridge with Applications to Score-Based Generative Modeling

Disclaimer: the paper is available on arxiv and was accepted for a spotlight presentation at NeurIPS 2021. The code is available on Github.

What is score-based generative modeling?

In generative modeling we are interested into designing algorithms which transform a given distribution $p_{\mathrm{prior}}$ into a given data distribution $p_{\mathrm{data}}$. We have access to the data distribution only through samples. In the last years, generative modeling has become a classical task in machine learning and image processing. There is a flurry of frameworks which aim at solving this problem such as Energy-Based models, Generative Adversarial Networks, normalizing flows, Variational Auto-encoders or diffusion score-matching techniques. The latter has shown great promises in outperforming GANs in term of visual quality, [1] and can be recast as the discretization of some diffusion process and thus is also appealing from a mathematical point of view.

Let us recall some of the basics concepts of score-based generative modeling, see also Yang Song blogpost for an introduction or the original papers [2,3]. If we have access to a (Ornstein-Ulhenbeck) diffusion $$ \mathrm{d} \mathbf{X}_t = -\alpha \mathbf{X}_t \mathrm{d} t + \sqrt{2} \mathrm{d} \mathbf{B}_t \ , \qquad \mathbf{X}_0 \sim p_{\mathrm{data}} \ ,$$ with $\alpha > 0$. The process $(\mathbf{X}_t)_{t \in [0,T]}$ is interpreted as a noising process, perturbing the data distribution $p_{\mathrm{data}}$. We denote $p_{\mathrm{prior}}$ the invariant distribution of the noising process, here $p_{\mathrm{prior}} = \mathcal{N}(0,1/\alpha)$. If $T > 0$ is large enough then the distribution of $\mathbf{X}_T$ is close to $p_{\mathrm{prior}}$ due to the ergodicity of the Ornstein-Ulhenbeck process. Then one can (approximately) sample from $p_{\mathrm{data}}$ by first sampling from $p_{\mathrm{prior}}$ and reversing the dynamics of $(\mathbf{X}_t)_{t \in [0,T]}$. Surprisingly this time-reversal operation leads to another diffusion process with explicit drift and diffusion matrix [4]. $$ \mathrm{d} \mathbf{Y}_t = \{\alpha \mathbf{Y}_t + 2 \nabla \log p_{T-t}(\mathbf{Y}_{t}) \}\mathrm{d} t + \sqrt{2} \mathrm{d} \mathbf{B}_t \ , \qquad \mathbf{Y}_0 \sim p_{\mathrm{prior}} \ ,$$ where $p_t$ is the density of $\mathbf{X}_t$. A generative model can be obtained using an Euler-Maruyama discretization of the previous diffusion process. We end up with the following Markov chain. $$ Y_{k+1}^\star = Y_k^\star + \gamma_{k+1} \{\alpha Y_k^\star + 2 \nabla \log p_{T- t_k}(Y_k^\star)\} + \sqrt{2 \gamma_{k+1}} \ ,$$ where $\{\gamma_{k+1}\}_{k=0}^{N-1}$ is a sequence of stepsizes and $t_k = \sum_{j=0}^{k-1} \gamma_{j+1}$. The Markov chain $\{Y_k^\star\}_{k=0}^N$ cannot be computed in practice because we do not have access to the logarithmic gradient (Stein score) $\nabla \log p_{t}$. In score-based generative modeling techniques this score is approximated by a neural network $s_{\theta^\star}$ solving the following variational problem $$\theta^\star = \mathrm{argmin}_\theta \mathrm{E}[\| \mathbf{Z}/\sigma_t + s_{\theta}(\mathbf{X}_t) \|^2] \ \, $$ where we recall that solutions of the Ornstein-Ulhenbeck process can be written as $\mathbf{X}_t = m_t \mathbf{X}_0 + \sigma_t \mathbf{Z}$, with $\mathbf{Z} \sim \mathcal{N}(0,1)$, $m_t = \exp[-\alpha t]$ and $\sigma_t^2 = (1 - \exp[-2 \alpha t])/\alpha$. The variational formulation for $s_{\theta^\star}$ can be obtained using the following formula $$ \nabla \log p_t (x) = \mathrm{E}[(m_t \mathbf{X}_0 - \mathbf{X}_t)/\sigma_t^2 | \mathbf{X}_t=x] = -\mathrm{E}[\mathbf{Z}/\sigma_t | \mathbf{X}_t=x] $$ Finally the generative model is given by the following Markov chain $$ Y_{k+1} = Y_k + \gamma_{k+1} \{\alpha Y_k + 2 s_{\theta^\star}(T- t_k, Y_k)\} + \sqrt{2 \gamma_{k+1}} \ .$$ These ideas are at the core of every score-based generative modeling method.


Image extracted from Yang Song blog. Generative model for CelebA.

One of the main limitation of score-based generative modeling is that they require a large number of step sizes so that the initial forward dynamics is close to the distribution $p_{\mathrm{prior}}$ and small enough stepsizes so that the neural network approximation holds.

In the next paragraph we introduce our main contribution: Diffusion Schrödinger Bridge, a new algorithm which generalizes existing score-based methods allowing to significantly reduce the number of stepsizes needed in order to define score-based generative modeling. This contribution also removes the limitation for $p_{\mathrm{prior}}$ to be Gaussian and enables future applications in high-dimensional optimal transport.

What is a Schrödinger Bridge?

⤴ Go back
The Schrödinger Bridge (SB) problem is a classical problem appearing in applied mathematics, optimal control and probability; see [5, 6, 7]. In the discrete-time setting, it takes the following (dynamic) form. Consider as reference diffusion $(\mathbf{X}_t)_{t \in [0,T]}$ with distribution $\mathbb{P}$ describing the process adding noise to the data. We aim to find $\pi^\star$ such that $\pi^\star_0 = p_{\mathrm{data}}$ and $\pi^\star_T = p_{\mathrm{prior}}$ and minimize the Kullback-Leibler divergence between $\pi^\star$ and $\mathbb{P}$. More precisely $$ \pi^\star = \mathrm{argmin} \ \{\mathrm{KL}(\pi|\mathbb{P})\ , \ \pi_0 = p_{\mathrm{data}} \ , \ \pi_N = p_{\mathrm{prior}}\}$$ In this work we introduce Diffusion Schrödinger Bridge (DSB), a new algorithm which uses score-matching approaches [2,3] to approximate the Iterative Proportional Fitting (IPF) algorithm, an iterative method to find the solutions of the SB problem. DSB can be seen as a refinement of existing score-based generative modeling methods [5, 6]. In the discrete world, IPF is also known as the Sinkhorn algorithm, see [8] for a review. First, let us describe IPF. In order to find the solution of the SB problem IPF operates iteratively by successively solving half-bridge problems. Doing so, we define a sequence of distributions $(\pi^n)_{n \in \mathbb{N}}$ such that $$ \pi^{2n+1} = \mathrm{argmin} \{ \mathrm{KL}(\pi|\pi^{2n}) \ , \ \pi_N = p_{\mathrm{prior}} \} \ , \\ \pi^{2n+2} = \mathrm{argmin} \{ \mathrm{KL}(\pi|\pi^{2n+1}) \ , \ \pi_0 = p_{\mathrm{data}} \} \ , $$ with initial condition $\pi^0 = \mathbb{P}$ the reference dynamics. For large $n$, $\pi^n$ is close to the Schrödinger bridge $p^\star$. Hence, a generative model is obtained by sampling from $\pi^{2n+1}$.

We know show how this problem is related to score-based generative modeling. Assume that $\pi^{2n}$ is the measure associated with the diffusion $$ \mathrm{d} \mathbf{X}_t^n = f_t^n(\mathbf{X}_t^n) \mathrm{d} t + \sqrt{2} \mathrm{d} \mathbf{B}_t \ , \quad \mathbf{X}_0^n \sim p_{\mathrm{data}} \ , $$ then we show that $(\pi^{2n+1})^R$ (where $R$ denotes the time-reversal operation) is associated with the diffusion $$ \mathrm{d} \mathbf{Y}_t^n = b_{T-t}^n(\mathbf{Y}_t^n) \mathrm{d} t + \sqrt{2} \mathrm{d} \mathbf{B}_t \ , \quad \mathbf{Y}_0^n \sim p_{\mathrm{prior}} $$ where $$ b_{t}^n(x) = -f_t^n(x) + 2 \nabla \log p_t^n(x) \ , $$ with $p_t^n$ the density of $\pi_t^{2n}$. Repeating this procedure we obtain that $\pi^{2n+2}$ is associated with the diffusion $$ \mathrm{d} \mathbf{X}_t^{n+1} = f_t^{n+1}(\mathbf{X}_t^{n+1}) \mathrm{d} t + \sqrt{2} \mathrm{d} \mathbf{B}_t \ , \quad \mathbf{X}_0^{n+1} \sim p_{\mathrm{data}} \ , $$ where $$ f_{t}^{n+1}(x) = -b_t^n(x) + 2 \nabla \log q_t^n(x) \ , $$ with $q_t^n$ the density of $\pi_t^{2n+1}$. We can then iterate this procedure. Of course we can not sample from these dynamics directly and we discretize them using Euler-Maruyama approximation. The logarithmic gradients are then approximated using score-matching techniques (although for memory reasons we do not approximate the scores but the drift functions). This discretization and variational approximation define our Diffusion Schrödinger Bridge algorithm.

Some advantages Current limitation
Disclaimer: in this short paragraph I tried to give a presentation of our work from the continuous-time point of view. In the paper we follow a similar presentation but with a discrete-time point of view. The two formulations define the same algorithm but the discrete-time formulation avoids the need of reverse-time diffusions at the expense of a Gaussian approximation.

In the next sections we show some of our results on two dimensional examples, MNIST, CelebA and dataset interpolation.

Two dimensional examples

⤴ Go back
For each of the row, we show the target density on the left and an animated plot of the DSB iterations on the right. Here the prior density $p_{\mathrm{prior}}$ is given by a Gaussian density with zero mean and covariance matrix $\sigma \mathrm{Id}$ where $\sigma$ is the variance computed from the target dataset. We fix the number of stepsizes to $20$.

MNIST

⤴ Go back
First we show some of samples obtained with our DSB algorithm (obtained with 30 steps in the backward dynamics).


Original dataset (left) and generated samples (right).

Then, we present an animated plot of the DSB iterations (with 10 steps in the backward dynamics). Note how the quality of the samples improve through the DSB iterations.

CelebA exploration

⤴ Go back
In our work we apply DSB for the generation of CelebA.

Here we show some latent space exploration. The Gaussian random variables in the generative models are fixed and therefore the transformation is deterministic. In this animation we follow an Ornstein-Ulhenbeck process into the latent space and observe its transformation by this deterministic mapping in the image space.

Dataset interpolation

⤴ Go back
Finally, we present some dataset interpolations experiments in two dimensions,




as well as between MNIST (handwritten digits) and EMNIST (handwritten letters).


References

⤴ Go back

[1] Prafulla Dhariwal and Alex Nichol Diffusion models beat GAN on Image Synthesis In: arxiv preprint:2105.05233

[2] Yang Song and Stefano Ermon Generative modeling by estimating gradients of the data distribution In: Advances in Neural Information Processing Systems 2019

[3] Jonathan Ho, Ajay Jain and Pieter Abbeel Denoising diffusion probabilistic models In: Advances in Neural Information Processing Systems 2020

[4] Hans Föllmer Random fields and diffusion processes In: École d'été de Probabilités de Saint-Flour 1985-1987

[5] Christian Léonard A survey of the Schrödinger problem and some of its connections with optimal transport In: Discrete & Continuous Dynamical Systems-A 2014

[6] Yongxin Chen, Tryphon Georgiou and Michele Pavon Optimal Transport in Systems and Control In: Annual Review of Control, Robotics, and Autonomous Systems 2020

[7] Aapo Hyvärinen and Peter Dayan Estimation of non-normalized statistical models by score matching In: Journal of Machine Learning Research 2005

[8] Gabriel Peyre and Marco Cuturi Computational Optimal Transport In: Foundations and Trends in Machine Learning 2019