Disclaimer: the paper is available
on
arxiv and was accepted for
a spotlight presentation at NeurIPS 2021. The code is available
on
Github.
What is score-based generative modeling?
In
generative modeling we are interested into designing algorithms
which transform a given distribution $p_{\mathrm{prior}}$ into a given data
distribution $p_{\mathrm{data}}$. We have access to the data distribution
only through samples. In the last years, generative modeling has become a
classical task in machine learning and image processing. There is a flurry
of frameworks which aim at solving this problem such as Energy-Based models,
Generative Adversarial Networks, normalizing flows, Variational
Auto-encoders or diffusion score-matching techniques. The latter has shown
great promises in outperforming GANs in term of visual
quality,
[1] and can be recast as the discretization of
some diffusion process and thus is also appealing from a mathematical point
of view.
Let us recall some of the basics concepts of score-based generative
modeling, see
also
Yang Song
blogpost for an introduction or the original
papers
[2,3]. If we have access to a
(Ornstein-Ulhenbeck) diffusion $$ \mathrm{d} \mathbf{X}_t = -\alpha
\mathbf{X}_t \mathrm{d} t + \sqrt{2} \mathrm{d} \mathbf{B}_t \ , \qquad
\mathbf{X}_0 \sim p_{\mathrm{data}} \ ,$$ with $\alpha > 0$. The process
$(\mathbf{X}_t)_{t \in [0,T]}$ is interpreted as a
noising process,
perturbing the data distribution $p_{\mathrm{data}}$. We denote
$p_{\mathrm{prior}}$ the invariant distribution of the noising process, here
$p_{\mathrm{prior}} = \mathcal{N}(0,1/\alpha)$. If $T > 0$ is large enough
then the distribution of $\mathbf{X}_T$ is close to $p_{\mathrm{prior}}$ due
to the ergodicity of the Ornstein-Ulhenbeck process. Then one can
(approximately) sample from $p_{\mathrm{data}}$ by first sampling from
$p_{\mathrm{prior}}$ and reversing the dynamics of $(\mathbf{X}_t)_{t \in
[0,T]}$. Surprisingly this time-reversal operation leads to another
diffusion process with explicit drift and diffusion
matrix
[4]. $$ \mathrm{d} \mathbf{Y}_t = \{\alpha
\mathbf{Y}_t + 2 \nabla \log p_{T-t}(\mathbf{Y}_{t}) \}\mathrm{d} t +
\sqrt{2} \mathrm{d} \mathbf{B}_t \ , \qquad \mathbf{Y}_0 \sim
p_{\mathrm{prior}} \ ,$$ where $p_t$ is the density of $\mathbf{X}_t$. A
generative model can be obtained using an Euler-Maruyama discretization of
the previous diffusion process. We end up with the following Markov chain.
$$ Y_{k+1}^\star = Y_k^\star + \gamma_{k+1} \{\alpha Y_k^\star + 2 \nabla
\log p_{T- t_k}(Y_k^\star)\} + \sqrt{2 \gamma_{k+1}} \ ,$$ where
$\{\gamma_{k+1}\}_{k=0}^{N-1}$ is a sequence of stepsizes and $t_k =
\sum_{j=0}^{k-1} \gamma_{j+1}$. The Markov chain $\{Y_k^\star\}_{k=0}^N$
cannot be computed in practice because we do not have access to the
logarithmic gradient (Stein score) $\nabla \log p_{t}$. In score-based
generative modeling techniques this score is approximated by a neural
network $s_{\theta^\star}$ solving the following variational problem
$$\theta^\star = \mathrm{argmin}_\theta \mathrm{E}[\| \mathbf{Z}/\sigma_t +
s_{\theta}(\mathbf{X}_t) \|^2] \ \, $$ where we recall that solutions of the
Ornstein-Ulhenbeck process can be written as $\mathbf{X}_t = m_t
\mathbf{X}_0 + \sigma_t \mathbf{Z}$, with $\mathbf{Z} \sim
\mathcal{N}(0,1)$, $m_t = \exp[-\alpha t]$ and $\sigma_t^2 = (1 - \exp[-2
\alpha t])/\alpha$. The variational formulation for $s_{\theta^\star}$ can
be obtained using the following formula $$ \nabla \log p_t (x) =
\mathrm{E}[(m_t \mathbf{X}_0 - \mathbf{X}_t)/\sigma_t^2 | \mathbf{X}_t=x] =
-\mathrm{E}[\mathbf{Z}/\sigma_t | \mathbf{X}_t=x] $$ Finally the generative
model is given by the following Markov chain $$ Y_{k+1} = Y_k + \gamma_{k+1}
\{\alpha Y_k + 2 s_{\theta^\star}(T- t_k, Y_k)\} + \sqrt{2 \gamma_{k+1}} \
.$$ These ideas are at the core of every score-based generative modeling
method.
Image extracted
from Yang Song
blog. Generative model for CelebA.
One of the main limitation of score-based generative modeling is that they
require a large number of step sizes so that the initial forward dynamics is
close to the distribution $p_{\mathrm{prior}}$ and small enough stepsizes
so that the neural network approximation holds.
In the next paragraph we introduce our main contribution:
Diffusion
Schrödinger Bridge, a new algorithm which generalizes existing
score-based methods allowing to significantly reduce the number of stepsizes
needed in order to define score-based generative modeling. This contribution
also removes the limitation for $p_{\mathrm{prior}}$ to be Gaussian and
enables future applications in high-dimensional optimal transport.
What is a Schrödinger Bridge?
⤴ Go back
The Schrödinger Bridge (SB) problem is a classical problem appearing in
applied mathematics, optimal control and probability; see
[5, 6,
7]. In the discrete-time setting, it takes the following (dynamic)
form. Consider as reference diffusion $(\mathbf{X}_t)_{t \in [0,T]}$ with
distribution $\mathbb{P}$ describing the process adding noise to the data. We aim to
find $\pi^\star$ such that $\pi^\star_0 = p_{\mathrm{data}}$ and
$\pi^\star_T = p_{\mathrm{prior}}$ and minimize the Kullback-Leibler
divergence between $\pi^\star$ and $\mathbb{P}$. More precisely $$ \pi^\star =
\mathrm{argmin} \ \{\mathrm{KL}(\pi|\mathbb{P})\ , \ \pi_0 = p_{\mathrm{data}} \ , \ \pi_N =
p_{\mathrm{prior}}\}$$ In this work we introduce
Diffusion Schrödinger
Bridge (DSB), a new algorithm which uses score-matching
approaches
[2,3] to approximate the
Iterative
Proportional Fitting (IPF) algorithm, an iterative method to find the
solutions of the SB problem. DSB can be seen as a refinement of existing
score-based generative modeling methods
[5, 6]. In the discrete world, IPF is
also known as the
Sinkhorn algorithm, see
[8] for a
review.
First, let us describe IPF. In order to find the solution of the SB problem
IPF operates iteratively by successively solving half-bridge problems. Doing
so, we define a sequence of distributions $(\pi^n)_{n \in \mathbb{N}}$ such
that $$ \pi^{2n+1} = \mathrm{argmin} \{ \mathrm{KL}(\pi|\pi^{2n}) \ , \
\pi_N = p_{\mathrm{prior}} \} \ , \\ \pi^{2n+2} = \mathrm{argmin} \{
\mathrm{KL}(\pi|\pi^{2n+1}) \ , \ \pi_0 = p_{\mathrm{data}} \} \ , $$ with
initial condition $\pi^0 = \mathbb{P}$ the
reference dynamics. For large
$n$, $\pi^n$ is close to the Schrödinger bridge $p^\star$. Hence, a
generative model is obtained by sampling from $\pi^{2n+1}$.
We know show how this problem is related to score-based generative modeling.
Assume that $\pi^{2n}$ is the measure associated with the diffusion $$
\mathrm{d} \mathbf{X}_t^n = f_t^n(\mathbf{X}_t^n) \mathrm{d} t + \sqrt{2}
\mathrm{d} \mathbf{B}_t \ , \quad \mathbf{X}_0^n \sim p_{\mathrm{data}} \ ,
$$ then we show that $(\pi^{2n+1})^R$ (where $R$ denotes the time-reversal
operation) is associated with the diffusion $$ \mathrm{d} \mathbf{Y}_t^n =
b_{T-t}^n(\mathbf{Y}_t^n) \mathrm{d} t + \sqrt{2} \mathrm{d} \mathbf{B}_t \
, \quad \mathbf{Y}_0^n \sim p_{\mathrm{prior}} $$ where $$ b_{t}^n(x) =
-f_t^n(x) + 2 \nabla \log p_t^n(x) \ , $$ with $p_t^n$ the density of
$\pi_t^{2n}$. Repeating this procedure we obtain that $\pi^{2n+2}$
is associated with the diffusion $$ \mathrm{d} \mathbf{X}_t^{n+1} =
f_t^{n+1}(\mathbf{X}_t^{n+1}) \mathrm{d} t + \sqrt{2} \mathrm{d}
\mathbf{B}_t \ , \quad \mathbf{X}_0^{n+1} \sim p_{\mathrm{data}} \ , $$
where $$ f_{t}^{n+1}(x) = -b_t^n(x) + 2 \nabla \log q_t^n(x) \ , $$ with
$q_t^n$ the density of $\pi_t^{2n+1}$. We can then iterate this
procedure. Of course we can not sample from these dynamics directly and we
discretize them using Euler-Maruyama approximation. The logarithmic
gradients are then approximated using score-matching techniques (although
for memory reasons we do not approximate the scores but the
drift
functions). This discretization and variational approximation define
our
Diffusion Schrödinger Bridge algorithm.
Some advantages
- Contrary to existing score-based generative modeling methods which require
the reference process to converge to $p_{\mathrm{prior}}$ the convergence of
our algorithm is determined by the convergence of the IPF.
- The DSB algorithm can be used on top of existing algorithms. Hence, our
method can be seen as a refinement of original score-based generative models
and all techniques used to improve the quality/speed of theses methods can be
implemented for DSB.
- DSB does not require $p_{\mathrm{prior}}$ to be Gaussian. In fact we only
require having access to samples from $p_{\mathrm{prior}}$ which can be another dataset. In
particular we are able to perform dataset
interpolation. This paves the way for further applications for high
dimensional optimal transport.
Current limitation
- DSB does not achieve state-of-the-art generative modeling results
(yet) due to compute limitations as we use existing architectures in
order to parameterize our score approximations. These architectures are
deep and notably instable (for instance they require the use of
exponential moving average). This requires careful selection of the
parameters of DSB.
Disclaimer: in this short paragraph I tried to give a
presentation of our work from the
continuous-time point of
view. In the paper we follow a similar presentation but with
a
discrete-time point of view. The two formulations define the
same algorithm but the discrete-time formulation avoids the need of
reverse-time diffusions at the expense of a Gaussian approximation.
In the next sections we show some of our results on
two dimensional examples,
MNIST,
CelebA and
dataset interpolation.
Two dimensional examples
⤴ Go back
For each of the row, we show the target density on the left and an animated plot
of the DSB iterations on the right. Here the prior density $p_{\mathrm{prior}}$
is given by a Gaussian density with zero mean and covariance matrix $\sigma
\mathrm{Id}$ where $\sigma$ is the variance computed from the target
dataset. We fix the number of stepsizes to $20$.
MNIST
⤴ Go back
First we show some of samples obtained with our DSB algorithm (obtained with 30 steps in the backward dynamics).
Original dataset (left) and generated samples (right).
Then, we present an animated plot of the DSB iterations (with 10 steps in the
backward dynamics). Note how the quality of the samples improve through the DSB
iterations.
CelebA exploration
⤴ Go back
In our work we apply DSB for the generation of CelebA.
Here we show some latent
space exploration. The Gaussian random variables in the generative models are
fixed and therefore the transformation is deterministic. In this animation we
follow an Ornstein-Ulhenbeck process into the latent space and observe its
transformation by this deterministic mapping in the image space.
Dataset interpolation
⤴ Go back
Finally, we present some dataset interpolations experiments in two dimensions,
as well as between MNIST (handwritten digits) and EMNIST (handwritten letters).
References
⤴ Go back
[1] Prafulla Dhariwal and Alex Nichol
Diffusion models beat GAN on Image Synthesis
In: arxiv preprint:2105.05233
[2] Yang Song and Stefano Ermon
Generative modeling by estimating gradients of the data distribution
In: Advances in Neural Information Processing Systems 2019
[3] Jonathan Ho, Ajay Jain and Pieter Abbeel
Denoising diffusion probabilistic models
In: Advances in Neural Information Processing Systems 2020
[4] Hans Föllmer
Random fields and diffusion processes
In: École d'été de Probabilités de Saint-Flour 1985-1987
[5] Christian Léonard
A survey of the Schrödinger problem and some of its connections with optimal transport
In: Discrete & Continuous Dynamical Systems-A 2014
[6] Yongxin Chen, Tryphon Georgiou and Michele Pavon
Optimal Transport in Systems and Control
In: Annual Review of Control, Robotics, and Autonomous Systems 2020
[7] Aapo Hyvärinen and Peter Dayan
Estimation of non-normalized statistical models by score matching
In: Journal of Machine Learning Research 2005
[8] Gabriel Peyre and Marco Cuturi
Computational Optimal Transport
In: Foundations and Trends in Machine Learning 2019