Normalizing Flow

By LI Haoyang 2020.12.21 | 2020.12.30

Content

Supplementary

Change of Variables Formula

Reference: http://stla.github.io/stlapblog/posts/ChangeOfVariables.html

The change of variables formulapdf $y=g(x)$ pdf $x$ .

pdf $x$ $p_x:\R^d\to\R$ $g$ $f=g^{-1}$ .

$x$ $|dx|$ $y$ $|dy|=|g^\prime(x)||dx|$ , i.e.

p_x(x)|dx|=p_y(y)|dy|=p_y(y)|g^\prime(x)||dx|

Therefore

\begin{array}{rl} p_y(y)&=p_x(x)|g^\prime(x)|^{-1}=p_x(f(y))|g^\prime(f(y))|^{-1}\\ &=p_x(f(y))|f^\prime(y)| \end{array}

Sylvester's determinant identity

Reference: https://math.stackexchange.com/questions/17831/sylvesters-determinant-identity

$A$ $B$ $m\times n$ $n\times m$ , then there exists

\det(I_m+AB)=\det(I_n+BA)

A simple proof is as follows:

\det\begin{pmatrix} I_m & -A \\ B & I_n \end{pmatrix} = \det\begin{pmatrix} I_m & 0 \\ B & I_n + BA \end{pmatrix} = \det(I_n + BA).

\det\begin{pmatrix} I_m & -A \\ B & I_n \end{pmatrix} = \det\begin{pmatrix} I_m+AB & 0 \\ B & I_n \end{pmatrix} = \det(I_m + AB).

Banach Fixed Point theorem

Reference: https://mathworld.wolfram.com/BanachFixedPointTheorem.html

$f$ $F$ $E$ $F$ $z\in F$ $f(z)=z$ .

$|c|<1$ $|f(x)-f(y)|\le c|x-y|$ , i.e. the Lipschitz constant of this function is less than 1.

Normalizing Flows: An Introduction and Review of Current Methods - TPAMI 2020

Paper: https://ieeexplore.ieee.org/document/9089305

I. Kobyzev, S. Prince and M. Brubaker, "Normalizing Flows: An Introduction and Review of Current Methods," in IEEE Transactions on Pattern Analysis and Machine Intelligence, doi: 10.1109/TPAMI.2020.2992934.

Normalizing Flows are generative models which produce tractable distributions where both sampling and density evaluation can be efficient and exact.

Generative Modeling

Generative modeling aims to model a probability distribution given examples from that distribution.

There are mainly two approaches:

Direct analytic approaches approximate observed data with a fixed family of distributions.
The traditional way, fit observations with priors.
Variational approaches and expectation maximization introduce latent variables to explain the observed data.

Normalizing Flows (NF) are a family of generative models with tractable distributions where both sampling and density evaluation can be efficient and exact.

Normalizing Flow

A Normalizing Flow is a transformation of a simple probability distribution (e.g., a standard normal) into a more complex distribution by a sequence of invertible and differentiable mappings.

$Z\in\R^D$ $p_Z:\R^D\to\R$ .

$g$ $Y=g(Z)$ .

$Y$ :

\begin{array}{rl} p_Y(y)&=p_Z(f(y))|\det Df(y)|\\ &=p_Z(f(y))|\det Dg(f(y))|^{-1} \end{array}

in which:

$f=g^{-1}$
$Df(y)=\frac{\part f}{\part y}$ $f$
$Dg(z)=\frac{\part g}{\part z}$ $g$

$p_Y(y)$ pushforward $p_Z$ $g$ $g_*p_Z$ .

Generative direction: maps a simple distribution into a more complex distribution
Normalizing direction: maps a complex distribution back to a simple (normal) distribution

Constructing arbitrarily complicated non-linear invertible functions (bijections) can be difficult.

$g_1,\dots,g_N$ $N$ $g=g_N\circ g_{N-1}\circ\cdots\circ g_1$ to be the composition of these functions.

$g$ is also bijective, with the inverse:

f=f_1\circ \cdots\circ f_{N-1}\circ f_N

and the determinant of the Jacobian as:

\det Df(y)=\prod_{i=1}^{N}\det Df_i(x_i)

$Df_i(y)=\frac{\part f_i}{\part x}$ $f_i$ .

$i$ -th intermediate flow is denoted as

x_i=g_i\circ\cdots\circ g_1(z)=f_{i+1}\circ\cdots\circ f_N(y)

$x_N=y$ .

Thus, a set of nonlinear bijective functions can be composed to construct successively more complicated functions.

Basically, the idea of normalizing flow is to chain a series of bijective mapping constructed using neural networks to map the given samples to a normal distribution, then the reverse can serve as a sampling function.

Formal Construction of Normalizing Flow

Definition 1. $(\mathcal{Z},\Sigma_{\mathcal{Z}})$ $(\mathcal{Y},\Sigma_{\mathcal{Y}})$ $g$ $\mu$ $\mathcal{Z}$ $\mathcal{Y}$ pushforward $g_*\mu$ , by the formula

g_*\mu(U)=\mu(g^{-1}(U)),\forall U\in\Sigma_{\mathcal{Y}}

$(\mathcal{Y},\Sigma_{\mathcal{Y}},\upsilon)$ .

$(\mathcal{Z},\Sigma_{\mathcal{Z}},\mu)$ $g:\mathcal{Z}\to\mathcal{Y}$ $\upsilon=g_*\mu$ .

$\mathcal{Z}$ as a latent space.

$\mathcal{Z}=\R^D$ $\mu=p_Zdz$ ).

Definition 2. $g:\R^D\to\R^D$ is called a diffeomorphism, if it is bijective, differentiable, and its inverse is differentiable as well.

$p_Z dz$ $g$ is also absolutely continuous.

Thus the theoretical foundation of normalizing flow.

Remark 3. $g$ $\R^D$ $g$ .

Applications

Density estimation and sampling

$g$ $\theta$ $p_Z$ $\phi$ .

$\mathcal{D}=\{y^{(i)}\}_{i=1}^M$ $\Theta=(\theta,\phi)$ .

The data likelihood becomes

\begin{array}{rl} \log p(\mathcal{D}|\Theta)&=&\sum_{i=1}^M\log p_Y(y^{(i)}|\Theta)\\ &=&\sum_{i=1}^M\log p_Z(f(y^{(i)}|\theta)|\phi)+\log |\det Df(y^{(i)}|\theta)| \end{array}

in which

The first term is the log likelihood of the sample under the base measure.
The second term (the log-determinant/volume correction) accounts for the change of volume induced by the transformation of the normalizing flows.

The log changes the multiplication into addition.

$\theta$ $\phi$ ) are adjusted to maximize the log-likelihood.

Even though a flow must be theoretically invertible, computation of the inverse may be difficult in practice; hence, for density estimation it is common to model a flow in the normalizing direction.

Variational Inference

$p(x)=\int p(x,y)dy$ $x$ $y$ is the latent variable.

$p(y|x)$ $q(y|x,\theta)$ , i.e. minimizing the KL divergence

D_{KL}(q(y|x,\theta)||p(y|x))

which is equivalent to maximizing the evidence lower bound

\mathcal{L}(\theta)=\Bbb{E}_{q(y|x,\theta)}[\log (p(y,x))-\log(q(y|x,\theta))]

$\nabla_{\theta}\Bbb{E}_{q(y|x,\theta)}[h(y)]$ .

One can reparametrize

q(y|x,\theta)=p_Y(y|\theta)

$g$ $\theta$ $y=g(z|\theta)$ $p_Z(z)$ $\theta$ , then

\Bbb{E}_{p_Y(y|\theta)}[h(y)]=\Bbb{E}_{p_Z(z)}[h(g(z|\theta))]

$\theta$ can be computed.

This approach generally to computing gradients of an expectation is often called the "reparameterization trick"

$h$ ?

Methods

Normalizing flows should satisfy several conditions to be practical:

Invertible $g$ $f$ .
Sufficiently expressive to model the distribution of interest.
Computationally efficient $f$ $g$ , but in terms of the calculation of the determinant of the Jacobian.

Elementwise Flows

A basic form of bijective non-linearity can be constructed given any bijective scalar function.

$h:\R\to\R$ be a scalar valued bijection.

$x=(x_1,x_2,\dots,x_D)^\top$ , then

g(x)=(h(x_1),h(x_2),\dots,h(x_D))^\top

is also a bijection.

$h$ , could be viewed as an "activation function"

Linear Flows

Linear mappings can express correlation between dimensions.

g(x)=Ax+b

$A\in\R^{D\times D}$ $b\in\R^D$ $A$ is an invertible matrix.

$\det(A)$ $\mathcal{O}(D^3)$ $A$ .

Diagonal
$A$ is diagonal with nonzero diagonal entries, then its inverse can be computed in linear time and its determinant is the product of the diagonal entries.
But the diagonal matrix yields an elementwise transformation, which cannot express correlation between dimensions.
Triangular
The determinant of a triangular matrix is the product of its diagonal, which is much more efficient.
$\mathcal{O}(D^2)$ operations.
Permutation and Orthogonal
The expressiveness of triangular transformations is sensitive to the ordering of dimensions. Reordering the dimensions can be done easily using a permutation matrix which has an absolute determinant of 1.
The inverse and absolute determinant of an orthogonal matrix are both trivial to compute, which is efficient.
Factorizations
Kingma and Dhariwal [2018] proposed using the LU factorization:
$g(x)=PLUx+b$
$L$ $U$ $P$ is a permutation matrix.
$\mathcal{O}(D)$ .
$\mathcal{O}(D^2)$ .
$P$ cannot be easily optimized.
Convolution
Convolutions are easy to compute but their inverse and determinant are non-obvious.
Zheng et al. [2018] used 1D convolutions (ConvFlow) and exploited the triangular structure of the resulting transform to efficiently compute the determinant.
Hoogeboom et al. [2019a] have provided a more general solution for modelling d×d convolutions, either by stacking together masked autoregressive convolutions (referred to as Emerging Convolutions) or by exploiting the Fourier domain representation of convolution to efficiently compute inverses and determinants (referred to as Periodic Convolutions).

Planar and Radial Flows

Rezende and Mohamed [2015] introduced planar and radial flows. They are relatively simple, but their inverses aren’t easily computed.

Planar Flows

Planar flows expand and contract the distribution along certain specific directions and take the following form:

g(x)=x+uh(w^\top x+b)

$u,w\in\R^D$ $b\in\R$ $h:\R\to\R$ is a smooth non-linearity.

The Jacobian determinant for this transformation is

\det\left(\frac{\part g}{\part x}\right)=\det(\bold{1}_D+uh^\prime(w^\top x+b)w^\top)\\ =1+h^\prime(w^\top x+b)u^\top w

$\mathcal{O}(D)$ time.

$h(\cdot)$ and certain parameter settings.

$uh(w^\top x+b)$ can be interpreted as a multilayer perceptron with a bottleneck hidden layer with a single unit [Kingma et al., 2016].

Sylvester Flows

This kind of flows replaces the scaling parameters in planar flows with matrices:

g(x)=x+Uh(W^\top x+b)

$U,W\in\R^{D\times M}$ $b\in\R^M$ $h:\R^M\to\R^M$ $M\le D$ is a hyperparameter to choose, which can be interpreted as the dimension of a hidden layer.

The Jacobian determinant is

\det\left(\frac{\part g}{\part x}\right)=\det(\bold{1}_D+Udiag(h^\prime(W^\top x+b))W^\top)\\ =\det(\bold{1}_M+diag(h^\prime(W^\top x+b))WU^\top)

where the last equality is Sylvester's determinant identity.

$M$ is small.

In a neural network perspective, this is a special residual block.

Radial Flows

Radial flows instead modify the distribution around a specific point so that

g(x)=x+\frac{\beta}{\alpha+||x-x_0||}(x-x_0)

$x_0\in\R^D$ $\alpha,\beta\in\R$ $\alpha>0$ .

The inverse of radial flows cannot be given in closed form but does exist under suitable constraints on the parameters.

Coupling and Autoregressive Flows

Coupling Flows

Dinh et al. [2015] introduced a coupling method to enable highly expressive transformations for flows.

$x\in\R^D$ $(x^A,x^B)\in\R^d\times\R^{D-d}$ $h(\cdot;\theta):\R^d\to\R^d$ $\theta$ .

$g:\R^D\to\R^D$ by the formula:

y^A=h(x^A;\Theta(x^B))\\ y^B=x^B

$\theta$ $\Theta(x^B)$ $x^B$ as input. It's called a conditioner.

$h$ coupling function $g$ is called a coupling flow.

$h$ is invertible and has inverse:

x^A=h^{-1}(y^A;\Theta(x^B))\\ x^B=y^B

$x^B=y^B$ $y^A$ $y^B=x^B$ .

$g$ $Dh$ and the identity matrix respectively.

$x^A$ element-wise:

h(x^A;\theta)=(h_1(x_1^A;\theta_1),\dots,h_d(x_d^A;\theta_d))

$h_i(\cdot;\theta_i):\R\to\R$ is a scalar bijection.

$\Theta(x^B)$ to be arbitrarily complex. In practice it is usually modelled as a neural network.

As shown in Figure 3 (b), with constant conditioner, one can construct a multi-scale flow, which gradually introduces dimensions to the distribution in the generative direction.

In the normalizing direction, the dimension reduces by half after each iteration step, such that most of semantic information is retained.

$x$ :

This is often done by splitting the dimensions in half [Dinh et al., 2015], potentially after a random permutation.

Autoregressive Flows

Kingma et al. [2016] used autoregressive models as a form of normalizing flow. These are non-linear generalizations of multiplication by a triangular matrix (Section 3.2.2).

$h(\cdot;\theta):\R\to\R$ $\theta$ .

$g:\R^D\to\R^D$ $y=g(x)$ conditioned on the previous entries of the input:

y_t=h(x_t;\Theta_t(x_{1:t-1}))

$x_{1:t}=(x_1,\dots,x_t)$ .

$x$ is conditioned on the previous parts.

$t=2,\dots,D$ $\Theta_t(\cdot)$ $\R^{t-1}$ $\Theta_1$ $\Theta_t(\cdot)$ are called conditioners.

$g$ $y_t$ $x_{1:t}$ , and so the determinant is just a product of its diagonal entries:

\det(Dg)=\prod_{t=1}^D\frac{\part y_t}{\part x_t}

In practice, it’s possible to efficiently compute all the entries of the direct flow.

The computation of the inverse is inherently sequential which makes it difficult to implement efficiently on modern hardware as it cannot be parallelized.

$y$ $y$ , i.e.

y_t=h(x_t;\theta_t(y_{1:t-1}))

which is the same form with autoregressive flow, but the inverse of IAF can be computed relatively efficiently as shown in Figure 4.

Typically, papers model flows in the normalizing direction, i.e. from data to the base density, while IAF is a flow in generative direction.

For fast sampling, IAF is better; for fast density estimation, MAF is better.

For several autoregressive flows the universality property has been proven [Huang et al., 2018; Jaini et al., 2019a].
Informally, universality means that the flow can learn any target density to any required precision given sufficient capacity and data.

Residual Flows

Residual networks are compositions of the function of the form:

g(x)=x+F(x)

Such a function is called a residual connectionresidual block $F(\cdot)$ is a feed-forward neural network of any kind.

The first attempts to build a reversible network architecture based on residual connections were made in RevNets [Gomez et al., 2017] and iRevNets [Jacobsen et al., 2018].

$\R^D=\R^d\times\R^{D-d}$ $x=(x^A,x^B)$ $y=(y^A,y^B)$ for the output, and define a function:

y^A=x^A+F(x^B)\\ y^B=x^B+G(y^A)

$F:\R^{D-d}\to\R^d$ $G:\R^d\to\R^{D-d}$ are residual blocks. This network is invertible but computation of the Jacobian is inefficient.

A different point of view on reversible networks comes from a dynamical systems perspective via the observation that a residual connection is a discretization of a first order ordinary differential equation.

$g(\cdot)$ invertible, a sufficient condition is found that

Propostion 7. $Lip(F)< 1$ .

There is no analytically closed form for the inverse, but it can be found numerically using fixed-point iterations (which, by the Banach theorem, converge if we assume Lip(F) < 1).

The specific architecture proposed by Behrmann et al. [2019], called iResNet, uses a convolutional network for the residual block. It constrains the spectral radius of each convolutional layer in this network to be less than one.

Infinitesimal (Continuous) Flows

The residual connection can be viewed as discretizations of a first order ordinary differential equation (ODE):

\frac{d}{dt}x(t)=F(x(t),\theta(t))

$F:\R^D\times\Theta\to\R^D$ evolution function $\Theta$ $\theta:\R\to\Theta$ is a parametrization.

The discretization of this quation (Euler's method) is

x_{n+1}-x_n=\varepsilon F(x_n,\theta_n)

$\varepsilon F(\cdot,\theta_n)$ .

This similarity is marvelous.

The rest of this part is a little out of my understanding... for now.

Discussion and Open Problems

Inductive biases

No more than prior, structure and loss function.

Role of the base measure

The base measure of a normalizing flow is generally assumed to be a simple distribution (e.g., uniform or Gaussian).

Theoretically the base measure shouldn’t matter: any distribution for which a CDF can be computed, can be simulated by applying the inverse CDF to draw from the uniform distribution.

However in practice if structure is provided in the base measure, the resulting transformations may become easier to learn.

Form of diffeomorphisms

The majority of the flows explored are triangular flows (either coupling or autoregressive architectures).

A natural question to ask is: are there other ways to model diffeomorphisms which are efficient for computation? What inductive bias does the architecture impose?

Loss function

The majority of the existing flows are trained by minimization of KL-divergence between source and the target distributions (or, equivalently, with log-likelihood maximization).

However, other losses could be used which would put normalizing flows in a broader context of optimal transport theory.

Generalisation to non-Euclidean spaces

This part is out of my mathematical level.... for now.

Inspirations

The core idea of a flow is to model distribution without information loss by insuring that the model is invertible.

Most flows are designed to be invertible, but it can also be converted from a general neural networks. The following two kinds of flows are particularly interesting to me:

Coupling flows
It sustains the invertibility by keeping part of information unchanged.
Invertible Residual Network
It insures the invertibility as long as the residual block has a Lipschitz constant smaller than 1.