It’s dynamical

Variational Optimization, and How It Simplifies Manifold Optimization (Part I: the continuous side of the story)

2023-06-01T00:00:00+00:00

TL; DR

Gradient Descent (GD) is one of the most popular optimization algorithms for machine learning, and momentum is often used to accelerate its convergence. In this blog, we will start with a variational formulation of momentum GD, explore its rich connection to mechanics, and demonstrate how it allows natural generalizations of momentum GD to optimizing functions defined on manifolds.

Two specific (classes) of manifolds will be discussed, namely Lie groups [Tao & Ohsawa, 2020] and the Stiefel manifold [Kong, Wang & Tao, 2023]. Such optimizations are beyond being mathematically interesting, and there are numerous applications helpful for machine learning practice. For example, it can be used to improve the performance of Transformer model and approximate Wasserstein distances in high dimension.

Codes for both general optimizers and specific applications can be found here.

Gradient Descent with Momentum: A Variational Perspective

Let first consider a smooth optimization problem in Euclidean (i.e. flat) space,

\[\min_{x\in\mathbb{R}^d} f(x).\]

Gradient Descent

Arguably the most common optimizer in machine learning, gradient descent, uses iteration

\[x_{k+1}=x_k-h\nabla f(x_k) \tag{1}\]

where $h$ is called the learning rate/step size. It can be understood as a forward Euler discretization of gradient flow ODE in continuous time:

\[\dot{x}=-\nabla f(x)\tag{2}\]

One reknown way to accelerate the convergence of gradent descent is to introduce ‘momentum’. We will explain why ‘momentum’ in machine learning really corresponds to momentum in physics (mechanics).

Gradient Descent with Momentum: How to View It as a Discretization

Consider for instance Nesterov’s Accelerated Gradient for convex functions (NAG-C, ‘C’ means it has acceleration for convex $f$), which is a popular momentum GD optimizer. The algorithm is

\[\begin{cases} x_k&=y_{k-1}-s\nabla f(y_{k-1})\\ y_k &= x_k+\frac{k-1}{k+2}(x_k-x_{k-1}) \end{cases} \tag{3}\]

starting from $x_0$ with initial condition $y_0=x_0$. [Su, Boyd and Candes, 2014] provided an insightful perspective of viewing it as a discretization of an ODE. To see that, one can introduce coordinate transformation $p_k=\frac{x_k-x_{k-1}}{\sqrt{s}}$ and step size $h=\sqrt{s}$ and rewrite it in the following form

\[\begin{cases} x_{k+1}=x_k+hp_{k+1}\\ p_{k+1}=p_k-\frac{3}{k+2}p_k-h\nabla f(y_k) \end{cases}\]

where the 1st equation is from the definition of $p_k$ and the 2nd equation is by plugging $x_k$ in the 1st equation of Eq. (3) in the 2nd equation of Eq. (3). Note $|f(y_k)-f(x_k)|=\mathcal{O}(h)$, and $t=hk=\sqrt{s}k$. Therefore, this update is a discretization of the following ODE (which is the $h\to 0$ limit of the iterative scheme)

\[\begin{cases} \dot{x}&= p \\ \dot{p} &= - \gamma(t) p -\nabla f(x) \end{cases} \tag{4}\]

where $\gamma=\frac{3}{t}$.

This ODE exactly corresponds to Newton’s second law, which says the rate of change of momentum (in time) is given by the net force, which in this case sums a frictional force $-\gamma p$ and a conservative force $-\nabla f$. $\gamma$ here thus serves as the friction coefficient, and it introduces energy dissipation, which leads $x(t)$ to converge to a local min of $f$ as $t\rightarrow\infty$.

Quantification of Momentum-Induced Acceleration

It is a common saying that momentum ‘accelerates gradient descent’. Let’s take a quick look at what this means quantitatively:

Since we are minimizing the function $f$, we quantify the convergence by the ‘error of optimization’. Mathematically, it is the difference between the function value we are trying to optimize and the oracle minimum value, i.e., $f(x_t)-f(x^\ast)$ for discrete cases and $f(x_t)-f(x^\ast)$ for continuous cases.

Assuming the $f$ to be convex and $L$-smooth ($L$-smooth means $|\nabla f(x)-\nabla f(y)|\le L|x-y|$ for all $x,y$), we have the following convergence rates:

	without momentum	with momentum
continuous case $f(x_t)-f(x^*)=$	Eq. 2 $\mathcal{O}\left(\frac{1}{t}\right)$	Eq. 4 ($\gamma(t)=\frac{3}{t}$) $\mathcal{O}\left(\frac{1}{t^2}\right)$
discrete case $f(x_k)-f(x^*)=$	Eq. 1 ($h\le 1/L$) $\mathcal{O}\left(\frac{1}{k}\right)$	Eq. 3 ($s\le 1/L$) $\mathcal{O}\left(\frac{1}{k^2}\right)$

This means momentum improves the nonasymptotic error bound from linear to quadratic.

In fact, the quadratic convergence speed that momentum GD gives is optimal when, roughly speaking, we only have access to the gradient of the function [Nesterov, 1983].

Many other celebrated GD with momentum algorithms can be viewed as discretizations of this ODE (Eq. 4). For example, when we choose $\gamma$ to be constant, one discretization gives NAG-SC (‘SC’ stands for strongly convex) and another gives heavy ball [Polyak, 1964].

So, in summary, one of the many reasons we like momentum GD is it has the optimal convergence rate among gradient-based optimizers, while GD without momentum can have a slower convergence speed.

Generalizing Gradient Descent to Manifold: Why is Nontrivial?

Until now, we have an optimization algorithm in the Euclidean space, let’s try to generalize it to optimize a function defined on a manifold. That is, we want manifold gradient descent. But what is the gradient?

Let’s take $\mathsf{SO}(2)$, the manifold of 2*2 special orthogonal matrices, as an example, and consider a toy objective function

\[f:\mathsf{SO}(2)\rightarrow \mathbb{R}, \text{ given by } \begin{pmatrix} a&b\\ c&d\\ \end{pmatrix} \mapsto ad-bc\]

If we collect all element-wise derivatives, we get

\[\begin{pmatrix} \frac{\partial f}{\partial a}&\frac{\partial f}{\partial b}\\ \frac{\partial f}{\partial c}&\frac{\partial f}{\partial d}\\ \end{pmatrix}=\begin{pmatrix} d&-c\\ -b&a\\ \end{pmatrix}\]

However, $f$ is actually the determinant, and all special orthogonal matrices have determinant 1, which means the gradient of $f$ should actually be 0 everywhere. Why the contradiction? That’s because $\mathsf{SO}(2)$ is a 1-dim manifold, $a,b,c,d$ are not independent from each other, and one can’t simply collect all partial-derivatives.

There are many other difficulties, but for a simple exposition, we will not overload the readers with technicalities. Instead, let’s stay on the main line and see how variational principle, which is a more fundamental view than Newtonian mechanics, can help manifold optimization.

Variational Principle and Lagrangian Mechanics

For preparation, let’s first start with Eq.4 without friction ($\gamma=0$), where the ODE becomes

\[\begin{cases} \dot{x}&= p \\ \dot{p} &= -\nabla f(x) \end{cases} \tag{5}\]

This is Newtonian mechanics. $x$ is a function of time $t$, which gives a trajectory in position space. Based on thinking mechanics in terms of trajectories, Italian-French mathematician and astronomer Joseph-Louis Lagrange proposed a deeper view of mechanics in 1788, as follows.

Let’s consider all possible trajectories, i.e. mappings each represented by $x: [0,T] \rightarrow \mathbb{R}^d$, and associate with each with something called a Lagrangian, which takes a vector-valued function of time and returns a scalar-valued function of time (for advanced readers, it is a dual of energy). If we choose the Lagrangian to be

\[L(x,\dot{x}, t)=\frac{1}{2}\|\dot{x}(t)\|^2-f(x(t)),\]

and consider an “action” functional $\mathcal{S}$ defined as

\[\mathcal{S}[x]:=\int_0^T L(x, \dot{x}, t)\]

Then the critical point of $\mathcal{S}$, i.e. $x$ such that $\delta \mathcal{S} / \delta x=0$, satisfies the Newtonian dynamics (Eq. 5).

For the sake of length, we’ll not detail the precise meaning of $\delta \mathcal{S}$. Roughly speaking, $\delta \mathcal{S} / \delta x=0$, known as the Euler-Lagrange equation, means $\lim_{\epsilon\rightarrow 0} \frac{1}{\epsilon}(\mathcal{S}[x+\epsilon \eta]-\mathcal{S}[x])=0$ for any curve $\eta$ fixed at end points. Variational calculus gives a streamlined way of computing the Euler-Lagrange equation for any $L$. ‘Stationary-Action Principle’ for example could be a good supplementary reading.

A Dissipative Instance of the Variational Principle: from Mechanics to Optimization

The classical Lagrangian perspective does not change the fact that an isolated mechanical system (i.e. Newtonian dynamics given by Eq. 5) is conservative. Without friction, the total energy, namely the sum of kinetic energy $\frac{1}{2}|\dot{x}|^2$ and potential energy $f(x)$, is a constant. What typically happens is an oscillatory behavior, where the kinetic energy and the potential energy will keep on exchanging values with each other, and $f$ will not be minimized.

To track the root of this oscillatory behavior, which is undesired for optimization, let’s talk about Noether’s theorem. Often considered as the mother of modern mechanics, German mathematician Emmy Noether proved that one symmetry gives one conservation law. In our case, $L$ is invariant under time translation of $x$, i.e., you get the same action if you shift the time via $x(\cdot) \mapsto x(\cdot+C)$, and this gives energy conservation. We can make total energy no longer a constant by breaking the time-translation symmetry: in a seminal paper [Wibisono, Wilson & Jordan 16] introduced an artificial time dependence, a simplified version of which multiplies the original Lagragnian by an extra given term $r(t)$, i.e.

\[L(x, \dot{x}, t) := r(t)\left(\frac{1}{2}\|\dot{x}(t)\|^2 - f(x(t))\right)\]

If we write down the corresponding Euler-Lagrange equation, we will have the ODE with the extra friction term (Eq. 4). $\gamma(t)$ in Eq. 4 is given by $r’(t)/r(t)$, and if we choose $r(t)$ to be positive and monotonically increasing, $\gamma$ will be a positive function and it stands for the friction parameter. Popular choices of $\gamma$ are constant for strongly convex functions and $\frac{3}{t}$ for convex functions.

A simple Lyapunov argument can help us prove that the system converges to a local minimum when time goes to infinity.

Like mentioned earlier, one time discretization of this ODE leads to a popular gradient descent algorithm with momentum (Eq. 3), and in fact, popular approaches like heavy-ball, Nesterov Accelerated Gradient method for Convex functions (NAG-C), Nesterov Accelerated Gradient method for Strongly Convex functions (NAG-SC), can all be obtained from different discretization schemes and choices of $\gamma(t)$.

Variational Formulation of Optimization Makes Its Generalization to Manifold Easy … in Theory

We mentioned that generalizing momentum GD (Eq.3) or its continuous time limit (ODE Eq.4) is nontrivial (but leading experts in manifold optimization have made a lot of progress!) The deeper layer of variational formulation, however, provides a big hammer. Geometers like to say it is “intrinsic”, meaning it doesn’t even care what kind of coordinate system you use to parametrize the space that $x$ lives in, be it Euclidean or a Riemannian manifold. Let’s see how that works and what obstacles will be on the way of getting a good algorithm.

Let’s first list the main steps of the variational optimization methodology:

Define a dissipative Lagrangian and a corresponding variational problem
Solve the variational problem to get an ODE, which does the optimization in continuous time
Design a numerical discretization of the ODE, so that we get an algorithm that optimizes in discrete time

Step 1 seems easy. For Euclidean optimization, we chose $L = r(t)\left(\frac{1}{2}|\dot{x}(t)|^2 - f(x(t))\right)$, i.e., {time discount} * ({kinetic energy} - {potential energy}). If $x(\cdot)$ is instead a trajectory on a Riemmanian manifold $\mathcal{M}$, $\dot{x}$ will be in the tangent space, and we can use the Riemannian metric to generalize the kinetic energy. The Lagrangian simply becomes

\[L = r(t)\left(\frac{1}{2}\|\dot{x}(t)\|_\mathcal{M}^2 - f(x(t))\right)\]

and the variational principle is again

\[\delta \int_0^T L(x,\dot{x},t) dt = 0 .\]

But the difficulty is hidden under the rug. What does $\delta$ mean? Because $x(t)\in\mathcal{M}$, $\dot{x}(t)\in T_{x(t)}\mathcal{M}$, and the variation is actually with respect to all infinitesimal changes of $x$ that keeps it inside a curved function space. This means Step 2 is actually nontrivial.

There are advanced tools from geometric mechanics that solve Step 2. From a pure math point of view, that is actually rather elegant. However, the resulting ODE will not appear to be very explicit to a practioner. What is even worse is, Step 3 (time discretization) is still needed so that an algorithm can be constructed, but so far (as in May 2023) we are not aware of any discretization that leads to an explicit algorithm. Instead the iterations are always implicit, meaning one has to solve at least one system of nonlinear equations per GD iteration. This slows down the computation and make the optimizer not well scalable to high dimensional problems often faced in machine learning.

But there are ways to get around these difficulties. We don’t have to get everything by brute force.

Tactically Solving the Variational Problem, by Leveraging Specific Manifold Structures

Here are two examples, where specific structures of the manifold class can be used to solve the variational problem, and that will lead to beautiful ODE that does the optimization (in continuous time).

When the Manifold is a Lie Group: the Technique of Left Trivialization

General Discussion

Lie group is a manifold that also has a group structure, meaning you have a rule that computes the “product” of any two points on the manifold, which will be another point on the manifold. This “multiplication” operation enriches the geometric structure of the manifold. Previously, we mentioned that a challenge of variational optimization on manifold is, it requires taking a “function derivative” with respect to variation in a curved function space; a smart utilization of the group structure, known as left trivialization, could alleviate this difficulty.

We’ll explain what is left trivialization and how it helps manifold optimization. We will try to remain intuitive, but details and rigor can be found in [Tao & Ohsawa, 2020].

Let’s begin by considering the same old question: what is momentum? An expert would distinguish velocity and momentum and state the velocity lives in the tangent space $T_{x(t)}\mathcal{M}$, while momentum lives in the cotangent space $T_{x(t)}^*\mathcal{M}$. In Euclidean space this would just correspond to $v=\dot{x}$ is the velocity, and $p=M\dot{x}$ is the momentum, where $M$ is mass (or more precisely, an inertia matrix, which gives an isomorphism between $T_{x(t)}\mathcal{M}$ and its dual). Let’s not worry about these and just consider constant mass, which allows us to mean velocity when saying “momentum”.

Then momentum lives in $T_{x(t)}\mathcal{M}$. The problem is, this is a space that is changing when $x$ moves on $\mathcal{M}$. This not only complicates our variational problem, but in fact is a well known issue for other approaches and one reason (among several) why manifold optimization with momentum is hard (see e.g., [Kong, Wang & Tao, 2023] for a review of smart ideas in the literature that address this issue).

However, when we have a Lie group, $\dot{x}$ is in $T_{x(t)}\mathcal{M}$, but $x^{-1}\dot{x}$ is actually in $T_{e}\mathcal{M}$ where $e$ is the identity element of the group. Note $T_{e}\mathcal{M}$ is a fixed linear space, not moving with $x$, and it is a fantastic thing known as the Lie algebra. The idea of left trivialization is, let’s consider a new version of “momentum” to be $x^{-1}\dot{x}$, and then we don’t have to worry about the nonlinear space the original momentum lives in, as now things are just like the Euclidean case.

Concrete Demo via Special Orthogonal Group SO(n)

That was how [Tao & Ohsawa, 2020] approached the variational optimization problem for general Lie groups. However, to remain concrete, this blog will only focus on an important case of $\mathcal{M}=\mathsf{SO}(n)$. $\mathsf{SO}(n)$ is called the special orthogonal group, defined as the set of all the orthogonal matrices whose determinant is 1, i.e.,

\[\mathsf{SO}(n):=\{X\in \mathbb{R}^{n\times n}: X^\top X=I, \text{det}(X)=1\}\]

For trajectory on this manifold, the velocity has to live in the moving tangent space. To see what that entails, taking the time derivative of $X^T X=I$ gives $\dot{X}^T X+X^T\dot{X}=0$. This means the tangent space at $X$ is ${\eta\in \mathbb{R}^{n\times n}: \eta^T X + X^T\eta=0}$. This still looks a bit complicated, but if we left trivialize the velocity by letting $\xi=X^{-1}\eta=X^T\eta$, then $\xi$ simply satisfies

\[\xi^T+\xi=0,\]

meaning it is a skew-symmetric matrix. The space in which this left trivialized velocity lives is a fixed tangent space, $T_e \mathsf{SO}(n)$, known as the Lie algebra $\mathfrak{so}(n)$. Note it no longer depends on $X$!

Now let’s rename $X$ to be $g$, simply to remind ourselves that the trajectory $g(t)$ lives on a Lie group. In the $\mathsf{SO}(n)$ case, we found that the “position” variable $g$ and the new “velocity”/”momentum” needed to satisfy two constraints

\[g^\top g=I, \quad\xi^\top+\xi=0.\]

Remarkably, they are independent of each other, making the variational problem (i.e. finding the critical point of the action functional, with respect to trajectory variations that maintain the constraints) easier to solve. More precisely, we can again define a Lagrangian as

\[L:=r(t)\left(\frac{1}{2}\langle \xi, \xi\rangle-f(g)\right),\]

where $\langle \xi_1, \xi_2\rangle:=\text{tr}(\xi_1^\top M \xi_2)$ is an inner product defined using standard matrix operations and $M$ is any constant positive definite matrix. We will use $M=I$ from now on for a simple demonstration.

Using tools from geometric mechanics (details are technical and thus omitted, but the treatment is actually intrinsic), one can show that the variational principle $\delta \int L dt = 0$ is equivalent to the following ODEs

\[\begin{cases} \dot{g}=g\xi\\ \dot{\xi}=-\gamma (t)\xi-\left(\frac{\partial f}{\partial g}^\top g-g^\top \frac{\partial f}{\partial g}\right) \end{cases} \tag{7}\]

Note here $\frac{\partial f}{\partial g}$ is simply an $n\times n$ matrix that collects all element-wise Euclidean partial derivative, i.e. the one we previously said to be incorrect. The dynamics automatically corrects everything for manifold, and one can just forget about complications due to curved geometry and pretend that $g$ and $\xi$ are matrices in Euclidean space. The ODE will internally keep the geometry right, meaning that $g(t)^\top g(t)=I, \quad\xi(t)^\top+\xi(t)=0$ for all $t>0$ as long as the initial condition is on the manifold, i.e. $g(0)^\top g(0)=I, \quad\xi(0)^\top+\xi(0)=0$. And $\lim_{t\to\infty}g(t)$ will be a local minimizer of $f$.

When the Manifold is the Stiefel Manifold: the Technique of Function Lagrange Multiplier

The Problem to Solve

The previous section discussed how to optimize $f(g)$ with respect to orthogonal matrices $g$. Orthogonal matrices are square matrices. A practically important generalization would be: how to optimize $f(X)$, when matrix $X$ satisfies orthonormal constraints but is not necessarily square? Note this is a highly nonconvex constraint, and traditional convex optimization tools don’t apply.

The geometric space for this problem is the Stiefel manifold. A Stiefel manifold $\mathsf{St}(n,m)$ is the set of $n\times m$ matrices ($n\ge m$, i.e. tall) with each column orthogonal to all other columns and normalized, i.e.,

\[\mathsf{St}(n,m):=\{X\in \mathbb{R}^{n\times m}: X^\top X=I_m\}\]

When $n=m$, $\mathsf{St}(n,n)$ is almost the same as $\mathsf{SO}(n)$ (but with a negative branch too). However, in general, $n\ge m$, and we no longer have a group structure, and we need a different way to solve the variational problem. Here is how:

Solving the Variational Problem via an Alternative Variational Formulation

Our original variation problem (for optimization) is to find the critical point of the action functional

\[\mathcal{S}[X]=\int r(t)\left(\frac{1}{2}\|\dot{X}(t)\|^2 - f(X(t))\right) dt\]

with respect to all trajectories that satisfy

\[X(t)^\top X(t)=I, \quad \dot{X}(t)^\top X(t)+X(t)^\top \dot{X}(t)=0, \quad \forall t,\]

similar to the $\mathsf{SO}(n)$ case (but note the order of multiplications matter as $X$ is $n\times m$ and constraints are $m\times m$). Variational derivative in this curved function space is challenging but we don’t have left-trivialization to help any more.

So we use a different approach, constrained variational principle: introduce a function Lagrange multiplier $\Lambda(t)$ to enforce the constraint at all $t$, let

\[\hat{L}(X, \dot{X}, \Lambda, t)=r(t)\Big[\frac{1}{2}\| \dot{X}(t)\|^2-f(X(t))\Big]-\frac{1}{2}\text{tr}\left(\Lambda(t)^\top(X(t)^\top X(t)-I)\right)\]

and consider $\delta \int_0^T \hat{L} dt = 0$ in the flat, unconstrained function space

\[\{ X(t), \Lambda(t) | 0\leq t\leq T, \Lambda(t)\in\mathbb{R}^{m\times m}, X(t)\in\mathbb{R}^{n\times m} \}.\]

This variational problem is easier to solve, but its resulting Euler-Lagrange equation will be a Differential-Algebraic Equation system that contains an ODE that describes how $X$ changes in time based on $X$ and $\Lambda$, and a requirement that $\Lambda$ is such that $X^\top X=I$ is maintained. This is still difficult to handle.

Fortunately, it is possible eliminate $\Lambda$ using techniques borrowed from an astrophysics paper [Chen, Li & Tao, 2021]. Technicality aside, this leads to the following ODE that does optimization

\[\begin{cases} \dot{X}=&Q\\ \dot{Q}=&-\gamma Q-XQ^\top Q -\frac{\partial f}{\partial X}+\frac{1}{2}XX^\top\frac{\partial f}{\partial X}+\frac{1}{2}X\frac{\partial f}{\partial X}^\top X \end{cases}\]

Here “position” variable $X$ and “momentum” variable $Q$ are again simply $n\times m$ matrices, $\frac{\partial f}{\partial X}$ is the element-wise derivative (an $n\times m$ matrix too). Like before, although everything is based on matrices in Euclidean space and user needs not to worry about the manifold or constraints, the dynamics internally keeps everything on the manifold, while optimizing $f$.

What’s Next?

Up to this point, we have exploited variational optimization, and obtained explicit Euclidean ODEs, which optimize the objective functions on manifolds. However, these are not optimizers yet as they are just dynamics in continuous time. To obtain optimization algorithms, we need to numerically discretize the time so that iterative solvers can be obtained.

Doing this will be fun, because the ODEs are constructed such that their solutions stay on the curved manifolds. A naive discretization will lead to numerical solutions that go off the manifold. We need better design. Please see the Part II of this blog.

📝 How to Cite Me?

Please cite the following 2 publications

@inproceedings{tao2020variational,
  title={Variational optimization on {L}ie groups, with examples of leading (generalized) eigenvalue problems},
  author={Molei Tao and Tomoki Ohsawa},
  booktitle={International Conference on Artificial Intelligence and Statistics (AISTATS)},
  year={2020}
}

@inproceedings{kong2023momentum,
  title={Momentum {S}tiefel Optimizer, with Applications to Suitably-Orthogonal Attention, and Optimal Transport},
  author={Lingkai Kong and Yuqing Wang and Molei Tao},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2023}
}

If you’d also like to cite this blog, please add a 3rd citation as follows

@misc{tao2023blog1,
  title = {Variational Optimization, and How It Works for Manifolds},
  author={Lingkai Kong and Molei Tao},
  howpublished = {\url{https://itsdynamical.github.io/article/2023/06/01/variational-optimization-1.html}},
  note = {From blog <It's dynamical>}
}

Thank you!

Variational Optimization, and How It Simplifies Manifold Optimization (Part II: the discrete side of the story)

2023-06-01T00:00:00+00:00

Part I of the blog described how to obtain ODEs that are written in Euclidean space but optimize functions on manifolds. To turn them into actual optimization algorithms, we need to discretize the time of the ODEs, so that evolution in discrete time corresponds to iterations of an optimizer.

This is nontrivial, because a naive discretization would destroy the fact that the exact solution remains on the manifold (see Fig.1 left panel). Of course, one can always add an extra step that artificially pulls things back to the manifold (e.g., a projection), but this operation can be computational costly. In addition, it partially cancels efforts and thus possibly slows down the convergence of the optimization too (see Fig.1 right panel). We’d like to be computationally efficient and avoid such cancellations as much as possible (to experts, these mean, for example, as little usage of exponential maps or projections as possible, computation complexity more dependent on $m$ but less on $n$ (we could have $n\gg m$), smaller constants in error bounds, etc.)

Fig.1 - What will happen if we artifacially pull the point back?

In this Part II of the blog, we will construct such optimizers. Then we will showcase some interesting applications, such as a general way to improve the performance of Transformer models, and approximating Wasserstein distances in high dimensions. Codes of the generic optimizers, as well as these applications, can be found here.

Reminder of the Optimization ODE, and Further Preparation

As a continuation from Part I, we will focus on optimization on Stiefel manifold. The specific case of $\mathsf{SO}(n)$ Lie group will be a special instance of the Stiefel manifold $\mathsf{St}(n,m)$ when $n=m$. The optimization dynamics, as obtained from variational optimization in Part I, is

\[\begin{cases} \dot{X}=&Q\\ \dot{Q}=&-\gamma Q-XQ^\top Q -\frac{\partial f}{\partial X}+\frac{1}{2}XX^\top\frac{\partial f}{\partial X}+\frac{1}{2}X\frac{\partial f}{\partial X}^\top X \end{cases}\]

where position $X\in \mathsf{St}(n,m)$ and momentum/velocity $Q\in T_X \mathsf{St}(n,m)$.

Rich geometric information can be obtained there. In the aforementioned Lie group case, we used left-trivialization to represent the velocity variable $\dot{g}$, i.e., using $\xi$ that lives in the Lie algebra, via $\dot{g}=g\xi$. Now our position is $X$ (replacing $g$) and velocity is $Q$ (replacing $\dot{g}$), but we don’t have a group structure, and if we pretend to do the same thing, namely $Q=XY$ and use $Y$ as a new representation of velocity, we will have big trouble — $Q$ is $n\times m$ and $Y$ then has to be $m\times m$, but $n > m$, and we would have lost information about the velocity! Instead, we decompose the tangent space $T_X\mathsf{St}$ into $X$ and $X^\perp$ components by $Q=XY+V$, where $XY$ is in the span of $X$, and $V$ is an “orthogonal” remainder. Given $X^\top X=I$ and $X^\top Q+Q^\top X=0$, one can show that this transformation turns these the velocity constraint $X^\top Q+Q^\top X=0$ into $Y^T+Y=0$ and $X^\top V=0$ instead, the latter giving the precise meaning of $V$ being orthogonal to $X$.

The above is the static geometric picture, but there is more. Remember $X, Q$ are actually functions of time $X(t), Q(t)$. If one does this decomposition for each $t$, what dynamics will the resulting $Y(t), V(t)$ obey? It turns out that they are given by some elegant ODEs

\[\begin{align} &\dot{X}=XY+V\tag{8a}\\ &\dot{Y}=-\gamma Y-\frac{1-b}{2}\Big(X^\top \frac{\partial f}{\partial X}-\frac{\partial f}{\partial X}^\top X\Big)\tag{8b}\\ &\dot{V}=-\gamma V+\frac{3a-2}{2}VY-XV^\top V-\left(I-XX^\top\right)\frac{\partial f}{\partial X}\tag{8c} \end{align}\]

and as long as the initial condition satisfies $X(0)^T X(0)=I, Y(0)^T+Y(0)=0$ and $X(0)^\top V(0)=0$, the solution automatically maintains the new structural constraints $X(t)^T X(t)=I, Y(t)^T+Y(t)=0$ and $X(t)^\top V(t)=0$, for all $t>0$, and of course, $Q(t):=X(t)Y(t)+V(t)$ will exactly satisfy its constraint and remain in $T_{X(t)} \mathsf{St}$ too.

Nontrivial Discretization for Computationally Efficient Structure Preservation

(Geometric) structure preservation means values of relevant variables stay on their respective manifolds. For our case, namely momentum-accelerated manifold optimization, it corresponds to satisfying 2 constraints:

the position variable stays on the manifold

the momentum variable stays on the tangent space of the manifold (based at the position variable)

In the continuous case, the manifold structure is preserved, because the variational problem is solved with respect to variations of curves on the manifold. Solving such a problem is nontrivial, but already accomplished (see Part I).

When it comes to discretzing ODEs on the manifold, on the other hand, things become even more difficult. One has to design a delicate numerical discretization, because otherwise the manifold structure may fail to be preserved, despite that the ODE in continuous time is structure preserving. This is often the case for off-the-shelf numerical schemes, such as Euler methods or Runge-Kutta (note both forward and backward Euler methods are special cases of Runge-Kutta).

Nevertheless, it is possible to discretize Eq.8 in a computationally cheap and accurate way, for obtaining iterates that exactly satisfy both constraints for all steps. The construction is a bit convolved because we’d like to maximize the computational efficiency, and we will just give some flavor of the tricks used.

To fix ideas, let’s first start with a simpler example, namely when the manifold is the Lie group $\mathsf{SO}(n)$. This is a special case of the Stiefel problem, in which we let $n=m$ and then $V$ becomes 0, Eq.8a and 8b become just Eq.7 in Part I, and Eq. 8c disappears. We copy Eq.7 here for convenience:

\[\begin{cases} \dot{g}=g\xi\\ \dot{\xi}=-\gamma (t)\xi-\left(\frac{\partial f}{\partial g}^\top g-g^\top \frac{\partial f}{\partial g}\right) \end{cases} \tag{duplicate of Eq.7}\]

To numerically simulate this ODE, we adopt a vector field splitting approach and strategically decompose its right hand side, known as vector field, as the sum of 3 vector fields, and consider their respective evolution dynamics:

\[\begin{cases} \dot{g}=g\xi\\ \dot{\xi}=0 \end{cases} \quad \begin{cases} \dot{g}=0\\ \dot{\xi}=-\gamma (t)\xi \end{cases} \quad \begin{cases} \dot{g}=0\\ \dot{\xi}=-\left(\frac{\partial f}{\partial g}^\top g-g^\top \frac{\partial f}{\partial g}\right) \end{cases} \tag{9}\]

The specific splitting is such that these split ODE systems have nice properties, and one of the most important properties is

Each of the 3 ODE systems is structure preserving

They are also easy to integrate: all of them admit closed form solutions. Evolving them alternatively gives us a numerical integrator of Eq.(7) (Algo. 2 in the Tao & Ohsawa, 2020), and the Lie-Trotter operator splitting theorem ensures that this integrator approximates its exact solution.

Because each evolution exactly preserves the manifold structures, so does their composition (i.e., alternatively evolving them). Having simple closed form solutions also ensures a low computational cost (experts may question the cost of exponential map, which is needed for solving $\dot{g}=g\xi$, but even this can be avoided in a more advanced discretization; see 3 paragraphs below).

Now let’s describe the full-blown Stiefel version (where $n>m$). The optimization ODE to discretize is Eq.8, and we decompose its right hand side as the sum of 3 carefully chosen vector fields, and their respective evolution dynamics are:

\[\begin{cases} &\dot{X}=XY\\ &\dot{Y}=-\gamma Y\\ &\quad -\frac{1-b}{2}\left(X^\top\frac{\partial f}{\partial X}-\frac{\partial f}{\partial X}^\top X\right)\\ &\dot{V}=0 \end{cases} \begin{cases} &\dot{X}=0\\ &\dot{Y}=0\\ &\dot{V}=-\gamma V+\frac{3a-2}{2}VY\\ &\quad -(I-XX^\top)\frac{\partial f}{\partial X} \end{cases} \begin{cases} \dot{X}=&V\\ \dot{Y}=&0\\ \dot{V}=&-XV^\top V \end{cases} \tag{10}\]

Again, one can check that each of these 3 ODE systems is structure preserving. Moreover, the first system in Eq. 10 is similar to the $SO(n)$ case (Eq.7) that we just discussed. Even those it does not admit a closed form solution, we can use the same numerical discretization as Eq.7 (given by Eq.9) for it. The second system is a linear ODE and its explicit solution can be cheaply computed. The third system is nonlinear, but a specially designed numerical discretization that preserves the manifold structure can be constructed; it is too complicated to be presented here, but interested experts are referred to [Kong, Wang & Tao, 2023] for details.

Alternating these integrators for the 3 ODEs in Eq.10 gives us a numerical optimizer, that exactly preserves the Stiefel manifold and its tangent structure.

Even better is, the computational cost of this structure preserving optimizer can be further reduced. Here are some technical details for experts: Costly matrix exponentiation operations are needed for computing the exact solutions of linear ODEs such as system 1 in Eq.10, but they can actually be avoided. If we use a cheaper forward Euler integrator to approximate the evolution of system 1, structure preservation will be destroyed by this step. However, it is a small miracle that, if we fist evolve system 2, followed forward Euler for system 1, and finally system 3, then deviation from the manifold created by forward Euler will be corrected! This carefully chosen ordering of composition makes the overall iteration still structure preserving, while significantly lowering the computational complexity.

So, in the end, we can obtain a highly-computational-efficient optimization algorithm, that stays exactly on the manifold forever, and faithfully captures the nice convergence property of the continuous-in-time optimization dynamics.

Some Applications of Optimization on the Stiefel manifold

It is now time to see a subset of useful applications. Let us begin with a simple problem, which is nevertheless at the heart of data sciences —

Leading EigenValue (LEV) problem

Given an $n\times n$ matrix $A$, the task is to compute its largest $m$ eigenvalues. Simply computing all the $n$ eigenvalues and sorting them can be too expensive and wasteful in the case $m\ll n$, and modern data set often corresponds to huge $n$ (e.g., $\geq 10^6$) such that any method with $\mathcal{O}(n^3)$ computational complexity or storage (needed by traditional eigenvalue methods) is unaffordable.

Instead, we can convert the task to an optimization problem

\[\max_{U\in \mathsf{St}(n,m)} \text{tr}(U^\top A U)\]

where $U$ represents the full bases of an m-dimensional subspace in the n-dimensional space. By searching for the optimal $U$, we look for the best subspace to project $A$ to, such that $A$ restricted to that subspace has maximized sum of eigenvalues. The minimizer $U$ will then give a small $m\times m$ matrix $U^T A U$, whose eigenvalues correspond to $A$’s $m$ leading eigenvalues.

One may think this problem is too easy as the objective function is quadratic, but in fact this optimization problem is not even convex, because there is a nonlinear equality constraint $U^\top U=I_{m\times m}$. Nevertheless, please read on if you’d like to see more complicated objective functions.

Projection Robust Wasserstein (PRW) Distance

Wasserstein distance is a very important notion in machine learning, as it quantifies a distance between two probability distributions. If these distributions are for high-dimension random variables, however, the computation of Wasserstein distance is very challenging; for example, 1) one needs a lot of sample points of the distributions (i.e. data), 2) the computation of the distance can be very expensive.

One way to alleviate these issues is to use Projection Robust Wasserstein Distance (e.g., [Paty & Cuturi, 2019], [Lin et al. 2020]). Let’s first review Wasserstein distance: given 2 probability measures $\mu,\nu$ on $\mathbb{R}^n$, we denote the set of all couplings as $\Pi(\mu,\nu)$. The Wasserstein distance between $\mu$ and $\nu$ can be defined as

\[W_2(\mu,\nu) := \min_{\pi \in \Pi(\mu,\nu)} \left( \int \|x-y\|^2 \,d\pi(x,y) \right)^{1/2}\]

Imagine $\mu$ and $\nu$ describe the shapes of a sand pile. Wasserstein distance basically tries to the least sand movement plan so that the $\mu$ pile becomes the $\nu$ pile. To deal with the case where the dimension of $x$ and $y$ is high, the beautiful idea of PRW for bypassing the curse of dimensionality is, to project these 2 distributions to lower dimensional subspaces, and then compute the distance in this lower dimensional space instead, and finally use an outer loop to find the best subspace to project to. Mathematically, it is given by a bi-level optimization problem

\[P_m(\mu,\nu) := \max_{U\in \mathsf{St}(n,m)} \min_{\pi \in \Pi(\mu,\nu)} \left( \int \|U^\top x - U^\top y\|^2 \,d\pi(x,y) \right)^{1/2}\]

The maximization is in order to keep as much information as possible. This approach not only makes the problem computationally more manageable, but also less data-hungry, when $m\ll n$. Moreover, since the dimensions that are relatively less important are omitted after projection, data noise is also reduced and only the essential component is left, which increases the robustness compared to the vanilla $W_2$ distance.

This is again a Stiefel optimization problem, and it is important to be exactly on the manifold. Near enforcement of the manifold structure, such as commonly by regularization, will lead to approximate orthogonality which would totally destroy the subspace structure.

Subspace Pursue: finding the best subspace to approximate a high dim. optimization problem

What do the aforementioned {Leading EigenValue problem} and {Projection Robust Wasserstein Distance example} have in common? They are both based on the idea of approximating a high dimensional problem by looking for an optimal low dimensional projection, and then solving the problem in that low dimensional subspace. In fact, we can make this strategy general, and this results in what we call Subspace Pursue. Both LEV and PRWD are instances of Subspace Pursue. Here is a precise, although not the most general, formulation of Subspace Pursue:

Given a dataset $\lbrace x_i \rbrace_{i=1}^k$ and a function $f$, which abstractly denotes the outcome of some algorithm applied to this dataset. Suppose this algorithm can work with various datasets of different dimensions, meaning both $f(\lbrace x_i\rbrace_{i=1}^k)$ with $x_i$ in $\mathbb{R}^n$ and $f(\lbrace y_i\rbrace_{i=1}^k)$ with $y_i$ in $\mathbb{R}^m$ are well-defined. If $f(\lbrace x_i\rbrace_{i=1}^k)$ is computationally too expensive to evaluate in dimension $n$, but not in dimension $m \ll n$, then we can consider instead the optimization problem

\[\max_{U\in \mathsf{St}(n,m)} f(\lbrace U^\top x_i\rbrace_{i=1}^k).\]

This is again a Stiefel optimization problem that can be pleasantly solved by optimizers described in this blog. It views a Stiefel matrix $U$ as a projection from $n$-dim spaces to a $m$-dim subspace, spanned by its (orthonormal) columns. The maximization is again to make sure that as much information as possible is captured by a low dimension approximation.

Orthogonality Boosts the Performance of Transformer Models

Transformer [Vaswani et al.] is an extremely powerful deep learning architecture. It was first invented for NLP, but then also applied to Computer Vision (e.g., Vision Transformer (ViT) [Dosovitskiy et al]). One amazing thing of Transformer is, its attention layer is able to characterize long-distance interactions between elements in the sequence, where ‘elements’ mean ‘words’ in NLP tasks and ‘patches’ in CV tasks.

Can non-Euclidean optimization make the self-attention mechanism even better? The main intuition is, many of the trainable parameters in attention layers aim at capturing correlations between elements, via training. If we require these correlations to be orthogonal to each other, information extracted by the attention mechanism can be less redudant and more accurate.

To try this idea out, one simply replaces the Euclidean optimization in training by Stiefel optimization, and it really works well in all tested cases. For example, for vanilla ViT trained from scratch for CIFAR 100, one only needs to modify 2 lines of code to enforce orthogonality, and then test error goes down from 33.1% to 30.2%.

Thank you for reading!

If you have any comment or question, please don’t hesitate to let us know!

📝 How to Cite Me?

Please cite the following 2 publications

@inproceedings{tao2020variational,
  title={Variational optimization on {L}ie groups, with examples of leading (generalized) eigenvalue problems},
  author={Molei Tao and Tomoki Ohsawa},
  booktitle={International Conference on Artificial Intelligence and Statistics (AISTATS)},
  year={2020}
}

@inproceedings{kong2023momentum,
  title={Momentum {S}tiefel Optimizer, with Applications to Suitably-Orthogonal Attention, and Optimal Transport},
  author={Lingkai Kong and Yuqing Wang and Molei Tao},
  booktitle={International Conference on Learning Representations (ICLR)},
  year={2023}
}

If you’d also like to cite this blog, please add a 3rd citation as follows

@misc{tao2023blog1,
  title = {Variational Optimization, and How It Works for Manifolds},
  author={Lingkai Kong and Molei Tao},
  howpublished = {\url{https://itsdynamical.github.io/article/2023/06/01/variational-optimization-1.html}},
  note = {From blog <It's dynamical>}
}

References

Molei Tao, and Tomoki Ohsawa. “Variational optimization on lie groups, with examples of leading (generalized) eigenvalue problems.” International Conference on Artificial Intelligence and Statistics (2020).
Lingkai Kong, Yuqing Wang, and Molei Tao. “Momentum Stiefel Optimizer, with Applications to Suitably-Orthogonal Attention, and Optimal Transport.” International Conference on Learning Representations (2023).
Renyi Chen, Gongjie Li, and Molei Tao. “GRIT: A package for structure-preserving simulations of gravitationally interacting rigid bodies.” The Astrophysical Journal (2021).
François-Pierre Paty, and Marco Cuturi. “Subspace robust Wasserstein distances.” International Conference on Machine Learning (2019).
Tianyi Lin, Chenyou Fan, Nhat Ho, Marco Cuturi, and Michael Jordan. “Projection robust Wasserstein distance and Riemannian optimization.” Advances in Neural Information Processing Systems (2020).
François-Pierre Paty, and Marco Cuturi. “Subspace robust Wasserstein distances.” International conference on machine learning. PMLR, 2019.
Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Łukasz Kaiser, and Illia Polosukhin. “Attention is all you need.” Advances in Neural Information Processing Systems (2017).
Alexey Dosovitskiy, Lucas Beyer, Alexander Kolesnikov, Dirk Weissenborn, Xiaohua Zhai, Thomas Unterthiner, Mostafa Dehghani, Matthias Minderer, Georg Heigold, Sylvain Gelly, Jakob Uszkoreit, Neil Houlsby. “An image is worth 16x16 words: Transformers for image recognition at scale.” International Conference on Learning Representations (2021).
Andre Wibisono, Ashia C. Wilson, and Michael I. Jordan. “A variational perspective on accelerated methods in optimization.” Proceedings of the National Academy of Sciences (2016).
Weijie Su, Stephen Boyd, and Emmanuel Candes. “A differential equation for modeling Nesterov’s accelerated gradient method: theory and insights.” Advances in neural information processing systems 27 (2014).
Boris T. Polyak. “Some methods of speeding up the convergence of iteration methods.” Ussr computational mathematics and mathematical physics 4.5 (1964): 1-17.

Fun facts about large learning rates: how large is large, and what tricks they can do?

2022-07-25T18:36:38+00:00

Background

Gradient Descent

Machine learning models are often trained by 1st-order optimizers, which are methods that use 1st derivative information to iteratively search the minimum of an objective function. Gradient Descent (GD) is a central one of these optimizers, and even though many results can extend beyond GD, we will focus on GD in this expository blog. GD uses iteration \[ x_{k+1}=x_k - h \nabla f(x_k) \] where $h$ is the learning rate (LR for short; numerical people also call it stepsize).

The nice classical regime of learning rates

Some functions enjoy a nice property called smoothness (in CS/optimization terminology), and understanding the convergence of GD is the easist for smooth $f$. Specifically, $f$ is $L$-smooth if $\nabla f$ is globally Lipschitz with coefficient $L$. This is a strong assumption (e.g., $f(x)=x^2$ is 2-smooth, but $f(x)=x^4$ is not smooth in $\mathbb{R}$), and it sometimes can be relaxed (e.g., one can make it local, or use prior knowledge of a bounded domain) but let’s first see what it can do.

Thm. If $f$ is $L-$smooth, $\min f$ exists, and $h<\frac{2}{L}$, GD converges to a stationary point.

Proof. Let $\min f=f^*$. \[ f(x_{k+1})\le f(x_k)+\langle\nabla f(x_k),{x_{k+1}-x_k}\rangle+\frac{L}{2}|x_{k+1}-x_k|^2 =f(x_k)-h(1-\frac{L}{2}h)|\nabla f(x_k)|^2. \]

Then \[ \sum_{k=1}^N|\nabla f(x_k)|^2\le \frac{1}{h(1-\frac{L}{2}h)}(f(x_0)-f(x_N)) \le \frac{1}{h(1-\frac{L}{2}h)}(f(x_0)-f^*). \] Therefore $\lim_{k\to\infty}|\nabla f(x_k)|^2=0$, i.e., GD converges to a stationary point.

Therefore, $h<2/L$ is a nice regime. One fun thing to note though, is that this regime can be further divided. As the following animations show (for a simple $f=x^2/2$), $h<1/L$ leads to continuous motion of $x_k$, but a bigger $h$ gives a discontinuous, oscillatory behavior, however only in $x_k$ but not in $f(x_k)$!

Nice part of the nice regime, and gradient flow

A computational mathematician would immediately recognize gradient descent as a forward Euler discretization of the gradient flow ODE \[ \dot{x}=-\nabla f(x) \] Obviously in this continuous time / infinitesimal $h$ limit, $x$ is changing continuously. It will be fun to connect to classical numerical analysis and explore where the $h<1/L$ condition mentioned above comes from! It makes gradient descent close to gradient flow. We shall call $h$ such that gradient descent approximates gradient flow small LR.

However, the larger $h$ becomes, the more GD deviates from gradient flow. Obviously, the oscillatory behavior at $1/L < h < 2/L$ deviates significantly from gradient flow.

ODE beyond gradient flow?

Backward error analysis, a.k.a. modified equation

Gradient flow only captures the behavior of GD when $h$ is sufficiently small. However, ODE can still help gain insight of GD when $h$ is larger. How is this possible, given that the gradient flow ODE is the $h\to 0$ limit of GD? The answer is, just like Taylor expansion, one can add $\mathcal{O}(h), \mathcal{O}(h^2), \cdots$ terms to the ODE to correct for the finite $h$ effect.

How to precisely do this has been well developed in numerical analysis under the name “backward error analysis”, and the corrected ODE is called “modified equation”. The seminal idea dates back to at least [Wilkinson (1960)] and the subject is beautifully reviewed, for example, in [Hairer, Lubich and Wanner (2006)].

Now let’s apply this general tool to the specific case of GD. Doing it to 1st-order in $h$ gives \[ \dot{x}=-\nabla f_1(x), \text{ where } f_1(x):=f(x)+\frac{h}{4}|\nabla f(x)|^2_2\] and it approximates GD in the sense that $x(kh)\approx x_k$, just that for gradient flow (a 0th-order approximation, $f_0:=f$). In this case, the 1st-order modified equation is still a gradient flow, and $f_1$ is thus called a (1st-order) modified potential.

Discussions

The 1st-order modified potential $f_1$ is useful. For example, it can characterize some implicit bias of GD as a modification of landscape, and as in 2022 the community continues to find its exciting machine learning implications. However, the methodology is not new.

In fact, obtaining the expression of $f_1$ is relatively easy, simply by matching the GD iterate with the Taylor expansion of $x(h)$ in $h$. This “derivation” however is only formal, and it might even give a false impression that this works for any $h$. However, a power series (in $h$) has a radius of convergence, and its generalization to an ODE will have a similar issue of convergence. A beautiful paper, [Li, Tai and E (2019)], went beyond a formal series matching. Instead, it rigorously characterized the accuracy of 1st-order modified equation when $h$ is small enough. Its setup is also more general (SGD), which includes GD as a special case. The modified equation for exactly the GD case was also explicitly provided in the literature. For example, [Kong and Tao (2020)] quantitatively discussed when/how/why $f_1$ is insufficient, and of course $f_1$’s expession was provided, even though that was not the point of their paper.

Indeed, $\dot{x}=-\nabla f_1(x)$ can approximate GD for $h$ values larger than those for $\dot{x}=-\nabla f(x)$, but if $h$ becomes too large, it will lose all its approximation power.

Will higher-order modified equation help with large $h$? In most cases, no. In fact if $h$ is too large, higher-order modified equation (i.e. more correction terms) may even lead to worse approximation.

These facts can be exemplified by the following plots, where the LR is respectively 1) small so that gradient flow is a good approximation; 2) still $<1/L$, but medium so that gradient flow is insufficient but modified equation is a good approximation; 3) larger, $\in (1/L, 2/L)$, and modified equation no longer works well; 4) truly large, $>2/L$, and for this simple objective function ($f=x^2/2$) GD blows up.

A remark is, although some literature use modified equation to study “large” LRs, for consistency, throughout this blog we will call such LRs medium instead, because there are larger LRs for which modified equation completely breaks down, and yet GD may still work and produce nontrivial and very interesting behaviors, which oftentimes are beneficial to deep learning.

Truly large learning rates

Now let’s exemplify some of the aforementioned behaviors of truly large LR.

Welcome to the zoo

Nonconvergence. The first instinct one might have is, if LR is too large, GD will just blow up (i.e. iterates grow unboundedly). This indeed could happen, such as whenever $h>2/L$ for quadratic objectives, as illustrated below (again for a simple $f=x^2/2$):

However, for more complex $f$’s, GD does not necessarily blow up under large LR.

Nor does GD always converge to a point, when convergent! We can view the gradient descent iteration as a discrete-in-time dynamical system. It is known that dynamical systems can have attractors of various sorts, some most common ones being fixed point, periodic orbit, and strange attractor (yes, we’re speaking about chaos). And they all can show up in GD dynamics!

Periodic orbits. Figures below are examples of periodic motions produced by GD for a miniature matrix factorization problem, with $f(x,y)=(1-xy)^2/2$ (note: seemingly simple, but in fact not $L$-smooth for any $L$, nor convex). Various values of period are possible, but in none of these figures is the convergence to any minimizer.

GD with h=1.9 converges to three orbits with period 2, 3 and 4 respectively (depending on initial condition). Blue lines are the orbits; red line is a reference line of all minima xy = 1.

These figures are from [Wang et al (2022)], although not the main points of that article. More will be discussed in the 2nd next section.

Strange attractor. Let’s now take a look at a case of convergence to a chaotic attractor, a famous example of which is the Lorenz butterfly.

These figures are from [Kong and Tao (2020)]. Again, more will be explained, right below:

An alternative mechanism for escapes from local min

The energy landscape in the above animation has a lot of (spurious) local minima. Impressively (but also rather intuitively), GD iterations can jump out of these local minima, as animated.

This is actually due to a large LR! In fact, if the LR were instead small, GD would just converge to a local minimum close to its initialization, just like gradient flow.

People recently started to appreciate this effect of large learning rate, and empirical results gradually appeared in the literature. A major reason is that the deep learning community, for example, is very much interested in escapes from local minima, which often correspond to improved training accuracy. The most popular mechanism for local min escape is via noise, which usually originates from stochastic gradients. Large LR, however, is a completely complementary escape mechanism, as it requires no stochasticity. A quantitative and rigorous analysis, however, was already provided in [Kong and Tao (2020)], and its main idea will now be sketched:

Given the objective function $f$, let’s decompose it as $f(\textbf{x})=f_0(\textbf{x})+f_1(\textbf{x})$, where $f_0$ corresponds to macroscopic behavior and $f_1$ encodes microscopic details, such as illustrated by the following 1D example:

When LR becomes large enough, it is no longer able to resolve the details of $f_1$. This is just like if you drive very fast, small pebbles and bumps on the road can no longer be felt. What is “large enough”? Denote the microscopic scale (when compared to the macroscopic scale) by $\epsilon$ and suppose $f_1$ is $L$-smooth, then $L=\mathcal{\Omega}(1/\epsilon)$ due to 2nd-order spatial derivative, which means traditional small LR is $h=o(1/L)=o(\epsilon)$. If $h=o(1)$ instead, independent of $\epsilon$, then $h\gg 1/L$, and it is large enough.

Using this setup, [Kong and Tao (2020)] rigorously proved that the microscopic part of the gradient, $-h\nabla f_1(\textbf{x})$, effectively acts like a noisy forcing to the deterministic GD dynamics. In fact, using tools from dynamical systems, probability, and functional analysis, they proved that the GD iterates actually converge to a chaotic attractor, and hence to a statistical distribution, which is very close to \[ Z^{-1} \exp(-f_0(\textbf{x})/T)d\textbf{x} \] under reasonable assumptions. Those familiar with statistical mechanics can immediately recognize this approximate distribution as the famous Gibbs distribution. Rather notably, this distribution, being the limit of GD, only depends on the large scale part of the objective, $f_0$, and small details encoded by $f_1$ are not even seen anymore, as a consequence of a large LR ($h\gg 1/L$)!

This quantitative result also suggests that smaller values of $f_0$ (and $f$ too) will have (significantly) higher probability, which roughly means they will be visited more often after many GD iterations. This is usually desirable (smaller training loss, likely beneficial for test accuracy as well), and thus a benefit of large LR.

In these senses, large LR GD behaves very much like SGD, even if there is no randomization or minibatch. It thus provides an alternative means of escape from minima.

A lot more details can be found in [Kong and Tao (2020)], but here is a final excerpt: do objective functions in real problems admit such multiscale structures? Both theoretical and empirical discussions were given in their paper. For example, if training data is multiscale, it is theoretically possible that the loss of a neural network, for regressing the data, inherits a multiscale structure! Here is what GD iterations give for an examplary weight parameter during training a small feedforward network:

You see it doesn’t converge to any single value, but rather “blobs”, whose concatenation is actually (the projection / marginal support) of a chaotic strange attractor / probability distribution.

TL;DR [Kong & Tao (2020)] show that GD can, with the help of large LR, escape local minima and exponentially prefer small objective values, because in this case GD is actually sampling a statistical distribution, instead of merely optimizing the objective. This is due to chaotic dynamics created by large LR.

An implicit bias of large LR: preference of flatter minima

Now let’s go back to the simple case, namely the convergence of GD to a single point. Even this simple case is not simple at all, but much fun. We will just discuss one example, namely a rigorous proof of preference of flatter minima, enabled by large LR. But first of all, why do we care?

Popular Conjecture 1: Flatter minima generalize better. There are many work arguing for this belief, while interesting results showing the contrary also exist. We do not intent to say anything new here, but only hope it is meaningful to find flatter local minima.

Popular Conjecture 2: Large LR helps find flatter minima. This conjecture is much more recent, but due to its importance, attention is rapidly building up in the literature. Not only did empirical results emerge, but also semianalytical work, such as those based on the $\mathcal{O}(h)$ correction term in 1st-order modified potential $f(x)+\frac{h}{4}|\nabla f(x)|^2_2$ (see above section “Backward error analysis, a.k.a. modified equation”).

[Wang et al (2022)] went beyond medium LRs for which modified potential works, and studied large LRs, for which the preference of flatter minimum is even more pronounced. They provided a rigorous proof of Popular Conjecture 2, for a subclass of problems.

More precisely, consider a matrix factorization task formulated as an optimization problem \[ \min_{X,Y}~ \frac{1}{2} |A-XY^\top|^2,\quad \text{where}\ A\in\mathbb{R}^{n\times n},\ X,Y\in\mathbb{R}^{n\times d}. \] where one tries to find a rank $d$ approximation of $n\times n$ matrix $A$. This is a rather general and important data science problem per se, but for those interested in deep learning, it is also almost the same as the training of a 2-layer linear neural network.

The landscape of $f(X,Y):=\frac{1}{2} |A-XY^\top|^2$ is intriguing. For example, 1) minimizers are not isolated or countable; in fact, if $(X,Y)$ is a minimizer, so is $(cX,Y/c)$ for any constant scalar $c\ne 0$. The authors call this property homogenity, and it persists even if certain nonlinear activations are applied, for example for $\frac{1}{2} |A-X\sigma(Y^\top)|^2$ where $\sigma$ is ReLU or leaky-ReLU. 2) each local minimizer is a global minimizer.

However, amongst the minimizers some are flat and some are sharp. In fact, one could compute eigenvalues of the Hessian of $f$ and evaluate at a minimizer. Then it will be seen that $|X|\approx|Y|$ means the landscape is flat locally around the minimizer, and having unbalanced norms implies sharp local geometry instead.

Besides possible deep learning implications (flat means better generalization (?)), having flatter geometry also benefits both analysis and numerical performance of GD (otherwise smaller LR is needed). That’s why the literature oftentimes add explicit regularizers to $f$ to promote balanceness.

[Wang et al (2022)] showed that, if the LR is large enough, GD will start to pick minimizers with balanced norms. Moreover, the larger LR is, the more balance GD can create. No regularizer is needed. Here is a visualization based on a simplest example; see how GD can travel afar to gain balance (and hence flatness) via large LR, even when the initial condition is already close to an unbalanced minimizer and small LR simply takes you there.

(Side remark: in a way, [Wang et al (2022)] can also be thought as related to a phenomenon known as Edge of Stability that was empirically discovered recently [Cohen et al (2021)] and starting to attract attention.)

More precisely, what [Wang et al (2022)] proved include: 1) if convergent to a point, GD with large enough $h$ has an implicit regularization effect of balancing, and the limiting point $(X_\infty,Y_\infty)$ of GD iterates will satisfy a bound of $|X_\infty-Y_\infty|$ that decreases with $h$. 2) GD with large $h$ converges. The first sounds mouthful but important, and the second sounds easy; however, the truth is, 1) can actually be obtained using existing dynamical systems tools in an intuitive way, but 2) is much more nontrivial to obtain. This is all because of large LR. Advanced readers might find the proof fun and appreciate the in-depth scrutinization of GD dynamics, but details won’t be loaded here.

However, there is (again) an important point pertinent to this blog — how large is large? Very roughly speaking, the regime where balancing is pronounced is $h\in (2/L,4/L)$, which defies traditional tools briefly reviewed in the beginning of this blog. Note also that bigger $h$ can make GD not convergent.

Expert readers may ask, hold on, what is $L$? In fact, the objective function is quartic (i.e. a 4th-order polynomial) and its gradient is not globally Lipschitz. The merits of the dynamical analysis lie in not only the fact that $h>2/L$, but also that it requires only local Lipschitzness. The $L$ in $(2/L,4/L)$ is simply the spectral radius of Hess$f$ at initialization.

TL;DR [Wang et al (2022)] show that GD has, when the LR is truly large, an implicit regularization effect of converging to flatter local min. Larger LR makes the implicit bias stronger.

Thank you for reading

That’s it for this blog, which is just the tip of an iceberg. For the sake of length and the diversity of readers, there are a lot of rigor and details, as well as related work, that are omitted and sacrificed. But questions and comments are always welcome. We hope you liked it, and please feel free to cite!

Acknowledgement

We thank Tuo Zhao for encouraging us to write a blog.

References

[1]: J.H.Wilkinson. Error analysis of floating-point computation. Numer. Math. 1960

[2]: Ernst Hairer, Christian Lubich, and Gerhard Wanner. Geometric Numerical Integration. Springer 2006

[3]: Qianxiao Li, Cheng Tai, and Weinan E. Stochastic modified equations and dynamics of stochastic gradient algorithms i: Mathematical foundations. JMLR 2019

[4]: Lingkai Kao and Molei Tao. Stochasticity of deterministic gradient descent: Large learning rate for multiscale objective function. NeurIPS 2020

[5]: Yuqing Wang, Minshuo Chen, Tuo Zhao, and Molei Tao. Large Learning Rate Tames Homogeneity: Convergence and Balancing Effect. ICLR 2022

[6]: Jeremy Cohen, Simran Kaur, Yuanzhi Li, J. Zico Kolter, and Ameet Talwalkar. Gradient descent on neural networks typically occurs at the edge of stability. ICLR 2021

Authors’ homepages:

Molei Tao, Yuqing Wang