Stochastic optimization in deep learning

Large-deviation analysis of SGD invariant measures and global convergence times.

All recent successes of deep learning share the same fundamental backbone: stochastic first-order optimization methods. Nonetheless, a principled understanding of how these methods behave when training modern deep models remains an open question, since they minimize objectives that are both complex and highly non-convex.

I developed a novel theoretical framework to tackle these challenges, focusing on stochastic gradient descent (SGD) and its long-run behavior on non-convex problems:

  1. Long-run distribution of SGD. We provide the first complete characterization of SGD’s invariant measures on non-convex problems. The long-run distribution takes the form of a Boltzmann–Gibbs law: the probability of SGD being close to critical points \(\mathcal{K}\) is of the form
\[\exp\left(-\frac{V(\mathcal{K})}{\gamma}\right)\]

where \(V(\mathcal{K})\) is an energy level determined by the objective and noise statistics, and \(\gamma\) is the step-size. SGD concentrates exponentially around the minimum-energy state while visiting other regions with probabilities exponentially proportional to their energy levels (ICML 2024, poster).

  1. Global convergence time. We quantify how long SGD takes to reach global minima, showing that the expected convergence time scales as
\[\exp\left(\frac{E(x)}{\gamma}\right)\]

where \(E(x)\) is a geometric measure capturing both the loss landscape difficulty and noise statistics. This demonstrates that SGD’s practical success stems from favorable loss geometry, particularly near initialization (ICML 2025, poster).

These results have been presented at the Thoth seminar (slides), the LPSM statistics seminar (slides), Université Côte d’Azur (slides), Morgan Stanley’s ML Research seminar (slides), and Inria’s Argo team (slides).

publications

  1. What is the long-run distribution of stochastic gradient descent? A large deviations analysis
    Waïss Azizian, Franck Iutzeler, Jerome Malick, and 1 more author
    In ICML, 2024
  2. The global convergence time of stochastic gradient descent in non-convex landscapes: sharp estimates via large deviations
    Waïss Azizian, Franck Iutzeler, Jerome Malick, and 1 more author
    In ICML, 2025