Jump to content

Main menu Navigation ●Main page ●Contents ●Current events ●Random article ●About Wikipedia ●Contact us ●Donate Contribute ●Help ●Learn to edit ●Community portal ●Recent changes ●Upload file

●Create account ●Log in ●Create account ● Log in Pages for logged out editors learn more ●Contributions ●Talk

(Top) 1 Formal definition 2 Application 3 Variants and associated algorithms 4 Mixing rates and algorithmic convergence 5 References

Stochastic gradient Langevin dynamics

●Українська Edit links ●Article ●Talk ●Read ●Edit ●View history Tools Actions ●Read ●Edit ●View history General ●What links here ●Related changes ●Upload file ●Special pages ●Permanent link ●Page information ●Cite this page ●Get shortened URL ●Download QR code ●Wikidata item Print/export ●Download as PDF ●Printable version Appearance From Wikipedia, the free encyclopedia (Redirected from Stochastic Gradient Langevin Dynamics)

SGLD can be applied to the optimization of non-convex objective functions, shown here to be a sum of Gaussians.

Stochastic gradient Langevin dynamics (SGLD) is an optimization and sampling technique composed of characteristics from Stochastic gradient descent, a Robbins–Monro optimization algorithm, and Langevin dynamics, a mathematical extension of molecular dynamics models. Like stochastic gradient descent, SGLD is an iterative optimization algorithm which uses minibatching to create a stochastic gradient estimator, as used in SGD to optimize a differentiable objective function.^[1] Unlike traditional SGD, SGLD can be used for Bayesian learning as a sampling method. SGLD may be viewed as Langevin dynamics applied to posterior distributions, but the key difference is that the likelihood gradient terms are minibatched, like in SGD. SGLD, like Langevin dynamics, produces samples from a posterior distribution of parameters based on available data. First described by Welling and Teh in 2011, the method has applications in many contexts which require optimization, and is most notably applied in machine learning problems.

Formal definition

[edit]

Given some parameter vector $\theta$ , its prior distribution $p(\theta )$ , and a set of data points $X=\{x_{i}\}_{i=1}^{N}$ , Langevin dynamics samples from the posterior distribution $p(\theta \mid X)\propto p(\theta )\prod _{i=1}^{N}p(x_{i}\mid \theta )$ by updating the chain:

\Delta \theta _{t}={\frac {\varepsilon _{t}}{2}}\left(\nabla \log p(\theta _{t})+\sum _{i=1}^{N}\nabla \log p(x_{t_{i}}\mid \theta _{t})\right)+\eta _{t}

Stochastic gradient Langevin dynamics uses a modified update procedure with minibatched likelihood terms:

\Delta \theta _{t}={\frac {\varepsilon _{t}}{2}}\left(\nabla \log p(\theta _{t})+{\frac {N}{n}}\sum _{i=1}^{n}\nabla \log p(x_{t_{i}}\mid \theta _{t})\right)+\eta _{t}

where $n<N$ is a positive integer, $\eta _{t}\sim {\mathcal {N}}(0,\varepsilon _{t})$ is Gaussian noise, $p(x\mid \theta )$ is the likelihood of the data given the parameter vector $\theta$ , and our step sizes $\varepsilon _{t}$ satisfy the following conditions:

\sum _{t=1}^{\infty }\varepsilon _{t}=\infty \quad \sum _{t=1}^{\infty }\varepsilon _{t}^{2}<\infty

For early iterations of the algorithm, each parameter update mimics Stochastic Gradient Descent; however, as the algorithm approaches a local minimum or maximum, the gradient shrinks to zero and the chain produces samples surrounding the maximum a posteriori mode allowing for posterior inference. This process generates approximate samples from the posterior as by balancing variance from the injected Gaussian noise and stochastic gradient computation.^{[citation needed]}

Application

[edit]

SGLD is applicable in any optimization context for which it is desirable to quickly obtain posterior samples instead of a maximum a posteriori mode. In doing so, the method maintains the computational efficiency of stochastic gradient descent when compared to traditional gradient descent while providing additional information regarding the landscape around the critical point of the objective function. In practice, SGLD can be applied to the training of Bayesian Neural NetworksinDeep Learning, a task in which the method provides a distribution over model parameters. By introducing information about the variance of these parameters, SGLD characterizes the generalizability of these models at certain points in training.^[2] Additionally, obtaining samples from a posterior distribution permits uncertainty quantification by means of confidence intervals, a feature which is not possible using traditional stochastic gradient descent.^{[citation needed]}

Variants and associated algorithms

[edit]

If gradient computations are exact, SGLD reduces down to the Langevin Monte Carlo algorithm,^[3] first coined in the literature of lattice field theory. This algorithm is also a reduction of Hamiltonian Monte Carlo, consisting of a single leapfrog step proposal rather than a series of steps.^[4] Since SGLD can be formulated as a modification of both stochastic gradient descent and MCMC methods, the method lies at the intersection between optimization and sampling algorithms; the method maintains SGD's ability to quickly converge to regions of low cost while providing samples to facilitate posterior inference.^{[citation needed]}

Considering relaxed constraints on the step sizes $\varepsilon _{t}$ such that they do not approach zero asymptotically, SGLD fails to produce samples for which the Metropolis Hastings rejection rate is zero, and thus a MH rejection step becomes necessary.^[1] The resulting algorithm, dubbed the Metropolis Adjusted Langevin algorithm,^[5] requires the step:

{\frac {p(\mathbf {\theta } ^{t}\mid \mathbf {\theta } ^{t+1})p^{*}\left(\mathbf {\theta } ^{t}\right)}{p\left(\mathbf {\theta } ^{t+1}\mid \mathbf {\theta } ^{t}\right)p^{*}(\mathbf {\theta } ^{t+1})}}<u,\ u\sim {\mathcal {U}}[0,1]

where $p(\theta ^{t}\mid \theta ^{t+1})$ is a normal distribution centered one gradient descent step from $\theta ^{t}$ and $p(\theta )$ is our target distribution.^{[citation needed]}

Mixing rates and algorithmic convergence

[edit]

Recent contributions have proven upper bounds on mixing times for both the traditional Langevin algorithm and the Metropolis adjusted Langevin algorithm.^[5] Released in Ma et al., 2018, these bounds define the rate at which the algorithms converge to the true posterior distribution, defined formally as:

\tau (\varepsilon ;p^{0})=\min \left\{k\mid \left\|p^{k}-p^{*}\right\|_{\mathrm {V} }\leq \varepsilon \right\}

where $\varepsilon \in (0,1)$ is an arbitrary error tolerance, $p^{0}$ is some initial distribution, $p^{*}$ is the posterior distribution, and $||*||_{TV}$ is the total variation norm. Under some regularity conditions of an L-Lipschitz smooth objective function ${\displaystyle U($ which is m-strongly convex outside of a region of radius $R$ with condition number $\kappa ={\frac {L}{m}}$ , we have mixing rate bounds:

\tau _{ULA}(\varepsilon ,p^{0})\leq {\mathcal {O}}\left(e^{32LR^{2}}\kappa ^{2}{\frac {d}{\varepsilon ^{2}}}\ln \left({\frac {d}{\varepsilon ^{2}}}\right)\right)

\tau _{MALA}(\varepsilon ,p^{0})\leq {\mathcal {O}}\left(e^{16LR^{2}}\kappa ^{3/2}d^{1/2}\left(d\ln \kappa +\ln \left({\frac {1}{\varepsilon }}\right)\right)^{3/2}\right)

where $\tau _{ULA}$ and $\tau _{MALA}$ refer to the mixing rates of the Unadjusted Langevin Algorithm and the Metropolis Adjusted Langevin Algorithm respectively. These bounds are important because they show computational complexity is polynomial in dimension $d$ conditional on $LR^{2}$ being ${\mathcal {O}}(\log d)$ .

References

[edit]

^ ^a ^b Welling, Max; Teh, Yee Whye (2011). "Bayesian Learning via Stochastic Gradient Langevin Dynamics" (PDF). Proceedings of the 28th International Conference on Machine Learning: 681–688.

^ Chaudhari, Pratik; Choromanska, Anna; Soatto, Stefano; LeCun, Yann; Baldassi, Carlo; Borgs, Christian; Chayes, Jennifer; Sagun, Levent; Zecchina, Riccardo (2017). "Entropy-sgd: Biasing gradient descent into wide valleys". arXiv:1611.01838 [cs.LG].

^ Kennedy, A. D. (1990). "The theory of hybrid stochastic algorithms". Probabilistic Methods in Quantum Field Theory and Quantum Gravity. Plenum Press. pp. 209–223. ISBN 0-306-43602-7.

^ Neal, R. (2011). "MCMC Using Hamiltonian Dynamics". Handbook of Markov Chain Monte Carlo. CRC Press. ISBN 978-1-4200-7941-8.

^ ^a ^b Ma, Y. A.; Chen, Y.; Jin, C.; Flammarion, N.; Jordan, M. I. (2018). "Sampling can be faster than optimization". Proceedings of the National Academy of Sciences. 116 (42): 20881–20885. arXiv:1811.08413. doi:10.1073/pnas.1820003116. PMC 6800351. PMID 31570618.

Retrieved from "https://en.wikipedia.org/w/index.php?title=Stochastic_gradient_Langevin_dynamics&oldid=1211309274" Categories: ●Computational statistics ●Gradient methods ●Optimization algorithms and methods ●Stochastic optimization Hidden categories: ●Articles with short description ●Short description matches Wikidata ●Orphaned articles from January 2019 ●All orphaned articles ●All articles with unsourced statements ●Articles with unsourced statements from December 2018 ●This page was last edited on 1 March 2024, at 22:11 (UTC). ●Text is available under the Creative Commons Attribution-ShareAlike License 4.0; additional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization. ●Privacy policy ●About Wikipedia ●Disclaimers ●Contact Wikipedia ●Code of Conduct ●Developers ●Statistics ●Cookie statement ●Mobile view