Jump to content
 







Main menu
   


Navigation  



Main page
Contents
Current events
Random article
About Wikipedia
Contact us
Donate
 




Contribute  



Help
Learn to edit
Community portal
Recent changes
Upload file
 








Search  

































Create account

Log in
 









Create account
 Log in
 




Pages for logged out editors learn more  



Contributions
Talk
 



















Contents

   



(Top)
 


1 Definition  





2 Motivation  



2.1  Variational Bayesian inference  





2.2  Deriving the ELBO  





2.3  Maximizing the ELBO  







3 Main forms  



3.1  Data-processing inequality  







4 References  





5 Notes  














Evidence lower bound






Deutsch

 

Edit links
 









Article
Talk
 

















Read
Edit
View history
 








Tools
   


Actions  



Read
Edit
View history
 




General  



What links here
Related changes
Upload file
Special pages
Permanent link
Page information
Cite this page
Get shortened URL
Download QR code
Wikidata item
 




Print/export  



Download as PDF
Printable version
 
















Appearance
   

 






From Wikipedia, the free encyclopedia
 


Invariational Bayesian methods, the evidence lower bound (often abbreviated ELBO, also sometimes called the variational lower bound[1]ornegative variational free energy) is a useful lower bound on the log-likelihood of some observed data.

The ELBO is useful because it provides a guarantee on the worst-case for the log-likelihood of some distribution (e.g. ) which models a set of data. The actual log-likelihood may be higher (indicating an even better fit to the distribution) because the ELBO includes a Kullback-Leibler divergence (KL divergence) term which decreases the ELBO due to an internal part of the model being inaccurate despite good fit of the model overall. Thus improving the ELBO score indicates either improving the likelihood of the model or the fit of a component internal to the model, or both, and the ELBO score makes a good loss function, e.g., for training a deep neural network to improve both the model overall and the internal component. (The internal component is , defined in detail later in this article.)

Definition[edit]

Let and berandom variables, jointly distributed with distribution . For example, is the marginal distributionof, and is the conditional distributionof given . Then, for a sample , and any distribution , the ELBO is defined as

The ELBO can equivalently be written as[2]

In the first line, is the entropyof, which relates the ELBO to the Helmholtz free energy.[3] In the second line, is called the evidence for , and is the Kullback-Leibler divergence between and . Since the Kullback-Leibler divergence is non-negative, forms a lower bound on the evidence (ELBO inequality)

Motivation[edit]

Variational Bayesian inference[edit]

Suppose we have an observable random variable , and we want to find its true distribution . This would allow us to generate data by sampling, and estimate probabilities of future events. In general, it is impossible to find exactly, forcing us to search for a good approximation.

That is, we define a sufficiently large parametric family of distributions, then solve for for some loss function . One possible way to solve this is by considering small variation from to, and solve for . This is a problem in the calculus of variations, thus it is called the variational method.

Since there are not many explicitly parametrized distribution families (all the classical distribution families, such as the normal distribution, the Gumbel distribution, etc, are far too simplistic to model the true distribution), we consider implicitly parametrized probability distributions:

This defines a family of joint distributions over . It is very easy to sample : simply sample , then compute , and finally sample using .

In other words, we have a generative model for both the observable and the latent. Now, we consider a distribution good, if it is a close approximation of :

since the distribution on the right side is over only, the distribution on the left side must marginalize the latent variable away.
In general, it's impossible to perform the integral , forcing us to perform another approximation.

Since (Bayes' Rule), it suffices to find a good approximation of . So define another distribution family and use it to approximate . This is a discriminative model for the latent.

The entire situation is summarized in the following table:

: observable : latent
approximable , easy
, easy
approximable , easy

InBayesian language, is the observed evidence, and is the latent/unobserved. The distribution over is the prior distribution over , is the likelihood function, and is the posterior distribution over .

Given an observation , we can infer what likely gave rise to by computing . The usual Bayesian method is to estimate the integral , then compute by Bayes' rule . This is expensive to perform in general, but if we can simply find a good approximation for most , then we can infer from cheaply. Thus, the search for a good is also called amortized inference.

All in all, we have found a problem of variational Bayesian inference.

Deriving the ELBO[edit]

A basic result in variational inference is that minimizing the Kullback–Leibler divergence (KL-divergence) is equivalent to maximizing the log-likelihood:

where is the entropy of the true distribution. So if we can maximize , we can minimize , and consequently find an accurate approximation .

To maximize , we simply sample many , i.e. use Importance sampling

where is the number of samples drawn from the true distribution. This approximation can be seen as overfitting.[note 1]

In order to maximize , it's necessary to find :

This usually has no closed form and must be estimated. The usual way to estimate integrals is Monte Carlo integration with importance sampling:
where is a sampling distribution over that we use to perform the Monte Carlo integration.

So we see that if we sample , then is an unbiased estimator of . Unfortunately, this does not give us an unbiased estimator of , because is nonlinear. Indeed, we have by Jensen's inequality,

In fact, all the obvious estimators of are biased downwards, because no matter how many samples of we take, we have by Jensen's inequality:
Subtracting the right side, we see that the problem comes down to a biased estimator of zero:
At this point, we could branch off towards the development of an importance-weighted autoencoder[note 2], but we will instead continue with the simplest case with :
The tightness of the inequality has a closed form:
We have thus obtained the ELBO function:

Maximizing the ELBO[edit]

For fixed , the optimization simultaneously attempts to maximize and minimize . If the parametrization for and are flexible enough, we would obtain some , such that we have simultaneously

Since
we have
and so
In other words, maximizing the ELBO would simultaneously allow us to obtain an accurate generative model and an accurate discriminative model .[5]

Main forms[edit]

The ELBO has many possible expressions, each with some different emphasis.

This form shows that if we sample , then is an unbiased estimator of the ELBO.

This form shows that the ELBO is a lower bound on the evidence , and that maximizing the ELBO with respect to is equivalent to minimizing the KL-divergence from to.

This form shows that maximizing the ELBO simultaneously attempts to keep close to and concentrate on those that maximizes . That is, the approximate posterior balances between staying close to the prior and moving towards the maximum likelihood .

This form shows that maximizing the ELBO simultaneously attempts to keep the entropy of high, and concentrate on those that maximizes . That is, the approximate posterior balances between being a uniform distribution and moving towards the maximum a posteriori .

Data-processing inequality[edit]

Suppose we take independent samples from , and collect them in the dataset , then we have empirical distribution .


Fitting to can be done, as usual, by maximizing the loglikelihood :

Now, by the ELBO inequality, we can bound , and thus
The right-hand-side simplifies to a KL-divergence, and so we get:
This result can be interpreted as a special case of the data processing inequality.

In this interpretation, maximizing is minimizing , which upper-bounds the real quantity of interest via the data-processing inequality. That is, we append a latent space to the observable space, paying the price of a weaker inequality for the sake of more computationally efficient minimization of the KL-divergence.[6]

References[edit]

  1. ^ Kingma, Diederik P.; Welling, Max (2014-05-01). "Auto-Encoding Variational Bayes". arXiv:1312.6114 [stat.ML].
  • ^ Goodfellow, Ian; Bengio, Yoshua; Courville, Aaron (2016). "Chapter 19". Deep learning. Adaptive computation and machine learning. Cambridge, Mass: The MIT press. ISBN 978-0-262-03561-3.
  • ^ Hinton, Geoffrey E; Zemel, Richard (1993). "Autoencoders, Minimum Description Length and Helmholtz Free Energy". Advances in Neural Information Processing Systems. 6. Morgan-Kaufmann.
  • ^ Burda, Yuri; Grosse, Roger; Salakhutdinov, Ruslan (2015-09-01). "Importance Weighted Autoencoders". arXiv:1509.00519 [stat.ML].
  • ^ Neal, Radford M.; Hinton, Geoffrey E. (1998), "A View of the Em Algorithm that Justifies Incremental, Sparse, and other Variants", Learning in Graphical Models, Dordrecht: Springer Netherlands, pp. 355–368, doi:10.1007/978-94-011-5014-9_12, ISBN 978-94-010-6104-9, S2CID 17947141
  • ^ Kingma, Diederik P.; Welling, Max (2019-11-27). "An Introduction to Variational Autoencoders". Foundations and Trends in Machine Learning. 12 (4). Section 2.7. arXiv:1906.02691. doi:10.1561/2200000056. ISSN 1935-8237. S2CID 174802445.
  • Notes[edit]

    1. ^ In fact, by Jensen's inequality, The estimator is biased upwards. This can be seen as overfitting: for some finite set of sampled data , there is usually some that fits them better than the entire distribution.
  • ^ By the delta method, we have
  • If we continue with this, we would obtain the importance-weighted autoencoder.[4]
    Retrieved from "https://en.wikipedia.org/w/index.php?title=Evidence_lower_bound&oldid=1213883581"

    Category: 
    Theory of probability distributions
    Hidden categories: 
    Articles with short description
    Short description matches Wikidata
     



    This page was last edited on 15 March 2024, at 18:14 (UTC).

    Text is available under the Creative Commons Attribution-ShareAlike License 4.0; additional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.



    Privacy policy

    About Wikipedia

    Disclaimers

    Contact Wikipedia

    Code of Conduct

    Developers

    Statistics

    Cookie statement

    Mobile view



    Wikimedia Foundation
    Powered by MediaWiki