Improved formatting
|
→Training an autoencoder: Fix a mangled sentence
|
||
(17 intermediate revisions by 15 users not shown) | |||
Line 3: | Line 3: | ||
{{Use dmy dates|date=March 2020|cs1-dates=y}} |
{{Use dmy dates|date=March 2020|cs1-dates=y}} |
||
{{Machine learning|Artificial neural network}} |
{{Machine learning|Artificial neural network}} |
||
An '''autoencoder''' is a type of [[artificial neural network]] used to learn [[Feature learning|efficient codings]] of unlabeled data ([[unsupervised learning]]).<ref name=":12">{{cite journal|doi=10.1002/aic.690370209|title=Nonlinear principal component analysis using autoassociative neural networks|journal=AIChE Journal|volume=37|issue=2|pages=233–243|date=1991|last1=Kramer|first1=Mark A.|url= https://www.researchgate.net/profile/Abir_Alobaid/post/To_learn_a_probability_density_function_by_using_neural_network_can_we_first_estimate_density_using_nonparametric_methods_then_train_the_network/attachment/59d6450279197b80779a031e/AS:451263696510979@1484601057779/download/NL+PCA+by+using+ANN.pdf}}</ref><ref name=":13">{{Cite journal |last=Kramer |first=M. A. |date=1992-04-01 |title=Autoassociative neural networks |url=https://dx.doi.org/10.1016/0098-1354%2892%2980051-A |journal=Computers & Chemical Engineering |series=Neutral network applications in chemical engineering |language=en |volume=16 |issue=4 |pages=313–328 |doi=10.1016/0098-1354(92)80051-A |issn=0098-1354}}</ref> An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an [[Feature learning|efficient representation]] (encoding) for a set of data, typically for [[dimensionality reduction]]. |
An '''autoencoder''' is a type of [[artificial neural network]] used to learn [[Feature learning|efficient codings]] of unlabeled data ([[unsupervised learning]]).<ref name=":12">{{cite journal|doi=10.1002/aic.690370209|title=Nonlinear principal component analysis using autoassociative neural networks|journal=AIChE Journal|volume=37|issue=2|pages=233–243|date=1991|last1=Kramer|first1=Mark A.|bibcode=1991AIChE..37..233K |url= https://www.researchgate.net/profile/Abir_Alobaid/post/To_learn_a_probability_density_function_by_using_neural_network_can_we_first_estimate_density_using_nonparametric_methods_then_train_the_network/attachment/59d6450279197b80779a031e/AS:451263696510979@1484601057779/download/NL+PCA+by+using+ANN.pdf}}</ref><ref name=":13">{{Cite journal |last=Kramer |first=M. A. |date=1992-04-01 |title=Autoassociative neural networks |url=https://dx.doi.org/10.1016/0098-1354%2892%2980051-A |journal=Computers & Chemical Engineering |series=Neutral network applications in chemical engineering |language=en |volume=16 |issue=4 |pages=313–328 |doi=10.1016/0098-1354(92)80051-A |issn=0098-1354}}</ref> An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an [[Feature learning|efficient representation]] (encoding) for a set of data, typically for [[dimensionality reduction]]. |
||
Variants exist, aiming to force the learned representations to assume useful properties.<ref name=":0" /> Examples are regularized autoencoders (''Sparse'', ''Denoising'' and ''Contractive''), which are effective in learning representations for subsequent [[Statistical classification|classification]] tasks,<ref name=":4" /> and ''Variational'' autoencoders, with applications as [[generative model]]s.<ref name=":11">{{cite journal |arxiv=1906.02691|doi=10.1561/2200000056|bibcode=2019arXiv190602691K|title=An Introduction to Variational Autoencoders|date=2019|last1=Welling|first1=Max|last2=Kingma|first2=Diederik P.|journal=Foundations and Trends in Machine Learning|volume=12|issue=4|pages=307–392|s2cid=174802445}}</ref> Autoencoders are applied to many problems, including [[face recognition|facial recognition]],<ref>Hinton GE, Krizhevsky A, Wang SD. [http://www.cs.toronto.edu/~fritz/absps/transauto6.pdf Transforming auto-encoders.] In International Conference on Artificial Neural Networks 2011 Jun 14 (pp. 44-51). Springer, Berlin, Heidelberg.</ref> feature detection,<ref name=":2">{{Cite book|last=Géron|first=Aurélien|title=Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow|publisher=O’Reilly Media, Inc.|year=2019|location=Canada|pages=739–740}}</ref> anomaly detection and acquiring the meaning of words.<ref>{{cite journal|doi=10.1016/j.neucom.2008.04.030|title=Modeling word perception using the Elman network|journal=Neurocomputing|volume=71|issue=16–18|pages=3150|date=2008|last1=Liou|first1=Cheng-Yuan|last2=Huang|first2=Jau-Chi|last3=Yang|first3=Wen-Chie|url=http://ntur.lib.ntu.edu.tw//handle/246246/155195 }}</ref><ref>{{cite journal|doi=10.1016/j.neucom.2013.09.055|title=Autoencoder for words|journal=Neurocomputing|volume=139|pages=84–96|date=2014|last1=Liou|first1=Cheng-Yuan|last2=Cheng|first2=Wei-Chen|last3=Liou|first3=Jiun-Wei|last4=Liou|first4=Daw-Ran}}</ref> Autoencoders are also generative models which can randomly generate new data that is similar to the input data (training data).<ref name=":2" /> |
Variants exist, aiming to force the learned representations to assume useful properties.<ref name=":0" /> Examples are regularized autoencoders (''Sparse'', ''Denoising'' and ''Contractive''), which are effective in learning representations for subsequent [[Statistical classification|classification]] tasks,<ref name=":4" /> and [[Variational_autoencoder|''Variational'' autoencoders]], with applications as [[generative model]]s.<ref name=":11">{{cite journal |arxiv=1906.02691|doi=10.1561/2200000056|bibcode=2019arXiv190602691K|title=An Introduction to Variational Autoencoders|date=2019|last1=Welling|first1=Max|last2=Kingma|first2=Diederik P.|journal=Foundations and Trends in Machine Learning|volume=12|issue=4|pages=307–392|s2cid=174802445}}</ref> Autoencoders are applied to many problems, including [[face recognition|facial recognition]],<ref>Hinton GE, Krizhevsky A, Wang SD. [http://www.cs.toronto.edu/~fritz/absps/transauto6.pdf Transforming auto-encoders.] In International Conference on Artificial Neural Networks 2011 Jun 14 (pp. 44-51). Springer, Berlin, Heidelberg.</ref> feature detection,<ref name=":2">{{Cite book|last=Géron|first=Aurélien|title=Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow|publisher=O’Reilly Media, Inc.|year=2019|location=Canada|pages=739–740}}</ref> anomaly detection and acquiring the meaning of words.<ref>{{cite journal|doi=10.1016/j.neucom.2008.04.030|title=Modeling word perception using the Elman network|journal=Neurocomputing|volume=71|issue=16–18|pages=3150|date=2008|last1=Liou|first1=Cheng-Yuan|last2=Huang|first2=Jau-Chi|last3=Yang|first3=Wen-Chie|url=http://ntur.lib.ntu.edu.tw//handle/246246/155195 }}</ref><ref>{{cite journal|doi=10.1016/j.neucom.2013.09.055|title=Autoencoder for words|journal=Neurocomputing|volume=139|pages=84–96|date=2014|last1=Liou|first1=Cheng-Yuan|last2=Cheng|first2=Wei-Chen|last3=Liou|first3=Jiun-Wei|last4=Liou|first4=Daw-Ran}}</ref> Autoencoders are also generative models which can randomly generate new data that is similar to the input data (training data).<ref name=":2" /> |
||
{{Toclimit|3}} |
{{Toclimit|3}} |
||
Line 27: | Line 27: | ||
In most situations, the reference distribution is just the [[Empirical measure|empirical distribution]] given by a dataset <math>\{x_1, ..., x_N\} \subset \mathcal X</math>, so that<math display="block">\mu_{ref} = \frac{1}{N}\sum_{i=1}^N \delta_{x_i}</math> |
In most situations, the reference distribution is just the [[Empirical measure|empirical distribution]] given by a dataset <math>\{x_1, ..., x_N\} \subset \mathcal X</math>, so that<math display="block">\mu_{ref} = \frac{1}{N}\sum_{i=1}^N \delta_{x_i}</math> |
||
where |
where <math>\delta_{x_i}</math> is the [[Dirac measure]], the quality function is just L2 loss: <math>d(x, x') = \|x - x'\|_2^2</math>, and <math>\|\cdot\|_2</math> is the Euclidean norm. Then the problem of searching for the optimal autoencoder is just a [[Least squares|least-squares]] optimization:<math display="block">\min_{\theta, \phi} L(\theta, \phi), \text{where } L(\theta, \phi) = \frac{1}{N}\sum_{i=1}^N \|x_i - D_\theta(E_\phi(x_i))\|_2^2</math> |
||
=== Interpretation === |
=== Interpretation === |
||
[[File:Autoencoder_schema.png|thumb|Schema of a basic |
[[File:Autoencoder_schema.png|thumb|Schema of a basic autoencoder]] |
||
An autoencoder has two main parts: an encoder that maps the message to a code, and a decoder that reconstructs the message from the code. An optimal autoencoder would perform as close to perfect reconstruction as possible, with "close to perfect" defined by the reconstruction quality function <math>d</math>. |
An autoencoder has two main parts: an encoder that maps the message to a code, and a decoder that reconstructs the message from the code. An optimal autoencoder would perform as close to perfect reconstruction as possible, with "close to perfect" defined by the reconstruction quality function <math>d</math>. |
||
Line 70: | Line 70: | ||
or the L1 loss, as <math>s(\rho, \hat\rho) = |\rho- \hat\rho|</math>, or the L2 loss, as <math>s(\rho, \hat\rho) = |\rho- \hat\rho|^2</math>. |
or the L1 loss, as <math>s(\rho, \hat\rho) = |\rho- \hat\rho|</math>, or the L2 loss, as <math>s(\rho, \hat\rho) = |\rho- \hat\rho|^2</math>. |
||
Alternatively, the sparsity regularization loss may be defined without reference to any "desired sparsity", but simply force as much sparsity as possible. In this case, one can sparsity regularization loss as <math display="block">L_{sparsity}(\theta, \phi) = \mathbb \mathbb E_{x\sim\mu_X}\left[ |
Alternatively, the sparsity regularization loss may be defined without reference to any "desired sparsity", but simply force as much sparsity as possible. In this case, one can define the sparsity regularization loss as <math display="block">L_{sparsity}(\theta, \phi) = \mathbb \mathbb E_{x\sim\mu_X}\left[ |
||
\sum_{k\in 1:K} w_k \|h_k\| |
\sum_{k\in 1:K} w_k \|h_k\| |
||
\right]</math>where <math>h_k</math> is the activation vector in the <math>k</math>-th layer of the autoencoder. The norm <math>\|\cdot\|</math> is usually the L1 norm (giving the L1 sparse autoencoder) or the L2 norm (giving the L2 sparse autoencoder). |
\right]</math>where <math>h_k</math> is the activation vector in the <math>k</math>-th layer of the autoencoder. The norm <math>\|\cdot\|</math> is usually the L1 norm (giving the L1 sparse autoencoder) or the L2 norm (giving the L2 sparse autoencoder). |
||
Line 76: | Line 76: | ||
====Denoising autoencoder (DAE)==== |
====Denoising autoencoder (DAE)==== |
||
Denoising autoencoders (DAE) try to achieve a ''good'' representation by changing the ''reconstruction criterion''.<ref name=":0" /><ref name=":4" /> |
Denoising autoencoders (DAE) try to achieve a ''good'' representation by changing the ''reconstruction criterion''.<ref name=":0" /><ref name=":4" /> |
||
A DAE is |
A DAE, originally called a "robust autoassociative network",<ref name=":13"/>istrainedbyintentionally corrupting the inputs of a standard autoencoder during training. A noise process is defined by a probability distribution <math>\mu_T</math> over functions <math>T:\mathcal X \to \mathcal X</math>. That is, the function <math>T</math> takes a message <math>x\in \mathcal X</math>, and corrupts it to a noisy version <math>T(x)</math>. The function <math>T</math> is selected randomly, with a probability distribution <math>\mu_T</math>. |
||
Given a task <math>(\mu_{ref}, d)</math>, the problem of training a DAE is the optimization problem:<math display="block">\min_{\theta, \phi}L(\theta, \phi) = \mathbb \mathbb E_{x\sim \mu_X, T\sim\mu_T}[d(x, (D_\theta\circ E_\phi \circ T)(x))]</math>That is, the optimal DAE should take any noisy message and attempt to recover the original message without noise, thus the name "denoising"''.'' |
Given a task <math>(\mu_{ref}, d)</math>, the problem of training a DAE is the optimization problem:<math display="block">\min_{\theta, \phi}L(\theta, \phi) = \mathbb \mathbb E_{x\sim \mu_X, T\sim\mu_T}[d(x, (D_\theta\circ E_\phi \circ T)(x))]</math>That is, the optimal DAE should take any noisy message and attempt to recover the original message without noise, thus the name "denoising"''.'' |
||
Line 101: | Line 101: | ||
==== Minimal description length autoencoder ==== |
==== Minimal description length autoencoder ==== |
||
<ref>{{Cite journal |last1=Hinton |first1=Geoffrey E |last2=Zemel |first2=Richard |date=1993 |title=Autoencoders, Minimum Description Length and Helmholtz Free Energy |url=https://proceedings.neurips.cc/paper/1993/hash/9e3cfc48eccf81a0d57663e129aef3cb-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Morgan-Kaufmann |volume=6}}</ref> |
<ref>{{Cite journal |last1=Hinton |first1=Geoffrey E |last2=Zemel |first2=Richard |date=1993 |title=Autoencoders, Minimum Description Length and Helmholtz Free Energy |url=https://proceedings.neurips.cc/paper/1993/hash/9e3cfc48eccf81a0d57663e129aef3cb-Abstract.html |journal=Advances in Neural Information Processing Systems |publisher=Morgan-Kaufmann |volume=6}}</ref> |
||
{{Empty section|date=March 2024}} |
|||
=== Concrete autoencoder === |
=== Concrete autoencoder === |
||
Line 114: | Line 115: | ||
==Advantages of depth== |
==Advantages of depth== |
||
[[File:Autoencoder_structure.png|350x350px|Schematic structure of an autoencoder with 3 fully connected hidden layers. The code (z, or h for reference in the text) is the most internal layer.|thumb]] |
[[File:Autoencoder_structure.png|350x350px|Schematic structure of an autoencoder with 3 fully connected hidden layers. The code (z, or h for reference in the text) is the most internal layer.|thumb]] |
||
Autoencoders are often trained with a single |
Autoencoders are often trained with a single-layer encoder and a single-layer decoder, but using many-layered (deep) encoders and decoders offers many advantages.<ref name=":0" /> |
||
* Depth can exponentially reduce the computational cost of representing some functions. |
* Depth can exponentially reduce the computational cost of representing some functions. |
||
* Depth can exponentially decrease the amount of training data needed to learn some functions. |
* Depth can exponentially decrease the amount of training data needed to learn some functions. |
||
* Experimentally, deep autoencoders yield better compression compared to shallow or linear autoencoders.<ref name=":7" /> |
* Experimentally, deep autoencoders yield better compression compared to shallow or linear autoencoders.<ref name=":7" /> |
||
=== Training === |
=== Training === |
||
[[Geoffrey Hinton]] developed the [[deep belief network]] technique for training many-layered deep autoencoders. His method involves treating each |
[[Geoffrey Hinton]] developed the [[deep belief network]] technique for training many-layered deep autoencoders. His method involves treating each neighboring set of two layers as a [[restricted Boltzmann machine]] so that pretraining approximates a good solution, then using backpropagation to fine-tune the results.<ref name=":7">{{cite journal|last1=Hinton|first1=G. E.|last2=Salakhutdinov|first2=R.R.|title=Reducing the Dimensionality of Data with Neural Networks|journal=Science|date=28 July 2006|volume=313|issue=5786|pages=504–507|doi=10.1126/science.1127647|pmid=16873662|bibcode=2006Sci...313..504H|s2cid=1658773}}</ref> |
||
Researchers have debated whether joint training (i.e. training the whole architecture together with a single global reconstruction objective to optimize) would be better for deep auto-encoders.<ref name=":9">{{cite arXiv |eprint=1405.1380|last1=Zhou|first1=Yingbo|last2=Arpit|first2=Devansh|last3=Nwogu|first3=Ifeoma|last4=Govindaraju|first4=Venu|title=Is Joint Training Better for Deep Auto-Encoders?|class=stat.ML|date=2014}}</ref> A 2015 study showed that joint training learns better data models along with more representative features for classification as compared to the layerwise method.<ref name=":9" /> However, their experiments showed that the success of joint training depends heavily on the regularization strategies adopted.<ref name=":9" /><ref>R. Salakhutdinov and G. E. Hinton, “Deep |
Researchers have debated whether joint training (i.e. training the whole architecture together with a single global reconstruction objective to optimize) would be better for deep auto-encoders.<ref name=":9">{{cite arXiv |eprint=1405.1380|last1=Zhou|first1=Yingbo|last2=Arpit|first2=Devansh|last3=Nwogu|first3=Ifeoma|last4=Govindaraju|first4=Venu|title=Is Joint Training Better for Deep Auto-Encoders?|class=stat.ML|date=2014}}</ref> A 2015 study showed that joint training learns better data models along with more representative features for classification as compared to the layerwise method.<ref name=":9" /> However, their experiments showed that the success of joint training depends heavily on the regularization strategies adopted.<ref name=":9" /><ref>R. Salakhutdinov and G. E. Hinton, “Deep Boltzmann machines,” in AISTATS, 2009, pp. 448–455.</ref> |
||
== Applications == |
== Applications == |
||
Line 129: | Line 130: | ||
=== Dimensionality reduction === |
=== Dimensionality reduction === |
||
[[File:PCA vs Linear Autoencoder.png|thumb|Plot of the first two Principal Components (left) and a two-dimension hidden layer of a Linear Autoencoder (Right) applied to the [[Fashion MNIST |
[[File:PCA vs Linear Autoencoder.png|thumb|Plot of the first two Principal Components (left) and a two-dimension hidden layer of a Linear Autoencoder (Right) applied to the [[Fashion MNIST]] dataset.<ref name=":10">{{Cite web|url=https://github.com/zalandoresearch/fashion-mnist|title=Fashion MNIST|website=[[GitHub]]|date=2019-07-12}}</ref> The two models being both linear learn to span the same subspace. The projection of the data points is indeed identical, apart from rotation of the subspace. While PCA selects a specific orientation uptoreflections in the general case, the cost function of a simple autoencoder is invariant to rotations of the latent space.]][[Dimensionality reduction]] was one of the first [[deep learning]] applications.<ref name=":0" /> |
||
For Hinton's 2006 study,<ref name=":7" /> he pretrained a multi-layer autoencoder with a stack of [[Restricted Boltzmann machine|RBMs]] and then used their weights to initialize a deep autoencoder with gradually smaller hidden layers until hitting a bottleneck of 30 neurons. The resulting 30 dimensions of the code yielded a smaller reconstruction error compared to the first 30 components of a principal component analysis (PCA), and learned a representation that was qualitatively easier to interpret, clearly separating data clusters.<ref name=":0" /><ref name=":7" /> |
For Hinton's 2006 study,<ref name=":7" /> he pretrained a multi-layer autoencoder with a stack of [[Restricted Boltzmann machine|RBMs]] and then used their weights to initialize a deep autoencoder with gradually smaller hidden layers until hitting a bottleneck of 30 neurons. The resulting 30 dimensions of the code yielded a smaller reconstruction error compared to the first 30 components of a principal component analysis (PCA), and learned a representation that was qualitatively easier to interpret, clearly separating data clusters.<ref name=":0" /><ref name=":7" /> |
||
Line 174: | Line 175: | ||
Another useful application of autoencoders in image preprocessing is [[image denoising]].<ref>Cho, K. (2013, February). Simple sparsification improves sparse denoising autoencoders in denoising highly corrupted images. In ''International Conference on Machine Learning'' (pp. 432-440).</ref><ref>{{cite arXiv |eprint=1301.3468|last1=Cho|first1=Kyunghyun|title=Boltzmann Machines and Denoising Autoencoders for Image Denoising|class=stat.ML|date=2013}}</ref><ref>{{Cite journal|doi = 10.1137/040616024|title = A Review of Image Denoising Algorithms, with a New One |url=https://hal.archives-ouvertes.fr/hal-00271141 |year = 2005|last1 = Buades|first1 = A.|last2 = Coll|first2 = B.|last3 = Morel|first3 = J. M.|journal = Multiscale Modeling & Simulation|volume = 4|issue = 2|pages = 490–530|s2cid = 218466166 }}</ref> |
Another useful application of autoencoders in image preprocessing is [[image denoising]].<ref>Cho, K. (2013, February). Simple sparsification improves sparse denoising autoencoders in denoising highly corrupted images. In ''International Conference on Machine Learning'' (pp. 432-440).</ref><ref>{{cite arXiv |eprint=1301.3468|last1=Cho|first1=Kyunghyun|title=Boltzmann Machines and Denoising Autoencoders for Image Denoising|class=stat.ML|date=2013}}</ref><ref>{{Cite journal|doi = 10.1137/040616024|title = A Review of Image Denoising Algorithms, with a New One |url=https://hal.archives-ouvertes.fr/hal-00271141 |year = 2005|last1 = Buades|first1 = A.|last2 = Coll|first2 = B.|last3 = Morel|first3 = J. M.|journal = Multiscale Modeling & Simulation|volume = 4|issue = 2|pages = 490–530|s2cid = 218466166 }}</ref> |
||
Autoencoders found use in more demanding contexts such as [[medical imaging]] where they have been used for [[image denoising]]<ref>{{Cite book|last=Gondara|first=Lovedeep|title=2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW) |chapter=Medical Image Denoising Using Convolutional Denoising Autoencoders |date=December 2016|location=Barcelona, Spain|publisher=IEEE|pages=241–246|doi=10.1109/ICDMW.2016.0041|isbn=9781509059102|arxiv=1608.04667|bibcode=2016arXiv160804667G|s2cid=14354973}}</ref> as well as [[super-resolution]].<ref>{{Cite journal|last1=Zeng|first1=Kun|last2=Yu|first2=Jun|last3=Wang|first3=Ruxin|last4=Li|first4=Cuihua|last5=Tao|first5=Dacheng|s2cid=20787612|date=January 2017|title=Coupled Deep Autoencoder for Single Image Super-Resolution|journal=IEEE Transactions on Cybernetics|volume=47|issue=1|pages=27–37|doi=10.1109/TCYB.2015.2501373|pmid=26625442|issn=2168-2267}}</ref><ref>{{cite book |last1=Tzu-Hsi |first1=Song |last2=Sanchez |first2=Victor |last3=Hesham |first3=EIDaly |last4=Nasir M. |first4=Rajpoot |title=2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017) |chapter=Hybrid deep autoencoder with Curvature Gaussian for detection of various types of cells in bone marrow trephine biopsy images |date=2017 |pages=1040–1043 |doi=10.1109/ISBI.2017.7950694 |isbn=978-1-5090-1172-8 |s2cid=7433130 }}</ref> In image-assisted diagnosis, experiments have applied autoencoders for [[breast cancer]] detection<ref>{{cite journal |last1=Xu |first1=Jun |last2=Xiang |first2=Lei |last3=Liu |first3=Qingshan |last4=Gilmore |first4=Hannah |last5=Wu |first5=Jianzhong |last6=Tang |first6=Jinghai |last7=Madabhushi |first7=Anant |title=Stacked Sparse Autoencoder (SSAE) for Nuclei Detection on Breast Cancer Histopathology Images |journal=IEEE Transactions on Medical Imaging |date=January 2016 |volume=35 |issue=1 |pages=119–130 |doi=10.1109/TMI.2015.2458702 |pmid=26208307 |pmc=4729702 }}</ref> and for modelling the relation between the cognitive decline of [[Alzheimer's disease]] and the latent features of an autoencoder trained with [[MRI]].<ref>{{cite journal |last1=Martinez-Murcia |first1=Francisco J. |last2=Ortiz |first2=Andres |last3=Gorriz |first3=Juan M. |last4=Ramirez |first4=Javier |last5=Castillo-Barnes |first5=Diego |s2cid=195187846 |title=Studying the Manifold Structure of Alzheimer's Disease: A Deep Learning Approach Using Convolutional Autoencoders |journal=IEEE Journal of Biomedical and Health Informatics |volume=24 |issue=1 |pages=17–26 |doi=10.1109/JBHI.2019.2914970 |pmid=31217131 |date=2020 |doi-access=free }}</ref> |
Autoencoders found use in more demanding contexts such as [[medical imaging]] where they have been used for [[image denoising]]<ref>{{Cite book|last=Gondara|first=Lovedeep|title=2016 IEEE 16th International Conference on Data Mining Workshops (ICDMW) |chapter=Medical Image Denoising Using Convolutional Denoising Autoencoders |date=December 2016|location=Barcelona, Spain|publisher=IEEE|pages=241–246|doi=10.1109/ICDMW.2016.0041|isbn=9781509059102|arxiv=1608.04667|bibcode=2016arXiv160804667G|s2cid=14354973}}</ref> as well as [[super-resolution]].<ref>{{Cite journal|last1=Zeng|first1=Kun|last2=Yu|first2=Jun|last3=Wang|first3=Ruxin|last4=Li|first4=Cuihua|last5=Tao|first5=Dacheng|s2cid=20787612|date=January 2017|title=Coupled Deep Autoencoder for Single Image Super-Resolution|journal=IEEE Transactions on Cybernetics|volume=47|issue=1|pages=27–37|doi=10.1109/TCYB.2015.2501373|pmid=26625442|issn=2168-2267}}</ref><ref>{{cite book |last1=Tzu-Hsi |first1=Song |last2=Sanchez |first2=Victor |last3=Hesham |first3=EIDaly |last4=Nasir M. |first4=Rajpoot |title=2017 IEEE 14th International Symposium on Biomedical Imaging (ISBI 2017) |chapter=Hybrid deep autoencoder with Curvature Gaussian for detection of various types of cells in bone marrow trephine biopsy images |date=2017 |pages=1040–1043 |doi=10.1109/ISBI.2017.7950694 |isbn=978-1-5090-1172-8 |s2cid=7433130 }}</ref> In image-assisted diagnosis, experiments have applied autoencoders for [[breast cancer]] detection<ref>{{cite journal |last1=Xu |first1=Jun |last2=Xiang |first2=Lei |last3=Liu |first3=Qingshan |last4=Gilmore |first4=Hannah |last5=Wu |first5=Jianzhong |last6=Tang |first6=Jinghai |last7=Madabhushi |first7=Anant |title=Stacked Sparse Autoencoder (SSAE) for Nuclei Detection on Breast Cancer Histopathology Images |journal=IEEE Transactions on Medical Imaging |date=January 2016 |volume=35 |issue=1 |pages=119–130 |doi=10.1109/TMI.2015.2458702 |pmid=26208307 |pmc=4729702 }}</ref> and for modelling the relation between the cognitive decline of [[Alzheimer's disease]] and the latent features of an autoencoder trained with [[MRI]].<ref>{{cite journal |last1=Martinez-Murcia |first1=Francisco J. |last2=Ortiz |first2=Andres |last3=Gorriz |first3=Juan M. |last4=Ramirez |first4=Javier |last5=Castillo-Barnes |first5=Diego |s2cid=195187846 |title=Studying the Manifold Structure of Alzheimer's Disease: A Deep Learning Approach Using Convolutional Autoencoders |journal=IEEE Journal of Biomedical and Health Informatics |volume=24 |issue=1 |pages=17–26 |doi=10.1109/JBHI.2019.2914970 |pmid=31217131 |date=2020 |doi-access=free |hdl=10630/28806 |hdl-access=free }}</ref> |
||
=== Drug discovery === |
=== Drug discovery === |
Anautoencoder is a type of artificial neural network used to learn efficient codings of unlabeled data (unsupervised learning).[1][2] An autoencoder learns two functions: an encoding function that transforms the input data, and a decoding function that recreates the input data from the encoded representation. The autoencoder learns an efficient representation (encoding) for a set of data, typically for dimensionality reduction.
Variants exist, aiming to force the learned representations to assume useful properties.[3] Examples are regularized autoencoders (Sparse, Denoising and Contractive), which are effective in learning representations for subsequent classification tasks,[4] and Variational autoencoders, with applications as generative models.[5] Autoencoders are applied to many problems, including facial recognition,[6] feature detection,[7] anomaly detection and acquiring the meaning of words.[8][9] Autoencoders are also generative models which can randomly generate new data that is similar to the input data (training data).[7]
An autoencoder is defined by the following components:
Two sets: the space of decoded messages
; the space of encoded messages
. Almost always, both
and
are Euclidean spaces, that is,
for some
.
Two parametrized families of functions: the encoder family
, parametrized by
; the decoder family
, parametrized by
.
For any , we usually write
, and refer to it as the code, the latent variable, latent representation, latent vector, etc. Conversely, for any
, we usually write
, and refer to it as the (decoded) message.
Usually, both the encoder and the decoder are defined as multilayer perceptrons. For example, a one-layer-MLP encoder is:
where is an element-wise activation function such as a sigmoid function or a rectified linear unit,
is a matrix called "weight", and
is a vector called "bias".
An autoencoder, by itself, is simply a tuple of two functions. To judge its quality, we need a task. A task is defined by a reference probability distribution over
, and a "reconstruction quality" function
, such that
measures how much
differs from
.
With those, we can define the loss function for the autoencoder asThe optimal autoencoder for the given task
is then
. The search for the optimal autoencoder can be accomplished by any mathematical optimization technique, but usually by gradient descent. This search process is referred to as "training the autoencoder".
In most situations, the reference distribution is just the empirical distribution given by a dataset , so that
where is the Dirac measure, the quality function is just L2 loss:
, and
is the Euclidean norm. Then the problem of searching for the optimal autoencoder is just a least-squares optimization:
An autoencoder has two main parts: an encoder that maps the message to a code, and a decoder that reconstructs the message from the code. An optimal autoencoder would perform as close to perfect reconstruction as possible, with "close to perfect" defined by the reconstruction quality function .
The simplest way to perform the copying task perfectly would be to duplicate the signal. To suppress this behavior, the code space usually has fewer dimensions than the message space
.
Such an autoencoder is called undercomplete. It can be interpreted as compressing the message, or reducing its dimensionality.[1][10]
At the limit of an ideal undercomplete autoencoder, every possible code in the code space is used to encode a message
that really appears in the distribution
, and the decoder is also perfect:
. This ideal autoencoder can then be used to generate messages indistinguishable from real messages, by feeding its decoder arbitrary code
and obtaining
, which is a message that really appears in the distribution
.
If the code space has dimension larger than (overcomplete), or equal to, the message space
, or the hidden units are given enough capacity, an autoencoder can learn the identity function and become useless. However, experimental results found that overcomplete autoencoders might still learn useful features.[11]
In the ideal setting, the code dimension and the model capacity could be set on the basis of the complexity of the data distribution to be modeled. A standard way to do so is to add modifications to the basic autoencoder, to be detailed below.[3]
The autoencoder was first proposed as a nonlinear generalization of principal components analysis (PCA) by Kramer.[1] The autoencoder has also been called the autoassociator,[12] or Diabolo network.[13][11] Its first applications date to early 1990s.[3][14][15] Their most traditional application was dimensionality reductionorfeature learning, but the concept became widely used for learning generative models of data.[16][17] Some of the most powerful AIs in the 2010s involved autoencoders stacked inside deep neural networks.[18]
Various techniques exist to prevent autoencoders from learning the identity function and to improve their ability to capture important information and learn richer representations.
Inspired by the sparse coding hypothesis in neuroscience, sparse autoencoders are variants of autoencoders, such that the codes for messages tend to be sparse codes, that is,
is close to zero in most entries. Sparse autoencoders may include more (rather than fewer) hidden units than inputs, but only a small number of the hidden units are allowed to be active at the same time.[18] Encouraging sparsity improves performance on classification tasks.[19]
There are two main ways to enforce sparsity. One way is to simply clamp all but the highest-k activations of the latent code to zero. This is the k-sparse autoencoder.[20]
The k-sparse autoencoder inserts the following "k-sparse function" in the latent layer of a standard autoencoder:where
if
ranks in the top k, and 0 otherwise.
Backpropagating through is simple: set gradient to 0 for
entries, and keep gradient for
entries. This is essentially a generalized ReLU function.[20]
The other way is a relaxed version of the k-sparse autoencoder. Instead of forcing sparsity, we add a sparsity regularization loss, then optimize forwhere
measures how much sparsity we want to enforce.[21]
Let the autoencoder architecture have layers. To define a sparsity regularization loss, we need a "desired" sparsity
for each layer, a weight
for how much to enforce each sparsity, and a function
to measure how much two sparsities differ.
For each input , let the actual sparsity of activation in each layer
be
where
is the activation in the
-th neuron of the
-th layer upon input
.
The sparsity loss upon input for one layer is
, and the sparsity regularization loss for the entire autoencoder is the expected weighted sum of sparsity losses:
Typically, the function
is either the Kullback-Leibler (KL) divergence, as[19][21][22][23]
or the L1 loss, as , or the L2 loss, as
.
Alternatively, the sparsity regularization loss may be defined without reference to any "desired sparsity", but simply force as much sparsity as possible. In this case, one can define the sparsity regularization loss as where
is the activation vector in the
-th layer of the autoencoder. The norm
is usually the L1 norm (giving the L1 sparse autoencoder) or the L2 norm (giving the L2 sparse autoencoder).
Denoising autoencoders (DAE) try to achieve a good representation by changing the reconstruction criterion.[3][4]
A DAE, originally called a "robust autoassociative network",[2] is trained by intentionally corrupting the inputs of a standard autoencoder during training. A noise process is defined by a probability distribution over functions
. That is, the function
takes a message
, and corrupts it to a noisy version
. The function
is selected randomly, with a probability distribution
.
Given a task , the problem of training a DAE is the optimization problem:
That is, the optimal DAE should take any noisy message and attempt to recover the original message without noise, thus the name "denoising".
Usually, the noise process is applied only during training and testing, not during downstream use.
The use of DAE depends on two assumptions:
Example noise processes include:
A contractive autoencoder adds the contractive regularization loss to the standard autoencoder loss:where
measures how much contractive-ness we want to enforce. The contractive regularization loss itself is defined as the expected Frobenius norm of the Jacobian matrix of the encoder activations with respect to the input:
To understand what
measures, note the fact
for any message
, and small variation
in it. Thus, if
is small, it means that a small neighborhood of the message maps to a small neighborhood of its code. This is a desired property, as it means small variation in the message leads to small, perhaps even zero, variation in its code, like how two pictures may look the same even if they are not exactly the same.
The DAE can be understood as an infinitesimal limit of CAE: in the limit of small Gaussian input noise, DAEs make the reconstruction function resist small but finite-sized input perturbations, while CAEs make the extracted features resist infinitesimal input perturbations.
![]() |
This section is empty. You can help by adding to it. (March 2024)
|
The concrete autoencoder is designed for discrete feature selection.[25] A concrete autoencoder forces the latent space to consist only of a user-specified number of features. The concrete autoencoder uses a continuous relaxation of the categorical distribution to allow gradients to pass through the feature selector layer, which makes it possible to use standard backpropagation to learn an optimal subset of input features that minimize reconstruction loss.
Variational autoencoders (VAEs) belong to the families of variational Bayesian methods. Despite the architectural similarities with basic autoencoders, VAEs are architecture with different goals and with a completely different mathematical formulation. The latent space is in this case composed by a mixture of distributions instead of a fixed vector.
Given an input dataset characterized by an unknown probability function
and a multivariate latent encoding vector
, the objective is to model the data as a distribution
, with
defined as the set of the network parameters so that
.
Autoencoders are often trained with a single-layer encoder and a single-layer decoder, but using many-layered (deep) encoders and decoders offers many advantages.[3]
Geoffrey Hinton developed the deep belief network technique for training many-layered deep autoencoders. His method involves treating each neighboring set of two layers as a restricted Boltzmann machine so that pretraining approximates a good solution, then using backpropagation to fine-tune the results.[10]
Researchers have debated whether joint training (i.e. training the whole architecture together with a single global reconstruction objective to optimize) would be better for deep auto-encoders.[26] A 2015 study showed that joint training learns better data models along with more representative features for classification as compared to the layerwise method.[26] However, their experiments showed that the success of joint training depends heavily on the regularization strategies adopted.[26][27]
The two main applications of autoencoders are dimensionality reduction and information retrieval,[3] but modern variations have been applied to other tasks.
Dimensionality reduction was one of the first deep learning applications.[3]
For Hinton's 2006 study,[10] he pretrained a multi-layer autoencoder with a stack of RBMs and then used their weights to initialize a deep autoencoder with gradually smaller hidden layers until hitting a bottleneck of 30 neurons. The resulting 30 dimensions of the code yielded a smaller reconstruction error compared to the first 30 components of a principal component analysis (PCA), and learned a representation that was qualitatively easier to interpret, clearly separating data clusters.[3][10]
Representing dimensions can improve performance on tasks such as classification.[3] Indeed, the hallmark of dimensionality reduction is to place semantically related examples near each other.[29]
If linear activations are used, or only a single sigmoid hidden layer, then the optimal solution to an autoencoder is strongly related to principal component analysis (PCA).[30][31] The weights of an autoencoder with a single hidden layer of size (where
is less than the size of the input) span the same vector subspace as the one spanned by the first
principal components, and the output of the autoencoder is an orthogonal projection onto this subspace. The autoencoder weights are not equal to the principal components, and are generally not orthogonal, yet the principal components may be recovered from them using the singular value decomposition.[32]
However, the potential of autoencoders resides in their non-linearity, allowing the model to learn more powerful generalizations compared to PCA, and to reconstruct the input with significantly lower information loss.[10]
Information retrieval benefits particularly from dimensionality reduction in that search can become more efficient in certain kinds of low dimensional spaces. Autoencoders were indeed applied to semantic hashing, proposed by Salakhutdinov and Hinton in 2007.[29] By training the algorithm to produce a low-dimensional binary code, all database entries could be stored in a hash table mapping binary code vectors to entries. This table would then support information retrieval by returning all entries with the same binary code as the query, or slightly less similar entries by flipping some bits from the query encoding.
The encoder-decoder architecture, often used in natural language processing and neural networks, can be scientifically applied in the field of SEO (Search Engine Optimization) in various ways:
In essence, the encoder-decoder architecture or autoencoders can be leveraged in SEO to optimize web page content, improve their indexing, and enhance their appeal to both search engines and users.
Another application for autoencoders is anomaly detection.[2][33][34][35][36][37] By learning to replicate the most salient features in the training data under some of the constraints described previously, the model is encouraged to learn to precisely reproduce the most frequently observed characteristics. When facing anomalies, the model should worsen its reconstruction performance. In most cases, only data with normal instances are used to train the autoencoder; in others, the frequency of anomalies is small compared to the observation set so that its contribution to the learned representation could be ignored. After training, the autoencoder will accurately reconstruct "normal" data, while failing to do so with unfamiliar anomalous data.[35] Reconstruction error (the error between the original data and its low dimensional reconstruction) is used as an anomaly score to detect anomalies.[35]
Recent literature has however shown that certain autoencoding models can, counterintuitively, be very good at reconstructing anomalous examples and consequently not able to reliably perform anomaly detection.[38][39]
The characteristics of autoencoders are useful in image processing.
One example can be found in lossy image compression, where autoencoders outperformed other approaches and proved competitive against JPEG 2000.[40][41]
Another useful application of autoencoders in image preprocessing is image denoising.[42][43][44]
Autoencoders found use in more demanding contexts such as medical imaging where they have been used for image denoising[45] as well as super-resolution.[46][47] In image-assisted diagnosis, experiments have applied autoencoders for breast cancer detection[48] and for modelling the relation between the cognitive decline of Alzheimer's disease and the latent features of an autoencoder trained with MRI.[49]
In 2019 molecules generated with variational autoencoders were validated experimentally in mice.[50][51]
Recently, a stacked autoencoder framework produced promising results in predicting popularity of social media posts,[52] which is helpful for online advertising strategies.
Autoencoders have been applied to machine translation, which is usually referred to as neural machine translation (NMT).[53][54] Unlike traditional autoencoders, the output does not match the input - it is in another language. In NMT, texts are treated as sequences to be encoded into the learning procedure, while on the decoder side sequences in the target language(s) are generated. Language-specific autoencoders incorporate further linguistic features into the learning procedure, such as Chinese decomposition features.[55] Machine translation is rarely still done with autoencoders, due to the availability of more effective transformer networks.
{{cite journal}}
: Cite journal requires |journal=
(help)
Differentiable computing
| |||||||
---|---|---|---|---|---|---|---|
General |
| ||||||
Concepts |
| ||||||
Applications |
| ||||||
Hardware |
| ||||||
Software libraries |
| ||||||
Implementations |
| ||||||
People |
| ||||||
Organizations |
| ||||||
Architectures |
| ||||||
|