Jump to content

Main menu Navigation ●Main page ●Contents ●Current events ●Random article ●About Wikipedia ●Contact us ●Donate Contribute ●Help ●Learn to edit ●Community portal ●Recent changes ●Upload file

●Create account ●Log in ●Create account ● Log in Pages for logged out editors learn more ●Contributions ●Talk

(Top) 1 Introduction 2 Definition of Occam learning 3 The relation between Occam and PAC learning 3.1 Theorem (Occam learning implies PAC learning) 3.2 Theorem (Occam learning implies PAC learning, cardinality version) 4 Proof that Occam learning implies PAC learning 5 Improving sample complexity for common problems 6 Extensions 7 See also 8 References 9 Further reading

Occam learning

●Català ●Español ●Русский Edit links ●Article ●Talk ●Read ●Edit ●View history Tools Actions ●Read ●Edit ●View history General ●What links here ●Related changes ●Upload file ●Special pages ●Permanent link ●Page information ●Cite this page ●Get shortened URL ●Download QR code ●Wikidata item Print/export ●Download as PDF ●Printable version Appearance From Wikipedia, the free encyclopedia

Machine learning and data mining
Part of a series on
Paradigms Supervised learning Unsupervised learning Online learning Batch learning Meta-learning Semi-supervised learning Self-supervised learning Reinforcement learning Curriculum learning Rule-based learning Quantum machine learning
Problems Classification Generative modeling Regression Clustering Dimensionality reduction Density estimation Anomaly detection Data cleaning AutoML Association rules Semantic analysis Structured prediction Feature engineering Feature learning Learning to rank Grammar induction Ontology learning Multimodal learning
Supervised learning (classification • regression) Apprenticeship learning Decision trees Ensembles Bagging Boosting Random forest k-NN Linear regression Naive Bayes Artificial neural networks Logistic regression Perceptron Relevance vector machine (RVM) Support vector machine (SVM)
Clustering BIRCH CURE Hierarchical k-means Fuzzy Expectation–maximization (EM) DBSCAN OPTICS Mean shift
Dimensionality reduction Factor analysis CCA ICA LDA NMF PCA PGD t-SNE SDL
Structured prediction Graphical models Bayes net Conditional random field Hidden Markov
Anomaly detection RANSAC k-NN Local outlier factor Isolation forest
Artificial neural network Autoencoder Cognitive computing Deep learning DeepDream Feedforward neural network Recurrent neural network LSTM GRU ESN reservoir computing Restricted Boltzmann machine GAN Diffusion model SOM Convolutional neural network U-Net Transformer Vision Mamba Spiking neural network Memtransistor Electrochemical RAM (ECRAM)
Reinforcement learning Q-learning SARSA Temporal difference (TD) Multi-agent Self-play
Learning with humans Active learning Crowdsourcing Human-in-the-loop RLHF
Model diagnostics Coefficient of determination Confusion matrix Learning curve ROC curve
Mathematical foundations Kernel machines Bias–variance tradeoff Computational learning theory Empirical risk minimization Occam learning PAC learning Statistical learning VC theory
Machine-learning venues ECML PKDD NeurIPS ICML ICLR IJCAI ML JMLR
Related articles Glossary of artificial intelligence List of datasets for machine-learning research List of datasets in computer vision and image processing Outline of machine learning
v t e

Incomputational learning theory, Occam learning is a model of algorithmic learning where the objective of the learner is to output a succinct representation of received training data. This is closely related to probably approximately correct (PAC) learning, where the learner is evaluated on its predictive power of a test set.

Occam learnability implies PAC learning, and for a wide variety of concept classes, the converse is also true: PAC learnability implies Occam learnability.

Introduction

[edit]

Occam Learning is named after Occam's razor, which is a principle stating that, given all other things being equal, a shorter explanation for observed data should be favored over a lengthier explanation. The theory of Occam learning is a formal and mathematical justification for this principle. It was first shown by Blumer, et al.^[1] that Occam learning implies PAC learning, which is the standard model of learning in computational learning theory. In other words, parsimony (of the output hypothesis) implies predictive power.

Definition of Occam learning

[edit]

The succinctness of a concept $c$ inconcept class ${\mathcal {C}}$ can be expressed by the length ${\displaystyle size($ of the shortest bit string that can represent $c$ in ${\mathcal {C}}$ . Occam learning connects the succinctness of a learning algorithm's output to its predictive power on unseen data.

Let ${\mathcal {C}}$ and ${\mathcal {H}}$ be concept classes containing target concepts and hypotheses respectively. Then, for constants $\alpha \geq 0$ and $0\leq \beta <1$ , a learning algorithm $L$ is an $(\alpha ,\beta )$ -Occam algorithm for ${\mathcal {C}}$ using ${\mathcal {H}}$ iff, given a set $S=\{x_{1},\dots ,x_{m}\}$ of $m$ samples labeled according to a concept $c\in {\mathcal {C}}$ , $L$ outputs a hypothesis $h\in {\mathcal {H}}$ such that

$h$ is consistent with $c$ on $S$ (that is, ${\displaystyle h($ ), and
${\displaystyle size($ ^[2]^[1]

where $n$ is the maximum length of any sample $x\in S$ . An Occam algorithm is called efficient if it runs in time polynomial in $n$ , $m$ , and ${\displaystyle size($ We say a concept class ${\mathcal {C}}$ isOccam learnable with respect to a hypothesis class ${\mathcal {H}}$ if there exists an efficient Occam algorithm for ${\mathcal {C}}$ using ${\mathcal {H}}.$

The relation between Occam and PAC learning

[edit]

Occam learnability implies PAC learnability, as the following theorem of Blumer, et al.^[2] shows:

Theorem (Occam learning implies PAC learning)

[edit]

Let $L$ be an efficient $(\alpha ,\beta )$ -Occam algorithm for ${\mathcal {C}}$ using ${\mathcal {H}}$ . Then there exists a constant $a>0$ such that for any $0<\epsilon ,\delta <1$ , for any distribution ${\mathcal {D}}$ , given ${\displaystyle m\geq a\left({\frac {1}{\epsilon }}\log {\frac {1}{\delta }}+\left({\frac {(n\cdot size($ samples drawn from ${\mathcal {D}}$ and labelled according to a concept $c\in {\mathcal {C}}$ of length $n$ bits each, the algorithm $L$ will output a hypothesis $h\in {\mathcal {H}}$ such that ${\displaystyle error($ with probability at least $1-\delta$ .

Here, ${\displaystyle error($ is with respect to the concept $c$ and distribution ${\mathcal {D}}$ . This implies that the algorithm $L$ is also a PAC learner for the concept class ${\mathcal {C}}$ using hypothesis class ${\mathcal {H}}$ . A slightly more general formulation is as follows:

Theorem (Occam learning implies PAC learning, cardinality version)

[edit]

Let $0<\epsilon ,\delta <1$ . Let $L$ be an algorithm such that, given $m$ samples drawn from a fixed but unknown distribution ${\mathcal {D}}$ and labeled according to a concept $c\in {\mathcal {C}}$ of length $n$ bits each, outputs a hypothesis $h\in {\mathcal {H}}_{n,m}$ that is consistent with the labeled samples. Then, there exists a constant $b$ such that if $\log |{\mathcal {H}}_{n,m}|\leq b\epsilon m-\log {\frac {1}{\delta }}$ , then $L$ is guaranteed to output a hypothesis $h\in {\mathcal {H}}_{n,m}$ such that ${\displaystyle error($ with probability at least $1-\delta$ .

While the above theorems show that Occam learning is sufficient for PAC learning, it doesn't say anything about necessity. Board and Pitt show that, for a wide variety of concept classes, Occam learning is in fact necessary for PAC learning.^[3] They proved that for any concept class that is polynomially closed under exception lists, PAC learnability implies the existence of an Occam algorithm for that concept class. Concept classes that are polynomially closed under exception lists include Boolean formulas, circuits, deterministic finite automata, decision-lists, decision-trees, and other geometrically-defined concept classes.

A concept class ${\mathcal {C}}$ is polynomially closed under exception lists if there exists a polynomial-time algorithm $A$ such that, when given the representation of a concept $c\in {\mathcal {C}}$ and a finite list $E$ ofexceptions, outputs a representation of a concept $c'\in {\mathcal {C}}$ such that the concepts $c$ and $c'$ agree except on the set $E$ .

Proof that Occam learning implies PAC learning

[edit]

We first prove the Cardinality version. Call a hypothesis $h\in {\mathcal {H}}$ badif ${\displaystyle error($ , where again ${\displaystyle error($ is with respect to the true concept $c$ and the underlying distribution ${\mathcal {D}}$ . The probability that a set of samples $S$ is consistent with $h$ is at most $(1-\epsilon )^{m}$ , by the independence of the samples. By the union bound, the probability that there exists a bad hypothesis in ${\mathcal {H}}_{n,m}$ is at most $|{\mathcal {H}}_{n,m}|(1-\epsilon )^{m}$ , which is less than $\delta$ if $\log |{\mathcal {H}}_{n,m}|\leq O(\epsilon m)-\log {\frac {1}{\delta }}$ . This concludes the proof of the second theorem above.

Using the second theorem, we can prove the first theorem. Since we have a $(\alpha ,\beta )$ -Occam algorithm, this means that any hypothesis output by $L$ can be represented by at most ${\displaystyle (n\cdot size($ bits, and thus ${\displaystyle \log |{\mathcal {H}}_{n,m}|\leq (n\cdot size($ . This is less than $O(\epsilon m)-\log {\frac {1}{\delta }}$ if we set ${\displaystyle m\geq a\left({\frac {1}{\epsilon }}\log {\frac {1}{\delta }}+\left({\frac {(n\cdot size($ for some constant $a>0$ . Thus, by the Cardinality version Theorem, $L$ will output a consistent hypothesis $h$ with probability at least $1-\delta$ . This concludes the proof of the first theorem above.

Improving sample complexity for common problems

[edit]

Though Occam and PAC learnability are equivalent, the Occam framework can be used to produce tighter bounds on the sample complexity of classical problems including conjunctions,^[2] conjunctions with few relevant variables,^[4] and decision lists.^[5]

Extensions

[edit]

Occam algorithms have also been shown to be successful for PAC learning in the presence of errors,^[6]^[7] probabilistic concepts,^[8] function learning^[9] and Markovian non-independent examples.^[10]

References

[edit]

^ ^a ^b Blumer, A., Ehrenfeucht, A., Haussler, D., & Warmuth, M. K. (1987). Occam's razor. Information processing letters, 24(6), 377-380.

^ ^a ^b ^c Kearns, M. J., & Vazirani, U. V. (1994). An introduction to computational learning theory, chapter 2. MIT press.

^ Board, R., & Pitt, L. (1990, April). On the necessity of Occam algorithms. In Proceedings of the twenty-second annual ACM symposium on Theory of computing (pp. 54-63). ACM.

^ Haussler, D. (1988). Quantifying inductive bias: AI learning algorithms and Valiant's learning framework Archived 2013-04-12 at the Wayback Machine. Artificial intelligence, 36(2), 177-221.

^ Rivest, R. L. (1987). Learning decision lists. Machine learning, 2(3), 229-246.

^ Angluin, D., & Laird, P. (1988). Learning from noisy examples. Machine Learning, 2(4), 343-370.

^ Kearns, M., & Li, M. (1993). Learning in the presence of malicious errors. SIAM Journal on Computing, 22(4), 807-837.

^ Kearns, M. J., & Schapire, R. E. (1990, October). Efficient distribution-free learning of probabilistic concepts. In Foundations of Computer Science, 1990. Proceedings., 31st Annual Symposium on (pp. 382-391). IEEE.

^ Natarajan, B. K. (1993, August). Occam's razor for functions. In Proceedings of the sixth annual conference on Computational learning theory (pp. 370-376). ACM.

^ Aldous, D., & Vazirani, U. (1990, October). A Markovian extension of Valiant's learning model. In Foundations of Computer Science, 1990. Proceedings., 31st Annual Symposium on (pp. 392-396). IEEE.