Jump to content
 







Main menu
   


Navigation  



Main page
Contents
Current events
Random article
About Wikipedia
Contact us
Donate
 




Contribute  



Help
Learn to edit
Community portal
Recent changes
Upload file
 








Search  

































Create account

Log in
 









Create account
 Log in
 




Pages for logged out editors learn more  



Contributions
Talk
 



















Contents

   



(Top)
 


1 Background  





2 Empirical risk minimization  





3 Properties  



3.1  Impossibility results  





3.2  Computational complexity  







4 Tilted empirical risk minimization  





5 See also  





6 References  





7 Further reading  














Empirical risk minimization






العربية
Deutsch
Italiano
Русский
Українська

 

Edit links
 









Article
Talk
 

















Read
Edit
View history
 








Tools
   


Actions  



Read
Edit
View history
 




General  



What links here
Related changes
Upload file
Special pages
Permanent link
Page information
Cite this page
Get shortened URL
Download QR code
Wikidata item
 




Print/export  



Download as PDF
Printable version
 
















Appearance
   

 






From Wikipedia, the free encyclopedia
 


Empirical risk minimization is a principle in statistical learning theory which defines a family of learning algorithms based on evaluating performance over a known and fixed dataset. The core idea is based on an application of the law of large numbers; more specifically, we cannot know exactly how well a predictive algorithm will work in practice (i.e. the true "risk") because we do not know the true distribution of the data, but we can instead estimate and optimize the performance of the algorithm on a known set of training data. The performance over the known set of training data is referred to as the "empirical risk".

Background[edit]

The following situation is a general setting of many supervised learning problems. There are two spaces of objects and and would like to learn a function (often called hypothesis) which outputs an object , given . To do so, there is a training setof examples where is an input and is the corresponding response that is desired from .

To put it more formally, assuming that there is a joint probability distribution over and , and that the training set consists of instances drawn i.i.d. from . The assumption of a joint probability distribution allows for the modelling of uncertainty in predictions (e.g. from noise in data) because is not a deterministic function of , but rather a random variable with conditional distribution for a fixed .

It is also assumed that there is a non-negative real-valued loss function which measures how different the prediction of a hypothesis is from the true outcome . For classification tasks these loss functions can be scoring rules. The risk associated with hypothesis is then defined as the expectation of the loss function:

A loss function commonly used in theory is the 0-1 loss function: .

The ultimate goal of a learning algorithm is to find a hypothesis among a fixed class of functions for which the risk is minimal:

For classification problems, the Bayes classifier is defined to be the classifier minimizing the risk defined with the 0–1 loss function.

Empirical risk minimization[edit]

In general, the risk cannot be computed because the distribution is unknown to the learning algorithm (this situation is referred to as agnostic learning[citation needed]). However, given a sample of iid training data points, we can compute an estimate, called the empirical risk, by computing the average of the loss function over the training set; more formally, computing the expectation with respect to the empirical measure:

The empirical risk minimization principle[1] states that the learning algorithm should choose a hypothesis which minimizes the empirical risk over the hypothesis class :

Thus, the learning algorithm defined by the empirical risk minimization principle consists in solving the above optimization problem.

Properties[edit]

Guarantees for the performance of empirical risk minimization depend strongly on the function class selected as well as the distributional assumptions made.[2] In general, distribution-free methods are too coarse, and do not lead to practical bounds. However, they are still useful in deriving asymptotic properties of learning algorithms, such as consistency. In particular, distribution-free bounds on the performance of empirical risk minimization given a fixed function class can be derived using bounds on the VC complexity of the function class.

For simplicity, considering the case of binary classification tasks, it is possible to bound the probability of the selected classifier, being much worse than the best possible classifier . Consider the risk defined over the hypothesis class with growth function given a dataset of size . Then, for every :[3]

Similar results hold for regression tasks.[2] These results are often based on uniform laws of large numbers, which control the deviation of the empirical risk from the true risk, uniformly over the hypothesis class.[3]

Impossibility results[edit]

It is also possible to show lower bounds on algorithm performance if no distributional assumptions are made.[4] This is sometimes referred to as the No free lunch theorem. Even though a specific learning algorithm may provide the asymptotically optimal performance for any distribution, the finite sample performance is always poor for at least one data distribution. This means that no classifier can provide on the error for a given sample size for all distributions.[3]

Specifically, Let and consider a sample size and classification rule , there exists a distribution of with risk (meaning that perfect prediction is possible) such that:[3]

It is further possible to show that the convergence rate of a learning algorithm is poor for some distributions. Specifically, given a sequence of decreasing positive numbers converging to zero, it is possible to find a distribution such that:

for all . This result shows that universally good classification rules do not exist, in the sense that the rule must be low quality for at least one distribution.[3]

Computational complexity[edit]

Empirical risk minimization for a classification problem with a 0-1 loss function is known to be an NP-hard problem even for a relatively simple class of functions such as linear classifiers.[5] Nevertheless, it can be solved efficiently when the minimal empirical risk is zero, i.e., data is linearly separable.[citation needed]

In practice, machine learning algorithms cope with this issue either by employing a convex approximation to the 0–1 loss function (like hinge loss for SVM), which is easier to optimize, or by imposing assumptions on the distribution (and thus stop being agnostic learning algorithms to which the above result applies).

In the case of convexification, Zhang's lemma majors the excess risk of the original problem using the excess risk of the convexified problem.[6] Minimizing the latter using convex optimization also allow to control the former.

Tilted empirical risk minimization[edit]

Tilted empirical risk minimization is a machine learning technique used to modify standard loss functions like squared error, by introducing a tilt parameter. This parameter dynamically adjusts the weight of data points during training, allowing the algorithm to focus on specific regions or characteristics of the data distribution. Tilted empirical risk minimization is particularly useful in scenarios with imbalanced data or when there is a need to emphasize errors in certain parts of the prediction space.

See also[edit]


References[edit]

  • ^ a b Györfi, László; Kohler, Michael; Krzyzak, Adam; Walk, Harro (2010-12-01). A Distribution-Free Theory of Nonparametric Regression (Softcover reprint of the original 1st ed.). New York: Springer. ISBN 978-1-4419-2998-3.
  • ^ a b c d e Devroye, L., Gyorfi, L. & Lugosi, G. A Probabilistic Theory of Pattern Recognition. Discrete Appl Math 73, 192–194 (1997)
  • ^ Devroye, Luc; Györfi, László; Lugosi, Gábor (1996). "A Probabilistic Theory of Pattern Recognition". Stochastic Modelling and Applied Probability. 31. doi:10.1007/978-1-4612-0711-5. ISBN 978-1-4612-6877-2. ISSN 0172-4568.
  • ^ V. Feldman, V. Guruswami, P. Raghavendra and Yi Wu (2009). Agnostic Learning of Monomials by Halfspaces is Hard. (See the paper and references therein)
  • ^ "Mathematics of Machine Learning Lecture 9 Notes | Mathematics of Machine Learning | Mathematics". MIT OpenCourseWare. Retrieved 2023-10-28.
  • Further reading[edit]


    Retrieved from "https://en.wikipedia.org/w/index.php?title=Empirical_risk_minimization&oldid=1228458708"

    Category: 
    Machine learning
    Hidden categories: 
    Articles with short description
    Short description is different from Wikidata
    All articles with unsourced statements
    Articles with unsourced statements from December 2023
    Articles to be expanded from February 2023
    All articles to be expanded
    Articles using small message boxes
     



    This page was last edited on 11 June 2024, at 10:12 (UTC).

    Text is available under the Creative Commons Attribution-ShareAlike License 4.0; additional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.



    Privacy policy

    About Wikipedia

    Disclaimers

    Contact Wikipedia

    Code of Conduct

    Developers

    Statistics

    Cookie statement

    Mobile view



    Wikimedia Foundation
    Powered by MediaWiki