Jump to content
 







Main menu
   


Navigation  



Main page
Contents
Current events
Random article
About Wikipedia
Contact us
Donate
 




Contribute  



Help
Learn to edit
Community portal
Recent changes
Upload file
 








Search  

































Create account

Log in
 









Create account
 Log in
 




Pages for logged out editors learn more  



Contributions
Talk
 



















Contents

   



(Top)
 


1 Pure statistical models  



1.1  Models based on word n-grams  





1.2  Exponential  





1.3  Skip-gram model  







2 Neural models  



2.1  Recurrent neural network  





2.2  Large language models  







3 Evaluation and benchmarks  





4 See also  





5 References  





6 Further reading  














Language model






Afrikaans
العربية
Արեւմտահայերէն
Беларуская
Български
Català
Čeština
Deutsch
Eesti
Español
Euskara
فارسی
Français

IsiZulu
עברית
Latviešu
Nederlands

Norsk nynorsk
Oʻzbekcha / ўзбекча
Português
Русский
Suomi
Svenska
Türkçe
Українська
Tiếng Vit


 

Edit links
 









Article
Talk
 

















Read
Edit
View history
 








Tools
   


Actions  



Read
Edit
View history
 




General  



What links here
Related changes
Upload file
Special pages
Permanent link
Page information
Cite this page
Get shortened URL
Download QR code
Wikidata item
 




Print/export  



Download as PDF
Printable version
 




In other projects  



Wikimedia Commons
 
















Appearance
   

 






From Wikipedia, the free encyclopedia
 


Alanguage model is a probabilistic model of a natural language.[1] In 1980, the first significant statistical language model was proposed, and during the decade IBM performed ‘Shannon-style’ experiments, in which potential sources for language modeling improvement were identified by observing and analyzing the performance of human subjects in predicting or correcting text.[2]

Language models are useful for a variety of tasks, including speech recognition[3] (helping prevent predictions of low-probability (e.g. nonsense) sequences), machine translation,[4] natural language generation (generating more human-like text), optical character recognition, handwriting recognition,[5] grammar induction,[6] and information retrieval.[7][8]

Large language models, currently their most advanced form, are a combination of larger datasets (frequently using words scraped from the public internet), feedforward neural networks, and transformers. They have superseded recurrent neural network-based models, which had previously superseded the pure statistical models, such as word n-gram language model.

Pure statistical models[edit]

Models based on word n-grams[edit]

Aword n-gram language model is a purely statistical model of language. It has been superseded by recurrent neural network–based models, which have been superseded by large language models. [9] It is based on an assumption that the probability of the next word in a sequence depends only on a fixed size window of previous words. If only one previous word was considered, it was called a bigram model; if two words, a trigram model; if n − 1 words, an n-gram model.[10] Special tokens were introduced to denote the start and end of a sentence and .

To prevent a zero probability being assigned to unseen words, each word's probability is slightly lower than its frequency count in a corpus. To calculate it, various methods were used, from simple "add-one" smoothing (assign a count of 1 to unseen n-grams, as an uninformative prior) to more sophisticated models, such as Good–Turing discountingorback-off models.

Exponential[edit]

Maximum entropy language models encode the relationship between a word and the n-gram history using feature functions. The equation is

where is the partition function, is the parameter vector, and is the feature function. In the simplest case, the feature function is just an indicator of the presence of a certain n-gram. It is helpful to use a prior on or some form of regularization.

The log-bilinear model is another example of an exponential language model.

Skip-gram model[edit]

Skip-gram language model is an attempt at overcoming the data sparsity problem that preceding (i.e. word n-gram language model) faced. Words represented in an embedding vector were not necessarily consecutive anymore, but could leave gaps that are skipped over.[11]

Formally, a k-skip-n-gram is a length-n subsequence where the components occur at distance at most k from each other.

For example, in the input text:

the rain in Spain falls mainly on the plain

the set of 1-skip-2-grams includes all the bigrams (2-grams), and in addition the subsequences

the in, rain Spain, in falls, Spain mainly, falls on, mainly the, and on plain.

In skip-gram model, semantic relations between words are represented by linear combinations, capturing a form of compositionality. For example, in some such models, if v is the function that maps a word w to its n-d vector representation, then

where ≈ is made precise by stipulating that its right-hand side must be the nearest neighbor of the value of the left-hand side.[12][13]

Neural models[edit]

Recurrent neural network[edit]

Continuous representations or embeddings of words are produced in recurrent neural network-based language models (known also as continuous space language models).[14] Such continuous space embeddings help to alleviate the curse of dimensionality, which is the consequence of the number of possible sequences of words increasing exponentially with the size of the vocabulary, furtherly causing a data sparsity problem. Neural networks avoid this problem by representing words as non-linear combinations of weights in a neural net.[15]

Large language models[edit]

Alarge language model (LLM) is a computational model notable for its ability to achieve general-purpose language generation and other natural language processing tasks such as classification. Based on language models, LLMs acquire these abilities by learning statistical relationships from vast amounts of text during a computationally intensive self-supervised and semi-supervised training process.[16] LLMs can be used for text generation, a form of generative AI, by taking an input text and repeatedly predicting the next token or word.[17]

LLMs are artificial neural networks that utilize the transformer architecture, invented in 2017. The largest and most capable LLMs, as of June 2024, are built with a decoder-only transformer-based architecture, which enables efficient processing and generation of large-scale text data.

Historically, up to 2020, fine-tuning was the primary method used to adapt a model for specific tasks. However, larger models such as GPT-3 have demonstrated the ability to achieve similar results through prompt engineering, which involves crafting specific input prompts to guide the model's responses.[18] These models acquire knowledge about syntax, semantics, and ontologies[19] inherent in human language corpora, but they also inherit inaccuracies and biases present in the data they are trained on.[20]

Some notable LLMs are OpenAI's GPT series of models (e.g., GPT-3.5 and GPT-4, used in ChatGPT and Microsoft Copilot), Google's Gemini (the latter of which is currently used in the chatbot of the same name), Meta's LLaMA family of models, Anthropic's Claude models, and Mistral AI's models.

Although sometimes matching human performance, it is not clear whether they are plausible cognitive models. At least for recurrent neural networks, it has been shown that they sometimes learn patterns that humans do not, but fail to learn patterns that humans typically do.[21]

Evaluation and benchmarks[edit]

Evaluation of the quality of language models is mostly done by comparison to human created sample benchmarks created from typical language-oriented tasks. Other, less established, quality tests examine the intrinsic character of a language model or compare two such models. Since language models are typically intended to be dynamic and to learn from data they see, some proposed models investigate the rate of learning, e.g., through inspection of learning curves.[22]

Various data sets have been developed for use in evaluating language processing systems.[23] These include:

See also[edit]

  • Deep linguistic processing
  • Factored language model
  • Generative pre-trained transformer
  • Katz's back-off model
  • Language technology
  • Statistical model
  • Ethics of artificial intelligence
  • Semantic similarity network
  • References[edit]

    1. ^ Jurafsky, Dan; Martin, James H. (2021). "N-gram Language Models". Speech and Language Processing (3rd ed.). Archived from the original on 22 May 2022. Retrieved 24 May 2022.
  • ^ Rosenfeld, Ronald (2000). "Two decades of statistical language modeling: Where do we go from here?". Proceedings of the IEEE. 88 (8): 1270–1278. doi:10.1109/5.880083. S2CID 10959945.
  • ^ Kuhn, Roland, and Renato De Mori (1990). "A cache-based natural language model for speech recognition". IEEE transactions on pattern analysis and machine intelligence 12.6: 570–583.
  • ^ Andreas, Jacob, Andreas Vlachos, and Stephen Clark (2013). "Semantic parsing as machine translation" Archived 15 August 2020 at the Wayback Machine. Proceedings of the 51st Annual Meeting of the Association for Computational Linguistics (Volume 2: Short Papers).
  • ^ Pham, Vu, et al (2014). "Dropout improves recurrent neural networks for handwriting recognition" Archived 11 November 2020 at the Wayback Machine. 14th International Conference on Frontiers in Handwriting Recognition. IEEE.
  • ^ Htut, Phu Mon, Kyunghyun Cho, and Samuel R. Bowman (2018). "Grammar induction with neural language models: An unusual replication" Archived 14 August 2022 at the Wayback Machine. arXiv:1808.10000.
  • ^ Ponte, Jay M.; Croft, W. Bruce (1998). A language modeling approach to information retrieval. Proceedings of the 21st ACM SIGIR Conference. Melbourne, Australia: ACM. pp. 275–281. doi:10.1145/290941.291008.
  • ^ Hiemstra, Djoerd (1998). A linguistically motivated probabilistically model of information retrieval. Proceedings of the 2nd European conference on Research and Advanced Technology for Digital Libraries. LNCS, Springer. pp. 569–584. doi:10.1007/3-540-49653-X_34.
  • ^ Bengio, Yoshua; Ducharme, Réjean; Vincent, Pascal; Janvin, Christian (1 March 2003). "A neural probabilistic language model". The Journal of Machine Learning Research. 3: 1137–1155 – via ACM Digital Library.
  • ^ Jurafsky, Dan; Martin, James H. (7 January 2023). "N-gram Language Models". Speech and Language Processing (PDF) (3rd edition draft ed.). Retrieved 24 May 2022.
  • ^ David Guthrie; et al. (2006). "A Closer Look at Skip-gram Modelling" (PDF). Archived from the original (PDF) on 17 May 2017. Retrieved 27 April 2014.
  • ^ Mikolov, Tomas; Chen, Kai; Corrado, Greg; Dean, Jeffrey (2013). "Efficient estimation of word representations in vector space". arXiv:1301.3781 [cs.CL].
  • ^ Mikolov, Tomas; Sutskever, Ilya; Chen, Kai; Corrado irst4=Greg S.; Dean, Jeff (2013). Distributed Representations of Words and Phrases and their Compositionality (PDF). Advances in Neural Information Processing Systems. pp. 3111–3119. Archived (PDF) from the original on 29 October 2020. Retrieved 22 June 2015.{{cite conference}}: CS1 maint: numeric names: authors list (link)
  • ^ Karpathy, Andrej. "The Unreasonable Effectiveness of Recurrent Neural Networks". Archived from the original on 1 November 2020. Retrieved 27 January 2019.
  • ^ Bengio, Yoshua (2008). "Neural net language models". Scholarpedia. Vol. 3. p. 3881. Bibcode:2008SchpJ...3.3881B. doi:10.4249/scholarpedia.3881 (inactive 12 June 2024). Archived from the original on 26 October 2020. Retrieved 28 August 2015.{{cite encyclopedia}}: CS1 maint: DOI inactive as of June 2024 (link)
  • ^ "Better Language Models and Their Implications". OpenAI. 14 February 2019. Archived from the original on 19 December 2020. Retrieved 25 August 2019.
  • ^ Bowman, Samuel R. (2023). "Eight Things to Know about Large Language Models". arXiv:2304.00612 [cs.CL].
  • ^ Brown, Tom B.; Mann, Benjamin; Ryder, Nick; Subbiah, Melanie; Kaplan, Jared; Dhariwal, Prafulla; Neelakantan, Arvind; Shyam, Pranav; Sastry, Girish; Askell, Amanda; Agarwal, Sandhini; Herbert-Voss, Ariel; Krueger, Gretchen; Henighan, Tom; Child, Rewon; Ramesh, Aditya; Ziegler, Daniel M.; Wu, Jeffrey; Winter, Clemens; Hesse, Christopher; Chen, Mark; Sigler, Eric; Litwin, Mateusz; Gray, Scott; Chess, Benjamin; Clark, Jack; Berner, Christopher; McCandlish, Sam; Radford, Alec; Sutskever, Ilya; Amodei, Dario (December 2020). Larochelle, H.; Ranzato, M.; Hadsell, R.; Balcan, M.F.; Lin, H. (eds.). "Language Models are Few-Shot Learners" (PDF). Advances in Neural Information Processing Systems. 33. Curran Associates, Inc.: 1877–1901.
  • ^ Fathallah, Nadeen; Das, Arunav; De Giorgis, Stefano; Poltronieri, Andrea; Haase, Peter; Kovriguina, Liubov (26 May 2024). NeOn-GPT: A Large Language Model-Powered Pipeline for Ontology Learning (PDF). Extended Semantic Web Conference 2024. Hersonissos, Greece.
  • ^ Manning, Christopher D. (2022). "Human Language Understanding & Reasoning". Daedalus. 151 (2): 127–138. doi:10.1162/daed_a_01905. S2CID 248377870.
  • ^ Hornstein, Norbert; Lasnik, Howard; Patel-Grosz, Pritty; Yang, Charles (9 January 2018). Syntactic Structures after 60 Years: The Impact of the Chomskyan Revolution in Linguistics. Walter de Gruyter GmbH & Co KG. ISBN 978-1-5015-0692-5. Archived from the original on 16 April 2023. Retrieved 11 December 2021.
  • ^ Karlgren, Jussi; Schutze, Hinrich (2015), "Evaluating Learning Language Representations", International Conference of the Cross-Language Evaluation Forum, Lecture Notes in Computer Science, Springer International Publishing, pp. 254–260, doi:10.1007/978-3-319-64206-2_8, ISBN 9783319642055
  • ^ Devlin, Jacob; Chang, Ming-Wei; Lee, Kenton; Toutanova, Kristina (10 October 2018). "BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding". arXiv:1810.04805 [cs.CL].
  • ^ "The Corpus of Linguistic Acceptability (CoLA)". nyu-mll.github.io. Archived from the original on 7 December 2020. Retrieved 25 February 2019.
  • ^ "GLUE Benchmark". gluebenchmark.com. Archived from the original on 4 November 2020. Retrieved 25 February 2019.
  • ^ "Microsoft Research Paraphrase Corpus". Microsoft Download Center. Archived from the original on 25 October 2020. Retrieved 25 February 2019.
  • ^ Aghaebrahimian, Ahmad (2017), "Quora Question Answer Dataset", Text, Speech, and Dialogue, Lecture Notes in Computer Science, vol. 10415, Springer International Publishing, pp. 66–73, doi:10.1007/978-3-319-64206-2_8, ISBN 9783319642055
  • ^ Sammons, V.G.Vinod Vydiswaran, Dan Roth, Mark; Vydiswaran, V.G.; Roth, Dan. "Recognizing Textual Entailment" (PDF). Archived from the original (PDF) on 9 August 2017. Retrieved 24 February 2019.{{cite web}}: CS1 maint: multiple names: authors list (link)
  • ^ "The Stanford Question Answering Dataset". rajpurkar.github.io. Archived from the original on 30 October 2020. Retrieved 25 February 2019.
  • ^ "Recursive Deep Models for Semantic Compositionality Over a Sentiment Treebank". nlp.stanford.edu. Archived from the original on 27 October 2020. Retrieved 25 February 2019.
  • ^ Hendrycks, Dan (14 March 2023), Measuring Massive Multitask Language Understanding, archived from the original on 15 March 2023, retrieved 15 March 2023
  • Further reading[edit]

    • J M Ponte; W B Croft (1998). "A Language Modeling Approach to Information Retrieval". Research and Development in Information Retrieval. pp. 275–281. CiteSeerX 10.1.1.117.4237.
  • F Song; W B Croft (1999). "A General Language Model for Information Retrieval". Research and Development in Information Retrieval. pp. 279–280. CiteSeerX 10.1.1.21.6467.
  • Chen, Stanley; Joshua Goodman (1998). An Empirical Study of Smoothing Techniques for Language Modeling (Technical report). Harvard University. CiteSeerX 10.1.1.131.5458.

  • Retrieved from "https://en.wikipedia.org/w/index.php?title=Language_model&oldid=1228657551"

    Categories: 
    Language modeling
    Statistical natural language processing
    Markov models
    Hidden categories: 
    Webarchive template wayback links
    CS1 maint: numeric names: authors list
    CS1 maint: DOI inactive as of June 2024
    CS1 maint: multiple names: authors list
    Articles with short description
    Short description matches Wikidata
    Use dmy dates from July 2022
    Articles with excerpts
     



    This page was last edited on 12 June 2024, at 12:56 (UTC).

    Text is available under the Creative Commons Attribution-ShareAlike License 4.0; additional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.



    Privacy policy

    About Wikipedia

    Disclaimers

    Contact Wikipedia

    Code of Conduct

    Developers

    Statistics

    Cookie statement

    Mobile view



    Wikimedia Foundation
    Powered by MediaWiki