Jump to content
 







Main menu
   


Navigation  



Main page
Contents
Current events
Random article
About Wikipedia
Contact us
Donate
 




Contribute  



Help
Learn to edit
Community portal
Recent changes
Upload file
 








Search  

































Create account

Log in
 









Create account
 Log in
 




Pages for logged out editors learn more  



Contributions
Talk
 



















Contents

   



(Top)
 


1 Description  





2 Algorithms  





3 Use in biomedicine  





4 See also  





5 References  





6 External links  














Lemmatization






Català
Čeština
Deutsch
Ελληνικά
Español
Euskara
Français

Հայերեն
Hrvatski
Italiano
Latina
Polski
Română
Русский
Slovenščina
Suomi
Українська


 

Edit links
 









Article
Talk
 

















Read
Edit
View history
 








Tools
   


Actions  



Read
Edit
View history
 




General  



What links here
Related changes
Upload file
Special pages
Permanent link
Page information
Cite this page
Get shortened URL
Download QR code
Wikidata item
 




Print/export  



Download as PDF
Printable version
 
















Appearance
   

 






From Wikipedia, the free encyclopedia
 

(Redirected from Lemmatizer)

Lemmatization (or less commonly lemmatisation) in linguistics is the process of grouping together the inflected forms of a word so they can be analysed as a single item, identified by the word's lemma, or dictionary form.[1]

Incomputational linguistics, lemmatization is the algorithmic process of determining the lemma of a word based on its intended meaning. Unlike stemming, lemmatization depends on correctly identifying the intended part of speech and meaning of a word in a sentence, as well as within the larger context surrounding that sentence, such as neighbouring sentences or even an entire document. As a result, developing efficient lemmatization algorithms is an open area of research.[2][3][4]

Description

[edit]

In many languages, words appear in several inflected forms. For example, in English, the verb 'to walk' may appear as 'walk', 'walked', 'walks' or 'walking'. The base form, 'walk', that one might look up in a dictionary, is called the lemma for the word. The association of the base form with a part of speech is often called a lexeme of the word.

Lemmatization is closely related to stemming. The difference is that a stemmer operates on a single word without knowledge of the context, and therefore cannot discriminate between words which have different meanings depending on part of speech. However, stemmers are typically easier to implement and run faster. The reduced "accuracy" may not matter for some applications. In fact, when used within information retrieval systems, stemming improves query recall accuracy, or true positive rate, when compared to lemmatization. Nonetheless, stemming reduces precision, or the proportion of positively-labeled instances that are actually positive, for such systems.[5]

For instance:

  1. The word "better" has "good" as its lemma. This link is missed by stemming, as it requires a dictionary look-up.
  2. The word "walk" is the base form for the word "walking", and hence this is matched in both stemming and lemmatization.
  3. The word "meeting" can be either the base form of a noun or a form of a verb ("to meet") depending on the context; e.g., "in our last meeting" or "We are meeting again tomorrow". Unlike stemming, lemmatization attempts to select the correct lemma depending on the context.

Document indexing software like Lucene[6] can store the base stemmed format of the word without the knowledge of meaning, but only considering word formation grammar rules. The stemmed word itself might not be a valid word: 'lazy', as seen in the example below, is stemmed by many stemmers to 'lazi'. This is because the purpose of stemming is not to produce the appropriate lemma – that is a more challenging task that requires knowledge of context. The main purpose of stemming is to map different forms of a word to a single form.[7] As a rule-based algorithm, dependent only upon the spelling of a word, it sacrifices accuracy to ensure that, for example, when 'laziness' is stemmed to 'lazi', it has the same stem as 'lazy'.

Algorithms

[edit]

A trivial way to do lemmatization is by simple dictionary lookup. This works well for straightforward inflected forms, but a rule-based system will be needed for other cases, such as in languages with long compound words. Such rules can be either hand-crafted or learned automatically from an annotated corpus.

Use in biomedicine

[edit]

Morphological analysis of published biomedical literature can yield useful results. Morphological processing of biomedical text can be more effective by a specialized lemmatization program for biomedicine, and may improve the accuracy of practical information extraction tasks.[8]

See also

[edit]

References

[edit]
  1. ^ Collins English Dictionary, entry for "lemmatize"
  • ^ "WebBANC: Building Semantically-Rich Annotated Corpora from Web User Annotations of Minority Languages".
  • ^ Müller, Thomas; Cotterell, Ryan; Fraser, Alexander; Schütze, Hinrich (2015). Joint Lemmatization and Morphological Tagging with LEMMING (PDF). 2015 Conference on Empirical Methods in Natural Language Processing. Lisbon: Association for Computational Linguistics. pp. 2268–2274. doi:10.18653/v1/D15-1272.
  • ^ Bergmanis, Toms; Goldwater, Sharon. "Context Sensitive Neural Lemmatization with Lematus" (PDF).
  • ^ Manning, Christopher D.; Raghavan, Prabhakar; Schütze, Hinrich. "Introduction to Information Retrieval". Cambridge University Press.
  • ^ "Lucene Snowball". Apache project.
  • ^ Martin Porter. "Porter Stemmer".
  • ^ Liu, H.; Christiansen, T.; Baumgartner, W. A.; Verspoor, K. (2012). "BioLemmatizer: A lemmatization tool for morphological processing of biomedical text". Journal of Biomedical Semantics. 3: 3. doi:10.1186/2041-1480-3-3. PMC 3359276. PMID 22464129.
  • [edit]
    Retrieved from "https://en.wikipedia.org/w/index.php?title=Lemmatization&oldid=1188111640"

    Categories: 
    Computational linguistics
    Tasks of natural language processing
    Hidden categories: 
    Articles with short description
    Short description matches Wikidata
     



    This page was last edited on 3 December 2023, at 11:38 (UTC).

    Text is available under the Creative Commons Attribution-ShareAlike License 4.0; additional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization.



    Privacy policy

    About Wikipedia

    Disclaimers

    Contact Wikipedia

    Code of Conduct

    Developers

    Statistics

    Cookie statement

    Mobile view



    Wikimedia Foundation
    Powered by MediaWiki