Jump to content

Main menu Navigation ●Main page ●Contents ●Current events ●Random article ●About Wikipedia ●Contact us ●Donate Contribute ●Help ●Learn to edit ●Community portal ●Recent changes ●Upload file

●Create account ●Log in ●Create account ● Log in Pages for logged out editors learn more ●Contributions ●Talk

(Top) 1 Features 2 Usage in healthcare 3 Spark OCR 4 License and availability 5 Award 6 References 7 Sources 8 External links

Spark NLP

●فارسی ●Português Edit links ●Article ●Talk ●Read ●Edit ●View history Tools Actions ●Read ●Edit ●View history General ●What links here ●Related changes ●Upload file ●Special pages ●Permanent link ●Page information ●Cite this page ●Get shortened URL ●Download QR code ●Wikidata item Print/export ●Download as PDF ●Printable version Appearance From Wikipedia, the free encyclopedia

This article needs additional citations for verification. Please help improve this articlebyadding citations to reliable sources. Unsourced material may be challenged and removed.
Find sources: "Spark NLP" – news · newspapers · books · scholar · JSTOR (June 2024) (Learn how and when to remove this message)

Spark NLP
Original author(s)	John Snow Labs
Initial release	October 2017^[1]

Stable release	5.2.3 / January 2024; 6 months ago (2024-01)

Repository	github.com/JohnSnowLabs/spark-nlp
Written in	Python, Scala
Operating system	Linux, Windows, macOS, OS X
Type	Natural language processing
License	Apache licence
Website	sparknlp.org

Spark NLP is an open-source text processing library for advanced natural language processing for the Python, Java and Scala programming languages.^[2]^[3]^[4] The library is built on top of Apache Spark and its Spark ML library.^[5]

Its purpose is to provide an API for natural language processing pipelines that implement recent academic research results as production-grade, scalable, and trainable software. The library offers pre-trained neural network models, pipelines, and embeddings, as well as support for training custom models.^[5]

Features

[edit]

The design of the library makes use of the concept of a pipeline which is an ordered set of text annotators.^[6] Out of the box annotators include, tokenizer, normalizer, stemming, lemmatizer, regular expression, TextMatcher, chunker, DateMatcher, SentenceDetector, DeepSentenceDetector, POS tagger, ViveknSentimentDetector, sentiment analysis, named entity recognition, conditional random field annotator, deep learning annotator, spell checking and correction, dependency parser, typed dependency parser, document classification, and language detection.^[7]

The Models Hub is a platform for sharing open-source as well as licensed pre-trained models and pipelines. It includes pre-trained pipelines with tokenization, lemmatization, part-of-speech tagging, and named entity recognition that exist for more than thirteen languages; word embeddings including GloVe, ELMo, BERT, ALBERT, XLNet, Small BERT, and ELECTRA; sentence embeddings including Universal Sentence Embeddings (USE)^[8] and Language Agnostic BERT Sentence Embeddings (LaBSE).^[9] It also includes resources and pre-trained models for more than two hundred languages. Spark NLP base code includes support for East Asian languages such as tokenizers for Chinese, Japanese, Korean; for right-to-left languages such as Urdu, Farsi, Arabic, Hebrew and pre-trained multilingual word and sentence embeddings such as LaUSE and a translation annotator.

Usage in healthcare

[edit]

Spark NLP for Healthcare is a commercial extension of Spark NLP for clinical and biomedical text mining.^[10] It provides healthcare-specific annotators, pipelines, models, and embeddings for clinical entity recognition, clinical entity linking, entity normalization, assertion status detection, de-identification, relation extraction, and spell checking and correction.

The library offers access to several clinical and biomedical transformers: JSL-BERT-Clinical, BioBERT, ClinicalBERT,^[11] GloVe-Med, GloVe-ICD-O. It also includes over 50 pre-trained healthcare models, that can recognize the entities such as clinical, drugs, risk factors, anatomy, demographics, and sensitive data.

Spark OCR

[edit]

Spark OCR is another commercial extension of Spark NLP for optical character recognition (OCR) from images, scanned PDF documents, and DICOM files.^[7] It is a software library built on top of Apache Spark. It provides several image pre-processing features for improving text recognition results such as adaptive thresholding and denoising, skew detection & correction, adaptive scaling, layout analysis and region detection, image cropping, removing background objects.

Due to the tight coupling between Spark OCR and Spark NLP, users can combine NLP and OCR pipelines for tasks such as extracting text from images, extracting data from tables, recognizing and highlighting named entities in PDF documents or masking sensitive text in order to de-identify images.^[12]

Several output formats are supported by Spark OCR such as PDF, images, or DICOM files with annotated or masked entities, digital text for downstream processing in Spark NLP or other libraries, structured data formats (JSON and CSV), as files or Spark data frames.

Users can also distribute the OCR jobs across multiple nodes in a Spark cluster.

License and availability

[edit]

Spark NLP is licensed under the Apache 2.0 license. The source code is publicly available on GitHub as well as documentation and a tutorial. Prebuilt versions of Spark NLP are available in PyPi and Anaconda Repository for Python development, in Maven Central for Java & Scala development, and in Spark Packages for Spark development.

Award

[edit]

In March 2019, Spark NLP received Open Source Award for its contributions in natural language processing in Python, Java, and Scala.^[13]

References

[edit]

^ Talby, David (19 October 2017). "Introducing the Natural Language Processing Library for Apache Spark". databricks.com. databricks. Retrieved 29 March 2019.

^ Ellafi, Saif Addin (2018-02-28). "Comparing production-grade NLP libraries: Running Spark-NLP and spaCy pipelines". O'Reilly Media. Retrieved 2019-03-29.

^ Ellafi, Saif Addin (2018-02-28). "Comparing production-grade NLP libraries: Accuracy, performance, and scalability". O'Reilly Media. Retrieved 2019-03-29.

^ Ewbank, Kay. "Spark Gets NLP Library". www.i-programmer.info.

^ ^a ^b Thomas, Alex (July 2020). Natural Language Processing with Spark NLP: Learning to Understand Text at Scale (First ed.). United States of America: O'Reilly Media. ISBN 978-1492047766.

^ Talby, David (2017-10-19). "Introducing the Natural Language Processing Library for Apache Spark - The Databricks Blog". Databricks. Retrieved 2019-08-27.

^ ^a ^b Jha, Bineet Kumar; G, Sivasankari G.; R, Venugopal K. (May 2, 2021). "Sentiment Analysis for E-Commerce Products Using Natural Language Processing". Annals of the Romanian Society for Cell Biology: 166–175 – via www.annalsofrscb.ro.

^ Cer, Daniel; Yang, Yinfei; Kong, Sheng-yi; Hua, Nan; Limtiaco, Nicole; John, Rhomni St; Constant, Noah; Guajardo-Cespedes, Mario; Yuan, Steve; Tar, Chris; Sung, Yun-Hsuan; Strope, Brian; Kurzweil, Ray (12 April 2018). "Universal Sentence Encoder". arXiv:1803.11175 [cs.CL].

^ Feng, Fangxiaoyu; Yang, Yinfei; Cer, Daniel; Arivazhagan, Naveen; Wang, Wei (3 July 2020). "Language-agnostic BERT Sentence Embedding". arXiv:2007.01852 [cs.CL].

^ Team, Editorial (2018-09-04). "The Use of NLP to Extract Unstructured Medical Data From Text". insideBIGDATA. Retrieved 2019-08-27.

^ Alsentzer, Emily; Murphy, John; Boag, William; Weng, Wei-Hung; Jindi, Di; Naumann, Tristan; McDermott, Matthew (June 2019). "Publicly Available Clinical BERT Embeddings". Proceedings of the 2nd Clinical Natural Language Processing Workshop. Association for Computational Linguistics: 72–78. arXiv:1904.03323. doi:10.18653/v1/W19-1909. S2CID 102352093.

^ "A Unified CV, OCR & NLP Model Pipeline for Document Understanding at DocuSign". NLP Summit. Retrieved 18 September 2020.

^ Civis Analytics, Okera, Sigma Computing and Spark NLP Named Winners of Strata Data Awards

Sources

[edit]

Thomas, Alex (21 July 2020). Natural Language Processing with Spark NLP: Learning to Understand Text at Scale. O'Reilly Media. ISBN 978-1492047766.
Quinto, Butch (2020). Next-Generation Machine Learning with Spark. Berkeley, California: Apress. doi:10.1007/978-1-4842-5669-5. ISBN 978-1-4842-5668-8. S2CID 211234215.

External links

[edit]

Spark NLP

Retrieved from "https://en.wikipedia.org/w/index.php?title=Spark_NLP&oldid=1231873429" Categories: ●2017 software ●Open-source artificial intelligence ●Software using the Apache license ●Free software programmed in Python ●Free software programmed in Scala ●Natural language processing toolkits Hidden categories: ●Articles with short description ●Short description matches Wikidata ●Articles needing additional references from June 2024 ●All articles needing additional references ●This page was last edited on 30 June 2024, at 18:56 (UTC). ●Text is available under the Creative Commons Attribution-ShareAlike License 4.0; additional terms may apply. By using this site, you agree to the Terms of Use and Privacy Policy. Wikipedia® is a registered trademark of the Wikimedia Foundation, Inc., a non-profit organization. ●Privacy policy ●About Wikipedia ●Disclaimers ●Contact Wikipedia ●Code of Conduct ●Developers ●Statistics ●Cookie statement ●Mobile view