Natural Language Toolkit

Typical NLTK pipeline for information extraction. Source: Bird et al. 2019, ch. 7, fig. 7.1.
Typical NLTK pipeline for information extraction. Source: Bird et al. 2019, ch. 7, fig. 7.1.

Natural Language Toolkit (NLTK) is a Python package to perform natural language processing (NLP). It was created mainly as a tool for learning NLP via a hands-on approach. It was not designed to be used in production.

The growth of unstructured data via social media, online reviews, blogs, and voice-based human-computer interaction are some reasons why NLP has become important in the late 2010s. NLTK is a useful toolkit for many of these NLP applications.

NLTK is composed of sub-packages and modules. A typical processing pipeline will call modules in sequence. Python data structures are passed from one module to another. Beyond the algorithms, NLTK gives quick access to many text corpora and datasets.

Discussion

  • Which are the fundamental NLP tasks that can be performed using NLTK?
    Pipeline for text classification. Source: Navlani 2018, fig. 1.
    Pipeline for text classification. Source: Navlani 2018, fig. 1.

    NLTK can be used in wide range of applications for NLP. For basic understanding, let's try to analyze a paragraph using NLTK. It can be pre-processed using sentence segmentation, removing stopwords, removing punctuation and special symbols, and word tokenization. After pre-processing the corpus, it can be analyzed sentence-wise using parts of speech (POS) to extract nouns and adjectives. Subsequent tasks can include named entity recognition (NER), coreference resolution, constituency parsing and dependency parsing. The goal is to find insights and context about the corpus.

    Further downstream tasks, more pertaining to application areas, could be emotion detection, sentiment analysis or text summarization. Tasks such as text classification and topic modeling typically require large amounts of text for better results.

  • Which are the modules available in NLTK?
    NLTK modules with functionalities. Source: Bird et al. 2019, ch. 0, table VIII.1.
    NLTK modules with functionalities. Source: Bird et al. 2019, ch. 0, table VIII.1.

    NLTK's architecture is modular. Functionality is organized into sub-packages and modules. NLTK is used for its simplicity, consistency and extensibility of its modules and functions. It's better explained in the tabular list of modules.

    A complete module index is available as part of NLTK documentation.

  • How is NLTK package split into sub-packages and modules?
    Illustrating the organization of 'text' sub-package and its modules. Source: Howard 2016, fig. 3.
    Illustrating the organization of 'text' sub-package and its modules. Source: Howard 2016, fig. 3.

    NLTK is divided into different sub-packages and modules for text analysis using various methods. Figure depicts an example of text sub-package and the modules within it. Each module fulfils a specific function.

  • Which are the natural languages supported in NLTK?

    Languages supported by NLTK depends on the task being implemented. For stemming, we have RSLPStemmer (Portuguese), ISRIStemmer (Arabic), and SnowballStemmer (Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish).

    For sentence tokenization, PunktSentenceTokenizer is capable of multilingual processing.

    Stopwords are also available in multiple languages. After importing stopwords, we can obtain a list of languages by running print(stopwords.fileids()).

    Although most taggers in NLTK support only English, nltk.tag.stanford module allows us to use StanfordPOSTagger, which has multilingual support. This is only an interface to the Stanford tagger, which must be running on the machine.

  • What datasets are available in NLTK for practice?
    NLTK downloader is a handy interface to manage packages and datasets. Source: Shetty 2018.
    NLTK downloader is a handy interface to manage packages and datasets. Source: Shetty 2018.

    NLTK corpus is a natural dump for all kinds of NLP datasets that can be used for practice or maybe combined for generating models. For example, to import the Inaugural Address address, the statement to execute is from nltk.corpus import inaugural.

    Out of dozens of corpora, some popular ones are Brown, Name Genders, Penn Treebank, and Inaugural Address. NLTK makes it easy to read a corpus via the package nltk.corpus. This package has a reader object for each corpus.

  • What are the disadvantages or limitations of NLTK?

    It's been mentioned that NLTK is "a complicated solution with a harsh learning curve and a maze of internal limitations".

    For sentence tokenization, NLTK doesn't apply semantic analysis. Unlike Gensim, NLTK lacks neural network models or word embeddings.

    NLTK is slow, whereas spaCy is said to be the fastest alternative. In fact, since NLTK was created educational purpose, optimized runtime performance was never a goal. However, it's possible to speed up execution using Python's multiprocessing module.

    Matthew Honnibal, the creator of spaCy, noted that NTLK has lots of modules but very few (tokenization, stemming, visualization) are actually useful. Often NLTK has wrappers to external libraries and this leads to slow execution. The POS tagger was terrible, until Honnibal's averaged perceptron tagger was merged into NLTK in September 2015.

    In general, NLP is evolving so fast that maintainers need to curate often and throw away old things.

  • For beginners, what are some useful resources to learn NLTK?

    The official website includes documentation, Wiki, and index of all modules. There are Google Groups for users and developers.

    For basic usage of NLTK, you can read a tutorial by Bill Chambers. This also shows some text classification examples using Scikit-learn. Another basic tutorial from Harry Howard includes examples from Pattern library as well.

    Often specific processing is implemented in external libraries. Benjamin Bengfort shows in a blog post how to call CoreNLP from inside NLTK for syntactic parsing.

    There's a handy cheat sheet by murenei. Another one from 2017 is published at Northwestern University.

    A list of recommended NLTK books appears on BookAuthority. You can start by reading Natural Language Processing with Python (Bird et al. 2009). Those who wish to learn via videos can look up a playlist of 21 videos from sentdex.

Milestones

Jul
2001
NLTK's chart parsing tool is a useful visualization. Source: Loper and Bird 2002, fig. 1.
NLTK's chart parsing tool is a useful visualization. Source: Loper and Bird 2002, fig. 1.

The first downloadable version of NLTK appears on SourceForge. Created at the University of Pennsylvania, the aim is to have a set of open source software, tutorials and problem sets to aid the teaching of computational linguistics. Before NLTK, a project might require students to learn multiple programming languages and toolkits. Lack of visualizations also made it difficult to have class demonstrations. NLTK is meant to solve these problems.

Jul
2005

NLTK-Lite 0.1 is released. Steven Bird, one of the creators of NLTK, explains that NLTK 1.4 introduced Python's dictionary-based architecture for storing tokens. This created overhead for programmers. With NLTK-Lite, programmers can use simpler data structures. For better performance, iterators are used instead of lists. Taggers use backoff by default. Method names are shorter. Since then, regular releases are made till NLTK-Lite 0.9 in October 2007. NLTK-Lite eventually becomes NLTK.

Apr
2008

Two NLTK projects are accepted for Google Summer of Code: dependency parsing and natural language generation. The dependency parser becomes part of NLTK version 0.9.6 (December 2008).

Jun
2009

Book titled Natural Language Processing with Python by Bird et al. is published by O'Reilly Media. Since October 2013, the authors release online revised versions of the book updated for Python 3 and NLTK 3.

Apr
2011

Version 2.0.1rc1 becomes the first release available via GitHub, although till July 2014 releases are also made via SourceForge.

Jul
2013

Over a five-year period from January 2008 to July 2013, NLTK gets more than half a million downloads. This excludes downloads via GitHub.

Sep
2014

NLTK 3.0.0 is released, making this the first stable release supporting Python 3. Alpha release, version 3.0a0 (alpha), supporting Python 3 can be traced to January 2013.

References

  1. ActiveWizards. 2018. "Comparison of Top 6 Python NLP Libraries." Blog, ActiveWizards, July.Accessed 2019-10-22.
  2. Bengfort, Benjamin. 2019. "Syntax Parsing with CoreNLP and NLTK." District Data Labs. Accessed 2019-10-27.
  3. Bird, Steven. 2005. "NLTK-Lite: Efficient Scripting for Natural Language Processing." Proceedings of the 4th International Conference on Natural Language Processing, pp. 11-18. Accessed 2019-10-27.
  4. Bird, Steven. 2019. "Whatever happened to NLTK-Lite?" Issue #2438, NLTK on GitHub, October 27. Accessed 2019-10-27.
  5. Bird, Steven, Ewan Klein, and Edward Loper. 2019. "Natural Language Processing with Python." Accessed 2019-10-27.
  6. BookAuthority. 2019. "10 Best Python NLTK Books of All Time." Accessed 2019-10-27.
  7. Chambers, Bill. 2015. "Python NLP - NLTK and scikit-learn." January 14. Accessed 2019-10-27.
  8. Chapagain, Mukesh. 2018. "Python NLTK: Stop Words [Natural Language Processing (NLP)]." Blog, February 19. Accessed 2019-10-27.
  9. Fedak, Vladimir. 2018. "5 Heroic Tools for Natural Language Processing." Towards Data Science, on Medium, January 30. Accessed 2019-10-27.
  10. Geitgey, Adam. 2018. "Natural Language Processing is Fun." Towards Data Science, on Medium, July 18. Accessed 2019-10-27.
  11. Honnibal, Matthew. 2015. "Dead Code Should Be Buried." Explosion Blog, September 04. Accessed 2019-10-27.
  12. Howard, Harry. 2016. "Basic natural language processing." Chapter 9 in: LING 3820, 6820: Natural Language Processing, Tulane University. Accessed 2019-10-27.
  13. Konrad, Marcus. 2017. "Speeding up NLTK with parallel processing." WZB Data Science Blog, June 19. Accessed 2019-10-27.
  14. Liyanapathirana, Lahiru. 2019. "NLP Chronicles: spaCy, the NLP Library Built for Production." Heartbeat, on Medium, June 05. Accessed 2019-10-27.
  15. Loper, Edward and Steven Bird. 2002. "NLTK: The Natural Language Toolkit." arXiv, v1, May 17. Accessed 2019-10-27.
  16. NLTK. 2019. "NLTK 3.4.5 documentation." August 20. Accessed 2019-10-27.
  17. NLTK. 2019b. "NLTK News." NLTK 3.4.5 documentation, August. Accessed 2019-10-27.
  18. NLTK. 2019c. "Corpus Readers." NLTK. Accessed 2019-10-27.
  19. NLTK API. 2019. "nltk.tag package." NLTK 3.4.5 documentation. Accessed 2019-10-27.
  20. NLTK API. 2019b. "nltk.tokenize package." NLTK 3.4.5 documentation. Accessed 2019-10-27.
  21. NLTK GitHub. 2019. "nltk/nltk." October 16. Accessed 2019-10-27.
  22. REF. 2014. "The Natural Language Toolkit (NLTK)." Impact case study (REF3b), Research Excellence Framework. Accessed 2019-10-27.
  23. Sarkar, Dipanjan. 2018. "A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text." Towards Data Science, on Medium, June 20. Accessed 2019-10-22.
  24. Shetty, Badreesh. 2018. "Natural Language Processing(NLP) for Machine Learning." Towards Data Science, on Medium, November 25. Accessed 2019-10-27.
  25. SourceForge. 2001. "Natural Language Toolkit." SourceForge, July 09. Updated 2014-07-22. Accessed 2019-10-27.
  26. sentdex. 2015. "Natural Language Processing With Python and NLTK." sentdex, on YouTube, May 01. Accessed 2019-10-27.
  27. text-processing.com. 2019. "Stemming and Lemmatization with Python NLTK." Accessed 2019-10-27.

Further Reading

  1. Howard, Harry. 2016. "Basic natural language processing." Chapter 9 in: LING 3820, 6820: Natural Language Processing, Tulane University. Accessed 2019-10-27.
  2. Loper, Edward and Steven Bird. 2002. "NLTK: The Natural Language Toolkit." arXiv, v1, May 17. Accessed 2019-10-27.
  3. Madnani, Nitin. 2007. "Getting started on natural language processing with Python." XRDS: Crossroads, The ACM Magazine for Students, vol. 13, no. 4, June. Accessed 2019-10-27.
  4. Bird, Steven, Ewan Klein, and Edward Loper. 2019. "Natural Language Processing with Python." Accessed 2019-10-27.

Article Stats

Author-wise Stats for Article Edits

Author
No. of Edits
No. of Chats
DevCoins
6
4
1786
6
3
701
1193
Words
0
Likes
22K
Hits

Cite As

Devopedia. 2019. "Natural Language Toolkit." Version 12, October 28. Accessed 2024-06-25. https://devopedia.org/natural-language-toolkit
Contributed by
2 authors


Last updated on
2019-10-28 02:19:44