Natural Language Toolkit
- Summary
-
Discussion
- Which are the fundamental NLP tasks that can be performed using NLTK?
- Which are the modules available in NLTK?
- How is NLTK package split into sub-packages and modules?
- Which are the natural languages supported in NLTK?
- What datasets are available in NLTK for practice?
- What are the disadvantages or limitations of NLTK?
- For beginners, what are some useful resources to learn NLTK?
- Milestones
- References
- Further Reading
- Article Stats
- Cite As
Natural Language Toolkit (NLTK) is a Python package to perform natural language processing (NLP). It was created mainly as a tool for learning NLP via a hands-on approach. It was not designed to be used in production.
The growth of unstructured data via social media, online reviews, blogs, and voice-based human-computer interaction are some reasons why NLP has become important in the late 2010s. NLTK is a useful toolkit for many of these NLP applications.
NLTK is composed of sub-packages and modules. A typical processing pipeline will call modules in sequence. Python data structures are passed from one module to another. Beyond the algorithms, NLTK gives quick access to many text corpora and datasets.
Discussion
-
Which are the fundamental NLP tasks that can be performed using NLTK? NLTK can be used in wide range of applications for NLP. For basic understanding, let's try to analyze a paragraph using NLTK. It can be pre-processed using sentence segmentation, removing stopwords, removing punctuation and special symbols, and word tokenization. After pre-processing the corpus, it can be analyzed sentence-wise using parts of speech (POS) to extract nouns and adjectives. Subsequent tasks can include named entity recognition (NER), coreference resolution, constituency parsing and dependency parsing. The goal is to find insights and context about the corpus.
Further downstream tasks, more pertaining to application areas, could be emotion detection, sentiment analysis or text summarization. Tasks such as text classification and topic modeling typically require large amounts of text for better results.
-
Which are the modules available in NLTK? NLTK's architecture is modular. Functionality is organized into sub-packages and modules. NLTK is used for its simplicity, consistency and extensibility of its modules and functions. It's better explained in the tabular list of modules.
A complete module index is available as part of NLTK documentation.
-
How is NLTK package split into sub-packages and modules? NLTK is divided into different sub-packages and modules for text analysis using various methods. Figure depicts an example of
text
sub-package and the modules within it. Each module fulfils a specific function. -
Which are the natural languages supported in NLTK? Languages supported by NLTK depends on the task being implemented. For stemming, we have RSLPStemmer (Portuguese), ISRIStemmer (Arabic), and SnowballStemmer (Danish, Dutch, English, Finnish, French, German, Hungarian, Italian, Norwegian, Portuguese, Romanian, Russian, Spanish, Swedish).
For sentence tokenization, PunktSentenceTokenizer is capable of multilingual processing.
Stopwords are also available in multiple languages. After importing
stopwords
, we can obtain a list of languages by runningprint(stopwords.fileids())
.Although most taggers in NLTK support only English,
nltk.tag.stanford
module allows us to use StanfordPOSTagger, which has multilingual support. This is only an interface to the Stanford tagger, which must be running on the machine. -
What datasets are available in NLTK for practice? NLTK corpus is a natural dump for all kinds of NLP datasets that can be used for practice or maybe combined for generating models. For example, to import the Inaugural Address address, the statement to execute is
from nltk.corpus import inaugural
.Out of dozens of corpora, some popular ones are Brown, Name Genders, Penn Treebank, and Inaugural Address. NLTK makes it easy to read a corpus via the package
nltk.corpus
. This package has a reader object for each corpus. -
What are the disadvantages or limitations of NLTK? It's been mentioned that NLTK is "a complicated solution with a harsh learning curve and a maze of internal limitations".
For sentence tokenization, NLTK doesn't apply semantic analysis. Unlike Gensim, NLTK lacks neural network models or word embeddings.
NLTK is slow, whereas spaCy is said to be the fastest alternative. In fact, since NLTK was created educational purpose, optimized runtime performance was never a goal. However, it's possible to speed up execution using Python's
multiprocessing
module.Matthew Honnibal, the creator of spaCy, noted that NTLK has lots of modules but very few (tokenization, stemming, visualization) are actually useful. Often NLTK has wrappers to external libraries and this leads to slow execution. The POS tagger was terrible, until Honnibal's averaged perceptron tagger was merged into NLTK in September 2015.
In general, NLP is evolving so fast that maintainers need to curate often and throw away old things.
-
For beginners, what are some useful resources to learn NLTK? The official website includes documentation, Wiki, and index of all modules. There are Google Groups for users and developers.
For basic usage of NLTK, you can read a tutorial by Bill Chambers. This also shows some text classification examples using Scikit-learn. Another basic tutorial from Harry Howard includes examples from Pattern library as well.
Often specific processing is implemented in external libraries. Benjamin Bengfort shows in a blog post how to call CoreNLP from inside NLTK for syntactic parsing.
There's a handy cheat sheet by murenei. Another one from 2017 is published at Northwestern University.
A list of recommended NLTK books appears on BookAuthority. You can start by reading Natural Language Processing with Python (Bird et al. 2009). Those who wish to learn via videos can look up a playlist of 21 videos from sentdex.
Milestones
2001
The first downloadable version of NLTK appears on SourceForge. Created at the University of Pennsylvania, the aim is to have a set of open source software, tutorials and problem sets to aid the teaching of computational linguistics. Before NLTK, a project might require students to learn multiple programming languages and toolkits. Lack of visualizations also made it difficult to have class demonstrations. NLTK is meant to solve these problems.
2005
NLTK-Lite 0.1 is released. Steven Bird, one of the creators of NLTK, explains that NLTK 1.4 introduced Python's dictionary-based architecture for storing tokens. This created overhead for programmers. With NLTK-Lite, programmers can use simpler data structures. For better performance, iterators are used instead of lists. Taggers use backoff by default. Method names are shorter. Since then, regular releases are made till NLTK-Lite 0.9 in October 2007. NLTK-Lite eventually becomes NLTK.
2008
2009
2011
2013
References
- ActiveWizards. 2018. "Comparison of Top 6 Python NLP Libraries." Blog, ActiveWizards, July.Accessed 2019-10-22.
- Bengfort, Benjamin. 2019. "Syntax Parsing with CoreNLP and NLTK." District Data Labs. Accessed 2019-10-27.
- Bird, Steven. 2005. "NLTK-Lite: Efficient Scripting for Natural Language Processing." Proceedings of the 4th International Conference on Natural Language Processing, pp. 11-18. Accessed 2019-10-27.
- Bird, Steven. 2019. "Whatever happened to NLTK-Lite?" Issue #2438, NLTK on GitHub, October 27. Accessed 2019-10-27.
- Bird, Steven, Ewan Klein, and Edward Loper. 2019. "Natural Language Processing with Python." Accessed 2019-10-27.
- BookAuthority. 2019. "10 Best Python NLTK Books of All Time." Accessed 2019-10-27.
- Chambers, Bill. 2015. "Python NLP - NLTK and scikit-learn." January 14. Accessed 2019-10-27.
- Chapagain, Mukesh. 2018. "Python NLTK: Stop Words [Natural Language Processing (NLP)]." Blog, February 19. Accessed 2019-10-27.
- Fedak, Vladimir. 2018. "5 Heroic Tools for Natural Language Processing." Towards Data Science, on Medium, January 30. Accessed 2019-10-27.
- Geitgey, Adam. 2018. "Natural Language Processing is Fun." Towards Data Science, on Medium, July 18. Accessed 2019-10-27.
- Honnibal, Matthew. 2015. "Dead Code Should Be Buried." Explosion Blog, September 04. Accessed 2019-10-27.
- Howard, Harry. 2016. "Basic natural language processing." Chapter 9 in: LING 3820, 6820: Natural Language Processing, Tulane University. Accessed 2019-10-27.
- Konrad, Marcus. 2017. "Speeding up NLTK with parallel processing." WZB Data Science Blog, June 19. Accessed 2019-10-27.
- Liyanapathirana, Lahiru. 2019. "NLP Chronicles: spaCy, the NLP Library Built for Production." Heartbeat, on Medium, June 05. Accessed 2019-10-27.
- Loper, Edward and Steven Bird. 2002. "NLTK: The Natural Language Toolkit." arXiv, v1, May 17. Accessed 2019-10-27.
- NLTK. 2019. "NLTK 3.4.5 documentation." August 20. Accessed 2019-10-27.
- NLTK. 2019b. "NLTK News." NLTK 3.4.5 documentation, August. Accessed 2019-10-27.
- NLTK. 2019c. "Corpus Readers." NLTK. Accessed 2019-10-27.
- NLTK API. 2019. "nltk.tag package." NLTK 3.4.5 documentation. Accessed 2019-10-27.
- NLTK API. 2019b. "nltk.tokenize package." NLTK 3.4.5 documentation. Accessed 2019-10-27.
- NLTK GitHub. 2019. "nltk/nltk." October 16. Accessed 2019-10-27.
- REF. 2014. "The Natural Language Toolkit (NLTK)." Impact case study (REF3b), Research Excellence Framework. Accessed 2019-10-27.
- Sarkar, Dipanjan. 2018. "A Practitioner's Guide to Natural Language Processing (Part I) — Processing & Understanding Text." Towards Data Science, on Medium, June 20. Accessed 2019-10-22.
- Shetty, Badreesh. 2018. "Natural Language Processing(NLP) for Machine Learning." Towards Data Science, on Medium, November 25. Accessed 2019-10-27.
- SourceForge. 2001. "Natural Language Toolkit." SourceForge, July 09. Updated 2014-07-22. Accessed 2019-10-27.
- sentdex. 2015. "Natural Language Processing With Python and NLTK." sentdex, on YouTube, May 01. Accessed 2019-10-27.
- text-processing.com. 2019. "Stemming and Lemmatization with Python NLTK." Accessed 2019-10-27.
Further Reading
- Howard, Harry. 2016. "Basic natural language processing." Chapter 9 in: LING 3820, 6820: Natural Language Processing, Tulane University. Accessed 2019-10-27.
- Loper, Edward and Steven Bird. 2002. "NLTK: The Natural Language Toolkit." arXiv, v1, May 17. Accessed 2019-10-27.
- Madnani, Nitin. 2007. "Getting started on natural language processing with Python." XRDS: Crossroads, The ACM Magazine for Students, vol. 13, no. 4, June. Accessed 2019-10-27.
- Bird, Steven, Ewan Klein, and Edward Loper. 2019. "Natural Language Processing with Python." Accessed 2019-10-27.
Article Stats
Cite As
See Also
- Natural Language Processing
- Text Corpus for NLP
- spaCy
- Gensim
- TextBlob
- Apache OpenNLP