Part-of-Speech Tagging
- Summary
-
Discussion
- Could you give an overview of POS tagging?
- What's the relevance of POS tagging for NLP?
- What are the sources of information for performing POS tagging?
- Which are the main types of POS taggers?
- In machine learning terminology, is POS tagging supervised or unsupervised?
- In HMM, given a word sequence, how do we determine the equivalent POS tag sequence?
- What are some practical techniques for POS tagging?
- How have researchers adapted standard POS tagging methods for custom applications?
- Could you describe some tools for doing POS tagging?
- Milestones
- Sample Code
- References
- Further Reading
- Article Stats
- Cite As

In the processing of natural languages, each word in a sentence is tagged with its part of speech. These tags then become useful for higher-level applications. Common parts of speech in English are noun, verb, adjective, adverb, etc.
The main problem with POS tagging is ambiguity. In English, many common words have multiple meanings and therefore multiple POS. The job of a POS tagger is to resolve this ambiguity accurately based on the context of use. For example, the word "shot" can be a noun or a verb. When used as a verb, it could be in past tense or past participle.
POS taggers started with a linguistic approach but later migrated towards a statistical approach. State-of-the-art models achieve accuracy better than 97%. POS tagging research done with English text corpus has been adapted to many other languages.
Discussion
-
Could you give an overview of POS tagging? Architecture diagram of POS tagging. Source: Devopedia 2019.A POS tagger takes in a phrase or sentence and assigns the most probable part-of-speech tag to each word. In practice, input is often pre-processed. One common pre-processing task is to tokenize the input so that the tagger sees a sequence of words and punctuations. Other tasks such as stop word removals, punctuation removals and lemmatization may be done before tagging.
The set of predefined tags is called the tagset. This is essential information that the tagger must be given. Example tags are NNS for a plural noun, VBD for a past tense verb, or JJ for an adjective. A tagset can also include punctuations.
Rather than design our own tagset, the common practice is to use well-known tagsets: 87-tag Brown tagset, 45-tag Penn Treebank tagset, 61-tag C5 tagset, or 146-tag C7 tagset. In the architecture diagram, we have shown the 45-tag Penn Treebank tagset. Sketch Engine is a place to download tagsets.
-
What's the relevance of POS tagging for NLP? POS tagging is a basic task in NLP. It's an essential pre-processing task before doing syntactic parsing or semantic analysis. It benefits many NLP applications including information retrieval, information extraction, text-to-speech systems, corpus linguistics, named entity recognition, question answering, word sense disambiguation, and more.
If a POS tagger gives poor accuracy, this has an adverse effect on other tasks that follow. This is commonly called downstream error propagation. To improve accuracy, some researchers have proposed combining POS tagging with other processing. For example, joint POS tagging and dependency parsing is an approach to improve accuracy compared to independent modelling.
-
What are the sources of information for performing POS tagging? Sometimes a word on its own can give useful clues. For example, 'the' is a determiner. Prefix 'un-' suggests an adjective, such as 'unfathomable'. Suffix '-ly' suggests adverb, such as 'importantly'. Capitalization can suggest proper noun, such as, 'Meridian'. Word shapes are also useful, such as '35-year' that's an adjective.
A word can be tagged based on the neighbouring words and the possible tags that those words can have. Word probabilities also play a part in selecting the right tag to resolve ambiguity. For example, 'man' is rarely used as a verb and mostly used as a noun.
In a statistical approach, we can count tag frequencies of words in a tagged corpus and then assign the most probable tag. This is called unigram tagging. A much better approach is bigram tagging. This counts the tag frequency given a particular preceding tag. Thus, a tag is seen to have dependence on previous tag. We can generalize this to n-gram tagging. In fact, it's common to model a sequence of words and estimate the sequence of tags. This is done by the Hidden Markov Model (HMM).
-
Which are the main types of POS taggers? We note the following types of POS taggers:
- Rule-Based: A dictionary is constructed with possible tags for each word. Rules guide the tagger to disambiguate. Rules are either hand-crafted, learned or both. An example rule might say, "If an ambiguous/unknown word X is preceded by a determiner and followed by a noun, tag it as an adjective."
- Statistical: A text corpus is used to derive useful probabilities. Given a sequence of words, the most probable sequence of tags is selected. These are also called stochastic or probabilistic taggers. Among the common models are n-gram model, Hidden Markov Model (HMM) and Maximum Entropy Model (MEM).
- Memory-Based: A set of cases is stored in memory, each case containing a word, its context and suitable tag. A new sentence is tagged based on best match from cases stored in memory. It's a combination of rule-based and stochastic method.
- Transformation-Based: Rules are automatically induced from data. Thus, it's a combination of rule-based and stochastic methods. Tagging is done using broad rules and then improved or transformed by applying narrower rules.
- Neural Net: RNN and Bidirectional LSTM are two examples of neural network architectures for POS tagging.
-
In machine learning terminology, is POS tagging supervised or unsupervised? POS taggers can be either supervised or unsupervised. Supervised taggers rely on a tagged corpus to create a dictionary, rules or tag sequence probabilities. They perform best when trained and applied on the same genre of text. Unsupervised taggers induce word groupings. This saves the effort of pre-tagging a corpus but word clusters are often coarse.
A combination of both approaches is also common. For example, rules are automatically induced from an untagged corpus. The output from this is corrected by humans and resubmitted to the tagger. The tagger looks at the corrections and adjusts the rules. Many iterations of this process may be necessary.
In 2016, it was noted that a completely unsupervised approach is not yet mature. Instead, weakly supervised approaches are adopted by aligning text, using translation probabilities (for machine translation) or transferring knowledge from resource-rich languages. Even a small amount of tagged corpus can be generalized to give better results.
-
In HMM, given a word sequence, how do we determine the equivalent POS tag sequence? We call this the decoding problem. We can observe the word sequence but the sequence of tags is hidden. We're required to find out the most probable tag sequence given the word sequence. In other words, we wish to maximize \(P(t^{n}|w^{n})\) for an n-word sequence.
An important insight is that parts of speech (and not words) give language its structure. Thus, using Bayes' Rule, we recast the problem to the following form, \(P(t^{n}|w^{n})=P(w^{n}|t^{n})\,P(t^{n})/P(w^{n})\). \(P(w^{n}|t^{n})\) is called likelihood. \(P(t^{n})\) is called prior probability.
Since we're maximizing over all tag sequences, the denominator can be ignored. We also make two assumptions: each word depends only on its own tag, and each tag depends only on its previous tag. We therefore need to maximize \(\prod_{i=1}^{n}P(w_i^{n}|t_i^{n})\,P(t_i|t_{i-1})\). In HMM, the terms are called emission probabilities and transition probabilities.
These probabilities are estimated from the tagged text corpus. The standard solution is to apply Viterbi algorithm, which is a form of dynamic programming. In the example figure, we see two non-zero paths and we select the more probable one.
-
What are some practical techniques for POS tagging? In any supervised statistical approach, it's recommended to divide your corpus into training set, development set (for tuning parameters) and testing set. An alternative is to use the entire corpus for training but do cross-validation. Moreover, if the corpus is too general the probabilities may not suit a particular domain; if it's too narrow, it may not generalize well across domains. To analyse where your model is failing, you can use confusion matrix or contingency table.
When unknown words are seen, one approach is to assign a suffix and calculate the probability that the suffixed word with a particular tag occurs in a sequence. Another approach is to assign a set of default tags and calculate the probabilities. Or we could look at the word's internal structure, such as assigning NNS for words ending in 's'.
To deal with sparse data (probabilities are zero), there are smoothing techniques. A naïve technique is to add a small frequency count, say 1, to all counts. The Good-Turing method along with Katz's backoff is a better technique. Linear interpolation is another technique.
-
How have researchers adapted standard POS tagging methods for custom applications? Learner English is English as a foreign language. Such text often contain spelling and orthographic errors. The use of neural networks, Bidirectional LSTM in particular, is found to give better accuracy than standard POS taggers. Word embeddings, character embeddings and native language vectors are used.
Historical English also present tagging challenges due to differences in spelling, usage and vocabulary. A combination of spelling normalization and a domain adaptation method such as feature embedding gives better results. Other approaches to historical text include neural nets, conditional random fields and self-learning techniques.
Techniques have been invented to tag Twitter data that's often sparse and noisy. In mathematics, POS taggers have been adapted to handle formulae and extract key phrases in mathematical publications. For clinical text, tagged corpus for that genre is used. However, it was found that it's better to share annotations across corpora than simply share a pretrained model.
-
Could you describe some tools for doing POS tagging? In Python, nltk.tag package implements many types of taggers. pattern is a web mining module that includes ability to do POS tagging. Unfortunately it lacks Python 3 support. It's also available in R as pattern.nlp. TextBlob is inspired by both NLTK and Pattern. spaCy is another useful package.
Implemented in TensorFlow, SyntaxNet is based on neural networks. Parsey McParseface is a parser for English and gives good accuracy.
Parts-of-speech.info is an online tool for trying out POS tagging for any text input. Useful open source tools are Apache OpenNLP, Orange and UDPipe.
Samet Çetin shows how to implement your own custom tagger using a logistic regression model. There's also a commercial tool from Bitext.
Datacube at the Vienna University of Economics and Business is a place to download text corpora, and taggers (OpenNLP or Stanford) implemented in R. Stanford tagger is said to be slow. Treetagger is limited to non-commercial use. Another R package is RDRPOS tagger. A Java implementation of a log-linear tagger from Stanford is available.
Milestones
W. Nelson Francis and Henry Kučera at the Department of Linguistics, Brown University, publish a computer-readable general corpus to aid linguistic research on modern English. The corpus has 1 million words (500 samples of about 2000 words each). Revised editions appear later in 1971 and 1979. Called Brown Corpus, it inspires many other text corpora. Brown Corpus is available online.

In one of the earliest departures from rule-based method to statistical method, Bahl and Mercer apply HMM to the problem of POS tagging and use Viterbi algorithm for decoding. In 1992, a research team at Xerox led by Doug Cutting apply HMM in two ways: they apply the Baum-Welch algorithm to obtain the maximum likelihood estimate of the model parameters; next they apply Viterbi algorithm to decode a sequence of tags given a sequence of words.
The CLAWS algorithm uses the co-locational probabilities, that is, likelihood of co-occurrence of ordered pairs of tags. These probabilities are estimated from the tagged Brown Corpus. Steven J. DeRose improves on this work in 1988 with the VOLSUNGA algorithm that's more efficient and achieves 96% accuracy. CLAWS and VOLSUNGA are N-gram taggers. By late 1980s, statistical approaches become popular. Rather than build complex and brittle hand-coded rules, statistical models learn these rules from text corpora.

Started in 1989 at the University of Pennsylvania, the Penn Treebank is released in 1992. It's an annotated text corpus of 4.5 million words of American English. The corpus is POS tagged. Over half of it is also annotated with syntactic structure. Treebank II is released in 1995. The original release had 48 tags. Treebank II merges some punctuation tags and results in a 45-tag tagset. Apart from POS tags, the corpus includes chunk tags, relation tags and anchor tags.
At a time when stochastic taggers are performing better than rule-based taggers, Eric Brill proposes a rule-based tagger that performs as well as stochastic taggers. It works by assigning the most likely estimates from a corpus but without any contextual information. It then improves on the estimate by applying patching rules, which are also learned from the corpus. One test shows 5.1% error rate with only 71 patches. This method is later named Transformation-Based Learning (TBL).
Called Net-Tagger, Helmut Schmid uses a Multi-Layer Perceptron (MLP) network to solve the POS tagging problem. He notes that neural networks were used previously for speech recognition. Though Nakamura et al. (1990) used a 4-layer feed-forward network for tag prediction, Net-Tagger is about tag disambiguation rather than prediction. Accuracy of 96.22% is achieved with a 2-layer model.
Christopher Manning, NLP researcher at Stanford University, comments that POS tagging has reached 97.3% token accuracy and 56% sentence accuracy. Further gains in accuracy might be possible with improved descriptive linguistics. He also argues that accuracy of 97% claimed to be achieved by humans might be an overestimate. Thus, automatic taggers are already surpassing humans.
Sample Code
References
- Artzi, Yoav. 2017. "Sequence Prediction and Part-of-speech Tagging." CS5740: Natural Language Processing, Cornell University. Accessed 2019-09-06.
- Brill, Eric. 1992. "A simple rule-based part of speech tagger." Proceedings of the third conference on applied natural language processing, ACL, Trento, Italy, March 31-April 03, pp. 152-155. Accessed 2019-09-06.
- CLiPS. 2018a. "Pattern." CLiPS Research Center, June 22. Accessed 2019-08-31.
- CLiPS. 2018b. "Penn Treebank II tag set." CLiPS Research Center, June 22. Accessed 2019-09-07.
- Cutting, Doug, Julian Kupiec, Jan Pedersen, and Penelope Sibun. 1992. "A practical part-of-speech tagger." Proceedings of the third conference on applied natural language processing, ACL, Trento, Italy, March 31-April 03, pp. 133-140. Accessed 2019-09-06.
- Daelemans, Walter, Jakub Zavrel, Peter Berck, and Steven Gillis. 1996. "MBT: A Memory-Based Part of Speech Tagger-Generator." Fourth Workshop on Very Large Corpora, ACL Anthology. Accessed 2019-09-06.
- DeRose, Steven J. 1988. "Grammatical category disambiguation by statistical optimization." J. Computational Linguistics, vol. 14, no. 1, Pages 31-39, MIT Press Cambridge. Accessed 2019-09-06.
- Derczynski, Leon, Alan Ritter, Sam Clark, and Kalina Bontcheva. 2013. "Twitter Part-of-Speech Tagging for All:Overcoming Sparse and Noisy Data". Proceedings of Recent Advances in Natural Language Processing, Hissar, Bulgaria, September 7-13, pp. 198–206. Accessed 2019-09-06.
- Fan, Jung-wei, Rashmi Prasad, Rommel M. Yabut, Richard M. Loomis, Daniel S. Zisook, John E. Mattison, and Yang Huang. 2011. "Part-of-speech tagging for clinical text: wall or bridge between institutions?" AMIA Annu Symp Proc. 2011, pp. 382–391. Accessed 2019-09-06.
- Ferraro, Jeffrey P, Hal Daumé, III, Scott L DuVall, Wendy W Chapman, Henk Harkema, and Peter J Haug. 2013. "Improving performance of natural language processing part-of-speech tagging on clinical narratives through domain adaptation." Journal of the American Medical Informatics Association, vol. 20, no. 5, September, pp. 931–939. Accessed 2019-09-06.
- Geitgey, Adam. 2018. "Natural Language Processing is Fun!" Medium, July 18. Accessed 2019-09-06.
- Jones, M. Tim. 2017. "Speaking out loud: An introduction to natural language processing." IBM Developer, June 13. Accessed 2019-09-06.
- Jurafsky, Daniel and James H. Martin. 2009. "Part-of-Speech Tagging." Chapter 5 in Speech and Language Processing, Second Edition, Prentice-Hall, Inc. Accessed 2019-09-07.
- Kauhanen, Henri. 2011. "The Standard Corpus of Present-Day Edited American English (the Brown Corpus)." VARIENG, University of Helsinki, March 20. Accessed 2019-09-06.
- Lee, Seungjae Ryan. 2019. "AIND: 20. Hidden Markov Models." endtoendAI. Accessed 2019-09-06.
- Manning, Christopher D. 2011. "Part-of-speech tagging from 97% to 100%: is it time for some linguistics?" Proceedings of the 12th international conference on Computational linguistics and intelligent text processing - Volume Part I, pp. 171-189, Tokyo, Japan, Springer-Verlag Berlin, February 20-26. Accessed 2019-08-31.
- Marcus, Mitch. 2011. "A Brief History of the Penn Treebank." Center for Language and Speech Processing, Johns Hopkins University, February 15. Accessed 2019-09-06.
- Marcus, Mitchell P., Mary Ann Marcinkiewicz, and Beatrice Santorini. 1993. "Building a Large Annotated Corpus of English: The Penn Treebank." Journal Computational Linguistics - Special issue on using large corpora: II, vol. 19, no. 2, pp. 313-330, MIT Press Cambridge, MA. Accessed 2019-09-06.
- Màrquez, Lluís, Lluís Padró, and Horacio Rodríguez. 2000. "A Machine Learning Approach to POS Tagging." Machine Learning, vol. 39, no. 1, pp. 59-91, Kluwer Academic Publishers, April. Accessed 2019-08-31.
- Nagata, Ryo, Tomoya Mizumoto, Yuta Kikuchi, Yoshifumi Kawasaki, and Kotaro Funakoshi. 2018. "A POS Tagging Model Designed for Learner English." Proceedings of the 2018 EMNLP Workshop W-NUT: The 4th Workshop on Noisy User-generated Text, Association for Computational Linguistics, pp. 39–48, Brussels, Belgium, November 01. Accessed 2019-09-06.
- Naseem, Tahira, Benjamin Snyder, Jacob Eisenstein, and Regina Barzilay. 2009. "Multilingual Part-of-Speech Tagging: Two Unsupervised Approaches." Journal of Artificial Intelligence Research, vol. 36, pp. 1-45, AI Access Foundation. Accessed 2019-08-31.
- Neves, Mariana. 2015. "Part-of-speech tagging and named-entity recognition." Natural Language Processing, Hasso Plattner Institute, May 11. Accessed 2019-09-06.
- Nguyen, Dat Quoc and Karin Verspoor. 2018. "An improved neural network model for joint POS tagging and dependency parsing." Proceedings of the CoNLL 2018 Shared Task: Multilingual Parsing from Raw Text to Universal Dependencies, pp. 81–91, Brussels, Belgium, Association for Computational Linguistics, October 31 – November 1. Accessed 2019-08-31.
- Petrov, Slav. 2016. "Announcing SyntaxNet: The World’s Most Accurate Parser Goes Open Source." Google AI Blog, May 12. Accessed 2019-09-06.
- R-bloggers. 2017. "Natural Language Processing on 40 languages with the Ripple Down Rules-based Part-Of-Speech Tagger." R-bloggers, March 30. Accessed 2019-08-31.
- Ratnaparkhi, Adwait. 1996. "A Maximum Entropy Model for Part-Of-Speech Tagging." Proceedings of the Conference on Empirical Methods in Natural Language Processing, Association for Computational Linguistics, pp. 133-142, May. Accessed 2019-09-06.
- Ruder, Sebastian. 2019. "Part-of-speech tagging." NLP-progress, via GitHub, August 30. Accessed 2019-08-31.
- Schmid, Helmut. 1994. "Part-of-speech tagging with neural networks." Proceedings of the 15th conference on Computational linguistics, Kyoto, Japan, ACL, vol. 1, pp. 172-176, August 5-9. Accessed 2019-09-08.
- Schulz, Sarah, and Jonas Kuhn. 2016. "Learning from Within? Comparing PoS Tagging Approaches for Historical Text." Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC 2016), European Language Resources Association (ELRA), pp. 4316-4322, May. Accessed 2019-09-06.
- Schöneberg, Ulf, and Wolfram Sperber. 2014. "POS Tagging and its Applications for Mathematics." arXiv, June 11. Accessed 2019-08-31.
- Shanmugamani, Rajalingappaa, and Rajesh Arumugam. 2018. "Applications of POS tagging." Hands-On Natural Language Processing with Python, Packt Publishing, July. Accessed 2019-08-31.
- TextBlob. 2018. "TextBlob: Simplified Text Processing." Release v0.15.2, November 21. Accessed 2019-09-06.
- Titov, Ivan. 2015. "Lecture 4: Smoothing, Part-of-Speech Tagging." Natural Language Models and Interfaces, Universiteit van Amsterdam. Accessed 2019-09-08.
- Webber, Bonnie. 2007. "Part of Speech Tagging." Informatics 2A: Lecture 13, University of Edinburgh, October 16. Accessed 2019-09-06.
- Yang, Yi and Jacob Eisenstein. 2016. "Part-of-Speech Tagging for Historical English." Proceedings of NAACL-HLT 2016, San Diego, California, Association for Computational Linguistic, pp. 1318–1328, June 12-17. Accessed 2019-09-06.
- van Guilder, Linda. 1995. "Automated Part of Speech Tagging: A Brief Overview." LING361, Georgetown University. Accessed 2019-09-06.
- Çetin, Samet. 2018. "Part-Of-Speech (POS) Tagging." Medium, July 28. Accessed 2019-09-06.
Further Reading
- Malhotra, Sachin and Divya Godayal. 2018. "An introduction to part-of-speech tagging and the Hidden Markov Model." freeCodeCamp, June 8. Accessed 2019-09-08.
- Artzi, Yoav. 2017. "Sequence Prediction and Part-of-speech Tagging." CS5740: Natural Language Processing, Cornell University. Accessed 2019-09-06.
- Jurafsky, Daniel and James H. Martin. 2009. "Part-of-Speech Tagging." Chapter 5 in Speech and Language Processing, Second Edition, Prentice-Hall, Inc. Accessed 2019-09-07.
- Bird, Steven, Ewan Klein and Edward Loper. 2019. "Categorizing and Tagging Words." Chapter 5 in Natural Language Processing with Python, Version 3.0. Accessed 2019-09-06.
- Santorini, Beatrice. 1990. "Part-of-Speech Tagging Guidelines for the Penn Treebank Project." 3rd Revision, MS-CIS-90-47, LINC LAB 178,University of Pennsylvania. July. Accessed 2019-09-08.
- Honnibal, Matthew. 2013. "A Good Part-of-Speech Tagger in about 200 Lines of Python." Blog, Explosion, September 18. Accessed 2019-09-06.
Article Stats
Cite As
See Also
- Multilingual POS Tagging
- Hidden Markov Model
- Dependency Parsing
- Named Entity Recognition
- Word Sense Disambiguation
- Text Corpus for NLP