• Example showing POS ambiguity. Source: Màrquez et al. 2000, table 1.
• Cutting et al. use HMM for both learning and decoding. Source: Cutting et al. 1992, fig. 1.
• Main tags (with examples) in the Penn TreeBank tagset. Source: Artzi 2017, slide 7.
• Architecture diagram of POS tagging. Source: Devopedia 2019.
• Most words in English are unambiguous but many common words are ambiguous. Source: Jurafsky and Martin 2009, fig. 5.10.
• Most probable tag sequence is Noun-Modal-Verb-Noun. Source: Lee 2019.
• Use of BLSTM for tagging learner English. Source: Nagata et al. 2018, fig. 1 & 2.

# Part-of-Speech Tagging

arvindpdmn
1993 DevCoins
Last updated by arvindpdmn
on 2019-09-08 16:34:44
Created by arvindpdmn
on 2019-08-31 07:52:00

## Summary

In the processing of natural languages, each word in a sentence is tagged with its part of speech. These tags then become useful for higher-level applications. Common parts of speech in English are noun, verb, adjective, adverb, etc.

The main problem with POS tagging is ambiguity. In English, many common words have multiple meanings and therefore multiple POS. The job of a POS tagger is to resolve this ambiguity accurately based on the context of use. For example, the word "shot" can be a noun or a verb. When used as a verb, it could be in past tense or past participle.

POS taggers started with a linguistic approach but later migrated towards a statistical approach. State-of-the-art models achieve accuracy better than 97%. POS tagging research done with English text corpus has been adapted to many other languages.

## Milestones

1963

Klein and Simmons describe a rule-based method with a focus towards initial categorical tagging rather than part-of-speech disambiguation. They identify 30 categories and achieve 90% accuracy, which may be because fewer categories implies less ambiguity.

1964

W. Nelson Francis and Henry Kučera at the Department of Linguistics, Brown University, publish a computer-readable general corpus to aid linguistic research on modern English. The corpus has 1 million words (500 samples of about 2000 words each). Revised editions appear later in 1971 and 1979. Called Brown Corpus, it inspires many other text corpora. Brown Corpus is available online.

1971

Greene and Rubin develop the TAGGIT system to tag the Brown Corpus from a set of 86 tags. It uses rules for tagging and obtains 77% accuracy. Human experts then do post-editing.

1976

In one of the earliest departures from rule-based method to statistical method, Bahl and Mercer apply HMM to the problem of POS tagging and use Viterbi algorithm for decoding. In 1992, a research team at Xerox led by Doug Cutting apply HMM in two ways: they apply the Baum-Welch algorithm to obtain the maximum likelihood estimate of the model parameters; next they apply Viterbi algorithm to decode a sequence of tags given a sequence of words.

1983

The CLAWS algorithm uses the co-locational probabilities, that is, likelihood of co-occurrence of ordered pairs of tags. These probabilities are estimated from the tagged Brown Corpus. Steven J. DeRose improves on this work in 1988 with the VOLSUNGA algorithm that's more efficient and achieves 96% accuracy. CLAWS and VOLSUNGA are N-gram taggers. By late 1980s, statistical approaches become popular. Rather than build complex and brittle hand-coded rules, statistical models learn these rules from text corpora.

1992

Started in 1989 at the University of Pennsylvania, the Penn Treebank is released in 1992. It's an annotated text corpus of 4.5 million words of American English. The corpus is POS tagged. Over half of it is also annotated with syntactic structure. Treebank II is released in 1995. The original release had 48 tags. Treebank II merges some punctuation tags and results in a 45-tag tagset. Apart from POS tags, the corpus includes chunk tags, relation tags and anchor tags.

1992

At a time when stochastic taggers are performing better than rule-based taggers, Eric Brill proposes a rule-based tagger that performs as well as stochastic taggers. It works by assigning the most likely estimates from a corpus but without any contextual information. It then improves on the estimate by applying patching rules, which are also learned from the corpus. One test shows 5.1% error rate with only 71 patches. This method is later named Transformation-Based Learning (TBL).

1994

Called Net-Tagger, Helmut Schmid uses a Multi-Layer Perceptron (MLP) network to solve the POS tagging problem. He notes that neural networks were used previously for speech recognition. Though Nakamura et al. (1990) used a 4-layer feed-forward network for tag prediction, Net-Tagger is about tag disambiguation rather than prediction. Accuracy of 96.22% is achieved with a 2-layer model.

1996

Maximum Entropy Model was previously used for problems such as language modelling and machine translation. A. Ratnaparkhi applies the model to POS tagging and achieves state-of-the-art word accuracy of 96.6%; and 85.6% for unknown words. OpenNLP POS tagger uses such a model.

2011

Christopher Manning, NLP researcher at Stanford University, comments that POS tagging has reached 97.3% token accuracy and 56% sentence accuracy. Further gains in accuracy might be possible with improved descriptive linguistics. He also argues that accuracy of 97% claimed to be achieved by humans might be an overestimate. Thus, automatic taggers are already surpassing humans.

2018

Since 2015, many neural network based POS taggers have shown better than 97% accuracy. In 2018, Meta-BiLSTM achieves an accuracy of 97.96%.

## Discussion

• Could you give an overview of POS tagging?

A POS tagger takes in a phrase or sentence and assigns the most probable part-of-speech tag to each word. In practice, input is often pre-processed. One common pre-processing task is to tokenize the input so that the tagger sees a sequence of words and punctuations. Other tasks such as stop word removals, punctuation removals and lemmatization may be done before tagging.

The set of predefined tags is called the tagset. This is essential information that the tagger must be given. Example tags are NNS for a plural noun, VBD for a past tense verb, or JJ for an adjective. A tagset can also include punctuations.

Rather than design our own tagset, the common practice is to use well-known tagsets: 87-tag Brown tagset, 45-tag Penn Treebank tagset, 61-tag C5 tagset, or 146-tag C7 tagset. In the architecture diagram, we have shown the 45-tag Penn Treebank tagset. Sketch Engine is a place to download tagsets.

• What's the relevance of POS tagging for NLP?

POS tagging is a basic task in NLP. It's an essential pre-processing task before doing syntactic parsing or semantic analysis. It benefits many NLP applications including information retrieval, information extraction, text-to-speech systems, corpus linguistics, named entity recognition, question answering, word sense disambiguation, and more.

If a POS tagger gives poor accuracy, this has an adverse effect on other tasks that follow. This is commonly called downstream error propagation. To improve accuracy, some researchers have proposed combining POS tagging with other processing. For example, joint POS tagging and dependency parsing is an approach to improve accuracy compared to independent modelling.

• What are the sources of information for performing POS tagging?

Sometimes a word on its own can give useful clues. For example, 'the' is a determiner. Prefix 'un-' suggests an adjective, such as 'unfathomable'. Suffix '-ly' suggests adverb, such as 'importantly'. Capitalization can suggest proper noun, such as, 'Meridian'. Word shapes are also useful, such as '35-year' that's an adjective.

A word can be tagged based on the neighbouring words and the possible tags that those words can have. Word probabilities also play a part in selecting the right tag to resolve ambiguity. For example, 'man' is rarely used as a verb and mostly used as a noun.

In a statistical approach, we can count tag frequencies of words in a tagged corpus and then assign the most probable tag. This is called unigram tagging. A much better approach is bigram tagging. This counts the tag frequency given a particular preceding tag. Thus, a tag is seen to have dependence on previous tag. We can generalize this to n-gram tagging. In fact, it's common to model a sequence of words and estimate the sequence of tags. This is done by the Hidden Markov Model (HMM).

• Which are the main types of POS taggers?

We note the following types of POS taggers:

• Rule-Based: A dictionary is constructed with possible tags for each word. Rules guide the tagger to disambiguate. Rules are either hand-crafted, learned or both. An example rule might say, "If an ambiguous/unknown word X is preceded by a determiner and followed by a noun, tag it as an adjective."
• Statistical: A text corpus is used to derive useful probabilities. Given a sequence of words, the most probable sequence of tags is selected. These are also called stochastic or probabilistic taggers. Among the common models are n-gram model, Hidden Markov Model (HMM) and Maximum Entropy Model (MEM).
• Memory-Based: A set of cases is stored in memory, each case containing a word, its context and suitable tag. A new sentence is tagged based on best match from cases stored in memory. It's a combination of rule-based and stochastic method.
• Transformation-Based: Rules are automatically induced from data. Thus, it's a combination of rule-based and stochastic methods. Tagging is done using broad rules and then improved or transformed by applying narrower rules.
• Neural Net: RNN and Bidirectional LSTM are two examples of neural network architectures for POS tagging.
• In machine learning terminology, is POS tagging supervised or unsupervised?

POS taggers can be either supervised or unsupervised. Supervised taggers rely on a tagged corpus to create a dictionary, rules or tag sequence probabilities. They perform best when trained and applied on the same genre of text. Unsupervised taggers induce word groupings. This saves the effort of pre-tagging a corpus but word clusters are often coarse.

A combination of both approaches is also common. For example, rules are automatically induced from an untagged corpus. The output from this is corrected by humans and resubmitted to the tagger. The tagger looks at the corrections and adjusts the rules. Many iterations of this process may be necessary.

In 2016, it was noted that a completely unsupervised approach is not yet mature. Instead, weakly supervised approaches are adopted by aligning text, using translation probabilities (for machine translation) or transferring knowledge from resource-rich languages. Even a small amount of tagged corpus can be generalized to give better results.

• In HMM, given a word sequence, how do we determine the equivalent POS tag sequence?

We call this the decoding problem. We can observe the word sequence but the sequence of tags is hidden. We're required to find out the most probable tag sequence given the word sequence. In other words, we wish to maximize $$P(t^{n}|w^{n})$$ for an n-word sequence.

An important insight is that parts of speech (and not words) give language its structure. Thus, using Bayes' Rule, we recast the problem to the following form, $$P(t^{n}|w^{n})=P(w^{n}|t^{n})\,P(t^{n})/P(w^{n})$$. $$P(w^{n}|t^{n})$$ is called likelihood. $$P(t^{n})$$ is called prior probability.

Since we're maximizing over all tag sequences, the denominator can be ignored. We also make two assumptions: each word depends only on its own tag, and each tag depends only on its previous tag. We therefore need to maximize $$\prod_{i=1}^{n}P(w_i^{n}|t_i^{n})\,P(t_i|t_{i-1})$$. In HMM, the terms are called emission probabilities and transition probabilities.

These probabilities are estimated from the tagged text corpus. The standard solution is to apply Viterbi algorithm, which is a form of dynamic programming. In the example figure, we see two non-zero paths and we select the more probable one.

• What are some practical techniques for POS tagging?

In any supervised statistical approach, it's recommended to divide your corpus into training set, development set (for tuning parameters) and testing set. An alternative is to use the entire corpus for training but do cross-validation. Moreover, if the corpus is too general the probabilities may not suit a particular domain; if it's too narrow, it may not generalize well across domains. To analyse where your model is failing, you can use confusion matrix or contingency table.

When unknown words are seen, one approach is to assign a suffix and calculate the probability that the suffixed word with a particular tag occurs in a sequence. Another approach is to assign a set of default tags and calculate the probabilities. Or we could look at the word's internal structure, such as assigning NNS for words ending in 's'.

To deal with sparse data (probabilities are zero), there are smoothing techniques. A naïve technique is to add a small frequency count, say 1, to all counts. The Good-Turing method along with Katz's backoff is a better technique. Linear interpolation is another technique.

• How have researchers adapted standard POS tagging methods for custom applications?

Learner English is English as a foreign language. Such text often contain spelling and orthographic errors. The use of neural networks, Bidirectional LSTM in particular, is found to give better accuracy than standard POS taggers. Word embeddings, character embeddings and native language vectors are used.

Historical English also present tagging challenges due to differences in spelling, usage and vocabulary. A combination of spelling normalization and a domain adaptation method such as feature embedding gives better results. Other approaches to historical text include neural nets, conditional random fields and self-learning techniques.

Techniques have been invented to tag Twitter data that's often sparse and noisy. In mathematics, POS taggers have been adapted to handle formulae and extract key phrases in mathematical publications. For clinical text, tagged corpus for that genre is used. However, it was found that it's better to share annotations across corpora than simply share a pretrained model.

• Could you describe some tools for doing POS tagging?

In Python, nltk.tag package implements many types of taggers. pattern is a web mining module that includes ability to do POS tagging. Unfortunately it lacks Python 3 support. It's also available in R as pattern.nlp. TextBlob is inspired by both NLTK and Pattern. spaCy is another useful package.

Implemented in TensorFlow, SyntaxNet is based on neural networks. Parsey McParseface is a parser for English and gives good accuracy.

Parts-of-speech.info is an online tool for trying out POS tagging for any text input. Useful open source tools are Apache OpenNLP, Orange and UDPipe.

Samet Çetin shows how to implement your own custom tagger using a logistic regression model. There's also a commercial tool from Bitext.

Datacube at the Vienna University of Economics and Business is a place to download text corpora, and taggers (OpenNLP or Stanford) implemented in R. Stanford tagger is said to be slow. Treetagger is limited to non-commercial use. Another R package is RDRPOS tagger. A Java implementation of a log-linear tagger from Stanford is available.

## Sample Code

• # Source: https://textblob.readthedocs.io/en/dev/quickstart.html#part-of-speech-tagging
# Accessed: 2019-09-06

>>> from textblob import TextBlob
>>> wiki = TextBlob("Python is a high-level, general-purpose programming language.")
>>> wiki.tags
[('Python', 'NNP'), ('is', 'VBZ'), ('a', 'DT'), ('high-level', 'JJ'),
('general-purpose', 'JJ'), ('programming', 'NN'), ('language', 'NN')]

# Source: https://www.nltk.org/api/nltk.tag.html
# Accessed: 2019-09-06

>>> from nltk.tag import DefaultTagger
>>> default_tagger = DefaultTagger('NN')
>>> list(default_tagger.tag('This is a test'.split()))
[('This', 'NN'), ('is', 'NN'), ('a', 'NN'), ('test', 'NN')]

# Source: https://www.clips.uantwerpen.be/pattern
# Accessed: 2019-09-06

>>> from pattern.en import parse
>>> s = 'The mobile web is more important than mobile apps.'
>>> parse(s, relations=True, lemmata=True)
u'The/DT/B-NP/O/NP-SBJ-1 mobile/JJ/I-NP/O/NP-SBJ-1 web/NN/I-NP/O/NP-SBJ-1
than/IN/B-PP/B-PNP/O mobile/JJ/B-NP/I-PNP/O apps/NN/I-NP/I-PNP/O ././O/O/O'

# Source: https://spacy.io/usage/linguistic-features#pos-tagging
# Accessed: 2019-09-06

import spacy

doc = nlp(u'Apple is looking at buying U.K. startup for \$1 billion')

for token in doc:
print(token.text, token.lemma_, token.pos_, token.tag_, token.dep_,
token.shape_, token.is_alpha, token.is_stop)

## Milestones

1963

Klein and Simmons describe a rule-based method with a focus towards initial categorical tagging rather than part-of-speech disambiguation. They identify 30 categories and achieve 90% accuracy, which may be because fewer categories implies less ambiguity.

1964

W. Nelson Francis and Henry Kučera at the Department of Linguistics, Brown University, publish a computer-readable general corpus to aid linguistic research on modern English. The corpus has 1 million words (500 samples of about 2000 words each). Revised editions appear later in 1971 and 1979. Called Brown Corpus, it inspires many other text corpora. Brown Corpus is available online.

1971

Greene and Rubin develop the TAGGIT system to tag the Brown Corpus from a set of 86 tags. It uses rules for tagging and obtains 77% accuracy. Human experts then do post-editing.

1976

In one of the earliest departures from rule-based method to statistical method, Bahl and Mercer apply HMM to the problem of POS tagging and use Viterbi algorithm for decoding. In 1992, a research team at Xerox led by Doug Cutting apply HMM in two ways: they apply the Baum-Welch algorithm to obtain the maximum likelihood estimate of the model parameters; next they apply Viterbi algorithm to decode a sequence of tags given a sequence of words.

1983

The CLAWS algorithm uses the co-locational probabilities, that is, likelihood of co-occurrence of ordered pairs of tags. These probabilities are estimated from the tagged Brown Corpus. Steven J. DeRose improves on this work in 1988 with the VOLSUNGA algorithm that's more efficient and achieves 96% accuracy. CLAWS and VOLSUNGA are N-gram taggers. By late 1980s, statistical approaches become popular. Rather than build complex and brittle hand-coded rules, statistical models learn these rules from text corpora.

1992

Started in 1989 at the University of Pennsylvania, the Penn Treebank is released in 1992. It's an annotated text corpus of 4.5 million words of American English. The corpus is POS tagged. Over half of it is also annotated with syntactic structure. Treebank II is released in 1995. The original release had 48 tags. Treebank II merges some punctuation tags and results in a 45-tag tagset. Apart from POS tags, the corpus includes chunk tags, relation tags and anchor tags.

1992

At a time when stochastic taggers are performing better than rule-based taggers, Eric Brill proposes a rule-based tagger that performs as well as stochastic taggers. It works by assigning the most likely estimates from a corpus but without any contextual information. It then improves on the estimate by applying patching rules, which are also learned from the corpus. One test shows 5.1% error rate with only 71 patches. This method is later named Transformation-Based Learning (TBL).

1994

Called Net-Tagger, Helmut Schmid uses a Multi-Layer Perceptron (MLP) network to solve the POS tagging problem. He notes that neural networks were used previously for speech recognition. Though Nakamura et al. (1990) used a 4-layer feed-forward network for tag prediction, Net-Tagger is about tag disambiguation rather than prediction. Accuracy of 96.22% is achieved with a 2-layer model.

1996

Maximum Entropy Model was previously used for problems such as language modelling and machine translation. A. Ratnaparkhi applies the model to POS tagging and achieves state-of-the-art word accuracy of 96.6%; and 85.6% for unknown words. OpenNLP POS tagger uses such a model.

2011

Christopher Manning, NLP researcher at Stanford University, comments that POS tagging has reached 97.3% token accuracy and 56% sentence accuracy. Further gains in accuracy might be possible with improved descriptive linguistics. He also argues that accuracy of 97% claimed to be achieved by humans might be an overestimate. Thus, automatic taggers are already surpassing humans.

2018

Since 2015, many neural network based POS taggers have shown better than 97% accuracy. In 2018, Meta-BiLSTM achieves an accuracy of 97.96%.

Author
No. of Edits
No. of Chats
DevCoins
3
0
1993
2391
Words
0
Chats
3
Edits
5
Likes
4142
Hits

## Cite As

Devopedia. 2019. "Part-of-Speech Tagging." Version 3, September 8. Accessed 2020-09-19. https://devopedia.org/part-of-speech-tagging
• Site Map