Text Normalization

Text normalization reduces variations in word forms to a common form when the variations mean the same thing. For example, US and U.S.A become USA; Product, product and products become product; naïve becomes naive; $400 becomes 400 dollars; +7 (800) 123 1231 becomes 0078001231231; 25 June 2015 and 25/6/15 become 2015-06-25; and so on.

Before text data is used in training NLP models, it's pre-processed to a suitable form. Text normalization is often an essential step in text pre-processing. Text normalization simplifies the modelling process and can improve the model's performance.

There's no fixed set of tasks that are part of text normalization. Tasks depend on application requirements. Text normalization started with text-to-speech systems and later became important for processing social media text.

Discussion

What are the typical tasks within text normalization?
We can identify the following tasks for normalizing text:
- Tokenization: Text is normally broken up into tokens. A token is usually a single word but there are exceptions, such as New York.
- Lemmatization: Reduce surface forms to their root form. For example, sang, sung and sings have a common root 'sing'.
- Stemming: Strip suffixes. For example, trouble, troubled and troubles are stemmed to 'troubl'. This is a simpler and faster alternative to lemmatization.
- Sentence Segmentation: Break up text into sentences using characters ., !, or ?.
- Phonetic Normalization: Words spelled differently could sound the same. Likewise, variations in pronunciation would need to be normalized to the same token.
- Spelling Correction: In some applications such as IR, it's useful to correct spelling errors. For example, 'infromation' is normalized to 'information'.
- Non-Standard Words: This includes phone numbers, dates, currencies, addresses, acronyms, etc.
- Others: Normalization may involve accents (naïve, naive), UK/US spelling (catalogue, catalog), and capital letters (Product, product).
What are some NLP applications that benefit from text normalization?
Information Retrieval (IR) is a typical example. If the search query is 'U.S.A.', we may want to return results for 'U.S.A.' and 'USA'. One way to do this is via query expansion in which both forms are searched. A more efficient approach is to normalize to 'USA', store all documents with this normalized form and search only for 'USA'. Wrong normalization can produce irrelevant results, such 'C.A.T.' normalized to 'cat'.
Conversational AI involves both Automatic Speech Recognition (ASR) and Text-to-Speech (TTS) synthesis. For example, when a user says "five p m", ASR should interpret this as "5:00PM". This is called inverse text normalization. On the reverse, a text input "6:30PM" should be spoken as "six thirty p m". This is text normalization. Another example is 'Dr.', which could be interpreted as 'Drive' or 'Doctor'. Hence, context of usage is important to determine the correct normal form.
Machine translation, opinion mining, spell checking, sentiment analysis, dependency parsing, and named entity recognition are further examples of NLP tasks or applications that can benefit from text normalization.
What are some general approaches to text normalization?
Text normalization has a few different approaches:
- Substitution Lists: Also called wordlist mapping, lookups or memorization. Uses a precompiled list. Doesn't generalize to variants not in the list.
- Rule-based Methods: Manually crafted rules encode regularities in variants.
- Distance-based Methods: Edit distance measures such as Levenshtein distance are used to determine if two word forms are similar.
- Spelling Correction: Hidden Markov Models analyse word morphology and determine the correct spelling. However, corrections are done word by word without any context.
- Automatic Speech Recognition (ASR): Based on the insight that microtext (social media text, SMS messages) is closer to sound forms rather than proper spelling. Decodes word sequences within a weighted phonetic framework.
- Machine Translation (MT): Microtext is treated as a foreign language that needs to be translated. This approach captures context. Character-level Statistical Machine Translation (CSMT) maps character sequences rather than words. It's an example of the noisy channel model: a translation model followed by a language model.
- Neural Models: Use of neural networks such as encoder-decoder model with LSTM.
What are some approaches to text tokenization?
In English, whitespace is used to separate words. Hence, whitespace is often used to identify tokens. Some punctuation characters could also indicate word boundaries. In social media text, :) and #nlproc would be considered as tokens.
Contractions are often normalized to expanded forms. Examples, what're → what are, I'm → I am, isn't → is not. This sort of normalization results in two tokens from a single word. On the contrary, New York is an example of two words considered as a single token.
Tokenization of some words is far from unambiguous. Hyphens present a challenge. Should state-of-the-art become 'state of the art'? Should Hewlett-Packard become 'Hewlett Packard'? Should lower-case become 'lowercase' or 'lower case'? Some acronyms are also challenging. How should we tokenize m.p.h. and PhD?
In Japanese and Chinese, there are no spaces to separate words. A greedy algorithm that attempts to find the longest dictionary word is often used. In French, should L'ensemble be tokenized as L, L' or Le? In German, noun compounds are not segmented and their processing is deferred to the application.
Among the well-known tokenization approaches are Byte-Pair Encoding (BPE), WordPiece and SentencePiece.
What are non-standard words that need to be normalized?
A taxonomy of NSWs useful for hand tagging and modelling. Source: Sproat et al. 2001, table 1.
Non-Standard Words (NSWs) include numbers, abbreviations, dates, currency amounts and acronyms. Mixed-case words (WinNT, SunOS), Roman numerals, URLs, and email addresses are more categories of NSWs.
NSWs often occur in text apart from ordinary words and names. The challenge with NSWs is that they're not dictionary words and their interpretation tends to be ambiguous. Therefore, we need to normalize them. This basically means replacing them with ordinary words.
Take for example 'Pvt', which is interpreted as 'Private'. An ambiguous example is 'IV'. It could be read as four, fourth or intravenous, depending on the context. The number 1750 could refer to a year, a building number or a cardinal number. These differences are important for a TTS system that needs to determine the correct pronunciation. Should Amazon Alexa read '2/3' as 'two thirds' or 'February Third'?
Rather than employ ad hoc techniques to handle NSWs, formal modelling has been shown to give better results. Techniques could include n-gram models, decision trees, and weighted finite-state transducers.
What are the challenges in normalizing social media text?
Possible edits to normalize social media text. Source: Baldwin and Li 2015, fig. 1.
Social media text often don't conform to rules of spelling, grammar or punctuation. Among its challenges are:
- Abbreviations: nite (night), gr8 (great), sayin (saying), lol (laugh out loud), iirc (if I remember correctly), hard2tell (hard to tell)
- Misspelling: wouls (would), rediculous (ridiculous)
- Omitted Punctuation: im (I'm), dont (don't)
- Slang: that was well mint (that was well good)
- Wordplay: that was soooooo great (that was so great)
- Disguised Vulgarities: sh1t, f**k
- Emoticons: :) for smiling face, <3 for heart
- Informal Transliteration: This concerns only multilingual text. Variations in transliteration occur due to long vowels, borrowed words, accents/dialects, double consonants, etc.
Experiments have shown that normalizing these gives better performance in machine translation and spell checking. However, challenges remain. Emoticons :P and ;D are treated as spelling errors. Abbreviations 'b' for 'be' and 'c' for 'see' are not caught by spell checkers and later affect machine translation. When "I'm" is written as "im", it's misinterpreted as an abbreviation for instant messaging.
What does it mean to normalize Unicode strings?
Examples of Unicode normalization forms. Source: Whistler 2020, fig. 6.
Consider the angstrom symbol Å that may require normalization. Its Unicode codepoint is 212B. It can be decomposed into A followed by a small top circle.
Unicode characters can contain diacritical marks, ligatures, or half-width katakana characters. Unicode has defined four normalization forms:
- Normalization Form D (NFD): Canonical Decomposition
- Normalization Form C (NFC): Canonical Decomposition, followed by Canonical Composition
- Normalization Form KD (NFKD): Compatibility Decomposition
- Normalization Form KC (NFKC): Compatibility Decomposition, followed by Canonical Composition
Canonical equivalence means that equivalent characters or sequences of characters represent the same abstract character. They display and behave the same way.
Compatibility equivalence is a weaker type of equivalence. In this case, the visual appearance and behaviour may differ though they represent the same abstract character. For example, character ℌ becomes H and ¼ becomes 1/4. This difference may be acceptable in some applications. In some cases, applications may account for these differences with additional styling.
Consider 'schön'. Its normal forms are 'scho\u0308n' (NFD & NFKD) and 'schön' (NFC & NFKC). Moreover, NFC and NFKC differ only in the decomposition phase.
What are some neural network approaches to text normalization?
Text normalization with encoder-decoder model using GRUs and attention mechanism. Source: Zhang et al. 2019, fig. 6.
Since 2016, Recurrent Neural Networks (RNNs) have been used for text normalization. In particular, a few layers of BiLSTM have been used to map character sequences to word tokens.
For a long time CSMT was the state of the art in text normalization. Neural models generally need much larger training datasets. To overcome this limitation, Lusetti et al. (2018) trained a character-level encoder-decoder model plus a word-level language model. Beam search is used during decoding.
Zhang et al. (2019) used transformers with good results but it's prone to unrecoverable errors. They got better results by modifying encoder-decoder model to capture context more effectively. Their multi-task architecture jointly trains the tagger and the normalizer.
Memory augmented network has been applied. A hybrid word-character attention-based encoder-decoder model has been used, with character-based component trained on adversarial examples. Pointer-generator network with transformer encoder and auto-regressive decoder has been used, with the pointer module replacing OOV output tokens.
For many NLP tasks in Chinese, word tokenization is not required. However, Convolutional Neural Networks (CNNs) have been used.
Could you mention some useful developer tools for text normalization?
In Python, many NLP software libraries support text normalization, particularly tokenization, stemming and lemmatization. Some of these include NLTK, Hunspell, Gensim, SpaCy, TextBlob and Pattern. More tools are listed in an online spreadsheet.
Penn Treebank tokenization standard is applied to treebanks released by the Linguistic Data Consortium (LDC). This standard keeps hyphenated words together, breaks up contractions (doesn't → does and n't), and separates out all punctuation ($10 → $ and 10).
For Unicode normalization, the International Components for Unicode page links to many useful resources including open source software. There's also an online demo at Unicode.org and a Unicode normalization FAQ.
In R, utf8_normalize from utf8 package does Unicode normalization. For other text analysis, R packages tidytext, tm, SnowballC and topicmodels are useful.
Wolfram supports many levels of text normalization: character-level, word-level, sentence-level, morphological and linguistic.

Milestones

1987

An early example of text normalization in the context of Text-to-Speech (TTS) is in a system named MITalk. Normalization is achieved using hard-coded rules in either Fortran or C.

1996

In the Bell Labs multilingual TTS system, Weighted Finite State Transducer (WFST) is used for text normalization. Instead of doing this as a pre-processing step, normalization is done along with other linguistic tasks. To consider context, language model transducers are used. The method identifies many possible interpretations and selects the best path using Viterbi algorithm. As late as 2014, this approach continues to be used in practice, such as in Google's Kestrel system.

2001

Sproat et al. give a taxonomy of NSWs. They also treat text normalization as a language modelling problem. For TTS application, they present both supervised and unsupervised machine learning approaches, with the latter being a better choice for new domains.

2005

With the growth of social media, there's a need to normalize such text. From about mid-2000s, this drives interest in text normalization for social media text.

Jul
2006

Aw et al. propose the metaphor of Machine Translation (MT) for normalizing SMS messages. The idea is to "translate" SMS language to English language by adapting a phrase-based statistical MT model. For alignment during training, they use EM algorithm and Viterbi search. They show improved BLEU score. They also show that downstream English to Chinese translations improve.

Oct
2007

Choudhury et al. apply Hidden Markov Model (HMM) to the problem of normalizing SMS messages. Non-standard tokens are the emission states. They also adopt the spell checking metaphor and process text at character level rather than word level.

Nov
2011

Pennell and Liu introduce a character-level MT method. Examples of character-level mappings are 'a'→'er', '@'→'at', and '8'→'ate'. This is only the first phase where possible expansions are identified. In the second phase, a language model is used to choose the correct expansion in context.

Dec
2012

Alignment of 'ystrdy' and 'yesterday' using (a) Character-level MT (b) and Character-block level MT. Source: Li and Liu 2012, fig. 1.

Li and Liu propose an algorithm in which input is blocks of characters segmented by phonetic similarity. They use two-step MT, translating non-standard words to phones, then phones to words. They use spell checking for simple corrections. In the example, character-level MT misaligns the second 'e' but character-block level MT gets it right.

2013

Previous work often treated text normalization as replacing out-of-vocabulary or non-standard words with dictionary words. Researchers realize that text normalization can't be a "one-size-fits-all" approach. Downstream NLP task or application matters. Zhang et al. normalize with a view on improving performance of dependency parsing rather than simply evaluate based on word error rate and BLEU score. Wang and Ng normalize social media text for better machine translation. Along with word replacement, they recover missing words and correct punctuation.

2015

Baldwin and Li normalize social media text. They evaluate the effect of normalization on three downstream applications: dependency parsing, NER and TTS. They also study the effect of each normalization edit on each of these applications. For example, only word replacements are critical for NER. For parsing, word replacements, token addition and removal edits are important. For TTS, it's critical to remove non-standard tokens while word addition is important but less so.

Oct
2016

Sproat and Jaitly present neural models for text normalization. In particular, they use a few layers of BiLSTM. In one architecture, they train a BiLSTM channel model to map characters to word tokens, followed by another LSTM for language modelling. In another architecture, they use 4-layer attention-based BiLSTM sequence-to-sequence model. This performs better than the first one. An FST-based filter improves results further.

Aug
2017

Van Esch and Sproat present a revised taxonomy of NSWs. They note that an earlier taxonomy from 2001 is inadequate due to many new categories that have come about due to social media. They present as many as 12 tables of various semiotic classes with useful examples for each. Some of these are word-like tokens, basic numbers, identifiers, dates, times, percentages, measures, geographic entities, and formulae.

Jun
2019

Historical variations of the word 'their'. Source: Bollmann 2019, fig. 1.

Historical texts need to be normalized. Bollmann evaluates and analyses the performance of three systems that do this: Norma (rule-based, distance-based, supervised), cSMTiser (CSMT with additional language modelling data), and Neural Machine Translattion (NMT). He considers texts from many languages, some dating back to 14th century. cSMTiser outperforms NMT in most cases. Norma could be used if there's limited training data.

Jun
2019

It's important to normalize NSWs correctly in spoken dialogue systems such as Amazon Alexa. Mansfield et al. approach this as a machine translation problem and sequence-to-sequence modelling. For better context, they use attention mechanism on subword units rather than words. With subwords, we reduce input size and handle OOV words better. BPE is used to create a subword inventory and SentencePiece to find its optimal size. They improve performance further by using linguistic features: POS, position, capitalization, and edit labels.

References

Article Stats

2615

Words

Authors

Edits

Chats

Likes

17K

Hits

Cite As

Devopedia. 2020. "Text Normalization." Version 2, December 21. Accessed 2023-11-12. https://devopedia.org/text-normalization

Contributed by
1 author

Last updated on
2020-12-21 13:42:38

algorithms natural language processing neural networks text-to-speech

Stemming
Lemmatization
Levenshtein Distance
Text-to-Speech
Machine Translation
Spelling Correction

Text Normalization

Discussion

Milestones

References

Further Reading

Article Stats

Cite As

See Also

Text Normalization

Discussion

Milestones

References

Further Reading

Article Stats

Author-wise Stats for Article Edits

Cite As

See Also

Login