# Grammar and Spell Checker

A well-written article with correct grammar, punctuation and spelling along with an appropriate tone and style to match the needs of the intended reader or community is always important. Software tools offer algorithm-based solutions for grammar and spell checking and correction.

Classical rule-based approaches employ a dictionary of words along with a set of rules. Recent neural network-based approaches learn from millions of published articles and offer suggestions for appropriate choice of words and way to phrase parts of sentences to adjust the tone, style and semantics of the sentence. They can alter suggestions based on the publication domain of the article like academic, news, etc.

Grammar and spelling correction are tasks that belong to a more general NLP process called lexical disambiguation.

## Discussion

• What is a software grammar and spell checker, its general tasks and uses?

A grammar and spell checker is a software tool that checks a written text for grammatical mistakes, appropriate punctuation, misspellings, and issues related to sentence structure. More recently, neural network-based tools also evaluate tone, style, and semantics to ensure that the writing is flawless.

Often such tools offer a visual indication by highlighting or underlining spelling and grammar errors in different colors (often red for spelling and blue for grammar). Upon hovering or clicking on the highlighted parts, they offer appropriately ranked suggestions to correct those errors. Certain tools offer a suggestive corrected version by displaying correction as strikeout in an appropriate color.

Such tools are used to improve writing, produce engaging content, and for assessment and training purposes. Several tools also offer style correction to adapt the article for specific domains like academic publications, marketing, and advertising, legal, news reporting, etc.

However, till today, no tool is a perfect alternative to an expert human evaluator.

• What are some important terms relevant to a grammar and spell checker?

The following NLP terms and approaches are relevant to grammar and spell checker:

• Part-of-Speech (PoS) tagging marks words as noun, verb, adverb, etc. based on definition and context.
• Named Entity Recognition (NER) is labeling a sequence of text into predefined categories such as name, location, etc. Labels help determine the context of words around them.
• Confusion Set is a set of probable words that can appear in a certain context, e.g. set of articles before a noun.
• N-Gram is a sub-sequence of n words or tokens. For example, "The sun is bright" has these 2-grams: {"the sun", "sun is", "is bright"}.
• Parallel Corpus is a collection of text placed alongside its translation, e.g. text with errors and its corresponding corrected version(s).
• Language Model (LM) determines the probability distribution over a sequence of words. It says how likely is a particular sequence of words.
• Machine Translation (MT) is a software approach to translate one sequence of text into another. In grammar checking, this refers to translating erroneous text into correct text.
• What are the various types of grammar and spelling errors?

We describe the following types:

• Sentence Structure: Parts of speech are organized incorrectly. For example, "she began to singing" shows misplaced 'to' or '-ing'. Dependent clause without the main clause, run-on sentence due to missing conjunction, or missing subject are some structural errors.
• Syntax Error: Violation of rules of grammar. These can be in relation to subject-verb agreement, wrong/missing article or preposition, verb tense or verb form error, or a noun number error.
• Punctuation Error: Punctuation marks like comma, semi-colon, period, exclamation, question mark, etc. are missing, unnecessary, or wrongly placed.
• Spelling Error: Word is not known in the dictionary.
• Semantic Error: Grammar rules are followed but the sentence doesn't make sense, often due to a wrong choice of words. "I am going to the library to buy a book" is an example where 'bookstore' should replace 'library'. Rule-based approaches typically can't handle semantic errors. They require statistical or machine learning approaches, which can also flag other types of errors. Often a combination of approaches leads to a good solution.
• What are classical methods for implementing grammar and spell checkers?

Classical methods of spelling correction match words against a given dictionary, an approach alluded by critiques to be unreliable as it can't detect incorrect use of correctly spelled words; or correct words not in the dictionary, like technical words, acronyms, etc.

Grammar checkers use hand-coded grammar rules on PoS tagged text for correct or incorrect sentences. For instance, the rule I + Verb (3rd person, singular form) corresponds to the incorrect verb form usage, as in the phrase "I has a dog." These methods provide detailed explanations of flagged errors making it helpful for learning. However, rule maintenance is tedious and devoid of context.

Statistical approaches validate parts of a sentence (n-grams) against their presence in a corpus. These approaches can flag words used out of context. However, it's challenging to provide detailed explanations. Their efficiency is limited to the choice of corpora.

Noisy channel model is one statistical approach. A LM based on trigrams and bigrams gives better results than just unigrams. Where rare words are wrongly corrected, using a blacklist of words or a probability threshold can help.

• What are Machine Learning-based methods for implementing grammar and spell checkers?

ML-based approaches are either Classification (discriminative) or Machine Translation (generative).

Classification approaches work with well-defined errors. Each error type (article, preposition, etc.) requires training a separate multi-class classifier. For example, a proposition error classifier takes n-grams associated with propositions in a sentence and outputs a score for every candidate proposition in the confusion set. Contextual corrections also consider features like PoS and NER. A model can be a linear classifier like a Support Vector Machine (SVM), an n-gram LM-based or Naïve Bayes classifier, or even a DNN-based classifier.

Machine Translation approaches can be Statistical Machine Translation (SMT) or Neural Machine Translation (NMT). Both these use parallel corpora to train a sequence-to-sequence model, where text with errors translates to corrected text. NMT uses encoder-decoder architecture, where an encoder determines a latent vector for a sentence based upon the input word embeddings. The decoder then generates target tokens from the latent vector and relevant surrounding input and output tokens (attention). These benefit from transfer learning and advancements in transformer-based architecture. Editor models reduce training time by outputting edits to input tokens from a reduced confusion set instead of generating target tokens.

• How can I train an NMT model for grammar and spell checking?

In general, NMT requires training an encoder-decoder model using cross-entropy as the loss function by comparing maximum likelihood output to the gold standard correct output. To train a good model requires a large number of parallel corpora and compute capacity. Transformers are attention-based deep seq2seq architectures. Pre-trained language models generated by transformer architectures like BERT provide contextual embeddings to find the most likely token given the surrounding tokens, making it useful to flag contextual errors in an n-gram.

Transfer learning via fine tuning weights of a transformer using the parallel corpus of incorrect to correct examples makes it suitable for GEC use. Pre-processing or pre-training with synthetic data improves the performance and accuracy. Further enhancements can be to use separate heads for different types of errors.

Editor models are better as they output edit sequences instead of corrected versions. Training and testing of editor models require the generation of edit sequences from source-target parallel texts.

• What datasets are available for training and evaluation of grammar and spell check models?

MT or classification models need datasets with annotated errors. NMT requires a large amount of data.

Lang 8, the largest available parallel corpora, has 100,051 English entries. Corpus of Linguistic Acceptability (CoLA) is a dataset of sentences labeled as either grammatically correct or incorrect. It can be used, for example, to fine tune a pre-trained model. GitHub Typo Corpus is harvested from GitHub and contains errors and their corrections.

Benchmarking data in Standard Generalized Markup Language (SGML) format is available. Sebastian Ruder offers a detailed list of available benchmarking test datasets along with the various models (publications and source code).

Noise models use transducers to produce erroneous sentences from correct ones with a specified probability. They induce various error types to generate a larger dataset from a smaller one, like replacing a word from its confusion set, misplace or remove punctuations, induce spelling, tense, noun number, or verb form mistakes, etc. Round-trip MT, such as English-German-English translation, can also generate parallel corpora. Wikipedia edit sequences offer millions of consecutive snapshots to serve as source-target pairs. However, only a tiny fraction of those edits are language related.

• How do I annotate or evaluate the performance of grammar and spell checkers?

ERRor ANnotation Toolkit (ERRANT) enabled suggestions with explanation. It automatically annotates parallel English sentences with error type information, thereby standardizing parallel datasets and facilitating detailed error type evaluation.

Training and evaluation require comparing the output to the target gold standard and giving a numerical measure of effectiveness or loss. Editor models have an advantage as the sequence length of input and output is the same. Unequal sequences need alignment with the insertion of empty tokens.

Max-Match ($$M^2$$) scorer determine the smallest edit sequence out of the multiple possible ways to arrive at the gold standard using the notion of Levenshtein distance. The evaluation happens by computing precision, recall, and F1 measure between the set of system edits and the set of gold edits for all sentences after aligning the sequences to the same length.

Dynamic programming can also align multiple sequences to the gold standard when there is more than one possible correct outcome.

• Could you mention some tools or libraries that implement grammar and spell checking?

GNU Aspell is a standard utility used in GNU OS and other UNIX-like OS. Hunspell is a spell checker that's part of popular software such as LibreOffice, OpenOffice.org, Mozilla Firefox 3 & Thunderbird, Google Chrome, and more. Hunspell itself is based on MySpell. Hunspell can use one or more dictionaries, stemming, morphological analysis, and Unicode text.

Python packages for spell checking include pyspellchecker, textblob and autocorrect.

A search for "grammar spell" on GitHub brings up useful dictionaries or code implemented in various languages. There's a converter from British to American English. Spellcheckr is a JavaScript implementation for web frontends.

Deep learning models include Textly-DRF-API and GECwBERT.

Many online services or offline software also exist: WhiteSmoke from 2002, LanguageTool from 2005, Grammarly from 2009, Ginger from 2011, Reverso from 2013, and Trinka from 2020. Trinka focuses on an academic style of writing. Grammarly focuses on suggestions in terms of writing style, clarity, engagement, delivery, etc.

## Milestones

1960

Blair implements a simple spelling corrector using heuristics and a dictionary of correct words. Incorrect spellings are associated with the corrected ones via abbreviations that indicate similarity between the two. Blair notes that this is in some sense a form of pattern recognition. In one experiment, the program successfully corrects 89 of 117 misspelled words. In general, research interest in spell checking and correction begins in the 1960s.

1971

R. E. Gorin writes Ispell in PDP-10 assembly. Ispell becomes the main spell-checking program for UNIX. Ispell is also credited with introducing the generalized affix description system. Much later, Geoff Kuenning implements a C++ version with support for many European languages. This is called International Ispell. GNU Aspell, MySpell and Hunspell are other software inspired by Ispell.

1980

In the 1980s, GEC systems are syntax-based systems, such as EPISTLE. They determine the syntactic structure of each sentence and the grammatical functions fulfilled by various phrases. They detect several classes of grammatical errors, such as disagreement in number between the subject and the verb.

1990

This decade focuses on simple linear classifiers to flag incorrect choice of articles or statistical methods to identify and flag use of commonly confused words. Confusion can be due to identical sounding words, typos etc.

2000

Rule-based methods evolve in the 2000s. Rule generation is based on parse trees, designed heuristically or based on linguistic knowledge or statistical analysis of erratic texts. These methods don't generalize to new types of errors. New rules need to be constantly added.

2005

The mid-2000s sees methods to record and create aligned corpora of pre- and post-editing ESL (English as a Second Language) writing samples. SMTs offer improvement in identifying and correcting writing errors. GEC sees the use of semantic and syntactic features including PoS tags and NER information for determining the applicable correction. Support Vector Machines (SVMs), n-gram LM-based and Naïve Bayes classifiers are used to predict the potential correction.

2010

DNN-based classifier approaches are proposed in 2000s and early 2010s. However, a specific set of error types have to be defined. Typically only well-defined errors can be addressed with these approaches. SMT models learn mappings from source text to target text using a noisy channel model. SMT-based GEC models use parallel corpora of erratic text and grammatically correct version of the same text in the same language. Open-source SMT engines are available online and include Moses, Joshua and cdec.

2016

Neural Machine Translation (NMT) shows better prospects by capturing some learner errors missed by SMT models. This is because NMT can encode structural patterns from training data and is more likely to capture an unseen error.

2018

With the advent of attention-based transformer architecture in 2017, its application to GEC gives promising results.

2019

Methods to improve the training data by text augmentation of various types, including cyclic machine translation, emerge. These improve the performance of GEC tools significantly and enable better flagging of style or context-based errors or suggestions. Predicting edits instead of tokens allows the model to pick the output from a smaller confusion set. Thus, editor models lead to faster training and inference of GEC models.

## Sample Code

• # Source: https://norvig.com/spell-correct.html
# Accessed 2021-04-25

# This is Peter Norvig's implementation from 2007.
# It relies on big.txt, a file of about a million words.

import re
from collections import Counter

def words(text): return re.findall(r'\w+', text.lower())

def P(word, N=sum(WORDS.values())):
"Probability of word."
return WORDS[word] / N

def correction(word):
"Most probable spelling correction for word."
return max(candidates(word), key=P)

def candidates(word):
"Generate possible spelling corrections for word."
return (known([word]) or known(edits1(word)) or known(edits2(word)) or [word])

def known(words):
"The subset of words that appear in the dictionary of WORDS."
return set(w for w in words if w in WORDS)

def edits1(word):
"All edits that are one edit away from word."
letters    = 'abcdefghijklmnopqrstuvwxyz'
splits     = [(word[:i], word[i:])    for i in range(len(word) + 1)]
deletes    = [L + R[1:]               for L, R in splits if R]
transposes = [L + R[1] + R[0] + R[2:] for L, R in splits if len(R)>1]
replaces   = [L + c + R[1:]           for L, R in splits if R for c in letters]
inserts    = [L + c + R               for L, R in splits for c in letters]
return set(deletes + transposes + replaces + inserts)

def edits2(word):
"All edits that are two edits away from word."
return (e2 for e1 in edits1(word) for e2 in edits1(e1))

# usage:
correction('speling') # spelling (single deletion)
correction('korrectud') # corrected (double replacements)

Author
No. of Edits
No. of Chats
DevCoins
6
0
1661
9
7
954
2357
Words
6
Likes
5257
Hits

## Cite As

Devopedia. 2022. "Grammar and Spell Checker." Version 15, August 23. Accessed 2022-09-22. https://devopedia.org/grammar-and-spell-checker
Contributed by
2 authors

Last updated on
2022-08-23 13:11:57
• Site Map