# Text Corpus for NLP

In the domain of natural language processing (NLP), statistical NLP in particular, there's a need to train the model or algorithm with lots of data. For this purpose, researchers have assembled many text corpora. A common corpus is also useful for benchmarking models.

Typically, each text corpus is a collection of text sources. There are dozens of such corpora for a variety of NLP tasks. This article ignores speech corpora and considers only those in text form.

While English has many corpora, other natural languages too have their own corpora, though not as extensive as those for English. Using modern techniques, it's possible to apply NLP on low-resource languages, that is, languages with limited text corpora.

## Discussion

• What are the traits of a good text corpus or wordlist?

It's said that a prototypical corpus must be machine-readable in Unicode. It must be a representative sample of the language in current use, balanced, and collected in natural settings.

A good corpus or wordlist must have the following traits:

• Depth: A wordlist, for instance, should include the top 60K words and not just the top 3K words.
• Recent: Corpus based on outdated texts is not going to suit today's tasks.
• Metadata: Metadata should indicate the sources, assumptions, limitations and what's included in the corpus.
• Genre: Unless corpus has been collected for specific tasks, it should include different genres such as newspapers, magazines, blogs, academic journals, etc.
• Size: A corpus of half a million words or more ensures that low frequency words are also adequately represented.
• Clean: A wordlist giving word forms of the same word can be messy to process. A better corpus would include only the lemma and part of speech.
• What are the different types of text corpora for NLP?

A plain text corpus is suitable for unsupervised training. Machine learning models learn from the data in an unsupervised manner. However, a corpus that has the raw text plus annotations can be used for supervised training. It takes considerable effort to create an annotated corpus but it may produce better results.

A corpus can be assembled from a variety of sources and genres. Such a corpus can be used for general NLP tasks. On the other hand, a corpus might be from a single source, domain or genre. Such a corpus can be used only for a specific purpose.

• What are the types of annotations that we can have on a text corpus?

Part-of-speech is one of the most common annotations because of its use in many downstream NLP tasks. Annotating with lemmas (base forms), syntactic parse trees (phrase-structure or dependency tree representations) and semantic information (word sense disambiguation) are also common. For discourse or text summarization tasks, annotations aid coreference resolutions.

For instance, British Component of the International Corpus of English (ICE-GB) of 1 million words is POS tagged and syntactically parsed. Another parsed corpus in Penn Treebank. While WordNet and FrameNet are not corpora, they contain useful semantic information.

Audio/video recordings are transcribed and annotated as well. Annotations are phonetic (sounds), prosodic (variations), or interactional. Video transcripts may annotate for sign language and gesture.

Annotations could be inline/embedded with the text. When they appear on separate lines, it's called multi-tiered annotation. If they're in separate files, and linked to the text via hypertext, it's called standalone annotation.

• What are some NLP task-specific training corpora?

• POS Tagging: Penn Treebank's WSJ section is tagged with a 45-tag tagset. Use Ritter dataset for social media content.
• Named Entity Recognition: CoNLL 2003 NER task is newswire content from Reuters RCV1 corpus. It considers four entity types. WNUT 2017 Emerging Entities task and OntoNotes 5.0 are other datasets.
• Constituency Parsing: Penn Treebank's WSJ section has dataset for this purpose.
• Semantic role labelling: OntoNotes v5.0 is useful due to syntactic and semantic annotations.
• Sentiment Analysis: IMDb has released 50K movie reviews. Others are Amazon Customer Reviews of 130 million reviews, 6.7 million business reviews from Yelp, and Sentiment140 of 160K tweets.
• Text Classification/Clustering: Reuters-21578 is a collection of news documents from 1987 indexed by categories. 20 Newsgroups is another dataset of about 20K documents from 20 newsgroups.
• Could you list some NLP text corpora by genre?

Formal genre is typically from books and academic journals. Examples are Project Gutenberg EBooks, Google Books Ngrams, and arXiv Bulk Data Access. There are many text corpora from newswire. Examples are 20 Newsgroups and Reuters-21578.

For informal genre, we can include web data and emails. Corpora for these include Common Crawl, Blogger Corpus, Wikipedia Links Data, Enron Emails, and UCI's Spambase. Corpora derived from reviews include Yelp Reviews, Amazon Customer Reviews, and IMDb Movie Reviews. Even more informal are SMS and tweets, for which we have Sentiment140, Twitter US Airline Sentiment, and SMS Spam Collection.

Spoken language is often different from written language. 2000 HUB5 English is a dataset that's a transcription of 40 telephone conversations. Signed language can also be annotated and transcribed to create a corpus.

Since languages evolve, when analyzing old text, our models need to be trained likewise. Examples include DOE Corpus (600s-1150s), and COHA (1810s-2000s).

Another special case is of learners who are likely to express ideas differently. The Open Cambridge Learner Corpus contains 10K student responses of 2.9 million words.

It's also common to have domain-specific corpora. For example, BioCreative and GENIA are for biology.

• What are some generic training corpora for NLP?

Some of the well-known corpora are Brown Corpus, British National Corpus (BNC), Lancaster-Oslo/Beren Corpus (LOB), International Corpus of English (ICE), Corpus of Contemporary American English (COCA), Google Books Ngram Corpus, Penn Treebank-3, English Gigaword Fifth Edition, and OntoNotes Release 5.0.

Wikipedia was not made for training NLP models but it can be used. We would need to strip markup. Gensim Python package has gensim.corpora.wikicorpus.WikiCorpus class to process Wikipedia data.

Generic corpora are usually suited for language modelling, which is useful for other downstream tasks such as machine translation and speech recognition. Researchers have suggested using Project Gutenberg EBooks; Penn Treebank of about a million words pre-processed by Mikolov et al. in 2011; WikiText-2 of more than 2 million words; and WikiText-103. Google's one-billion word corpus provides a useful benchmark.

• Derived from text corpus, which datasets are useful for NLP tasks?

Wordlists such as list of names or stopwords are useful for NLP work. Phrases in English (PIE) is another resource to explore distribution of words and phrases. It's based on the BNC corpus.

Tagsets are essential for POS tagging, chunking, dependency parsing or constituency parsing. DKPro Core Tagset Reference is an excellent resource. University of Lancaster maintains a multilingual semantic tagset.

Treebanks go beyond just POS-tagging a corpus. A treebank is an annotated corpus in which grammatical structure is typically represented as a tree structure. Examples are Penn Treebank and CHRISTINE Corpus. Treebanks are useful for evaluating syntactic parsers or as resources for ML models to optimize linguistic analyzers.

Word embeddings are real-valued vectors representations of words. These have improved many NLP task including language modelling and semantic analysis. While it's possible to learn embeddings from a large corpus, it's easier to start with downloadable embeddings. Two sources for downloads are Polyglot and Nordic Language Processing Laboratory (NLPL).

Perhaps by 2020, we'll be able to download pretrained language models and apply it to a variety of NLP tasks.

• Which are some corpora for non-English languages?

For machine translation, it's common to have parallel corpus, that is, aligned text in multiple languages. We mention a few examples:

• Aligned Hansards of the 36th Parliament of Canada containing 1.3 million pairs of aligned text segments in English and French
• Europarl parallel corpus from 1996-2011 of 21 European languages from parliament proceedings
• WMT 2014 EN-DE and WMT 2014 EN-FR
• A corpus using Wikipedia across 20 languages, 36 bitexts, about 610 million tokens and 26 million sentence fragments

An excellent source is OPUS, the open parallel corpus. Lionbridge published a list of parallel corpora in 2019. Martin Weisser maintains a list that links to many non-English corpora.

• Are there curated lists of datasets for NLP work?

A simple web search will yield plenty of relevant results. Some include download links to the sources. We mention a few that stand out:

## Milestones

1964

W. Nelson Francis and Henry Kučera at the Department of Linguistics, Brown University, publish a computer-readable general corpus to aid linguistic research on modern English. The corpus has 1 million words (500 samples of about 2000 words each). Revised editions appear later in 1971 and 1979. Called Brown Corpus, it inspires many other text corpora. The corpus with annotations is included in Treebank-3 (1999).

1992

Linguistic Data Consortium (LDC) is formed to serve as a repository for NLP resources, including corpora. It's hosted at the University of Pennsylvania.

1994

A 100-million corpus of British English called BNC (British National Corpus) is assembled between 1991 and 1994. It's balanced across genres. A follow-up task called BNC2014 is started in 2014, which can help in understanding how language evolves. Spoken BNC2014 is released in September 2017. Written BNC2014 is expected to come out in 2019.

1999

Penn Treebank-3 is released. It's based upon the original Treebank (1992) and its revised Treebank II (1995). This work started in 1989 at the University of Pennsylvania. Treebank-3 includes tagged/parsed Brown Corpus, 1 million words of 1989 WSJ material annotated in Treebank II style, tagged sample of ATIS-3, and tagged/parsed Switchboard Corpus. Apart from POS tags, the corpus includes chunk tags, relation tags and anchor tags. The BLLIP 1987-89 WSJ Corpus Release 1 has 30 million words and supplements the WSJ section of Treebank-3.

2008

Collected for the years 1990-2007, the Corpus of Contemporary American English (COCA) is released with 365 million words. By December 2017, it has 560 million words, adding 20 million each year. There's good balance of spoken, fiction, popular magazines, newspapers, and academic texts. It's been noted that COCA contains many common words that are missing in the American National Corpus (ANC), a corpus of 22 million words.

Jun
2011

English Gigaword Fifth Edition is released by LDC. It comes from seven English newswire services. It has 4 billion words and takes up 26 gigabytes uncompressed. The first edition appeared in 2003. In November 2012, researchers at the John Hopkins University add syntactic and discourse structure annotations to this corpus after parsing more than 183 million sentences.

Jul
2012

From digitized books, Google releases version 2 of Google Books Ngrams. Version 1 came out in July 2009. Only n-grams that appear more than 40 times are included. The corpus includes 1-gram to 5-grams. It includes many non-English languages as well. To experiment on small sets of phrases, researchers can try out the online Google Books Ngram Viewer.

Aug
2012

As a corpus for informal genre, English Web Treebank (EWT) is released by LDC. This includes content from weblogs, reviews, question-answers, newsgroups, and email. It has about 250K word-level tokens and 16K sentence-level tokens. It's annotated for POS and syntactic structure. This includes Enron Corporation emails from 1999-2002. In 2014, Silveira et al. provide annotation of syntactic dependencies for this corpus that can be used to train dependency parsers.

Sep
2019

Common Crawl publishes 240 TiB of uncompressed data from 2.55 billion web pages. Of these, 1 billion URLs were not present in previous crawls. Common Crawl started in 2008. In 2013, they moved from ARC to Web ARChive (WARC) file format. WAT files contain the metadata. WET file contain plaintext of the WARC files.

Author
No. of Edits
No. of Chats
DevCoins
5
0
2123
2201
Words
2
Likes
28213
Hits

## Cite As

Devopedia. 2020. "Text Corpus for NLP." Version 5, December 20. Accessed 2022-09-22. https://devopedia.org/text-corpus-for-nlp
Contributed by
1 author

Last updated on
2020-12-20 05:23:24
• Site Map