Comparison between the corpora: Brown, Susanne and Penn Treebank

Why these corpora still matter

These are foundational English corpora, useful for learning annotation schemes and dataset documentation. They are not representative of every current domain or population. Before training a model, review license, genre, period, label mapping, train/test splits and possible demographic or lexical bias.

The objective of this project is to search for information from the corpora: Brown, Susanne and Penn Treebank with the objective to present the following information

Description of the corpora
Concise comparison (table or similar) of different aspects that you consider relevant:
- Type of tagging: lexical tagging (POS tagging), syntactic, etc.
- Corpus size
- Label set size
- Themes included
- Origin of the texts: newspapers, speech transcriptions, etc.

2. Description of the corpora

In this section the following corpus will be described:

Brown
Susanne
Penn Treebank

2.1 Brown Corpus

In 1967 Henry Kučera and W. Nelson Francis Computational wrote Analysis of Present-Day American English, the first corpus of American English. It was published by Brown University and is currently known as Brown Corpus.

To prepare this corpus, the authors carried out several computational analyses, thanks to which managed to compile a rich and varied work of documents, which combines elements of linguistics, psychology, statistics and sociology. The result of this work was the Brown Corpus, a work made up of one million of words drawn from a wide variety of sources written in current American English belonging to the following categories:

PRESS: Report (44 texts)
PRESS: Editorial (27 texts)
PRESS: Comments (17 texts)
RELIGION (17 texts)
SKILL AND HOBBIES (36 texts)
POPULAR KNOWLEDGE (48 texts)
FINE LETTERS (75 texts)
MISCELLANEOUS: US Government and Luxury bodies (30 texts)
EDUCATIONAL (science, mathematics...) (80 texts)
FICTION: General (29 texts)
FICTION: Mystery and detective fiction (24 texts)
FICTION: Science (6 texts)
FICTION: Adventure and Western (29 texts)
FICTION: Romance and love stories (29 texts)
HUMOR (9 texts)

Once published, this resource was widely used in computational linguistics, since for many years one of the most cited resources in the field of linguistics. And not only that, the Brown Corpus has been the basis for other corpus after him, such as the Lancaster-Oslo-Bergen Corpus or the SUSANNE.

Finally, the Brown Corpus is a "POS Tagging" type corpus (Part-Of-Speech tagging, lexical analysis) and It has 82 different labels.

2.2 Susanne

Susanne is the abbreviation of Surface and Underlying Structural Analysis of Natural English and was created with the sponsorship of the Economic and Social Research Council (United Kingdom), as part of the process of developing a comprehensive NLP-oriented taxonomy and annotation scheme for grammar (logical and superficial) of English.

In its initial version, it is made up of approximately 130,000 words that came from 64 documents extracted from the Brown Corpus. The average weight of the files is approximately 86 kilobytes and each of them contains more than 2000 words. Like the Brown Corpus, Susanne is a POS Tagging corpus. Lexicon), and is made up of 353 tags with the following themes:

Press reports
Fine arts, biographies and memoirs.
Scientific or technical articles.
Adventure and cowboy fiction.

The SUSANNE analytical scheme has been developed on a sample basis in American English. It was initially oriented towards the written language only, and in fact contains samples exclusively of the written language. A Several extensions have been made from this corpus, such as the CHRISTINE Corpus, a corpus presented in 1999 which includes analysis of a balanced section of English spoken throughout the United Kingdom in the last decade.

Finally, although Susane uses a subset of documents from the Brown Corpus, it does improve notably the probabilistic analysis compared to the Brown Corpus.

2.3 Penn Treebank

Penn Treebank is a corpus created by the University of Pennsylvania, composed of more than 4.5 million American English words. This corpus uses the following types of tagging:

lexicon. To interpret each of the words, use grammatical labeling that It allows us to recognize and interpret the word.
syntactic. From a syntactic point of view, this corpus represents this information through of a tree structure (TreeBank)

To make it easier to understand, I am going to attach the following diagram extracted from Wikipedia:

 
   (S (NP (NNP Victor))
     (VP (VPZ ama)
       (PP (TO a)
         (NP (NNP Maria))))
     (. .))

This corpus has 36 lexical analysis tags, plus 12 tags for punctuations and symbols. And 14 labels of syntactic type in addition to 4 null elements.

The set of samples from which it is composed comes from different different corpora, specifically from the following:

Dept.of Energy abstract
Dow Jones Newswire stories
Dept. of Agriculture bulletins
Library of America texts
MUC-3 messages
IBM Manual sentences
WBUR radio transcripts
ATIS sentences
Brown Corpus, retagged

3. Comparison of the Corpus

In this section the following aspects of the three corpora will be compared:

Type of tagging: lexical tagging (POS tagging), syntactic, etc.
Corpus size
Label set size
Themes included
Origin of the texts: newspapers, speech transcriptions, etc.

3.1 Type of tagging: lexical tagging (POS tagging), syntactic, etc.

Corpus	Labeling
Brown	Lexicon
Susanne	Lexicon
Peen Treebank	Lexical and syntactic

3.2 Corpus size

Corpus	Corpus size
Brown	500 samples of 2,000 or more words (1,014,312 words total)
Susanne	64 samples of at least 2,000 words (130,000 words total)
Peen Treebank	4,885,798 words in total.

3.3 Label set size

Corpus	Label set size
Brown	82, divided into 6 parts: Parts of speech Noun, common and proper, verb, adjective... Function of words: determiners, prepositions, conjunctions... Individual, important words: no, existential infinity, the form of the verb. Syntactic importance punctuation marks.,5. Inflectional morphemes. Two tags (FM and NC), foreign WORDS cited.
Susanne	353 wordtags
Peen Treebank	36 and 12 for punctuations and symbols in lexical labeling. For syntactic labeling 14 plus 4 more for null elements.

3.4 Topics included

Corpus	Themes included
Brown	PRESS: Report (44 texts) PRESS: Editorial (27 texts) PRESS: Comments (17 texts) RELIGION (17 texts) SKILL AND HOBBIES (36 texts) POPULAR LORE (48 texts) FINE LETTERS (75 texts) MISCELLANEOUS: US Government and Luxury bodies (30 texts) EDUCATIONAL (science, mathematics...) (80 texts) FICTION: General (29 texts) FICTION: Mystery and detective fiction (24 texts) FICTION: Science (6 texts) FICTION: Adventure and Western (29 texts) FICTION: Romance and love stories (29 texts) HUMOR (9 texts)
Susanne	Press reports Fine arts, biographies and memoirs. Scientific or technical articles. Adventure and cowboy fiction.
Peen Treebank	PRESS: Report (44 texts) PRESS: Editorial (27 texts) PRESS: Comments (17 texts) RELIGION (17 texts) SKILL AND HOBBIES (36 texts) POPULAR LORE (48 texts) FINE LETTERS (75 texts) MISCELLANEOUS: US Government and Luxury bodies (30 texts) EDUCATIONAL (science, mathematics...) (80 texts) FICTION: General (29 texts) FICTION: Mystery and detective fiction (24 texts) FICTION: Science (6 texts) FICTION: Adventure and Western (29 texts) FICTION: Romance and love stories (29 texts) HUMOR (9 texts)

Themes

3.5 Origin of the texts: newspapers, speech transcriptions, etc.

Corpus	Origin of the texts
Brown	To prepare this corpus, the authors carried out several computational analyses, thanks to which they managed to compile a rich and varied work of documents, which combines elements of the linguistics, psychology, statistics and sociology
Susanne	It comes from 64 of the 500 samples that Corpus Brown has
Peen Treebank	Dept. of Energy abstract Dow Jones Newswire stories Dept. of Agriculture,bulletins Library of America texts MUC‐3 messages IBM Manual sentences WBUR radio transcripts ATIS sentences Brown Corpus

4. Susane vs. Brown

Based on the information collected, Susanne has a more precise, more granular and easier set of labels to interpret the Brown corpus. The result of all these features is that the analysis significantly improves probabilistic compared to the Brown Corpus, providing better results in parsing, interpretation and Text analysis with automatic text parsing techniques.