Comparison between the corpora: Brown, Susanne and Penn Treebank

The objective of this project is to search for information from the corpora: Brown, Susanne and Penn Treebank with the objective to present the following information
  • Description of the corpora
  • Concise comparison (table or similar) of different aspects that you consider relevant:
    • Type of tagging: lexical tagging (POS tagging), syntactic, etc.
    • Corpus size
    • Label set size
    • Themes included
    • Origin of the texts: newspapers, speech transcriptions, etc.

2. Description of the corpora

In this section the following corpus will be described:
  • Brown
  • Susanne
  • Penn Treebank

2.1 Brown Corpus

In 1967 Henry Kučera and W. Nelson Francis Computational wrote Analysis of Present-Day American English, the first corpus of American English. It was published by Brown University and is currently known as Brown Corpus.

To prepare this corpus, the authors carried out several computational analyses, thanks to which managed to compile a rich and varied work of documents, which combines elements of linguistics, psychology, statistics and sociology. The result of this work was the Brown Corpus, a work made up of one million of words drawn from a wide variety of sources written in current American English belonging to the following categories:
  • PRESS: Report (44 texts)
  • PRESS: Editorial (27 texts)
  • PRESS: Comments (17 texts)
  • RELIGION (17 texts)
  • SKILL AND HOBBIES (36 texts)
  • POPULAR KNOWLEDGE (48 texts)
  • FINE LETTERS (75 texts)
  • MISCELLANEOUS: US Government and Luxury bodies (30 texts)
  • EDUCATIONAL (science, mathematics...) (80 texts)
  • FICTION: General (29 texts)
  • FICTION: Mystery and detective fiction (24 texts)
  • FICTION: Science (6 texts)
  • FICTION: Adventure and Western (29 texts)
  • FICTION: Romance and love stories (29 texts)
  • HUMOR (9 texts)
Once published, this resource was widely used in computational linguistics, since for many years one of the most cited resources in the field of linguistics. And not only that, the Brown Corpus has been the basis for other corpus after him, such as the Lancaster-Oslo-Bergen Corpus or the SUSANNE.

Finally, the Brown Corpus is a "POS Tagging" type corpus (Part-Of-Speech tagging, lexical analysis) and It has 82 different labels.

2.2 Susanne

Susanne is the abbreviation of Surface and Underlying Structural Analysis of Natural English and was created with the sponsorship of the Economic and Social Research Council (United Kingdom), as part of the process of developing a comprehensive NLP-oriented taxonomy and annotation scheme for grammar (logical and superficial) of English.

In its initial version, it is made up of approximately 130,000 words that came from 64 documents extracted from the Brown Corpus. The average weight of the files is approximately 86 kilobytes and each of them contains more than 2000 words. Like the Brown Corpus, Susanne is a POS Tagging corpus. Lexicon), and is made up of 353 tags with the following themes:
  • Press reports
  • Fine arts, biographies and memoirs.
  • Scientific or technical articles.
  • Adventure and cowboy fiction.
The SUSANNE analytical scheme has been developed on a sample basis in American English. It was initially oriented towards the written language only, and in fact contains samples exclusively of the written language. A Several extensions have been made from this corpus, such as the CHRISTINE Corpus, a corpus presented in 1999 which includes analysis of a balanced section of English spoken throughout the United Kingdom in the last decade.

Finally, although Susane uses a subset of documents from the Brown Corpus, it does improve notably the probabilistic analysis compared to the Brown Corpus.

2.3 Penn Treebank

Penn Treebank is a corpus created by the University of Pennsylvania, composed of more than 4.5 million American English words. This corpus uses the following types of tagging:
  • lexicon. To interpret each of the words, use grammatical labeling that It allows us to recognize and interpret the word.
  • syntactic. From a syntactic point of view, this corpus represents this information through of a tree structure (TreeBank)
To make it easier to understand, I am going to attach the following diagram extracted from Wikipedia:
 
   (S (NP (NNP Victor))
     (VP (VPZ ama)
       (PP (TO a)
         (NP (NNP Maria))))
     (. .))
   
This corpus has 36 lexical analysis tags, plus 12 tags for punctuations and symbols. And 14 labels of syntactic type in addition to 4 null elements.

The set of samples from which it is composed comes from different different corpora, specifically from the following:
  • Dept.of Energy abstract
  • Dow Jones Newswire stories
  • Dept. of Agriculture bulletins
  • Library of America texts
  • MUC-3 messages
  • IBM Manual sentences
  • WBUR radio transcripts
  • ATIS sentences
  • Brown Corpus, retagged

3. Comparison of the Corpus

In this section the following aspects of the three corpora will be compared:
  • Type of tagging: lexical tagging (POS tagging), syntactic, etc.
  • Corpus size
  • Label set size
  • Themes included
  • Origin of the texts: newspapers, speech transcriptions, etc.

3.1 Type of tagging: lexical tagging (POS tagging), syntactic, etc.

Corpus Labeling
Brown Lexicon
Susanne Lexicon
Peen Treebank Lexical and syntactic

3.2 Corpus size

Corpus Corpus size
Brown 500 samples of 2,000 or more words (1,014,312 words total)
Susanne 64 samples of at least 2,000 words (130,000 words total)
Peen Treebank 4,885,798 words in total.

3.3 Label set size

Corpus Label set size
Brown 82, divided into 6 parts:
  • Parts of speech Noun, common and proper, verb, adjective...
  • Function of words: determiners, prepositions, conjunctions...
  • Individual, important words: no, existential infinity, the form of the verb.
  • Syntactic importance punctuation marks.,5. Inflectional morphemes.
  • Two tags (FM and NC), foreign WORDS cited.
Susanne353 wordtags
Peen Treebank 36 and 12 for punctuations and symbols in lexical labeling. For syntactic labeling 14 plus 4 more for null elements.

3.4 Topics included

Corpus Themes included
Brown
  • PRESS: Report (44 texts)
  • PRESS: Editorial (27 texts)
  • PRESS: Comments (17 texts)
  • RELIGION (17 texts)
  • SKILL AND HOBBIES (36 texts)
  • POPULAR LORE (48 texts)
  • FINE LETTERS (75 texts)
  • MISCELLANEOUS: US Government and Luxury bodies (30 texts)
  • EDUCATIONAL (science, mathematics...) (80 texts)
  • FICTION: General (29 texts)
  • FICTION: Mystery and detective fiction (24 texts)
  • FICTION: Science (6 texts)
  • FICTION: Adventure and Western (29 texts)
  • FICTION: Romance and love stories (29 texts)
  • HUMOR (9 texts)
Susanne
  • Press reports
  • Fine arts, biographies and memoirs.
  • Scientific or technical articles.
  • Adventure and cowboy fiction.
Peen Treebank
  • PRESS: Report (44 texts)
  • PRESS: Editorial (27 texts)
  • PRESS: Comments (17 texts)
  • RELIGION (17 texts)
  • SKILL AND HOBBIES (36 texts)
  • POPULAR LORE (48 texts)
  • FINE LETTERS (75 texts)
  • MISCELLANEOUS: US Government and Luxury bodies (30 texts)
  • EDUCATIONAL (science, mathematics...) (80 texts)
  • FICTION: General (29 texts)
  • FICTION: Mystery and detective fiction (24 texts)
  • FICTION: Science (6 texts)
  • FICTION: Adventure and Western (29 texts)
  • FICTION: Romance and love stories (29 texts)
  • HUMOR (9 texts)
Themes

3.5 Origin of the texts: newspapers, speech transcriptions, etc.

Corpus Origin of the texts
Brown To prepare this corpus, the authors carried out several computational analyses, thanks to which they managed to compile a rich and varied work of documents, which combines elements of the linguistics, psychology, statistics and sociology
Susanne It comes from 64 of the 500 samples that Corpus Brown has
Peen Treebank Dept. of Energy abstract Dow Jones Newswire stories Dept. of Agriculture,bulletins Library of America texts MUC‐3 messages IBM Manual sentences WBUR radio transcripts ATIS sentences Brown Corpus

4. Susane vs. Brown

Based on the information collected, Susanne has a more precise, more granular and easier set of labels to interpret the Brown corpus. The result of all these features is that the analysis significantly improves probabilistic compared to the Brown Corpus, providing better results in parsing, interpretation and Text analysis with automatic text parsing techniques.