The objective of this project is to search for information from the corpora: Brown, Susanne and Penn Treebank with the objective to present the following information
2. Description of the corpora
In this section the following corpus will be described:
- Brown
- Susanne
- Penn Treebank
2.1 Brown Corpus
In 1967 Henry Kučera and W. Nelson Francis Computational wrote Analysis of Present-Day American English, the first corpus of American English. It was published by Brown University and is currently known as Brown Corpus.
To prepare this corpus, the authors carried out several computational analyses, thanks to which managed to compile a rich and varied work of documents, which combines elements of linguistics, psychology, statistics and sociology. The result of this work was the Brown Corpus, a work made up of one million of words drawn from a wide variety of sources written in current American English belonging to the following categories:
- PRESS: Report (44 texts)
- PRESS: Editorial (27 texts)
- PRESS: Comments (17 texts)
- RELIGION (17 texts)
- SKILL AND HOBBIES (36 texts)
- POPULAR KNOWLEDGE (48 texts)
- FINE LETTERS (75 texts)
- MISCELLANEOUS: US Government and Luxury bodies (30 texts)
- EDUCATIONAL (science, mathematics...) (80 texts)
- FICTION: General (29 texts)
- FICTION: Mystery and detective fiction (24 texts)
- FICTION: Science (6 texts)
- FICTION: Adventure and Western (29 texts)
- FICTION: Romance and love stories (29 texts)
- HUMOR (9 texts)
Once published, this resource was widely used in computational linguistics, since for many years one of the most cited resources in the field of linguistics. And not only that, the Brown Corpus has been the basis for other corpus after him, such as the Lancaster-Oslo-Bergen Corpus or the SUSANNE.
Finally, the Brown Corpus is a "POS Tagging" type corpus (Part-Of-Speech tagging, lexical analysis) and It has 82 different labels.
2.2 Susanne
Susanne is the abbreviation of Surface and Underlying Structural Analysis of Natural English and was created with the sponsorship of the Economic and Social Research Council (United Kingdom), as part of the process of developing a comprehensive NLP-oriented taxonomy and annotation scheme for grammar (logical and superficial) of English.
In its initial version, it is made up of approximately 130,000 words that came from 64 documents extracted from the Brown Corpus. The average weight of the files is approximately 86 kilobytes and each of them contains more than 2000 words. Like the Brown Corpus, Susanne is a POS Tagging corpus. Lexicon), and is made up of 353 tags with the following themes:
- Press reports
- Fine arts, biographies and memoirs.
- Scientific or technical articles.
- Adventure and cowboy fiction.
The SUSANNE analytical scheme has been developed on a sample basis in American English. It was initially oriented towards the written language only, and in fact contains samples exclusively of the written language. A Several extensions have been made from this corpus, such as the CHRISTINE Corpus, a corpus presented in 1999 which includes analysis of a balanced section of English spoken throughout the United Kingdom in the last decade.
Finally, although Susane uses a subset of documents from the Brown Corpus, it does improve notably the probabilistic analysis compared to the Brown Corpus.
2.3 Penn Treebank
Penn Treebank is a corpus created by the University of Pennsylvania, composed of more than 4.5 million American English words. This corpus uses the following types of tagging:
- lexicon. To interpret each of the words, use grammatical labeling that It allows us to recognize and interpret the word.
- syntactic. From a syntactic point of view, this corpus represents this information through of a tree structure (TreeBank)
To make it easier to understand, I am going to attach the following diagram extracted from Wikipedia:
(S (NP (NNP Victor))
(VP (VPZ ama)
(PP (TO a)
(NP (NNP Maria))))
(. .))
This corpus has 36 lexical analysis tags, plus 12 tags for punctuations and symbols. And 14 labels of syntactic type in addition to 4 null elements.
The set of samples from which it is composed comes from different different corpora, specifically from the following:
- Dept.of Energy abstract
- Dow Jones Newswire stories
- Dept. of Agriculture bulletins
- Library of America texts
- MUC-3 messages
- IBM Manual sentences
- WBUR radio transcripts
- ATIS sentences
- Brown Corpus, retagged