Example of manual labeling.
Introduction
On the following web page: https://nlp.stanford.edu/links/statnlp.html , in the "Part of Speech Taggers" section you can find numerous statistical taggers. Many of them based on different models (HMMs, Support Vector Machine, etc.), use different training corpora, serve for different languages, etc. In this task you must compare the behavior of at least two of them. We will study them, we will describe and we will use it to label a small text. Then we will compare the results: labels used by each labeler and labeling accuracy.2. Description of selected taggers
The taggers chosen have been:- TreeTagger
- Stanford Log-linear Part-Of-Speech Tagger
- tTAG
2.1 TreeTagger
TreeTagger is a text tagging, sentence analysis and lemma extraction tool. It was developed by Helmut Schmid in the TC project of the Institute of Computational Linguistics of the University of Stuttgart with the aim of being used for speech tagging and stemming. To run it, It is necessary to use the model for the selected language (a file known as "parameters" and with the extension .par) that can be obtained from the TreeTagger website itself. Within this website, we can find different parameters that we can use to analyze texts in English, French, German, Italian, Spanish, Russian, Bulgarian, Dutch, Estonian, Finnish, Galician, Latin, Mongolian, Polish, Slovak and Swahili. For a language where a model does not exist, the tool offers the user the possibility of creating a new model. For this it is necessary to manually tag an example text and then run a training program (provided with TreeTagger) to create the model.2.1.1 Installation
In my case, I installed the application on a personal computer with Windows 7, using the installer downloaded at the following url https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/To install it I have performed the following steps:- First of all, it is necessary to install the Perl interpreter, which we can download from the following url http://www.activestate.com/activeperl/
- The next point is to extract the files from the zip document downloaded from the web into the C:/ directory. of the computer used.
- Once unzipped, the next step is to download from the web the files called parameters for the languages we need. The files must have a structure -utf8.par and must be stored in the TreeTagger/lib subdirectory.
- We have to add the following path as an environment variable C:/TreeTagger/bin
- Subsequently, it is necessary to open a Windows terminal and execute the following commands
set PATH=C:\TreeTagger\bin;%PATH% cd c:\TreeTagger
- Finally, we can now run the program with the following command
tag-<lenguage><namefile>
3. Stanford Log-linear Part-Of-Speech Tagger
A Part-Of-Speech Tagger (POS Tagger) is a tool that is responsible for classifying the different parts of a written text in a specific language. This software aims to classify each word depending on its functionality, since it will present each of them as a noun, verb, adjective, etc. and even has the ability to use more precious POS tags like "noun-plural." It is implemented in Java and uses the log-line taggers described in Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger In order to run this program it is necessary:- A version of Java 1.8 or higher.
- Between 60 and 200 MB of memory for the program
- At least 1 GB of memory to store the model responsible for training the labeler.
4. tTAG
To start, tTAG was implemented by Infogistics, an international Edinburgh company created by experts in the field of text mining and document searching. Its main product, tTAG, is a text tagger that can handle both ASCII-encoded text and XML marked text. tTAG incorporates a tokenizer (tNORM) responsible for segmenting the text into words and sentences. In In those cases where the user needs to use their own tokenizer, the application allows replacing tNORM with one of own elaboration. Morphologically, tTAG uses a lexicon that can be easily extended to include words new. When the software encounters an unknown word, tTAG runs prediction software that allows you to predict and label unknown words. This ability can be retrained by the user to new sublanguages. Thanks to this capability, tTAG achieves an accuracy of 96% 98% correct in those words found in the lexicon and 88-92% accuracy in words unknown. tTAG allows you to use both pre-trained resources based on the Penn tag set Treebank that provides you with the application as well as developing your own resources using a set of own tags. When running tTAG, you simply specify which resource file you want to use during the labeling. When parsing marked text in SGML/XML, tTAG can mark the entire text or only mark some sections of the XML document (for example, only to tag paragraphs, and not headings or subheadings). It is also It is possible to request tTAG to output the tagged text as XML.
<SENTENCE>
<W TAG="PPS">He</W>
<W TAG="VBZ">books</W>
<W TAG="NNS">tickets</W>
</SENTENCE>
Regarding the tTag syntax analyzer, it is worth mentioning that it uses the information from the document and uses grammars context-sensitive to detect the boundaries of syntactic groups. The analyzer leaves all the information previously added in the text and creates structural elements that include words from the fragment:
<NG>Este hombre</NG>
<VG>canta</VG>
5. Test text used.
To carry out this task I have chosen the following text in English extracted from the book The Little Prince by ANTOINE DE SAINT-EXUPÉRY:Once when I was six years old I saw a magnificent picture in a book, called True Stories from Nature, about the primeval forest. It was a picture of a boa constrictor in the act of swallowing an animal. Here is a copy of the drawing.
Based on the fact that the three texts have the ability to label texts in different languages, I have searched for the same text in the Spanish version to analyze it. The text used is the followingWhen I was six years old I saw a book about the virgin forest called "Stories lived", a magnificent sheet. It represented a boa snake that swallowed a beast.
6.Labeling result with each selected labeler.
6.1 TreeTagger
First of all, in the picture 1 we can consult what kind of tags use TreeTagger in English, in the box 2 the labels that he uses in Spanish, in the painting 3 if will represent the result in English and finally, in the box 4 will be represented result in Spanish.| POS Tag | Description | Example | |
|---|---|---|---|
| DC | coordinating conjunction | and, but, or, \ | |
| CD | cardinal number | 1, three | |
| DT | determine | the | |
| EX | existential there | there is | |
| F.W. | foreign word | d'oeuvre | |
| IN | preposition/subord. conj. | in,of,like,after,whether | |
| IN/that | complementizer | that | |
| J.J. | adjective | green | |
| JJR | adjective, comparative | greener | |
| JJS | adjective, superlative | greenest | |
| L.S. | list marker | (1), | |
| M.D. | modal | could, will | |
| N.N. | noun, singular or mass | table | |
| NNS | plural noun | tables | |
| NP | proper noun, singular | John | |
| NPS | proper noun, plural | Vikings | |
| PDT | predetermine | both the boys | |
| POS | possessive ending | friend's | |
| PP | personal pronoun | I, he, it | |
| PP\$ | possessive pronoun | my, his | |
| R.B. | adverb | however, usually, here, not | |
| RBR | adverb, comparative | better | |
| RBS | adverb, superlative | best | |
| PR | particles | give up | |
| SENT | end punctuation | ?, !, . | |
| SYM | symbols | @, +, *, \textasciicircum, |, = | |
| TO | to | to go, to him | |
| UH | interjection | uhhuhhuhh | |
| V.B. | verb be, base form | be | |
| VBD | verb be, past | was|were | |
| GBV | verb be, gerund/participle | being | |
| VBN | verb be, past participle | been | |
| VBZ | verb be, pres, 3rd p. sing | is | |
| VBP | verb be, pres non-3rd p. | am|are | |
| D.V. | verb do, base form | do | |
| VDD | verb do, past | did | |
| VDG | verb do gerund/participle | doing | |
| VDN | verb do, past participle | donated | |
| VDZ | verb do, pres, 3rd per.sing | does | |
| VDP | verb do, pres, non-3rd per. | do | |
| V.H. | verb have, base form | have | |
| VHD | verb have, past | had | |
| HGV | verb have, gerund/participle | having | |
| VHN | verb have, past participle | had | |
| VHZ | verb have, pres 3rd per.sing | you have | |
| VHP | verb have, pres non-3rd per. | have | |
| V.V. | verb, base form | take | |
| VVD | verb, past tense | took | |
| VVG | verb, gerund/participle | taking | |
| VVN | verb, past participle | taken | |
| VVP | verb, present, non-3rd p. | take | |
| VVZ | verb, present 3d p. sing. | takes | |
| WDT | wh-determiner | which | |
| W.P. | wh-pronoun | who, what | |
| WP\$ | possessive wh-pronoun | whose | |
| WRB | wh-abverb | where, when | |
| : | general joiner | ;, -, -- | |
| \$ | currency symbol | \$, £ |
Table.1 Tags using TreeTagger in English
| ExampleTagTreecast Tag | Description |
|---|---|
| APR | abbreviation |
| ADJ | adjective |
| ADV | adverb |
| DET:ART | article |
| DET:POS | possessive pronoun (ma, ta, …) |
| INT | interjection |
| KON | conjunction |
| NAM | proper name |
| NOM | noun |
| NUM | numeral |
| PRO | pronoun |
| PRO:DEM | demonstrative pronoun |
| PRO:IND | indefinite pronoun |
| PRO:PER | personal pronoun |
| PRO:POS | possessive pronoun (mien, tien, …) |
| PRO:REL | relative pronoun |
| PRP | preposition |
| PRP:det | preposition plus article (au, du, aux, des) |
| PUN | punctuation |
| PUN:cit | punctuation citation |
| SENT | sentence tag |
| SYM | symbols |
| SEE:cond | conditional verb |
| SEE:futu | future verb |
| SEE:impe | imperative verb |
| SEE:impf | imperfect verb |
| SEE:infi | verb infinitive |
| SEE:pper | verb past participle |
| SEE:ppre | verb present participle |
| SEE:pres | verb present |
| SEE:simp | simple verb past |
| SEE: uploaded | imperfect subjunctive verb |
| SEE:subp | verb subjunctive present |
Table.2 Tags using TreeTagger in Spanish
| Original Text | Label | Root |
|---|---|---|
| Eleven | R.B. | eleven |
| when | WRB | when |
| I | PP | I |
| was | VBD | be |
| six | CD | six |
| years | NNS | year |
| old | J.J. | old |
| I | PP | I |
| saw | VVD | see |
| to | DT | to |
| magnificent | J.J. | magnificent |
| picture | N.N. | picture |
| in | IN | in |
| to | DT | to |
| book | N.N. | book |
| , | , | , |
| called | VVD | call |
| true | NP | true |
| Stories | NP | Stories |
| from | IN | from |
| Nature | NP | Nature |
| , | , | , |
| about | IN | about |
| the | DT | the |
| primeval | J.J. | primeval |
| forest | N.N. | forest |
| . | SENT | . |
| It | PP | it |
| was | VBD | be |
| to | DT | to |
| picture | N.N. | picture |
| of | IN | of |
| to | DT | to |
| boa | N.N. | boa |
| constrictor | N.N. | constrictor |
| in | IN | in |
| the | DT | the |
| act | N.N. | act |
| of | IN | of |
| swallowing | VVG | swallow |
| an | DT | an |
| animal | N.N. | animal |
| . | SENT | . |
| Here | R.B. | here |
| is | VBZ | be |
| to | DT | to |
| copy | N.N. | copy |
| of | IN | of |
| the | DT | the |
| drawing | N.N. | drawing |
| . | SENT | . |
Table.3 Tags used by TreeTagger in English
| Original Text | Label | Root |
|---|---|---|
| When | CSUBX | when |
| i | PPX | i |
| had | VLfin | have |
| six | CARD | six |
| years | NC | year |
| I saw | CARD | I saw |
| in | PREP | in |
| a | ART | a |
| book | NC | book |
| about | PREP | about |
| the | ART | the |
| jungle | NC | jungle |
| virgin | NC | virgin |
| that | CQUE | that |
| if | SE | if |
| titled | VLfin | holder |
| " | Q.T. | " |
| Stories | NP | Stories |
| lived | VLadj | live |
| " | Q.T. | " |
| , | CM | , |
| one | ART | a |
| magnificent | ADJ | magnificent |
| sheet | NC | sheet |
| . | F.S. | . |
| represented | VLfin | represent |
| one | ART | a |
| snake | NC | snake |
| boa | NC | boa |
| that | CQUE | that |
| if | SE | if |
| swallowed | VLfin | swallowed |
| to | PREP | to |
| one | ART | a |
| beast | NC | beast |
| . | F.S. | . |
Table.4 Tags used by TreeTagger in Spanish
6.2 Stanford Log-linear Part-Of-Speech Tagger
In the first part of the section I have used the Stanford Log-linear Part-Of-Speech Tagger to parse the text in English by applying the following commandjava -mx300m -classpath stanford-postagger.jaredu.stanford.nlp.tagger.maxent.MaxentTagger -modelmodels/wsj-0-18-bidirectional-distsim.tagger -textFilesample-input.txt >sample-tagged.txt
Once it has been executed and to better understand its operation, I have used the figure 1 where They show all the labels that the program can generate for a text in English.
Once_RB when_WRB I_PRP was_VBD six_CD years_NNS old_JJ I_PRPsaw_VBD a_DT magnificent_JJ picture_NN in_IN a_DT book_NN ,_,called_VBN True_NNP Stories_NNP from_IN Nature_NNP ,_, about_IN the_DT primeval_JJ forest_NN ._. It_PRP was_VBD a_DT picture_NN of_IN a_DT boa_NN constrictor_NN in_IN the_DT act_NN of_IN swallowing_VBG an_DT animal_NN ._. Here_RB is_VBZ a_DT copy_NN of_IN the_DT drawing_NN ._.
Once the Test with the text in English, I have run the same test but with a text in Spanish. To better understand the tags used by this software, I recommend visiting the following url https://nlp.stanford.edu/software/spanish-faq.shtml and the result obtained has been next:When_cs i_pp000000 had_vmii000 six_dn0000 years_nc0p000vi_vmis000 en_sp000 un_di0000 libro_nc0s000 sobre_sp000 la_da0000 jungle_nc0s000 virgin_nc0s000 que_cs se_p0000000 titulaba_vmii000 "_fe Stories_np00000 vivid_aq0000 "_fe,_fcuna_di0000 magnificent_aq0000 lamina_nc0s000 ._fpRepresentaba_vmii000 una_di0000 snake_nc0s000 boa_np00000 que_pr000000 se_p0000000 swallowed_vmii000 a_sp000 una_di0000 beast_nc0s000 ._fp
6.3 tTAG
First of all, I recommend reading the box 5 where they appear all tags Penn Treebank and its meaning. Once the table has been consulted, the result provided for the English text is as follows:
Once_RB when_WRB ([ I_PRP ])
<: was_VBD>([ six_CD years_NNS ]) old_JJ ([ I_PRP ])
<: saw_VBD>([ a_DT magnificent_JJ picture_NN ]) in_IN ([ a_DT book_NN ]),_,
<: called_VBD>([ True_NNP Stories_NNP ]) from_IN ([ Nature_NNP ]),_, about_IN ([ the_DT primeval_JJ forest_NN
])._.([ It_PRP ])
<: was_VBD>([ a_DT picture_NN ]) of_IN ([ a_DT boa_NN constrictor_NN ]) in_IN ([ the_DT act_NN ]) of_IN
swallowing_VBG ([ an_DT animal_NN ])._.Here_RB
<: is_VBZ>([ a_DT copy_NN ]) of_IN ([ the_DT drawing_NN ])._.
For the text in Spanish, we have the following result: ([ When_NNP ]) I_FW was_FW six_FW years_FW saw_FW in_FW a_FW book_FW about_FW the_FW jungle_FW virgin_FW that_FW was_FW titled_FW "_'' ([ Stories_NNP lived_NNS ])"_" ,_,a_FW magnificent_FW sheet_FW ._.([ Represented_NNP ]) a_FW snake_FW ([ boa_NN ]) that_FW swallowed_FW a_FW a_FW fierce_FW ._.
| number | Tag | Description |
|---|---|---|
| 1. | DC | Coordinating collaboration |
| 2. | CD | Cardinal number |
| 3. | DT | Determiner |
| 4. | EX | Existential there |
| 5. | F.W. | foreign word |
| 6. | IN | Preposition or subordinating conjunction |
| 7. | J.J. | Adjective |
| 8. | JJR | Adjective, comparative |
| 9. | JJS | Adjective, superlative |
| 10. | L.S. | List item marker |
| 11. | M.D. | Modal |
| 12. | N.N. | Noun, singular or mass |
| 13. | NNS | Noun, plural |
| 14. | NNP | Proper noun, singular |
| 15. | NNPS | Proper noun, plural |
| 16. | PDT | Default |
| 17. | POS | Possessive ending |
| 18. | PRP | Pronoun Staff |
| 19. | PRP\$ | Possessive pronoun |
| 20. | R.B. | Adverb |
| 21. | RBR | Adverb, comparative |
| 22. | RBS | Adverb, superlative |
| 23. | PR | particles |
| 24. | SYM | symbol |
| 25. | TO | to |
| 26. | UH | Interjection |
| 27. | V.B. | Verb, base form |
| 28. | VBD | Verb, past tense |
| 29. | GBV | Verb, gerund or present participle |
| 30. | VBN | Verb, past participle |
| 31. | VBP | Verb, non-3rd person singular present |
| 32. | VBZ | Verb, 3rd person singular present |
| 33. | WDT | wh-determiner |
| 34. | W.P. | Wh-pronoun |
| 35. | WP\$ | Possessive wh-pronoun |
| 36. | WRB | wh-adverb |
5 Label table
7. Observations on the comparison of the results.
To make this comparison of the English text, the first thing we must analyze is the tags used and, in my experiment, all three systems have used the same Penn Treebank tags, which makes the task easier. Now, not everyone has used the same name for labels. When analyzing the results and the documentation provided I have observed the following differences between TreeTagger and Stanford Log-linear systems Part-Of-Speech Tagger and tTAG:- In TreeTagger personal pronouns are tagged as PP while in Stanford Log-linear Part-Of-Speech Tagger and tTAG are tagged as PRP. Example:
words TreeTagger Stanford Log-linear Part-Of-Speech Tagger and tTAG I PP PRP It PP PRP - In TreeTagger the verbs in the past are interpreted as VVD while in se in Stanford Log-linear Part-Of-Speech Tagger and tTAG are tagged as VBD. Example:
words TreeTagger Stanford Log-linear Part-Of-Speech Tagger and tTAG saw VVD VBD called VVD VBD - In TreeTagger proper nouns are interpreted as NP while in se in Stanford Log-linear Part-Of-Speech Tagger and tTAG are tagged as NNP. Example:
words TreeTagger Stanford Log-linear Part-Of-Speech Tagger and tTAG true NP NNP nature NP NNP stories NP NNP
I_FW had_FW six_FW
If we analyze the exposed result we can analyze how the tagger considers i, had and six as possessing the same lexical functionality FW when i is a pronoun, It had one verb and six a noun. Which makes me think that tTag in its web version is not prepared for texts in other languages and the FW tag will be like a tag to indicate that a word is unknown to the model. For its part, the Stanford Log-linear Part-Of-Speech Tagger tool shows the tags of a model implemented by them much more complete than Penn Treebank, providing the user with much more information. If you can consult at the address https://nlp.stanford.edu/software/spanish-faq.shtml. Finally, with TreeTagger we use the same operation and labeling using the English model that in Spanish. Comparing both results, the one provided by TreeTagger and Stanford Log-linear Part-Of-Speech Tagger, the The second tagger provides us with more information than the first about its lexical functionalities, therefore, Thanks to the model used, I consider it to be a better option than the second.