Example of manual labeling.

Introduction

On the following web page: https://nlp.stanford.edu/links/statnlp.html , in the "Part of Speech Taggers" section you can find numerous statistical taggers. Many of them based on different models (HMMs, Support Vector Machine, etc.), use different training corpora, serve for different languages, etc.

In this task you must compare the behavior of at least two of them. We will study them, we will describe and we will use it to label a small text.

Then we will compare the results: labels used by each labeler and labeling accuracy.

2. Description of selected taggers

The taggers chosen have been:
  • TreeTagger
  • Stanford Log-linear Part-Of-Speech Tagger
  • tTAG

2.1 TreeTagger

TreeTagger is a text tagging, sentence analysis and lemma extraction tool.

It was developed by Helmut Schmid in the TC project of the Institute of Computational Linguistics of the University of Stuttgart with the aim of being used for speech tagging and stemming. To run it, It is necessary to use the model for the selected language (a file known as "parameters" and with the extension .par) that can be obtained from the TreeTagger website itself. Within this website, we can find different parameters that we can use to analyze texts in English, French, German, Italian, Spanish, Russian, Bulgarian, Dutch, Estonian, Finnish, Galician, Latin, Mongolian, Polish, Slovak and Swahili. For a language where a model does not exist, the tool offers the user the possibility of creating a new model. For this it is necessary to manually tag an example text and then run a training program (provided with TreeTagger) to create the model.

2.1.1 Installation

In my case, I installed the application on a personal computer with Windows 7, using the installer downloaded at the following url https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/

To install it I have performed the following steps:
  • First of all, it is necessary to install the Perl interpreter, which we can download from the following url http://www.activestate.com/activeperl/
  • The next point is to extract the files from the zip document downloaded from the web into the C:/ directory. of the computer used.
  • Once unzipped, the next step is to download from the web the files called parameters for the languages we need. The files must have a structure <language><language>-utf8.par and must be stored in the TreeTagger/lib subdirectory.
  • We have to add the following path as an environment variable

    C:/TreeTagger/bin
  • Subsequently, it is necessary to open a Windows terminal and execute the following commands

    set PATH=C:\TreeTagger\bin;%PATH% cd c:\TreeTagger

  • Finally, we can now run the program with the following command
     
      tag-<lenguage><namefile>
       
    

3. Stanford Log-linear Part-Of-Speech Tagger

A Part-Of-Speech Tagger (POS Tagger) is a tool that is responsible for classifying the different parts of a written text in a specific language. This software aims to classify each word depending on its functionality, since it will present each of them as a noun, verb, adjective, etc. and even has the ability to use more precious POS tags like "noun-plural." It is implemented in Java and uses the log-line taggers described in Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger

In order to run this program it is necessary:
  • A version of Java 1.8 or higher.
  • Between 60 and 200 MB of memory for the program
  • At least 1 GB of memory to store the model responsible for training the labeler.
The basic download contains two tagger models trained for English and uses Penn taggers Treebank.

The software is distributed under the GNU General Public License (v2 or later).

4. tTAG

To start, tTAG was implemented by Infogistics, an international Edinburgh company created by experts in the field of text mining and document searching.

Its main product, tTAG, is a text tagger that can handle both ASCII-encoded text and XML marked text. tTAG incorporates a tokenizer (tNORM) responsible for segmenting the text into words and sentences. In In those cases where the user needs to use their own tokenizer, the application allows replacing tNORM with one of own elaboration.

Morphologically, tTAG uses a lexicon that can be easily extended to include words new. When the software encounters an unknown word, tTAG runs prediction software that allows you to predict and label unknown words. This ability can be retrained by the user to new sublanguages. Thanks to this capability, tTAG achieves an accuracy of 96% 98% correct in those words found in the lexicon and 88-92% accuracy in words unknown.

tTAG allows you to use both pre-trained resources based on the Penn tag set Treebank that provides you with the application as well as developing your own resources using a set of own tags. When running tTAG, you simply specify which resource file you want to use during the labeling.

When parsing marked text in SGML/XML, tTAG can mark the entire text or only mark some sections of the XML document (for example, only to tag paragraphs, and not headings or subheadings). It is also It is possible to request tTAG to output the tagged text as XML.
 
     <SENTENCE>
      <W TAG="PPS">He</W>
      <W TAG="VBZ">books</W>
      <W TAG="NNS">tickets</W>
    </SENTENCE>
   
Regarding the tTag syntax analyzer, it is worth mentioning that it uses the information from the document and uses grammars context-sensitive to detect the boundaries of syntactic groups. The analyzer leaves all the information previously added in the text and creates structural elements that include words from the fragment:
 
    <NG>Este hombre</NG>
    <VG>canta</VG>
   

5. Test text used.

To carry out this task I have chosen the following text in English extracted from the book The Little Prince by ANTOINE DE SAINT-EXUPÉRY:

Once when I was six years old I saw a magnificent picture in a book, called True Stories from Nature, about the primeval forest. It was a picture of a boa constrictor in the act of swallowing an animal. Here is a copy of the drawing.

Based on the fact that the three texts have the ability to label texts in different languages, I have searched for the same text in the Spanish version to analyze it. The text used is the following

When I was six years old I saw a book about the virgin forest called "Stories lived", a magnificent sheet. It represented a boa snake that swallowed a beast.

6.Labeling result with each selected labeler.

6.1 TreeTagger

First of all, in the picture 1 we can consult what kind of tags use TreeTagger in English, in the box 2 the labels that he uses in Spanish, in the painting 3 if will represent the result in English and finally, in the box 4 will be represented result in Spanish.
POS TagDescriptionExample
DCcoordinating conjunctionand, but, or, \
CDcardinal number1, three
DTdeterminethe
EXexistential therethere is
F.W.foreign wordd'oeuvre
INpreposition/subord. conj.in,of,like,after,whether
IN/thatcomplementizerthat
J.J.adjectivegreen
JJRadjective, comparativegreener
JJSadjective, superlativegreenest
L.S.list marker(1),
M.D.modalcould, will
N.N.noun, singular or masstable
NNSplural nountables
NPproper noun, singularJohn
NPSproper noun, pluralVikings
PDTpredetermineboth the boys
POSpossessive endingfriend's
PPpersonal pronounI, he, it
PP\$possessive pronounmy, his
R.B.adverbhowever, usually, here, not
RBRadverb, comparativebetter
RBSadverb, superlativebest
PRparticlesgive up
SENTend punctuation?, !, .
SYMsymbols@, +, *, \textasciicircum, |, =
TOtoto go, to him
UHinterjectionuhhuhhuhh
V.B.verb be, base formbe
VBDverb be, pastwas|were
GBVverb be, gerund/participlebeing
VBNverb be, past participlebeen
VBZverb be, pres, 3rd p. singis
VBPverb be, pres non-3rd p.am|are
D.V.verb do, base formdo
VDDverb do, pastdid
VDGverb do gerund/participledoing
VDNverb do, past participledonated
VDZverb do, pres, 3rd per.singdoes
VDPverb do, pres, non-3rd per.do
V.H.verb have, base formhave
VHDverb have, pasthad
HGVverb have, gerund/participlehaving
VHNverb have, past participlehad
VHZverb have, pres 3rd per.singyou have
VHPverb have, pres non-3rd per.have
V.V.verb, base formtake
VVDverb, past tensetook
VVGverb, gerund/participletaking
VVNverb, past participletaken
VVPverb, present, non-3rd p.take
VVZverb, present 3d p. sing.takes
WDTwh-determinerwhich
W.P.wh-pronounwho, what
WP\$possessive wh-pronounwhose
WRBwh-abverbwhere, when
:general joiner;, -, --
\$currency symbol\$, £

Table.1 Tags using TreeTagger in English

ExampleTagTreecast TagDescription
APRabbreviation
ADJadjective
ADVadverb
DET:ARTarticle
DET:POSpossessive pronoun (ma, ta, …)
INTinterjection
KONconjunction
NAMproper name
NOMnoun
NUMnumeral
PROpronoun
PRO:DEMdemonstrative pronoun
PRO:INDindefinite pronoun
PRO:PERpersonal pronoun
PRO:POSpossessive pronoun (mien, tien, …)
PRO:RELrelative pronoun
PRPpreposition
PRP:detpreposition plus article (au, du, aux, des)
PUNpunctuation
PUN:citpunctuation citation
SENTsentence tag
SYMsymbols
SEE:condconditional verb
SEE:futufuture verb
SEE:impeimperative verb
SEE:impfimperfect verb
SEE:infiverb infinitive
SEE:pperverb past participle
SEE:ppreverb present participle
SEE:presverb present
SEE:simpsimple verb past
SEE: uploadedimperfect subjunctive verb
SEE:subpverb subjunctive present

Table.2 Tags using TreeTagger in Spanish

Original TextLabelRoot
ElevenR.B.eleven
whenWRBwhen
IPPI
wasVBDbe
sixCDsix
yearsNNSyear
oldJ.J.old
IPPI
sawVVDsee
toDTto
magnificentJ.J.magnificent
pictureN.N.picture
inINin
toDTto
bookN.N.book
,,,
calledVVDcall
trueNPtrue
StoriesNPStories
fromINfrom
NatureNPNature
,,,
aboutINabout
theDTthe
primevalJ.J.primeval
forestN.N.forest
.SENT.
ItPPit
wasVBDbe
toDTto
pictureN.N.picture
ofINof
toDTto
boaN.N.boa
constrictorN.N.constrictor
inINin
theDTthe
actN.N.act
ofINof
swallowingVVGswallow
anDTan
animalN.N.animal
.SENT.
HereR.B.here
isVBZbe
toDTto
copyN.N.copy
ofINof
theDTthe
drawingN.N.drawing
.SENT.

Table.3 Tags used by TreeTagger in English

Original TextLabelRoot
WhenCSUBXwhen
iPPXi
hadVLfinhave
sixCARDsix
yearsNCyear
I sawCARDI saw
inPREPin
aARTa
bookNCbook
aboutPREPabout
theARTthe
jungleNCjungle
virginNCvirgin
thatCQUEthat
ifSEif
titledVLfinholder
"Q.T."
StoriesNPStories
livedVLadjlive
"Q.T."
,CM,
oneARTa
magnificentADJmagnificent
sheetNCsheet
.F.S..
representedVLfinrepresent
oneARTa
snakeNCsnake
boaNCboa
thatCQUEthat
ifSEif
swallowedVLfinswallowed
toPREPto
oneARTa
beastNCbeast
.F.S..

Table.4 Tags used by TreeTagger in Spanish

6.2 Stanford Log-linear Part-Of-Speech Tagger

In the first part of the section I have used the Stanford Log-linear Part-Of-Speech Tagger to parse the text in English by applying the following command

java -mx300m -classpath stanford-postagger.jaredu.stanford.nlp.tagger.maxent.MaxentTagger -modelmodels/wsj-0-18-bidirectional-distsim.tagger -textFilesample-input.txt >sample-tagged.txt

Once it has been executed and to better understand its operation, I have used the figure 1 where They show all the labels that the program can generate for a text in English.

1 POS Tags used in tanford Log-linear Part-Of-Speech Tagger

The result obtained for the English text has been the following:

Once_RB when_WRB I_PRP was_VBD six_CD years_NNS old_JJ I_PRPsaw_VBD a_DT magnificent_JJ picture_NN in_IN a_DT book_NN ,_,called_VBN True_NNP Stories_NNP from_IN Nature_NNP ,_, about_IN the_DT primeval_JJ forest_NN ._. It_PRP was_VBD a_DT picture_NN of_IN a_DT boa_NN constrictor_NN in_IN the_DT act_NN of_IN swallowing_VBG an_DT animal_NN ._. Here_RB is_VBZ a_DT copy_NN of_IN the_DT drawing_NN ._.

Once the Test with the text in English, I have run the same test but with a text in Spanish. To better understand the tags used by this software, I recommend visiting the following url https://nlp.stanford.edu/software/spanish-faq.shtml and the result obtained has been next:

When_cs i_pp000000 had_vmii000 six_dn0000 years_nc0p000vi_vmis000 en_sp000 un_di0000 libro_nc0s000 sobre_sp000 la_da0000 jungle_nc0s000 virgin_nc0s000 que_cs se_p0000000 titulaba_vmii000 "_fe Stories_np00000 vivid_aq0000 "_fe,_fcuna_di0000 magnificent_aq0000 lamina_nc0s000 ._fpRepresentaba_vmii000 una_di0000 snake_nc0s000 boa_np00000 que_pr000000 se_p0000000 swallowed_vmii000 a_sp000 una_di0000 beast_nc0s000 ._fp

6.3 tTAG

First of all, I recommend reading the box 5 where they appear all tags Penn Treebank and its meaning.

Once the table has been consulted, the result provided for the English text is as follows:
 
    Once_RB when_WRB ([ I_PRP ])
      <: was_VBD>([ six_CD years_NNS ]) old_JJ ([ I_PRP ])
        <: saw_VBD>([ a_DT magnificent_JJ picture_NN ]) in_IN ([ a_DT book_NN ]),_,
          <: called_VBD>([ True_NNP Stories_NNP ]) from_IN ([ Nature_NNP ]),_, about_IN ([ the_DT primeval_JJ forest_NN
            ])._.([ It_PRP ])
            <: was_VBD>([ a_DT picture_NN ]) of_IN ([ a_DT boa_NN constrictor_NN ]) in_IN ([ the_DT act_NN ]) of_IN
              swallowing_VBG ([ an_DT animal_NN ])._.Here_RB
              <: is_VBZ>([ a_DT copy_NN ]) of_IN ([ the_DT drawing_NN ])._.
   
For the text in Spanish, we have the following result:

([ When_NNP ]) I_FW was_FW six_FW years_FW saw_FW in_FW a_FW book_FW about_FW the_FW jungle_FW virgin_FW that_FW was_FW titled_FW "_'' ([ Stories_NNP lived_NNS ])"_" ,_,a_FW magnificent_FW sheet_FW ._.([ Represented_NNP ]) a_FW snake_FW ([ boa_NN ]) that_FW swallowed_FW a_FW a_FW fierce_FW ._.

numberTagDescription
1.DCCoordinating collaboration
2.CDCardinal number
3.DTDeterminer
4.EXExistential there
5.F.W.foreign word
6.INPreposition or subordinating conjunction
7.J.J.Adjective
8.JJRAdjective, comparative
9.JJSAdjective, superlative
10.L.S.List item marker
11.M.D.Modal
12.N.N.Noun, singular or mass
13.NNSNoun, plural
14.NNPProper noun, singular
15.NNPSProper noun, plural
16.PDTDefault
17.POSPossessive ending
18.PRPPronoun Staff
19.PRP\$Possessive pronoun
20.R.B.Adverb
21.RBRAdverb, comparative
22.RBSAdverb, superlative
23.PRparticles
24.SYMsymbol
25.TOto
26.UHInterjection
27.V.B.Verb, base form
28.VBDVerb, past tense
29.GBVVerb, gerund or present participle
30.VBNVerb, past participle
31.VBPVerb, non-3rd person singular present
32.VBZVerb, 3rd person singular present
33.WDTwh-determiner
34.W.P.Wh-pronoun
35.WP\$Possessive wh-pronoun
36.WRBwh-adverb

5 Label table

7. Observations on the comparison of the results.

To make this comparison of the English text, the first thing we must analyze is the tags used and, in my experiment, all three systems have used the same Penn Treebank tags, which makes the task easier.

Now, not everyone has used the same name for labels. When analyzing the results and the documentation provided I have observed the following differences between TreeTagger and Stanford Log-linear systems Part-Of-Speech Tagger and tTAG:
  • In TreeTagger personal pronouns are tagged as PP while in Stanford Log-linear Part-Of-Speech Tagger and tTAG are tagged as PRP. Example:
    wordsTreeTaggerStanford Log-linear Part-Of-Speech Tagger and tTAG
    IPPPRP
    ItPPPRP
  • In TreeTagger the verbs in the past are interpreted as VVD while in se in Stanford Log-linear Part-Of-Speech Tagger and tTAG are tagged as VBD. Example:
    wordsTreeTaggerStanford Log-linear Part-Of-Speech Tagger and tTAG
    sawVVDVBD
    calledVVDVBD
  • In TreeTagger proper nouns are interpreted as NP while in se in Stanford Log-linear Part-Of-Speech Tagger and tTAG are tagged as NNP. Example:
    wordsTreeTaggerStanford Log-linear Part-Of-Speech Tagger and tTAG
    trueNPNNP
    natureNPNNP
    storiesNPNNP
And although it does not appear in the test text I chose, the same thing happens with different forms of the verb To BE.

Regarding the three systems, the one that provides us with the most information is the tTAG solution since it can detect the limits of a syntactic group and has a web development environment that makes it very easy to use. in me opinion, the second most useful would be TreeTagger, which provides us with information on the root of the analyzed lexicon and, Although it is not a web environment, it is also quite usable.

By changing the base text and analyzing the labeling of texts in Spanish, the order of preference changes. For start the result provided by tTag has been defective as we can see in the following extracted line of the result presented in the previous section:

I_FW had_FW six_FW

If we analyze the exposed result we can analyze how the tagger considers i, had and six as possessing the same lexical functionality FW when i is a pronoun, It had one verb and six a noun. Which makes me think that tTag in its web version is not prepared for texts in other languages and the FW tag will be like a tag to indicate that a word is unknown to the model.

For its part, the Stanford Log-linear Part-Of-Speech Tagger tool shows the tags of a model implemented by them much more complete than Penn Treebank, providing the user with much more information. If you can consult at the address https://nlp.stanford.edu/software/spanish-faq.shtml.

Finally, with TreeTagger we use the same operation and labeling using the English model that in Spanish.

Comparing both results, the one provided by TreeTagger and Stanford Log-linear Part-Of-Speech Tagger, the The second tagger provides us with more information than the first about its lexical functionalities, therefore, Thanks to the model used, I consider it to be a better option than the second.