Example of manual labeling.

Classical taggers and current evaluation

TreeTagger and the Stanford POS Tagger remain useful historical baselines. For a current comparison, define one tagset, preserve tokenization, use a gold test set not seen during configuration, report per-tag errors and compare with a maintained NLP pipeline. Manual annotation should include written guidelines and agreement between annotators.

Introduction

On the following web page: https://nlp.stanford.edu/links/statnlp.html , in the "Part of Speech Taggers" section you can find numerous statistical taggers. Many of them based on different models (HMMs, Support Vector Machine, etc.), use different training corpora, serve for different languages, etc.

In this task you must compare the behavior of at least two of them. We will study them, we will describe and we will use it to label a small text.

Then we will compare the results: labels used by each labeler and labeling accuracy.

2. Description of selected taggers

The taggers chosen have been:

TreeTagger
Stanford Log-linear Part-Of-Speech Tagger
tTAG

2.1 TreeTagger

TreeTagger is a text tagging, sentence analysis and lemma extraction tool.

It was developed by Helmut Schmid in the TC project of the Institute of Computational Linguistics of the University of Stuttgart with the aim of being used for speech tagging and stemming. To run it, It is necessary to use the model for the selected language (a file known as "parameters" and with the extension .par) that can be obtained from the TreeTagger website itself. Within this website, we can find different parameters that we can use to analyze texts in English, French, German, Italian, Spanish, Russian, Bulgarian, Dutch, Estonian, Finnish, Galician, Latin, Mongolian, Polish, Slovak and Swahili. For a language where a model does not exist, the tool offers the user the possibility of creating a new model. For this it is necessary to manually tag an example text and then run a training program (provided with TreeTagger) to create the model.

2.1.1 Installation

In my case, I installed the application on a personal computer with Windows 7, using the installer downloaded at the following url https://www.cis.uni-muenchen.de/~schmid/tools/TreeTagger/

To install it I have performed the following steps:

First of all, it is necessary to install the Perl interpreter, which we can download from the following url http://www.activestate.com/activeperl/
The next point is to extract the files from the zip document downloaded from the web into the C:/ directory. of the computer used.
Once unzipped, the next step is to download from the web the files called parameters for the languages we need. The files must have a structure $<language>$ -utf8.par and must be stored in the TreeTagger/lib subdirectory.
We have to add the following path as an environment variable
C:/TreeTagger/bin
Subsequently, it is necessary to open a Windows terminal and execute the following commands
set PATH=C:\TreeTagger\bin;%PATH% cd c:\TreeTagger
Finally, we can now run the program with the following command
```
 
  tag-<lenguage><namefile>
   
```

3. Stanford Log-linear Part-Of-Speech Tagger

A Part-Of-Speech Tagger (POS Tagger) is a tool that is responsible for classifying the different parts of a written text in a specific language. This software aims to classify each word depending on its functionality, since it will present each of them as a noun, verb, adjective, etc. and even has the ability to use more precious POS tags like "noun-plural." It is implemented in Java and uses the log-line taggers described in Enriching the Knowledge Sources Used in a Maximum Entropy Part-of-Speech Tagger

In order to run this program it is necessary:

A version of Java 1.8 or higher.
Between 60 and 200 MB of memory for the program
At least 1 GB of memory to store the model responsible for training the labeler.

The basic download contains two tagger models trained for English and uses Penn taggers Treebank.

The software is distributed under the GNU General Public License (v2 or later).

4. tTAG

To start, tTAG was implemented by Infogistics, an international Edinburgh company created by experts in the field of text mining and document searching.

Its main product, tTAG, is a text tagger that can handle both ASCII-encoded text and XML marked text. tTAG incorporates a tokenizer (tNORM) responsible for segmenting the text into words and sentences. In In those cases where the user needs to use their own tokenizer, the application allows replacing tNORM with one of own elaboration.

Morphologically, tTAG uses a lexicon that can be easily extended to include words new. When the software encounters an unknown word, tTAG runs prediction software that allows you to predict and label unknown words. This ability can be retrained by the user to new sublanguages. Thanks to this capability, tTAG achieves an accuracy of 96% 98% correct in those words found in the lexicon and 88-92% accuracy in words unknown.

tTAG allows you to use both pre-trained resources based on the Penn tag set Treebank that provides you with the application as well as developing your own resources using a set of own tags. When running tTAG, you simply specify which resource file you want to use during the labeling.

When parsing marked text in SGML/XML, tTAG can mark the entire text or only mark some sections of the XML document (for example, only to tag paragraphs, and not headings or subheadings). It is also It is possible to request tTAG to output the tagged text as XML.

 
     <SENTENCE>
      <W TAG="PPS">He</W>
      <W TAG="VBZ">books</W>
      <W TAG="NNS">tickets</W>
    </SENTENCE>

Regarding the tTag syntax analyzer, it is worth mentioning that it uses the information from the document and uses grammars context-sensitive to detect the boundaries of syntactic groups. The analyzer leaves all the information previously added in the text and creates structural elements that include words from the fragment:

 
    <NG>Este hombre</NG>
    <VG>canta</VG>

5. Test text used.

To carry out this task I have chosen the following text in English extracted from the book The Little Prince by ANTOINE DE SAINT-EXUPÉRY:

Once when I was six years old I saw a magnificent picture in a book, called True Stories from Nature, about the primeval forest. It was a picture of a boa constrictor in the act of swallowing an animal. Here is a copy of the drawing.

Based on the fact that the three texts have the ability to label texts in different languages, I have searched for the same text in the Spanish version to analyze it. The text used is the following

When I was six years old I saw a book about the virgin forest called "Stories lived", a magnificent sheet. It represented a boa snake that swallowed a beast.

6.Labeling result with each selected labeler.

6.1 TreeTagger

First of all, in the picture 1 we can consult what kind of tags use TreeTagger in English, in the box 2 the labels that he uses in Spanish, in the painting 3 if will represent the result in English and finally, in the box 4 will be represented result in Spanish.

POS Tag	Description	Example
DC	coordinating conjunction	and, but, or, \
CD	cardinal number	1, three
DT	determine	the
EX	existential there	there is
F.W.	foreign word	d'oeuvre
IN	preposition/subord. conj.	in,of,like,after,whether
IN/that	complementizer	that
J.J.	adjective	green
JJR	adjective, comparative	greener
JJS	adjective, superlative	greenest
L.S.	list marker	(1),
M.D.	modal	could, will
N.N.	noun, singular or mass	table
NNS	plural noun	tables
NP	proper noun, singular	John
NPS	proper noun, plural	Vikings
PDT	predetermine	both the boys
POS	possessive ending	friend's
PP	personal pronoun	I, he, it
PP\$	possessive pronoun	my, his
R.B.	adverb	however, usually, here, not
RBR	adverb, comparative	better
RBS	adverb, superlative	best
PR	particles	give up
SENT	end punctuation	?, !, .
SYM	symbols	@, +, *, \textasciicircum, \|, =
TO	to	to go, to him
UH	interjection	uhhuhhuhh
V.B.	verb be, base form	be
VBD	verb be, past	was\|were
GBV	verb be, gerund/participle	being
VBN	verb be, past participle	been
VBZ	verb be, pres, 3rd p. sing	is
VBP	verb be, pres non-3rd p.	am\|are
D.V.	verb do, base form	do
VDD	verb do, past	did
VDG	verb do gerund/participle	doing
VDN	verb do, past participle	donated
VDZ	verb do, pres, 3rd per.sing	does
VDP	verb do, pres, non-3rd per.	do
V.H.	verb have, base form	have
VHD	verb have, past	had
HGV	verb have, gerund/participle	having
VHN	verb have, past participle	had
VHZ	verb have, pres 3rd per.sing	you have
VHP	verb have, pres non-3rd per.	have
V.V.	verb, base form	take
VVD	verb, past tense	took
VVG	verb, gerund/participle	taking
VVN	verb, past participle	taken
VVP	verb, present, non-3rd p.	take
VVZ	verb, present 3d p. sing.	takes
WDT	wh-determiner	which
W.P.	wh-pronoun	who, what
WP\$	possessive wh-pronoun	whose
WRB	wh-abverb	where, when
:	general joiner	;, -, --
\$	currency symbol	\$, £

Table.1 Tags using TreeTagger in English

ExampleTagTreecast Tag	Description
APR	abbreviation
ADJ	adjective
ADV	adverb
DET:ART	article
DET:POS	possessive pronoun (ma, ta, …)
INT	interjection
KON	conjunction
NAM	proper name
NOM	noun
NUM	numeral
PRO	pronoun
PRO:DEM	demonstrative pronoun
PRO:IND	indefinite pronoun
PRO:PER	personal pronoun
PRO:POS	possessive pronoun (mien, tien, …)
PRO:REL	relative pronoun
PRP	preposition
PRP:det	preposition plus article (au, du, aux, des)
PUN	punctuation
PUN:cit	punctuation citation
SENT	sentence tag
SYM	symbols
SEE:cond	conditional verb
SEE:futu	future verb
SEE:impe	imperative verb
SEE:impf	imperfect verb
SEE:infi	verb infinitive
SEE:pper	verb past participle
SEE:ppre	verb present participle
SEE:pres	verb present
SEE:simp	simple verb past
SEE: uploaded	imperfect subjunctive verb
SEE:subp	verb subjunctive present

Table.2 Tags using TreeTagger in Spanish

Original Text	Label	Root
Eleven	R.B.	eleven
when	WRB	when
I	PP	I
was	VBD	be
six	CD	six
years	NNS	year
old	J.J.	old
I	PP	I
saw	VVD	see
to	DT	to
magnificent	J.J.	magnificent
picture	N.N.	picture
in	IN	in
to	DT	to
book	N.N.	book
,	,	,
called	VVD	call
true	NP	true
Stories	NP	Stories
from	IN	from
Nature	NP	Nature
,	,	,
about	IN	about
the	DT	the
primeval	J.J.	primeval
forest	N.N.	forest
.	SENT	.
It	PP	it
was	VBD	be
to	DT	to
picture	N.N.	picture
of	IN	of
to	DT	to
boa	N.N.	boa
constrictor	N.N.	constrictor
in	IN	in
the	DT	the
act	N.N.	act
of	IN	of
swallowing	VVG	swallow
an	DT	an
animal	N.N.	animal
.	SENT	.
Here	R.B.	here
is	VBZ	be
to	DT	to
copy	N.N.	copy
of	IN	of
the	DT	the
drawing	N.N.	drawing
.	SENT	.

Table.3 Tags used by TreeTagger in English

Original Text	Label	Root
When	CSUBX	when
i	PPX	i
had	VLfin	have
six	CARD	six
years	NC	year
I saw	CARD	I saw
in	PREP	in
a	ART	a
book	NC	book
about	PREP	about
the	ART	the
jungle	NC	jungle
virgin	NC	virgin
that	CQUE	that
if	SE	if
titled	VLfin	holder
"	Q.T.	"
Stories	NP	Stories
lived	VLadj	live
"	Q.T.	"
,	CM	,
one	ART	a
magnificent	ADJ	magnificent
sheet	NC	sheet
.	F.S.	.
represented	VLfin	represent
one	ART	a
snake	NC	snake
boa	NC	boa
that	CQUE	that
if	SE	if
swallowed	VLfin	swallowed
to	PREP	to
one	ART	a
beast	NC	beast
.	F.S.	.

Table.4 Tags used by TreeTagger in Spanish

6.2 Stanford Log-linear Part-Of-Speech Tagger

In the first part of the section I have used the Stanford Log-linear Part-Of-Speech Tagger to parse the text in English by applying the following command

java -mx300m -classpath stanford-postagger.jaredu.stanford.nlp.tagger.maxent.MaxentTagger -modelmodels/wsj-0-18-bidirectional-distsim.tagger -textFilesample-input.txt >sample-tagged.txt

Once it has been executed and to better understand its operation, I have used the figure 1 where They show all the labels that the program can generate for a text in English.

1 POS Tags used in tanford Log-linear Part-Of-Speech Tagger

The result obtained for the English text has been the following:

Once_RB when_WRB I_PRP was_VBD six_CD years_NNS old_JJ I_PRPsaw_VBD a_DT magnificent_JJ picture_NN in_IN a_DT book_NN ,_,called_VBN True_NNP Stories_NNP from_IN Nature_NNP ,_, about_IN the_DT primeval_JJ forest_NN ._. It_PRP was_VBD a_DT picture_NN of_IN a_DT boa_NN constrictor_NN in_IN the_DT act_NN of_IN swallowing_VBG an_DT animal_NN ._. Here_RB is_VBZ a_DT copy_NN of_IN the_DT drawing_NN ._.

Once the Test with the text in English, I have run the same test but with a text in Spanish. To better understand the tags used by this software, I recommend visiting the following url https://nlp.stanford.edu/software/spanish-faq.shtml and the result obtained has been next:

When_cs i_pp000000 had_vmii000 six_dn0000 years_nc0p000vi_vmis000 en_sp000 un_di0000 libro_nc0s000 sobre_sp000 la_da0000 jungle_nc0s000 virgin_nc0s000 que_cs se_p0000000 titulaba_vmii000 "_fe Stories_np00000 vivid_aq0000 "_fe,_fcuna_di0000 magnificent_aq0000 lamina_nc0s000 ._fpRepresentaba_vmii000 una_di0000 snake_nc0s000 boa_np00000 que_pr000000 se_p0000000 swallowed_vmii000 a_sp000 una_di0000 beast_nc0s000 ._fp

6.3 tTAG

First of all, I recommend reading the box 5 where they appear all tags Penn Treebank and its meaning.

Once the table has been consulted, the result provided for the English text is as follows:

 
    Once_RB when_WRB ([ I_PRP ])
      <: was_VBD>([ six_CD years_NNS ]) old_JJ ([ I_PRP ])
        <: saw_VBD>([ a_DT magnificent_JJ picture_NN ]) in_IN ([ a_DT book_NN ]),_,
          <: called_VBD>([ True_NNP Stories_NNP ]) from_IN ([ Nature_NNP ]),_, about_IN ([ the_DT primeval_JJ forest_NN
            ])._.([ It_PRP ])
            <: was_VBD>([ a_DT picture_NN ]) of_IN ([ a_DT boa_NN constrictor_NN ]) in_IN ([ the_DT act_NN ]) of_IN
              swallowing_VBG ([ an_DT animal_NN ])._.Here_RB
              <: is_VBZ>([ a_DT copy_NN ]) of_IN ([ the_DT drawing_NN ])._.

For the text in Spanish, we have the following result:

([ When_NNP ]) I_FW was_FW six_FW years_FW saw_FW in_FW a_FW book_FW about_FW the_FW jungle_FW virgin_FW that_FW was_FW titled_FW "_'' ([ Stories_NNP lived_NNS ])"_" ,_,a_FW magnificent_FW sheet_FW ._.([ Represented_NNP ]) a_FW snake_FW ([ boa_NN ]) that_FW swallowed_FW a_FW a_FW fierce_FW ._.

number	Tag	Description
1.	DC	Coordinating collaboration
2.	CD	Cardinal number
3.	DT	Determiner
4.	EX	Existential there
5.	F.W.	foreign word
6.	IN	Preposition or subordinating conjunction
7.	J.J.	Adjective
8.	JJR	Adjective, comparative
9.	JJS	Adjective, superlative
10.	L.S.	List item marker
11.	M.D.	Modal
12.	N.N.	Noun, singular or mass
13.	NNS	Noun, plural
14.	NNP	Proper noun, singular
15.	NNPS	Proper noun, plural
16.	PDT	Default
17.	POS	Possessive ending
18.	PRP	Pronoun Staff
19.	PRP\$	Possessive pronoun
20.	R.B.	Adverb
21.	RBR	Adverb, comparative
22.	RBS	Adverb, superlative
23.	PR	particles
24.	SYM	symbol
25.	TO	to
26.	UH	Interjection
27.	V.B.	Verb, base form
28.	VBD	Verb, past tense
29.	GBV	Verb, gerund or present participle
30.	VBN	Verb, past participle
31.	VBP	Verb, non-3rd person singular present
32.	VBZ	Verb, 3rd person singular present
33.	WDT	wh-determiner
34.	W.P.	Wh-pronoun
35.	WP\$	Possessive wh-pronoun
36.	WRB	wh-adverb

5 Label table

7. Observations on the comparison of the results.

To make this comparison of the English text, the first thing we must analyze is the tags used and, in my experiment, all three systems have used the same Penn Treebank tags, which makes the task easier.

Now, not everyone has used the same name for labels. When analyzing the results and the documentation provided I have observed the following differences between TreeTagger and Stanford Log-linear systems Part-Of-Speech Tagger and tTAG:

In TreeTagger personal pronouns are tagged as PP while in Stanford Log-linear Part-Of-Speech Tagger and tTAG are tagged as PRP. Example:
words TreeTagger Stanford Log-linear Part-Of-Speech Tagger and tTAG
I PP PRP
It PP PRP
In TreeTagger the verbs in the past are interpreted as VVD while in se in Stanford Log-linear Part-Of-Speech Tagger and tTAG are tagged as VBD. Example:
words TreeTagger Stanford Log-linear Part-Of-Speech Tagger and tTAG
saw VVD VBD
called VVD VBD
In TreeTagger proper nouns are interpreted as NP while in se in Stanford Log-linear Part-Of-Speech Tagger and tTAG are tagged as NNP. Example:
words TreeTagger Stanford Log-linear Part-Of-Speech Tagger and tTAG
true NP NNP
nature NP NNP
stories NP NNP

words	TreeTagger	Stanford Log-linear Part-Of-Speech Tagger and tTAG
I	PP	PRP
It	PP	PRP

words	TreeTagger	Stanford Log-linear Part-Of-Speech Tagger and tTAG
saw	VVD	VBD
called	VVD	VBD

words	TreeTagger	Stanford Log-linear Part-Of-Speech Tagger and tTAG
true	NP	NNP
nature	NP	NNP
stories	NP	NNP

And although it does not appear in the test text I chose, the same thing happens with different forms of the verb To BE.

Regarding the three systems, the one that provides us with the most information is the tTAG solution since it can detect the limits of a syntactic group and has a web development environment that makes it very easy to use. in me opinion, the second most useful would be TreeTagger, which provides us with information on the root of the analyzed lexicon and, Although it is not a web environment, it is also quite usable.

By changing the base text and analyzing the labeling of texts in Spanish, the order of preference changes. For start the result provided by tTag has been defective as we can see in the following extracted line of the result presented in the previous section:

I_FW had_FW six_FW

If we analyze the exposed result we can analyze how the tagger considers i, had and six as possessing the same lexical functionality FW when i is a pronoun, It had one verb and six a noun. Which makes me think that tTag in its web version is not prepared for texts in other languages and the FW tag will be like a tag to indicate that a word is unknown to the model.

For its part, the Stanford Log-linear Part-Of-Speech Tagger tool shows the tags of a model implemented by them much more complete than Penn Treebank, providing the user with much more information. If you can consult at the address https://nlp.stanford.edu/software/spanish-faq.shtml.

Finally, with TreeTagger we use the same operation and labeling using the English model that in Spanish.

Comparing both results, the one provided by TreeTagger and Stanford Log-linear Part-Of-Speech Tagger, the The second tagger provides us with more information than the first about its lexical functionalities, therefore, Thanks to the model used, I consider it to be a better option than the second.