1. Corpus creation
1.1 What is a corpus?
A Corpus is a term used in theories about language. Computational to refer to any collection of text that meets certain criteria and that serves as a representation of the language used. That is, if we need to create a corpus it must be composed of texts produced in real situations and their inclusion must be guided by a series of explicit linguistic criteria to ensure that you can use as a representative sample of a language.
Although all scholars dedicated to the corpus agree that these are fundamental aspects in its creation and its definition, although they are still controversial aspects and have sometimes given rise to different postures.
1.2 Possible uses and usefulness of a corpus
A corpus allows us to carry out lexical labeling of the components it has stored. Basically, through analysis of the corpus we can create a dictionary, semi-automatically construct the lexicon of the domain and acquire automatically lexical-semantic information.
1.3 Creation of corpus from the web.
To create a corpus we will first have to meet some preconditions such as:
- Electronic format of documents: We can use scanning or keyboarding to transform texts into paper or other media to the desired electronic format.
- Comply with the copyright of the documents used by paying the respective licenses or using other free.
- Registration: For example, we have to consider whether it is formal, informal, literary, etc. language.
- Plain text.
- Large amount of texts
At first it was common to find many examples of corpora built based on press news, since These resources are easy to obtain and have multiple applications for your treatment.
This fact has been changing with the evolution of the Internet. Currently the written press has been replaced by Web documents for their diversification of records, diversity of languages and easy access. This has evolved so much source of information that corpora formed with Web documents are called opportunistic as opposed to planned corpora (they only incorporate previously selected documents). Normally this type of structures are characterized by a lack of control, documentation or representativeness and, as a consequence, they are usually used to complement other reference corpora used or create exploratory corpora for specialty languages.
Another very important point of an opportunistic Corpus is the method of obtaining Web documents. For To compile this structure we must first use a Crawler like GoogleBot or Yahoo that allows us to perform a document collection. Carrying out the search can be done using two methods, manually with a high time cost but selecting each of the documents entered or automated where a search algorithm until a certain number of documents that meet certain conditions are found. Both in the first method as in the second, the basic idea is to use one or more seed words that allow recovering some documents. These documents, once explored, provide us with new terms that are used or not used in combination with the previous ones to obtain more new documents. This process is repeated until generating a corpus of the desired size.
1.4 Examples of some corpora and their purpose
The most important corpuses are:
- Brown Corpus: Created by Kucera and Francis, the Brown Corpus was a selection of documents carefully compiled. The language of the documents was English, exactly the American dialect variant, where They managed to store around a million words extracted from a wide variety of sources. To create this system, the authors performed a variety of computational analyses, thanks to which managed to compile a rich and decaying work, which combined elements of linguistics, psychology, statistics and sociology. This Corpus has been very relevant in the history of Text Mining since it has It has been widely used in computational linguistics, and, for many years, it was one of the most cited resources. It has also been used as the basis of corpuses after him such as the Lancaster-Oslo-Bergen Corpus or the SUSANNE. Regarding its internal structure, this structure is of the POS tagging type, that is, the corpus assigns (or tag) to each of the words in a text its grammatical category (which in the case of our structure can correspond to 82 labels).
- SUSANNE Corpus: Susanne is short for Surgace and Underlying Structural ANalysis of Natural English. As we have commented in the previous Corpus, SUSANNE comes from the Brown Corpus, since it uses 64 of the 500 samples used by the Brown Corpus. In an initial phase it consisted of about 130,000 words but used different documents. The first The difference was the use of the two main dialects of English, British and American. Like the Brown Corpus, it is true that its base structure is of the "POS Tagging" type, but unlike It is made up of 353 labels and the themes used in its documents are basically Press, Fine Letters, Learned and Fiction: Adventure and Western. Due to all these differences, SUSANNE decided to provide a new improvement in probabilistic analysis and, therefore, linguists consequent an improvement in the statistics obtained after his predecessor Brown. Finally, the Royal Signals and Radar Establishment implemented an evolution to use samples of Spontaneous spoken English which resulted in a new Corpus called CHRISTINE. It was released in August 1999 and It is one of the oral corpora to be able to analyze spoken language.
- Penn Treebank: The Penn Treebank is a corpus of more than 4.5 million words of English American During first three-year phase of the Penn Treebank Project (1989 - 1992), this corpus has 2 types of tagging of, of lexical type and syntactic type. The set of samples from which it is composed comes from different corpora different, specifically the following: Dept. of Energy abstract, Dow Jones Newswire stories, Dept. of Agriculture bulletins, Library of America texts, MUC-3 messages, IBM Manual sentences, WBUR radio transcripts, ATIS sentences, Brown Corpus, retagged.
2.Textual Information Extraction (Automatic Information Extraction)
2.1 Definition and objectives
Information extraction is defined as the task of identifying descriptions of events in language texts. natural and subsequently, extract the information related to said events [Patward S.
& Riloff E.,2006]. In other words, an information extraction system finds and links information relevant, while ignoring the strange and irrelevant [Cowie J.
& Lehnert W., 1996].
The increasing rise of the Internet and the increase in the number of stored documents has led to increased research into information extraction methods. The purpose of this action has been to find new methods to generate databases with this large number of documents, thus, also improving the tools of acquisition of useful knowledge for emerging technologies such as text mining. This evolution I'm left there, the old information retrieval system (IR Information Retrieval) is outdated. The IR systems were simply being implemented to answer queries to a list of potentially relevant while in the IE (Information Extraction) to search for the relevant content of those documents you have to be located and extracted from the text. In the IE methodology, the content considered relevant is decided by priori, which means that there is a clear dependency with the domain. Of course, when dealing with new domains, new specific knowledge will be needed and it has to be acquired by the system.
Below is an example of how a waste extraction system would work. information. The following text is a part of a document that belongs to the management succession domain extracted from a free text [Turmo J. et al., 2006].
A.C.Nielsen Co. told George Garrick, 40, president of London Information Resources that works in the European information services operation, will become president of Nielsen Marketing Research, a unit of Dun \& Bradstreet corporation. He will succeed John I. Costello, who resigned in March.
The The output of an information extraction system is a set of records per input document. in the table Next, the record extracted from the text fragment shown in this section is shown. It is worth mentioning that each record is made up of fields. These fields are established from the first stages of the extraction system and They are grouped into what is called an extraction template. It is important to note that each field represents information relevant according to the domain, which will be useful for the analysis of the set of input textual documents.
| MANAGERS INFORMATION | NAME |
|---|
| INCOMING PERSON | George Garrick |
| OUTGOING PERSON | John I. Costello |
| POSITION | President |
| ORGANIZATION | Nielsen Marketing Research |
Various information extraction methods have been built to date. However, these methods are They are characterized by using two types of approaches: supervised and unsupervised.
2.2 Architecture of an EI system
In general, the combination of modules in cascade allows the realization of the following functions, to a greater or lesser extent. lesser measure:
- Document preprocessing This process can be achieved thanks to a variety of modules that you can use, such as: text zoners (converts a text into a set of zones with text), segmenters (also called splitters) (in charge of the segmentation zones in appropriate units, usually phrases), filters (select segments relevant), tokenizers (obtain lexical units), lexical analyzers (perform morphological analysis and classification and NE recognition), disambiguators (POS taggers, semantic taggers, etc.), stemmers and lemmatizers, among others. Especially interesting for the EI are the NE (Named Entities) recognition modules. This module is characterized by its speed due to the use of finite-state transducers (it is a finite automaton (or finite states) with two tapes, one input and one output) and search dictionaries since they use algorithms optimize the calculations. The results that we will obtain when applying these methods will depend on the sources of information that are used to fill out the dictionary. For example, Grishman in his experiments on the NE recognition, he used the following sources: a small gazetteer, which contained the names of all countries and major cities; a business dictionary, a Government Agency dictionary; a dictionary of common names; and a dictionary of specific terms that gave him good results.
- Discourse analysis and semantic interpretation This point serves to link related interpretations between sentences. The points that we take into account to carry out this process are the following:
- A full scan involves a large reserve of memory space and relatively unlimited disk space. search. As a consequence, high computational and memory cost.
- A complete parse is not a robust process because the global syntax tree is not always reached. For To make up for this lack, the analysis is intended to cover the greatest substring of the phrase.
- A complete analysis may produce ambiguous results. In those cases, more is achieved by applying a syntactic interpretation. If both analyzes provide valid results, the more correct interpretation
- The broad-spectrum grammars necessary for complete analysis are difficult to fine-tune. Dealing with new domains, new syntactic constructions could occur in specialized texts and not be recognized
- A vocabulary analysis cannot handle situations outside the vocabulary
- Once the constituents have been analyzed, the systems resolve domain-specific dependencies between them, generally using the semantic restrictions imposed by the extraction scenario. They are usually get two different approaches to resolve such dependencies:
- Pattern recognition This approach (the most used among extraction systems) is based on syntactic simplification allowing to reduce semantic processing, making it match the specific patterns of the scenario, also called extraction patterns or EI rules. These patterns are used to identify dependencies between document elements. In fact, the EI rules are sets of ambiguity resolution decisions to be applied during an complete analysis. There are two types of extraction patterns, on the one hand we have a set of expectations semantic-syntactic of the different extraction tasks. We also have to comment on how the EI rules allow identifying properties of entities and relationships between such entities through the use of information syntactic-semantics about names and modifiers. On the other hand, the EI rules that use predicate-argument sets (Object, subject, modifiers) that They allow events between entities to be identified. The representation of these IE rules differ greatly between different EI systems.
- Grammatical relations: Generally, the pattern matching strategy requires a proliferation of task-specific EI rules, with explicit variants for each verb form, explicit variants for different lexical heads. Instead of using El's rules, a more flexible syntactic model consists of define a set of grammatical relationships between entities as general relationships (Subject, object and modifier), some specialized modifier relationships (Temporal and Location), and relationships for arguments mediated by prepositional phrases among others. In a similar way to dependency grammars, a graph is constructed following general rules of interpretation for grammatical relationships. The previously detected fragments are nodes within such graph, the relationships between them being the edges labeled
- Discourse analysis EI systems generally proceed by representing the information extracted from a sentence as templates partially filled or as logical shapes. The main goal of the Discourse Analysis process is the resolution of semantic aspects, such as the presence in the text of corereference, anaphora, etc. The systems that work with partial templates they make use of some fusion procedure for that task. However, working with shapes logic allows EI systems to use traditional semantic interpretation processes
- Generation of output templates. Template generation attempts to map the information to the desired output format. However, some Inferences can occur in this phase due to domain-specific constraints on the output structure, such as the following cases:
- Output gaps that take values from a predefined set.
- Exit holes forced to be instantiated.
- Classes of extracted information that generate a set of different output templates.
- Exit gaps that have to be normalized. For example, dates, products that must be normalized with a code from a standard list.
3. Terminology Extraction (Automatic Terminology Extraction)
3.1 Definition and objectives
Terminology extraction is the process by which units are selected from a text or set of texts. candidates to constitute terms. Said this way, it seems as if we want to build a terminological glossary from a text or a terminology database but this is not the case. The two processes must be clearly differentiated since in the case of automatic terminology extraction, we try to discover the most relevant terms without previously knowing these terms and, in the other case, we look for which terms from a terminology database are present in a certain text and, therefore, the possible terms are known a priori.
3.2 Methodology
As stated in the article "Terminology Retrieval: towards a synergy between thesaurus and free text searching", Building a thesaurus requires collecting a set of salient terms. This is a task that combines two approaches.
- Deductive approach: Analyze existing vocabularies, synonyms and indexes to design a new dictionary of synonyms for the desired scope, structure and level of specification.
- Inductive approach: Analyzes real-world vocabularies in document repositories to identify terms and update terminologies.
Once its approaches are known, we can analyze the methodology, developed in the following steps:
- Term extraction via morphological analysis. We will distinguish between terms of a word (terms monolexical), and multi-word terms, extracted with different techniques.
- Weight assessment of the terms with statistical information, measuring the relevance of the term in the domain
- Selection of the term. You get the relevance of the term and eliminate the terms that are left over from the list. selected thresholds.
These steps require a previous one in which the relevant corpus is identified, automatically collected and prepared for the task of terminology recovery.
3.3 Extraction of terminology from the web
The extraction of terminology from the web is used through NLP techniques (it is a field of sciences of computing, artificial intelligence and linguistics that studies the interactions between computers and the human language) to automatically perform the following tasks:
- Terminology extraction and indexing of a multilingual text collection. The document collection is automatically processed to obtain a large list of terminological phrases using syntactic patterns. The selection of phrases is based on the frequency of the document and the inclusion of phrases
- Interactive natural language query processing and retrieval performed with the following procedures:
- Lemmatized search words are expanded with semantically related words in the language of the query, and all target languages using the EuroWordNet lexical database and some dictionaries bilingual.
- A high number of sentences containing expanded words are recovered. To solve use of words related semantics (such as synonyms) term retrieval and ordering methods are used via information in sentences that allow most of the inappropriate combinations of words to be ruled out, in both the source language and the target language.
- The documents are also ordered according to the frequency and coverage of the relevant phrases that contain.
- Navigation by propositions considering morphosyntactic, semantic and interlinguistic variations of the query. Two ranges are presented:
- Range of phrases that are relevant to the user's query.
- Range of documents that are relevant to the query.
On the other hand, the phrases in the different languages are organized by a hierarchy according to: - Number of expanded terms contained in the phrase.
- Appearance of the phrase according to its weight as a terminological expression. This weight is reduced to the frequency within the document collection If there is no multidisciplinary corpus to compare with.
- Inclusion of phrases. If a subphrase can be included within another subphrase more frequently than collection is added trying to simulate a hierarchy of topics that allows us to navigate more easily in the information.
3.4 Problems associated with natural language
The main problems that arise in natural language are the following:
- Loss of coverage due to non-exhaustive syntactic patterns and incorrect labeling by the speech
- Loss of coverage due to incorrect stemming of phrase components in the text.
- Loss of coverage due to incorrect expansion, lemmatization or translation of the words in the query that causes an incorrect discard in the selection of phrases and in the classification of terms
- Mismatches caused by accents and capitalization.
4. Similarity, classification, clustering
4.1 Definition of each one. Similarities and differences.
Classification
Data analysis used to classify data and predict trends. Typical applications include risk analysis for loans and growth predictions. Some techniques for data classification include: classification Bayesian. K-Nearest Neighbor, genetic algorithms, among others.
Clustering
Clustering techniques are unsupervised classification techniques for patterns (observations, data or vectors). of characteristics) in groups or clusters. These techniques have been used in various disciplines and applied in different contexts, which reflects great usefulness in experimental data analysis.
Clustering vs classification
First of all, it is important differences between clustering and classification. In the first case there is no information related to the organization of the analysis elements and the objective is to find said organization. In the second, there is information on the elements that make up the analysis set and what is What you want to determine is what factors intervene in the definition of the elements and what values of the elements themselves determine these.
Analyzing clustering a little more we can affirm that two items or variables belonging to a group must be more similar to each other than those who are in different groups. Starting from this idea, the grouping techniques. These techniques clearly depend on the type of data being analyzed, what measures of similarities are being used and what kind of problem is being solved.
In a more correct sense, the goal is to gather a set of objects into classes such that the degree of natural association for each individual is high with members of his own class and low with members of others classes. The essence of grouping analysis then focuses on how to assign meaning to terms, groups natural and natural association, where natural usually refers to homogeneous and well-separated structures.
4.2 Purpose of each one
Clustering
The basic idea of model-based clustering is the approximation of data density by a mixture model, usually a mixture of Gaussians (A mixture of Gaussians is a weighted sum of k Gaussians, and to estimate the parameters of the densities of the components, the mixing fractions, and the number of components of the data. The number of distinct groups in the data is the number of components in the mixture and the observations are divided into groups using Bayes' rule.
Classification
The purpose of classification is to attempt to assign a set of data to a predefined category based on a model. created from pre-classified training data (supervised learning). More general terms, both the Grouping and classification are under the area of knowledge discovery in databases or data mining.
4.3 Uses and applications
Clustering
Within the area of Usage Web Mining we can find various studies related mainly to groupings by content, this being one of the main areas where clustering is used on the Web. By For example, we can name some search engines that use this technique to perform grouping or clustering by content such as Vivisimo, Grokker, Clusty, iBoogie.
With this we can say that there are different systems that are concerned with knowing what the characteristics are. of the user related mainly to the content that the user visits or the topics that relate to their navigation.For this reason a need arises, the need to group user pages to know which ones are the most representative pages and identify groups of users with certain characteristics, preferences and/or interests in your navigation. Which will allow us to carry out a demographic study and also obtain different profiles that represent sets of user characteristics. By making these groupings we can in some way provide better information to the user during their navigation.
Classification
The classification method is based on a naive Bayesian classifier (it is a probabilistic classifier based on Bayes' theorem and some additional simplifying hypotheses). Some notable works concerned with improving web search include, which describes methods in a hierarchical clustering approach.
Nahm and Mooney describe a methodology that can be information extraction and data mining. combined to improve each other, information extraction provides the data mining process with access to text documents (text mining) and instead of data mining provides rules for the earned portion information extraction to improve its performance.
5. Related research areas
Content mining and text mining are directly related to the research areas referring to the following approaches:
- Information Retrieval and Information Extraction
IR and web mining have different objectives. Web mining is used by large companies in the world. web world, which make use of this type of systems for search machines (google and altavista), hierarchical directories (yahoo) and other types of agents and collaborative filtering systems. - From the point of view of Databases
The main objective of web content mining from the point of view of BD is that it seeks to represent data through labeled graphs. But also, it is related to the following areas: - Web Structure Mining
- Web Usage Mining
- Association rules.
- Sequence patterns.
- Clustering.
The main categories of Web Text mining are Text Categorization, Text Clustering, association analysis, trend prediction.
6. International conferences
Some of the international conferences that address the topic of web usage mining are: following:
- International Conference on Databases Theory (ICDT)
- International Conference on Very Large Data Base IBM Almadén Research Center
- International World Wide Web Conference
- Conference on Artificial Intelligence (AAAI198)
- International Conference on Machine Learning (ICML)
- International Conference on Distributed Computing Systems
- European Conference on Machine Learning (ECML-98)
- International Conference Machine Learning
- International Conference on Knowledge Discovery and Data Mining
- International Computer Software and Applications Conference on Prolonging Software Life