Information retrieval · Ranking · Models

Information retrieval on the web: boolean, probabilistic and vector models

Classical information retrieval models explain the foundations of search, but the Web adds scale, spam, links, freshness and user intent.

Boolean

Precise matching with logical operators, useful but rigid for exploratory search.

Probabilistic

Ranks documents by estimated probability of relevance to the query.

Vector

Represents queries and documents as weighted term vectors and compares similarity.

Web limits

Links, duplicates, spam and changing pages require extra signals beyond text.

How to recover a Web document

In this project we are going to make a report pointing out the limitations of the models and techniques. developed in Traditional Information Retrieval when searching the Web.

1. Introduction

This task summarizes the main characteristics of the classic Information Retrieval (IR) models, which are:

  • Boolean model.

    It is based on Boolean Logic and classical Set Theory in which both the documents to search and the user's query, are conceived as a set of terms. Retrieval is based on when the documents contain or do not contain the terms of the query.
  • Probabilistic model.

    The probabilistic retrieval model is based on probabilistic matching, given a document and a question, it is possible to calculate the probability that that document is relevant to that question.
  • Vector model.

    The vector recovery or vector space model proposes a framework in which the partial matching, assigning non-binary weights to the index terms of questions and documents. These term weights are used to compute the degree of similarity between each document stored in the system and the user's question.

2. Boolean Model

The Boolean model is based on set theory and therefore recovery It is based on whether or not the documents contain the terms of the query.

A document is represented as a set of terms, such that a term will be present or absent of a specific document, without considering the possibility of establishing different degrees of membership. The Queries are expressed using Boolean expressions that correspond to operations on sets:

  • AND: intersection of sets.
  • OR: union of sets.
  • NOT: complementary to a set.

The result obtained will be a set of unordered documents with all the documents that respond to the expression boolean of the query.

2.1 Advantages

  • simplicity
  • easy to implement

2.2 Disadvantages

  • It is not possible to sort the results obtained
  • The number of times is not taken into account
  • that a word appears in a document
  • Does not differentiate between the AND and OR operators and the natural language words 'and' and 'or'

3. Probabilistic Model.

The probabilistic model was formulated by Stephen Robertson and Sparck Jones in 1977. To begin to define this model we have to establish the IR process as inherently imprecise.

This model works as follows. A user makes a query to the system looking for a certain information then the model estimates the probability that the documents accessible by it are relevant to that query. If we consider the query as CC and any document DD We could define the probability as P( Cd\frac{C} {d} ) .

The model attempts to obtain a set of relevant documents (called RR ), which should maximize the probability of relevance. A document is considered relevant if its probability of being relevant, P(rel)( Cd\frac{C} {d} ), is greater than the probability of not being relevant, P(not rel)( )

The probabilistic model is based on a feedback process. This process begins with a first set of relevant documents, which is gradually recalculated based on the information provided by the user of those documents that he considers relevant and not relevant.

3.1 Advantages

  • Provides an ordering of documents based on their probability of relevance

3.2 Disadvantages

  • The need to start the model from a first estimate of the set of relevant documents
  • The number of times each term appears in a document is not taken into account when estimating its probability of relevance.
  • The results are not much better than those obtained in the Boolean model.

4. Vector Model

This model sorts the recovered documents in descending order to a degree of one degree of similarity, the vector retrieval model takes into consideration documents that are only matched partially with the question, so the whole answer with the aligned documents is much more precise (in the sense that best matches the user's need for information) than the set recovered by the Boolean model. Response set alignment performances are difficult to improve.

In the vector model, documents are represented as a vector of terms, and vice versa. The queries are modeled as a vector of terms and the model retrieves the relevant documents based on the similarity of the vectors of the documents with the vector of the query, in an n-dimensional space.

The most widely adopted representation is known as a bag of words: a collection of documents made up of In indexed documents and m terms represented by an n x m document-term matrix. Where the n vectors row represent the n documents; and the value assigned to each component reflects the importance or frequency weighted that produces the term, phrase or concept ti in the semantic representation of the document j. dj=(w1j,w2j...wmj) d_{j} = ( w_{1j},w_{2j} ... w_{mj}) Where m is the cardinality of the dictionary (a list of unique terms that appear in a set of documents) and 0wij10 \le w_{ij} \le 1 represents the contribution of the term ti to the semantic representation of the document dj.

4.1 Advantages

  • More reliable
  • It allows partial successes, since a document can be considered relevant even if it does not include all the terms of the consultation.
  • The ordering of the results is done based on several factors: frequency of the
  • terms, importance of the terms and without giving priority to longer documents.
  • Allows efficient deployment for large document collections.

4.2 Disadvantages

  • Your application. It requires the needs of the values of all the components of the vector, but these are not available in an inverted file architecture. In practice, normalized values should be used and the vector product algorithm.