Web search · Indexing · Relevance

How web document search works: indexing, anchors and ranking signals

Searching web documents is harder than searching a closed collection: pages change, links matter, spam exists and the same content can appear in many formats.

Web context

Distributed, dynamic and noisy documents require crawling, normalization and deduplication.

Indexing

Terms, metadata and document structure are transformed into searchable representations.

Anchor text

Links describe pages from the outside and can improve discovery and ranking.

Signals

Entropy, language models and relevance features help order results beyond keyword matching.

Why this matters for SEO

Search engines do not only read the words inside a page. They also use crawlability, links, metadata, duplication control and document structure to decide whether a page deserves to be indexed and how it should be understood. That is why technical SEO, internal linking and clear headings are practical applications of information retrieval.

How to find a web document

In this project we are going to describe the term Search for Web Documents, analyzing the different types of information, the indexing process and the interfaces used.

1.Characteristics of the website.

To analyze the term Search In the context of the Web it is necessary to start by analyzing the very characteristics of this environment. Internet, and more precisely the Web Documents that this space contains, It is a distributed information source, dynamic and constantly expanding that can provide us with a large amount of information. This features, emerged a few years ago, they are opposed to traditional methods since they were based on techniques to index static, non-dynamic and directly accessible documents, representing a true revolution in the field.

As a direct consequence, the architecture of the search engines itself was changed in an attempt to resolve the problems. following problems arising from the information revolution:

  • The indexing of a very high number of websites. The exponential growth of the Internet has caused a great problem creating an index to all this content. Furthermore, together with this inconvenience, there has arisen a new problem known as SPAM, that is, those Web documents that the search provides us despite not interest the user in their information.
  • Continuous Expansion: Every day, the number of pages stored on the Internet that create different users. The most widespread solution is, using insertion tools, to add the authors themselves of the Websites in different search engines manually. However, many people are unaware of this type of tools.
  • Tracker speeds. It is very important that trackers are used that are efficient and are used hardware adapted to be able to run the different software to index the Web.
  • Duplication of Web content: Download the content of all pages, analyze it and check that the content is duplicated, causes loss of efficiency in the process since we have to invest time to delete it, update the URLs and make some modifications to the index.

2. Types of information to consider when searching on the web

Textual Content

The most common way to express and communicate information about a topic on the Internet is through text. The text can be encoded:

  • EBCCDIC and ASCII in 8 bits
  • Unicode: Used for oriental languages in 16 bits.

As there is no single format, information retrieval systems must recover documents in all the possible formats and it is not always possible to interpret the code they use. This is a serious problem with a difficult solution, since options such as converting documents into a single format have a computational cost too high if we think about the variety of fonts, designs and text types that exist. The most solution widespread and that has obtained the best results is to apply a series of filters that allow these conversions to be avoided and, in this way, increase efficiency. The formats that we can find in the documents are:

  • RTF: document exchange format
  • PDF: format for displaying documents
  • MIME format for mail
  • RAR: compressed files

Once we are clear about the type of text we are going to search for, it is interesting to measure the amount of information that we are looking for. provide a document. Normally the metrics used will be related to the distribution of the symbols in the document. To this end, terms such as entropy have been used in information theory to be used as a magnitude that measures the information of a data source. To understand it better we can apply the following hypothesis: if a concert is held on a weekend it is normal because it is a good date to hold it since it may have a lot of people, but, for example, if the concert is held on a Wednesday, it may indicate other different reasons such as there may be a holiday nearby or that you are at a music festival.

The benefits of applying this measure of entropy to sources of information of different nature are several. but, as we can see in the example, its main virtue is to avoid expressing redundant information about our searches. The formal definition of entropy is as follows: H=i=1mpilog2(pi) H = \sum_{i=1}^{m} p_{i} log_{2} (p_{i}) where H is the entropy, the p are the probabilities of the different codes appearing and m the total number of codes. Normally, as we can see, the base2 logarithm is used to express the entropy in bits. By For example, suppose the number of states in a message is equal to 3, M1,M2,M3, M_{1},M_{2},M_{3}, where the probability of M1M_{1} is 50 %\% , that of M2M_{2} is 25 %\% and that of M3M_{3} is 25 %\% . Therefore the entropy of the information is: H(M)=12log2(2)+14log2(4)+14log2(4)=1.5 H(M) = \frac{1}{2} log_{2} (2) +\frac{1}{4} log_{2} (4) +\frac{1}{4} log_{2} (4)=1.5

Natural language modeling

If we analyze Natural language we can verify that it is composed by symbols that are responsible for separating the words or being part of them (we can considered as symbols). Natural language, as we can see, does not follow a uniform distribution, it is about a binomial model and depends on the previous symbols. It is considered a Markob model of order K.

From these characteristics, Zipf's Law was formulated in the 1940s, by which the frequency of occurrence of a word in a language by the following distribution: Pn1naP_{n} \simeq \frac{1}{n^{a}} where PnP_{n} represents the frequency of a word in the nth position (when words are ordered from highest to lowest frequency) and a is almost 1. This means that the second element will repeat approximately with a frequency of 1/2 of that of the first, and the third element with a frequency of 1/3 and so on.

From these first approximations, new empirical laws emerged that made more approximations. real.The Heaps Law, one of the most used, proposes a relationship between the size of the text (amount of words) and vocabulary growth (number of unique words) through the following formula: V=KnβV=Kn^{\beta} where:N: It is the size of the document (number of words)K: Constant that depends on the text, typically between 10 and 100 β\beta : It is also a constant that depends on the text. To understand the following formula, the easiest method is to develop some example like the following:

  • 10K20 10 \leq K \leq 20
  • 0.5β0.6 0.5 \leq \beta \leq 0.6

Knowing the previous proportion, if K=20K=20 and β=\beta= we will have:

N V
10000 6325
25000 10000
40000 12649
800000 17889
100000 20000

With the example table we can see that the size of the corpus grew 10 times, while the vocabulary It barely exceeded 3 times its initial size. That is, as documents are added to a collection, they are They will discover new vocabulary terms until they reach a maximum. Its application allows us to estimate the size of the vocabulary to, for example, know the scalability of the data structures to store the indexes that They support the SRI (Information Recovery System). This is highly useful if a hash table will be used in memory for the index.

Link Anchor Information

A hyperlink, according to the Effective Site document Finding using Link Anchor Information is a relationship between two documents or two parts of the same document. inside of a link will always exist:

  • Source document: It is the one that contains the link
  • Destination document: It is the one to which the link refers

Methods that use hyperlink-based ranking mainly use three assumptions:

  • Recommendation: By linking an objective, we can interpret that this link is recommended by the document origin. In accordance with this philosophy, the more pages that link to the analyzed document, the greater its relevance. This quality judgment has been very widespread and used since although the counting method can vary (it can be based for example on a count of simple links or on a calculation of a weight for the propagation of the page), it is necessary to classify the information in some way to classify the documents.
  • Topic localization: Pages connected by links are more likely to be on the same topic. It is That is, we can use this assumption to give more relevance in a query to those pages that are linked to relevant pages.
  • Anchor Description: The anchor text of a link describes its objective. You can grant greater facilities to analyze its content, since if this capacity is used correctly we can index the link by the description of the content and therefore obtain information about its content.

Depending on the philosophy applied, each software will use one, two or all three assumptions, depending on the user needs for document search.

Link structure between pages

As we have mentioned previously, hyperlinks are interpreted by search engines as recommendations made by the author of the web page about a source of information. That is why, a web page that has a greater number of links pointing towards it, has a higher ranking than a website that has few links. This type of metric is based on associating a value with each link (called weight) and apply an iterative propagation algorithm that gave us a page relevance value.

Another very interesting aspect related to this aspect was the research of Nick Craswell and David Hawking to compare which relevance analysis method was more effective. For this they analyzed the methods of anchor text searches and content searches reaching a very interesting conclusion. The authors stated that Search methods using "anchor text" obtain more valid results than searching by URL directly.

It is also true that over the years this type of ranking has been played with and cases have arisen. truly curious. There is the case that a few years ago, the General Society of Spanish Authors (SGAE) had links to your website with "thieves" as the anchor term, in such a way that, since there is such a high number of links to your site with that term, if we searched on Google for "thieves" your page appeared positioned in the first position.

Others.

Another method is the assumption on which Topic Locality is based where it is stated that a website only exchanges links between pages of the same topic, so that the links in a website will supposedly deal with the same topic. Using these methods, a page that is accessible by Through a link to pages considered relevant, I could have a higher position in the ranking.

3. Process of indexing information on the web

This procedure is responsible for transforming the documents for information retrieval through queries. If we carry out this process correctly, a indexed document should function as a representation of its semantic contents.

Generally, the objective of indexing is to obtain a list of terms with meaning (concepts) with associated information about the frequency in which the document is referenced, the frequency in the database or its concurrence. For its part, a term can be several things such as a word (reduced to its root form by some stemming algorithm), a phrase, a proper name or even expressions specials such as date, places, etc.

Terms are recognized with language-related techniques. According to these methods, a document is a set of strings concatenated without taking into account the properties of natural language and, therefore, facing the following drawbacks:

  • Stemming algorithms do not extract the base of words via morphological analysis, therefore, fails when they want to identify variations of terms in languages with a more complex morphology than the English.
  • Ignoring lexical ambiguity, which is a problem when not being able to distinguish different meanings of a word same word.
  • Errors when relating synonymous words.

Indexing is a slow and expensive process that is only executed when the information system is created. document recovery. For this reason, researchers have given greater importance to carrying out methods of Update when there are modifications which is more efficient. Indexing consists of the following steps:

  • Go through all the documents to be indexed. This can be a finite and known set such as a set of HTML pages from a folder, or it may be unknown like all pages on the Internet. In the latter case, the Search engines use what are called robots, crawlers, which are small software that tracks the structure of the web in search of new pages. These types of programs collect pages through the links to other pages, thus in an endless cycle.
  • Process the document: It is decomposed until the list of words that make it up is obtained. This process can be very simple, as in plain text files, which have all the words separated by spaces or other characters, or very complex, such as a PDF document that must be decoded, separating the format and the images, extracting only the text.
  • Create the inverted index(It is an index data structure to store a content assignment, such as words or numbers, to their locations in a database file, or in a document or set of documents.): We save the list of words from the previous point pointing out in which document we have found each one of her. There will be some words that only appear in one document (searching for that word will give us a result only), while other more common words will appear in many documents.
  • Option to store the document. Some search engines store a copy of the document in their own database. data. In this way, we can consult any document even if the original is no longer available.

3.1 Lemmatization

Stemming or stemming is a method of reducing a word to its root. These algorithms are closely related to information retrieval systems since they increase the number of documents that can be found with a query (Recall: Metric to count the number of documents recovered.). For example, a query about "libraries" also finds documents in which only "librarian" because the root of the two words is the same ("librarian").

The most common algorithm for stemming is Porter's algorithm, although there are other methods based on analysis. lexicographic for example that are also widely used (KSTEM, stemming with body, linguistic methods...). By Furthermore, these algorithms are implemented with a small programming language known as Snowball used basically for managing strings, allowing lemmatization algorithms to be easily implemented. The pages of Snowball contains stemmers for 12 languages (including Spanish, Galician, Valencian and Basque). All the Explanations, however, are given in English.

3.2 Keyword extraction

The terms provided by stemming are used as terms for indexing documents. A weight is assigned according to the frequency to each term in each document and in the collection. There are two types of frequency:

  • Frequency of the term tf(t): Number of occurrences of the word (t) in the document. The more times you repeat a term, the more relationship the document will have with that specific term.
  • Document frequency f(t): Number of documents that are indexed by t, that is, the total number of documents that contain the term. The less a term appears in documents, the more discriminatory it is.

To evaluate the weight of each term, the following formula is used: tf(t)log(Nf(t))tf(t)*log(\frac{N}{f(t)}) where N represents the total number of documents in the collection.

4. Interfaces, browsing and search visualization.

Searching for information is a slow process. accurate because users do not know how to perform a search query with the necessary terms. It is because Therefore, user interfaces are needed to help solve this lack of knowledge. To make a efficient interface we must comply with design principles that are

  • Feedback or feedback: This principle consists of analyzing a set of reactions or responses that manifests a receiver with respect to the interface, which is taken into account by it to change or modify. Thanks to her, they are made
  • important design choices such as which operations are to be performed automatically by the system and which ones should be started and controlled by users.
  • Reduced memory load. To reduce this load we can use two methods:
    • Providing mechanisms to keep records of the decisions made during the process of search.
    • Providing navigable information that is relevant to the current state of the access process information.
  • Visualization Techniques: Where We Stand Out
    • The display of icons and changes in coloring.
    • Show links between different documents.
    • Ability to zoom.

On the other hand, apart from the principles that we should always try to comply with, there are a series of interfaces alternatives for expert and novice users, pitting simplicity against power and offering bridges which are more or less advanced depending on the user they are aimed at.

Another important decision that must be made when making a design is the amount of information that we want to show the user through the information access system

To finish in this section, in the article by Alan J. Dix, Janet E. Finlay, Gregory D. Abowd, Russell Beale and Prentice Hall, Human Computer Interaction, offers an evaluation metric, not for the web but for the entire interaction system, made up of 10 elements (414):

  • Visibility of system status. The system must keep the user informed of what is happening.
  • Correspondence between the system and the real world.
  • User control and freedom. Support undo and redo.
  • Consistency and standards.
  • Error prevention.
  • Recognition better than having to rely on memory.
  • Flexibility and efficiency of use.
  • Aesthetic and minimalist design.
  • Help users recognize, diagnose and recover from their errors. The errors should be expressed in simple language.
  • Help and Documentation. Even if it is better that the system can be used without documentation, it may be necessary to provide help and documentation. This information should be easy to search and focused on. user task, list concrete steps and not be too big.

5. Metasearch.

A metasearch engine is a search engine that sends a search request to others multiple search engines or databases, returning a list of search results or a list of links to access the individual results of each search engine easily.

Metasearch engines allow their users to enter search criteria only once, and access multiple search engines simultaneously. They do not usually have their own database, since they simply use the results of other search engines, generally unifying them using their own algorithms to order them in relevance (so general, eliminating those identical results).

These software usually deliver results from WWW web pages, but there are also some specific ones who search in discussion forums, news groups, weblogs, images on the web, free or free documents on the website, etc.

The steps in how a metasearch engine works:

  • The user makes his request to the metasearch engine.
  • The metasearch engine formats said request according to the interface of each of the search engines and passes them the request.
  • Search engines carry out the search using their usual means from websites on the Internet.
  • These return the information obtained to the metasearch engine, which analyzes the data.
  • The metasearch engine organizes the information according to its criteria and shows it to the user.

Finally, some of the metasearch engines that currently exist are the following:

  • Startpage (formerly IxQuick) – uses Google results, but without tracking the user
  • DuckDuckGo – does not track searches, includes “bangs” shortcuts (!w for Wikipedia, etc.) and trusted sources
  • Dogpile – combines results from Google, Bing, Yahoo and others, eliminating duplicates
  • SearXNG – free and federated metasearch, integrates with more than 70 search engines and preserves privacy
  • Mojeek – independent search engine with its own web index, without user tracking or profiling
  • Presearch – decentralized blockchain-based metasearch, with community node and Web3 approach

These engines offer a combination of privacy, independence, and access to results from multiple sources without relying solely on Google.

6. Web agents.

A Web agent is software designed to try to help the user organize the information that you need from the network based on your interests. In order to achieve this, there are techniques of machine learning and intelligent agents, or web agents that use these techniques to achieve their objective. Mainly, there are two methods to help the user find the information. desired, with user assistants or recommendation systems:

  • Content-based: The approach of the agents that use the method is as follows:
    • To classify the text, the system looks for objects similar to those the user has by comparing them completely by the content of each one. A main problem for agents who use this method is find content that has already been viewed before and discard it.
    • These types of agents use different text-learning methods.
    • The content-based method is very popular in systems that work with data that is text, for example for example, in web documents or news.
  • There are different types of agents that use this method, such as:
    • WebWatcher: The user performs a search, and when they access an article, the agent examines the web and the links it has, marking those that are most relevant, encouraging the user to visit them.
    • Lira: learns while the user browses the internet. Your searches are stored and the agent selects the best pages and receives information from the user, who evaluates the links they see.
    • Musag: works by creating a dictionary with the most relevant words in the searches carried out by the user. user, and searches for related pages based on the repetition of these words in the content of the articles the user has read.
    • Letizia: the system offers new links to the user, having noted the search that has been carried out, showing the information of the suggested links, in a new tab.
  • Collaborative: This method is based on the following points:
    • In the collaborative approach, there is a set of users who use the system.
    • It is based on Social learning.
    • It does not analyze the content, it analyzes the evaluation that users make with each link, based on the interests of each one.
    • This type of approach is mainly used for non-textual documents, such as videos, images, movies etc
  • Some examples of agents that use this method are the following:
    • Firefly and Ringo: used for music collection, taking into account the type of music you like to each user and the votes.
    • Siteseers: Web page recommendation system, which has individual bookmarks.
    • Phoaks: System that automatically recognizes and distributes recommendations of Web resources mined from new messages from Usenet.
    • GroupLense: Has a second database to store ratings that users have given to the messages and correlations between users based on their scores.
    • Referral Web: Interactive system for reconstruction, visualization and search of social networks in Internet. It is similar to ContactFinder.
    • Fab sytem: It really combines the two methods, content-based and collaborative. Use the method based on content to create a user profile and also each user rates the different links.
    • WebCobra: also combines the two methods like the previous agent.
    • Lifestyle Finder: it works by generating a user profile, and depending on each person's interests, Create groups of users with the same interests.

7. Research areas related to web search.

Search on the web is related to another series of fields, but basically two fields in this:

  • Information recovery
  • Natural language processing.

In addition to the previous two, based on the study carried out in this work, and taking into account the different phases that must be gone through to have a good search on the web, is also related to the following research areas:

  • Textual contents: Zipf and Heaps laws among others, which state that as documents are incorporated to a collection, each time new vocabulary terms will be discovered.
  • Information in the links and their methods: Recommendation methods, topic location and description of anchorage.
  • Web indexing: and what it entails, lemmatization and keyword extraction.
  • Machine Learning: applications that help users find everything they need online interest, in a faster and fully automated way.

8. International conferences where web search is addressed.

Some of the conferences International ones that address the topic of web search, indexing and indexing methods are the following:

  • International World Wide Web Conference (IW3C2).
  • International journal of Computer Networks δ\delta Communications (IJCNC)
  • International Conference on Internet and Web Engineering
  • Interlink Web Design Conference
  • International Conference on Web Intelligence, Mining and Semantics
  • International Conference on Web-based Learning (ICWL 2010)
  • International Conference on Machine Learning (ICML97)
  • International Conference on Autonomous Agents (Agents ’98)
  • International Conference on Web Information Systems and Technologies