Crawler · Search engines · Freshness

Web crawler: how search engine spiders crawl, prioritize and refresh pages

A crawler discovers URLs, decides what to fetch next and keeps the index fresh without overloading sites. The hard part is not downloading pages, but doing it at scale and with priorities.

Discovery

Seeds, links, sitemaps and queues feed the crawl frontier.

Priority

Importance, freshness, change rate and coverage decide what is fetched first.

Politeness

Rate limits, robots rules and host queues protect external servers.

Quality

Canonicalization, duplicate detection and spam control keep the index useful.

Current SEO context for crawlers

Modern crawling is not only about following links. Search engines also interpret robots.txt, XML sitemaps, canonical signals, redirects, mobile rendering, JavaScript output and duplicate clusters before deciding what to index. If a site depends on JavaScript, the important content and links should still be available after rendering, and critical URLs should be discoverable through real links plus a clean sitemap.

Crawler (Seekers). The internet spider

This project tries to explain the concept of Crawler, the problems it solves and what impediments exist. taking as reference the articles The Anatomy of a Large-Scale Hypertextual Web Search Engine (link), Efficient Crawling Through URL Ordering (Link), Mercator: A Scalable, Extensible Web Crawler (Link) and High-Performance Web Crawling (Link).

In this link to YouTube Do you have an explanation in video about the project.

Index

1 Introduction: What is a crawler
2 Problems that a crawler tries to solve
3 Problems encountered by a crawler
4 Crawling stages (Google, WebBase, Mercator, WebEater)
5 Research areas related to Crawling

1. Introduction: What is a crawler

A crawler (tracker) is software intended for the recovery of Web Documents, normally used by search engines.

This term acquires special relevance when the WWW (World Wide Web) increases considerably in size and, therefore, they use these software to answer the following problems to solve:

High number of Web Documents
Document Dynamism
Continued expansion

With the idea of avoiding these problems, search engines have optimized their Trackers to provide information more correct and appropriate. To achieve this, each engine has carried out its own studies and applied algorithms diverse with different results.

On another point, the constant increase in pages and their dynamism require use faster traces that return the information as compressed as possible and using data structures. efficient data.

The operation of a tracker is normally based on the following processes:

Get the initial set of URLs (Uniform Resource Announcer, Uniform Resource Locator)
URL collection. The crawler downloads this set of associated web document (called a "seed") and Search within this for other URLs.
Each new URL found is added to the list of URLs that the web spider must visit

Some examples of Tracked are:

Matthew Gray's Wanderer : NCSA Mosaix Page Tracker (Graphical Page Navigator) website for Windows) created in the spring of 1993.
google: is a web search engine owned by Alphabet Inc. Its main Features are:

Use of an efficient index.
Use of optimized structures.
High degree of scalability

URLServer: provides the URL seeds to be extracted by the software.
StoreServer: receives the URLs, performs compression calculations and stores them in a repository. For analyze.
Indexer: It is the most important part of the process. It is responsible for executing a series of algorithms to read StoreServe information, decompress it and analyze it. Documents are converted to sets of words (called hits), these, in turn, are subdivided into other sets known as barrels, and by Lastly, these barrels form a partially ordered forward index structure. Apart from that, it also generates from each document an anchor file, that is, a document where all the links are stored that exist on the web and what text they link to.
URLresolver: It is responsible for analyzing all the information stored in the anchor files. once After applying this process, a relevance estimate can be obtained by applying the PageRank method.
Sprter is in charge of arranging the barrels.

The Internet Archive: System that distributes queries between various machines. Each tracking process It can receive up to 64 requests and cannot send a request to two different machines.
Mercator Crawler: This search engine is scalable, extensible and modular built in JAVA. with regarding the rest of the search engines analyzed, its benefits are:

Efficiency
Speed
High probability of success
Modularity

Web Fountain : Distributed and modular tracker developed in the C++ language. has a very philosophy similar to the previous tracker but, this time, the physical structure is different. They follow a pattern Master-Slave where a main server is responsible for coordinating a series of machines. Its algorithms Indexing is based on solving equations with nonlinear programming.
PolyBot: Crawler that fuses the efficiency of C++ with the ease of modification of Python. This system is divided by three types of servers:

Crawling Manager
URL downloaders.
DNS Resolvers.

WebRace: It is a module for the Crawler eRace implemented in Java. Its operation is a little special, since receives requests from users to download web pages, the crawler acting as an intelligent server. That is, we do not always start with the same seed but rather it changes depending on the users' queries. Once analyzed, the software can generate a structure known as a "subscription", which can only be It will update when any modification occurs on the page.
Location: Distributed, fault-tolerant tracker built in Java. The difference with respect to the rest of analyzed examples is that, this tracker is composed of a number of identical agents that can perform the same functions. To process a query, this crawler processes the request and applies a hash function to it. The The result obtained tells you which agent should process the query. The benefits of using a function hashing is to prevent two agents from processing an identical request. Furthermore, if any of the agents is arrested or has a problem the rest of them have the ability to redistribute their workload.

2. Problems that a crawler tries to solve.

The purpose of a crawler is to search for URLs that have the most accurate information about a query made by a user. Therefore, the main problem that We find ourselves selecting the most relevant pages and ordering them based on the information they contain. For any ordering process we will have to start by knowing what metric we are going to use to measure relevance. of the page and, in the case of trackers, there is no single method, but different ones have been developed solutions like:

Similarity to query Q : P being a website and Q a query, they are converted into vectors and compare their similarity in the words that contain each one.
Backlink Count: Analyze the number of links that exist to the website P. The greater the number of links, better position.
PageRank: Algorithm that calculates the ranking of a website using different parameters like:

The number of links to the website
Number of outgoing q links there are
Information contained on the website

Forward Link Count: Number of links emanating from P
Location Metric: Measures the distance to the origin of the query. For someone who found in Chile, You will be interested in websites that are in America before they are in Europe, for example.

Once we have known what metrics we are going to apply to the tracked data, we will have to analyze different measurement methods. tracking. The problem in this section is not the chosen method but the large number of URLs that we will have to analyze daily and efficiently. The solutions to this problem is to use approximate methods that are close to a value. Among the most current methods we highlight the following three:

Crawl & Stop: C crawler starts with a page $P_{0}$ . Once processed, it moves on to the next page until it reaches the last stored page. At this point, a tracker perfect that you have K pages stored you should visit the pages from the first $P_{0}$ to the $P_{k}$ to redistribute them in the following way $R_{1}$ to $R_{k}$ where $R_{1}$ It would be the page with the most important information, $R_{2}$ the next, and so on.
Crawl & Stop with Threshold: As in the previous method, we visit a number of K pages. Without However, now we will have an important point G that we will use to apply an algorithm as follows: If we consider I the algorithm to measure relevance and P the page analyzed then $I(P)^{3} \geq G$ is considered hot (relevant). We will call all relevant (hot) pages H (pages that contain information relevant to the search).
Applying all this information and considering that the tracker does not have unlimited time to perform the task, we can define that $P_{ST}(C)$ is the percentage of H pages that have been found when the crawler stops. Yes K < H, entonces el rastreador ideal tendrá un rendimiento de (K*100)/H . En cambio, si K $\leq ser\geq$ H, then it will be ideal and will have one hundred percent performance.
Limited Buffer Crawl: In this model we have a new consideration, the ability to storage for process is limited. We assume that the crawler can only save B pages in its buffer. Therefore, when the tracker fills up we will have to provide a solution among which will be:

Delete a page randomly from the buffer (faster but can greatly lower the quality of the results)
Delete the most relevant pages.

3. Problems that a crawler encounters

Crawlers have different needs to perform crawling of web pages. Solving these problems is a real dilemma since between a series of problems of different types, such as the storage capacity they have, the bandwidth (both yours and the web tracked) related to more hardware aspects we find other software problems such as:

URL aliases: URLs can be transformed to have pseudo names that make them more user-friendly. This technique It can also cause serious problems in our browser since there may be two or more different aliases that point to the same page. There are four cases that we must analyze:

Alias in the domain or server name: It is very common that we have a server with an IP, in which We have different Web pages hosted with different aliases each of them. Each of these websites will use the same IP and, therefore, the server and the tracker will have to communicate efficiently so as not to process the same IP several times.
Port number omission: The normal communication protocols between a client and the web server are 80 for an HTTP connection and 21 for an FTP. Depending on our tracker and what type of connections it is going to perform we will have to configure it to make queries through the appropriate port.
Alternative routes on the same server: There are cases where two different routes access the same space Web. For those cases our crawler should only process the web once, ignoring future requests coming from other different URLs.
Replicas on different servers: The most demanded Web documents will surely be replicated on different servers. In this case, like the previous one, we must check that we do not analyze twice the same document.

Session IDs in URLs: In many cases, it is necessary to create sessions to be able to access dynamic pages. This causes different sessions to access the same resources and therefore our crawler imitate these users and analyze repeated documents. The solution to this problem is to implement solutions to detect repeated files such as Mercator fingerprint.
Crawler Traps: A URL or set of URLs that cause the crawler to crawl indefinitely. some crawler traps are divided as:

Unintentional: A symbolic link in a file system that creates a loop.
Volunteer Spammers who display advertising on a cyclical basis.

4. Stages of crawling

To answer a user's query, it is necessary to go through different stages with their respective functions. The order and methods will not always be the same, depending on the Implementation and hardware used the tracker may have minor modifications to improve its efficiency.

4.1 Google

If we focus on the largest trackers like Google we are faced with software that They interact with a huge number of pages and have opted for a distributed system. Its operation is next; They have a single URLServer that serves URLs to a number of crawlers. Apart from each of them, maintains its own cache to avoid doing a DNS lookup before crawling each document. A crawler can receive up to 300 open connections at the same time, that is, Google can receive approximately 100 pages every second using 4 trackers, that is, an average of 600 Kbits/s. Regarding the connections, they do not have to be all in the same state. There will always be connections connecting to a server, sending a request, receiving a response and resolving the alias. All these features make the tracker a very complex component, with an asynchronous I/O (Input/Output) system to manage events and a number of queues to move pages from a state to another.

4.2 WebBase

Another different type of crawler is WebBase. This software is based on running many processes at the same time to perform a trace. Each process receives a list of URLs to be downloaded and They simply return the full HTML content or an error if they fail. To run this system, open hundreds of connections at the same time, at a speed of about 25 pages per second for each process. There is also adapted for crawls on servers that have a fairly slow speed and allow little information to be downloaded quickly. In those cases, first, all the URLs that are going to be crawled are divided into 500 queues based on a hash of the server where it is hosted. This ensures that all URLs that belong to the same server are are in the same queue and, therefore, a delayed server will only decrease efficiency in one of the the connections. Apart from this method, the system allows the webmaster the ability to restrict the access of this tracker.

4.3 Mercator

Another example of a tracker is Mercator. This software has multiple threads work when a trace occurs. Each thread is repeated as many times as necessary to do the job.

The first step of a Mercator crawler is to obtain an absolute URL from the URLFrontier. To perform this task First of all you have to identify the network protocol that should be used to download it. In a At first the software only includes the default configuration for the HTTP, FTP and Gopher protocols (Internet service consisting of access to information through menus) and, if we have the need to trace new protocols are introduced in a configuration file and are dynamically loaded upon startup with tracking
After step 1 and having selected the correct protocol for the download, the next step is to execute the execution module to download the document from the internet in a RewindInputStream thread (RIS, I/O method that given a stream input allows to obtain and store the content referenced by the input to be read subsequently as many times as necessary).
Once the document has been written to the RIS, the working thread executes the content-seen test to determine if this document has been previously viewed.
If so, the document is not processed again and the thread deletes the URL from the URLFrontlier. All documents downloaded, have an associated MIME TYPE (Multipurpose Internet Mail Extensions, are a series of conventions or specifications aimed at the exchange over the Internet of all types of files (text, audio, video, etc.) transparently for the user). This process of associating a MIME to the document is based on the file Mercator configuration and can be modified to associate one or another type of MIME structure.
After downloading the file and associating the MIME, the module that must process it is selected. An example It would be the Link Extractor and the Tag counter that are used for text/html type documents.
Each link is converted to an Absolute URL and passed through a URLFilter to determine if it should be downloaded.
If the URL is accepted, it goes through the content-viewed test module, which checks whether the URL has been previously downloaded.

4.4 WebEater

A WebEater crawler measures the relevance of each page to a topic or search that the user performs. Of course, the entire page that contains the searched word in the title is considered relevant and is added to a KeyQueue. Positive results that are not in the title are stored in ColaURL. Of course, it It will first analyze those added to the KeyQueues.

5. What other areas of research are related to Crawling

The area of research most closely linked to Crawling is search engines, although it is also very related to Data Mining and Information Recovery.

However, based on the bibliography consulted about crawlers, taking into account the different tasks that must be performed, we can also relate secondarily to:

DNS name servers: The improvements we provide in these technologies would directly affect the Crawlers as they are an important step in their tasks.
Web indexing. An increase in the efficiency of the indices can lead to a decrease in the processing times. execution of the various queries.
WEB Document Link Extractors