Search engines · Crawling · Ranking

Web search engines: crawling, indexing, ranking and Google Scholar

A search engine is a pipeline: discover pages, crawl them, store them, index their content and rank answers for each query. Google Scholar adds a bibliographic layer around citations and academic relevance.

Crawling

Robots discover and revisit pages while balancing coverage, freshness and cost.

Indexing

Documents are transformed into term, metadata and link structures for fast retrieval.

Ranking

Relevance combines text matching, authority, freshness, context and user intent.

Scholar

Academic search emphasizes papers, citations, authors, venues and publication dates.

Web Search Engines

1 Interaction with Google Scholar

The basic place to perform bibliographic searches on the Internet will be scholar.google.com. To familiarize you with this service, we suggest that you look for the five articles with the most impact (that is, those that have been most referenced) in relation to Web search. In addition to providing a list as a result of your work, you must document what your search criteria were and how your interaction with the system was.

Solution: To begin, first of all we have to configure the search engine according to the search criteria of the statement, that is, following the figure [1] .
  • In red the text that serves as our search. In our case Web Search.
  • In yellow the type of document we are looking for. In our case articles
  • In Green the time interval of the articles. In our case between 2009 and 2015.
  • In blue the order you will use to organize the answer. In our case the most referenced.
  • in color black We find that the search range corresponds to the entire website.

Busqueda realizada utilizando https://scholar.google.es
1 Search performed using https://scholar.google.es

Regarding the result of my search, the following link has been:

Summary An information retrieval system is defined as the process that deals with the representation, storage, organization and access of information elements. That is, it is a system capable of storing, retrieving and maintaining information.

But what does the concept of information represent? In this context, information can be anything suitable for retrieval, such as text (including numbers and dates), images, audio, video and other multimedia objects. The main type of recoverable object has always been text, due to its ease of processing compared to multimedia objects, but recently systems capable of recovering other types of objects on the Internet are emerging. In fact, Google has included an image recovery system in its search engine.

This aspect will be discussed in more detail in the articles.

Summary: Information architecture, based on the solid classical principles of traditional information science, was born towards the end of the 90s. It is a discipline in charge of structuring, organizing and labeling the elements that make up information environments to facilitate the search and retrieval of the information they contain and thus improve its usefulness and use by its users. Among the main systems or structures that make up the architectural anatomy of a website, the organization, labeling, navigation, search and controlled vocabularies systems stand out. Regarding its praxis, the development of the architectural anatomy of a website focuses on aspects related to the needs of its typical users.

Summary: Proposal for identifying significant characteristics of Web 2.0, Web 3.0 and Semantic Web. Consideration of the possible impact on cyberjournalism and especially on current and new Web search systems.

Summary: This article offers a quick overview of the needs for using search engines for knowledge management and the basic ideas of how they work. The CSORA method (Classify, Search, Organize, Relate, Adapt) is presented below, a set of technical and methodological procedures that improve the effectiveness of search engines by making it possible to find relevant information in environments with a lot of information or specialized information. Finally, its integration into search engines is shown and the conclusions include the advantages of its use, verified through experiences and future lines of research.

Summary: A labor-intensive part of software testing is the generation of test cases. The cost of this task can be reduced by using techniques that allow its automation. This work presents an approach based on the Sparse Search metaheuristic technique for the automatic generation of test cases of business processes specified in BPEL. As a sufficient criterion, the transition coverage criterion is used.

2. Searching the Web

Statement 1 As an introduction to Web search engines, we suggest reading this article by Arvind Arasu, Junghoo Cho, Hector García-Molina, Andreas Paepcke and Sriram Raghavan entitled Searching the Web that you can find here. Once you have read this article, you must select from scholar.google.com three recent articles (preferably from 2009) that cite it and that have the greatest possible impact, and from them discuss three factors in which the current research is situated far beyond what is proposed in the original article.

After reading the article Searching the web I have selected the following three articles, which cite it in their references:

Summary: This article analyzes the current problems that exist in search engines to index, analyze and search for new information in the Deep Web and Hidden Web. These terms, widely used lately, refer to 3 different types of Internet that currently exist, which are:
  • Global Internet: We will define this as that free and open information network that is theoretically accessible through the interconnection of computers. The form of access is done through browser programs, Chats, messaging or exchange of protocols (FTP, P2P).
  • Invisible Internet: It responds to all information content that is available on the Internet but is only accessible through pages dynamically generated after performing a query in a database. This particular nature makes them inaccessible to the usual information retrieval processes carried out by search engines, directories and search agents. But we can access them through our usual navigation tools, email, etc. The only condition is to know exactly the access address (URL or FTP)
  • Dark Internet: It is defined as servers or hosts that are totally inaccessible from our computer. According to a study by the company Arbors Networks, this situation occurs in 5% of global Internet content. The main cause (78% of cases) is due to restricted areas for national security and military purposes. The remaining percentage (22%) is due to other reasons: incorrect configuration of routers, firewall and protection services, inactive servers and finally "hijacking" of servers for illegal use.
In the article, they propose new methods to implement more efficient Crawlers that allow searches for content located on the last two types of Internet. To do this, they have been based on algorithms used in Web Mining and genetic classification.

Summary: The World Wide Web is a great source of information full of hyperlinks to hypertext content. Search engines use web crawlers to collect these documents from the web for storage and indexing purposes. However, many of these documents contain dynamic information that changes daily, weekly, monthly or yearly and therefore we have to update the storage on the search engine side so that the most recent information is made available to the user. An incremental crawler visits the web several times after a specific time interval to update its collection. In this work of regulating the frequency we revisit a new mechanism and a novel architecture for incremental crawlers is being proposed.

Summary: The most difficult part of a web crawler is downloading content at the fastest rate possible for the bandwidth allowed and at the same time processing the downloaded information without leaving the program dead. The scalable web tracking system implemented in the article, called WebTracker, has been designed to respond to this challenge. According to the article, by distributing the software you can reduce the resources used while maintaining the same pace of results. WebTracker has a Crawler Central server and manages all nodes. On each node there is a Crawler, which runs a downloader and manages the downloaded contents and analyzes them. While the crawler Server ensures synchronized distributed system operations.

2.1 Summary and conclusion of the article Searching the Web

In the article Searching the web, different points are discussed about web crawling, page storage, indexing and the use of different techniques for the design and implementation of a series of components.

Introduction

The authors tell us about the different challenges encountered when creating good search engines and list a series of useful techniques. Specifically, they detail IR (Information Retrieval) techniques, which are used to recover information from small collections, such as newspaper articles and book catalogs in libraries.

However, these IR techniques are not valid for the web, since it has suffered a considerable increase in volume and using it would require too many resources, so it tries to show us new techniques such as indexing that produces scalability when performing web crawls, the use of discrimination techniques for websites with irrelevant content. It also takes into account if the page is referred to by other pages with specific terms, which would denote that this page is important with those specific terms.

Then they explain to us how crawlers work, which are small applications in charge of crawling a repository of web pages, observing the changes in them, including new pages they find, etc. Through complex tracking algorithms. When the crawler completes a cycle, it knows which pages it should crawl and which it shouldn't. Resource optimization is achieved thanks to indexing.

Web page crawling

In the second point of the article, he summarizes the different functions that the tracker must perform when collecting information from all the websites. First, the pages pass through a crawling module that retrieves the pages for further analysis by the indexing module. Through a set of websites that are received, a priority is assigned to each of them so that they can be analyzed.

A topic that is also discussed quite extensively in the article is the page selection method, in which different models of importance are discussed, due to popularity, interest, location, etc. Coupled with this, come the tracking models, specifically it deals with two: tracking and stopping, tracking and stopping with threshold. In addition, the updating of the pages, or the "freshness" of the links in the search engine and updating strategies are other topics derived at this point.

Storage: Challenges and repositories.

The third point of the article deals with the scalability that must exist in storage repositories for large collections of web pages.

This entails a series of challenges, since the repository manages a large collection of data objects (in this case web pages), similar to file systems or databases. Issues arise such as: scalability, dual access, large bulk updates, and stale pages.

Indexing: Index structures, challenges, partitioning, text indexing systems.

There is an analysis module that creates a variety of indexes. The analysis produces links, text indexes, and utility indexes. The H1, H2 or b html tags make it easier for analyzers to know what information is the most important on a website.

With indexes, what is created are large graphs that have very important nodes and links, which facilitate the understanding of texts by crawlers.

Ranking and link analysis: PageRank, HITS algorithms and other techniques.

Since the Internet is so large, the way to evaluate which website is more important is complex. So different algorithms were created that evaluate the importance of a web page based on different parameters.

On the one hand there is PageRank, which has variations: Simple PR and Practical PR. PageRank depends on the importance of the websites, and the links that exist from and to it, the more important a website is, the more PageRank transmits to the websites that it links to. The other algorithm, HITS, is responsible for identifying, given a query, a set of natural web pages or authority web pages.

Final conclusions compared to the first article Searching the Web.

After looking for three more current references on the search on the web and that also reference this article, I have been able to observe the following details:
  • Speed, reduction of resources and storage capacity: Both in the article that we had to read, from 2001, and in the 3 most recent articles that I have selected from scholar.google.com, I have observed that they give supreme importance to the speed of scanning, to the minimization of resources when carrying out these scans (try to consume little bandwidth, little CPU and little RAM, distributing processes, etc.) and also try to have the greatest amount of information in the smallest possible space. In the articles, they tell a series of tracking improvement algorithms, which have managed to optimize speed, resources and, above all, storage to the maximum.
  • Problems in crawling: In the search processes for web content, it is valid as a response to a user.
  • PageRank or the importance of websites: A few years ago, an algorithm was invented to calculate the importance of a website, which was based on the importance of the websites that linked to it, above all. This has changed over the years, and PageRank is not currently the value that a tracker most observes to calculate its importance on the Internet, since a series of trends began to be created that encouraged spamming, causing search engines to block this type of website and penalize them in terms of positions in Google. Now other factors are taken into account more such as the purity of the code (if it meets CSS, W3C, xhtml... etc. standards) as well as whether it is correctly redirected with a good htaccess file.