2. Searching the Web
Statement 1 As an introduction to Web search engines, we suggest reading this article by Arvind Arasu, Junghoo Cho, Hector García-Molina, Andreas Paepcke and Sriram Raghavan entitled Searching the Web that you can find
here. Once you have read this article, you must select from scholar.google.com three recent articles (preferably from 2009) that cite it and that have the greatest possible impact, and from them discuss three factors in which the current research is situated far beyond what is proposed in the original article.
After reading the article Searching the web I have selected the following three articles, which cite it in their references:
Summary: This article analyzes the current problems that exist in search engines to index, analyze and search for new information in the Deep Web and Hidden Web. These terms, widely used lately, refer to 3 different types of Internet that currently exist, which are:
- Global Internet: We will define this as that free and open information network that is theoretically accessible through the interconnection of computers. The form of access is done through browser programs, Chats, messaging or exchange of protocols (FTP, P2P).
- Invisible Internet: It responds to all information content that is available on the Internet but is only accessible through pages dynamically generated after performing a query in a database. This particular nature makes them inaccessible to the usual information retrieval processes carried out by search engines, directories and search agents. But we can access them through our usual navigation tools, email, etc. The only condition is to know exactly the access address (URL or FTP)
- Dark Internet: It is defined as servers or hosts that are totally inaccessible from our computer. According to a study by the company Arbors Networks, this situation occurs in 5% of global Internet content. The main cause (78% of cases) is due to restricted areas for national security and military purposes. The remaining percentage (22%) is due to other reasons: incorrect configuration of routers, firewall and protection services, inactive servers and finally "hijacking" of servers for illegal use.
In the article, they propose new methods to implement more efficient Crawlers that allow searches for content located on the last two types of Internet. To do this, they have been based on algorithms used in Web Mining and genetic classification.
Summary: The World Wide Web is a great source of information full of hyperlinks to hypertext content. Search engines use web crawlers to collect these documents from the web for storage and indexing purposes. However, many of these documents contain dynamic information that changes daily, weekly, monthly or yearly and therefore we have to update the storage on the search engine side so that the most recent information is made available to the user. An incremental crawler visits the web several times after a specific time interval to update its collection. In this work of regulating the frequency we revisit a new mechanism and a novel architecture for incremental crawlers is being proposed.
Summary: The most difficult part of a web crawler is downloading content at the fastest rate possible for the bandwidth allowed and at the same time processing the downloaded information without leaving the program dead. The scalable web tracking system implemented in the article, called WebTracker, has been designed to respond to this challenge. According to the article, by distributing the software you can reduce the resources used while maintaining the same pace of results. WebTracker has a Crawler Central server and manages all nodes. On each node there is a Crawler, which runs a downloader and manages the downloaded contents and analyzes them. While the crawler Server ensures synchronized distributed system operations.
2.1 Summary and conclusion of the article Searching the Web
In the article Searching the web, different points are discussed about web crawling, page storage, indexing and the use of different techniques for the design and implementation of a series of components.
Introduction
The authors tell us about the different challenges encountered when creating good search engines and list a series of useful techniques. Specifically, they detail IR (Information Retrieval) techniques, which are used to recover information from small collections, such as newspaper articles and book catalogs in libraries.
However, these IR techniques are not valid for the web, since it has suffered a considerable increase in volume and using it would require too many resources, so it tries to show us new techniques such as indexing that produces scalability when performing web crawls, the use of discrimination techniques for websites with irrelevant content. It also takes into account if the page is referred to by other pages with specific terms, which would denote that this page is important with those specific terms.
Then they explain to us how crawlers work, which are small applications in charge of crawling a repository of web pages, observing the changes in them, including new pages they find, etc. Through complex tracking algorithms. When the crawler completes a cycle, it knows which pages it should crawl and which it shouldn't. Resource optimization is achieved thanks to indexing.
Web page crawling
In the second point of the article, he summarizes the different functions that the tracker must perform when collecting information from all the websites. First, the pages pass through a crawling module that retrieves the pages for further analysis by the indexing module. Through a set of websites that are received, a priority is assigned to each of them so that they can be analyzed.
A topic that is also discussed quite extensively in the article is the page selection method, in which different models of importance are discussed, due to popularity, interest, location, etc. Coupled with this, come the tracking models, specifically it deals with two: tracking and stopping, tracking and stopping with threshold. In addition, the updating of the pages, or the "freshness" of the links in the search engine and updating strategies are other topics derived at this point.
Storage: Challenges and repositories.
The third point of the article deals with the scalability that must exist in storage repositories for large collections of web pages.
This entails a series of challenges, since the repository manages a large collection of data objects (in this case web pages), similar to file systems or databases. Issues arise such as: scalability, dual access, large bulk updates, and stale pages.
Indexing: Index structures, challenges, partitioning, text indexing systems.
There is an analysis module that creates a variety of indexes. The analysis produces links, text indexes, and utility indexes. The H1, H2 or b html tags make it easier for analyzers to know what information is the most important on a website.
With indexes, what is created are large graphs that have very important nodes and links, which facilitate the understanding of texts by crawlers.
Ranking and link analysis: PageRank, HITS algorithms and other techniques.
Since the Internet is so large, the way to evaluate which website is more important is complex. So different algorithms were created that evaluate the importance of a web page based on different parameters.
On the one hand there is PageRank, which has variations: Simple PR and Practical PR. PageRank depends on the importance of the websites, and the links that exist from and to it, the more important a website is, the more PageRank transmits to the websites that it links to. The other algorithm, HITS, is responsible for identifying, given a query, a set of natural web pages or authority web pages.
Final conclusions compared to the first article Searching the Web.
After looking for three more current references on the search on the web and that also reference this article, I have been able to observe the following details:
- Speed, reduction of resources and storage capacity: Both in the article that we had to read, from 2001, and in the 3 most recent articles that I have selected from scholar.google.com, I have observed that they give supreme importance to the speed of scanning, to the minimization of resources when carrying out these scans (try to consume little bandwidth, little CPU and little RAM, distributing processes, etc.) and also try to have the greatest amount of information in the smallest possible space. In the articles, they tell a series of tracking improvement algorithms, which have managed to optimize speed, resources and, above all, storage to the maximum.
- Problems in crawling: In the search processes for web content, it is valid as a response to a user.
- PageRank or the importance of websites: A few years ago, an algorithm was invented to calculate the importance of a website, which was based on the importance of the websites that linked to it, above all. This has changed over the years, and PageRank is not currently the value that a tracker most observes to calculate its importance on the Internet, since a series of trends began to be created that encouraged spamming, causing search engines to block this type of website and penalize them in terms of positions in Google. Now other factors are taken into account more such as the purity of the code (if it meets CSS, W3C, xhtml... etc. standards) as well as whether it is correctly redirected with a good htaccess file.