Web scale · Deep web · Freshness

Web dynamics, diversity and scale: why search engines are hard

The Web is not a stable library. It grows, changes, duplicates itself, disappears, expands across languages and hides large parts behind forms or dynamic systems.

Growth

Search systems must handle new pages, changed pages and removed pages continuously.

Diversity

Languages, domains, formats and quality levels make uniform indexing impossible.

Deep web

Many resources are not directly reachable by following static links.

Metadata

Structured descriptions help, but they are incomplete, inconsistent and unevenly adopted.

Dynamics, diversity and size of the web. A problem for search engines

Web dynamics source notes

This project summarizes the conclusions obtained after reading of the articles:
  • Internet in Chile (SUBTEL) Link
  • Trends in the Evolution of the Public Web (1998 - 2002). link
  • Graph Structure in the web Link
  • Web Dynamics Link

If you need help, the following explanatory video is available: https://www.youtube.com/watch?v=-ibNde2KkM0.

2 Definition and objectives of the study of web dynamics

2. Definition and objectives of the study of web dynamics

Every day humanity produces a quantity increasing amount of information, approximately 2.5 quintillion bytes which are 25x byte, that is, a lot of data. Furthermore, 90% of the information available to humanity has been produced these last two years.

How have we increased the capacity to generate data so much? Where do you get all this information from?

Virtually everywhere; In this current technological world, you can get information from anything. Ya whether we are talking about a network of sensors, our last photo uploaded to Instagram or our last message written on Facebook. How can we track that information? Through the web dynamics, that is, the techniques that allow us to know how the use evolves of the web, know its topology, its content and what kinds of models and techniques will agree with the scale to this current growth rate ((figure 1 ).

A new need has really emerged to understand and manage the dynamics of the web to develop new techniques that make the web tractable.

Figura ilustrativa sobre buscadores
1 Graph that shows us an estimate and forecast of the evolution of the amount of global information that circulates on the network.[blogs.elpais.com 2014]

3 Website Features

3. Features of the website

If we analyze the recent history of humanity, in 1995 the number of people in the world who had access to The Internet was practically non-existent, but in the last 10 years the situation has taken a turn due to an outbreak of network and an exponential increase in users with new needs. According to data in December 2010, 30% of the world population, approximately 2,000 million people, has access to the Internet and, projections indicate that this tremendous rise will skyrocket in a very few years due to the emergence of the mobile Internet and Smartphones (currently it is estimated that there are 1,038 million Smartphones with Internet access in the world). Thus, the amount of information published, viewed or uploaded by Internet users in the most popular applications today reaches simply impressive figures. Furthermore, it is expected that in the coming years the amounts current information we are using will be ridiculous as we can see in Figure 1 . In the first part of the graph you can see a circle where it compares the amount of information created in 2005 with the information that is expected to be created in 2020. The second graph shows us shows an evolution of the global amount of information and also the existing storage capacity available worldwide.

Although the web remains a work in progress, the most current research is analyzing the trends that characterize its evolution. Despite its relative youth, the Web that has been immersed in predictions on the direction of its future development, as well as the role as a means of communication to obtain information in digital format. In light of the persistent uncertainty attending the maturation of the Web, it is useful examine some of the main current trends, both to mark the current state of evolution of the Web to inform new predictions about its evolution.

The Trends in the Evolution of the Public Web article link examines three key trends in the development of Public website that are:
  • Size and growth
  • Internationalization
  • Use of metadata

3.1 Size and growth

The number of websites now stands at over 1 billion and counting. increasing at an accelerated rate, according to data disseminated in real time by the site Internet Live Stats.

Now, if we analyze their efficiency we cannot affirm that the value of the information stored in them It doesn't grow that fast. Shapiro and Varian have recently estimated that static HTML text on the Web was equivalent to about 1.5 million books. They compared this figure with the number of volumes at the University of California at Berkeley Library (8 million), and, considering that only a fraction of the Web's information can be considered “useful,” he concluded that “the Web is not that impressive as a source of information.”

However, Shapiro's evaluation is not capable of analyzing other more dynamic sources of information (such as videos) that can be excellent sources of information. The web includes digital resources of many varieties beyond plain text, are often combined and recombined in complex information media. multiple objects that are not recognized in the previous study.

Furthermore, many Web analysts now recognize the distinction between the “surface Web” and the “Web deep." Although this terminology suffers from different shades of meaning in different contexts, we can define as:
  • Web Surface: Portion of the Web that is accessible through traditional Internet technologies Link-to-cross-link based tracking of Web content.
  • Deep web: It consists of information that is inaccessible to web crawlers based on the link like:
    • dynamically generated pages
    • particular pages created in response to an interaction between the site and the user. Although one Reliable estimate of the size of the Deep Web is not available, it is believed to be large and growing.
The conclusion that we can obtain from all the studies related to the evolution of the Web is that, although its size is equivalent to, or even exceeds, the collections of the largest library, probably the number of information would be quite inferior. These same studies also affirm that the growth of the Internet is evolving exponentially and therefore this difference is decreasing.

3.2 International distribution and language.

As its name suggests, the World Wide Web is a global resource of information in the sense that anyone, regardless of country or language, is free to make the information available in this space. Ideally, then, the content of the Web should reflect the community. generally international, coming from sources around the world, and expressed in a wide range of languages.

The reality is a little different. At this time the vast majority of sites are written in English and belong to entities, people or organizations located in the US. This trend is changing as expands the use of the Internet in new countries such as China or India. There are currently more users in China connected than in the US or Europe and only 47% of the population is connected. https://es.wikipedia.org/wiki/Annex:Countries_by_number_of_Internet_users)

3.3 Use of metadata

When we store large amounts of information in a space, it has to be organized and indexed to facilitate search and retrieval. On the other hand, the Web in its beginning was not devised no type of structure that would allow this organization. The search is done using “brute force” methods such as the Google search engine that employs relatively sophisticated algorithms that rank search results based on linking patterns and popularity.

Librarians, who face these problems regularly, manage to establish an organization through through the careful preparation and maintenance of bibliographic data, that is, descriptive information about the resources in their collections. More generally, this descriptive information is called metadata, or “data about data”. Imitating this philosophy, a movement has been generated to promote the use of metadata in the Web, notably through the Dublin Core Metadata Initiative ( https://es.wikipedia.org/wiki/Dublin_Core)

Metadata for Web resources is typically implemented with the META tag, which can be used by the creators to integrate a quantity of information that is considered relevant to describe the resource. The META tag consists of two main components:
  • NAME, which identifies a particular piece of metadata (keywords, author, etc.).
  • CONTENT, which creates an instance of, or provides a value for, the metadata element identified in the NAME attribute.
Using data from the five web surveys, it was possible to examine increasing trends in the use of metadata on the public Web in the last five years. The objective of the analysis was simply to detect the presence of any form of metadata, running using the META tag, on public websites. Regarding the reasons These increases may be due to several factors, but the main one is the arrival of the most sophisticated editors of HTML, some META tags are created and populated automatically as part of the document template.

A second interesting feature about the use of metadata on the Web is that, it appears, it is not increasingly more detailed. Assuming that a META tag is equivalent to a metadata element, or a piece of information descriptive information about the Web resource, then it is clear that, on average, Web pages that include metadata contain about two or three elements.

One of the discouraging aspects of metadata usage trends on the public Web over the last five years is the apparent renunciation of content creators to adopt formal metadata schemes that for describe the documents. For example, Dublin Core metadata appeared in only 0.5 percent of the public main site web pages in 1998, that figure rose almost imperceptibly to 0.7 percent in 2002. The The vast majority of metadata provided on the public Web is ad hoc in its creation, not structured by any formal metadata schema.

4 Zipf's law, “power laws” on the web

4.Zipf's Law, “power laws” on the web

Zipf's law, named after the linguistics professor at the Harvard University George Kingsley Zipf (1902-1950), is a mathematical curiosity that explains some of the difficulties that appear in digital libraries.

Suppose we make a classification of the words that appear in the library, assigning the number one to the most frequent word, two to the second most frequent, etc. For example, in the Miguel de Cervantes, the 10 most frequent words and their frequencies of appearance f(n) are the following:
n word f(n)
1 of 5952871
2 that 4294496
3 and 3887331
4 the 3473934
5 in 2521954
6 the 2463429
7 to 2348470
8 the 1689770
9 if 1305932
10 no 1261456
Zipf's law states that the number of occurrences of a word is inversely proportional to its order number: where C is a constant that is fixed experimentally. For example, in the library Miguel de Cervantes set 10 million obtaining the following approximations:
nword f(n) C/n
10 no 1261456 1000000
100 day 93619 100000
1000 sorrow 9837 10000
1000 frankly 841 100
Despite being an approximate result, one of the virtues of Zipf's law is that it explains how difficult it is build good dictionaries for two reasons
  • Whatever the size of the library, adding new documents adds some new words
  • Simple mathematical reasoning tells us that the library contains on the order of C different words and that the number of words with frequency f is approximately:
Therefore, if we construct a dictionary that contains all the words in the library (that is, with C entries), About half of the words in the dictionary (C/2) appear only once in the library: this is, in effect, the result if we make f = 1 in the previous formula.

5 Size and growth trend of the website

5. Size and growth trend of the website

The size of the website makes it a source of extremely important, diversified and continuously expanding information. Although it cannot be said that all information contained can be considered useful. In the figure 2 We observe how growth is continuous and advances as the technologies become more accessible until 2003, when stagnation is generated because possibly those Those who want to establish a presence on the Web have probably already done so.
Ilustración de Internet y buscadores
2 Graph that shows us an estimate of the amount of web global.[http://fundacionorange.es/]

6 Public web and hidden web

6.1 Definition of Terms

The public website It is used to define the Web space that can be reached by a search engine generalist That is, from static html to images, video, audio, pdf files, compressed files or executables.

In another sense, the hidden web occupies all the data that exists on the web but is outside the search engines traditional, whether general or specialized. Among the types of data contained we have:
  • Information contained in enormous numerical or textual databases requires spending many resources and is very expensive for search engines to store these types of formats in their databases:
  • Those data that are generated dynamically in real time (pages built with Flash technologies, ASP, etc.).
  • Dynamic databases
  • Databases with passwords or authentication.
  • Offline pages.
  • archival material
  • Interactive tools such as dictionaries or calculators

6.2 Dimensions

According to the study How much Information? 2003, performed by Peter Lyman and Hal R. Varian of the School of Information Management and Systems at the University of California, Berkeley, the amount of information of the navigable or visible Web is 147 terabytes, while the invisible Web is 91,850 terabytes.

7 Languages on the web

7. Languages on the web

Ideally, Web content should reflect the entire international community, originating from sources around the world and expressed in a wide range of languages. The reality is that, the The main people responsible for this Internet growth effort are the United States, Germany, China, South Korea South and Japan, with the vast majority of these sites being in English.

Although the supremacy of English on the Internet is overwhelming, we are faced with a medium that, almost definition, it must also be multilingual, and it is very common to find buttons or marks that allow us to choose the language in which we want to read a text. Almost all search engines offer the option of translating the page you we are seeing in the language that one wants and fortunately there are powerful free translators that can be used from the network ( https://translate.google.es/?hl=ca).

What are the most used languages on the Internet? Based on data published by W3Techs ( https://w3techs.com/technologies/overview/content_language ), which is a platform dedicated to reporting daily on the use of Internet in the world, providing data and statistics by countries and regions. According to data collected at the level Worldwide, 36% of Internet users come from Asia (with 418 million users), 28% from Europe (with 322 million of users), 20% from North America (with 233 million users) and 9% from Latin America (with 110 million users). The remaining 7% is distributed between Oceania, the Middle East and Africa.

The languages mentioned in the list represent 80% of the languages on the network, since 20% (around about 200 million) use other languages not mentioned.

Regarding the Spanish language, the growth that the Hispanic market had within the Web is notable. It is estimated that only 25% of Spanish speakers have access to the Internet and it is believed that this number will increase.

These data are extremely interesting for Webmasters (web page creators) since they will know who and In what language to direct your content to reach a greater number of people. The data was published at the end March 2007. Below I detail a list of the 10 languages that are most used on the Internet, detailing the number of Internet users who speak each language.
  • English 329 million users.
  • Chinese 159 million users.
  • Spanish 89 million users.
  • Japanese 86 million users.
  • German 59 million users.
  • French 56 million users.
  • Portuguese 40 million users.
  • Korean 34 million users.
  • Italian 31 million users.
  • Arabic 28 million users.

8 Domains on the web

Domain names are people's translation of IP addresses, which are useful only for computers.

So, for example, www.youtube.com It is a domain name with IP 216.58.211.206. As you can see, the Domain names are words separated by periods, rather than numbers in the case of IP addresses. These words can give us an idea of the computer we are referring to. If you know a little more about names of domain, just by seeing https://www.hacienda.gob.es/es-ES/Paginas/Home.aspx We can conclude that it is "A site that belongs to the government of Spain." by the ending .gov (used by governments) and also the ending .es of Spain. Existing domains they are:
  • com: Companies
  • edu: Educational institutions, mostly universities
  • org: Non-governmental organizations
  • gov: Government entities
  • thousand: Military installations
  • info: Organizations that offer information
  • tv: Television networks
  • In the rest of the countries, the last word indicates the country:
  • es: Spain
  • fr: France
  • uk: united kingdom
  • it: Italy
  • jp: Japan

9 Studies on the Spanish web

9. Studies on the Spanish web

The main conclusions that emerge from the study of the web Spanish made are that:
  • A large number of sites do not use the first name domain corresponding to the country .es, preferring .com or .org (github.com, brand.com, as.com, etc)
  • Diversity of information. There is a large amount of information generated by Universities and research, newspapers, leisure pages, etc.
  • The statistical properties of the sample are very similar to those of other samples, indicating that the sample can be used for studies that are at least partially extrapolated to the global network.
  • 1% of the web pages are links to files that are not HTML. Although it seems like a small number, they are about 200,000 documents. Plain text and pdf formats are the most used
  • The proportion of sites that only consist of one page, without any links, is close to 30%

10 Related research areas

10. Related research areas

The dynamics of the web are directly related to the following areas of research, which we have already discussed throughout the course in the different works carried out:
  • Data mining: It can be said that data mining (DM. Data Mining) consists of the extraction not trivial information that resides implicitly in the data. This information was previously unknown and it may be useful for some process. In other words, data mining prepares, probes and explores the data to extract the information hidden in them. Under the name of data mining, a whole set of techniques aimed at extracting actionable knowledge. implicit in the databases. It is strongly linked to the supervision of industrial processes since it is very useful to take advantage of stored data in the databases. The bases of data mining are found in artificial intelligence and statistical analysis. Through the models extracted using data mining techniques, the solution is addressed. to prediction, classification and segmentation problems.
  • Graph theory: In mathematics and computer science, graph theory (also called graph theory) of graphs) studies the properties of graphs (also called graphs). A graph is a set, not empty, of objects called vertices (or nodes) and a selection of pairs of vertices, called edges (edges in English) that may or may not be oriented. Typically, a graph is represented by a series of points (the vertices) connected by lines (the edges).
  • Information Recovery: The Search and Retrieval of Information (called in English Information Search and Retrieval) is the science that is responsible for searching for information in
    • electronic documents
    • digital documentary collections
    • relational databases
    Information retrieval is an interdisciplinary study. It covers so many disciplines that it normally generates partial knowledge from only one perspective or another. Some of the disciplines that deal with these Studies are cognitive psychology, information architecture, information design, intelligence artificial, linguistic, semiotics, computer science, library science, archival science and documentation. To achieve your recovery objective is based on information systems, and being multidisciplinary in nature librarians intervene to determine search criteria, the relevance and pertinence of the terms, in together with computing.

11 International conferences

11. International conferences

Some of the international conferences that address the topic of Web search, indexing and indexing methods are as follows:
  • International World Wide Web Conference (IW3C2).
  • International journal of Computer Networks & Communications (IJCNC)
  • International Conference on Internet and Web Engineering
  • Interlink Web Design Conference
  • International Conference on Web Intelligence, Mining and Semantics
  • International Conference on Web-based Learning (ICWL 2010)
  • International Conference on Machine Learning (ICML97)