Web graph · PageRank · HITS

World Wide Web structure mining: PageRank, HITS, hubs and authorities

Structure mining treats the Web as a graph. Links are not only navigation: they reveal authority, communities, recommendations and ranking signals.

Graph

Pages become nodes and hyperlinks become directed edges.

Authority

Good pages receive meaningful links from other relevant pages.

Hubs

Hub pages point to several useful authorities within a topic.

Algorithms

PageRank, HITS and co-citation turn link structure into measurable signals.

World Wide Web and Web Mining

This work summarizes the conclusions obtained after reading the articles. Soumen Chakrabarti “Mining the Link Structure of the World Wide Web” , Ravi Kumar “The Web as a Graph” and Broder “Graph Structure in the web”.

1. Definition and objectives of web structure mining.

The World Wide Web is a large and continually expanding corpus of hypertext. This means that it is made up of a set of web pages without any type of unified structure and also with variability in style and content. This level of complexity makes it impossible to apply techniques from a database manager and information retrievers. to which we were accustomed in traditional processes.

To try to solve it, algorithms have been developed that exploit the hyperlink structure of the WWW. and this is how we try to solve the discovery of information and categorization, the construction of resource lists quality and analysis of linked online communities. And it is in this last aspect where developed the most significant advances. The hyperlink structure implies an underlying social structure to create pages and links to other websites of other authors. The objective is to develop techniques that take advantage of what we observe about intrinsic social organization on the web as we design algorithms that mine information from hyperlinks.

Web structure mining would then be defined as the process of using graph theory to analyze the nodes and connection structure of a website. That is, we will represent the websites as Nodes of a graph and connections as vertices. To generate this type of structure we will have to perform the following procedure:

Extract patterns from hyperlinks on the web. A hyperlink is a structural component that connects the web page to a different location.
Mine the structure of the document. It consists of using the tree structure to analyze the XML or the HTML within the web page.
Apply a metric to the information obtained.

2. Definition, modeling and use of the notions of:

Authority (authoritative page), prestige

Actually in the search process we do not want to locate a set of relevant pages but we only want the highest quality relevant pages. To do this, we will have We must limit the searches to a size sensible for a human observer by applying methods that identify the pages with more “Authority”.

One way to obtain this metric is through links. As we have analyzed, the creation of a link by The author of a web page represents an implicit type of “approval” to the page being pointed to. Collecting the set of such “approvals”, a deeper understanding of both the relevance and the quality of the contents of the website pointed out by all of them.

Now this system is not perfect, there is also difficulty in using the information from the hyperlinks. Despite Because many links represent the type of endorsement described above, others are created for reasons that They have nothing to do with the granting of authority. For example, some links exist only for purposes navigational (“Click to go above”) or as paid advertisements. Another problem is that the authorities do not They link directly to another since in this way they avoid losing users.

To solve these problems, we first have to detect when a relatively anonymous page (does not have references) is relevant. Normally in these cases we always find pages that link to a collection of prominent sites on a common topic and, therefore, will be like authority concentrators (Hubs). They may appear in a variety of ways, from lists of professional resources, to lists of recommended links on pages individual websites. Hubs typically have few links pointing to them as they grant authority over a specific topic. In this way, they have a role that is dual to that of authorities: a good Authority is one which is targeted by many good hubs. . This mutually reinforcing relationship between hubs and authorities will serve as a central theme for the exploration of link-based methods in search, automated compilation of high-quality web resources, and the discovery of thematically web communities cohesive.

Centrality

In the web structure, the central node is that structure, site or document central website that generally provides most of the connections to other members, authorities, etc. For To analyze it, there are several ways depending on the characteristics of the network:

Eigenvector of centrality of Bonacci
It is used only for hyperlink analysis in symmetrically connected networks. In addition, the frequencies of the link connections between websites should not be binary.
Freeman's degree of neutrality
This analysis method is used for directional link networks. Measures the number of direct connections of a website with others. To do this, use:
- Degree of incoming neutrality
- Outgoing neutrality degree
- Closeness
- Betweenness
- The centrality of Negopy.

Co-date

Co-citation analysis is used to map the thematic relationship of sets of authors, newspapers or articles. These relationships are normally expressed in graphs where:

A node is a web page
Each undirected edge has a weight

Co-citation is used to measure the similarity of two documents. To generate the analysis we have to choose the set to observe, for example, documents k, i, j and d. Once chosen we realize that documents i and j are cited by document k. If document i and document j are cited by another paper, it means that i and j are related or similar.

Let L be the citation matrix. Each cell of the matrix is defined as: Lij = 1 if paper i cites paper j, and L = 0 if otherwise. Co-citation (denoted by C j is a similar measure defined as the number of papers that co-cite i and j, and is computed with:

C_{i,j}=\sum_{k=0}^{n}L_{ki}L_{kj}

where n is the total number of pages.

C_{ii}

is naturally the number of papers that cite i. A square matrix C can be formed with C ij, and is called citation matrix. Co-citation is symmetric, Cij = Cji, and is commonly used as measures of similarity of two clustering papers to a group of papers of similar topics together.

3. Ranking of web pages based on links:

3.1 PageRank

PageRank has emerged as the dominant link analysis model for search engines. search, particularly due to the evaluation of independent queries of web pages, to avoid the spamming.

It is based on the democratic nature of the web using its link structure as an indicator of quality individually for each page. PageRank interprets hyperlinks from page x to page y as a vote, from the page to; to page and. However, PageRank looks for a legal number of votes, or links, that a page receives. Also analyzes the pages that cast the vote. The votes cast by the pages that are of important weight help the other pages to have a better score. This is the idea of prestige ranking on social networks.

3.2 PageRank algorithm

PageRank is a static ranking of web pages in the sense that a PageRank value It is computed for each offline page and does not depend on search queries. For this algorithm you can give as a "prestige" value to each page. The main concepts in web contexts are the following:

In-Links of page i: These are hyperlinks that point to a page i of other pages. Usually, hyperlinks from the same site are not considered.
Out-Links of page i: These are hyperlinks that point to other pages from page i. Usually, hyperlinks from the same site are not considered.

From another perspective, applying the concept of prestige there is the following derivation of the PageRank algorithm:

A hyperlink from one page pointing to another page is a transport of authority to the destination page. So, The inbound links that page i receives give more prestige than page i had before.
Pages that point to a page i also have their prestige scores in power. A page with a high prestige pointing to page i is more important than a page with low prestige pointing to i. In In other words, a page is important if it is pointed to by other pages that are important.

Consistent with these statements, the importance of pages i is determined by summarizing PageRank scores. of all pages pointing to i. A page that points to many other pages, distributes the prestige it has on all the pages it points to.

To formulate the ideas explained, we treat the Web as a graph direct G = (V,E) where V is a set of vertices or nodes, for example, V is the set of all pages and E is the set of direct links (hyperlinks). Knowing that the number of total pages on the Web is

n ( n = /F/ )

. The PageRank of the page

i ( P(i) )

is defined by:

P(i)=\sum_{(i,j) \in E}\frac{P(j)}{O_{j}}

where

O_{j}

is the number of outgoing links from page j. Mathematically, we use a system of n linear equations with n unknowns. We can represent this system using a matrix of equations knowing that P is a column of n-dimensional vectors of PageRank value. For example

P= (P(1),P(2),...P(n))^{r}

Given A as the adjacency matrix of our graph with:

A_{ij} = \{ \begin{matrix} \frac{1} {O_{i}} & SI(i,j) \in E \\ 0 & resto de casos \end{matrix} \}

We can write the system of n equations with

P= A^{T}P

and define the system itself. To solve it we can apply an iterative algorithm thinking that 1 is the largest value and the PageRank is the vector P.

Now, does it meet all the conditions to apply it to a graph with web information? No. To introduce these conditions we go into the Markov chain.

In This string, each web page or node in the graph is remembered as a state. A hyperlink is a transition that carries from one state to another with a probability. This models random web browsing, browsing the web like a transition state. Each transition probability is

\frac{1} {O_{j}}

(assuming that the surfer when clicking on the hyperlinks on the page i uniformly randomly). Given a state transition probability matrix A, a square matrix as follows:

A=\begin{bmatrix} A_{11} & A_{12} & .& .& .& . & A_{1n} \\ A_{21} & A_{22} & .& .& .& .& A_{2n} \\ . & . & .& .& .& .& .\\ . & . & .& .& .& .& .\\ A_{n1} & A_{n2} & .& .& .& . & A_{nm} \end{bmatrix}

Aij represents the transition probability of a user in state i to state j.

Given a vector of initial probability distribution

P_{o} = (P_{o}(1),P_{o}(2), ...,P_{o}(n))^{T}

and a transition probability matrix A, we will have:

\sum_{i=0}^{n}P_{o}(i)=1

\sum_{ji=1}^{n} A_{ij}=1

For all those web pages that have exit links.

Once the equalities are defined, if the matrix A that we are analyzing meets the conditions of the equation we will say that A is a stochastic Markov chain matrix. Another very important detail to keep in mind is that in the graph, nodes that do not have any links in the system can never be calculated. If there are, we can make two procedures:

Delete the nodes that prevent the calculation. These pages will not affect the process of computing Pagerank of another page directly. Outbound links from other pages pointing to those pages will also be erased. After PageRank has been computed, these pages and the hyperlinks pointing to them, can be added to the network.
Add a full set of outbound links to each page. For example we add to all nodes a link to pages X, Z and W and, thus, the transition probability of going from one page i to each page is 1/3 assuming uniform probability.

3.3 PageRank as an iterative algorithm

To implement this concept, we must first be clear about two concepts

A direct graph G=(V,E) is directly related if and only if, for each pair of nodes , there is a path from u to v.
A state i is periodic with k > 1 if k is the smallest number of all paths from the stadium prior to i until i having the length that is k. If the state is not periodic, it is aperiodic.

Thus, the PageRank value of web pages can be created using an iteration method, which produces the main eigenvector with the eigenvalue of 1. The algorithm is simple. We start with a PageRank assignment value and the The iteration ends when the PageRank value stops varying. The algorithm would be the following:

4.HITS

HITS stands for Hypertext Induced Topic Search, a static classification algorithm that It depends on the search query. When the user makes a query, the list of successful websites expands the list of relevant pages returned by a search engine and then produces two rankings based on:

Authority Rank
Hub ranking
A hub is a page with many links. The page serves as an organizer of the information on a particular topic and points to many pages that are a good source on the topic.

The key idea of HITS is that a good center points to many good authorities and one good authority It is pointed out by many websites that are important. Therefore, authorities and centers have a relationship of mutual reinforcement.

5. Analysis of communities on the web

The Internet provides shelter to a large number of communities with an interest common manifesting itself as a set of web pages. Although many communities are explicitly defined (news groups, resource collections on portals, etc.) the vast majority are implicit. Identify these communities would help understand the social and intellectual evolution of the web and provide detailed information to a group of people with certain interests in their sights. But of course, due to the continuous evolving flow of Internet, it is difficult to find and analyze these communities using only manual effort.

A possible solution to the problem is to treat the web as a huge directed graph. It is based on the fact that the Thematically cohesive web communities contain at their core a dense pattern of links from Hubs to Authorities.

To put this in more graph theory-oriented language: The notion of a bipartite graph is used. directed (Graph whose nodes can be divided into two sets A and B such that each link in the graph is directs from nodeA to node B. Since the communities being searched contain directed bipartite graphs with a large edge density, many of them are expected to contain smaller subgraphs that are in fact complete: Each node in A has a link to every node in B. Using "pruning" algorithms (A variant of Enhanced Backtracking substantially) (pruning) you can enumerate all the subgraphs of the Web on a Standard PC in about three days of execution.

6 Other applications of structure mining

The main objective of mining structure is to extract previously unknown relationships between Web pages. This data mining structure allows the use of a company to link information from its own website to allow information from navigation and group in sitemaps. For their part, users obtain the possibility of accessing the desired information through keyword association and content mining.

On the WWW, the use of structure mining allows the determination of the similar structure of web pages of clustering through identification of the underlying structure. This information can be used to project the similarities of web content and use it to maintain or improve the information on a site to allow access to web spiders.

Structure mining can also be applied to the business world. It can be very useful to determine the connection between two or more business websites and analyze how we can differentiate them to make it more attractive This information is useful for mapping companies that compete through third-party links, Your The purpose will be to analyze the data so that users first access the links that interest the company and give more benefits- Another aspect of structure mining in this field is the improvement of the navigation of these pages through their relationships and link hierarchy of the websites, quite interesting to encourage the user access our website.

Improving Web page navigation on business sites indirectly improves the search engine. search, This stronger connection allows the generation of traffic to a business site to provide results that are more productive. This navigation improvement attracts robots to the right locations provide the requested information, proving to be most beneficial in clicks to a particular site.

As a conclusion, web mining and the use of structure mining can provide results strategic for the marketing of a website for the production of the sale. The traffic most directed to web pages of a particular site increases the level of visits returning to the site and remembering search engines about the information or product provided by the company. This also allows marketing strategies provide highlights that are more productive through page navigation with links to the page main site itself. To really leverage your website as a web mining business tool of structure becomes a necessity.

7 Related research areas

How well we have been Seeing throughout this course, Structure Mining is within the global term of Structure Mining. Web. which is made up of three types of mining:

Web Content Mining (WCM, Web Content Mining).
Web Structure Mining (WSM. Web Structure Mining).
Web Usage Mining (WUM. Web Usage Mining).

In summary, WSM extracts the structure of hyperlinks, that is, how the documents are structured with respect to the others (interdocumentary structure as opposed to the intradocumentary structure of WCM). The structure is represents as a graph of links on a website or between websites. WSM reveals more information than information contained in documents: for example, links pointing to a document may indicate the popularity or importance of a document. while outgoing links indicate the richness or variety of the themes it contains. This leads us to a hierarchical organization by themes that can be inferred directly from the linking patterns. It is even possible not to specify the documents by keywords, but by exemplary documents.

8 International Conferences

Some of the international conferences that address the topic of web mining and that specifically address the topic of structure web mining are the following:

International World Wide Web Conference (IW3C2).
Asia-Pacific Web Conference (APWeb 2008)
International Conference on Internet and Web Engineering
International Workshop on Web Mining for E-commerce and E-services (WMEE 2008) I
International Conference on Web Intelligence, Mining and Semantics
International Conference on Web-based Learning (ICWL 2010)
Web Information Systems Engineering (WISE 2008)
Call for book chapter: Web Mining Applications in E-commerce and E-services (April 2008)
International Conference on Web Information Systems and Mining (WISM2010)