Web mining · Clickstream · Personalization

Web mining and web usage mining: logs, patterns and personalization

Web mining applies data-mining techniques to pages, links and user behavior. In practice, the most useful part is often web usage mining: turning logs and clickstream data into patterns that improve navigation, recommendations, marketing decisions and adaptive websites.

Content mining

Extracts topics, entities and signals from text, images and page metadata.

Structure mining

Analyzes links as a graph to detect hubs, authorities, communities and page importance.

Usage mining

Studies logs, sessions and click paths to discover how users actually move through a site.

Personalization

Uses discovered patterns to adapt menus, recommendations, content and conversion paths.

Practical pipeline

Start by cleaning logs and identifying users, sessions and page views. Then discover patterns with statistics, association rules, clustering, classification or sequential models. Finally, validate whether those patterns improve a real objective such as findability, retention, conversion or support cost.

Mining usage

This project summarizes the conclusions obtained after reading the articles:

1. Definition and objectives of web usage mining

The usage mining of the web is the process of application of data mining techniques for the discovery of use patterns from Web data.

Internet speed allows economic transactions, it has become the key main in the growth of electronic commerce. New evolutions allow us the ability to power buy in a store without having to have a person to assist you or a physical store that has to be open.

These sites need to learn every day about the customers or users who browse their sites. Only from this This way they will be able to appropriately direct efforts to improve marketing services and personalization of the site. The discovery of patterns of activity and behavior related to Web browsing requires the development of Data Mining algorithms capable of discovering sequential access patterns in files log.

Web Mining evolved from these new needs and was divided into three subgroups:
  • Usage Mining
  • Structure Mining
  • Content Mining
From this division, Usage Mining was defined as the automatic discovery of access patterns of user from Web servers. Organizations collect large amounts of data in their operations daily, automatically generated by its web servers and server access log collections.

There are many web analytics tools that have mechanisms to collect reports on web activity. users on servers and various data form filters. However, these tools are designed to moderate traffic on the servers and in many cases does not perform analysis of data relationships or files accessed on the server, therefore, it is not common to see tools that efficiently apply the information provided by Usage Mining.

2. Processing stages

2.1 Preprocessing

Preprocessing consists of converting the usage, content and structure obtained from various data sources available in the data abstractions necessary for the pattern discovery. It is divided into 3 stages.

Usage preprocessing

Usage preprocessing is arguably the most complex task in the Web Usage Mining process because the available data is often incomplete. Unless you use a client tracking mechanism, only the IP address and the agent used are available for Identify users and server sessions.

Many of the problems encountered are:
  • Single IP address / Multiple server sessions

    Generally, Internet Service Providers (ISPs) have a group of proxy servers through which users can access. A proxy server can have many users accessing a website in the same period.
  • Multiple IP addresses / A single server session.

    Some ISPs or random privacy tools assign each user request to a group of IP addresses. In this case a single server session can have multiple IP addresses.
  • Multiple IP addresses / Single user:

    A user accessing the web from different machines may have different IPs for each session. This makes visit tracking counts that unique user as more than one visit.
  • Multiple Agents/Single User:

    Also, a user using more than one browser, even on the same machine, may appear as multiple users.
Assuming that each user has been identified (through cookies, logins or agent or IP analysis), the click-stream (term to refer to the trail of clicks that a user makes on a website) for each user must be divided into sessions. The biggest drawback of performing this task is managing when a user leaves the web. To solve this problem we can use the server logs (where the information about what content is being displayed at any given time) or using state variables for each active session (where stores the information necessary to determine what content has been viewed by the user).

Content preprocessing

Content preprocessing consists of converting text, images and other multimedia files in structures that are useful for the Web Usage Mining process. For They use classification or clustering methods that allow us to organize the information and use it to filter information at the input or output of pattern discovery algorithms. For example, the Results of a ranking algorithm can be used to narrow down patterns in page contents views on a certain topic or class of products. Additionally, we can classify or cluster page views based on not only in themes but also in intention of use, since this classification can transmit information to us, collect user information, enable navigation, or some combinations of these uses.

Of course, before analyzing all this information it is necessary to convert it into a quantifiable format such as vector spaces. For example, we can treat a very common case to understand this process better. The Text files can be broken down into word vectors and the necessary calculations applied to them and in the case From the graphics we can keep the description to evaluate its content.

We will have the biggest problem on the web pages. On the one hand we have those that are static and are They can be easily preprocessed by HTML parsing without a very high cost. The biggest drawback is in the dynamic content servers, where personalization techniques are used and/or databases are consulted to build the different views of a page. So, in a session we can only access a certain fraction of information. In these cases the space vector will have to store more information since the content of each page must be assembled to an HTTP request from a crawler, or a combination of templates, script and access to databases to process it. If only part of the server is accessed to be preprocessed, the output of any clustering or classification algorithm can be partial.

2.2 Structure Preprocessing

The structure of a site is created by the hypertext links between the pages visited. The structure can be obtained and preprocessed in the same way as the content of a site. Again, dynamic content (in addition to links) poses more problems than static pages. For each user session has to create a different site structure.

2.3 Pattern interference (pattern discovery)

Pattern discovery is based on methods and algorithms developed from many fields such as statistics, data mining, machine learning and Pattern recognition applied to Web mining. But of course, the methods developed in other fields We must transform to apply them correctly in our field of study. For example, in the discovery of association rules, the notion of a transaction for market analysis should not be in consideration of the order in which objects are selected. However, in web usage mining, a server session is a ordered sequence of page requests by a user.

2.4 Statistical analysis

This process is the most common method to extract knowledge base about the visitors to a website. By analyzing the session files, one can perform different types of analysis descriptive statistics (frequency, meaning, medium...) in variables such as pages visited, duration of the visit and navigation route. These analyzes can also include all limited low-level errors detecting unauthorized entry points or finding invalid URIs. Furthermore, in the depth of these analyses, This type of knowledge bases can be potentially useful to improve the execution of systems, increasing systems security, facilitating site modification tasks and providing support for marketing decisions.

2.5 Association rules

Association rule generation can be used to relate pages which are most often referenced together in a single server session. If we apply these rules in the context of Web Use mining, we can obtain information about those websites that are accessed together with a support value greater than a specified threshold.

Its benefits are well known in companies and marketing applications, but the presence of these rules can help Web designers restructure their Web site. Association rules can also serve as document heuristics to reduce latency per user when loading a page from a remote site.

2.6 Clustering

Clustering is a technique that groups together a set of objects that have similar characteristics. In the Web Usage domain, there are two types of interesting clusters:
  • usage clusters

    Its purpose is to create groups of users who exhibit similar browsing patterns. It is useful to make market studies.
  • page cluster

    Its purpose is to create groups of pages that exhibit similar content patterns. It is useful for search engines web.

Classification

Classification is the task of mapping an object containing information into one of the predefined classes. If we apply the concept on the Web, we will be interested in generating a user class that allows us differentiate between different objects to analyze behaviors. Classification can be done using supervised inductive learning algorithms such as decision tree classification, etc.

Sequential patterns

Sequential pattern discovery techniques attempt to find similarities of different sessions, creating sets of users that follow the same set of objects at a time ordered. Using this approach, Web marketplaces display ads targeting certain user groups. Other types Temporal analyzes that can be run on sequence patterns include trend analysis, detections of change points or similar analysis.

Modeling dependency

Dependency modeling is another pattern discovery task in Mining. Web. The objective is to develop a model capable of representing significant dependencies between the variables of a Web domain. As an example, one may be interested in building a model representing the different states of a user while purchasing in an online store analyzing each of the steps they take to manage a purchase. There are many learning techniques that can be used to model user navigation such as Hidden Markov Models (is a statistical model in which it is assumed that the system to be modeled is a Markov process of unknown parameters) and Belief Bayesian Networks (is a probabilistic graph model (a type of static model) that represents a set of random variables and their conditional dependencies through a directed acyclic graph (DAG)) well known in the field. Modeling web usage patterns will not only provide a theoretical framework to analyze user behavior and predict web trends in a future, but also strategies can be developed to increase sales of products offered by the website or improve navigation for the convenience of users.

2.7 Pattern analysis

Pattern analysis is the last step in the information usage mining process. web and its purpose is to filter rules that are not interesting or patterns from a set obtained in the phase of pattern discovery. The analysis methodology is usually driven by the application performed by the web mining process. The most common form of pattern analysis consists of a database query mechanism. of knowledge such as SQL. Another method is to load the usage data into a data cube in order to run OLAP operations (is the acronym in English for online analytical processing (On-Line Analytical Processing). It is a solution used in the field of so-called Business Intelligence (or Business Intelligence) whose objective is speed up the query of large amounts of data. To do this, it uses multidimensional structures (or OLAP Cubes) that contain summarized data from large Databases or Transactional Systems (OLTP). Used in reports sales business, marketing, management reporting, data mining and similar areas.).

2.8 Some existing tools

There are multiple tools to analyze log files access, to extract statistics, make reports (in html and text) and even make graphs. Among them We can mention some well-known ones such as Webtrends, Getstats, Analog, Microsoft Intersé Market Focus, among others.

Regarding access statistics generators we can mention 3DstatS, M5 Analyzer, Web Log Explorer, eWebLog Analyzer. Finally we must mention the Web Mining Log Sessionizator XPert, it is a processing and analysis, which allows the generation of rules to understand the behavior of website visitors. a website.

3. Learning techniques applied to usage mining

Learning techniques are applied in our field to suggest an appropriate link given a query on a topic and a Web page. In other words, from from a knowledge base of the following objective function: CalidaddeEnlace:PaginaInteresEnlace[0,1[Calidad de Enlace: Pagina * Interes * Enlace \rightarrow \left[0,1\right[ The Link Quality value is interpreted as the probability that a user selects Links given the Current page you are on and the topic we are discussing, the interest it generates.

There are mainly three learning approaches which are:
  • Use previous guides as a source of information to increase the internal representation of each hyperlink selected.
  • It is based on reinforcement learning, the idea is to find guides through the Web as well as the amount of relevant information found about the trajectory that is maximized.
  • It is the method that combines both previous approaches.

3.1 Learning from Previous Guides

Learning is achieved by annotating each hyperlink with the interest of users who click on these hyperlinks on previous visits. Thus, every time a user follows a hyperlink the description of the hyperlink is augmented by adding the keywords that users have typed in the beginning of your visit. To suggest hyperlinks during a visit, the most interesting option is to adapt them to the user's search for information.

The metric used to measure the similarity between a user's state of interest and a description of hyperlinks is based on a technique from information retrieval. This connection is analyzed using vectors of features with several dimensions, where each one represents a particular word in the English language. Each unit of the vector is calculated using tf-IDF (Term frequency – Inverse document frequency, frequency of term – inverse document frequency (that is, the frequency of occurrence of the term in the collection of documents), is a numerical measure that expresses how relevant a word is to a document in a collection. This measure is often used as a weighting factor in information retrieval and text mining). Subsequently, these units are used to measure the cosine of the vector and with the result we can interpret its value.

Of course, for each hyperlink we will have to analyze it with the procedure and choose the appropriate word to expose it. The Link Quality value for each hyperlink is estimated if the mean similarity, k (normally 5), is the highest ranked among all the options. The maximum number of hyperlinks suggested by page is three.

3.2 Learning from Hypertext Structures

At this point, we will describe a second method of learning that values those hyperlinks related to the most common topics (using an approach based on in reinforcement learning).
Reinforcement learning
Reinforcement learning allows Web agents to learn strategies controls that select the most optimal actions for certain configurations. For example, consider a agent navigating from one state to another state by actions performed by a user. In each state it is the agent receives a reward R(s). The value of a stock can be expressed in terms of an evaluation function Q(s,a) , defined for all possible state-action pairs. If the agent can execute this function, then We can apply in each state and obtain highly optimized search results. More precisely, We can observe the following equation: Q(st,a)=i=0γiR(st+1+i)Q(s_{t},a)=\sum_{i=0}^{\infty}\gamma^{i}R\left(s_{t}+1+i\right) where sts_{t} is the state, the agent is at time t, and where γ \gamma is the discount factor 0 \leq γ< \gamma < 1 that determines how to discount the value of rewards received in the future. Under certain conditions, the function Q can be iteratively approximated by updating the estimate for Q(s,a)Q\left(s,a\right) repeatedly as shown in the following formula: Qn+1(s,a)=R(s)+γaϵins[Qn(s,a)]Q_{n+1}(s,a)=R(s')+\gamma_{a'\epsilon -in-s'}\left[ Q_{n}(s',a')\right] s' is the state resulting from the performance of action a in state s.

Ejemplo Aprendizaje desde Estructuras de Hipertexto
1.Example Learning from Hypertext Structures

In the figure [1] , the boxes represent possible states of the agent. The axes They represent actions that lead the agent from one state to another. The axles have the weights obtained by function Q(s,a). And R=1 we give it to the target document of the query.
Reinforcement Learning and Hypertext
Imagine a web agent searching for pages it is on the word "smart." For this agent, states correspond to web pages and actions correspond to hyperlinks. In this case, we define the value R(s,smart) for a particular page s as a result to apply the tf-IDF mathematical algorithm in s searching for the smart word. The agent will then learn a function Q(s,a,"smart") to choose the best link on each page and thus guide the user.

4.Responsive websites

4.1 Definition and objectives

An adaptive website is one that adapts its content to the type of user. accesses with the aim of providing you with the information you are looking for. Furthermore, the website must have the capacity to Automatically improves your organization and presentation by learning from users' access patterns. For To meet this definition, our site will have to have the ability to:
  • Allow a user from the same website to perform different searches.
  • Avoid links to incorrect or outdated information
  • Allow all possible access methods and be adapted to new needs.
To create our responsive website we can use 2 different approaches.

4.2 Approaches

Sites must be adaptive in two main ways.
  • the site should focus on personalization

    Modify web pages in real time to meet the needs of individual users.
  • sites must focus on optimization

    Configure the sites to make navigation easier for everyone.

Personalization

Personalization is adjusting site presentations for users individually. Typically this option is implemented allowing the user to perform manual configurations. another option is route prediction, on the other hand, it is another improvement focused on the agent used, which consists of locating the most interesting links for the user in the most visible places.

Optimization

Optimization attempts to improve the site as a whole. Instead of making changes each user, sites learn from all users to make the site easier to navigate. We must see a web design as a particular point in the space of possible designs and we can take as a metric quality the measure of an amount of effort that the user must make to find what they want on our page. Of this This way we can choose which space is most correct for our purpose.

Meta-Information

A website's ability to adapt can be hampered by the lack of knowledge about its content and the structure provided by the HTML code. To solve this problem, gave the ability to websites to adapt with meta-information: information about their content, structure and organization.

One way to provide meta-information is to represent the site's contents in a formal framework with semantic precision like a database or a semantic network.

5. Related research areas

Web Mining is generally used in three ways, Web Mining of Content, Structure Web Mining and Usage Web Mining (which is what we have talked about in this work).

Web Content Mining is an automatic process that goes beyond the extraction of keywords, since The data is analyzed to generate information from the documents found on the Web, whether articles, audiovisual material, HTML documents, among others. The vast Web can reveal more information than what is found contained in documents, for example, links pointing to a document indicate the popularity of the document, while some links leave a document indicate the richness or perhaps the variety of topics covered in the document. This can be used to compare bibliographic citations.

In this area is Structure Web Mining, which consists of studying the structures of the links. And finally, Usage Web Mining, which is a process of automatic discovery of access patterns or use of Web services, focusing on the behavior of users when they interact on the Web.

6.International conferences

  • International Conference on Databases Theory (ICDT)
  • International Conference on Very Large Data Base IBM Almadén Research Center
  • International World Wide Web Conference
  • Conference on Artificial Intelligence (AAAI198)
  • International Conference on Machine Learning (ICML)
  • International Conference on Distributed Computing Systems
  • European Conference on Machine Learning (ECML-98)
  • International Conference Machine Learning
  • International Conference on Knowledge Discovery and Data Mining
  • International Computer Software and Applications Conference on Prolonging