2. Processing stages
2.1 Preprocessing
Preprocessing consists of converting the usage, content and structure obtained from various data sources available in the data abstractions necessary for the pattern discovery. It is divided into 3 stages.
Usage preprocessing
Usage preprocessing is arguably the most complex task in the Web Usage Mining process because the available data is often incomplete. Unless you use a client tracking mechanism, only the IP address and the agent used are available for Identify users and server sessions.
Many of the problems encountered are:
- Single IP address / Multiple server sessions Generally, Internet Service Providers (ISPs) have a group of proxy servers through which users can access. A proxy server can have many users accessing a website in the same period.
- Multiple IP addresses / A single server session.Some ISPs or random privacy tools assign each user request to a group of IP addresses. In this case a single server session can have multiple IP addresses.
- Multiple IP addresses / Single user: A user accessing the web from different machines may have different IPs for each session. This makes visit tracking counts that unique user as more than one visit.
- Multiple Agents/Single User: Also, a user using more than one browser, even on the same machine, may appear as multiple users.
Assuming that each user has been identified (through cookies, logins or agent or IP analysis), the click-stream (term to refer to the trail of clicks that a user makes on a website) for each user must be divided into sessions. The biggest drawback of performing this task is managing when a user leaves the web. To solve this problem we can use the server logs (where the information about what content is being displayed at any given time) or using state variables for each active session (where stores the information necessary to determine what content has been viewed by the user).
Content preprocessing
Content preprocessing consists of converting text, images and other multimedia files in structures that are useful for the Web Usage Mining process. For They use classification or clustering methods that allow us to organize the information and use it to filter information at the input or output of pattern discovery algorithms. For example, the Results of a ranking algorithm can be used to narrow down patterns in page contents views on a certain topic or class of products. Additionally, we can classify or cluster page views based on not only in themes but also in intention of use, since this classification can transmit information to us, collect user information, enable navigation, or some combinations of these uses.
Of course, before analyzing all this information it is necessary to convert it into a quantifiable format such as vector spaces. For example, we can treat a very common case to understand this process better. The Text files can be broken down into word vectors and the necessary calculations applied to them and in the case From the graphics we can keep the description to evaluate its content.
We will have the biggest problem on the web pages. On the one hand we have those that are static and are They can be easily preprocessed by HTML parsing without a very high cost. The biggest drawback is in the dynamic content servers, where personalization techniques are used and/or databases are consulted to build the different views of a page. So, in a session we can only access a certain fraction of information. In these cases the space vector will have to store more information since the content of each page must be assembled to an HTTP request from a crawler, or a combination of templates, script and access to databases to process it. If only part of the server is accessed to be preprocessed, the output of any clustering or classification algorithm can be partial.
2.2 Structure Preprocessing
The structure of a site is created by the hypertext links between the pages visited. The structure can be obtained and preprocessed in the same way as the content of a site. Again, dynamic content (in addition to links) poses more problems than static pages. For each user session has to create a different site structure.
2.3 Pattern interference (pattern discovery)
Pattern discovery is based on methods and algorithms developed from many fields such as statistics, data mining, machine learning and Pattern recognition applied to Web mining. But of course, the methods developed in other fields We must transform to apply them correctly in our field of study. For example, in the discovery of association rules, the notion of a transaction for market analysis should not be in consideration of the order in which objects are selected. However, in web usage mining, a server session is a ordered sequence of page requests by a user.
2.4 Statistical analysis
This process is the most common method to extract knowledge base about the visitors to a website. By analyzing the session files, one can perform different types of analysis descriptive statistics (frequency, meaning, medium...) in variables such as pages visited, duration of the visit and navigation route. These analyzes can also include all limited low-level errors detecting unauthorized entry points or finding invalid URIs. Furthermore, in the depth of these analyses, This type of knowledge bases can be potentially useful to improve the execution of systems, increasing systems security, facilitating site modification tasks and providing support for marketing decisions.
2.5 Association rules
Association rule generation can be used to relate pages which are most often referenced together in a single server session. If we apply these rules in the context of Web Use mining, we can obtain information about those websites that are accessed together with a support value greater than a specified threshold.
Its benefits are well known in companies and marketing applications, but the presence of these rules can help Web designers restructure their Web site. Association rules can also serve as document heuristics to reduce latency per user when loading a page from a remote site.
2.6 Clustering
Clustering is a technique that groups together a set of objects that have similar characteristics. In the Web Usage domain, there are two types of interesting clusters:
- usage clustersIts purpose is to create groups of users who exhibit similar browsing patterns. It is useful to make market studies.
- page clusterIts purpose is to create groups of pages that exhibit similar content patterns. It is useful for search engines web.
Classification
Classification is the task of mapping an object containing information into one of the predefined classes. If we apply the concept on the Web, we will be interested in generating a user class that allows us differentiate between different objects to analyze behaviors. Classification can be done using supervised inductive learning algorithms such as decision tree classification, etc.
Sequential patterns
Sequential pattern discovery techniques attempt to find similarities of different sessions, creating sets of users that follow the same set of objects at a time ordered. Using this approach, Web marketplaces display ads targeting certain user groups. Other types Temporal analyzes that can be run on sequence patterns include trend analysis, detections of change points or similar analysis.
Modeling dependency
Dependency modeling is another pattern discovery task in Mining. Web. The objective is to develop a model capable of representing significant dependencies between the variables of a Web domain. As an example, one may be interested in building a model representing the different states of a user while purchasing in an online store analyzing each of the steps they take to manage a purchase. There are many learning techniques that can be used to model user navigation such as Hidden Markov Models (is a statistical model in which it is assumed that the system to be modeled is a Markov process of unknown parameters) and Belief Bayesian Networks (is a probabilistic graph model (a type of static model) that represents a set of random variables and their conditional dependencies through a directed acyclic graph (DAG)) well known in the field. Modeling web usage patterns will not only provide a theoretical framework to analyze user behavior and predict web trends in a future, but also strategies can be developed to increase sales of products offered by the website or improve navigation for the convenience of users.
2.7 Pattern analysis
Pattern analysis is the last step in the information usage mining process. web and its purpose is to filter rules that are not interesting or patterns from a set obtained in the phase of pattern discovery. The analysis methodology is usually driven by the application performed by the web mining process. The most common form of pattern analysis consists of a database query mechanism. of knowledge such as SQL. Another method is to load the usage data into a data cube in order to run OLAP operations (is the acronym in English for online analytical processing (On-Line Analytical Processing). It is a solution used in the field of so-called Business Intelligence (or Business Intelligence) whose objective is speed up the query of large amounts of data. To do this, it uses multidimensional structures (or OLAP Cubes) that contain summarized data from large Databases or Transactional Systems (OLTP). Used in reports sales business, marketing, management reporting, data mining and similar areas.).
2.8 Some existing tools
There are multiple tools to analyze log files access, to extract statistics, make reports (in html and text) and even make graphs. Among them We can mention some well-known ones such as Webtrends, Getstats, Analog, Microsoft Intersé Market Focus, among others.
Regarding access statistics generators we can mention 3DstatS, M5 Analyzer, Web Log Explorer, eWebLog Analyzer. Finally we must mention the Web Mining Log Sessionizator XPert, it is a processing and analysis, which allows the generation of rules to understand the behavior of website visitors. a website.