Web scraping · Data quality · NoSQL

Web scraping and data quality: responsible extraction and validation

Web scraping is useful when data is public but not available through an API. The valuable part is not only extracting pages: it is doing it responsibly, validating quality and storing the result in a structure that can be analyzed.

Before scraping

Check APIs, sitemaps, robots rules, terms of use and update frequency.

During scraping

Use delays, retries, caching and limits so the target service is not degraded.

After scraping

Validate accuracy, completeness, consistency, timeliness, uniqueness and format.

Storage

Use document, graph or wide-column databases when the extracted data is varied or connected.

Data quality and Web Scraping.

Why is web scraping necessary?

On the internet there is content that may be interesting to retrieve, store and analyze with the objective of discovering new information, making it easier for the user to understand or giving it a new interpretation. It is true that some web pages offer the possibility of accessing and downloading information in a structured way through APIs or web services, but not all the information available on the internet can be downloaded through them.

When the owner of a web page does not make any tool available, web scraping appears as an alternative for the extraction of information.

Thus, web scraping consists of building an agent that allows downloading, analyzing and organizing data coming from the internet automatically. Thanks to the use of this technique, we can design a script that performs a series of repetitive tasks with which to store information of interest in a structured way and much more efficiently than if it were done manually, speeding up the process and avoiding errors produced during copy and paste.

Why is it useful to look at the website map?

The map will help us locate updated content without the need to crawl each of the pages that compose it. The main advantage of sitemaps is that they allow search engines, such as Google or Bing, to crawl the website more easily.

Although these search engines usually index any properly designed small or medium website correctly, crawling larger sites will present certain complexities, mainly when they are updated frequently.

Likewise, the sitemap can allow users to navigate more comfortably through the site.

How to avoid saturating the server with web requests

To avoid saturating a server with requests, exponential delays can be introduced between consecutive requests when some error is detected on the page. Instead of resending the same request almost instantly, adding an exponentially increasing delay between requests gives the web server the opportunity to recover.

Another widely used method consists of calculating the time taken by the different requests to complete and, after that, adding a delay proportional to the estimated time. In this way, if the site starts to slow down and the requests take longer to receive a response, the waiting time between requests can be adjusted automatically.

Different factors that influence data quality.

  • Accuracy: defines the degree to which the data correctly describes the object or event of the real world.

    Example: my case is a real problem obtained from the media outlet El Pais. It exposes NASA's problem when programming the Mars Climate Orbiter. According to the American organization, the craft exploded due to a confusion between miles and kilometers. The probe, built to navigate according to the English system, received the flight instructions in the metric system before takeoff. The data passed the system validation phase because the values stayed within range, but they were not accurate.
  • Completeness: it is defined as the proportion of stored data compared with the potential one-hundred-percent complete state.

    Example: percentage of customers with the NIF/CIF field filled in within the database of an electric company.
  • Consistency: it is the absence of differences when comparing two or more representations of one thing with its definition.

    Example: the classification of legal entities or natural persons in a policy stored in an ERP. If the policy stores a CIF, it is a company, and if the policy stores a DNI or NIE, it is a natural person.
  • Timeliness: timeliness is defined as the degree to which data represents reality from a required point in time.

    Example: the activation or cancellation date of a contract is a required point in time.
  • Uniqueness: it ensures that nothing is recorded more than once.

    Example: there are not two policies with the same identification number and the same customer.
  • Validity: it is defined as conformity with the predefined syntax, that is, format, type and range.

    Example: selector for entering a country. In other words, a user of an application has to choose which country represents them according to a list.

What types of databases do not require knowing in advance the data to be stored? Three examples of these databases.

Non-relational databases: they are a type of database implemented when the information to be stored is too complex to be expressed in a table. Unlike relational databases, it is not necessary to know in advance what is desired to be stored, because non-relational databases are more flexible and can store any kind of data regardless of its structure. In them, there is not a single identifier that serves to store everything, so JSON-type data is used.

Three examples are:
  • MongoDB is an open-source document-oriented NoSQL database system. Instead of storing data in tables, as is done in relational databases, MongoDB stores BSON data structures, a specification similar to JSON, with a dynamic schema, making data integration in certain applications easier and faster.

    The source code is available for Windows, GNU/Linux, OS X and Solaris.
  • Neo4j is a free graph-oriented database implemented in Java. Developers describe Neo4j as an embedded, disk-based persistence engine, fully transactional, that stores structured data in graphs instead of in tables. The database is licensed under a dual model, both under the Affero General Public License version 3 and under a commercial license.
  • Apache Cassandra is an open-source NoSQL database distributed and based on a key-value storage model written in Java. It allows large volumes of data in distributed form. Its main objective is linear scalability and availability. Cassandra's distributed architecture is based on a series of equal nodes that communicate using a P2P protocol, which means maximum redundancy. It is developed by the Apache Software Foundation.

    The Cassandra data model consists of partitioning rows, which are reorganized in tables. The primary keys of each table have a first component that is the partition key. Within a partition, rows are grouped by the remaining columns of the key. The other columns can be indexed separately from the primary key.

    Tables can be created, deleted and altered at runtime without blocking updates and queries.