NoSQL · Data models · SQL comparison

NoSQL basics: models, SQL comparison and when to use each one

NoSQL is not a single database type and it is not a generic replacement for SQL. It is a family of data models designed for cases where the relational model is not the most comfortable fit: flexible documents, very large distributed datasets, fast key lookups or highly connected relationships.

Quick answer

Use a relational database when the domain depends on strong consistency, joins, tabular reporting and a stable schema. Consider NoSQL when the access pattern, scale or data shape fits one of its models better: key-value, document, column-family or graph.

The four NoSQL models at a glance

Key-value

Stores a value behind a key. Useful for sessions, caches, counters and very fast direct lookups.

Document

Stores JSON-like documents. Useful when the application reads and writes complete aggregates.

Column-family

Optimized for large distributed writes and queries designed around known access patterns.

Graph

Designed for relationships. Useful for recommendations, networks, fraud paths and dependency analysis.

SQL vs NoSQL: the practical decision

QuestionUsually SQLUsually NoSQL
Data shapeStable tables and well-defined relations.Flexible structures, aggregates or relationship-heavy graphs.
QueriesMany joins, reporting and ad hoc analysis.Known access patterns designed into the model.
ConsistencyStrong transactional guarantees are central.Availability, partitioning or scale may be prioritized.

Common mistake

The mistake is not choosing SQL or NoSQL. The mistake is choosing a database before knowing how the application will read, write and evolve its data. In NoSQL, modeling starts from the queries and the aggregate boundaries, not only from the entities.

1. Comparison between SQL and NoSQL Databases and Their Relationship with Big Data Analytics

Read the article "Comparison between SQL and NoSQL Databases and Their Relationship with Big Data Analytics'' [1] I have responded the following questions:
  • What do the authors think about what would be a suitable storage solution for a blog?

    They believe that NoSql databases can be 85% faster than those with blog management systems SQL. This is because NoSQL systems give more importance to accessibility than to consistency of the information and can therefore better manage binary documents such as videos and images.
  • Is it interesting to use NoSQL databases in the development of mobile applications for data management? data?

    In some cases yes. According to the text, Upsert update operations have been shown to be simpler and faster in NoSQL databases than in SQL ones. It is true that the use of cloud computing and NoSQL improves the performance of mobile platforms, especially in the data layer but we will also need more space storage and searches on "large" NoSQL databases take longer than on RDBMSs.
  • How are links between nodes represented in graph-oriented databases?

    Graphically with the edges and computationally with an adjacency method without index, that is, it is created with each node a direct point that points towards the different neighboring nodes.
  • What are the main reasons for moving from using a relational persistence system to a NoSQL system?

    The main reason you would need to migrate to NoSQL databases is the need to store a large amount of data. Other important reasons are the scalability of these systems, the performance they offer with large volumes of data and the versatility to work with data of various types.
  • What can you say comparing NoSQL databases with relational databases regarding maintenance, integrity and scalability?

    Relational databases have better data integrity due to the defined database structure. information and easy maintenance. Now, it is not scalable, on the contrary, it has many limitations techniques in this sense.

    NoSQL databases are the complete opposite, as they do not have a defined structure in the data and manage Large volumes of information maintenance and integrity are more difficult to manage. On the other hand, the Systems are easily scalable.

2. NoSQL Distilled

From reading the book NoSQL Distilled (Download) I have answered the following statements:
  • A common characteristic of all NoSQL databases is that they are oriented to run on a cluster.

    False.

    However, not all NoSQL databases are strongly oriented towards running on clusters. Graph databases are one style of NoSQL databases that use a distribution model similar to relational databases but offer a different data model that makes it better at handling data with complex relationships.

    Chapter 1.5. The Emergence of NoSQL [2]

    The cited text clearly states that not all NoSQL databases are strongly oriented towards execution in clusters. Graph-oriented databases are a variety of NoSQL that use a similar distribution model. to relational databases, but they offer a different model for the purpose of managing data with relationships complex. Neo4j is an example of NoSQL databases not oriented to run on a cluster but oriented to simplify data interpretation.

  • It is said that relational databases do not present the problem of impedance (impedance mismatch) because the way to represent information in the relational model and in the data structures used in memory is identical and no translation process is required between both models.

    False.

    Relational databases provide many advantages, but they are by no means perfect.[...]For application developers, the biggest frustration has been what’s commonly called the impedance mismatch: the difference between the relational model and the in-memory data structures.The relational data model organizes data into a structure of tables and rows, or more properly, relations and tuples. In the relational model, a tuple is a set of name-value pairs and a relation is a set of tuples. (The relational definition of a tuple is slightly different from that in mathematics andmany programming languages with a tuple data type, where a tuple is a sequence of values.) Alloperations in SQL consume and return relations, which leads to the mathematically elegant relational algebra. This foundation on relations provides a certain elegance and simplicity, but it also introduces limitations. In particular, the values in a relational tuple have to be simple—they cannot contain any structure, such as a nested record or a list. This limitation isn't true for in-memory data structures, which can take on much richer structures than relations. As a result, if you want to use a richer inmemory data structure, you have to translate it to a relational representation to store it on disk.

    Chapter 1.2. Impedance Mismatch [2]

    As already stated, the statement is false. Relational databases are not perfect, there are limitations such as the impedance problem. In the commented databases a relational tuple must be simple: cannot contain any structure, such as a nested record or a list. This limitation is not true for in-memory data structures, which can assume much richer structures than relationships. As a result, if If you want to use a richer in-memory data structure, you must translate it to a relational representation to store it on disk.

    That is, relational databases present the problem of impedance (impedance mismatch).

  • If you clearly know how the data will be queried, it is better to use a model that ignore aggregations, that is, a graph or relational model.

    False.

    Relational databases have no concept of aggregate within their data model, so we call them aggregate-ignorant. in the NoSQL world, graph databases are also aggregate-ignorant. Being aggregate-ignorant is not a bad thing. It's often difficult to draw aggregate boundaries well, particularly if the same data is used in many different contexts. An order makes a good aggregate when a customer is making and reviewing orders, and when the retailer is processing orders. However, if a retailer wants to analyze its product sales over the last few months, then an order aggregate it becomes a trouble. To get to product sales history, you'll have to dig into every aggregate in the database. So an aggregate structure may help with some data interactions but be an obstacle for others. An aggregate-ignorant model allows you to easily look at the data in different ways, so it is a better choice when you don't have a primary structure for manipulating your data

    Chapter 2.1.2.Consequences of Aggregate Orientation. [2]

    As the cited text indicates, relational and graph-oriented databases do not have a concept of aggregation within your data model. This feature allows them to easily view data from different ways, for which is a better option when you don't have a main structure to manipulate your data.

    That said, by clearly knowing the structure in which the data is going to be queried, the best option would be a added model.

  • The use of aggregates makes it easier for databases to operate on clusters since the concept of aggregate it becomes a natural unit for replication and distribution.

    True.

    Dealing in aggregates makes it much easier for these databases to handle operations on a cluster, since the aggregate makes a natural unit for replication and sharding. Aggregates are also often easier for application programmers to work with, since they often manipulate data through aggregate structures.

    Chapter 2.1. Aggregates. [2]

    As the text says, the aggregates make it much easier to operate in a cluster, since this structure It constitutes a natural unit of replication and fragmentation.

  • Because relational databases have a fixed data schema, the problem of sparse tables - those that have many null columns or that have columns that do not have meaning).

    False.

    As well as handling changes, a schemaless store also makes it easier to deal with nonuniform data: data where each record has a different set of fields. A schema puts all rows of a table into a straightjacket, which becomes awkward if you have different kinds of data in different rows. You either end up with lots of columns that are usually null (a sparse table), or you end up with meaningless columns like custom column 4. Schemalessness avoids this, allowing each record to contain just what it needs—no more, no less.

    Chapter 3.3. Schemaless Databases [2]

    Referring to relational databases, the study document warns us that these systems must fill all rows in a table, which is awkward if you have different types of data in different rows. Either you end up with many columns that are usually null (sparse table) or you end up with columns without sense. Schemalessness avoids this, allowing each record to contain just what it needs, no more, no more. less.

    The conclusion is clear, the statement is false.

  • Graph-based databases make traversing relationships during query operations not so expensive, since the cost of the navigation calculation mainly impacts during the insertion of the data.

    True.

    Once you have built up a graph of nodes and edges, a graph database allows you to query that network with query operations designed with this kind of graph in mind. This is where the important differences between graph and relational databases come in. Although relational databases can implement relationships using foreign keys, the joins required to navigate around can get quite expensive — which means performance is often poor for highly connected data models. Graph databases make traversal along the relationships very cheap. A large part of this is because graph databases shift most of the work of navigating relationships from query time to insert time. This naturally pays off for situations where querying performance is more important than insert speed.

    Chapter 3.2. Graph Databases. [2]

    Once a graph with its nodes and edges has been created, we can consult its structure using algorithms based on graphs. These systems can quickly navigate the relationships between the different nodes because They are very optimized for it. Its internal structure penalizes the insertion of new elements in favor of allowing faster routes over the network, that is, query performance takes precedence over insertion.

    Therefore, the statement is correct.

3. Example of Aggregates.

A supermarket's home delivery service needs to manage the information about the orders it receives. The Supermarket managers show interest in resolving the following queries:

  • Know the purchases made and the customers who made them, for each month and type of product. In particular, They want to obtain, for each month and product category, the customers (ID, name and surname) who purchased products from that category, as well as, for each customer, the list of products they purchased from that category. For each product, we want to obtain its identifier and the quantity purchased.

    1 Representation 1

    An aggregate is proposed that represents the information by month. Each aggregate will be composed of a month field, year and the different categories. For its part, categories will be composed of catname (a text or number that will allow the category to be identified, since, without being specified in the statement, I have considered it useful to add this field) and the addition of customers.

    Continuing with the analysis of the structure, the clients will be composed of the fields ID, name and surnames and the different products purchased. The product aggregate is composed of an id and the quantity of products purchased.
  • Know for each delivery area, and depending on their delivery date (year, month and day), the orders served. For Each order is interested in obtaining its identifier, the data of the client who placed the order (ID, name and surnames) and the list of purchased products. For each product you want to obtain its Identifier and the quantity purchased.

    2 Representation 2

    An aggregate is proposed that represents the information by zones. Each zone will be composed of a zone name field as a descriptive element and the different delivery dates. Each date aggregate will be composed of a field day, month, year and the different orders. For its part, the orders will be composed of id, the customer aggregate and product aggregates.

    Continuing with the analysis of the structure, the clients will be composed of the fields ID, name and surnames.For each product we will obtain an id and the number of items delivered.

4. IEEE/ACM Transactions on Computational Biology and Bioinformatics

  • Data persistence issue that has been resolved.

    The chosen application case has been a research published in IEEE/ACM Transactions on Computational Biology and Bioinformatics the year 2012 with the title A fast ranking algorithm for predicting gene functions in biomolecular networks [3]

    I really found it an interesting example due to the importance given to the organization of the data and how to benefit from it. It is a scientific article that describes a new technique for classification of biomolecules from the graph arrangement of its components. The algorithms proposed for this purpose, called ranking algorithms, use graphs loaded entirely in main memory (using data structures such as matrices or adjacency lists) so that they can later catalog the genes not identified, according to the degree of class membership of the genes with which it is related. These methods provide us with a probability value regarding the membership relationship that exists between a certain gene and a class.

    In the article, they have proposed a new approach where, instead of using these arrays loaded into databases traditional, propose the use of kernels with graphs and thus be able to exploit the direct relationships and indirect interactions between genes, taking into account the general topology of the network. Its classification method is fast and scalable, since no model learning is required, but only a calculation of scores, approx. linear to the number of genes.

    To do this, the text has compared its approach with several state-of-the-art genetic classification methods. [4] , through the integration of multiple sources of biomolecular data in the context of a genetic classification problem in the model organism of the yeast.
  • Reasons why the use of a relational database is not recommended.

    According to the text, research carried out using high-performance biomolecular technologies highlighted that most biological functions are based on complex relationships between numerous components biomolecules such as proteins, DNA, RNA and many other small molecules.

    Therefore, the prediction of genetic characteristics in different species or the gene-disease association, the number of genes that must be taken into consideration, and consequently the size of the graph with which we work, grows exponentially. For this reason, classical ranking algorithms become computationally very expensive or even inapplicable and, therefore, new algorithms have to be developed that allow you climb better information by keeping it in secondary storage or distributing the workload between several machines.

    Furthermore, if we analyze the current languages or the computers on which we can apply the algorithms, only They can handle an array of limited size. Another characteristic of these matrices is that, by their nature, they are They are dispersed and it is difficult for an element to have arcs with all the other elements of the graph. For what Therefore, this leads to the matrix memorizing a large number of values equal to 0 that are not useful for the final calculation.

    Finally, relational databases are very difficult to manage and require an advanced level to apply complex algorithms. On the other hand, databases like Neo4J simplify things a lot and allow technicians initiated in computing to be able to perform tasks more easily.
  • Reasons why the database that has been used as solution.

    The first and most important justification is the ease of management and consultation that a database offers us. of graph-oriented data. If we look for different representations of the problem explained in the document, we We find that it is easier to understand visually a graph than a table, therefore, it will be easier implement new algorithms or techniques on this system

    The second reason is performance. If we remember, a graph-oriented database has a cost of very high insertion but, once the graph has been generated, the cost per query is low.

    Finally, space, storing an adjacency matrix in a graph greatly reduces the amount of memory necessary.

    With all this, it has been possible to obtain a better result in the classification of biomolecules than the algorithms prior to publication. The essays suggest that the strategies proposed in the text could improve the current genetic classification techniques, reduce their computational costs and improve their performance and efficiencies.

Bibliography

  • 1.Titulo
    Comparison between SQL and NoSQL databases and their relationship with big data analytics
    Autor
    Ali, Wajid and Shafique, Muhammad Usman and Majeed, Muhammad Arslan and Raza, Ali
    Publicacion
    Asian Journal of Research in Computer Science
    Url
  • 2.Titulo
    NoSQL distilled: a brief guide to the emerging world of polyglot persistence
    Autor
    Sadalage, Pramod J and Fowler, Martin
    Publicacion
    Url
  • 3.Titulo
    A fast ranking algorithm for predicting gene functions in biomolecular networks
    Autor
    Re, Matteo and Mesiti, Marco and Valentini, Giorgio
    Publicacion
    IEEE/ACM Transactions on Computational Biology and Bioinformatics
    Url