Evaluation of the performance of different NoSQL databases using a ranking function for gene identification.

Historical benchmark: how to interpret it today

The measurements in this report compare specific versions, hardware, schemas and client code. They do not prove that one database is universally faster. Titan is a legacy reference whose project lineage continued through JanusGraph, while Neo4j, MongoDB and Cassandra have changed substantially since these tests.

A current benchmark must pin server and driver versions, warm-up and measure repeatedly, verify equivalent indexes and durability settings, monitor CPU, memory, disk and network, and publish the dataset and query mix. The database should be selected from access patterns, consistency and operations requirements before comparing latency.

In recent years web applications such as Facebook or YouTube have revolutionized the world of computing due to to the large amounts of information that they manage efficiently and quickly. This fact has led to The companies in charge of managing these Databases have developed various tools to respond to new needs, such as Nosql Databases. The main features of these systems is their operation without large resource needs and with the ability to manage a very high amount of information (the more information stored there, the better its performance and the more efficient will be its management).

With the aim of making a comparison between the best-known systems currently on the market (Neo4j, Titan Cassandra, MongoDB, Cassandra), we have decided to apply a Ranking function used to analyze genes and See which systems offer us the best benefits. To do this, we have created different gene databases with different amounts of connections, classes and genes and, in this way, you can see which system performs the calculations faster in different situations.

Thanks to this, we have come to the conclusion that currently for systems with few resources and not distributed, We will obtain better results with those NoSql databases managed by Neo4J, a system aimed at graphics and with a native language that is very easy to use.

In this report, the activities carried out for the achievement of the project and what tools have been used.

Keywords

Databases, Free Software, NoSql, MongoDB, Cassandra, Titan, Java.

How to read this long-form NoSQL comparison

This page is intentionally extensive because it documents a complete academic project: context, graph model, implementation strategy and performance results. The most useful reading path is to start with the database comparison, then review the graph-ranking workload and finally interpret the benchmark results according to the shape of the data.

If you need a shorter conceptual entry point, read it together with basic NoSQL models, MongoDB updated and Neo4j graph queries.

1. Introduction

The first chapter documents the context, the student's motivation, and the proposed objectives. at the beginning of the project, as well as the work environment and the structure of the entire memory.

1.1. Context

In today's society we have millions of examples where we handle large amounts of information, from web applications such as Facebook, Twitter, Instagram or banks with millions and millions of daily transactions. In all these examples, the rapid handling of information and the search for data in their Servers pose a problem because of their large volume and, as a consequence, any operation becomes very complex. To better understand the magnitude of this problem, there are several estimates that state, for example, that there will be an amount of information stored and processed by 2020 of 35,000 Exabytes (1 Exabyte is 1024 Petabytes, and one Petabyte 1024 Terabytes), that the annual Internet traffic is probably close to 8 Exabytes or the content (what we write, tweet, our photos, etc.) can be approximately 500 Exabytes (mmdrigal.com 2014).

As we can read, it represents a real challenge for computing to the point of creating a new field for treat these problems. Researchers use a new term, “Big Data” to refer to situations when faced with databases with gigantic amounts of information and have to solve problems in three very different fields: volume, speed and variety.

Volume: companies manage an increasing amount of data of all types, easily accumulating Terabytes, even petabytes, of information.
Speed: sometimes 2 minutes is too late. In processes in which time counts, such as For example, detecting fraud, large volumes of data must be used in real time to maximize the value.
Variety: Big data includes any type of data, structured and unstructured, such as text, sensor data, audio, video, click sequences or log files, among others. At By analyzing this data together you will find new information.

Related to Big Data, one of the areas where most advances are being applied is in biology and, more exactly, in the prediction of characteristics in genes and proteins. This field is essential for understand molecules and, with this, better understand human physiology, the reasons that cause a disease or the differences that exist between different species. These new discoveries have allowed the creation, identification and validate new drugs, develop new diagnostic systems or find new transgenic foods that They have improved the quality of human life and longevity.

To achieve these advances, various methods have been used to predict the characteristics and functions genetic characteristics that genes have and the relationships between them through graphs. This is because the Gene networks are essentially graphs where the genes are the vertices, the relationships between them represent the arcs and each gene has a certain degree of membership in a particular class.

1.2. Motivation

The algorithms proposed for this purpose, called ranking algorithms, use graphs loaded entirely into main memory (using data structures such as arrays or adjacency lists) so that they can later catalog the unidentified genes, according to the degree of belonging to a class of genes with which it is related. These methods provide us with a value of probability regarding the membership relationship that exists between a certain gene and a class. Among all the various classification techniques found in the bibliography, in this project, the Score has been used Function based on kernels [Re et al, 2010] that is based on identifying the characteristics of a gene using as basis of information relating to neighboring genes, that is, the vertices that can be reached through the arcs that protrude from the starting vertex.

In the new contexts that arise to apply these methods, such as feature prediction genetics in different species or the gene-disease association, the number of genes that should be taken into account consideration, and consequently the size of the graph with which we work, grows exponentially. For this reason, classical ranking algorithms become computationally very expensive or even inapplicable and therefore Therefore, we have to develop new algorithms that allow you to better scale the information while keeping it in a secondary storage or distributing the workload across multiple machines. Furthermore, if we analyze the Current languages or computers on which we can apply the algorithms can only handle a matrix of limited in size and will rarely exceed 50,000 elements, i.e. at most 2,500,000,000 values to maintain in memory. Another characteristic of these matrices is that due to their nature they are dispersed and it is difficult that an element has arcs with all the other elements of the graph. Therefore, this leads to the matrix memorize a large number of values equal to 0 that are not useful for the final calculation.

All these new needs have led to an increase in research related to the field of management of massive information, that is, more and more attempts were made to solve the problem of analyzing the “Big Data”. In response to this problem, the NoSQL movement was born in 2009, a group of organizations that propose Innovative systems for managing data sets much larger than those commonly managed by through relational systems. They also propose effective techniques for working with large structures of graphs, and distribute the workload between several machines in a simple and intuitive way.

The basis of some NoSQL systems is the possibility of using the MapReduce paradigm [Ghemawat et al, 2004], implemented in the Hadoop library. This technique consists of dividing the problem into a Map function and a another from Reduce, so the Map function converts the lines read from the input file into pairs key-value and the internal logic process this pair, gather all the key values and offer the result to the Reduce function that processes the whole.

Among the systems that are part of the NoSQL movement we also include those that manage modeled structures with a graph, such as Neo4J or Titan Cassandra, that is, easily represented with nodes and associations between them nodes. This representation is useful in many modern application cases (e.g. Facebook) since both Data management systems abandon property relationships in favor of a graph-based structure. Other systems that are part of NoSQL systems are, for example, MongoDB where we store the database structures. data in JSON-type documents with a dynamic schema (MongoDB calls that format BSON), making the integration of data in certain applications easier and faster; o Cassandra, a distributed NoSQL database, based on a column storage model and written in Java that has as its main characteristic a great ability to manage large volumes of data in a distributed manner.

1.3. Goals

Returning to the main problem, in this final year project we are going to carry out a search for information about "Big Data", apply the knowledge acquired to develop techniques for solve the problem of predicting genetic characteristics when the size of the graphs they manage is of high dimensions. Following the spirit of NoSQL databases we are going to use machines with normal returns (personal computers that we can find in stores and, therefore, a quantity relatively small amount of main memory) to verify to what extent it is possible to analyze gene networks in order to predict the genetic characteristics in them. The ultimate goal is to develop new approaches which will be scalable in the reference applied context. Also, we want to empirically evaluate the complexity of the approaches developed in order to allow the application of classification algorithms to large dimensional graphs.

Therefore, to obtain these results it is necessary to rethink the algorithms developed for a representation of the graph matrix in algorithms that work directly on the graph. Furthermore, it is necessary develop algorithms that work locally on a node, rather than end-to-end. This allows to obtain techniques that are more scalable and the ability to parallelize the calculation process.

In order to achieve the objectives set in the thesis we have decided to implement the different algorithms with the following NoSQL systems: Neo4J, Titan Cassandra, Cassandra and MongoDB. Each will provide The methods have a different philosophy since their specifications are quite diverse among them. The first of them, Neo4J is a database management system totally oriented to the graph model, Titan Cassandra allows the management of graphs but using a column representation oriented to this purpose, Cassandra represents the graphs through column management and finally MongoDB will use document management structured, which can be used to represent graphs. All these systems allow management in a single machine or on a network.

From the approach provided by the methods of prediction of genetic characteristics and functions based on kernel-based functions [King et al, 2012], which is especially fast in main memory, We have carried out different implementations in the NoSQL systems previously explained. After that we have built different benchmarks to evaluate and compare the performance of implementations on a single machine.

To verify the accuracy of the results provided in the tests we have compared the values obtained in our implementations with the results obtained in an R implementation provided by the Professor Marco Mesiti.

Finally, all the applications that we are going to implement will be characterized by the following requirements:

Performance:
Stability:
Portability:

1.4. Description of the general project environment

The project has been developed in two environments different. The first 4 months in a space set up in the University's Database Laboratory degli Studi di Milano as part of an Erasmus exchange and the rest of the course has been alternating between home of the student and some teaching laboratories of the Universidade Positivo de Curitiba (Brazil). In order to check that the project was being developed correctly, a series of weekly interviews were established in which presented the work done to the tutor, in addition to raising possible doubts or discussing some decisions of implementation. When the Erasmus scholarship ended and the exchange with Curitiba began, the weekly meetings were They were done via Skype or email.

The computer on which the entire project was developed was an Asus UX31 with an Intel Core I 5 processor. (2.2 GHz) with 4Gb of RAM. The operating system used was Windows 7 and the development tools They are free software. Later in the chapter [5] More details will be given about the development environment.

1.5. Document structure

This document reflects the project development process. We begin the project exposing the conclusions that we have made in the initial bibliographic research on NoSQL databases. In the first section, presented as NoSQL Research, the conclusions are presented. obtained about the problem of Big Data for database management. It is divided into different subtasks main ones and describes the four database systems used: Neo4J, MongoDB, Titan Cassandra, Cassandra.

The second section presents the ranking methods analyzed from the literature, which algorithms they use. each one and defines in detail the kernel-based Score function applied in the project. Next, it is exposed the concept of cross-validation to evaluate the quality of the ranking.

In the third section, with all the initial research completed, the project planning is explained, defining tasks, their duration and estimated completion dates.

In the fourth section we are going to present the tools used to carry out this project and why of this election.

The fifth section contains how we have built the different database managers and how we have applied the Average Score in each of them. We will begin by explaining in general the Average Score algorithm and cross-validation, and then specify a little more the implementation differences between Neo4J, Titan Cassandra, MongoDB and Cassandra.

The fifth chapter explains two very important subtasks. The first we will explain how we have verified the results obtained by the Average Score functions for each of the systems and in the second we will explain how much time has been necessary to perform the calculations for different circumstances.

Finally, the final conclusions of the work will be presented in the last chapter.

2. Project planning

In this section we are going to comment on the different tasks that have been carried out to carry out the project, as well as a Gantt chart where you can see both the temporal duration of each of them, as well as the different origins that exist between the tasks.

2.1. Definition of tasks

Below is They describe what tasks have been necessary to achieve the project. The project can be divided into five major blocks: definition of requirements, learning about Nosql databases, learning about genetic prediction functions, project implementation and results obtained in different examples.

Definition of system requirements

Investigation of initial requirements
Definition of functional requirements

Research on BigData and the NoSql movement

Big Data and the NoSql movement
Get familiar with Neo4j
Getting to know Titan Cassandra
Get familiar with MongoDb
Get to know Cassandra

Genetic prediction functions

The methods used in reading
Kernel-based ranking function
Cross validation techniques

Analysis and development of the different databases used in the project

Analyze and Design the Score function based on Kernels
Database implementation in Neo4j
Database implementation in Titan Cassandra
Database implementation in MongoDb
Database implementation in Cassandra
Check Results
Results

Conclusions
Project documentation

In the figure [1] you can see the graphic representation of the decomposition of tasks in the form of WBS (Work Breakdown Structure) or WBS (Work Breakdown Structure). How the project is developed by a single person, you cannot save time by calculating the dependencies between tasks and executing them in parallel, so it will be done sequentially from left to right as seen in the diagram, except initial resource research tasks and basic documentation tasks to be performed in parallel.

2.2. Gantt chart

The following Gantt chart (figure [2] ) allows view planned durations, start and end dates, and precedence structure at the start of the project development. The work schedule contemplated in the diagram is: Monday to Friday from 9:00 a.m. to 1:00 p.m. and 3:00 p.m. to 16:30.

Although this planning was well defined, the actual development of the project has been diverted due to not having some factors that have delayed it. On the one hand, factors such as exam periods, holidays, etc. are factors typical that slightly delay the project. Apart from these problems, the student had two experiences of exchange. One in Milan for four months and another in Brazil for 7 months, which led to some delays due to legislative procedures, apartment searches, theft of suitcases, etc. that prevented planning from being carried out planned.

3.Research on Big Data and the NoSQL movement

Every day humanity produces an increasing amount of information,

(2.5 quintillones de bytes que son 25x 10^{18}

byte . Suffice it to say that 90% of the information created in the last decades has been produced these last two years. Where do you get all this information from? Virtually everywhere; In today's technological world, You can get information from anything. Whether we are talking about a sensor network, our latest photo uploaded to Instagram or the last message written on Facebook. In this context, the concept of Big Data arises, That is, the term that researchers use to represent the set of information that a organization wants to analyze, but due to its size, analyzing it is very complex.

The chapter is organized as follows. In the section [3.1] It introduces a little to the history and concept of Big Data. We explain how the problem evolved and how the NoSQL (Not Only SQL) movement proposed a solution, developing new alternatives to the relational DBMS that existed until now. Among these new database management systems of data, we have explained the basics of: Neo4J in the section [3.2] , Titan Cassandra in section [3.3] , MongoDB in section [3.4] and Cassandra in the section [3.5] . All these systems are characterized by allowing information to be managed organized in graphs, therefore, they allow applying methods for predicting genetic characteristics to meet the objectives of point 1.3.

3.1. Big Data and the NoSQL movement

If we analyze the recent history of humanity, in 1995 the The number of people in the world who had access to the Internet was practically non-existent, but in recent years 10 years the situation has taken a turn due to an explosion of the network and an exponential increase in users with new needs. According to data in December 2010, 30% of the world population, approximately 2,000 million people, has access to the Internet and, projections indicate that this tremendous rise will skyrocket in very few years due to the emergence of the mobile Internet and Smartphones (currently an estimated number of 1,038 million Smartphones with Internet access in the world). Thus, the amount of information published, viewed or uploaded by Internet users in today's most popular applications reaches figures simply impressive. Furthermore, it is expected that in the coming years the current amounts of information we are using will be ridiculous as we can see in Figure [3.1] . In the first part of the graph you can see a circle where it compares the amount of information created in 2005 with the information that is expected to be created in 2020. The second graph shows us shows an evolution of the global amount of information and also the existing storage capacity available worldwide. From 2010 to 2012 it is an estimate and from 2012 to 2020 it is a forecast.

Gráfico que nos muestra una estimación y pronóstico en la evolución de la
cantidad de información global que circula por la red. — 3.1 Graph that shows us an estimate and forecast of the evolution of the amount of global information that circulates through the network.

To meet and respond to these new information management needs, first of all you have to know what can be the main sources and that is a problem. Actually in this current technological world there is a great demand for very varied information. We only need to see the great examples par excellence to give us account of the magnitude of the problem:

Youtube: Every minute of the day, about 24 hours of video are uploaded to Youtube.
Twitter: 65 million Tweets are shared by Internet users on this microblogging network; that is, about 45,000 Tweets per minute.
Facebook: The most popular social network of all and in the entire world has already exceeded 500 million users registered users, who on average produce about 90 units of content each month, which gives a total of 45 billion monthly publications.
Wikipedia: The popular online encyclopedia that laid the foundations of Web 2.0 by presenting an outline collaborative and open publication has not been immune to this exponential growth, since if in 2001 it had 19,700 pages published, currently this figure is estimated at more than 3.5 million.

And if we analyze the economic section, in the figure [3.2] . we can understand how this problem is really creating new research and development needs. The examples cited above obtain great benefits thanks to its innovation in these aspects to be able to attract more users without worsening quality.

Gráfica que nos muestra las previsiones sobre la magnitud del mercado del Big
Data. — 3.2 Graph that shows us the forecasts for the magnitude of the Big Market Date.

In conclusion, we can define the term Big Data to refer to all those sets of information that we need to analyze for some purpose, but because of its large size and variety, it is very complex and difficult to perform any operation on them. As a consequence, these sets will meet these three needs:

Volume. The volume of data stored in company warehouses has gone from occupying megabytes and gigabytes to petabytes, that is, the physical size of this information is often so large that it is difficult to handle most databases. We can take as an example the New York Stock Exchange, which generates a terabyte of data per day or Twitter, which generates 8 terabytes per day (or 80 MB per second).
Speed: Information must be processed quickly, often in real time. For example, eBay confronts fraud through PayPal by analyzing five million transactions in real time per day.
Variety: The variety of data has exploded, going from being stored and structured data, stored in relational databases, to be completely unstructured and contain, for example, audio, video, XML, text, etc.

3.1.1 RDBMS VS NoSQL

Nowadays, any industrial process requires the processing of large amounts of data and to that end paradigms and algorithms oriented to this type of data began to flourish. processing. Big Data analysis generally includes all the data that a company has in its possession and offers several benefits such as leading to new decision-making processes, new marketing ideas or new productive developments. In essence, it allows the company to keep pace with the times and decide its future. Some business processes are capable of producing more than 1 TB of data each day, so they must find a way to parallelize execution and improve query times.

First, in response to these new needs, they used relational data management systems ( RDBMS) -- Relational Database Management System) which, with DB2 and Oracle implementations, its simple and intuitive way to represent information in table form. The success of the relational system is was mainly due to the ability to handle operations transactions that comply with the properties ACID:

Atomicity: it is the property that ensures that an operation can never be left halfway in the event of a failure of the system. Then we can consider that an operation is atomic when it is impossible for another part of an system find intermediate steps. For example, in the case of a bank transaction or both the deposit and deduction or no action is taken.
Consistency: It is the property that ensures that you only start what you can finish. As a consequence, Only those operations will be executed that will not break the integrity rules and guidelines of the database.
Isolation: it is the property that ensures that one operation cannot affect others. This allows the carrying out two transactions on the same information are independent and do not generate any type of error. This property defines how and when the changes produced by an operation become visible to the other concurrent operations.
Durability: It is the property that ensures that once the operation is carried out, it will persist and cannot be undo even if the system fails and that in this way the data survives in some way.

Recently, problems have arisen due to the fact that RDBMSs are designed to process a number of much smaller information than what is being produced today. In addition to that, they are not oriented to distributed systems and do not support the concept of horizontal scalability well.

Horizontal scalability is very important and is used to refer to the ability to split data and the computational load between several servers without having to share RAM or hard drives. This concept differs from the concept of vertical scalability in that a distributed system uses more CPUs than they share resources such as RAM. At this time horizontal scalability is more relevant because it can be managed without the need to have a high-performance computer.

With these problems and needs, in 2009 the NoSQL (Not Only SQL) movement emerged, which includes all data management systems that have the following characteristics:

They do not follow a relational model
They do not use SQL as the main query language
They have the capacity for horizontal scalability
efficiently uses indexes and RAM
has the ability to dynamically add new records and attributes
they do not follow the ACID property.

We have to start by first thinking that a database management system that satisfies the ACID property has a problem in distributed systems because when we try to get certain information from one of its To these nodes the CAP theorem applies, which states that a maximum of two of the following axioms can be satisfied at the same time:

Strong ( c)onsistency: All clients should see the same version of the data.
High ( TO)vailability: a client must always have at least one copy of the data available, even in the presence of failures.
( Q)artition-tolerance: System characteristics should not be degraded, even if the Computation is split across multiple servers.

In the case of relational systems, the [C] Strong properties always follow ( c)onsistency and [A] High ( TO)vailability while NoSQL distributed systems have tried to improve availability promoting property [P] ( Q)artition-tolerance and sometimes to the [A] High ( TO)vailability in reprimand of the [C] Strong ( c)onsistency.

This change has produced the birth of BASE systems (Basically Available, Soft-state, Eventually consistent), that is, a new database management system that will have the following characteristics:

Eventual Consistency: Rigid consistency mechanisms such as those present in the bases are not implemented relational data, where the confirmation of a change implies a communication of the same to all nodes let them replicate it. They have a distributed structure where data is generally distributed through distributed hash table mechanisms.
Horizontal scalability: The system will be able to increase system performance simply by adding more nodes, without the need in many cases to perform any other operation other than indicating to the system which ones are the available nodes. Many NoSQL systems allow you to use Map-Reduce type queries, which can run on all the nodes at once (each operating on a portion of the data) and then gather the results before returning them.
Fault tolerance and redundancy.
They do not generate bottlenecks: the underlying problem with SQL systems is that they must transcribe each sentence to be executed and each complex sentence requires, in addition to a further level of execution concrete to be able to be carried out, which is why it constitutes a common, unique and conflictive entry point in performance basis.
Only what is necessary: they are simple systems that do not have a complex query system or the capacity declarative to perform an exorbitant internal number of operations in a single line.
Dynamic structure: The data does not have a fixed attribute definition, that is: Each record can contain information in a different form each time, thus being able to store only the attributes that are of interest in each of them, facilitating data polymorphism under the same collection of information. It is also can store complex data structures in a single document, such as storing information about a blog post (title, body text, author, etc.) along with comments and tags poured onto it, all in a single record. Doing so increases clarity (by having all the data related in the same block of information) and performance (you do not have to do a JOIN to obtain the related data, since these are found directly in the same document).

NoSQL systems, however, are not the solution to all problems. First of all, there is still no index of reference, benchmark, where you can apply a test to different relational database managers and NoSQL managers and then evaluate their performance.

Furthermore, NoSQL databases have a very diverse data structure, which is both a strength and a weakness. In fact, there is no standard for exchanging data between two NoSQL databases, except difference from the SQL standard for relational databases. This latest problem that affects databases NoSQL also affects every new technology. RDBMS have been a standard for over 30 years, they have been formed experts in the field, and above all business data is stored in table form. There is a cost substantial associated with the data contained in the new format and instruct staff to use the new data management systems.

3.1.2 Types of NoSQL databases

The types of NoSQL databases that we can find are:

Key-Value: key-value is the most typical form, like a HashMap where each element is identified by a single key, which allows the recovery of information very quickly. Normally the value is store as an object called blob (Binary Large Objects) which are elements used in databases to store large data that changes dynamically. In this way the content type is not important for the database, only the key and the value associated with it. They are very efficient for readings and writes, plus they can easily scale partitioned values according to their key; by For example, those whose password is between 1 and 1000 go to one SERVER, those from 1001 to 2000 go to another, etc. inside These databases can be found in Google's BigTable, Amazon's SimpleDB, Cassandra, Hadoop, Riak, Voldemort and MemcacheDB among others.
Document Based: These store information as a document (generally with a structure simple as JSON or XML) and with a unique key. It is similar to Key-value databases, but with the difference that the value is a file that can be understood. If the server understands the data, it can do operations with them. In fact, several of the implementations of this type of database allow very specific queries. advanced over the data, and even establish relationships between them, although they still do not allow joins. We can find MongoDB and CouchDB among the most important of this type.
Graph Oriented: There are other databases that store information as graphs where the relationships between the nodes are the most important. They are very useful for representing information from social networks. In fact, Relationships can also have attributes and you can make direct queries to relationships, instead of to the nodes. Furthermore, since they are stored in this way, it is much more efficient to navigate between relationships than in a relational model. Obviously, these types of databases are only usable if the information in question can easily be represented as a network. We find Neo4J among others.
Column Oriented: They store values in columns instead of rows. With this change we gain a lot speed in readings, since if you need to consult a small number of columns, it is very fast to do so but it is not efficient for performing writes. For this reason, this type of solution is used in applications with a low rate of writes but many reads. For example, Cassandra.

3.2. Neo4J

Neo4J is an open source software that allows you to use graph-oriented database and the following configuration is constituted:

Nodes: They are the entities that we are going to represent.
Relations: They represent associations between entities and always have a start node and another end node. destiny.
Properties: They represent the attributes that can be stored in both nodes and relationships.

This structure is well suited to contain information about how the entity relates to its environment, adding new information by adding new nodes and subdividing to distribute it on different servers. This structure actually makes it relatively easy to expand the graph with new nodes or eliminate those existing. Apart from that, graph database management systems use a transaction-based model, That is, they only store a series of modifications if all the defined integrity conditions are met.

The Neo4J system follows the NoSQL movement described in section 1.1 and, as such, is not based on the relational, uses a native system to perform queries in the database. Below will be presented a more detailed view of the system, we will analyze some differences between the Neo4J system and the databases relational.

3.2.1 Neo4J in detail

The language used by Neo4J is based on Java and maintains the graph on disk inside an optimized structure. The graph in question will be:

Addressed: Each arc has a direction.
Fully labeled: Each node has a name that identifies it and each arc has a class.
Subject to attribute: it is possible to memorize name-value pairs on arcs and nodes.

The Neo4J kernel is extremely fast and includes all the features of a commercial database recognized, such as:

Recovery: recovers information after it has been lost due to an error.
2-phase-commit: protocol for distributed applications that allows maintaining the ACID property in the transactions.

Neo4J has chosen the graph structure basically because it allows the storage of various data sources. information. Furthermore, the task of adding or removing nodes based on the needs of the application is very simple. that uses Neo4J, with minimal performance impact. A single instance can contain billions of nodes and if there are problems related to hardware limitation, the database can be distributed across multiple servers to solve the problem. Queries have been optimized for the graph structure and are extremely fast with a fixed execution time independent of the size of the graph (unlike RDBMS that both the size and the number of joins penalize the execution time quite a bit). Other information Important is that the system defaults to READ COMMITTED as the isolation level, that is, we can only obtain data where a commit has existed first and thus avoid the problem of dirty reads. Yes for example we need to read a line that is already blocked by another transition, wait for the transaction commit or abort. Finally we have to mention that, like a large part of current databases, Neo4J can also work with transactions.

The next point to analyze is the internal structure of Neo4J. To do this we have to define a series of terminology that helps us understand and analyze this type of database:

Node: contains an entity of our problem and must be associated with a graph. Inside it we can They will define both properties and key-value pairs and will be contained within the same node. Apart from that we can use these properties to access the node that interests us, that is, we can use in queries to find the desired nodes. Additionally, we can access the nodes using two methods. different: indices and traverse that I will explain below.
Relationship: represents the relationship between the nodes. Each relation has a type that can be used as a search method for when performing an index search. On the other hand, the traverse method does not We could apply here because it serves to consult the nodes. To obtain the nodes through the indexes We have to indicate if we want to obtain the start or end node of the relationship.
Index: used to obtain a set of nodes or relationships through database queries with parameters associated with the object or objects that we want to obtain.
Traverse: Used to query the database of a given node. You can analyze the relationships incoming and outgoing through an algorithm (to catch only interested relationships) and define even the number of hops that the algorithm has to complete at most.

To better understand this structure we have included a graph called figure [3.3] where we can see all these concepts applied in a graph.

Gráfico donde podemos observar un grafo con la representación interna de la
estructura interna del Neo4J. Se observa como la base de datos almacena el grafo y los índice que han estado
definidos para realizar las consultas más rápidamente. El grafo contiene nodos, relaciones y todas las
propiedades que poseen estos. Para consultar las bases de datos aparte de los índice encontramos las traverse
que nos proporciona más diversidad para generar el algoritmo de consulta y en algunos casos resultan mejor que
los índices para bases de datos de grandes dimensiones. — 3.3 Graph where we can observe a graph with the internal representation of the internal structure of Neo4J. It is observed how the database stores the graph and the indexes that have been defined to perform queries more quickly. The graph contains nodes, relationships, and all properties that these possess. To consult the databases apart from the indexes we find the traverse which provides us with more diversity to generate the query algorithm and in some cases they are better than indexes for large databases.

3.2.2 Neo4J database management example

To better understand how Neo4J works, we have also built and explained a simple program that allows us to learn everything we need to know before writing more complex algorithms such as Ranking functions for gene analysis.

Example

Since our algorithm is going to work with gene graphs, we have considered that the example ideal to show the basic operation of Neo4J could be a simple algorithm that works with a tree genealogical with the following statement: Víctor has adopted a daughter.

The first step will be to define all the variables that we want to use during our algorithm and initialize the Neo4J database. To do this we create a HashMap variable <p ,String> with all the properties that we believe are appropriate for our application, which in my case I consider to be they are:

 
       HashMap properties ;
       properties.put("online_backup_enabled","false");
       properties.put("keep_logical_logs","false");
       properties.put("db.mapped_memory","300M");
       properties.put("relationshipstore.db.mapped_memory","300M");
       properties.put("propertystore.db.mapped_memory","600M");
       properties.put("propertystore.db.strings.mapped_memory","500M");
       properties.put("propertystore.db.arrays.mapped_memory","0M");

Later we have to start the database. For this we will use the method EmbeddedGraphDatabase() which needs as the first parameter the path where we want to store the information and how optional second parameter the properties we want to give to the database. Immediately after create the system, since the Neo4J documentation [neo4j.com 2014] recommends using the registerShutdownHook method which has been implemented to properly shut down the database when the java virtual machine terminates. So as a result we get:

 
       graphDb = new EmbeddedGraphDatabase(DB_PATH,properties);
       registerShutdownHook( graphDb );

The next step in Neo4J is to define the relationships that we are going to use throughout our program, indicating their name and what will be a type of relationship:

 
       enum RelTypes implements RelationshipType { HIJA &#125;

With the relationship created, the The next step is to create the father and daughter nodes, with information about them, for example their name. For We will simply have to carry out the following steps:

First create the nodes:

 
       Node padre;
       padre = graphDb.createNode();
       Node hija;
       hija = graphDb.createNode();

Define its properties:

 
       padre.setProperty( "nombre", "Victor " );
       hija.setProperty( "nombre", "Andrea" );

Once this is done, now we have to establish the relationship defined as “DAUGHTER” between the parent node and the daughter node:

 
       relacion = padre.createRelationshipTo( hija, RelTypes.HIJA );

In Neo4J it is also allowed add properties to relationships, such as saying that the daughter is adopted:

 
       relacion.setProperty( "tipo", "adoptada" );

As a result we have obtained the figure [3.4] , that is, a graph in which both Víctor and Andrea have represented with two nodes and their family relationship with an arc named “DAUGHTER” that unites them. To store the information that it is adopted, we have simply added a property to the arc that indicates this characteristic. another option that we could use is to create more relationships of the natural daughter and adopted daughter type that will allow us get the same result without giving properties. In our example we see it is very important to quickly access the nodes that identify Victor's daughters and for them the first option is more efficient since with a single consultation we can find all his daughters.

Representación del grafo utilizado en el ejemplo. — 3.4 Representation of the graph used in the example.

Once the entire structure is created we can show how to perform some simple queries. As we have said previously, it has a very powerful and simple native language that, for example, allows us to access from a node to another that has a relationship using the getRelationship() method. This method receives as parameters the name that identifies the relationship sought and the direction of the arc to provide as a result the node or set of nodes that meet that condition. It is very important to provide the address because all of them have definition a start node and an end node, that is, there is always an address. A small detail in me implementation has been the use of getSingleRelationship since I have only foreseen that there is only one relationship and to gain more efficiency and not handle iterators it is better to use this variant.

The result of the exposed procedure is the following code:

 
       Relationship r;
       padre.getSingleRelationship(RelTypes.HIJA,Direction.OUTGOING );
       print("TIENE UNA HIJA DE NOMBRE:")print(r.getEndNode().getProperty("nombre"));

Another very interesting point of Neo4J is access of the nodes through the indices. To do this, our database allows us to define a structure that is called Index. Once it has been created, we only have to indicate what its purpose will be, that is, if it will be used to access nodes or relationships and an identifying name to access it.

 
       public static Index <Node> padreindice;
       padreindice=graphDb.index().forNodes("nombre");

With this structure defined, we have to add the nodes that we want to index and the key to access them, such as:

 
       padreindice.add(padre,"nombre","Victor ");

Once this procedure is finished, we can use the get() function provided by the Index interface to obtain the node we want quickly and efficient.

 
       Node recuperar=padreindice.get("nombre", "Victor ").getSingle();

The last type of query What we have left to know is the traverse method. As we had said, it is used to obtain the nodes that we need faster than with indexes. To prepare this type of queries we first have to apply the search algorithm (in our example we have applied breadth first, which consists of analyzing the nodes that are closest to the starting node), the depth of the search, the type of relationship and which node we are interested in (the beginning or the end). As a result, it provides us with a series of vertices that we will have to scroll through to get your information

 
       Traverser t = recuperar.traverse(Order.BREADTH_FIRST,StopEvaluator.DEPTH_ONE,
       ReturnableEvaluator.ALL_BUT_START_NODE,RelTypes.HIJA, Direction.OUTGOING);

Finally we have to explain how you use Neo4J transactions. We can see that most of the code exists these two lines of code.

 
       Transaction tx = graphDb.beginTx();tx.success();tx.finish();

Its purpose is to allow the user perform operations on the database without any type of control. At the moment in which we wish store the changes, we will call the success() method and verify that the modifications do not break any rules of system integrity.

3.3. titan cassandra

Titan Cassandra is an open source software that allows you to use database distributed graph-oriented and the following configuration is constituted:

Nodes: They are the entities that we are going to represent.
Relations: They represent associations between entities and always have a start node and another end node. destiny.

The main novelty that this system brings to the management of graph-oriented databases is its optimization for the storage and query of graphs distributed on different servers. Its graph-type structure apart from contain information about how the entity relates to its environment intuitively, it is very easy to add new information and subdivide it to distribute it on different servers. It has the ability to integrate technologies such as Apache Cassandra, Apache HBase or Oracle BerkeleyDB that allow you to easily scale the system cluster and It also natively has the Blueprints and Tinkerpop tools, which help manipulate the information it has the graph. The only drawback that we can find with this feature is that to distribute the information on various servers and to be able to manipulate them efficiently, the nodes have to have few relationships between them since otherwise the temporal costs of operations increase too much in relation to them.

In another sense, like Neo4J, using graphs makes it relatively easy to expand their size with new nodes or delete existing ones. Apart from that, to ensure data integrity stored also offers the possibility of using transactions, that is, it will only store a series of modifications if they meet all the defined integrity conditions.

Our database manager follows the NoSQL movement described in section 1.1 and, as such, is not based in the relational model and uses a native system to perform queries in the database. Below is will present a more detailed view of the system and we will analyze some differences between the Titan Cassandra system and relational databases.

3.3.1 Titan Cassandra in detail.

Titan Cassandra keeps the graph on disk inside a structure oriented to be distributed. The graph in question will be:

Directed or undirected: there are two possible implementations, construct a graph where each arc has a direction or build another one in which all the arches have both directions.
Fully labeled: each node has a name that identifies it and each arc needs a name to be created.
Subject to the attribute: it is possible to memorize name-value pairs about arcs and nodes.

That said, Titan Cassandra offers us a series of advantages when designing our database such as:

Horizontal scalability. Titan Cassandra has implemented various mechanisms to work with large graphs through the ability to divide the analyzed graph in proportion to the number of clusters that our system has. This philosophy has also been applied to transactions, allowing a simultaneous number of these operations very elevated and even allowing itself to be divided into different servers.

Consultations. As in all NoSQL database managers, Titan uses Cassandra to perform its queries and manage its structure in a native, effective and intuitive language. Apart from that, the system can make queries by ranges of elements, location or key-value searches and thus locate both vertices and edges.

Supports the Blueprints System. This property is very interesting because it is an open source interface that offers two advantages:
- Any algorithm written in Blueprints works efficiently and without any adaptation problems on Titan. This also includes all open source software from Tinkerpop, a working group which is responsible for developing new systems for the management and interpretation of graphs.
- Any query written for Titan can be migrated to other graph-oriented database systems which accepts this interface, so there is no dependency on a provider.

You can transform a graph into JSON. This is very important since there are many applications or even database managers like MongoDB that accept this format.

It uses a vertex-centered method. Offers the user the possibility of using a special method for the purpose of sorting and indexing the edges incidents (and therefore adjacent vertices) of a vertex according to the labels and properties of the edges. Using this method well when we perform a query on the vertices will provide us with a advantage and linear scans of incident edges (O(n)) can be avoided and graph traversals more fast supervene (O(1) or O(log n)).

Follow the NoSQL philosophy but try to optimize the space used. Provides optimized disk representation to enable efficient use of storage and speed. access.

It is open source under the Apache 2 license.

Having commented on all its virtues, we have to talk a little about its limitations. There are two aspects that by itself structure of the system we cannot avoid or improve. The first is size, a Titan Cassandra database can only store

2^{60}

arcs and half of vertices. The second is the access time to an arc is never constant. Depends on the quantity of incident arcs at the vertex and therefore the cost is linear. That's why Titan Cassandra is Recommended for graphs with few connections.

The next point to analyze is the internal structure of Titan Cassandra. To do this we have to define a series of terminology that helps us understand and analyze this type of database:

Node: It contains an entity of our problem and must be associated with a graph. Inside it we can define key-value pairs that will be contained within the same node. Apart from that we can use these properties to access the node that interests us, that is, we can use them in queries to find the desired nodes. It should be said that Titan Cassandra gives by default an id to each node that is created, always has a numerical value and can be used to obtain the nodes we need.
Relationship: It represents the relationship between the nodes. Each relationship has a name that can be used as a search method for when performing an index search. Apart from that Titan Cassandra also grants an id automatically.
Index: The system allows us to create indexes to access both vertices and arcs. For them, There are two different methods to differentiate them, one is based on labels (labels) that allow storing indexes to arcs that interest us. The other is based on keys that provide us with the different methods to quickly access vertices.

3.3.2 Titan Cassandra database management example

Once each structure has been explained, the The next step would be to understand a little how you can build an algorithm in Java based on Titan Cassandra to understand all the quirks of your native language before writing more complex algorithms like algorithms for Ranking functions.

Example

Since our algorithm is going to work with gene graphs, we have considered the example used in section 4.2.2 as it is ideal for showing the basic operation of Titan Cassandra. to remember the statement we are going to implement a simple algorithm that works with a family tree to represent the following statement: Victor has adopted a daughter named Andrea.

The first step of our application will be to initialize the system in order to store the graph. For this, First we will apply the erase( ) method to ensure that there is no other database with the same name as generate conflicts. We have implemented this method because Titan Cassandra does not have any method to delete databases. The algorithm only searches if the folder that we will use to store the database exists. of data and deletes all its content. If we did not use this method, we would have to perform this task manually every time we want to create a new test database

Once we are sure that there will be no conflicts, we will have to configure the system. For this, We will have to indicate its characteristics such as, for example, where the information will be stored, if it will be a system local or distributed or what type of indexes we are going to use by applying algorithms such as the following:

 
       BaseConfiguration config;
       Configuration storage = config.subset(STORAGE_NAMESPACE);
       configuring local backendstorage.setProperty(STORAGE_BACKEND_KEY,"local");
       storage.setProperty(STORAGE_DIRECTORY_KEY, directory);
       configuring elastic searchindexstorage.subset(INDEX_NAMESPACE).subset("search");
       config.setProperty(INDEX_BACKEND_KEY, "lucene");
       config.setProperty("local-mode",true);
       config.setProperty("client-only", false);
       config.setProperty(STORAGE_DIRECTORY_KEY, directory + File.separator + "es");
       graph =TitanFactory.open(config);

Once the structure is defined, our graph will fulfill a interface called KeyIndexableGraph that has methods to create indexes and use it for queries on vertices or arches. The methods implemented for this purpose are makeKey intended for vertices and arcs and makeLabel only. for arches. When we apply the sortkey() method to indexes, we are applying the Vertex-Centric philosophy explained previously in subsection 3.2.1. The consequences of this action on the database is a redistribution of the arcs and vertices affected in a special way and more efficient for certain occasions. The only drawback is that to carry out this task we have to invest a lot of time.

 
     .unique().make();tx.makeLabel("hija").manyToMany().sortKey().unidirected().make();

With all With this information generated, the next step is to create the father and daughter nodes, with information about them, for example their name. To do this we will simply have to carry out the following steps:

First create the nodes:

 
       Vertex padre = tx.addVertex(null);
       Vertex hija = tx.addVertex(null);

Define its properties:

 
        padre.setProperty("nombre", "Victor");
        hija.setProperty("nombre", "Andrea");

Once this is done, now we have to establish the direct relationship from the parent node to the daughter node as follows:

 
     padre.addEdge("hija", andrea).setProperty("tipo", "Adoptada");

On Titan Cassandra too It is allowed to add properties to relationships, such as saying that the daughter is adopted simply by using the setProperty() method.

As a result we have obtained the figure [3.5] , that is, a graph in which both Víctor and Andrea have represented with two nodes and their family relationship with an arc named “DAUGHTER” that unites them. To store the information that it is adopted, we have simply added a property to the arc that indicates this characteristic. another option that we could use is to create more relationships of the natural daughter and adopted daughter type that will allow us get the same result without giving properties. In our example we see it is very important to quickly access the nodes that identify Victor's daughters and for them the first option is more efficient since with a single consultation we can find all his daughters.

Representación del grafo generado en el ejemplo. — 3.5 Representation of the graph generated in the example.

Once the entire structure is created we can show how to perform some simple queries. As we have said previously, it has a very powerful and simple native language that, for example, allows us to access from a node to another that has a relationship using getEdges(). This method receives as parameters the direction of the arc (in, out or both) and the name that identifies the relationship sought to provide as a result the node or set of nodes that meet that condition. It is very important to provide the address because all of them have definition a start node and an end node, that is, there is always an address.

The result of the exposed procedure is the following code:

 
     Edge aux=victor.getEdges(Direction.BOTH,"hija");
     Vertex resultado=aux.iterator().next().getVertex(Direction.IN);
     print(resultado.getProperty("nombre"));

Finally, as we had mentioned, Titan Cassandra uses a transaction system that guarantees the integrity of the stored data using the following code:

 
     TitanTransaction tx = graph.newTransaction();tx.commit();

Therefore, all Sentences that are executed between these lines will only be stored if they do not violate any integrity rule.

3.4. MongoDB

MongoDB is an open source software that manages document-oriented databases. In a normal MongoDB system we will always find these three elements:

Document: It is defined as the records or set of data managed by the system. Each of these units is composed of “key-value” pairs where the key is the name of the field and the value is its content, separated by the use of “:” As a value, numbers, strings or binary data can be used as images or any other “key-value” pair.
Collection: grouping of documents, which could be said to be the equivalent of tables in a relational database. It does not have any limitation regarding the quantity
Index: The MongoDB system is made to store large amounts of information and, therefore, needs indexes to locate the documents we need and search for the desired values without having to go through all the documents.

A database that manages documents gives us the freedom to define the information we want as we wish. We store and, therefore, we may store graphs, tables, images or any type of information without any impediment or restriction. Furthermore, in MongoDB we find that each document in a collection can have different fields and by not having any defined scheme we can add, delete, modify or rename fields values at any time.

Using documents also means having very good horizontal scalability since it is relatively easy to expand its size with new documents or delete existing ones and therefore very easy distribute the database on different servers. Aside from that, like many database management systems made to be distributed, cannot guarantee the integrity of the data stored MongoDB and as a consequence does not use the transactions.

Our database manager follows the NoSQL movement described in section 1.1 and, as such, is not based In the relational model, a native system is used to perform queries in the database. Below is will present a more detailed view of the system.

3.4.1 MongoDB in detail

MongoDB is an open-source, cross-platform database system. It is written in C++, which gives it a certain proximity to the machine's hardware resources and provides speed to the time to execute your tasks. In addition to being free license software, it works on operating systems Windows, Linux, OS X and Solaris.

Its most important features to highlight are:

Document-oriented storage

As we have seen in the introduction, MongoDB is oriented to The use of JSON type documents with a dynamic schema called BSON, or Binary JSON, allows searches fast data. To get an idea, BSON explicitly saves the lengths of the fields, the indexes of the arrays, and other useful information for data scanning. This is why, in some cases, the same A document in BSON takes up a little more space than it would if it were stored directly in JSON format. This drawback always exists in NoSQL systems, since it really has a philosophy based on the cheap storage, and as a consequence defends that it is better to use more memory if this introduces a considerable increase in the speed of locating information within a document. An example of this document is the figure [3.6]

Ejemplo de documento Bson de MongoDB — 3.6 MongoDB Bson Document Example

We can see in the previous example how MongoDB uses an attribute called “_id” (or primary key). Your The purpose is to identify each of the different documents and be able to access them quickly. We like Users can give the value we want to the _id field, although it is advisable to use the MongoDB one because This is quite important for the correct functioning of the system and always has to have a unique value, that is That is, we can never find two _ids of the same value. MongoDB uses a value similar to a UUID in hexadecimal. Despite seeming to be a completely random value, they use a seed based on the MAC of the machine's network interface (and other details of the machine) to prevent two different machines from being able to generate the same value for a document key. Apart from that, the first bytes correspond to a mark of time, so the keys are naturally sorted in time order. We can see the diagram of this structure in the figure [3.7] .

Esquema ObjectID MongoDB. El campo está compuesto por 12 bytes. Los cuatro
primeros bytes son un timestamp con los segundos desde el epoch de Unix; los tres siguientes bytes representan
el identificador único de la máquina; los dos siguientes el identificador del proceso; y para finalizar los
últimos tres bytes, son un campo incremental. — 3.7 MongoDB ObjectID Scheme. The field is made up of 12 bytes. The four first bytes are a timestamp with seconds from the Unix epoch; the next three bytes represent the unique identifier of the machine; the next two the process identifier; and to finish the last three bytes are an incremental field.

Another interesting feature is the size of a document. In MongoDB we can store up to 16 Megabytes and if we had to store much more, we would have to choose to use another scheme or system.

Full index support

Indexes in MongoDB have a tree structure called B-Tree. That is, that the data is stored in a tree whose nodes can have a multiple number of children, but keeping all They balanced. This increases the speed when searching and also when returning results. ordered. In fact MongoDB is capable of traversing indexes in both directions, so with a single index, we can achieve both ascending and descending sorting.

An important detail that can help us improve our performance efficiency is that the indexes Let's use them with fields that have great cardinality. When we create an index and fill the tree with the information that we want to index, it introduces the elements in certain positions of the tree depending of the content found in the field. Therefore, the more diversity the indexed information has, the more efficient will be the index.

In another sense, the system offers us different types of indexes. First, MongoDB indexes all documents with the key _id (explained above), which allows us to quickly access them and avoid having to go through the entire collection to find them. Then it also offers us geospatial indexes, that is, allows indexing of information based on location. Lastly, it also allows indexing fields of a document, to access them without having to go through the entire document.

Horizontal scalability

In traditional RDBMS systems, to improve database performance data a more powerful machine is acquired (vertical scaling) while scaling works better in MongoDB horizontal (increase number of machines)

Consultations

Other strong points to highlight of MongoDB are its speed and its rich but simple query system for the contents of the database. Following the NoSQL philosophy explained above, has managed to achieve an optimal level of performance and functionality, incorporating types of system queries relational, but without sacrificing performance.

Map and Reduce

MongoDB allows us to use Map and Reduce functions to select attributes that interest us in the data, add it, unify it and simplify it as we wish. This is common in many NoSQL systems, and in some cases it is even the only possible way to query data.

Absence of transactions

It is really not known whether this is a virtue or a deficiency. what MongoDB not having any system to check that the data entered corrupts the database makes the implemented algorithms are faster but the database manager will not block the “commits” with information wrong

3.4.2 MongoDB database management

Once each of its main virtues and deficiencies, the next step would be to understand a little how you can build an algorithm in Java based on MongoDB and thus know its native language before writing the more complex algorithms.

Example

Since our algorithm is going to work with gene graphs, we have considered the example used in section 4.4.2 since it is ideal to show the basic operation of MongoDB. To remember the statement we are going to implement a simple algorithm that works with a family tree to represent the following statement: Victor has adopted a daughter named Andrea.

The first step will be to define all the variables that we want to use during our algorithm and initialize the MongoDB database. To do this we will first have to know where the server that is located is located. has the MongoDB database stored and connect our program:

 
     MongoClient conn = new MongoClient( "localhost" , 27017 );

Once connected we have to create or search for the database that we are going to use. In the case of the example, first I check that there is nothing in the database and if there is, I delete it and then the structure is created. First the database with the name “Tree” using the getDB() method. This method searches the server if there is a database with that name, if the finds it and returns it, otherwise it creates a new one with that name. Then we create the Family collection within the database “Tree” data that stores all the Documents necessary for its implementation. For this the method getCollection("Family") applied to the db variable will link us to the Family collection if it exists or create it if contrary.

 
     DB db;DBCollection coleccionfamilia;
     for (String s : conn.getDatabaseNames()) {
      conn.dropDatabase(s);
     }
     db = conn.getDB("Árbol"); //creamos la base de datos
     coleccionfamilia=db.getCollection("Familia");

The last step to finish the construction of our system is create an index to quickly access our document. Indexes in MongoDB are created by applying to the desired collection the createIndex() method and passing it as a parameter the key to access the indexed field that We want together with the value 1 or -1 to indicate whether we want to order them ascending or descending. respectively.

 
     this.coleccionfamilia.createIndex(new BasicDBObject("name", 1));

Once we have created all The structure we want to use the next step would be to store the information within the collection. For We have to remind ourselves that MongoDB works with Documents in JSON format. To do this, Java gives us There are some libraries available that work with this format that will allow us to create and check the structure before enter it into the database. To this end, we have implemented a JSONObject called Victor that inside has stored as properties the name of the person it represents, that is, Victor and the daughters he represents. possesses. This last field has been implemented as a JSONArray, that is, it will be made up of an Array of LinkedHashMap where each of the elements entered must have the name of the girls and the type of relationship she maintains with Víctor (natural or adopted, for example).

 
     JSONObject victor=new JSONObject();
     JSONArray hijas=new
     JSONArray();
     this.victor.put("Nombre","Victor");
     LinkedHashMap andrea=new LinkedHashMap();andrea.put("nombre","Andrea");
     andrea.put("tipo", "adoptada");
     this.victor.put("hijas",andrea);
     familia.add(victor);

As a result of this work we have obtained the figure [3.8] . shown below.

Documento JSON creado en java que correspondería a la información
proporcionada en el enunciado — 3.8 JSON document created in java that would correspond to the information provided in the statement

Once we have the JSON document prepared, the next step is to enter it into the MongoDB database. To do this we have to convert figure 7 into a BSON file using the function com.MongoDB.util.JSON.parse() which It will receive the JSON document as a parameter and will return a BSON document. After that store the document within the collection that corresponds to it and would already be part of its structure.

 
     Object classaux = com.MongoDB.util.JSON.parse(x.toString());
     DBObject dbObjaux = (DBObject) classaux;
     this.coleccionfamilia.insert(dbObjaux);

Once the process is finished we can see the result real in figure 8, that is, a document that represents the person Victor and the number of daughters he has and the relationship that unites them. The important thing about this figure is that in this document we have really stored easy and intuitive the same information as in the graph in the figure [3.9] .

Representación del grafo utilizado en el ejemplo y su correspondente en
Bson — 3.9 Representation of the graph used in the example and its corresponding in Bson

With all the information entered the next step is to show how to perform some simple queries such as search to Victor's daughters. As we have said before, it has a very powerful and efficient native language that allows us It allows access to the entire document based on the value of a key.

To achieve this goal, MongoDB provides us with a method called find() that parses the documents in search for the desired information. If it finds it, it returns it as a result in the form of a JSON object and with the method get() we access the fields that we want to know.

 
     BasicDBObject query=new BasicDBObject("Nombre","Victor");
     System.out.print("Victor TIENE UNA HIJA DE NOMBRE:");
     DBCursor f =coleccionfamilia.find(query);while ( f.hasNext()) {
      print((f.next().get("hijas")).get("nombre"));
     &#125;

3.5. cassandra

Cassandra is an open source software that manages column-oriented databases. In a normal Cassandra system we will always find these elements:

Column: It is the most basic element of the database and is similar to the concept of a field in databases. relational. It has a structure very similar to an associative array since each Column consists of the following three fields:

Name: array of bytes used to access the stored value.
Worth: array of bytes that can contain any type of information.
Timestamp: int64 variable where the last time the Column was modified is stored.
There are no restrictions on what the fields can contain. We can enter both in the name as in the value the fields that we want, since as in all NoSQL systems there is no restrictions on stored information.

SuperColumn: It is an aggregation of n-columns that can be referenced by a name. It is quite a resource useful, since it allows you to have a collection of values associated with a value.
As before, there is no restriction on what attributes can contain.
Row: It is an aggregation of columns or super columns that are referenced with a name contained within a String. That name is the key that uniquely identifies a record.
At this point it is important not to confuse a row with a super column. The row is just a keyword, it is not a structure, it lacks attributes such as super columns.
Column Family: It is an aggregation of rows referenced by a String. It is a similar idea to a Table in the RBDMS model.
KeySpace: It is an aggregation of column families that can be referenced by a name. It is not a structure with attributes, it is just another container accessed by a String. If we look for the equivalent with the RBDMS models would be a similar idea to the schemas (set of tables) in the relational model.
Cluster: It is the highest level element that can be referenced by a name. It is more physical in nature. than the previous ones, more related to hardware, since it groups the nodes (machines) on which it runs Cassandra. It can contain one or more keyspaces.

As we have seen, Cassandra manages everything as associative arrays. Cassandra maintains collections ordered arrays of objects accessed by a name (key), which in Cassandra terminology is called a row (key). The model is actually a multi-dimensional hash, where the columns are located at the lowest level, the super columns are a hash of columns, rows are a hash of columns or super columns, and column families are a rows hash. Having the chain of names (keys) you can access all the elements that make it up.

All these concepts become clearer if we look at the figure [3.10] which represents the entire structure of Cassandra and its hierarchy.

3.10 Graphic representation of the typical structure of the organizational system of cassandra

Another important point of a database that menaje columns gives us freedom to define how we we want the information we store. Although it costs a little more to understand than, for example, with documents or graphs we have to think that in this database each row of a keyspace references Columns with very diverse.

Using columns also means having very good horizontal scalability since it is relatively easy expand its size with new columns or delete existing ones and, therefore, it is very easy to distribute the database on different servers. Apart from that, Cassandra has managed to offer users a system of transactions that help maintain the integrity of stored data.

3.5.1 Cassandra in detail

Cassandra is an open source database, developed in Java and multiplatform. Its main feature is the merger of Dynamo, from Amazon, with BigTable, from Google, both being closed source implementations.

If we look for a little history about this system, we will realize that it was developed in response to the problems of Big Data described in the introduction. Facebook had a need to improve the performance of the search engine, specifically those related to communication between users (“Inbox Search”) since this functionality involves managing a large volume of data, with a very high growth perspective (we have to think that the social media boom occurred after the implementation of Cassandra). To solve this problem, this system was designed where its main characteristics were: highly scalable queries, horizontal and with a profitable economic cost.

After investing Cassandra, in 2008 the Cassandra license was released by Facebook, becoming be open source, and is currently maintained by Apache. This characteristic is what makes Cassandra a really interesting NoSQL database, as it combines the best of Dynamo (eventual consistency) with the best of BigTable (column families) for free.

We can observe the consequences of this story in the following characteristics that the system:

Fault tolerance: Data is automatically replicated to multiple nodes. Losing a node does not cause deregister from the cluster.
Flexible scheme: We have already mentioned previously that columns, super columns and column families can store the values we want.
Symmetric: All nodes in the cluster are identical, which avoids problems related to bottlenecks.
Horizontally scalable: As we had already mentioned, it is normal for NoSQL systems to increase their efficiency as more servers are introduced into the system
Big data support: The ability to scale to hundreds of gigabytes of data.

As the last point of this system we have to talk about Cassandra Architecture and how it works a little special. Before, we have commented that the system always considers which node may fail and, therefore, They thought that the best network structure they could use would be Peer To Peer (computer network in which everyone or some aspects work without fixed clients or servers, but rather a series of nodes that behave as equals each other). Data is partitioned across all Cluster nodes and allows it to be replicated for ensure fault tolerance.

Apart from that, the nodes communicate with each other through a protocol called Gossip, which is dedicated to exchange information between servers constantly. Logs are made of the commits made in each node when they use write operations, therefore the durability of the data is practically assured, as long as the system is well planned.

3.5.2 Cassandra database management

Once each of its main virtues and deficiencies, the next step would be to understand a little how you can build an algorithm in Java based on Cassandra and thus know your native language before writing the more complex algorithms.

Example

Since our algorithm is going to work with gene graphs, we have considered the example used in section 4.4.2 as it is ideal for showing basic operation. To remember the statement We are going to implement a simple algorithm that works with a family tree to represent the following statement: Victor has adopted a daughter named Andrea.

The first step will be to define all the variables that we want to use during our algorithm and initialize the Cassandra database. To do this we will first have to know where the server is located and connect our program using the connect() method:

 
     connect("127.0.0.1");session = cluster.connect();

Once the connection is made, the session variable will allow us to make modifications or create structures within the system. We can see how in For example, we have used it to perform any task such as defining the structures of our database. data for:

Delete if a Keyspace called Family exists and avoid possible conflicts.
```
 
       DROP KEYSPACE IF EXISTS Familia;
        
```
Create, if it does not exist, a Keyspace called Family. Cassandra allows us to give features to our Keyspace, in order to further optimize the efficiency of the system. In our case we have indicated that it belongs to the SimpleStrategy class (recommended for databases with a single server) and made up of one node.

 
       CREATE KEYSPACE IF NOT EXISTS Familia WITH replication = { 'class':
       'SimpleStrategy', 'replication_factor' : 1 };

Create the family column called “perez” (the name would be to symbolize the family a little) where We will store your family tree information. The columns will have the name and Afterwards, the information of the different daughters will be stored.
```
 
         CREATE TABLE Familia.perez (nombre text PRIMARY KEY, hijas map<text,text>);
     
```

Once we have created the structure, the next step is to enter the information into the keyspace. How Cassandra is a database management system that works with Columns and makes extensive use of key-value pairs. we can find certain similarities with JSON. So we have thought about reusing the java code that we created in MongoDB to generate the information, format it and store it within the system. We are not going to explain the code that generates the JSON because we can see its explanation in section 4.4.2 about MongoDB only remember the structure of the information is shown in the figure [3.8] .

Once we have the information we need, we only have to create the queries in the language Cassandra native (CQL). To do this we have to convert the figure [3.8] in CQL format following this structure:

 
     INSERT INTO Familia.perez (nombre , hijas) VALUES (“Victor”,  { "nombre": "Andrea","tipo": "adoptada"});

As a result we have obtained the [3.11] , that is, a structure in columns that represents the person Victor, the number of daughters he has and the relationship that unites them. We can observe that one of the main features of Cassandra since we have stored it in an easy, intuitive way and without any restrictions information that the graph provided us.

Representación del grafo utilizado en el ejemplo y su correspondiente en
Cassandra. — 3.11 Representation of the graph used in the example and its corresponding in Cassandra.

Once the entire structure is created we can show how to perform some simple queries. To carry out this task we have to learn a little about its native language, since CQL, although it is very similar to SQL, has certain notable differences.

4. Genetic prediction functions

Ranking functions are those functions that process a set of data, predict the probability of that its elements are associated with each other and indicate the degree of belonging to a certain characteristic, functions or field, that is, a cataloging of data based on already cataloged data. These functions are used in any technological or scientific field, since they create new data from previously information obtained. In this project, ranking functions will be applied to the context of genes, called GFP (Gene Function Prediction).

The networks used in GFP always have a large amount of data and therefore we will always have issues related to scalability and performance. If we also add to this that the amount of information grows day after day, there is an urgent need to study and improve the procedures used in studies.

As a result of these events, in a few years the use of existing computational methods to identifying characteristics and models have led to the cataloging of millions of genes and the creation of genetic interaction networks that They promise to improve knowledge in this field. To achieve this, a computational approach is used that leads to the definition of a gene and a set of characteristics that characterize it. Then the techniques of machine learning to identify new information from what we know [Pavlidis et al, 2001].

The emergence of gene interaction networks has made it possible to analyze data in a different context than that of a single dimension. One of these networks is PPI (protein-protein interaction) and contains associations between the human genes and those of other species [Aebersold et al, 2003]. The data it contains is summarized in the form of graphs or documents and section 4.1 will show its structure.

The objective in the analysis methods is to be able to calculate the characteristics associated with each gene in the different types of databases listed in section 2. We have to analyze whether it is possible to use the databases of data for this purpose and what method is most efficient for these systems.

As we have mentioned, there are several approaches to solving these problems and we will start by analyzing whether A genetic network can be represented in graphs, documents and columns. We will present these conclusions in the section 4.1. In section 4.2 we will talk a little about the methods of reading data on graphs. In section 4.3 We will analyze the most interesting function for our systems, the kernel-based function. In the last section We will talk about the concept of cross validation, using a kernel-based function.

4.1. Data representation

We will use three structures to represent gene networks and their interrelationships, which will be:

Graph for Neo4J and Titan Cassandra
Document for MongoDb
Columns for Cassandra

Then in the following sections we will comment on the representations of each of them to verify that It is possible to apply the algorithms applied in networks to these structures.

4.1.1 Graph (Neo4J and Titan Cassandra)

This representation will use a graph defined as the pair G = (V, E) where V represents the vertices, which correspond to the genes, and E the arcs that connect them, that is, the relationships between the pairs of genes. The latter may be associated with a weight. The graph structure can be represented in a matrix of adjacency that has the genes as a row and column header and the value in which corresponds to the importance relative of the relationship that connects them.

This adjacency matrix, which will be identified with W, is clearly conveyed that is, the number of zeros that contained therein far exceeds the number of relations with different weights from scratch. This is due to the fact of that each gene is put into correspondence with all the other genes and by their nature many genes are not connected by relations, then, a zero appears in the matrix. For this reason, we try to define the algorithms that do use of this feature. An example of matrix W is found in the figure [4.1] , which also represents an example of a format called adjacency list.

Ejemplo de representación en grafo de un red de genes — 4.1 Example of graph representation of a gene network

As we can see in figure [4.1] , is composed of three graphs that show that they show how we could obtain an adjacency list through the information stored in a graph which takes up less space. For that we are going to explain a little each of the three parts of the figure:

[a)] example of a graph that represents a network of genes. Each node contains a gene, each arc contains a weight which corresponds to the similarity between genes.
[b)] example of matrix W, corresponding to graph a)
[c)] formatting the W matrix into an adjacency list takes up less space.

4.1.2 Bson Documents (MongoDB)

The second option to represent the data is through Documents. More exactly JSON style documents with a dynamic schema called BSON, or Binary JSON, already explained in the section 2.4. These documents will be made up of key-value fields (or I put key-value) that will contain the information from each vertex of our network. Each network vertex will be represented with a document that will contain a field called _id, another with the name of the gene and finally an array where we will have stored all the names of the genes. genes that it is linked to and its similarity value with respect to them.

As an example, we have made a conversion of the graph in the figure [4.2] to document and we have obtained figure 15 as a result, which is the following:

Representación en documentos Bson del grafo de la figura — 4.2 Representation in Bson documents of the graph in the figure [4.1] a)

Like the graph structure, with documents we can also represent an adjacency matrix with the same characteristics.

4.1.3 Columns (Cassandra)

The last of the three representations is in columns. This format is a bit special since it uses keys to access the information. To represent our network in a context like this, we will have to define several fields to place them. First a name to the set of all networks, for example, how our systems are going to represent genes, we can call it “Genes” and it would represent the Keyspace. The next step would be to define its context a little more, such as what purpose our network would have, example genesHumanSpecies and we would have the ColumnFamily defined. And lastly would come each of the columns, corresponding to the vertices of our network. This last structure is really where our vertex because on the one hand it has stored the key-value field with the name of the gene and on the other an array formed by key-value variables with the vertices that have a relationship and the similarity value.

In the figure [4.3] We can observe this structure graphically to make this format better understood.

Representación gráfica de una base de datos orientada a columnas a partir de la
información del grafo de la figura — 4.3 Graphical representation of a column-oriented database based on the graph information in the figure [4.3] a)

Like the graph structure, with the columns we can also represent an adjacency matrix with the same characteristics as those set out in section 2.1. Actually in a column-oriented system what we are storing rows of this matrix.

4.2. The methods used in reading

Described below briefly some of the main methods used to predict genetic function. Each method follows one of these approaches:

Direct method: identifies the characteristics of a gene from the connections in the network.
Assisted method: identifies genes similar to those analyzed, thus the analyzed genes that lack cataloging inherit the properties of those who are similar.

As we have seen in section 2.1, working with graphs, documents or columns provides the same results. By For this reason, to explain these methods we have decided to use graph-shaped representations that are easier to understand.

4.2.1 Direct method

To understand the direct methods we are going to use the figure [4.4] shown below.

Representa el funcionamiento de un método directo — 4.4 Represents the operation of a direct method

This figure is composed of three different diagrams to show us how a direct method is applied. in each one From the diagrams we find:

[1] Shown is a network in which the functions of certain genes are known (colored in colors different) and others (blank) that are not known.
[2] The arcs are shown with direction to mark the influence of a gene with respect to another gene
[3] Finally, it shows us in diagram 3 how the white genes change color after finishing the process. It can also be seen in the image that there are cases where, for example, we have a gene that may have an associated two or more characteristics.

The conclusion obtained after analyzing the direct methods is that all the genes located close to each other They tend to have the same associated characteristics. Regarding the methods considered direct, we have found:

Neighborhood counting: This method [Schwikowski et al, 2000] predicts gene functions based on the characteristics already known from its immediate neighbors. The problem we have found is its simplicity and, Apart from that, it does not take into account the overall topology of the network, only a very small part.
Graph: If we think that most networks are essentially a graph, it is normal to think that a method that is Taking advantage of this feature will be more beneficial to us and, if on top of that, the method takes into account the entire topology network (unlike the previous method) we get a more efficient algorithm.
Searching for algorithms in this context, we have found the graph method [Nabieva et al, 2005] where takes into account both the global topology and the concept of local neighborhood for each genetic function that we want to analyze. The basic concept is to treat each gene that already has an associated function as the source node of a flow chart. After the simulation, each gene lacking a genetic function is associated with the probability that its function is equal to that of the output gene. This process is carried out for all the genes that They already have an associated function.
Probabilistic methods: They are based on the concept of MRF (Markov Random Field) and were initially proposed in the article [Deng et al, 2003]. With this method, the function associated with a gene is semi-independent of that of its neighbors. To know if a gene has a genetic function associated with it we will have to use a logarithmic formula which takes into account the number of times said function appears in the network and the number of genes associated with the function. itself and that have relationships with the starting gene.

In order to compare and evaluate the quality of direct methods, the following formulas have been introduced:

where

n_{i}

is the number of known genetic functions for gene i,

m_{i}

is the number of genetic functions predicted for the gene for which we do not know the true function and

k_{i}

They are the elements common to the two previous definitions. Both formulas are evaluated with a technique leave-one-out, that is, hiding the function associated with a gene and trying to obtain it from the other genes. By executing this algorithm we have obtained that the probabilistic method obtains better results, and therefore it is the more sophisticated model.

4.2.2 Assisted Method

As in the previous section, to understand the methods assisted I will explain from the figure [4.6] shown below.

Representa el funcionamiento de un método asistido — 4.6 Represents the operation of an assisted method

This figure is made up of three different diagrams to show us how an assisted method is applied. In each one of the diagrams we find:

[1] Shown is a network in which the functions of certain genes are known (colored in colors different) and others (blank) that are not known.
[2] A division of the network into sets based on density is shown.
[3] Finally, graph 3 shows us how the white genes change color after finishing the process, taking the color of the one that predominates in the set. You can also see in the image there are cases where, for example, we have a gene, it may have two or more characteristics associated with it.

As we can see in the figure [4.6] , with assisted methods instead of predict the function of a particular gene (Figure [4.4] ) We try to identify a group of genes and then assign the same function to all of them.

Below we are going to comment a little on the main most used methods that belong to the methods assisted.

Generic methods

In these methods the choice of the group is based on the interaction between genes and the decomposition of the network into its topology. One of these methods is MCODE [Bader et al, 2003] which is consists of two main stages:

Nodes are weighted according to a coefficient calculated from the density of neighboring nodes and, therefore, Therefore, they will have a higher weight than nodes with many connections.
In the second step, the graph is processed by dividing it into sections whose nodes have a large number among them. of connections

Method with hierarchical clustering

This method uses a hierarchical clustering concept to distribute the data into their sets. The advantage of using this approach is being able to choose the measure of similarity between gene pairs. These topics are discussed in [Rives et al, 2003].

Graph partitioning methods

There are several methods that follow this approach for graphs. In the article [Spirin et al, 2003] by For example, two quite different categories are presented: those based on SPC (Superparamagnetic clustering) [ Blatt et al, 1996 ] and based on the Monte Carlo method. In both categories they require as input the size of the cluster , but the difference between the two is that SPC works better in densely connected structures, that is That is, if we take this information for the graphs, they would be those with a large number of arcs.

Later, based on these methods, other new algorithms were introduced such as RNSC (Restricted Neighborhood Search Clustering) [King et al, 2004] and the MCl [Krogan et al, 2006]:

RNC is an algorithm that, given a network, divides its elements into different sets according to a cost function. Its procedure begins with the creation of a random cluster and subsequently processes all its elements and the reallocates based on maximization of the cost function. To finish the algorithm filters the different groupings according to their size and density.
In another sense we have The MCL. This algorithm creates an adjacency matrix based on the input network and identifies highly connected areas by creating a set with the elements.)

Conclusion assisted methods

Among the assisted methods that have been listed, MCL is the most robust and with better performance regardless of the chosen graph, whether it has real or simulated data. The tests that performed RNSC that was very intolerant with respect to the input parameters. MCODE and SPC appear lower if the We compare it with the MCL algorithm, although there are some aspects to take into account such as:

MCODE works better if the data is random, it provides a lower number of false positives.
SPC generates complex structures to promote better sensitivity but at the cost of worsening the specificity.

4.3. Kernel-based ranking function

Among all the algorithms analyzed, the most efficient for To carry out our study on graphs, a kernel-based ranking method included in the direct methods. Presents always the following formula:

S:V\longrightarrow R^{X}

Its task is to associate a real number with a gene that identifies the degree of membership in a genetic class. The higher this number is, the greater the probability that the gene belongs to that genetic class. The criterion of the distance using a scoring function refers to a Hilbert space, that is, a concept mathematical defined as an inner product space that is complete with respect to the defined vector norm by the inner product [Wikipedia Hilbert 2014].

Sim(i,V_{c})=-K(x_{i},x_{i})+\frac{2}{\vert V_{c} \vert } \sum _{j \in V_{c}}K(x_{i},x_{j})

If there are no arcs in the graph where the exit and entry nodes are identical (that is, all the

K(x_{i},x_{i})

are equal to 0 for each

i \in H

) the formula can be reduced by eliminating the first term from the algorithm.

To facilitate the understanding of the operations, we have presented an example in the figure [4.7] .

Representación en una red genética en una matriz W — 4.7 Representation in a genetic network in a matrix W

In this example, we will apply the test on gene X. Class 1 will be the class studied and Vc will contain the genes of the test that belong to class 1, that is, Y,V,Q.

Sim(X,Clace 1)=\frac{2}{\vert V_{c} \vert } \sum _{j \in V_{c}}K(x_{i},x_{j})=\frac{2}{3}(0.5+0.33)=0.55

4.4. Cross validation techniques

Cross-validation techniques are used to evaluate the performance of an algorithm based on machine learning. Therefore, we can use this technique to evaluate the efficiency of kernel-based ranking functions. Machine learning deals with algorithms based on the observation of existing data for the synthesis of new knowledge. The learning We can obtain it through information provided by examples, data structures or sensors, with the purpose to analyze and evaluate the relationships between the observed variables. The totality of available data is called dataset and is divided into two parts: a training set and a test set. The training set represents the known data and is consists of a set of elements already cataloged, while the test set is a set of elements lacking cataloging. The algorithm must use the knowledge of the training set to classify with it the elements of the test set. In some cases, however, we can have a training set with few elements or the analysis carried out on the training set has lasted too long. These situations are known as overfitting and although the elements are part of the training set are correct, obtaining new information is difficult. To avoid this drawback, cross-validation techniques have been developed applied in programs discussed in Chapter 6. Cross-validation techniques require a data set of size considerable and provide a criterion of membership, in this way each element can be divided into two sets different, one that has all the positive elements and another with the negative ones. If we talk about genes, the Membership criterion could be, for example, the association of the gene to a genetic class. At this point, all dataset is divided into N sets of similar sizes called folds. The algorithm is performed in cycles of N times changing the fold as an example. Each fold used in a cycle will be considered the test set and has the set of positives and negatives. Once this procedure has finished it is time to perform a estimation of the entire dataset and we can make the necessary estimates. An example of this technique is the figure [4.8] , where we can see how we have divided the genes into two sets, those positive and those negative to a certain class genetics. Subsequently, all genes have been divided equally into 5 folds.

Representación gráfica de la técnica de validación cruzada — 4.8 Graphical representation of the cross-validation technique

5.Tools and technologies used in the project

In this section we are going to comment on the technologies used to carry out the project. We will start in the first section choosing the programming language we will use. In the second section we are going to comment that tools we have used to develop the project.

5.1. Choice of programming language

have to Carrying out programming from scratch gives us the advantage of being able to choose the programming language to be used. use. In this case, 3 languages were initially proposed:

C++
Python
Java

Due to the performance issue, Python was completely ruled out since speed is needed to compare performance. and efficiency, and Python lacks these features.

The main advantage of C++ is its great execution efficiency, greater than Java. Java for its part too It is quite efficient, but Java offers much more comfortable programming for the programmer. Furthermore also It has many facilities for creating multiplatform environments.

Taking into account these characteristics of each language, the decision was made. What finally decided the balance was the student's willingness to want to learn a new language that he had hardly used during his training. It is for this reason that Java was finally used.

5.2. Development tools

For To develop the project the following tools have been used:

Eclipse: used as a Java programming environment.
Dropbox: as a tool to save backup copies of the project and at the same time have control of versions of each new feature.
TexMaker: used to make the project report.

It should be noted that these development tools have been used because the student has learned to use them during his higher studies in the different subjects he has taken.

5.2.1 Eclipse

Eclipse is an IDE very useful when it comes to Java programming, since it provides many functionalities so that the programming is much more comfortable. Below are the features that have been most useful and that They have undoubtedly reduced the time spent on the project:

Text editor with syntax highlighting.
Compilation in real time.
Smart autocomplete.
Templates to create pieces of code, such as new classes, constructors, loops, etc.
Automatic reporting of errors and build warnings and suggestions to resolve them.
Navigation between classes and methods.
Automatic generation of javadoc, jars and tgzs.

5.2.2 Dropbox

Dropbox is a cross-platform cloud file hosting service, operated by the Dropbox company. The service allows users to store and sync files online and between computers and share files and folders with others. There are free and paid versions, each of which with varied options

5.2.3 Texmaker

Texmaker is a free editor distributed under the GPL license for write text documents, cross-platform, that integrates many tools necessary to develop documents with LaTeX, in a single application Texmaker includes Unicode support, spell checking, autocompletion, code folding and a built-in pdf viewer with synctex support and view mode continue.

6. Desarrollo del Software

Como ya se ha comentado en el capítulo 1, el objetivo de este proyecto es implementar la función ranking en diferentes sistemas de gestión de bases de datos para comparar su eficiencia y rapidez en el manejo de grandes cantidades de información utilizando un ordenador sin grandes recursos hardware.

En la etapa de diseño se intentará crear un diseño flexible y reutilizable de tal forma que permita extender y modificar la aplicación fácilmente. Esta parte se considera fundamental, ya que es probable que en el futuro se disponga de más operaciones para agregar a la aplicación que surja de este proyecto.

De esta forma hemos dividido este apartado en varios subapartados como serán: 6.1 Donde analizaremos el análisis y el diseño del proyecto, 6.2 Implementación de la función score basada en kernels para cada una de las tecnologías expuestas, 6.2 Resultados, 6.3 Conclusiones

6.1. Análisis y diseño de la función Score basada en kernels

Durante el inicio de esta etapa se ha prestado mucha atención a la distribución de los requisitos del proyecto en diferentes funcionalidades evitando complejidades innecesarias y crear clases o procedimientos inútiles, siempre intentando crear sistemas más fáciles de entender.

6.1.1 Diagrama de casos de uso

Se ha comenzado definiendo los posibles casos de uso que puede realizar el usuario. De esta forma se ha obtenido la información sobre la funcionalidad que requiere el sistema.

En la figura [6.1] se muestra el diagrama de casos de uso.

Como actor del sistema se puede definir al investigador que podrá realizar algún tipo de test con alguna aplicación. En este caso, ese usuario conocerá la aplicación y proporcionando los correctos ficheros de entrada ejecutará el programa y obtendrá los resultados. La aplicación generará las diferentes bases de datos y proporciona al usuario dos objetos, uno un documento con los tiempos de cada ejecución y otro con los resultados de aplicar la función Ranking.

A continuación se detallan los diferentes casos de uso expuestos en el diagrama.

6.2. Implementación de la función Score basada en kernels

En este apartado del proyecto vamos a definir la implementación de la función Average Score. Para ello vamos a presentar los diferentes documentos que dispondremos como entrada para generar nuestras bases de datos en la sección 6.2.1 y el resultado final proporcionado en la sección 6.2.2. También expondremos en la sección 6.2.3 cómo hemos implementado los cálculos expuestos en el apartado 3 para obtener las probabilidades y cada uno de los detalles que hemos tenido que enfrentarnos en cada uno de los diferentes sistemas que hemos utilizado: Neo4J, MongoDB, Cassandra , Titan Cassandra.

6.2.1 Formato de datos de entrada

Para generar la base de datos donde aplicaremos los cálculos utilizaremos los tres archivos de texto expuestos a continuación

File gene.txt

Contiene una columna con todos los genes con todos los nombres de los genes que vamos introducir en la base de datos. Podemos ver un ejemplo de fichero en la siguiente figura [6.2]

6.2.1.2 File geneclasse.txt

Contiene la matriz de adyacencia entre los genes y sus respectivas clases. Cuando en la matriz aparece un valor de 1 corresponde a que el gen de la fila pertenece a la clase de la columna y cuando aparece 0 viceversa. Esta matriz siempre cumplirá la siguiente característica: una clase es definida para un gen y este último puede pertenecer a varias clases.

Como se muestra a continuación, en el archivo la primera fila contiene una lista con los identificadores para las clases. Después, las siguientes filas (separadas con un salto de línea) empiezan por el identificador del gen, y una listas de valores con 0 o 1 que nos informan si pertenece (1) o no pertenece (0) un gen a esa clase determinada. Podemos ver un ejemplo de fichero en la siguiente figura [6.3]

File gene-classe.txt

Contiene el valor de similitud entre un par de genes. Las dos primeras columnas obtendremos el identificador de los genes y en la tercera columna su valor de similitud, que siempre será mayor de 0 y menor o igual a 1. En la figura [6.4] se muestra un fichero de ejemplo:

6.2.2 Explicación general del Algoritmo

La finalidad del código que vamos a implementar es generar bases de datos que representen redes de genes en 4 sistemas diferentes, utilizando para ello los tres archivos de entrada que se describen en la introducción de este capítulo.

Con este fin los diferentes sistemas nos proporcionan una serie de posibles configuraciones, para optimizar al máximo los recursos de la máquina que disponemos y mejorar la eficiencia en servidores locales sin grandes recursos. Una vez clara estas configuraciones en cada situación, todos nuestros algoritmos siempre empezaran creando bases de datos vacías listas para introducir información y realizar todas las consultas necesarias.

El segundo objetivo que tendremos que implementa, son los métodos que se encargará de introducir los vértices que representan los genes, las clases y los arcos que los relacionan entre sí. Para hacer esta tarea, podemos aplicar diferentes filosofías para aumentar más la robustez, la eficiencia o rapidez dependiendo de las necesidades del proyecto. En nuestro caso, nosotros hemos optado por utilizar una versión más robusta ya que si analizamos todo el algoritmo podemos comprobar que los dos últimos métodos requieren la búsqueda de vértices que ya han sido instanciados en los métodos anteriores, es decir, que si analizamos un poco más el código, sería más interesante realizar solamente la llamada de uno o dos métodos y reducir el número de búsquedas en la base de datos y, por lo tanto, disminuir los tiempos necesarios para crear la base de datos. Pero, realmente es preferible realizarlo de este modo porque al estar tratando con bases de datos tan grandes y con tantos elementos existe una probabilidad muy alta de que se produzca un problema en mitad de la ejecución, un fallo en la implementación o un error en la transición, y, en todos esos casos nos resultaría mucho más fácil solucionar el problema si resulta fácil entender la base de datos y en qué estado se encuentra.

Los principales pasos para crear la base de datos en cada uno de los sistemas tienen muchas partes comunes que vamos a explicar en los siguientes subapartados

Creación de los genes

Este método se encarga de leer el fichero gene.txt (analizado en el apartado 6.2.1) para poder obtener la lista de genes con la que vamos a interactuar durante el programa. Nuestro algoritmo leerá el fichero de texto línea por línea para ir obteniendo y almacenado cada uno de los nombres de los genes que vamos a utilizar.

Creación de las classes

Este método empieza leyendo la primera fila del archivo geneclasse.txt y aplica un split en ella, utilizando como elemento separador el espacio. El contenido obtenido corresponderá a todas las clases de genes posibles que posee nuestra base de datos y será almacenada esta información en un vector. El siguiente paso es recorrer el vector y con cada uno de sus elementos almacenados crear las estructuras que representen las clases.

Creación de las conexiones entre genes y clases

Este método empieza igual que el anterior, leemos la primera fila del archivo geneclasse.txt y aplica un split en ella. El contenido será almacenado en un vector que después será utilizado en otra parte del algoritmo.

Posteriormente, el método continuará leyendo el resto de información del documento, donde en cada una de las filas realizará el mismo procedimiento:

Realiza un split en la fila para obtener toda la información separada y accesible.
Lee el primer elemento de la fila (el nombre de los genes) para poder realizar una búsqueda del gen que corresponde.
Lee los siguientes elementos de la fila que se corresponden con la información de la tabla explicada en el apartado anterior y, por lo tanto, si la celda correspondiente tiene un valor 1, insertamos en la base de datos una relación que indique esta pertenencia del gen a la clase, y si el valor es 0, lo contrario.
Realizaremos este proceso hasta acabar de leer todo el documento.

Creación de las conexiones entre genes

Este método se encarga de leer el fichero genegene.txt para obtener el valor de similitud entre dos genes, es decir, identificar genes en la base de datos con similares características. El archivo en cuestión está formado por una serie de filas donde, como hemos dicho en el apartado anterior, y mediante un split podremos obtener la información que nos interese y almacenarla en un vector. Los dos primeras partes tendrán almacenado los nombres de los genes que podremos utilizar posteriormente para realizar consultas en la base de datos. Y en la tercera parte del vector tendremos almacenado el valor en forma de entero que representa la similitud entre ellos.

Este algoritmo tiene una característica especial. Durante la implementación surgieron problemas de fallos en la creación al tratar con cantidades muy grande de genes y por lo tanto cada vez que se alcanza las 10.000 relaciones analizadas se almacena la información nueva dentro de la base de datos y en caso de parada es necesario que realizar todas las transacciones desde el principio.

6.2.3 Validación cruzada

El primer problema que hay que abordar antes de analizar la función de Score para una determinada clase es subdividir los genes en 5 folds. Para realizar los cálculos de una forma más sencilla y optimizada hemos decidido dividir cada uno de los folds en dos subestructuras que almacenen de forma separada los genes positivos y negativos. Todas estas estructuras las hemos creado e implementado en una clase especial llamada Fold donde su constructor será el encargado de realizar todo el proceso. Este método recibe como parámetro una referencia a una clase genética e implementará el pseudocódigo expuesto en la figura [6.5]

Nuestro algoritmo empezará implementado e inicializando las siguientes estructuras necesarias para la función score:

n_fold : cantidad de folds que vamos a utilizar en nuestro algoritmo. Normalmente utilizaremos un valor 5 que es el valor estándar utilizado en la bibliografía consultada.
positiveGenFold e negativeGenFold : Contendrán almacenados tantos ArrayList como número de fold utilicemos en nuestro algoritmo y su finalidad será ser rellenados con los genes positivos o los negativos respecto a una clase analizada.

Estás variables serán almacenadas de forma pública dentro de la clase para que cualquier método pueda acceder a sus valores.

Una vez hecho este procedimiento, el siguiente paso es obtener todos los genes que son positivos para una determinada clase y los que no realizando consultas a la base de datos que estemos utilizando. El mayor problema que nos encontramos en este proyecto es que cada sistema de gestión de base de datos funciona diferente y utiliza lenguaje propio. Debido a este problema hemos decidido implementar un método llamado findGenesClassType() para que sea compartido en todos los sistemas que vamos a crear y contenga el algoritmo necesario para realizar esta consulta. Esta implementación la comentaremos en los siguientes apartados, aunque todas ellas serán creadas con la misma filosofía de optimizar los recursos que disponemos para proporcionar consultas muy rápidas en la base de datos.

Una vez que los vectores se han rellenado se han llenado con los genes, tenemos que dividir toda esta información entre los folds que estamos utilizando, es decir 5. Con este fin hemos utilizado un algoritmo que, si existe una cantidad suficiente de genes, intenta subdividir la información de forma equitativa, intentando mantener en cada uno de los fold el mismo número de genes o un nivel muy próximo. Un ejemplo de ejecución es figura [6.6] donde hemos aplicado este método y hemos obtenido los siguientes folds:

La cantidad mínima de genes positivos que tenemos que utilizar es siempre mayor o igual a la cantidad de folds que vamos a utilizar. Es decir en el ejemplo mostrado en la figura [6.6] tendríamos que tener por lo menos 5 o más genes positivos para realizar los cálculos. Posteriormente simplemente recorremos el vector de positivos e introduciremos la información en los folds intentando que en todos ellos exista la misma cantidad de genes o por lo menos una cantidad muy próxima.

6.2.4 Función Score

Una vez que hemos terminado de crear los fold tenemos que proceder con el cálculo de la función de Score, representado en pseudocódigo en la siguiente figura [6.7]

Como podemos observar, nuestro algoritmo se encargará de recorrer todas las clases, con la finalidad de calcular la probabilidad de que los genes pertenezcan a cada una de ellas. Para ello, lo primero que vamos a utilizar es la clase Fold definida en el apartado anterior y así poder crear para cada clase sus 5 “folds” correspondientes. Posteriormente, utilizando estas estructuras calcularemos cuál es la probabilidad de que un gen pertenezca a una clase genética. La forma más fácil que hemos encontrado para realizar estos cálculos es utilizando un bucle que simule el sumatorio y recorra los folds realizando las siguientes tareas:

Obtener todos los genes que pertenecen a la clase investigada, excepto aquellos que ya están dentro del fold analizado. Podemos entender más fácil este proceso en la siguiente figura [6.8]
6.8 Representación de todos los genes que pertenecen a la clase investigada, excepto aquellos que ya están dentro del fold analizado
Recorremos el fold con el objetivo de aplicar el método average score a cada gen que hay almacenado. Como parámetros, este algoritmo necesita el gen que vamos a estudiar junto con la estructura GenesPositive para la clase a la cual puede pertenecer. Para mejorar el rendimiento del programa, hemos implementado el array de GenesPositivo como un hashset ya que las búsquedas de elementos en esta estructura tiene un coste constante y resulta mucho más beneficioso que un array.
Respecto al método AvgScore cabe comentar que se encarga de simular los cálculos matemáticos expuestos en la sección 4.2. es decir simulara la función de ranking basada en kernels. El procedimiento empieza obteniendo el valor correspondiente al $D_{ii}$ del gen analizado, es decir, el resultado de dividir 1 entre la raíz cuadrada del valor total de la suma de la similitud que tiene con todos los genes conectados. El siguiente paso es, para todos aquellos genes con los cuales está relacionado y pertenezcan a la clase analizada (está contenido en él has dado por parámetro), realizar el siguiente producto entre:
- el valor de $d_{ii}$ del gen analizado, es decir, $d_{ii}=\frac{1}{\sqrt{\sum _{i}w_{ij}}}.$
- el valor $d_{jj}$ del gen que está relacionado (se obtiene de la misma forma que el $D_{ii}$ pero lo haces respecto al gen j, que sería en este caso aquel con el que se encuentra conectado).
- el peso de la relación que existe entre ellos.
Todos valores obtenidos serán sumados y el resultado de esta suma nos indica la probabilidad de que este gen pertenezca a una determinada clase.

Una vez realizado este procedimiento para todos los genes, almacenaremos los resultados en un fichero Excel a través del método stampaMatrice que recibe tres parámetros que son:

Un hastable rellenado con los pares de gen y probabilidad de pertenecer a las clases
una lista con las clases que hay en la base de dato
el nombre final que queremos otorgar al fichero.

6.3. Implementación en Neo4J

La primera parte del proyecto se basa en la implementación del algoritmo para Neo4J. Para realizar este experimento hemos utilizado la versión 1.8.2 obtenida del sitio oficial de Neo4J. Esta versión aun no permite la distribución de la información en diferentes servidores pero el equipo desarrollador trabaja en una futura versión que permita esta mejora sustancial.En el siguiente apartado explicaremos cómo hemos utilizado los diferentes mecanismos que nos proporciona Neo4J para resolver los algoritmos expuestos anteriormente. Los subapartados que nos vamos a encontrar son los siguientes:

Diagrama de clases
Representación de la estructura de la base de datos
Utilización del método validación cruzada.
Implementación de la función Score.
Diagrama de actividades.

6.3.1 Diagrama de Clases

6.9 Diagrama de clase de la aplicación de Neo4J implementada

6.3.2 Creación de la estructura

Para crear una base de datos de Neo4J tenemos que diseñar su configuración a partir de la máquina donde realizamos el trabajo. Siguiendo la filosofía NoSQL hemos hemos optado por ejemplo evitar saturar el sistema ocupando mucha memoria, y evitando crear un backup o logs de estados que ocupen mucha memoria.

Con la construcción finalizada, tendremos que implementar los métodos que se encargará de introducir los nodos que representan los genes, las clases y los arcos que los relacionan entre sí. Para hacer esta tarea hemos implementado los algoritmos expuestos en el apartado 6.2.2 que utilizara diferentes implementaciones para ver que opción proporciona una respuesta más rápida y eficiente.

Como vimos en apartados anteriores, Neo4J es un sistema que gestiona nodos y relaciones. Cada nodo va a representar un gen de nuestra red, cuyo nombre figura en la propiedad “name” dentro del nodo. Cada vez que Neo4J crea un nuevo nodo le asocia una id automáticamente que comienza con 1 y se incrementa para cada nodo añadido. Esta id se convierte en índice automáticamente pero no ocurre lo mismo con la propiedad nombre. Si queremos aumentar la eficiencia en la búsqueda y obtención de nodos la forma más correcta de hacerlo es creando un índice (llamado gene) respecto al nombre, ya que utilizando este parámetro podremos crear consultas a la base de datos mucho más eficientes respecto al tiempo. Cabe señalar que tanto los nombres como los id que crea el Neo4J son únicos y por lo tanto a cualquier consulta que realicemos utilizando este parámetro obtendremos como respuesta un único nodo. Respecto a los nodos que van a representar las clases va a tener una estructura muy similar, la única diferencia con respecto a la creación de los nodos anteriores es que ahora el nombre de la clase se utilizará para crear el índice llamado “class”.

Por último, las relaciones van a representar las conexiones que existen entre los genes y las clases de la red. En el caso que la conexión se produzca entre genes lo vamos a representar creando una relación “SIM” y cuando se produzca entre genes y clases la etiquetamos como “class”. Aparte de eso, en las relaciones de tipo “sim” almacenaremos el valor de la similitud entre los dos genes, creando una propiedad llamada “SIM”.

Otros aspectos a comentar son las mejoras de rendimiento y robustez que hemos introducido en la implementación de este apartado. Para buscar los nodos que necesitamos en una relación utilizaremos los índices que hemos implementado ya que proporciona una disminución en los tiempos de búsqueda. Respecto a la robustez hemos introducido en todos los métodos que hemos trabajado las transacciones. Las transacciones es una herramienta que nos proporciona Neo4J para que solamente se guarden las modificaciones en el caso en que no se produzca o exista ningún error.

El resultado de la implementación de este apartado se muestra en la siguiente figura [6.10]

Representación estructura Neo4j — 6.10 Representación estructura Neo4J

La siguiente implementación dentro del problema analizado utiliza las propiedades de los Nodos para almacenar las conexiones entre los genes. En ella hemos decidido crear nodos que tengan las siguientes propiedades almacenadas en ella:

$D_{ii}$ : contendrá el valor explicado en la sección 6.2.2.
type: indicará que el Vertex es un gen. Esta propiedad la utilizaremos en la implementación sin índices para buscar en la base de datos todos los genes que hay almacenados.
Las relaciones entre los genes las almacenamos de dos formas diferentes para ver el rendimiento del sistema utilizando variables HashMap o ArrayList. Por ello crearemos:

Relacionesmap: contendrá una variable Hashmap con las relaciones que posee en el nodo. Cada elemento tendrá almacenado como clave el nodo con el que existe la relación y el peso de la relación.
Relaciones: Propiedad que almacena un ArrayList con el nombre de todas las relaciones.
Relpesos: Propiedad que almacena un ArrayList con el peso de todas las relaciones.

Respecto a los Nodos que representan las clases genéticas podemos aplicar la misma filosofía y aplicar los siguientes Como resultado los vértices ahora tendrá la siguiente forma:

type: indicará que el Class es un gene. Esta propiedad la utilizaremos en la implementación sin índices para buscar en la base de datos todos los genes que hay almacenados.
Positive: tendrá almacenado todos los genes que pertenece a este Nodo.
Negative: tendrá almacenado todos los genes que no pertenecen a este Nodo.

Esta nueva construcción tendrá la estructura que se muestra en la siguiente figura [6.11] :

6.11 Representación de los nodos con la implementación sin conexiones

Para generar esta estructura hemos utilizado los siguientes métodos de la clase Neo4Jstore, los cuales hemos considerado como más importantes:

Anyadirpropiedadesclases(): Se encarga de añadir a los nodos que representan las clases genéticas las propiedades type, positive, negative.
Anyadirpropiedadesclases(): Se encarga de añadir a los nodos que representan los genes las propiedades type, relaciones, relpesos, relacionesmap.
Neo4Jstore(String, int , int, String): Constructor de la clase. El primer parámetro que recibimos puede contener la siguiente información

“crea”: Construye una nueva base de datos.
“aggiorna”: Actualiza la base de datos
“aggiornadati”: Actualiza datos que están dentro de la base de datos

aggiorna(): Método para actualizar la base de datos.
aggiornadati():Actualiza datos que están dentro de la base de datos

6.3.3 Validación cruzada

Las modificaciones que permiten adaptar este algoritmo (explicado en la sección 6.2) a las bases de datos de Neo4J son pocos, como por ejemplo que los arrays positiveGenFold e negativeGenFold van a gestionar Nodos.

La única diferencia sustancial la vamos a encontrar en el método findNodesClassType() ya que es posible desarrollarla con diferentes filosofías. Los dos métodos de la clase Fold resultantes han sido:

findNodesClassType():Esta función se encargará de realizar rápidamente una consulta en la base de datos para buscar todos los arcos del tipo “class” que salgan del nodo proporcionado como argumento. Como podemos observar Neo4J es muy intuitivo, al ser una base de datos NOSQL orientada a grafos, nuestra consulta simplemente tendrá qué buscar todos los arcos salientes del tipo “class” que salgan del nodo de la clase analizada. Para realizar la consulta podremos utilizar parámetros como por ejemplo que el tipo de búsqueda utilizaremos en el grafo, que relaciones vamos a buscar, a qué profundidad o cuántos nodos vamos a analizar. El resultado será devuelto mediante una variable del tipo iterable y por lo tanto con un simple for podemos recorrerla y almacenar en un array el contenido de la búsqueda, es decir todos los genes positivos respecto la clase.

findNodesClassTypelocal():Esta función utiliza la implementación más local de la información. Como no existen las relaciones, para obtener los genes positivos a la clase solamente tenemos que buscar la propiedad positive que contiene esta información. Una vez obtenida es almacenada en una variable y devuelta para realizar los cálculos necesarios.

La implementación de la validación cruzada se realiza en la clase Fold .

6.3.4 Score

En la clase Scoring las modificaciones también están relacionadas con el sistema de consultas de Neo4J.Primero de todo hemos implementado 3 métodos especiales para conseguir el correcto funcionamiento de la función Score que son:

setD(): este método recorre todos los genes de la base de datos para calcular el valor de $D_{ii}$ que posee el gen analizado. Una vez que ha sido calculado lo almacena en el Nodo como $D_{ii}$ . Para realizar esta tarea simplemente recorre todas las relaciones tipo “Sim” que posee el nodo analizado con el método getRelationships(). Este método de Neo4J va asociado a los vértices, recibe como parámetro el tipo de relación que estamos buscando y nos ofrece como resultado un iterador que recorre todas aquellas que están conectadas a nuestro nodo. El siguiente paso es ir cogiendo el valor de similitud almacenado anteriormente y sumar todos los valores para obtener el $D_{ii}$ .
AvgScore(): Con la función getRelationships() podremos recorrer todas las conexiones del gen analizado para obtener los valores de similitud y $D_{ii}$ de los vertex que necesitamos.
AvgScoreVersiónLocal() Este método utiliza las propiedades que hemos almacenado en el nodo y que contiene toda la información de sus relaciones. Podemos obtener el nombre de las conexiones a través de la propiedad “relaciones”, el valor $D_{ii}$ de la propiedad llamada igual, el valor de similitud de sus conexiones a través de la propiedad “relpesos” y el valor Djj del Nodo con el método getGeneByName() dando como parametro el gen que nos interesa.
AvgScoremap(): Misma metodología que el anterior pero ahora la información de las relaciones y el valor de la similitud la obtendremos de la misma estructura, un hashmap almacenado en la propiedad relacionesmap.
rankingAvg() Función encargada de realizar los procedimientos para llevar los cálculos de ranking

6.3.5 Modelo de ejecución/navegación

Una vez descritos los posibles casos de uso que puede ejecutar el usuario junto con sus flujos de eventos vamos a describir la navegación que se puede realizar en la aplicación.figura [6.12]

6.4. Implementación en Titan Cassandra

A continuación vamos a comentar la implementación del AvgScore para Titan utilizando Cassandra como base. Para realizar este experimento hemos utilizado la versión 0.4.0 de Titan para la gestión de grafos y 1.2.2 de Cassandra para almacenar la información, ambas obtenidas del sitio oficial de Titan.

En los siguientes apartados explicaremos cómo hemos utilizado los diferentes mecanismos que nos proporciona Titan para resolver los algoritmos expuestos anteriormente. Los subapartados que nos vamos a encontrar son los siguientes:

Diagrama de clases
Representación de la estructura de la base de datos
Utilización del método validación cruzada.
Implementación de la función Score.
Diagrama de actividades.

6.4.1 Diagrama de clases

13 Diagrama de clase para la aplicación de Titan Cassandra

6.4.2 Creación de la estructura

Para crear una base de datos de Titan tenemos que diseñar su configuración a partir de las características del sistema que vamos a utilizar, es decir, un sistema no distribuido, en local y con índices de estructura Lucene que aprovechan estas características. Antes de continuar con la creación de los genes y sus conexiones tenemos que definir todos los índices que utilizaremos para aumentar la eficiencia de nuestro algoritmo, ya que el sistema necesita primero crear los índices para después rellenarlos con los vértices que contienen la información de forma automática.

Con la estructura finalizada, tendremos que implementar los métodos que se encargará de introducir los genes, las clases genéticas y los arcos que los relacionan entre sí. Para hacer esta tarea hemos implementado los algoritmos expuestos en la sección 6.2.2 pero adoptando el código para este sistema y sus necesidades. En este sentido tenemos que recordar el sección 3.3 del proyecto donde hemos presentado las característica del sistema Titan y hemos explicado como no soporta eficientemente grafos con una cantidad muy elevada de conexiones. Sabiendo este hecho y para intentar obtener la mejor implementación posible de nuestros algoritmos hemos creado dos estructuras muy diferentes, una primera que crea las conexiones básicas para realizar consultas mediante arcos y otra en la que introduce esta información en las propiedades de los genes almacenados en la base de datos. Nuestra intención es comparar en los resultados estas dos implementaciones para ver cuál obtiene los mejores resultados.

Versión con Vertex y Edge

Para hacer este experimento lo primero que debemos recordar es que Titan es un sistema que gestiona objetos llamados Vertex y Edges. Estas clases se corresponden a los vértices y los arcos de un grafo y las vamos a utilizar para representar los genes y sus conexiones respectivamente. Para ellos crearemos genes a través de los métodos de la clase Vertex con las siguientes propiedades:

nameGen: almacena el nombre del gen.
type: indicará que el Vertex es un vertex

Respecto a las clases genéticas, vamos a realizar el mismo proceso. Las representaremos mediante Vertex con las siguientes propiedades:

nameClass: nombre de las clases.
type: indicará que el Vertex es una clase.

Por último, vamos a representar las conexiones que existen entre los genes y las clases de la red. En esta implementación vamos a utilizar la clase Edge que sirve para conectar Vertex. Cuando la conexión se produzca entre genes lo vamos a representar creando un edge etiquetado como “sim” y cuando se produzca entre genes y clases la etiquetamos como “class”. Aparte de eso, en las relaciones de tipo “sim” almacenaremos el valor de la similitud entre los dos genes, creando una propiedad llamada “weight”.

Por último, respecto al índice hemos implementado tres tipos de índices. El primero a la propiedad nameGen que afecta a todos los Vertex del tipo gen. El segundo a todos los Vertex con la propiedad nameClass que afecta a todos los genes. Y por último a la propiedad type, para intentar obtener todos los Vertex tipo gene o tipo class más rápidamente

El resultado de esta implementación se muestra en la siguiente figura [6.14]

6.14 Estructura Titan Cassandra con Vertex y Edges

Versión con Vertex

La otra implementación posible es almacenar toda la información necesaria para los cálculos en los vértices. Aparte de las propiedades comentadas en la implementación anterior, al no existir la posibilidad de utilizar arcos para encontrar sus conexiones, tendremos que crear más propiedades en el Vertex que representa el gen para almacenar esta nueva información. Con esa idea hemos añadido las siguientes propiedades:

$D_{ii}$ : contendrá la división de 1 entre la raíz cuadrada del valor de la suma de todas las relaciones.
relaciones: array con todos los genes que tiene una conexión de similitud.
pesosrelaciones: array con el valor numérico de la similitud. Existe una correlación entre este array y la propiedad relaciones.
pesosrelacionesdjj: array con el valor numérico de Djj. Existe una correlación entre este array y la propiedad relaciones.

Como podemos observar lo que realizamos es almacenar los genes con los que existe una relación en la propiedad “relaciones” y los valores de similitud en la propiedad “pesosrelaciones” eliminando la necesidad de crear un Edge para relacionarlos.

Respecto a los Vertex que representan las clases genéticas podemos hacer una cosa similar. Utilizaremos la propiedad “positive” y “negative” para almacenar todos los genes que pertenecen a las clase y los que no respectivamente.

Durante la implementación de este apartado, nos hemos encontrado con un problema de tamaño. Los arrays “positive” y “negative” para bases de datos muy grandes ocupan demasiado tamaño y el sistema no los acepta. Como solución a este problema hemos creado tantas propiedades como necesitábamos a partir del tamaño del array. Es decir dividimos el array correspondiente entre 250 y creamos tantas propiedades negative+índice como necesitemos. El problema de una implementación de este estilo es que otro programador que posteriormente quiera realizar alguna modificación será muy difícil de entender.

El resultado de la implementación de este apartado es la siguiente figura. figura [6.15]

6.15 Estructura utilizando únicamente la clase Vertex

Para generar esta estructura hemos utilizado los siguientes métodos de la clase DataBaseCassandraTitan:

create(String,String):El primer parámetro que recibimos contiene el nombre del directorio donde crear nuestra base de datos y el segundo donde almacenar los tiempos de nuestro programa. Dentro de este método está todos los procesos para la creación de nuestra estructura.
setD(): este método recorre todos los genes de la base de datos para calcular el valor de $D_{ii}$ que posee el gen analizado. Una vez que ha sido calculado lo almacena en el Vertex como una propiedad llamada “ $D_{ii}$ ”. Para realizar esta tarea simplemente obtenemos el array con todos los valores de similitud llamado “pesosrelaciones” y realizamos los cálculos matemáticos pertinentes.

6.4.3 Validación cruzada

Las modificaciones que permiten adaptar este algoritmo (explicado en el apartado 6.2.2) a las bases de datos de Titan empiezan por modificar en la clase Fold los arrays positiveGenFold e negativeGenFold para gestionar Vertex, las clases que van a representar nuestros genes.

La siguiente diferencia la vamos a encontrar en el método findGenesClassType(). Esta función nos proporciona como resultado todos los Vertex positivos y negativos a una clase y para este sistema de gestión de bases de datos hemos implementado dos métodos diferentes para realizar esta tarea: findGenesClassTypeEdge y findGenesClassType

findGenesClassTypeEdge

Esta función se encargará de realizar una consulta en la base de datos para buscar todos los arcos del tipo “class” que salgan del Vertex proporcionado como argumento. La consulta simplemente tendrá que buscar todos los arcos salientes del tipo “class” que salgan del nodo de la clase analizada utilizando el getEdges().El resultado será devuelto mediante una variable del tipo iterable y por lo tanto con un simple for podemos recorrerla y almacenar en un array el contenido de la búsqueda, es decir todos los genes positivos respecto la clase.

Con esa información podremos obtener la lista de los genes negativos y, por lo tanto, todos los genes que no estén en esta lista, serán negativos respecto a la clase y tendremos también el array con los vértices negativos.

findGenesClassType

Esta función utiliza la implementación más local de la información. Como no existen las relaciones, para obtener los genes positivos a la clase solamente tenemos que buscar la propiedad positive que contiene esta información. Una vez obtenida es almacenada en una variable y devuelta para realizar los cálculos necesarios.

El mayor problema que nos vamos a encontrar para crear esta función es que Titan cassandra no puede gestionar arrays de mucho tamaño. La solución por la que hemos optado es dividir los arrays en diferentes propiedades y por lo tanto tendremos que leer más de una propiedad para obtener los genes positivos y negativos a la clase.

Otros métodos

Aparte de estos dos métodos la implementación de la validación cruzada se realizará en la clase Fold.

6.4.4 Score

En la clase Scoring las modificaciones también están relacionadas con el sistema de consultas de Titan. Hemos implementado métodos especiales para conseguir el correcto funcionamiento de la función Score con la intención de adaptar todas las consultas al sistema Titan utilizado.

Los métodos utilizados más importantes en esta clase son:

AvgScore(): Esta función realiza los cálculos matemáticos a partir de las propiedades que hemos almacenado en los Vertex. A partir de las etiquetas “ $D_{ii}$ ” , ”pesosrelaciones” y “pesosrelacionesdjj” podremos obtener el valor de $D_{ii}$ , el peso de similitud y los valores Djj respectivamente.
AvgScore(): Esta función utiliza la clase Edge para obtener la información necesaria para realizar el Average Score. Mediante la función gedEdge podremos recorrer todas las conexiones del gen analizado para obtener los valores de similitud y $D_{ii}$ de los vertex que necesitamos o, por otra parte, AvgScoreVertex().

6.4.5 Modelo de ejecución/navegación

Una vez descritos los posibles casos de uso que puede ejecutar el usuario junto con sus flujos de eventos vamos a describir la navegación que se puede realizar en la aplicación.

6.5. MongoDb

En los siguientes apartados explicaremos cómo hemos utilizado los diferentes mecanismos que nos proporciona MongoDb para resolver los algoritmos expuestos anteriormente. Los subapartados que nos vamos a encontrar son los siguientes:

Diagrama de clases
Representación de la estructura de la base de datos
Utilización del método validación cruzada.
Implementación de la función Score.
Diagrama de actividades.

6.5.1 Diagrama de clase

6.17 Diagrama de clases para la aplicación de MongoDb

6.5.2 Representación de la estructura

El siguiente punto del proyecto es la implementación del algoritmo para MongoDb. Para realizar este experimento hemos utilizado la versión 2.11.3 de MongoDb obtenida del sitio oficial.

En el próximo apartado explicaremos cómo hemos utilizado los diferentes mecanismos que nos proporciona este sistema para resolver los algoritmos expuestos anteriormente. Empezaremos primero con la creación de la base de datos y como representa la información almacenada para después implementar la función del Score, utilizando siempre como método de verificación el algoritmo validación cruzada.

Estructura

Como habíamos expuesto en el apartado 3.3, MongoDb tiene una estructura orientada a documentos del tipo Json y, por lo tanto, tendremos que empezar creando los documentos que posteriormente vamos a introducir en la base de datos.

Los documentos que representen a los genes los almacenaremos en la colección genes y tendrán siempre la siguiente propiedades clave valor:

_id: Campo hexadecimal utilizado como identificador del documento. Utilizaremos las id proporcionadas por el sistemas, explicadas en el apartado 2.
name: Nombre del gen analizado.
clases: Array donde almacenamos las clases en las que pertenece el gen representado.
$D_{ii}$ : valor numérico que representa este concepto matemático.
rel $D_{ii}$ : array donde almacenamos el valor $D_{ii}$ de los genes con los que tiene relación.
inforel:array con los valores de similitud de cada relación
rel: array con el valor de la similitud que existe con los genes que está relacionado
relname: array con los nombres de los genes que posee relación.

Esta colección también almacenará un último documento diferente al resto. Su finalidad será servirnos como guía para obtener todos los genes y como propiedades tendrá un campo llamado “info” donde almacenaremos el valor “tutto” y otro llamado “tutto” donde almacenaremos un array con todos los genes que están almacenados en la base de datos.

La estructura de los genes va a quedar de la siguiente forma:

Los documentos que representen las clases genéticas los vamos almacenar en la colección collclassveloce. Estos documentos tendrán las siguientes propiedades:

_id: Campo hexadecimal utilizado como identificador del documento. Utilizaremos las id proporcionadas por el sistemas, explicadas en el apartado de Mongodb 3.2.
Nombre: Nombre de la clase analizada.
Gene: Array con los nombre de los genes que pertenecen a la clase
Genenoclass”: Array con los nombre de los genes que no pertenecen a la clase

Al igual que en el caso anterior, hemos decidido crear un documento especial donde almacenaremos el nombre de todas las clases del sistema. Este documento tendrá dos propiedades:

"tuttoclass" : que contendrá "infclass" almacenado.
infclass”: Array con los nombre de las clases almacenadas en el sistema.

Como resultado hemos obtenido la siguiente figura:

Por último hemos optado por crear una una nueva colección llamada correlation para almacenar en ella todas la información de las relaciones. Esta implementación la hemos utilizado para comprobar si gestionando documentos con propiedades simples es más rápido que gestionar los documentos con arrays.

La estructura de estos documentos está formada por las propiedades:

id: Campo hexadecimal utilizado como identificador del documento. Utilizaremos las id proporcionadas por el sistemas.
Nombre: Nombre de la conexión genética analizada. Estará formada por el nombre de los dos genes entre los que hay una relación, primero el saliente y después el entrante. Hemos optado por crear las relaciones en ambos sentidos, es decir si existe una conexión entre X e Y hemos creado dos documentos diferentes, uno llamado XY y otro YX.
Weight: Valor numérico que contiene el valor de similitud de la conexión genética
$D_{ii}$ : Valor numérico que contiene que correspondería a este campo matemático
Djj: Valor numérico que contiene que correspondería a este campo matemático

Para crear toda esta estructura necesitamos dos clases diferentes. La primera de ellas una clase llamada JSON donde hemos construido todos los métodos necesarios para construir objetos Json.

Para realizar la transformación de objetos json a Bson y realizar las consultas a la base de datos utilizamos la clase DataBaseMongoDB.

6.5.3 Validación cruzada

Las modificaciones que permiten adaptar este algoritmo (explicado en el apartado 6.2.2) a las bases de datos de MongoDb empiezan modificando los arrays positiveGenFold e negativeGenFold para gestionar String.

El método findNodesClassType() ya no realizará búsquedas en vértices de grafos . Esta función se encargará de realizar consultas en la base de datos para buscar los documentos que poseen propiedades otorgados por parámetros. Con el lenguaje nativo de MongoDb, simplemente buscaremos y obtendremos de la colección de clases el documento que representa a la clase analizada. De este documento le indicaremos que queremos obtener obtener las propiedades gene y genenoclass para devolver la lista de los genes positivos y negativos a esa clase.

Aparte de estos dos métodos la implementación de la validación cruzada se realizará en la clase Fold.

6.5.4 Score

En la clase Scoring las modificaciones están relacionadas con el sistema de consultas de MongoDB y sus diferentes implementaciones. Primero de todo hemos implementado 3 métodos especiales para conseguir el correcto funcionamiento de la función Score que son:

AvgScore(): Para utilizar este método en la implementación las modificaciones que deberemos realizar estarán relacionadas con la búsqueda de documentos y extracción de sus propiedades. El primer paso es localizar en la colección de documentos de genes el documento que representa nuestro gen. Para ello simplemente tendremos que comprobar que el campo “name” de nuestro documento coincide con el nombre de Gen. Una vez encontrado tendremos que extraer los campos “ $D_{ii}$ ”, “rel” y “info $D_{ii}$ ” y a partir de ellos extraer toda la información que nos proporcionan y realizar las operaciones matemáticas necesarias.
AvgScoreVersionRelaciones() Este método utiliza la colección relaciones para obtener la información de los genes conectados entre sí. Para obtenerla tenemos que buscar la relación y extraer la información, de un tamaño menos que el AvgScore()o el AvgScoreHashMap();
AvgScoreVersionHashMap():En esta implementación pasamos de utilizar estructuras del tipo arrayList para almacenar la información de las relaciones para utilizar HashMap. De este modo las consultas devuelven menos información y puede ser gestionada más eficientemente. Para llevar a cabo esta idea utilizaremos la información almacenada en la propiedad "rel".

6.5.5 Modelo de ejecución/navegación

Una vez descritos los posibles casos de uso que puede ejecutar el usuario junto con sus flujos de eventos vamos a describir la navegación que se puede realizar en la aplicación.

21 Modelo de ejecución/navegación de las bases de datos MongoDb

6.6. Implementación en Cassandra

El siguiente punto del proyecto es la implementación del algoritmo para Cassandra. Para realizar este experimento hemos utilizado la versión 2.0.6 obtenida del sitio oficial.

En los siguientes apartados explicaremos cómo hemos utilizado los diferentes mecanismos proporcionados por Cassandra para resolver la función Score basada en Kernels. Los subapartados que nos vamos a encontrar son los siguientes:

Diagrama de clases
Representación de la estructura de la base de datos
Utilización del método validación cruzada.
Implementación de la función Score.
Diagrama de actividades.

6.6.1 Diagrama de clase

6.22 Diagrama de clases para la aplicación de Cassandra

6.6.2 Representación de la estructura

Como habíamos expuesto en la sección 2.5, Cassandra tiene una estructura orientada a columnas y, por lo tanto, empezaremos definiendo como son estas estructura que posteriormente vamos a introducir en la base de datos. Las dividiremos en tres apartados según representen a genes, clases o relaciones.

Las column que representen a los genes los almacenaremos en la ColumnFamily llamada Genes y tendrán siempre los siguientes campos clave valor:

name: Nombre del gen analizado.Esta propiedad es la encargada de identificar la columna.
clases: Array donde almacenamos las clases en las que pertenece el gen representado.
$D_{ii}$ : valor númerico que representa este concepto matemático.
rel $D_{ii}$ : array donde almacenamos el valor matemático $D_{ii}$ de los genes con los que tiene relación.
inforel:array con los valores de similitud de cada relación
rel: array con el valor de la similitud que existe con los genes que está relacionado
relname: array con los nombres de los genes que posee relación.

La estructura de las columnas va a quedar de la siguiente forma:

Las columnas que representan las clases genéticas las vamos almacenar en la Column Family collclassveloce. Estos documentos tendrán las siguientes propiedades:

Nombre: Nombre de la clase analizada.
Gene: Array con los nombre de los genes que pertenecen a la clase
Genenoclass: Array con los nombre de los genes que no pertenecen a la clase

Como resultado hemos obtenido la siguiente figura:

Por último hemos optado por crear una una nueva ColumnFamily (llamada relaciones) para almacenar en ella todas la información sobre las conexiones entre genes. Esta implementación la hemos creado con la intención de comprobar si gestionando columnas con propiedades simples es más rápido que gestionando las columnas con arrays o conjuntos de grandes dimensiones.

La estructura de estos documentos está formada por las propiedades:

Nombre: Nombre de la conexión genética analizada. Estará formada por el nombre de los dos genes entre los que hay una relación, primero el saliente y después el entrante. Hemos optado por crear las relaciones en ambos sentidos, es decir si existe una conexión entre X e Y hemos creado dos documentos diferentes, uno llamado XY y otro YX. Esta propiedad es la encargada de identificar la columna.
Weight: Valor numérico que contiene el valor de similitud de la conexión genética
$D_{ii}$ : Valor numérico que contiene que correspondería a este campo matemático
Djj: Valor numérico que contiene que correspondería a este campo matemático

Para crear toda esta estructura necesitamos dos clases diferentes. Primero aprovechando la clase llamada JSON de MongoDb donde hemos construido todos los métodos necesarios para construir objetos Json.

Para realizar la transformación de objetos json a Bson y realizar las consultas a la base de datos utilizamos la clase ProgramCassandra.

6.6.3 Validación cruzada

Las modificaciones que permiten adaptar este algoritmo (explicado en el apartado 6.2.3) a las bases de datos de Cassandra son muy parecidos a los de MongoDb, como por ejemplo que los arrays positiveGenFold e negativeGenFold van a gestionar String.

El método findNodesClassType() Esta función se encargará de realizar consultas en la base de datos la columna con el nombre dado por parámetros. Con el CQL, simplemente buscaremos y obtendremos la columna que representa a la clase analizada. De esta columna le indicaremos que queremos obtener las filas que corresponden a gene y genenoclass para devolver el conjunto de los genes positivos y negativos a esa clase.

Aparte de estos dos métodos la implementación de la validación cruzada se realizará en la clase Fold.

6.6.4 Score

En la clase Scoring las modificaciones están relacionadas con el sistema de consultas de Cassandra y sus diferentes implementaciones . Primero de todo hemos implementado 3 métodos especiales para conseguir el correcto funcionamiento de la función Score que son:

AvgScore():El primer paso es localizar la columna que representa nuestro gen. Para ello simplemente tendremos que buscar la columna identificada con el nombre del gen. Una vez encontrado tendremos que extraer los campos “ $D_{ii}$ ”, “rel” y “info $D_{ii}$ ” y a partir de ellos extraer toda la información que nos proporcionan y realizar las operaciones matemáticas necesarias.
AvgScoreVersionRelaciones() A partir de las columnas de la Column Family relaciones para obtener la información de los genes conectados entre sí. Para obtenerla tenemos que buscar la relación y extraer la información, de un tamaño menor que el AvgScore()o el AvgScoreHashMap();
AvgScoreVersionHashMap():En esta implementación pasamos de utilizar estructuras del tipo arrayList para almacenar la información de las relaciones para utilizar HashMap. De este modo las consultas devuelven menos información y puede ser gestionada más eficientemente. Para llevar a cabo esta idea utilizaremos la información almacenada en la propiedad "rel".

6.6.5 Modelo de ejecución/navegación

Una vez descritos los posibles casos de uso que puede ejecutar el usuario junto con sus flujos de eventos vamos a describir la navegación que se puede realizar en la aplicación.

6.26 Modelo de ejecución/navegación de la aplicación en bases de datos Cassandra

6.7. Verificación de resultados

Para verificar todos los resultados obtenidos en las implementaciones realizadas, el profesor Marco Messeti nos ha proporcionado los resultados obtenidos por un algoritmo escrito en R para une serie de ejemplos. La siguiente tabla nos muestra los resultados para el sample1 una base de datos con 10 genes y dos clases.

6.27 Resultados de la función Score basado en Kernels

A partir del ejemplo, hemos ejecutado nuestros sistemas con las mismas pruebas para comparar los resultados y cómo podemos observar en todas las implementaciones hemos obtenido resultados idénticos:

Resultados de la función Score basado en Kernels para todos los sistemas
NoSQL — 6.28 Resultados de la función Score basado en Kernels para todos los sistemas NoSQL

6.8. Resultados

Una vez verificados todos los resultados en todos los sistemas y en todas las posibles implementaciones, hemos empezado con el siguiente punto de nuestro proyecto, comparar todos los programas y sus implementaciones para analizar cuales poseen rendimientos mejores o una mayor eficiencia en determinadas circunstancias.

Como hemos comentado ya, el mayor interés de nuestro proyecto es el análisis del rendimiento para la gestión de diferentes bases de datos con cantidades de información muy elevadas y, por lo tanto, no nos va importar el tiempo de creación de nuestro sistema o el tiempo que vamos a invertir devolviendo los resultados. El mayor interés lo vamos a encontrar en el tiempo que necesita una base de datos para responder las consultas que le enviemos. Con esta idea, hemos analizado todas las clases de cada programa y hemos observado que todas las consultas al sistema son realizadas por dos métodos: rankingAvg() y Fold(). Por lo tanto, hemos añadido a estos métodos para medir el tiempo y analizar el coste temporal de realizar estas consultas como muestra en la siguiente figura [6.29] :

6.29 Pseudocódigo función Ranking con las muestras del tiempo

Una vez hemos tenido claro donde mediremos el rendimiento del programa, vamos a exponer un poco qué ejemplos de bases de datos hemos utilizado para obtener las muestras. Nuestro interés ha sido dividir los tipos en tres grupos para después poder analizar los resultados de una forma más sencilla y eficaz. Los grupos que hemos creado han sido los siguientes:

Muestras de control: Son 6 ejemplos de bases de datos que han sido proporcionadas por el profesor Marco Mesiti. No siguen ningún patrón de tamaño y sirven para realizar comprobaciones de tamaño.
6.30 Muestras de verificación
Muestras con gran cantidada de genes y clases: Son 5 bases de datos genéticas con una gran cantidad de genes y muy pocas conexiones entre ellos y las clases. (cambiar el nombre de las muestras por el correcto)
6.31 Muestras con gran cantidad de Vertex

Muestras con gran cantidad de conexiones: Son 5 bases de datos genéticas con una gran cantidad de conexiones entre genes y una cantidad menor de vértices.
6.32 Muestras con gran cantidad de conexiones

6.8.1 Resultados Neo4J.

cprovechando la estructura que hemos generado con Neo4J , hemos realizado 5 ejecuciones diferentes en nuestro sistema aprovechando las diferentes posibilidades que nos ofrecían y explicaremos a continuación.

Implementación 1
Utilizamos los métodos findNodesClassTypelocal() y AvgScore() fusionamos dos estructuras diferentes. Utilizaremos por una parte la propiedad positive de los nodos que representan las clases para generar los folds mientras que para calcular el Average Score utilizaremos las clases relationships.

Implementación 2
Utilizamos las métodos findNodesClassType() y AvgScore() . Estos métodos utilizan la clase Relationships y Node para realizar las consultas.

Implementación 3
Utilizamos los métodos findNodesClassTypelocal() y AvgScoreVersiónLocal(). El primer método utiliza la propiedad positive de los nodos que representan las clases para generar los folds mientras que para calcular el Average Score utilizaremos la propiedad relacionesmap que tiene almacenados los vertex con los que está relacionado y el valor de similitud en forma de HashMap.

Implementación 4
Utilizamos las métodos findNodesClassType() y AvgScoreVersiónmap() fusionamos otra vez dos estructuras diferentes. Utilizaremos por una parte la clase relationships para generar los folds mientras que para calcular el Average Score utilizaremos la propiedad relacionesmap.

Implementación 5
Esta última implementación es un poco especial. Hemos utilizado la implementación findNodesClassTypelocal() ya explicada en los casos anteriores y además ahora hemos introducido dos propiedades “relname” y “relpesos” donde hay almacenados arrays con los nombre de los genes que están conectados y los valores de similitud.

Cambios en la implementación no previstos.

El primer problema que nos hemos encontrado es que las propiedades en Neo4J sólo pueden almacenar información en String. Es decir que para todas los cosas donde hemos implementado una versión basada en estas estructuras, si queríamos almacenar algún dato en formato Array o HashMap teníamos que transformarlo en String y después hacer una conversión cuando realizamos la consulta mediante métodos como split o Replece de la clase String.

Hemos desestimado después de las primeras pruebas con bases de datos de grandes dimensiones la opción de utilizar una propiedad “negative” en los nodos que generan las clases. Este hecho se ha producido después de que hayamos observado que la cantidad de memoria utilizada era demasiado grande y ralentizaba bastante cuando generaba los folds. Hemos sustituido esta implementación por otra que simplemente recorre todos los Nodos a través de un índice y comprueba si están contenidos en la propiedad positive.

6.8.2 Resultados MongoDb.

Aprovechando la estructura que hemos generado con MongoDb hemos realizado 3 ejecuciones diferentes en nuestro sistema aprovechando las diferentes posibilidades que nos ofrecían y explicaremos a continuación.

Implementación 1
Utilizamos el método getsimandrel() con la intención de realizar una única consulta nos devuelva una gran cantidad de información y no tengamos la necesidad de realizar más
Implementación 2
Utilizamos las métodos getHASHrel() y getelement() para obtener la información necesaria más específicamente. Realizamos más consultas pero las transacciones de información son más pequeñas
Implementación 3
Utilizamos la colección relaciones. Realizamos consultas mucho más rápidas pero manejamos menos información. El proceso de construcción de esta implementación y la cantidad de memoria que necesita el constructor es mucho más elevado.

Cambios en la implementación no previstos.

Al momento de crear la base de datos, mongoDb no ha soportado la cantidad de insets enviados por el cliente. Para intentar no saturar el sistema hemos introducido un sleep cada 5.000 inserts o updates, consiguiendo nuestro objetivo de evitar saturar el sistema.

6.8.3 Cassandra

Aprovechando la estructura que hemos generado con Cassandra explicado , hemos realizado 3 ejecuciones diferentes en nuestro sistema aprovechando las diferentes posibilidades que nos ofrecían y explicaremos a continuación.

Implementación 1
Utilizamos el método getsimandrel() con la intención de realizar una única consulta nos devuelva una gran cantidad de información y no tengamos la necesidad de realizar más
Implementación 2
Utilizamos las métodos getHASHrel() y getelement() para obtener la información necesaria más específicamente. Realizamos más consultas pero las transacciones de información son más pequeñas
Implementación 3
Utilizamos la colección relaciones. Realizamos consultas mucho más rápidas pero manejamos menos información. El proceso de construcción de esta implementación y la cantidad de memoria que necesita el constructor es mucho más elevado.

Respecto a los resultados hemos tenido un problema bastante grave. Hemos lanzado los ejemplos de base de datos pequeños como son el sample2 y el sample3 y no ha habido ningún problema,exceptuando el coste temporal que ha sido algo elevado (29 segundos y un minuto respectivamente para la primera implementación) . Pero cuando hemos lanzado el sample3 y la muestra7 hemos tenido serios problemas. El tiempo que ha necesitado para calcular y aplicar la función Ranking en el ejemplo 7 ha sido de 27 horas para la primera implementación. Después de este hecho, hemos cambiado de base de datos para calcular de nuevo esta función y en el sample3 cuando llevaba 48 lo hemos detenido.

Las conclusiones que hemos obtenido respecto a este sistema han sido muy negativas. Primero durante la creación de la base de datos hemos sobresaturado el servidor por qué enviamos más información por segundo para insertar que el propio servidor podía soportar. La solución ha sido introducir un sleep en el código y así dar tiempo a la base de datos para almacenar la información. Después para introducir la información con los comandos CQL era muy especial y teníamos que formatear todos los arrays y hasmap para que los aceptara a través de funciones string con alto coste computacional. Por último, hemos introducido cambios a la desesperada para intentar mejorar el rendimiento como desplazar el servidor a un ordenador personal un poco más potente, pero los tiempos continuaban siendo exagerados (utilizamos el sample3 y a las 10 horas de ejecución detuvimos la ejecución).

6.8.4 Resultados Titan Cassandra

Aprovechando la estructura que hemos generado con Neo4J, hemos realizado 4 ejecuciones diferentes en nuestro sistema aprovechando las diferentes posibilidades que nos ofrecían las estructuras creadas. Las implementaciones utilizadas son:

Implementación 1
Utilizamos el AvgScore() para aprovechar la información almacenada en las propiedades de los vértices y lo fusionamos con el método findGenesClassTypeEdge que utiliza la clase Edge para encontrar las clases de los elementos.
Implementación 2
Utilizamos las métodos findGenesClassType() y AvgScore(), ambos basados en el uso de propiedades.
Implementación 3
Utilizamos los métodos findGenesClassTypeEdge() y AvgScoreVersiónVertex(). Ambos basados en utilizar las clases Vertex y las clases Edge para obtener información y realizar cálculos.
Implementación 4
Utilizamos los métodos findGenesClassType() y AvgScoreVersiónVertex(). El método para crear los folds se basa en las propiedades almacenadas en las clases mientras que el Average Score utiliza la clase Vertex y Edge para realizar los cálculos.

Cambios en la implementación no previstos

Hemos desestimado después de las primeras pruebas con bases de datos de grandes dimensiones la opción de utilizar una propiedad “negative” en los Vertex que generen las clases. Este hecho se ha producido después de que hayamos observado que la cantidad de memoria utilizada era demasiado grande y ralentizaba bastante cuando generaba los folds. Hemos sustituido esta implementación por otra que simplemente recorre todos los Vertex a través de un índice y comprueba si están contenidos en la propiedad positive.

7. Conclusiones

7.1. Neo4J

Como podemos observar entre los diferentes diseños de implementación realizados en Neo4J, su mejor rendimiento se ha producido en la implementación más orientada a grafos (en comparación a otras más orientadas a las propiedades claves-valor). Estos resultados los ha repetido en todas las pruebas realizada con diferentes tamaños y cantidades de vértices .Las observaciones posteriores han reafirmado la eficiencia más elevada de una implementación utilizando el método getRelationships para aquellas bases de datos que tengan más relaciones. Esta causa es debida a una indexación optimizada para todas las conexiones entre nodos. Dado un nodo cualquiera, podemos obtener de forma muy veloz todos los nodos relacionados a través de sus conexiones.

Respecto a las versión más locales, cabe decir que Neo4J solo permite almacenar datos del tipo String o tipo Double en sus propiedades. Este hecho ralentiza mucho las versiones que utilicen propiedades de forma abundante (3,4,5), ya que, aunque el tiempo necesario para consultar una propiedad en nodo Nodo no es muy elevado, el hecho de no poder almacenar pares clave-valor con valores tipo lista o hasmap obliga al analista a leer y tratar esta información para realizar cálculos con ellas. Esto significa que debes a partir de una fila de texto generar la estructura que vas a utilizar utilizando funciones como split o Replece con alto coste temporal.

Para ver mejor las diferencia entre las implementaciones observa la figura [7.36] :

7.36 Resultados de la función Score basado en Kernels para el sistema Neo4J en todas las implementaciones

7.2 Titan Cassandra

Como podemos observar entre los diferentes diseños de implementación realizados en Titan Cassandra, su mejor rendimiento se ha producido en la implementación más local, es decir, aquella que utiliza las propiedades para obtener la información en las consultas. Estos resultados los ha repetido en todas las pruebas realizadas con diferentes tamaños y cantidades de vértices. En la documentación de Titan hallada en la web ya advertía sobre la ineficacia del sistema para grafos con muchas conexiones. Como no especificaba una cantidad exacta de conexiones a partir de los cuales el rendimiento empiece a escasear, hemos intentando probar con grafos con pocas conexiones en nuestros ejemplos pero el rendimiento obtenido ha dejado mucho que desear.

A pesar de ser un sistema muy orientado a la distribución en diferentes computadores, hemos podido observar como la integración entre Cassandra y Titan para crear un gestor híbrido de bases de datos orientada a grafos ha sido aproximadamente 5 veces más lento en su versión más óptima que la utilizada por un gestor de grafos puro como Neo4J.

También otro aspecto a comentar, aunque nuestras pruebas no miden este valor, son los tiempos de creación de una base de datos Titan Cassandra. Para crear una nueva base de datos como el Sample3 y la muestra11 ha llegado a sobrepasar las 24 horas, mientras que en Neo4J por ejemplo el caso más costoso ha sido creado en 50 minutos como máximo.

Todos los resultados obtenidos se pueden compararse en la siguiente figura [7.37] :

Resultados de la función Score basado en Kernels para el sistema Titan
Cassandra en todas las implementaciones — 7.37 Resultados de la función Score basado en Kernels para el sistema Titan Cassandra en todas las implementaciones

7.3. MongoDb

Como podemos observar entre los diferentes diseños de implementación realizados en MongoDb, su mejor rendimiento se ha producido en la implementación tanto utilizando los documentos de la colección relaciones como a través de la colección genes y sus propiedades.

La implementación 2 ha utilizado métodos de la librería String para elaborar el algoritmo y, usar sus métodos métodos, genera una pérdida muy grande de prestaciones. Aparte de eso, en la documentación también advierten de una correlación entre la memoria ram disponible del ordenador con la velocidad de acceso a los documentos. Esto quiere decir que si invertimos mas ram en el computador funcionaria más rápidamente la implementación 3 que la 1, es decir, encontraríamos antes un documento en la colección relaciones y podríamos extraer la información en una consulta muy pequeñas que si lo comparamos en la implementación 1 con menos documentos donde buscar pero consultas más grandes y que necesitan más tiempo.

Un pequeño detalle, es que la implementación 3, cuando ha tenido menos carga de documentos, es decir, en las muestras 11,10 y 9 ha sido más rápido que la implementación 1 con mucha más cantidad de vértices.Dicho todo esto los resultados los hemos expuesto en la siguiente figura.

Resultados de la función Score basado en Kernels para el sistema MongoDb en
todas las implementaciones — 7.38 Resultados de la función Score basado en Kernels para el sistema MongoDb en todas las implementaciones

7.4 Cassandra

Respecto a Cassandra la verdad no ha cumplido las expectativas. Teóricamente una base de datos NoSQL está diseñada para ser soportada por ordenadores sin grandes recursos hardware pero un ordenador personal no ha conseguido soportar este sistema en condiciones.

Según las investigaciones realizadas, cassandra (a parte de necesitar una cantidad de recursos muy elevada) cuando se implementa en local actúa igual como si estuviera siendo un sistema distribuido, por lo tanto si para hacer un insert o realizar una consulta realiza todas las comprobaciones correspondientes a los sistemas distribuidos en un ordenador con recursos muy limitados generan tiempos de lecturas y tiempos de escritura demasiados elevados.

Por lo tanto, este sistema queda totalmente desestimado.

7.5. Resultados en común

Al comparar todos los sistemas en los cuales hemos obtenidos resultados analizables, observamos que los gestores de bases de datos orientados a grafos son los sistemas que han proporcionado mejores prestaciones para realizar los cálculos más rápidamente. Por ejemplo con Neo4J hemos obtenido unos resultados entre 6 y 7 veces más veloces que en MongoDb.

La razón de estos hechos, puede ser debida a la utilización de un ordenador personal con una memoria ram no muy elevada. Buscando información acerca de este aspecto en la documentación de MongoDb hemos comprobado comentarios de los usuarios están exponiendo problemas de lentitud en las consultas para servidores con una memoria ram no muy elevada. Exponen que para utilizar la indexación de los documentos y encontrar la información necesaria es necesario una cantidad de ram elevada para funcionar a un nivel veloz. Aparte de esto, el sistema está hecho más para bases de datos distribuidas, es decir, situaciones con escalabilidad horizontal donde se puede compartir recursos entre las diferentes máquinas.

Realmente ha ocurrido un caso similar con Cassandra, pero mucho más extremo. Este sistema de gestión de base de datos está diseñado para utilizar y dividir la información en diversos nodos y, de este modo, poder aprovechar los recursos de cada una de las máquinas. Cuando hemos intentado implementar una versión local de esta base de datos, no hemos podido obtener información de forma muy optimizada y eficiente porque realizaba los mismos pasos analizados en la figura 2.11 para sistemas distribuidos utilizando una única máquina, es decir, realiza mucho trabajo innecesario. Este hecho ha impedido realizar consultas de forma masiva a la base de datos ya que los tiempos de espera eran demasiado elevados.

Por último, comparar Neo4J y Titan Cassandra también nos resulta muy útil. Titan es un gestor de grafo muy deficiente. En la documentación básica que hemos analizado en el punto 2.5 ya hemos comentado el hecho que Titan no puede gestionar eficientemente los grafos con una cantidad muy elevada de aristas. La solución que hemos intentado implementar fue utilizar la clase Vertex fusionada con el motor de base de datos de Cassandra, una versión antigua adaptada para Titan que permite almacenar las propiedades de los nodos de forma rápida. Se considero que si implementabamos una versión más local, sin utilizar edges y almacenado la información de las aristas en los propoios Nodos podriamos llegar a obtener unos resultados del estilo de Neo4J. Pero no fue así. El rendimiento obtenido es de unas 5 o 6 veces más lento que Neo4J en los casos más grandes.

Respecto a Neo4J cabe mencionar que de entre todas las versiones disponibles en el site, hemos buscado aquella que nos ofrecía más rendimiento en modo local ya que por ejemplo a partir de la versión 2.0 empieza a estar más diseñada para versiones distribuidas y por lo tanto no sería tan eficiente para nuestros experimentos. Otro punto importante a comentar como conclusión es la impotencias de no poder almacenar solamente doubles y strings en los Nodos. Si nuestro experimento necesitará almacenar otro tipo de información, deberíamos transformar la estructura a string y después realizar las conversiones que deseamos, con un coste computacional muy elevado.

Por último presentamos una comparación de resultados, para ver claramente las diferencias de tiempo para cada implementación. Recordamos que en la sección 6.3 de este capítulo pueden encontrar el tamaño de las bases de datos utilizadas.

Resultados de la función Score basado en Kernels los sistemas Neo4J, Titan
Cassandra y MongoDb — 7.39 Resultados de la función Score basado en Kernels los sistemas Neo4J, Titan Cassandra y MongoDb

7.6 Conclusión final

Menos Cassandra, el resto de sistemas han conseguido realizar de una forma más o menos rápida, aplicar la función ranking a grandes bases de datos.

Las conclusiones, para una base de datos local, siempre será más fácil de gestionar y de realizar las consultas en una base de datos orientadas a grafos. En los diferentes diagramas mostrados en el proyecto siempre hemos seguido como referencia estas estructuras para crear después los documentos o las columnas de otros sistemas, ya que es mucho más fácil de entender un grafo que no un documento JSON o una column.

Aparte del detalle del rendimiento, los lenguajes nativos utilizados en todos los sistema han sido muy fáciles de entender, pero siempre resultaba más fácil realizar los cálculos o buscar comprobaciones en Neo4J que en Cassandra por ejemplo. Además actualmente en el mercado podemos encontrar muchas aplicaciones que ayudan a gestionar estos sistemas orientados a grafos.

Respecto a la comparación entre los sistema NoSQL y otro MYSQL, aunque las bases de datos MYSQL tienen mejores prestaciones en versiones locales, son muy difíciles de gestionar y requieren un nivel avanzado para aplicar algoritmos complejos. En cambio las bases de datos Neo4J simplifican mucho las cosas y permiten gente iniciada en la informática poder realizar tareas más fácilmente.

7.7. Trabajo Futuro

En este último apartado de la memoria, veremos qué extensiones se podrían realizar al proyecto desarrollado. Una de las mejoras que se debería realizar es la la distribución de las bases de datos en varios servidores y compara entonces sus rendimientos

Otras mejoras que se podrían añadir sería comparar el rendimiento de Neo4J con las bases de datos SQL.