Apache Cassandra and CQL: data modeling and query examples
Cassandra is designed for distributed writes, high availability and queries known in advance. The most important lesson is that the table is modeled from the access pattern, not from a normalized entity diagram.
Partition key
Defines where data lives in the cluster and which queries are efficient.
Clustering columns
Define order inside a partition and support range-like reads.
Query-first design
Duplicate data when needed so each important query has a table that serves it.
Careful filtering
ALLOW FILTERING is useful for learning, but often a warning sign in production.
Introduction to Cassandra. CQL query examples
The aim of this article is to manage information about courses and students of the master's degree in Hispanic philology in a Cassandra database:
Courses
IdC
Title
Credits
Professor
101
Romance philology
7
Sara Martinez
102
Modern Greek
7
Daniel Perez
103
Modern literature
10
Amalia Sierra
104
Hispanic-American literature
5
Paloma Sanchez
105
Phonetics and morphology
5
Miguel Hernandez
106
Literary theories
3
Juan Garcia
107
General linguistics
3
Samuel Lopez
108
Modern Spanish
10
Isabel Sanz
Student
IdE
Student name
201
Maria Mestre
202
Rodrigo Calvo
203
Oriol Menezes
204
Carlo Berruzo
205
Sofia Canyadell
206
Marina Perez
207
Arianna Ruiz
208
Naiara Zapico
Grades
IdE
IdC
Exam grade
Exam date
201
101
B
2019-02-11
201
102
B
2018-06-30
202
106
C
2018-06-30
203
103
A
2019-02-11
203
108
A
2018-06-30
203
105
B
2020-01-04
204
101
B
2020-01-04
205
103
C
2020-01-04
1. Retrieve the information of the courses, that is, course title, credits, professor name and student name, attended by a specific student and ordered by course title.
IdE
IdC
Title
Credits
Professor
Student name
201
101
Romance philology
7
Sara Martinez
Maria Mestre
201
102
Modern Greek
7
Daniel Perez
Maria Mestre
202
106
Literary theories
3
Juan Garcia
Rodrigo Calvo
203
103
Modern literature
10
Amalia Sierra
Oriol Menezes
203
105
Phonetics and morphology
5
Miguel Hernandez
Oriol Menezes
203
108
Modern Spanish
10
Isabel Sanz
Oriol Menezes
204
101
Romance philology
7
Sara Martinez
Carlo Berruzo
205
103
Hispanic-American literature
5
Paloma Sanchez
Sofia Canyadell
205
104
Hispanic-American literature
5
Paloma Sanchez
Sofia Canyadell
205
108
Modern Spanish
10
Isabel Sanz
Sofia Canyadell
206
106
Literary theories
3
Juan Garcia
Marina Perez
206
107
General linguistics
3
Samuel Lopez
Marina Perez
207
107
General linguistics
3
Samuel Lopez
Arianna Ruiz
208
104
Hispanic-American literature
5
Paloma Sanchez
Naiara Zapico
As can be seen, to avoid problems with duplicated names I added the fields IdE and IdC to the query, even though they were not requested in the statement. With those fields I can correctly distinguish two students with the same name or two courses with the same title. Due to the requirements of the following sections, I decided to use the primary key `idE_estudiante`, `titulo_curso` and `idC_curso`. The main key, `idE_estudiante`, allows me to group the data by that value and therefore determine the course with the maximum number of credits in which each student enrolled. The next key used was `titulo_curso`, and in this way I also satisfy the ordering requirement stated in the exercise.
1 Result of block ejercicio_1
2. Retrieve the information of the courses in which there were students who obtained a specific grade. The statement asks for course title, credits, professor name, student identifier and grade, ordered by the number of course credits.
Credits
Exam grade
IdE
Title
Professor
7
B
201
Romance philology
Sara Martinez
7
B
201
Modern Greek
Daniel Perez
3
C
202
Literary theories
Juan Garcia
10
A
203
Modern literature
Amalia Sierra
5
B
203
Phonetics and morphology
Miguel Hernandez
10
A
203
Modern Spanish
Isabel Sanz
7
B
204
Romance philology
Sara Martinez
10
C
205
Modern literature
Paloma Sanchez
5
B
205
Hispanic-American literature
Paloma Sanchez
10
A
205
Modern Spanish
Isabel Sanz
3
B
206
Literary theories
Juan Garcia
3
B
206
General linguistics
Samuel Lopez
3
A
207
General linguistics
Samuel Lopez
5
C
208
Hispanic-American literature
Paloma Sanchez
In this point we repeat the previous process. To avoid data duplication issues and still include all the information, it is necessary to use `idE_estudiante` and `titulo_curso` as the primary key of the collection. In this case, I do not consider it important to distinguish courses by identifier because what really matters is the course title and its credits. The order of the keys will be: `nota_exam`, `creditos_curso`, `idE_estudiante` and `titulo_curso`. The main key, `nota_exam`, allows me to solve the first requirement of the statement appropriately. In that sense, the next key used was `creditos_curso`, which solves the ordering requirement indicated in the statement. The result can be evaluated in the following image: 2 Result of block ejercicio_b
3. Starting from the two column families created in sections A) and B), generate CQL queries and show the results obtained in each one:
Retrieve all courses with students who received a 'C', ordered descending by number of credits.
The query is very simple for the `ejercicio_b` collection. We use its primary key to obtain the desired information and then sort the data as required by the statement. 3 Cassandra result for section C, part 1
Obtain the maximum number of credits obtained by each student from a subject.
At this point we use the `ejercicio_a` collection. Since it is ordered by the primary key `idE_estudiante`, it is possible to perform a `group by` on that value and obtain the desired query. 4 Cassandra answer for section C, part 2 I add the student's name, thereby satisfying the requirements of the statement. 5 Cassandra answer for section C, part 2 with the student's name added
Retrieve the information for the course "Modern Literature" without using the ALLOW FILTERING clause. Knowing that `titulo_curso` is not part of the family keys of this collection and that it is part of a `WHERE` condition, it is necessary to define an index on that field. In that way we can avoid using the `ALLOW FILTERING` command mentioned in the statement.
6 Cassandra answer for section C, part 3
Change the professor name Daniel Perez to Xesus Cortinhas.
At this point we need to make modifications in both collections. To do so, we perform the following procedure in table `ejercicio_a`:
Search for the affected rows.
Obtain the primary keys that identify the displayed records: `idc_curso in (102) and ide_estudiante in (201) and titulo_curso='Griego moderno'`.
Verify that the modification was successful.
7 Cassandra answer for exercise 4a. Table ejercicio_a
To modify collection `ejercicio_b`, we follow this procedure:
Search for the affected rows.
Obtain the primary keys that identify the displayed records: `titulo_curso='Griego moderno' and creditos_curso=7 and ide_estudiante=201 and nota_exam='B'`.
Verify that the modification was successful.
8 Cassandra answer for exercise 4b. Table ejercicio_b
Delete the information related to the student "Rodrigo Calvo".
As in the previous point, we have to make changes in both collections. For that reason, we perform the following procedure in collection `ejercicio_a`:
Search for the affected rows.
Obtain the primary keys that identify the displayed records: `idc_curso=106 and ide_estudiante=202 and titulo_curso='Teorias literarias'`, and execute the delete.
Verify that the deletion was successful.
9 Cassandra answer for exercise 5a. Table ejercicio_a
We now carry out the following procedure in collection `ejercicio_b`:
Search for the affected rows.
Obtain the primary keys that identify the displayed records: `nota_exam='C' and creditos_curso=3 and ide_estudiante=202 and titulo_curso='Teorias literarias'`, and execute the delete.
Verify that the deletion was successful.
10 Cassandra answer for exercise 5b. Table ejercicio_b