NoSQL · Cassandra · CQL

Apache Cassandra and CQL: data modeling and query examples

Cassandra is designed for distributed writes, high availability and queries known in advance. The most important lesson is that the table is modeled from the access pattern, not from a normalized entity diagram.

Partition key

Defines where data lives in the cluster and which queries are efficient.

Clustering columns

Define order inside a partition and support range-like reads.

Query-first design

Duplicate data when needed so each important query has a table that serves it.

Careful filtering

ALLOW FILTERING is useful for learning, but often a warning sign in production.

Introduction to Cassandra. CQL query examples

The aim of this article is to manage information about courses and students of the master's degree in Hispanic philology in a Cassandra database:

Courses

IdCTitleCreditsProfessor
101Romance philology7Sara Martinez
102Modern Greek7Daniel Perez
103Modern literature10Amalia Sierra
104Hispanic-American literature5Paloma Sanchez
105Phonetics and morphology5Miguel Hernandez
106Literary theories3Juan Garcia
107General linguistics3Samuel Lopez
108Modern Spanish10Isabel Sanz

Student

IdEStudent name
201Maria Mestre
202Rodrigo Calvo
203Oriol Menezes
204Carlo Berruzo
205Sofia Canyadell
206Marina Perez
207Arianna Ruiz
208Naiara Zapico

Grades

IdEIdCExam gradeExam date
201101B2019-02-11
201102B2018-06-30
202106C2018-06-30
203103A2019-02-11
203108A2018-06-30
203105B2020-01-04
204101B2020-01-04
205103C2020-01-04

1. Retrieve the information of the courses, that is, course title, credits, professor name and student name, attended by a specific student and ordered by course title.

IdEIdCTitleCreditsProfessorStudent name
201 101 Romance philology 7 Sara Martinez Maria Mestre
201 102 Modern Greek 7 Daniel Perez Maria Mestre
202 106 Literary theories 3 Juan Garcia Rodrigo Calvo
203 103 Modern literature 10 Amalia Sierra Oriol Menezes
203 105 Phonetics and morphology 5 Miguel Hernandez Oriol Menezes
203 108 Modern Spanish 10 Isabel Sanz Oriol Menezes
204 101 Romance philology 7 Sara Martinez Carlo Berruzo
205 103 Hispanic-American literature 5 Paloma Sanchez Sofia Canyadell
205 104 Hispanic-American literature 5 Paloma Sanchez Sofia Canyadell
205 108 Modern Spanish 10 Isabel Sanz Sofia Canyadell
206 106 Literary theories 3 Juan Garcia Marina Perez
206 107 General linguistics 3 Samuel Lopez Marina Perez
207 107 General linguistics 3 Samuel Lopez Arianna Ruiz
208 104 Hispanic-American literature 5 Paloma Sanchez Naiara Zapico

As can be seen, to avoid problems with duplicated names I added the fields IdE and IdC to the query, even though they were not requested in the statement. With those fields I can correctly distinguish two students with the same name or two courses with the same title.

Due to the requirements of the following sections, I decided to use the primary key `idE_estudiante`, `titulo_curso` and `idC_curso`. The main key, `idE_estudiante`, allows me to group the data by that value and therefore determine the course with the maximum number of credits in which each student enrolled.

The next key used was `titulo_curso`, and in this way I also satisfy the ordering requirement stated in the exercise.

1 Result of block ejercicio_1

2. Retrieve the information of the courses in which there were students who obtained a specific grade. The statement asks for course title, credits, professor name, student identifier and grade, ordered by the number of course credits.

CreditsExam gradeIdETitleProfessor
7 B 201 Romance philology Sara Martinez
7 B 201 Modern Greek Daniel Perez
3 C 202 Literary theories Juan Garcia
10 A 203 Modern literature Amalia Sierra
5 B 203 Phonetics and morphology Miguel Hernandez
10 A 203 Modern Spanish Isabel Sanz
7 B 204 Romance philology Sara Martinez
10 C 205 Modern literature Paloma Sanchez
5 B 205 Hispanic-American literature Paloma Sanchez
10 A 205 Modern Spanish Isabel Sanz
3 B 206 Literary theories Juan Garcia
3 B 206 General linguistics Samuel Lopez
3 A 207 General linguistics Samuel Lopez
5 C 208 Hispanic-American literature Paloma Sanchez

In this point we repeat the previous process. To avoid data duplication issues and still include all the information, it is necessary to use `idE_estudiante` and `titulo_curso` as the primary key of the collection. In this case, I do not consider it important to distinguish courses by identifier because what really matters is the course title and its credits.

The order of the keys will be: `nota_exam`, `creditos_curso`, `idE_estudiante` and `titulo_curso`. The main key, `nota_exam`, allows me to solve the first requirement of the statement appropriately.

In that sense, the next key used was `creditos_curso`, which solves the ordering requirement indicated in the statement.

The result can be evaluated in the following image:

2 Result of block ejercicio_b

3. Starting from the two column families created in sections A) and B), generate CQL queries and show the results obtained in each one:

  • Retrieve all courses with students who received a 'C', ordered descending by number of credits.
    The query is very simple for the `ejercicio_b` collection. We use its primary key to obtain the desired information and then sort the data as required by the statement.

    3 Cassandra result for section C, part 1

  • Obtain the maximum number of credits obtained by each student from a subject.

    At this point we use the `ejercicio_a` collection. Since it is ordered by the primary key `idE_estudiante`, it is possible to perform a `group by` on that value and obtain the desired query.

    4 Cassandra answer for section C, part 2

    I add the student's name, thereby satisfying the requirements of the statement.

    5 Cassandra answer for section C, part 2 with the student's name added

  • Retrieve the information for the course "Modern Literature" without using the ALLOW FILTERING clause.

    Knowing that `titulo_curso` is not part of the family keys of this collection and that it is part of a `WHERE` condition, it is necessary to define an index on that field. In that way we can avoid using the `ALLOW FILTERING` command mentioned in the statement.

    6 Cassandra answer for section C, part 3

  • Change the professor name Daniel Perez to Xesus Cortinhas.

    At this point we need to make modifications in both collections. To do so, we perform the following procedure in table `ejercicio_a`:
    • Search for the affected rows.
    • Obtain the primary keys that identify the displayed records: `idc_curso in (102) and ide_estudiante in (201) and titulo_curso='Griego moderno'`.
    • Verify that the modification was successful.

    7 Cassandra answer for exercise 4a. Table ejercicio_a

    To modify collection `ejercicio_b`, we follow this procedure:
    • Search for the affected rows.
    • Obtain the primary keys that identify the displayed records: `titulo_curso='Griego moderno' and creditos_curso=7 and ide_estudiante=201 and nota_exam='B'`.
    • Verify that the modification was successful.

    8 Cassandra answer for exercise 4b. Table ejercicio_b

  • Delete the information related to the student "Rodrigo Calvo".
    As in the previous point, we have to make changes in both collections. For that reason, we perform the following procedure in collection `ejercicio_a`:
    • Search for the affected rows.
    • Obtain the primary keys that identify the displayed records: `idc_curso=106 and ide_estudiante=202 and titulo_curso='Teorias literarias'`, and execute the delete.
    • Verify that the deletion was successful.

    9 Cassandra answer for exercise 5a. Table ejercicio_a

    We now carry out the following procedure in collection `ejercicio_b`:
    • Search for the affected rows.
    • Obtain the primary keys that identify the displayed records: `nota_exam='C' and creditos_curso=3 and ide_estudiante=202 and titulo_curso='Teorias literarias'`, and execute the delete.
    • Verify that the deletion was successful.

    10 Cassandra answer for exercise 5b. Table ejercicio_b