SQL queries

The SQL SELECT statement is a tool for constructing queries which extract data from a database. We will show various ways in which this statement can be used. For demonstration purposes we will be working with a small SQLite database which can be downloaded here:

To execute SQL code in Jupyter Notebook we will use the ipython-sql notebook extension. Assuming that this extension is already installed (see SQL Basics for installation instructions), the code below will load it:

[1]:
%load_ext sql

Next, we connect to the gradebook.sqlite database:

[2]:
%%sql
sqlite:///gradebook_data.sqlite

This database consists of a single table gradebook, with records (names, exam scores etc.) of students enrolled in a college course. The whole table looks as follows:

[3]:
%%sql
SELECT *
FROM gradebook
 * sqlite:///gradebook_data.sqlite
Done.
[3]:
student_id first_name last_name major exam1 exam2 exam3
2712 Starr Bachand computer science None 27 85
3548 Marcia Hinkel biology 79 10 41
3877 Chu Flax mathematics 72 None 29
5108 Shawnna Coate computer science 79 96 61
5775 Rosina Hovey biology 96 13 99
5787 Tenisha Rippel computer science 27 54 79
6245 Adelina Brummitt mathematics None 67 99
7138 Fran Mount biology 71 5 80
7256 Rosalinda Trippe computer science None None 82
8911 Lottie Champine computer science None None 62
9165 Ludivina Hoskin mathematics 27 55 None
9346 Stanton Mcgaugh computer science 6 60 40
9667 Jonnie Hillock mathematics 56 24 72

SELECT … FROM …

The code above shows the simplest application of the SELECT statement: it can be used to retrieve the entire content of a database table. The complete syntax of this statement, however, can be much more complex:

SELECT [DISTINCT] column_list
FROM table_list
[WHERE condition]
[GROUP BY column [HAVING conditon]]
[ORDER BY column  [ASC | DESC]]
[LIMIT count OFFSET offset]

All parts in [square brackets] are optional. The only mandatory parts are SELECT followed a list of columns we want to retrieve, and FROM followed by names of tables where these columns come from. In the example below we use just these two keywords to extract a few columns from the gradebook table:

[5]:
%%sql
SELECT first_name, last_name, exam3
FROM gradebook;
 * sqlite:///gradebook_data.sqlite
Done.
[5]:
first_name last_name exam3
Rosina Hovey 99
Stanton Mcgaugh 40
Jonnie Hillock 72
Chu Flax 29
Rosalinda Trippe 82
Fran Mount 80
Shawnna Coate 61
Marcia Hinkel 41
Starr Bachand 85
Lottie Champine 62
Adelina Brummitt 99
Tenisha Rippel 79
Ludivina Hoskin None

In some cases it may be convenient to give columns aliases instead of using their original names. This can be accomplished as follows:

[6]:
%%sql
SELECT first_name AS FName, last_name AS LName, exam3 AS Last_Exam
FROM gradebook;
 * sqlite:///gradebook_data.sqlite
Done.
[6]:
FName LName Last_Exam
Rosina Hovey 99
Stanton Mcgaugh 40
Jonnie Hillock 72
Chu Flax 29
Rosalinda Trippe 82
Fran Mount 80
Shawnna Coate 61
Marcia Hinkel 41
Starr Bachand 85
Lottie Champine 62
Adelina Brummitt 99
Tenisha Rippel 79
Ludivina Hoskin None

We can use the SELECT DISTINCT ... syntax to retrieve all distinct values in given columns. As an example, we can use it to get a list of majors listed in the gradebook table:

[7]:
%%sql
SELECT DISTINCT major
FROM gradebook;
 * sqlite:///gradebook_data.sqlite
Done.
[7]:
major
biology
computer science
mathematics

Instead of extracting data as it appears in a table, we can modify it in various ways. For example, SQLite provides the || operator which concatenates strings. We can use it to combine columns with first and last names into a single column:

[8]:
%%sql
SELECT last_name || ", " || first_name, exam1, exam2, exam3
FROM gradebook;
 * sqlite:///gradebook_data.sqlite
Done.
[8]:
last_name || ", " || first_name exam1 exam2 exam3
Hovey, Rosina 96 13 99
Mcgaugh, Stanton 6 60 40
Hillock, Jonnie 56 24 72
Flax, Chu 72 None 29
Trippe, Rosalinda None None 82
Mount, Fran 71 5 80
Coate, Shawnna 79 96 61
Hinkel, Marcia 79 10 41
Bachand, Starr None 27 85
Champine, Lottie None None 62
Brummitt, Adelina None 67 99
Rippel, Tenisha 27 54 79
Hoskin, Ludivina 27 55 None

The default name of a column created using some operation is the code specifying the operation. Aliases let us replace it with a more meaningful name:

[39]:
%%sql
SELECT last_name || ", " || first_name AS full_name, exam1, exam2, exam3
FROM gradebook;
 * sqlite:///gradebook_data.sqlite
Done.
[39]:
full_name exam1 exam2 exam3
Hovey, Rosina 96 13 99
Mcgaugh, Stanton 6 60 40
Hillock, Jonnie 56 24 72
Flax, Chu 72 None 29
Trippe, Rosalinda None None 82
Mount, Fran 71 5 80
Coate, Shawnna 79 96 61
Hinkel, Marcia 79 10 41
Bachand, Starr None 27 85
Champine, Lottie None None 62
Brummitt, Adelina None 67 99
Rippel, Tenisha 27 54 79
Hoskin, Ludivina 27 55 None

NULL values and COALESCE

Lets say that we want to calculate for each student the sum of scores from the three exams. We can try to do it as follows:

[41]:
%%sql
SELECT first_name, last_name, exam1 + exam2 +  exam3 AS exam_total
FROM gradebook;
 * sqlite:///gradebook_data.sqlite
Done.
[41]:
first_name last_name exam_total
Rosina Hovey 208
Stanton Mcgaugh 106
Jonnie Hillock 152
Chu Flax None
Rosalinda Trippe None
Fran Mount 156
Shawnna Coate 236
Marcia Hinkel 130
Starr Bachand None
Lottie Champine None
Adelina Brummitt None
Tenisha Rippel 160
Ludivina Hoskin None

In some cases this works as expected, but several rows show None in the exam_total column. The Nonetable entries correspond to the SQL NULL value which signifies missing data. In the original gradebook table several exam scores have the NULL value, which may indicate e.g. that some students have not taken an exam. Any number added toNULL results in NULL (since it is not possible to determine what the missing data stands for), which results in NULL values in the column exam_total.

In our example, it may be sensible to treat NULL exam scores as zeros for the purpose of computing the total score. This can be accomplished using the COALESCE function. This function takes a sequence of values as arguments and returns the first value which is not NULL (or NULL if there is no such value). For example, COALESCE(exam1, 0) will return either the value of the column exam1 or 0, depending on whether exam1 is or is not NULL:

[11]:
%%sql
SELECT first_name, last_name, COALESCE(exam1, 0) AS exam1_modified
FROM gradebook;
 * sqlite:///gradebook_data.sqlite
Done.
[11]:
first_name last_name exam1_modified
Rosina Hovey 96
Stanton Mcgaugh 6
Jonnie Hillock 56
Chu Flax 72
Rosalinda Trippe 0
Fran Mount 71
Shawnna Coate 79
Marcia Hinkel 79
Starr Bachand 0
Lottie Champine 0
Adelina Brummitt 0
Tenisha Rippel 27
Ludivina Hoskin 27

Using COALESCE we can improve the query which computes the total score from the three exams:

[12]:
%%sql
SELECT first_name, last_name, COALESCE(exam1, 0) +  COALESCE(exam2, 0) + COALESCE(exam3, 0) AS exam_total
FROM gradebook;
 * sqlite:///gradebook_data.sqlite
Done.
[12]:
first_name last_name exam_total
Rosina Hovey 208
Stanton Mcgaugh 106
Jonnie Hillock 152
Chu Flax 101
Rosalinda Trippe 82
Fran Mount 156
Shawnna Coate 236
Marcia Hinkel 130
Starr Bachand 112
Lottie Champine 62
Adelina Brummitt 166
Tenisha Rippel 160
Ludivina Hoskin 82

The CASE expression

Lets say that we want to assign letter grades based on exam 3 scores: ‘Great’ for scores above 80 points, ‘Pass’ for scores between 80 and 50 points, and ‘Fail’ for all other scores. This can be accomplished using the CASE expression:

[13]:
%%sql
SELECT first_name, last_name, exam3,
CASE
    WHEN exam3 > 80 THEN 'Great'
    WHEN exam3 > 50 THEN 'Pass'
    ELSE "Fail"
END AS grade
FROM gradebook;
 * sqlite:///gradebook_data.sqlite
Done.
[13]:
first_name last_name exam3 grade
Rosina Hovey 99 Great
Stanton Mcgaugh 40 Fail
Jonnie Hillock 72 Pass
Chu Flax 29 Fail
Rosalinda Trippe 82 Great
Fran Mount 80 Pass
Shawnna Coate 61 Pass
Marcia Hinkel 41 Fail
Starr Bachand 85 Great
Lottie Champine 62 Pass
Adelina Brummitt 99 Great
Tenisha Rippel 79 Pass
Ludivina Hoskin None Fail

WHERE

We can use the WHERE clause of the SELECT statement to specify which rows of a table should be retrieved. For example, below we use it to select students who received at least 70 points on exam 1:

[14]:
%%sql
SELECT *
FROM gradebook
WHERE exam1 > 70;
 * sqlite:///gradebook_data.sqlite
Done.
[14]:
student_id first_name last_name major exam1 exam2 exam3
5775 Rosina Hovey biology 96 13 99
3877 Chu Flax mathematics 72 None 29
7138 Fran Mount biology 71 5 80
5108 Shawnna Coate computer science 79 96 61
3548 Marcia Hinkel biology 79 10 41

Logical conditions can be combined using AND, OR and NOT operators. Below we select students with scores above 50 points on each of the three exams:

[15]:
%%sql
SELECT *
FROM gradebook
WHERE exam1 > 50 AND exam2 > 50 AND exam3 > 50;
 * sqlite:///gradebook_data.sqlite
Done.
[15]:
student_id first_name last_name major exam1 exam2 exam3
5108 Shawnna Coate computer science 79 96 61

Comparing NULL values

In some cases we may be interested in retrieving records which have (or do not have) the NULL value in some column. The conditions column = NULL and column != NULL will not work, since it is not possible to determine if something has the same value as the value of unknown data. Instead, we can use the conditions column IS NULL and column IS NOT NULL.

As an example, below we select records of students with NULL value in the exam1 column:

[16]:
%%sql
SELECT *
FROM gradebook
WHERE exam1 IS NULL;
 * sqlite:///gradebook_data.sqlite
Done.
[16]:
student_id first_name last_name major exam1 exam2 exam3
7256 Rosalinda Trippe computer science None None 82
2712 Starr Bachand computer science None 27 85
8911 Lottie Champine computer science None None 62
6245 Adelina Brummitt mathematics None 67 99

LIKE operator

The LIKE operator can be used to perform simple pattern matches in strings. Below we use it to select records of all students whose last names start with an “H”:

[17]:
%%sql
SELECT *
FROM gradebook
WHERE last_name LIKE 'H%';
 * sqlite:///gradebook_data.sqlite
Done.
[17]:
student_id first_name last_name major exam1 exam2 exam3
5775 Rosina Hovey biology 96 13 99
9667 Jonnie Hillock mathematics 56 24 72
3548 Marcia Hinkel biology 79 10 41
9165 Ludivina Hoskin mathematics 27 55 None

In patterns used by LIKE the percentage symbol % represents zero of more characters, and the underscore _ stands for a single character. By default pattern matches are case insensitive, so ‘H%’ will match both ‘Hello’ and ‘hello’.

ORDER

The ORDER clause specifies in which order records retrieved by a query should be returned. We can use it, for example, to arrange students records according to the exam 3 scores:

[42]:
%%sql
SELECT first_name, last_name, exam3
FROM gradebook
ORDER BY exam3;
 * sqlite:///gradebook_data.sqlite
Done.
[42]:
first_name last_name exam3
Ludivina Hoskin None
Chu Flax 29
Stanton Mcgaugh 40
Marcia Hinkel 41
Shawnna Coate 61
Lottie Champine 62
Jonnie Hillock 72
Tenisha Rippel 79
Fran Mount 80
Rosalinda Trippe 82
Starr Bachand 85
Rosina Hovey 99
Adelina Brummitt 99

By default, records are returned in the ascending order. NULL is regarded as the smallest possible value, so it appears at the top of the list. By adding the DESC keyword we can arrange records in the descending order:

[19]:
%%sql
SELECT first_name, last_name, exam3
FROM gradebook
ORDER BY exam3 DESC;
 * sqlite:///gradebook_data.sqlite
Done.
[19]:
first_name last_name exam3
Rosina Hovey 99
Adelina Brummitt 99
Starr Bachand 85
Rosalinda Trippe 82
Fran Mount 80
Tenisha Rippel 79
Jonnie Hillock 72
Lottie Champine 62
Shawnna Coate 61
Marcia Hinkel 41
Stanton Mcgaugh 40
Chu Flax 29
Ludivina Hoskin None

LIMIT

The LIMIT clause constrains the number of records returned by the SELECT statement. For example, we can use it to get records of 5 students with the highest exam 3 scores:

[20]:
%%sql
SELECT first_name, last_name, exam3
FROM gradebook
ORDER BY exam3 DESC
LIMIT 5
 * sqlite:///gradebook_data.sqlite
Done.
[20]:
first_name last_name exam3
Rosina Hovey 99
Adelina Brummitt 99
Starr Bachand 85
Rosalinda Trippe 82
Fran Mount 80

LIMIT can be used with an additional OFFSET keyword. For example, LIMIT 3 OFFSET 2 will skip the first 2 records and return the following 3 records:

[21]:
%%sql
SELECT first_name, last_name, exam3
FROM gradebook
ORDER BY exam3 DESC
LIMIT 3 OFFSET 2
 * sqlite:///gradebook_data.sqlite
Done.
[21]:
first_name last_name exam3
Starr Bachand 85
Rosalinda Trippe 82
Fran Mount 80

Aggregate functions

Aggregate functions summarize values in a column. SQLite provides a few aggregate functions:

  • AVG(column) returns the average of numerical values in a column.

  • MIN(column) and MAX(my_column) return the minimal and maximal value of a column, respectively.

  • SUM(column) returns the sum of numerical values in a column.

  • COUNT(column) returns the number of non-NULL values in a column. COUNT(*) returns the number of rows, regardless of their values.

Note. COUNT(column) will count all non-NULL values, even if some values repeat. In order to count only distinct values we can use COUNT (DISTINCT column).

For example, below we count the number of non-NULL values in columns with exam scores.

[22]:
%%sql
SELECT COUNT(exam1), COUNT(exam2), COUNT(exam3)
FROM gradebook;
 * sqlite:///gradebook_data.sqlite
Done.
[22]:
COUNT(exam1) COUNT(exam2) COUNT(exam3)
9 10 12

Some values of the columns with exam scores repeat, the above counts include these repetition. We can obtain the number of unique scores in each column as follows:

[4]:
%%sql
SELECT COUNT(DISTINCT exam1), COUNT(DISTINCT exam2), COUNT(DISTINCT exam3)
FROM gradebook;
 * sqlite:///gradebook_data.sqlite
Done.
[4]:
COUNT(DISTINCT exam1) COUNT(DISTINCT exam2) COUNT(DISTINCT exam3)
7 10 11

Next, we compute the maximal score recorded for each exam:

[23]:
%%sql
SELECT MAX(exam1), MAX(exam2), MAX(exam3)
FROM gradebook;
 * sqlite:///gradebook_data.sqlite
Done.
[23]:
MAX(exam1) MAX(exam2) MAX(exam3)
96 96 99

GROUP BY

The GROUP BY expression clause of the SELECT statement adds more flexibility to the aggregate functions. It divides records into groups in such a way, that the specified expression has the same value for all records in a group. Aggregate functions can be then used to summarize columns in each group separately.

As an example, below we group records based on the value of the major column. Then we calculate the average exam scores for each group:

[28]:
%%sql
SELECT major, AVG(exam1), AVG(exam2), AVG(exam3)
FROM gradebook
GROUP BY major;
 * sqlite:///gradebook_data.sqlite
Done.
[28]:
major AVG(exam1) AVG(exam2) AVG(exam3)
biology 82.0 9.333333333333334 73.33333333333333
computer science 37.333333333333336 59.25 68.16666666666667
mathematics 51.666666666666664 48.666666666666664 66.66666666666667

The table above is somewhat difficult to read, because it contains too many decimal digits. We can improve it using the ROUND(value, num_digits) function, which rounds numerical values to a given number of digits:

[31]:
%%sql
SELECT major,
       ROUND(AVG(exam1), 2) AS exam1_avg,
       ROUND(AVG(exam2), 2) AS exam1_avg,
       ROUND(AVG(exam3), 2) AS exam3_avg
FROM gradebook
GROUP BY major;
 * sqlite:///gradebook_data.sqlite
Done.
[31]:
major exam1_avg exam1_avg_1 exam3_avg
biology 82.0 9.33 73.33
computer science 37.33 59.25 68.17
mathematics 51.67 48.67 66.67

SQLite provides an additional aggregate function GROUP_CONCAT(column, separator), which concatenates all non-NULL values in a column, separating them by the specified separator string. We can use it to list exam 1 scores obtained by students in each major:

[32]:
%%sql
SELECT major, GROUP_CONCAT(exam1, ", ") AS exam1_scores
FROM gradebook
GROUP BY major;
 * sqlite:///gradebook_data.sqlite
Done.
[32]:
major exam1_scores
biology 96, 71, 79
computer science 6, 79, 27
mathematics 56, 72, 27

HAVING

GROUP BY can be used with an additional HAVING clause. In such a case, only groups that satisfy the condition given after HAVING will be returned. Below we use it to check which majors averaged more than 50 points on exam 1.

[38]:
%%sql
SELECT major, ROUND(AVG(exam1),2) AS exam1_avg
FROM gradebook
GROUP BY major
HAVING exam1_avg > 50
 * sqlite:///gradebook_data.sqlite
Done.
[38]:
major exam1_avg
biology 82.0
mathematics 51.67