Getting Started with SQL
An introduction to queries and (inner) joins
--
Why would you?
From “data is the new oil” (or not) to “90% of the world data was generated over the last 2 years”, it is fair to say data has been getting some traction recently.
But what is “data” exactly? The Cambridge Dictionary defines data as:
information, especially facts or numbers, collected to be examined and considered and used to help decision-making, or information in an electronic form that can be stored and used by a computer.
So data is “especially facts and numbers”, but not only. Effectively, data comes in many formats: audio files, video/photo files, tabular data, pdf scans, emails, list of invoices, etc. Each of these formats contain a specific type of information and, with it a particular value, therefore it has to be handled in an appropriate way to maximise this value. Welcome to the world of data engineering!
It is a very complex field of computer science and, to keep things simple, I will only mention 2 principal families in data handling: DataBase Management Systems and Relational DataBase Management Systems. It is a functional classification, as “management systems” is just another fancy name for software, but as I am going to emphasise the relational aspect only, there is no need to bring confusion. (More detail is available in the footnotes if you are interested.)
Structured Query Language
In terms of relational databases, there is a clear winner: the Structured Query Language. Used across the board, a “flavour” of SQL will most likely interact with the database that stores your data, no matter its type. Because SQL is king in the RDBMS ecosystem, it also happens to be one of the most in-demand skills, according to the Indeed tech job report at least.
The Setup
There are many ways to get started with SQL. Depending on which aspect you want to work on: you can install DBeaver, which handles many types of database protocols out of the box, or you can specifically install PostgreSQL/SQLite. You can even download a docker container image. These are good options if you are interested in the data engineering side of data handling.
In this article, I am going to focus on data management using Python. It can be as simple as import sqlite3
(it is a built-in library) or installing the right driver (e.g. one of these for a Postgre database). Again, it all depends on which format your data is.
I would highly recommend using DBeaver to ease the learning process. You will be able to generate a sample database on the first start, which is very handy when you want to focus on learning SQL basics (it is the database I’ll use in this article). One of the best features of DBeaver is the direct access to a visual representation of the database structure (once connected to the database, double click on ‘Tables’ and have a look at the ‘ER Diagram’ tab).
The power of SQL sits in its proximity to the English language. Looking at this query: SELECT Name FROM Album LIMIT 100;
, you instinctively know it will return the first 100 'Name' from the 'Album' table.
Just as with any other language, the first obstacles to “speaking” SQL are the vocabulary, and the structure:
Keywords
Although keywords will mostly be the same across the different “flavours” of SQL, I would advise referring to the manual of the system you are learning. MySQL — keywords and reserved words, for example.
Structure
Here again, it will be very similar to any other “flavours”, but there will be some specificities so head to the manual: MySQL — SELECT statement, for example. One thing to be aware of is the order in which the query is processed: SQL queries don’t start with SELECT — Julia Evans.
The Code
As said previously, I want to keep things simple and focus on the Data Manipulation Language, as opposed to the Data Definition Language, specific to creating the data structure. On top of keywords and structure (they will become more natural with practice), there are a few concepts that need to be understood:
- Avoiding repetition. Databases avoid repeating the information by design. Instead information is stored in different tables, and these tables have connections. It is a more efficient way to store data as
integers
are smaller thanstrings
, amongst other things. - Primary key. The primary key is unique and will be used to identify every row in the table.
- Foreign key. Not a “key” per se, more of a relationship. It is the common ground between 2 tables, allowing a cross-reference between those tables.
- Joins. Because data repetition is kept to a minimum, it is necessary to understand how to connect those tables together.
- Connection and Cursor. As we are using Python, we need to create a
Connection
object and aCursor
object.
It is pretty straight forward to connect to a database with the sqlite3
library: Connection > Cursor > Query. fetchall()
is requesting all the results, fetchone()
would be another useful method for database exploration. The only trick, really, is to remember to name the columns using cur.description
.
Notes:
*
is the wildcard character, it means everything available.SQL keywords are capitalised by convention, but most interpreters won’t hold you accountable if you don’t capitalise them. Same goes for the semicolon at the end of the query.
Helping Decision-making
We want to know what our catalogue contains. What songs are we managing? Let’s start by extracting the detailed discography of all the metal artists in the database. Unfortunately the information is not formatted in such a way, but the database schema from DBeaver tells us that:
- The genre is in the
Genre
table, linked toTrack
via the foreign keyGenreId
. - The artist is in the
Artist
table. There is no direct link to theTrack
table, but it connects to theAlbum
table via the foreign keyArtistId
. - The album is in the
Album
table, which links toTrack
via the foreign keyAlbumId
. - The track is in the Track table.
One way to resolve this inconvenience is to use JOIN
!
In this QUERY
:
Artist
joins the tableAlbum
withJOIN Artist a2 ON a.ArtistId = a2.ArtistId
. (a2
anda
are aliases and are not necessary, but they simplify the code.JOIN Artist ON Album.ArtistId = Artist.ArtistId
is also a valid query.)Album
connects to the tableTrack
throughJOIN Album a ON t.AlbumId = a.AlbumId
.Genre
links to the tableTrack
viaJOIN Genre g ON t.GenreId = g.GenreId
.Track
is our starting point:FROM Track t
.- Our restriction is on the genre:
WHERE Genre = 'Metal'
. Genre is understood by the interpreter, as it is defined in theSELECT
statement as the alias tog.Name
. 'Metal' is a string and is case sensitive. - We want to see the artist, album name, track name and genre. We use aliases to render the data in a more interpretable way (Artist, Track, and Genre are locally called ‘Name’):
SELECT a2.Name AS 'Artist', a.Title AS 'Album', t.Name AS 'Track', g.Name AS 'Genre'
ORDER BY Artist ASC, Album DESC, t.TrackId DESC
will order the data by artist name in alphabetical order, the album in reverse alphabetical order and, finally, the track name in alphabetical order.
Another cool thing to do in SQL is to compute calculations: Would you like to know what the most profitable song in the catalogue is?
Once again, we need to chain join as tracks and invoices are not directly linked:
And the most profitable band is:
Conclusion
It is true that SQL can get complex (subqueries, CASE, calculations, etc.) but keep practising, you will be able to extract exactly what you want from the data in no time.
Learning SQL can be overwhelming sometimes, but the power of SQL is in its structure. Once you start thinking in SQL, nothing can stop you!
Extra Links (fundamental concepts)
DBMS vs RDBMS — GeeksforGeeks
SQL vs NoSQL — IBM
SQL vs NoSQL — Guru99
PEP 249 — Python Database API Specification v2.0