Getting Started with SQL
Why would you?
From “data is the new oil” (or not) to “90% of the world data was generated over the last 2 years”, it is fair to say data has been getting some traction recently.
But what is “data” exactly? The Cambridge Dictionary defines data as:
information, especially facts or numbers, collected to be examined and considered and used to help decision-making, or information in an electronic form that can be stored and used by a computer.
So data is “especially facts and numbers”, but not only. Effectively, data comes in many formats: audio files, video/photo files, tabular data, pdf scans, emails, list of invoices, etc. Each of these formats contain a specific type of information and, with it a particular value, therefore it has to be handled in an appropriate way to maximise this value. Welcome to the world of data engineering!
It is a very complex field of computer science and, to keep things simple, I will only mention 2 principal families in data handling: DataBase Management Systems and Relational DataBase Management Systems. It is a functional classification, as “management systems” is just another fancy name for software, but as I am going to emphasise the relational aspect only, there is no need to bring confusion. (More detail is available in the footnotes if you are interested.)
Structured Query Language
In terms of relational databases, there is a clear winner: the Structured Query Language. Used across the board, a “flavour” of SQL will most likely interact with the database that stores your data, no matter its type. Because SQL is king in the RDBMS ecosystem, it also happens to be one of the most in-demand skills, according to the Indeed tech job report at least.
There are many ways to get started with SQL. Depending on which aspect you want to work on: you can install DBeaver, which handles many types of database protocols out of the box, or you can specifically install PostgreSQL/SQLite. You can even download a docker container image. These are good options if you are interested in the data engineering side of data handling.
In this article, I am going to focus on data management using Python. It can be as simple as
import sqlite3 (it is a built-in library) or installing the right driver (e.g. one of these for a Postgre database). Again, it all depends on which format your data is.
I would highly recommend using DBeaver to ease the learning process. You will be able to generate a sample database on the first start, which is very handy when you want to focus on learning SQL basics (it is the database I’ll use in this article). One of the best features of DBeaver is the direct access to a visual representation of the database structure (once connected to the database, double click on ‘Tables’ and have a look at the ‘ER Diagram’ tab).
The power of SQL sits in its proximity to the English language. Looking at this query:
SELECT Name FROM Album LIMIT 100;, you instinctively know it will return the first 100 'Name' from the 'Album' table.
Just as with any other language, the first obstacles to “speaking” SQL are the vocabulary, and the structure:
Although keywords will mostly be the same across the different “flavours” of SQL, I would advise referring to the manual of the system you are learning. MySQL — keywords and reserved words, for example.
Here again, it will be very similar to any other “flavours”, but there will be some specificities so head to the manual: MySQL — SELECT statement, for example. One thing to be aware of is the order in which the query is processed: SQL queries don’t start with SELECT — Julia Evans.
As said previously, I want to keep things simple and focus on the Data Manipulation Language, as opposed to the Data Definition Language, specific to creating the data structure. On top of keywords and structure (they will become more natural with practice), there are a few concepts that need to be understood:
- Avoiding repetition. Databases avoid repeating the information by design. Instead information is stored in different tables, and these tables have connections. It is a more efficient way to store data as
integersare smaller than
strings, amongst other things.
- Primary key. The primary key is unique and will be used to identify every row in the table.
- Foreign key. Not a “key” per se, more of a relationship. It is the common ground between 2 tables, allowing a cross-reference between those tables.
- Joins. Because data repetition is kept to a minimum, it is necessary to understand how to connect those tables together.
- Connection and Cursor. As we are using Python, we need to create a
Connectionobject and a
It is pretty straight forward to connect to a database with the
sqlite3 library: Connection > Cursor > Query.
fetchall() is requesting all the results,
fetchone() would be another useful method for database exploration. The only trick, really, is to remember to name the columns using
*is the wildcard character, it means everything available.
SQL keywords are capitalised by convention, but most interpreters won’t hold you accountable if you don’t capitalise them. Same goes for the semicolon at the end of the query.
We want to know what our catalogue contains. What songs are we managing? Let’s start by extracting the detailed discography of all the metal artists in the database. Unfortunately the information is not formatted in such a way, but the database schema from DBeaver tells us that:
- The genre is in the
Genretable, linked to
Trackvia the foreign key
- The artist is in the
Artisttable. There is no direct link to the
Tracktable, but it connects to the
Albumtable via the foreign key
- The album is in the
Albumtable, which links to
Trackvia the foreign key
- The track is in the Track table.
One way to resolve this inconvenience is to use
Artistjoins the table
JOIN Artist a2 ON a.ArtistId = a2.ArtistId. (
aare aliases and are not necessary, but they simplify the code.
JOIN Artist ON Album.ArtistId = Artist.ArtistIdis also a valid query.)
Albumconnects to the table
JOIN Album a ON t.AlbumId = a.AlbumId.
Genrelinks to the table
JOIN Genre g ON t.GenreId = g.GenreId.
Trackis our starting point:
FROM Track t.
- Our restriction is on the genre:
WHERE Genre = 'Metal'. Genre is understood by the interpreter, as it is defined in the
SELECTstatement as the alias to
g.Name. 'Metal' is a string and is case sensitive.
- We want to see the artist, album name, track name and genre. We use aliases to render the data in a more interpretable way (Artist, Track, and Genre are locally called ‘Name’):
SELECT a2.Name AS 'Artist', a.Title AS 'Album', t.Name AS 'Track', g.Name AS 'Genre'
ORDER BY Artist ASC, Album DESC, t.TrackId DESCwill order the data by artist name in alphabetical order, the album in reverse alphabetical order and, finally, the track name in alphabetical order.
Another cool thing to do in SQL is to compute calculations: Would you like to know what the most profitable song in the catalogue is?
Once again, we need to chain join as tracks and invoices are not directly linked:
And the most profitable band is:
Learning SQL can be overwhelming sometimes, but the power of SQL is in its structure. Once you start thinking in SQL, nothing can stop you!