Data Science is what you do while you wait for your queries to finish.

You can quote me on that if you want, although I think at this point someone must have said it already.

As a data scientist (or analyst, or whatever title is fancy when you are reading this) before you do any plots, any exploratory analysis, any preliminary statistics, any prototype models... you need some data. With all probability, that data will be dispersed among a bunch of different DBs in your organization, and the way to extract it will be some dialect of SQL or other. So before starting R or Python, you would probably be accessing those systems (for instance, I particularly like DBeaver as my unified DB endpoint), checking which tables are available, what data columns have been stored and thinking about how to convert all that into some features you can start working with.

As I have progressed in this trade, SQL queries have evolved from something inescapable I have to do to that thing I really enjoy and that, when it's actually well structured an planned, will save me tons of time down the road. How did I end up here?

I began learning SQL, like a lot of people before me, because back in the early 2000's I became interested in web technologies and the PHP + MySQL combination (plus some other things) was nearly ubiquitous. So I learned my SELECT, my WHERE... and actually not much else, as I then moved on. Years later, armed with that extremely rudimentary knowledge, I built a bibliography manager for the research lab where I did my PhD that for a long time was published here. That version is now lost, but thanks to the Internet Archive I can tell you that it looked like this. The entire thing was a single table (I didn't know JOIN!), so a lot of the logic was done in PHP. For instance, any bibliography entry could belong to several research categories. The way that was done was with a single column of text, separated by commas, that was parsed after the query. Not pretty. Years later, the person that was in charge of all IT systems at the time told me they needed to do some modifications, and the only thing I could tell him was that I was sorry. They of course ended up replacing it.

Then, years later, and not too long ago, I started doing more complicated stuff. This Kaggle competition... had proper JOINs! I was still doing a lot of stuff outside of the query, this time using R, but still, it's an improvement. In retrospect, it's the same thing I did with another Kaggle competition in which I also dumped all the data into a SQLite database and queried from there. A lot of stuff happens outside of the main queries that could possible had been done inside.

And then I started my career as a Data Scientist. I cannot give too many details, but the first project I was assigned to was very SQL-intensive. All the necessary data were spread across a bunch of different tables that sometimes could be joined directly, others needed some previous aggregation. We needed to dump a couple of temporary tables at the beginning in order to not have correlated subqueries (another thing I didn't know, which I learned after the people in charge of the DBs came to my desk with lit torches.) I also learned about data partitions (more people with torches), use indexes when available (same), split huge queries into smaller portions (yep), etc.

After learning to get correct results, I started looking into ways of making them faster. SQL optimization is a mad, crazy, savage world. I still don't entirely understand certain things, but I remember this is one of the first articles I read about this. Then I started noticing differences between different SQL implementations: t-SQL, HIVE, MySQL, Vertica. Even within the same system, a query is not written in the same way twice.

After two years of learning, I did another Kaggle competition. I can tell the difference. This time I was doing much more stuff directly using the query.

Do I write the best queries in the world now? Of course I don't. But I can do these three things now:

  1. Don't fear SQL.
  2. Be a good citizen to the DB people.
  3. Get correct results in a reasonable amount of time. It won't probably be the fastest query, but it doesn't matter. It won't be the slowest one, either.

As with a lot of other stuff, knowledge comes slowly and one of the best ways of acquiring it is reading from people that already know it. Felipe Hoffa writes very good examples with Google BigQuery. Read them even if you don't understand them completely, but you will get the knack eventually. See this article about analyzing Reddit users, for instance. Here is a list of some free SQL books and, of course, the good ol' W3 Schools. Also, try to find some data analysis competition on Kaggle in which the features are not precomputed for you. This is where most of the fun is.

SQL is data Tetris. It's fun. It's relaxing. It's good for your mind.

I started writing this article started because a few weeks ago I found myself coding a Vertica query with three subordinate clauses, one of them including a FIRST_VALUE() OVER() structure that I didn't even know existed a couple of months ago.