Know your database – a complex query

I have seen in my life many people learning SQL without prior understanding how databases work. But there are many crucial things that can make writing queries more fun, if it is not fun enough. Optimization in terms of memory use and execution is one of them. I would even say that is the most funny and enjoyable part. Below I will try to list some topics or issues one may encounter, or things to look at while optimizing queries.

The Basics

First of all understanding JOINS. The main difference between LEFT and INNER joins, but also learn to use RIGHT join. People often forget about the RIGHT join, although it may bring a lot to clarity of the code. How come? Simply to keep the list of tables in subsequent queries in the same order, then we can alter the way we join tables by changing LEFT to RIGHT. Clarity is important as it makes the code easier to maintain. Also for this purpose personally I try to keep all the keywords in UPPER CASE and all user generated parts of syntax in lower case – it makes the code much easier to read and follow. Another thing that greatly improves readability is using aliases for tables and putting it before every single column in a query. But most important is that whatever way you write the queries, be consistent!

When you get what the joins are about and how to write them, learn how to make them more efficient. Try to see what the Execution Plan is and how to read it. Dig in a bit about the different types of joins your database query planner/optimizer may use, for example in MS SQL Server/PostgreSQL those would be HASH, MERGE, LOOP joins, and what makes each of them so special. Knowing what you may expect from the planner learn how to help him choose what will be the fastest. Get to know how your database gathers statistics about tables, and how they can influence the query planner – probably the most important ones are the number of rows and frequencies. Given that you may go one step further and try to optimize your joins with indexes. If you join two tables by several columns maybe it is worth to create an index on a tuple (two separate indexes will not work in this case). Also remember that indexing both tables (from left and right side of join) might help you a bit. Talking about indexes, some databases offer many different type of indexes like B-TREE, HASH, GiST, GIN, or BITMAP. Sometimes the keys in the index can be sorted ascending or descending. Also the index might be on a column that will have no duplicates in the data, then make it UNIQUE. To make it more complex you may even have partial indexes, that will cover only part of the table based on the value of some variable, or something called CLUSTERED on NON-CLUSTERED index.

In the section below I will not cover all the topics I have mentioned. I will try to show you how to approach writing some more complex queries. Divide the work into simpler, smaller, steps and then building the final solution.

Querying

Knowing the basic tools, and having the idea how the database make it work we may try to look at some strategies of writing queries. Assume we have several tables that we need to join together, and summarize the results. Let us say that we have a database with standard library problem so tables like books, dim_author, dim_category, dim_publisher, dim_county, ngrams, rel_book_ngrams. As you may suspect all tables with dimensions, i.e. beginning with dim_ are rather small, and list of books, authors and ngrams are large. The list of ngrams will be especially big (for more details what is ngram see ). All of the tables will have some typical fields you would expect them to have, and ngrams will have id, ngram_words, ngram_size. The table rel_book_ngrams contains book_id, ngrams_id, and ngram_count which is the number of time the ngram with ngrams_id has appeared in book with book_id.

So what is the problem we have to solve?

Problem: Return a list of books in mathematics ordered by the biggest ratio of the number of different 2-grams to the number of pages among German authors in books published by the UK publishers between 1979 and 1989.

Sounds complicated, especially that we pretty much have to join all the existing tables and the list we have to return is pretty small. The problem is completely made up, and I do not have data myself but we may analyze the problem step by step or rather lets list all queries that we would need to have to get the data separately.

    list of books in mathematics published between 1979 and 1989
SELECT 
     b.id AS book_id
   , b.book_name
   , b.num_pages
   , b.publisher_id
   , b.author_id 
FROM 
   books AS b 
   INNER JOIN dim_category AS ct 
      ON (b.category_id = ct.id) 
WHERE 
       ct.description = 'mathematics' 
   AND '1979-01-01' <= b.publish_date 
   AND b.publish_date < '1990-01-01'
;
    list of the UK publishers
SELECT 
     p.id AS publisher_id
   , p.publisher_name 
FROM 
   publishers AS p 
   INNER JOIN dim_country AS co 
      ON (p.country_id = co.id) 
WHERE 
   co.country_name = 'united kingdom'
;
    list of German authors
SELECT 
     a.id AS author_id
   , a.author_name 
FROM 
   authors AS a 
   INNER JOIN dim_country AS co 
      ON (a.country_id = co.id) 
WHERE 
   co.country_name = 'germany'
;
    different 2-grams
SELECT 
   n.id AS ngram_id 
FROM 
   ngrams AS n 
WHERE 
   n.ngram_size = 2
;

We may say that we have all pieces that we need to put together. So lets try to join them. We may assume that all tables are indexed on the id field, which is also unique. Personally, I would suggest creating a clustered index on tuple (ngram_size, id) which would include (ngram_count, ngram_words) for ngrams. This way whenever you will want to look through the certain size of ngrams it will be much faster. Also creating index on country_name in table country will make your life easier.

Lets first get the list of all our books we are interested in. So, German author published from 1979 to 1989 in the UK. We do not need publisher name, nor its id in any further step so we can use those only in the JOINS. Basically we need only some data about books, and authors’ names. At the very end put all the results into temporary table tmp_filtered_books.

SELECT 
  -- author
    a.author_name 
  -- books
  , b.id AS book_id
  , b.book_name
  , b.num_pages
INTO
      tmp_filtered_books
FROM 
      books AS b 
      INNER JOIN authors AS a                
         ON (a.id = b.author_id)
      INNER JOIN dim_category AS ct          
         ON (ct.id = b.category_id) 
      INNER JOIN dim_country AS author_co   
         ON (author_co.id = a.country_id) 
      INNER JOIN dim_country AS publisher_co 
         ON (publisher_co .id = p.country_id)
WHERE 
       author_co.country_name    = 'germany'
   AND publisher_co.country_name = 'united kingdom'
   AND ct.description = 'mathematics' 
   AND ('1979-01-01' <= b.publish_date 
   AND b.publish_date < '1990-01-01')
 ;

Note that we had to change aliases for dim_country as you cannot use the same alias for any two tables twice in a query. Yes, that is true that we need to use the same table twice, but it does not change the fact that we refer to it twice, so we should not use the same alias twice. To make them more descriptive I have added a name of dimension they are joined to so author and publisher. In case we would need to SELECT some data from those dimensions it would be more readable to have author_co.country than just co.country.

One may try to argue that we could first slim the tables with dimensions before making a join, but if we have indexes on, for example, a tuple (country, id) for dim_country they should be used even though we specify the use of id and country_name in different part of the query – in JOIN and WHERE clauses respecitvely.

NOTE: Dimension tables should have index on the id field, and prefferably be clustered including description. That would greatly speed up the joins. Also creating a second clustered index on the description including id might be beneficial, in case you have to make selections on the description. In case of clustered index it may even help for small tables as there would be no need to read the data outside of the index.

Now we are left with the hardest part. Joining ngrams with what we have above and summarizing. So what we should do first? Summarize and join or join and summarize? Why is it so important? Because after manipulations on a table what you lose are indexes that you had on initial tables. If you try to make too many joins in one go, the results may not fit into RAM, and then the tables will need to be materialized (written to disk) that will slow things even more. But if you plan everything beforehand and even split the queries into several you may use the indexes efficiently, and sometimes even create indexes between the queries to make the join and sorting faster.

To get the results we need to join the big query above to rel_book_ngrams and then to ngrams. As we already have filtered our data, and we are not going to filter more on the result from the query above, we do not need an index there. Why? Because index on the base table during JOIN helps if you have to filter the data, otherwise you have to do the whole table scan anyway as you need to make join to all the data from the base table. Otherwise the filter would go through the index and get the ids that need to be joined making the query run faster.

But we have a problem here. The table structure has been badly designed. The table rel_book_ngrams does not contain a field with the number of ngrams we join to, that information is only available in table ngrams. Lets see what can we do about it. The table with ngrams can have milions of milions of rows, the index on ngram_size would limit the search probably to several dozen milion rows. Unfortunately, this table does not contain a field indicating the language of ngram, which is another flaw in design.

SELECT 
    b.author_name 
  , b.book_name
    SUM(b.num_pages)/SUM(rbn.ngram_count) AS ratio
FROM 
  ngrams AS n 
  INNER JOIN rel_book_ngrams AS rbn  
     ON (rbn.ngram_id = n.id)
  INNER JOIN tmp_filtered_books AS b 
     ON (b.book_id = rbn.book_id)
WHERE 
  n.gram_size = 2
GROUP BY
    b.author_name
  , b.book_name
ORDER BY
  ratio -- depending on the SQL implementation (database)
        -- you may have to put here the whole formula SUM()/SUM()
;

The optimizer should first use index on rel_book_ngrams to get the list of all ngram_ids for our list of filtered books. Then using the result join it to the ngrams using the partial index on ngrams to join the table to rel_book_ngrams. And that would be it. We have our list ordered by the ratio.

So what could have been done better?

One of course needs to investigate the excution plan, which I do not have here as the problem is completely artificial. Nevertheless, there is something that could be improved. That is adding the ngram_size to the rel_book_ngrams. That would speed up the query a lot as the database would not need to check so many different values agains ngrams table.

Leave a Reply