Index Your Database, Like a Boss


During the holidays I had the opportunity to read the wonderful book, SQL Performance Explained and now I want to share some of my new found knowledge.

The book contains a lot more information such as how joins, datatypes, ranges, etc, affect performance. In this post I will write about indexing, what an index is, what should be indexed, when to use simple indexes and when to use composite.

I will be using a Postgres database since that is what I know best, but most of it can be applied to any database although the details may differ.

Why index?

There are at least two reasons for using an index. One is for making sure that invalid data is not entered, in that case the index, unique, is used as a constraint. The other reason for using indexes is what I'm writing about here, performance.

By tuning my database, I will not only improve the performance of single queries, I will also improve how many queries it will be able to handle at the same time. A well-tuned database uses less resources and does more with less.

What is an index?

An index is a B-Tree, a data structure that keeps data sorted and allows searches, sequential access, insertions and deletions in logarithmic time.

B-Tree from Wikipedia

The index is the reason why a database is fast. If I have a table with 1 million rows and I do a sequential search, I have to search through all the rows to know if something matches. If on the other hand I have an index on it I only have to search through log2(1 million) =~ 20 rows instead. Quite an improvement! And this improvement is for log2.

Databases expose this concept to a maximum extent and put as
many entries as possible into each node—often hundreds.
That means that every new index level supports a hundred times
more entries. –SQL Performance Explained

This means a database has to search through log100(1 million) =~ 3 rows.

What should I index?

So, the million-dollar question, actually the e-book is only EUR 9.95 :), is: "What should I index?". The answer to this is, as always, it depends, (Damn consultants never willing to stand for anything!) :). But, it does not depend that much. I'm going to go out on a limb and write:

In a typical web application I will have a 100 times more queries
than inserts, so I index everything I put in my where clauses.

So, with this premise, it is as easy as analyzing my queries finding all the where clauses and adding an index to each an every one of them.


Let's say that we have a users table with four columns, the database usually adds an automatic index to the primary key.

A users table with four columns

  • id (primary key)
  • firstname
  • lastname
  • male_female

When I select a user by id, the primary key, this is how Postgres will execute it.

Postgres will use the users_pkey index and it will cost 9.37. (Cost is measured in units of disk page fetches).

Simple Indexes

If I instead try to find a user by firstname

Aiya! Postgres will do a full table scan and it will take cost 22805. That is

more than 2000 times slower!

To fix this I need to put an index on the table.

Zippedidoodaa! Postgres is performant again, by using my new index.

Alright, that was easy. So what happens if we want to find users by both firstname and lastname?

The Filter above means that Postgres will scan the result for matching lastnames. We can do better. Let's add an index on lastname to see what happens.

Nothing happened! Postgres cannot take advantage of this index. Let's remove it again. My Mama always said "Don't leave indexes around unless you can prove that they help you." :).

Composite Indexes

To solve the problem with AND clauses I need to create a composite index.

WTF? Postgres is ignoring our composite index, in preference to our firstname index. Why? Because it believes that it is faster! Is it? Let's find out. The timing command turns on timing in Postgres.

Now I drop the firstname index to make sure that Postgres will use my index.

The query using my composite index is about 1ms faster, so in this case Postgres was wrong, but for other data distributions my guess may be wrong. Another reason to always measure.

If I now perform the firstname query again, Postgres can use my composite index, but if I try to search for lastname it cannot. The order of the fields in the composite index matters.

Alright, one last query before I'm done. What happens with an OR query?

OR queries are like separate queries and they need separate indexes. I'll add an extra index to lastname and we should be good to go.

It worked! Postgres was able to use both our composite index and the new lastname index.


So to sum it up, to get a performant and scalable database:

  • Use indexes that covers all fields that are ANDed in where clauses.
  • Use separate indexes for fields that are in OR clauses.
  • Create reusable composite indexes.
    • When searching for (firstname, lastname) and firstname, add ONE composite index on (firstname, lastname).
    • When searching for (firstname, lastname) and lastname, add ONE composite index on (lastname, firstname).
    • When searching for (firstname, lastname), firstname, and lastname, add TWO indexes, one composite index on (firstname, lastname) (or reversed) and one simple index on lastname (or firstname)

I highly recommend the book. It is good!

Leave a Reply

Close Menu