Composite/ combined indexes vs. separate indexes in PostgreSQL

Posted on 2019-07-29 by Hans-Jürgen Schönig

A “composite index”, also known as “concatenated index” or a combined index, is an index on multiple columns in a table. Many people wonder what is more beneficial: Using separate or using composite indexes? Whenever we do training, consulting or support this question is high up on the agenda and many people keep asking this question. Therefore, I decided to shed some light on this question.

Which indexes should I create? Combined indexes or separate indexes?

To discuss the topic on a more practical level, I created a table consisting of three columns. Then I loaded 1 million rows and added a composite index covering all three columns:

test=# CREATE TABLE t_data (a int, b int, c int);
CREATE TABLE
test=# INSERT INTO t_data 
		SELECT random()*100000, 
			random()*100000, 
			random()*100000 
		FROM generate_series(1, 1000000);
INSERT 0 1000000
test=# CREATE INDEX idx_data ON t_data(a, b, c);
CREATE INDEX

The layout of the table is therefore as follows:

test=# \d t_data
               Table "public.t_data"
 Column |  Type   | Collation | Nullable | Default 
--------+---------+-----------+----------+---------
 a      | integer |           |          | 
 b      | integer |           |          | 
 c      | integer |           |          | 
Indexes:
    "idx_data" btree (a, b, c)

Now let’s run ANALYZE to ensure that optimizer statistics are there. Usually, autovacuum will kick in and create statistics for your table — but to make absolutely sure, running ANALYZE does not hurt in this case.

test=#  ANALYZE t_data;
ANALYZE

PostgreSQL will rearrange filters for you

The first important thing to observe is that PostgreSQL will try to arrange the filters in your query for you. The following query will filter on all indexed columns:

test=# explain SELECT * 
       FROM   t_data 
       WHERE  c = 10 
              AND b = 20 
              AND a = 10;
                   QUERY PLAN                                  
---------------------------------------------------
 Index Only Scan using idx_data on t_data  
       (cost=0.42..4.45 rows=1 width=12)
   Index Cond: ((a = 10) AND (b = 20) AND (c = 10))
(2 rows)

As you can see, we filtered for c, b, & a, but the optimizer changed the order of those conditions and turned it into a, b, c to make sure that the index we created suits the query. There are some important things to learn here:

The order of the conditions in your WHERE clause makes no difference
PostgreSQL will find the right indexes automatically

Also, keep in mind that PostgreSQL knows that equality is transitive and can infer conditions from that:

test=# explain SELECT * 
       FROM    t_data 
       WHERE   c = 10 
               AND b = a 
               AND a = c;
                     QUERY PLAN                                  
---------------------------------------------------
 Index Only Scan using idx_data on t_data  
       (cost=0.42..4.45 rows=1 width=12)
   Index Cond: ((a = 10) AND (b = 10) AND (c = 10))
(2 rows)

What you can see here is that PostgreSQL figured out automatically that a, b and c are actually the same.

Using parts of an index

However, if you have a composite index, it isn’t necessary to filter on all columns. It is also ok to use just the first or both the first and second field to filter on. Here is an example:

test=# explain SELECT * 
       FROM    t_data 
       WHERE   a = 10;
                 QUERY PLAN                                  
------------------------------------------
 Index Only Scan using idx_data on t_data  
       (cost=0.42..4.62 rows=11 width=12)
   Index Cond: (a = 10)
(2 rows)

As you can see, PostgreSQL can still use the same index. An index is a sorted list which happens to be ordered by three fields. In multi-column indexes, this ordering is a so-called &ldauo;lexicographical ordering”: the rows are intially sorted by the first index column. Rows with the same first column are sorted by the second column, and so on. It is perfectly fine to just make use of the first columns. Talking about sorted lists:

test=# explain SELECT *
       FROM    t_data
       WHERE   a = 10
       ORDER BY b, c;
                QUERY PLAN
-------------------------------------------
 Index Only Scan using idx_data on t_data
       (cost=0.42..4.62 rows=11 width=12)
   Index Cond: (a = 10)
(2 rows)

An index can also provide you with sorted data. In this case, we filter on “a” and order by the remaining two columns, “b” and “c”.

When composite indexes don’t work

When DOESN’T a composite index help to speed up a query? Here is an example:

test=# explain SELECT * 
       FROM    t_data 
       WHERE   b = 10;
                     QUERY PLAN                                
---------------------------------------------------
 Gather  (cost=1000.00..11615.43 rows=11 width=12)
   Workers Planned: 2
   ->  Parallel Seq Scan on t_data  
         (cost=0.00..10614.33 rows=5 width=12)
         Filter: (b = 10)
(4 rows)

In this case, we filter on the second column. Remember: A b-tree index is nothing more than a sorted list. In our case, it’s sorted by a, b, c. So naturally, we cannot usefully filter on this field. However, in some rare cases it may be that you will still see an index scan. Some people see that as proof “that it does indeed work”. But you will see an index scan for the wrong reason: PostgreSQL is able to read an index completely – not to filter but to reduce the amount of I/O it takes to read a table. If a table has many columns, it might be faster to read the index than to digest the table.

Let’s simulate this type of index scan:

test=# SET seq_page_cost TO 10000;
SET

By telling PostgreSQL that sequential I/O is more expensive than it really is, you will see that the optimizer decides on an index scan:

test=# explain analyze 
       SELECT * 
       FROM   t_data 
       WHERE  b = 10;
                  QUERY PLAN                                                          
--------------------------------------------------
 Index Only Scan using idx_data on t_data  
       (cost=0.42..22892.53 rows=11 width=12) 
       (actual time=7.626..63.087 rows=8 loops=1)
   Index Cond: (b = 10)
   Heap Fetches: 0
 Planning Time: 0.121 ms
 Execution Time: 63.122 ms
(5 rows)

However, the execution time is 63 milliseconds, which is A LOT more than if we had done this for the first column in the index.

Note that “Index Cond: (b = 10)” means something different here than in the previous examples: while before we had index scan conditions, here we have an index filter condition. It’s not ideal: the two look the same in the EXPLAIN output.

Understanding bitmap index scans in PostgreSQL

PostgreSQL is able to use more than one index at the same time. This is especially important if you are using OR as shown in the next example:

test=# SET seq_page_cost TO default;
SET
test=# explain SELECT * 
       FROM    t_data 
       WHERE   a = 4 
               OR a = 23232;
                     QUERY PLAN                                  
----------------------------------------------------
 Bitmap Heap Scan on t_data  
        (cost=9.03..93.15 rows=22 width=12)
   Recheck Cond: ((a = 4) OR (a = 23232))
   ->  BitmapOr  (cost=9.03..9.03 rows=22 width=0)
         ->  Bitmap Index Scan on idx_data  
               (cost=0.00..4.51 rows=11 width=0)
               Index Cond: (a = 4)
         ->  Bitmap Index Scan on idx_data  
               (cost=0.00..4.51 rows=11 width=0)
               Index Cond: (a = 23232)
(7 rows)

What PostgreSQL will do is to decide on a bitmap scan. The idea is to first consult all the indexes to compile a list of rows / blocks, which then have to be fetched from the table (= heap). This is usually a lot better than a sequential scan. In my example, the same index is even used twice. However, in real life you might be more likely to see bitmap scans involving a variety of indexes.

Using a subset of indexes in a single SQL statement

In many cases, using too many indexes can even be counterproductive. The planner will make sure that enough indexes are chosen, but it won’t take too many. Let’s take a look at the following example:

test=# DROP INDEX idx_data;
DROP INDEX
test=# CREATE INDEX idx_a ON t_data (a);
CREATE INDEX
test=# CREATE INDEX idx_b ON t_data (b);
CREATE INDEX
test=# CREATE INDEX idx_c ON t_data (c);
CREATE INDEX
test=# \d t_data 
               Table "public.t_data"
 Column |  Type   | Collation | Nullable | Default 
--------+---------+-----------+----------+---------
 a      | integer |           |          | 
 b      | integer |           |          | 
 c      | integer |           |          | 
Indexes:
    "idx_a" btree (a)
    "idx_b" btree (b)
    "idx_c" btree (c)

The following query filters on all columns again. This time, I didn’t use a single combined index. Instead, I decided to use separate indexes, which is usually more inefficient. Here is what happens:

test=# explain SELECT * 
       FROM    t_data 
       WHERE   a = 10 
               AND b = 20 
               AND c = 30;
                     QUERY PLAN                                 
----------------------------------------------------
 Bitmap Heap Scan on t_data  
        (cost=9.27..13.28 rows=1 width=12)
   Recheck Cond: ((c = 30) AND (b = 20))
   Filter: (a = 10)
   ->  BitmapAnd  (cost=9.27..9.27 rows=1 width=0)
         ->  Bitmap Index Scan on idx_c  
               (cost=0.00..4.51 rows=11 width=0)
               Index Cond: (c = 30)
         ->  Bitmap Index Scan on idx_b  
               (cost=0.00..4.51 rows=11 width=0)
               Index Cond: (b = 20)
(8 rows)

Let’s take a closer look at the plan. The query filters on 3 columns BUT PostgreSQL only decided on two indexes. Why is that the case? We have imported 1 million rows. Each column contains 100.000 distinct values, which means that every value occurs 10 times. What is the point in fetching a couple of rows from every index even if two indexes already narrow down the result sufficiently enough? This is exactly what happens.

Optimizing min / max in SQL queries

Indexes are not only about filtering. It will also help you to mind the lowest and the highest value in a column as shown by the following SQL statement:

test=# explain SELECT min(a), max(b) FROM t_data;
                               QUERY PLAN                                                     
-----------------------------------------------------------------------
 Result  (cost=0.91..0.92 rows=1 width=8)
   InitPlan 1 (returns $0)
     ->  Limit  (cost=0.42..0.45 rows=1 width=4)
           ->  Index Only Scan using idx_a on t_data  
               (cost=0.42..28496.42 rows=1000000 width=4)
                 Index Cond: (a IS NOT NULL)
   InitPlan 2 (returns $1)
     ->  Limit  (cost=0.42..0.45 rows=1 width=4)
           ->  Index Only Scan Backward using idx_b on t_data t_data_1  
                 (cost=0.42..28496.42 rows=1000000 width=4)
                 Index Cond: (b IS NOT NULL)
(9 rows)

As you can see, PostgreSQL is pretty sophisticated. If you are looking for good performance. it certainly makes sense to see how PostgreSQL handles indexes and review your code in order to speed thing up. If you want to learn more about performance, consider checking out Laurenz Albe’s post about speeding up count(*). Also, if you are not sure, why your database is slow, check out my post on PostgreSQL database performance, which explains, how to find slow queries.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.