Speeding up GROUP BY in PostgreSQL

Posted on 2019-03-26 by Hans-Jürgen Schönig

In SQL the GROUP BY clause groups records into summary rows and turns large amounts of data into a smaller set. GROUP BY returns one records for each group. While most people know how to use GROUP BY not many actually know how to squeeze the last couple of percentage points out of the query. There is a small optimization, which can help you to speed up things by a couple of percents quite reliably. If you want to speed up GROUP BY clauses, this post is for you.

Creating a test data set in PostgreSQL

To prepare ourselves for the aggregation we first have to generate some data:

test=# CREATE TABLE t_agg (x int, y int, z numeric);
CREATE TABLE
test=# INSERT INTO t_agg SELECT id % 2, id % 10000, random() 
	FROM 	generate_series(1, 10000000) AS id;
INSERT 0 10000000

The interesting part is that the first column only has 2 distinct values while the second column will contain 10.000 different values. That is going to be of great importance for our optimization efforts.

Let us VACUUM the table to set hint bits and to build optimizer statistics. To make those execution plans more readable I also decided to turn off parallel queries:

test=# VACUUM ANALYZE ;
VACUUM
test=# SET max_parallel_workers_per_gather TO 0;
SET

Running an aggregation

Now that the is in place the first tests can be started:

test=# explain analyze SELECT x, y, avg(z) FROM t_agg GROUP BY 1, 2;
                                               QUERY PLAN                                                         
--------------------------------------------------------------------------------------------
 HashAggregate  (cost=238697.01..238946.71 rows=19976 width=40) 
		(actual time=3334.320..3339.929 rows=10000 loops=1)
   Group Key: x, y
   ->  Seq Scan on t_agg  (cost=0.00..163696.15 rows=10000115 width=19) 
		(actual time=0.058..636.763 rows=10000000 loops=1)
 Planning Time: 0.399 ms
 Execution Time: 3340.483 ms
(5 rows)

PostgreSQL will read the entire table sequentially and perform a hash aggregate. As you can see most of the time is burned by the hash aggregate (3.3 seconds minus 636 milliseconds). The resultset contains 6000 rows. However, we can do better. Keep in mind that the first column does not contain as many different values as the second column. That will have some implications as far as the hash aggregate is concerned. Let us try to play around with the GROUP BY clause

Changing aggregation order can improve performance

Let us run the same query again. But this time we won’t use “GROUP BY x, y” but instead use “GROUP BY y, x”. The result of the statement will be exactly the same as before (= 10.000 groups). However, the slightly modified query will be faster:

test=# explain analyze SELECT x, y, avg(z) FROM t_agg GROUP BY 2, 1;
                                                        QUERY PLAN                                                         
-----------------------------------------------------------------------------------------------------
 HashAggregate  (cost=238697.01..238946.71 rows=19976 width=40) 
		(actual time=2911.989..2917.276 rows=10000 loops=1)
   Group Key: y, x
   ->  Seq Scan on t_agg  (cost=0.00..163696.15 rows=10000115 width=19) 
		(actual time=0.052..580.747 rows=10000000 loops=1)
 Planning Time: 0.144 ms
 Execution Time: 2917.706 ms
(5 rows)

Wow, the query has improved considerably. We saved around 400ms, which is a really big deal. The beauty is that we did not have to rearrange the data, change the table structure, adjust memory parameters or make any other changes to the server. All I did was to change the order in which PostgreSQL aggregated the data.

Which conclusions can developers draw from this example? If you are grouping by many different columns: Take the ones containing more distinct values first and group by the less frequent values later. It will make the hash aggregate run more efficiently in many cases. Also try to make sure that work_mem is high enough to make PostgreSQL trigger a hash aggregate in the first place. Using a hash is usually faster than letting PostgreSQL use the “group aggregate”.

It is very likely that future versions of PostgreSQL (maybe starting with PostgreSQL 12?) will already do this kind of change automatically. A patch has already been proposed by Teodor Sigaev and I am quite confident that this kind of optimization will make it into PostgreSQL 12. However, in the meantime it should be easy to make the change by hand and enjoy a nice, basically free speedup.

If you want to learn more about GROUP BY, aggregations and work_mem in general, consider checking out my blog post about this topic. On behalf of the entire team I wish everybody “happy performance tuning”. If you want to learn more about aggregation and check out Teodor Sigaev’s patch, check out the PostgreSQL mailing list.

If you want to learn more about performance tuning, advanced SQL and so on, consider checking out one of our posts about window functions and analytics.

In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.