EXPLAIN EXTENDED

How to create fast database queries

PostgreSQL: optimizing DISTINCT

with 8 comments

In PostgreSQL (as of 8.3, at least), performance of DISTINCT clause in SELECT list is quite poor.

Probably because DISTINCT code in PostgreSQL is very, very old, it always acts in same dumb way: sorts the resultset and filters out the duplicate records.

GROUP BY that can be used for the same purpose is more smart, as it employs more efficient HashAggregate, but its performance is still poor for large dataset.

All major RDBMS's, including MySQL, are able to jump over index keys to select DISTINCT values from an indexed table. This is extremely fast if there are lots of records in a table but not so many DISTINCT values.

This behavior can be emulated in PostgreSQL too.

Let's create a sample table:

CREATE TABLE t_distinct (
id INT NOT NULL PRIMARY KEY,
grouper INT NOT NULL,
value VARCHAR(50) NOT NULL
);

CREATE INDEX ix_distinct_grouper ON t_distinct (grouper);

BEGIN TRANSACTION;

SELECT  SETSEED(0.20090503);

INSERT
INTO    t_distinct (id, grouper, value)
SELECT  s, (s - 1) % 10 + 1, 'Value ' || s
FROM    generate_series(1, 1000000) s;

COMMIT;

ANALYZE t_distinct;

This table has 1,000,000 rows and 10 DISTINCT grouper's. There is an index on grouper.

The built-in DISTINCT clause:

SELECT  DISTINCT grouper
FROM    t_distinct

grouper
1
2
3
4
5
6
7
8
9
10
10 rows fetched in 0.0003s (1.3437s)
Unique  (cost=0.00..56583.08 rows=10 width=4)
  ->  Index Scan using ix_distinct_grouper on t_distinct  (cost=0.00..54082.85 rows=1000090 width=4)

works for 1.34 seconds, while GROUP BY:

SELECT  grouper
FROM    t_distinct
GROUP BY
grouper

grouper
6
9
3
5
7
1
2
10
4
8
10 rows fetched in 0.0002s (0.4325s)
HashAggregate  (cost=18871.13..18871.22 rows=10 width=4)
  ->  Seq Scan on t_distinct  (cost=0.00..16370.90 rows=1000090 width=4)

is little more efficient, but still takes 432 milliseconds.

To emulate the INDEX GROUP BY used by MySQL, we just need to jump over the index keys, issuing

SELECT  MIN(grouper)
FROM    t_distinct
WHERE   grouper > current_grouper

PostgreSQL is smart enough to use the indexes for such a query.

Unfortunately, PostgreSQL doesn't offer a simple way to use session variables (you need to edit configuration files to access them which can be a pain on a database you have no control over).

Even PL/PgSQL doesn't get installed by default on some distributions and cannot be relied upon.

Fortunately, there is a workaround using recursive set-returning functions.

PostgreSQL allows recursion for SQL functions, and the recursion stack can be used to record the current_grouper used in the query.

The function looks like this:

CREATE OR REPLACE FUNCTION "20090503_distinct".skipscan_distinct(INT) RETURNS SETOF "20090503_distinct".t_distinct.grouper%TYPE AS
$$
SELECT  MIN(grouper)
FROM    "20090503_distinct".t_distinct
WHERE   grouper = $1
HAVING  MIN(grouper) IS NOT NULL
UNION ALL
SELECT  *
FROM    "20090503_distinct".skipscan_distinct (
(
SELECT  MIN(grouper)
FROM    "20090503_distinct".t_distinct
WHERE   grouper > $1
HAVING  MIN(grouper) IS NOT NULL
)
)
WHERE   $1 IS NOT NULL
$$
LANGUAGE SQL;

This function takes a current_grouper as a parameter, and selects the current_grouper UNION'ed with the function's result for the next grouper.

Let's see the performance:

SELECT  *
FROM    "20090503_distinct".skipscan_distinct(-1)

skipscan_distinct
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
15 rows fetched in 0.0003s (0.0381s)
Function Scan on skipscan_distinct  (cost=0.00..260.00 rows=1000 width=4)

38 ms, much faster than GROUP BY.

However, there are side effects:

  • This function uses recursion, and a stack overflow can occur if there are too many DISTINCT values
  • This function is efficient only if there is an index on the DISTINCT field

But for the cases like described above, this function is a real boon.

Written by Quassnoi

May 3rd, 2009 at 11:00 pm

Posted in PostgreSQL

8 Responses to 'PostgreSQL: optimizing DISTINCT'

Subscribe to comments with RSS

  1. Thank you for describing the current state of DISTINCT in PostgreSQL so clearly. Do you have any faith in this getting resolved?

    Christoffer S.

    12 Nov 12 at 02:11

  2. I don’t know. It’s not me who develops PostgreSQL and they don’t report to me.

    In Oracle and SQL Server, by the way, it does not work out of box also, you need some special tricks to make loose index scans work, though it’s still possible.

    Quassnoi

    12 Nov 12 at 02:20

  3. I tried this on my machine (postgres 9.2) and result was 256 ms for the last query (which utilizes the function)

    “Function Scan on skipscan_distinct (cost=0.25..10.25 rows=1000 width=4) (actual time=256.244..256.245 rows=10 loops=1)”

    Milovan

    8 Mar 13 at 15:30

  4. +notifier

    Milovan

    8 Mar 13 at 15:31

  5. This type of query performs better on my machine:

    explain analyse
    WITH RECURSIVE t(n) AS (
    SELECT MIN(grouper) FROM t_distinct
    UNION ALL
    SELECT (SELECT t_distinct.grouper FROM t_distinct WHERE t_distinct.grouper > n LIMIT 1)
    FROM t WHERE n IS NOT NULL
    )
    SELECT n FROM t;

    Milovan

    8 Mar 13 at 15:33

  6. I forgot to add the order clause. Now this works extremely fast:

    explain analyse
    WITH RECURSIVE t(n) AS (
    SELECT MIN(grouper) FROM t_distinct
    UNION ALL
    SELECT (SELECT t_distinct.grouper FROM t_distinct WHERE t_distinct.grouper > n ORDER BY t_distinct.grouper LIMIT 1)
    FROM t WHERE n IS NOT NULL
    )
    SELECT n FROM t;

    Milovan

    8 Mar 13 at 16:51

  7. Milovan

    8 Mar 13 at 17:12

  8. Thanks.
    I tried with Postgres 9.4. PG 9.4 is using the same plan with group_by and distinct.
    the function variation is much slower at first query
    Same as group_by at second query.

    select * from skipscan_distinct(-1);
    (10 rows)
    Time: 473.639 ms

    select * from skipscan_distinct(-1);
    skipscan_distinct
    ——————-
    1
    2
    3
    4
    5
    6
    7
    8
    9
    10
    (10 rows)
    Time: 240.236 ms

    Van

    13 Sep 16 at 07:03

Leave a Reply