CYBERTEC Logo

How the PostgreSQL query optimizer works

03.2021 / Category: / Tags: | | |

Just like any advanced relational database, PostgreSQL uses a cost-based query optimizer that tries to turn your SQL queries into something efficient that executes in as little time as possible. For many people, the workings of the optimizer itself remain a mystery, so we have decided to give users some insight into what is really going on behind the scenes.

Table of Contents

So let’s take a tour through the PostgreSQL optimizer and get an overview of some of the most important techniques the optimizer uses to speed up queries. Note that the techniques listed here are in no way complete. There is a lot more going on, but it makes sense to take a look at the most basic things in order to gain a good understanding of the process.

PostgreSQL constant folding

Constant folding is one of the easier processes to understand. Nonetheless, it's extremely important.

Let’s see what happens during the constant folding process:

Here we add a filter to the query: x = 7 + 1. What the system does is to “fold” the constant and instead do “x = 8”. Why is that important? In case “x” is indexed (assuming it is a table), we can easily look up 8 in the index.

Note the following:

PostgreSQL does not transform the expression to “x = 8” in this case. That’s why you should try to make sure that the filter is on the right side, and not on the column you might want to index.

PostgreSQL query optimizer: function inlining

One more important technique is the idea of function inlining. The goal is to reduce function calls as much as possible and thus speed up the query.

Let’s create a function to calculate a logarithm:

2^10 = 1024. This looks right.

Now, let’s see what happens in a real query that uses function inlining:

Look in the WHERE clause. The ld function has been replaced with the underlying log function. Note that this is only possible in the case of SQL functions. PL/pgSQL and other stored procedure languages are black boxes to the optimizer, so whether these things are possible or not depends on the type of language used.

Here is an example using PL/pgSQL:

In this case, inlining is not possible. While the code is basically the same, the programming language does make a major difference.

Function stability: VOLATILE vs. STABLE vs. IMMUTABLE

Something that is often overlooked is the concept of function stability. When creating a function, it makes a difference if a function is created as VOLATILE (default), STABLE, or as IMMUTABLE. It can even make a major difference - especially if you are using indexes. Let’s create some sample data and sort these differences out:

VOLATILE means that a function is not guaranteed to return the same result within the same transaction given the same input parameters. In other words, the PostgreSQL optimizer cannot see the function as a constant, and has to execute it for every row.

Here's an example where the optimizer has to execute the function for every row:

We have generated a list of 64 million entries containing 1 row per minute since January 1900, which produces 64 million entries.

Let’s run the query using a VOLATILE function:

In this case, the query needs a whopping 2.6 seconds and eats up a ton of resources. The reason is that clock_timestamp() is VOLATILE.

Now, what if we try to do the same thing using a STABLE function?

The query is many thousands of times faster, because now PostgreSQL can turn it into a constant and thus use the index. If you want to learn more about function stability in PostgreSQL, here is more information.

Equality constraints

The next optimization on our list is the concept of equality constraints. What PostgreSQL tries to do here is to derive implicit knowledge about the query.

First, let’s create some sample data for our equality constraints:

What we have here are 1000 rows. We’ll run a simple query:

Again, the magic is in the execution plan. You can see that PostgreSQL has figured that x and y are 4.

That opens the door for important optimizations:

Without this optimization, it would be absolutely impossible to use the index we have just created. In order to optimize the query, PostgreSQL will automatically figure out that we can use an index here.

Indexes are important!

View inlining and subselect flattening

When talking about the PostgreSQL optimizer and query optimization there is no way to ignore views and subselect handling.

Let’s create a view and query it:

When we look at the execution plan, the view is nowhere to be seen.

The reason is that PostgreSQL has inlined the view as a subselect and flattened it out. How does that work?

This query is turned into …

Then the subselect is flattened out which leaves us with …

There is a parameter in the PostgreSQL query optimizer, from_collapse_limit, which controls this behavior:

The meaning of this parameter is that only up to 8 subselects in the FROM clause will be flattened out. If there are more than 8 subselects, they will be executed without being flattened out. In most real-world use cases, this is not a problem. It can only become an issue if the SQL statements used are very complex. More information about joins and join_collapse_limit can be found in our blog.

Keep in mind that inlining is not always possible. Developers are aware of that.

Optimizing joins in PostgreSQL

Joins are used in most queries and are therefore of incredible importance to good performance. We’ll now focus on some of the techniques relevant to joins in general.

Optimizing join orders

The next important thing on our list is the way the PostgreSQL optimizer handles join orders. In a PostgreSQL database, joins are not necessarily done in the order proposed by the end user - quite the opposite: The query optimizer tries to figure out as many join options as possible.

Let’s create some sample data and figure out how the optimizer works when it comes to join orders:

In the next steps, a couple of indexes are created:

We now have three tables. We’ll query them and see what happens:

Note that the query joins “c and a” and then “a and b”. However, let’s look at the plan more closely. PostgreSQL starts with index scans on a and b. The result is then joined with c. Three indexes are used. This happens because of the equality constraints we discussed before. To find out about forcing the join order, see this blog.

Implicit vs. explicit joins in PostgreSQL

Many people keep asking about explicit versus implicit joins. Basically, both variants are the same.

Let’s check out two queries, one with an explicit and one with an implicit join:

Both queries are identical and the planner will treat them the same way for most commonly seen queries. Mind that the explicit joins work with and without parenthesis.

However, there is one parameter that is of great importance here, join_collapse_limit:

The join_collapse_limit parameter controls how many explicit joins are planned implicitly. In other words, an implicit join is just like an explicit join, but only up to a certain number of joins controlled by this parameter. See this blog for more information. It is also possible to use join_collapse_limit to force the join order, as explained in this blog.

For the sake of simplicity, we can assume that it makes no difference for 95% of all queries and for most customers.

Determine the join strategy

PostgreSQL offers various join strategies. These strategies include hash joins, merge joins, nested loops, and a lot more. We have already shared some of this information in previous posts. More on PostgreSQL join strategies can be found here.

Optimizing outer joins (LEFT JOIN, etc.)

Optimizing outer joins (LEFT JOIN, RIGHT JOIN, etc.) is an important topic. Usually, the planner has fewer options here than in the case of inner joins. The following optimizations are possible:

where Pac is a predicate referencing A and C, etc (in this case, clearly
Pac cannot reference B, or the transformation is nonsensical).

While this theoretical explanation is correct, most people will have no clue what it means in real life.

Therefore I have compiled a real-world example showing how PostgreSQL actually reorders a real join:

What we see here is that the PostgreSQL optimizer decides on joining x with y and then with z. In other words, the PostgreSQL optimizer has simply followed the join order as used in the SQL statement.

But what happens if we decide to tweak the parameters a bit?

This is the same query, but with slightly altered parameters.

The difference is that PostgreSQL again starts with x but then joins z first, before adding y.

Note that this optimization happens automatically. One reason why the optimizer can make this decision is because of the existence of optimizer support functions which were added to PostgreSQL a while ago. The reason why the reordering works is that support functions offer the planner a chance to figure out how many rows are returned from which part. If you use tables instead of set returning functions, support functions are irrelevant. PostgreSQL v16 has added support for "anti-joins" in RIGHT and OUTER queries.

Automatic join pruning in PostgreSQL

Not every join in a query is actually executed by PostgreSQL. The optimizer knows the concept of join pruning and is able to get rid of pointless joins quite efficiently. The main question is: When is that possible, and how can we figure out what’s going on?

The next listing shows how some suitable sample data can be created:

In this case, we need to make sure that both sides actually have primary keys, or some kind of unique constraint:

To show how the PostgreSQL query planner handles join pruning, we’ll take a look at two different SQL statements:

In this case, the join has to be executed. As you can see, PostgreSQL has decided on a hash join.

The next example contains only a small variation of the query:

Join pruning can happen if we DO NOT read data from the right side, and if the right side is unique. If the right side is not unique, the join might actually increase the number of rows returned; so pruning is only possible in case the right side is unique.

Let’s try out join pruning:

While it is certainly a good thing to have join pruning in the PostgreSQL optimizer, you have to be aware of the fact that the planner is basically fixing something which should not exist anyway. Write queries efficiently in the first place; don’t add pointless joins.

EXISTS and anti-joins

There is also something common you will see everywhere in the code within SQL EXISTS. Here’s an example:

This might not look like a big deal, but consider the alternatives: What PostgreSQL does here is to create a “hash anti-join”. This is way more efficient than some sort of nested loop. In short: The nested loop is replaced with a join which can yield significant performance gains.

Making use of sorted output

Every database relies heavily on sorting, which is necessary to handle many different types of queries and optimize various workloads. One of the key optimizations in this area is that PostgreSQL can use indexes to optimize ORDER BY in a very clever way. See more about how PostgreSQL can optimize subqueries in the blog Subqueries and Performance in PostgreSQL.

Let’s see how optimizing ORDER BY looks:

The listing created a list of 1 million entries that’s been stored on disk in random order. Subsequently, VACUUM was called to ensure that all PostgreSQL hint bit-related issues were sorted out before the test is executed. If you want to know what hint bits are and how they operate, check out our post about hint bits in PostgreSQL.

Let's run explain analyze:

PostgreSQL needs more than one core to process the query in 86 milliseconds, which is a lot.

Now, what happens if we create an index?

After adding an index, the query executed in a fraction of a millisecond. However, what is most important here is that we do NOT see a sort-step. PostgreSQL knows that the index returns data in sorted order (sorted by id) and thus there is no need to sort the data all over again. PostgreSQL consults the index and can simply take the data as it is and feed it to the client until enough rows have been found. In this special case, even an index-only scan is possible, because we are only looking for columns which actually exist in the index.

Reducing the amount of data to be sorted is vital to performance, and thus important to the user experience.

min and max can be used to reduce the amount of data to be sorted:

The minimal value is the first value in a sorted list that is not NULL. The max value is the last value in a sequence of sorted values that is not NULL. PostgreSQL can take this into consideration and replace the standard way of processing aggregates with a subplan that simply consults the index.

In a way, calling “min” is the same as …

Fortunately, PostgreSQL does this for you and optimizes the query perfectly.

Partition pruning and constraint exclusion at work

Partitioning is one of the favorite features of many PostgreSQL users. It offers you the ability to reduce table sizes and break up data into smaller chunks, which are easier to handle and in many cases (not all of them) faster to query. However, keep in mind that SOME queries might be faster - increased planning time can, however, backfire - this is not a rare corner case. We have seen partitioning decrease speed countless times in cases when partitioning wasn’t suitable.

Logically, the PostgreSQL optimizer has to take care of partitioning in a clever way and make sure that only the partitions are touched which might actually contain some of the data.

The following is an example of how the optimizer manages partitioning issues:

In this case, a table with two partitions has been created.

What is the planner going to do if we are only looking for positive values?

The PostgreSQL optimizer correctly figured out that the data cannot be in one of the partitions and removed it from the execution plan.

If we want to query all values below 1000, it's not possible, and all partitions are correctly queried:

Conclusion…

The optimizations you have seen on this page are only the beginning of everything that PostgreSQL can do for you. It’s a good start on getting an impression of what is going on behind the scenes. Usually, database optimizers are some sort of black box and people rarely know which optimizations really happen. The goal of this page is to shed some light on the mathematical transformations going on.

 


In order to receive regular updates on important changes in PostgreSQL, subscribe to our newsletter, or follow us on Twitter, Facebook, or LinkedIn.

0 0 votes
Article Rating
Subscribe
Notify of
guest
1 Comment
Oldest
Newest Most Voted
Inline Feedbacks
View all comments
Marc Rechté
Marc Rechté
3 years ago

Many thanks. Just noticed an indentation problem on line 16, 3rd listing of
"Optimizing join orders"§

CYBERTEC Logo white
CYBERTEC PostgreSQL International GmbH
Römerstraße 19
2752 Wöllersdorf
Austria

+43 (0) 2622 93022-0
office@cybertec.at

Get the newest PostgreSQL Info & Tools


    This site is protected by reCAPTCHA and the Google Privacy Policy & Terms of Service apply.

    ©
    2024
    CYBERTEC PostgreSQL International GmbH
    phone-handsetmagnifiercrosscross-circle
    1
    0
    Would love your thoughts, please comment.x
    ()
    x
    linkedin facebook pinterest youtube rss twitter instagram facebook-blank rss-blank linkedin-blank pinterest youtube twitter instagram