Skip to content

Query Processing

Overview

The query processing contains following stages:

  1. Parser: The parser generates a parse tree from an SQL statement in plain text.
  2. Analyzer/Analyser: The analyzer/analyser carries out a semantic analysis of a parse tree and generates a query tree.
  3. Rewriter: The rewriter transforms a query tree using the rules stored in the rule system if such rules exist.
  4. Planner: The planner generates the plan tree that can most effectively be executed from the query tree.
  5. Executor: The executor executes the query via accessing the tables and indexes in the order that was created by the plan tree.

Parser

The parser generates a parse tree that can be read by subsequent subsystems from an SQL statement in plain text.

The parser only checks the syntax of an input when generating a parse tree, it only returns an error if there is a syntax error in the query. The parser does not check the semantics of an input query. For example, even if the query contains a table name that does not exist, the parser does not return an error. Semantic checks are done by the analyzer/analyser.

Analyzer/Analyser

The analyzer/analyser runs a semantic analysis of a parse tree generated by the parser and generates a query tree.

Rewriter

The rewriter transforms a query tree according to the rules stored in the pg_rules system catalog if necessary

Find more information about rule system here

Planner and Executor

The planner receives a query tree from the rewriter and generates a (query) plan tree that can be processed by the executor most effectively.

A plan tree is composed of elements called plan nodes, and it is connected to the plantree list of the PlannedStmt structure. Each plan node has information that the executor requires for processing, and the executor processes from the end of the plan tree to the root in the case of a single-table query.

Cost Estimation

PostgreSQL's query optimization is based on cost.

In PostgreSQL, there are three kinds of costs: start-up, run and total. The total cost is the sum of start-up and run costs; thus, only the start-up and run costs are independently estimated.

  • The start-up cost is the cost expended before the first tuple is fetched. For example, the start-up cost of the index scan node is the cost to read index pages to access the first tuple in the target table. For a sequential scan, the startup cost will generally be close to zero, as it can start fetching rows straight away. For a sort operation, it will be higher because a large proportion of the work needs to be done before rows can start being returned.
  • The run cost is the cost to fetch all tuples.
  • The total cost is the sum of the costs of both start-up and run costs.

The costs are in an arbitrary unit. A common misunderstanding is that they are in milliseconds or some other unit of time, but that’s not the case.

The cost units are anchored (by default) to a single sequential page read costing 1.0 units (seq_page_cost). Each row processed adds 0.01 (cpu_tuple_cost), and each non-sequential page read adds 4.0 (random_page_cost).

How the costs are calculated

In order to calculate these costs, the Postgres query planner uses both constants(cpu_tuple_cost(0.01), cpu_operator_cost(0.0025), random_page_cost(4), seq_page_cost(4)) and metadata about the contents of the database. The metadata is often referred to as statistics.

These statistics include a number of very useful things, like roughly the number of rows each table has, and what the most common values in each column are.

Creating the Plan Tree

The planner in PostgreSQL performs three steps, as shown below:

  1. Preprocessing (normalize boolean expression, flattening AND/OR expression...)
  2. Get the cheapest access path by estimating the costs of all possible access paths.
  3. Create the plan tree from the cheapest path

Scan Nodes

Sequential Scan

This is the simplest way of fetching data from a table: it scans through every page of data sequentially. Like most other scans, this can apply a filter while reading data, but it needs to read the data first and then discard it. A sequential scan has no way to zero in on just the data you want: it always reads everything in the table.

Index Scan

An index scan uses an index to find either a specific row, or all rows matching a predicate. An index scan must first look up each row in the index, and then check the actual table data for that index entry. The table data must be checked to ensure that the row it found is actually visible to the current transaction, and also to fetch any columns included in the query that are not present in the index. Because of this, an index scan actually has higher per-row overhead than a sequential scan: its real advantage is that it allows you to read only some of the rows in a table. If your query predicate is not very selective (that is, if few rows are filtered out), a sequential scan may still be more efficient than an index scan.

Index Only Scan

This is very similar to an Index Scan, but the data comes directly from the index and the visibility check is handled specially, so it can avoid looking at the table data entirely.

For example:

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
Table: "public.tbl"

 Column |  Type   | Modifiers
--------+---------+-----------
 id     | integer |
 name   | text    |
 data   | text    |

Indexes: "tbl_idx" btree (id, name)

SELECT id FROM tbl WHERE id BETWEEN 18 and 19;

This query gets data from two columns of the table: id and name, and the index tbl_idx is composed of these columns. Thus, when using index scan, it seems at first glance that accessing the table pages is not required because the index tuples contain the necessary data.

However, in fact, PostgreSQL has to check the visibility of the tuples in principle, and the index tuples do not have any information about transactions such as the t_xmin and t_xmax of the heap tuples.

PostgreSQL uses the visibility map of the target table. If all tuples stored in a page are visible, PostgreSQL uses the key of the index tuple and does not access the table page that is pointed at from the index tuple to check its visibility; otherwise, PostgreSQL reads the table tuple that is pointed at from the index tuple and checks the visibility of the tuple, which is the ordinary process.

Bitmap Index Scan and Bitmap Heap Scan

Bitmap index scan as a middle ground between a sequential scan and an index scan. Like an index scan, it scans an index to determine exactly what data it needs to fetch, but like a sequential scan, it takes advantage of data being easier to read in bulk.

The bitmap index scan actually operates in tandem with a Bitmap Heap Scan: it does not fetch the data itself. Instead of producing the rows directly, the bitmap index scan constructs a bitmap of potential row locations. It feeds this data to a parent Bitmap Heap Scan, which can decode the bitmap to fetch the underlying data, grabbing data page by page.

A bitmap heap scan takes a row location bitmap generated by a Bitmap Index Scan andd looks up the relevant data. A Bitmap Heap Scan is the most common parent node of a Bitmap Index Scan, but a plan may also combine several different Bitmap Index Scans with BitmapAnd or BitmapOr nodes before actually fetching the underlying data. This allows Postgres to use two different indexes at once to execute a query.

What does "Bitmap Heap Scan" phase do?

A plain Index Scan fetches one tuple-pointer at a time from the index, and immediately visits that tuple in the table. A bitmap scan fetches all the tuple-pointers from the index in one go, sorts them using an in-memory “bitmap” data structure, and then visits the table tuples in physical tuple-location order.

What is "Recheck condition" and why is it needed?

If the bitmap gets too large we convert it to "lossy" style, in which we only remember which pages contain matching tuples instead of remembering each tuple individually. When that happens, the table-visiting phase has to examine each tuple on the page and recheck the scan condition to see which tuples to return.

Join Operations

PostgreSQL supports three join operations: nested loop join, merge join and hash join.

Nested Loop Join

This is the simplest and most general join strategy of all. PostgreSQL scans the outer relation sequentially, and for each result row it scans the inner relation for matching rows.

Indexes that can help with nested loop joins

Since we scan the outer relation sequentially, no index on the outer relation will help. But an index on the join key of the inner relation can speed up a nested loop join considerably.

Use cases for the nested loop join strategy

Nested loop joins are particularly efficient if the outer relation is small, because then the inner loop won’t be executed too often. If the outer relation is large, nested loop joins are usually very inefficient, even if they are supported by an index on the inner relation. Nested loop joins are also used as the only option if the join condition does not use the = operator.

Merge Join

In a merge join, PostgreSQL picks all join conditions with the = operator. It then sorts both tables by the join keys (which means that the data types must be sortable). Then it iterates through both sorted lists and finds matching entries.

Indexes that help with a merge join

An index on the sort keys can speed up sorting, so an index on the join keys on both relations can speed up a merge join.

Use cases for the merge join strategy

The optimizer usually chooses a merge join if the involved relations are both too big for a hash that fits into work_mem. So this is the best strategy for joining really large tables. Merge join is only feasible if there is at least one join condition with the = operator.

Hash Join

First, PostgreSQL scans the inner relation sequentially and builds a hash table, where the hash key consists of all join keys that use the = operator. Then it scans the outer relation sequentially and probes the hash for each row found to find matching join keys.

Indexes that can help with hash joins

Nope

Use cases for the hash join strategy

Hash joins are best if none of the involved relations are small, but the hash table for the smaller table fits in work_mem. This is because otherwise PostgreSQL would build the hash in several batches and store them in temporary disk files, which hurts performance.

Looking up values in a hash table only works if the operator in the join condition is =, so you need at least one join condition with that operator.

How to make PostgreSQL choose the correct join strategy

You can disable different join strategies temporarily with the SET command, which changes a parameter in your current database session:

1
2
3
SET enable_hashjoin = off;
SET enable_mergejoin = off;
SET enable_nestloop = off;

seq_page_cost and random_page_cost

The default values of seq_page_cost and random_page_cost are 1.0 and 4.0, respectively. This means that PostgeSQL assumes that the random scan is four times slower than the sequential scan; that is, obviously, the default value of PostgreSQL is based on using HDDs.

On the other hand, in recent days, the default value of random_page_cost is too large because SSDs are mostly used. If the default value of random_page_cost is used despite using an SSD, the planner may select ineffective plans. Therefore, when using an SSD, it is better to change the value of random_page_cost to 1.0 (not sure this is the correct value).

This blog reported the problem when using the default value of random_page_cost.