While it's true the most problems can be solved by the appropriate use of the index, there are cases where you will just waste resources doing so. For casual developer it might seems like PostgreSQL decided to do its own thing, but when you look behind the scenes it all makes perfect sense. Here's a quick run down of the some reasons why planner might pass on index and rather do things without it.
The case of missing condition
In the evolution of the software developer journey into the realms of databases, the partial indexes are the next best thing. Why to create the full index where you can easily restrict it to the sub-set of the records? The trick is ensure every query aligns with the condition given during the index creation.
Think of an order management system, where most active orders are going to those in 'open' status. Most of the day-to-day operational tasks will resolve around those entries, a partial index like
CREATE INDEX open_orders ON orders(order_id) WHERE state = 'open';
provides a perfect match. It will allow to iterate only on a relatively small subset of the data, hence saving the disc space. As long as the application developers are aware of the partial condition, and how to use it.
The column order matters
The second common case is just a step away from the partial index. The compound indexes, or multi-column indexes, are versatile and powerful - that is when used correctly. The crucial element is the order in which the columns are indexed. The order impacts whatever it gets used or not at all.
The key is to match the left-most columns of the index in your queries. Once these are aligned, you can optionally skip or include additional columns in the filter, but the initial columns must be used to leverage the index effectively.
While it might be easy not to miss the condition of the country, when filtering for cities
CREATE INDEX contact_location ON contact_details (country_id, city);
consider the scenario which might not be so obvious.
CREATE INDEX brand_products_per_category ON products (category_id, brand_id);
In this case the index will only get used if either both category_id
and brand_id
are used, or at least category_id
- but not for brand_id
alone. In latter case PostgreSQL planner might (unless another index is available) to opt in for scan of the entire table.
In this case alternative strategy would be to switch the column order (should the application logic support it).
Low Selectivity
If we use the example of the partial index from the first section, we can easily demonstrate another case when index just might get ignored forever. It's the case for the columns with low selectivity, where a predominant value overshadows others, making an index less useful.
Take example of the ordering system. While having index for open
orders is helpful, in most scenarios majority of the records will end up in delivered
state. Hence
CREATE INDEX delivered_orders ON orders(order_id) WHERE state = 'delivered';
might seem beneficial, but in reality PostgreSQL will most likely chose to skip it and instead to scan the table. The reason? Should you consider the regular business operations, you might find vast majority (let's say 95%) or orders shipped and considered finished. With such a high percentage of the records orders, the index will do little to narrow the search for the records. Planner hence will perform the sequential scan directly. Why? It's all to do with the statistics that are available on the table.
table_name | orders
column_name | state
n_distinct | 4
most_common_vals | {delivered,cancelled,pending,open}
most_common_freqs | {0.95023334,0.030166665,0.013,0.0096}
Those are fictional statistics matching the order distribution described above. From there you can see that delivered
orders dominate the dataset, compromising approximately out of 95.02% of the records. The other statuses make up just for a small fraction.
The above data sample comes from the query similar to
SELECT
tablename AS table_name,
attname AS column_name,
n_distinct,
most_common_vals,
most_common_freqs
FROM
pg_stats
WHERE
tablename = 'orders'
AND schemaname = 'public';
Outdated Statistics
As demonstrated low selectivity to addressed by statistics, instead of index, it's not the only case where stale data can lead to inefficient query planning. Keeping database statistics current is crucial for PostgreSQL to make informed decisions about whether to use an index or opt for a sequential scan.
PostgreSQL heavily relies on statistics to estimate costs and make the right decisions across different query execution plans. The statistics cover various data points, like total number of rows in table(s), distinct values, their distribution, histogram bounds, correlation between physical row ordering and column values, and much more.
The trouble starts when statistics get out-of-date. It's similar as navigating using old maps. It just might you send you going in circles. How might the statistics in PostgreSQL get dated? Most cases involve high volume of data modifications or bulk operations, but will be affected also by schema changes. For all those operations it's always good idea to ANALYZE
the table to keep the DB up-to-date.
The prevent the statistics outdated, PostgreSQL alone either manual ANALYZE
or periodically tries to trigger it as part of autovacuum daemon. The key factor is whatever it can run fast and frequently enough to keep up with the changes. You can adjust the autovaccuum settings globally or per-table.
Keeping statistics up-to-date is crucial for maintaining good query performance, as it ensures that the query planner has accurate information to base its decisions on.
Summary
As demonstrated creating index might not be always the best course of action. Understanding those edge-cases is crucial not only for DBAs but also for developers. By keeping those considerations you can better design the schema and indexing strategy.
Indexes are still the next best thing in database world, but they require thoughtful implementation and maintenance. The goal is not only to create indexes, but to create the right indexes based on accurate, up-to-date data and aligned with your specific query patterns.