Playing with Postgres Histogram Stats
In my previous post about Postgres random_page_cost, we experimented with an orders table containing 10,000,000 rows and a status column with 4 distinct values. When filtering by each status, the query plans chosen by Postgres were as follows:
| Status | Count | % | Query Plan |
|---|---|---|---|
| AWAITING_DELIVERY | 2,547,518 | 25.4% | Bitmap Index Scan |
| AWAITING_PAYMENT | 426,870 | 4.2% | Bitmap Index Scan |
| CANCELLED | 22,552 | 0.2% | Bitmap Index Scan |
| COMPLETED | 7,003,060 | 70.0% | Seq Scan |
We showed how these decisions made sense based on the balance between random and sequential I/O. But we left out an important part of the story: how does Postgres know how many rows match each status value in the first place? It clearly decided to use a Seq Scan for status = 'COMPLETED' because it knew this would yield a huge number of rows. But how does it know that? Let’s take a look at how Postgres tracks and uses data distribution statistics behind the scenes.
Let’s imagine one of the columns in our orders table is city, and the distribution of its values looks like this: 90% of the orders are concentrated in just 10 cities, while the remaining 10% are spread thinly across 990 other cities. This kind of skewed distribution is quite common in real-world datasets. Here are the top 10 cities with the most orders and the bottom 10 with the fewest:
| Top 10 | Orders | Bottom 10 | Orders |
|---|---|---|---|
| City 1 | 901,153 | City 400 | 911 |
| City 3 | 900,939 | City 627 | 916 |
| City 2 | 900,677 | City 891 | 923 |
| City 9 | 899,886 | City 803 | 930 |
| City 6 | 899,853 | City 126 | 931 |
| City 10 | 899,825 | City 208 | 931 |
| City 5 | 899,447 | City 163 | 932 |
| City 8 | 899,410 | City 742 | 934 |
| City 7 | 899,218 | City 729 | 935 |
| City 4 | 899,170 | City 31 | 935 |
Let’s now apply what we’ve learned so far to reason about what Postgres might do when filtering by different cities:
Querying one of the top 10 cities: Each of these cities accounts for roughly 9% of all rows. With the default
random_page_costof 4, that’s likely not enough to justify aSeq Scan, so Postgres will probably use the index. But for teaching purposes, let’s intentionally increaserandom_page_costto15to induce aSeq Scanon these cities. You’ll see why this matters in a moment.Querying one of the bottom 10 cities: This case is much clearer—each of these cities represents a tiny fraction of the dataset, so Postgres will almost certainly favor using the index, as the cost of scanning the whole table would far outweigh the few random reads required.
Everything seems obvious, right? But if there’s one thing this industry has taught me over the years, it’s this: never blindly trust our reasoning. So let’s actually run the queries for each of these cities and check the query plans. Here’s what we got:
| Top 10 | Orders | Query Plan | Bottom 10 | Orders | Query Plan |
|---|---|---|---|---|---|
| City 1 | 901,153 | Bitmap Index Scan | City 400 | 911 | Bitmap Index Scan |
| City 3 | 900,939 | Bitmap Index Scan | City 627 | 916 | Bitmap Index Scan |
| City 2 | 900,677 | Bitmap Index Scan | City 891 | 923 | Bitmap Index Scan |
| City 9 | 899,886 | Seq Scan | City 803 | 930 | Bitmap Index Scan |
| City 6 | 899,853 | Bitmap Index Scan | City 126 | 931 | Bitmap Index Scan |
| City 10 | 899,825 | Seq Scan | City 208 | 931 | Bitmap Index Scan |
| City 5 | 899,447 | Seq Scan | City 163 | 932 | Bitmap Index Scan |
| City 8 | 899,410 | Bitmap Index Scan | City 742 | 934 | Bitmap Index Scan |
| City 7 | 899,218 | Seq Scan | City 729 | 935 | Bitmap Index Scan |
| City 4 | 899,170 | Seq Scan | City 31 | 935 | Bitmap Index Scan |
OK. Things seem very reasonable for the bottom 10 cities, since Postgres uses a Bitmap Index Scan for all of them. But for the top 10 cities… something feels off: we got a Seq Scan for only 5 of them, while the other 5 still use a Bitmap Index Scan. Given that the top 10 cities have basically the same number of rows, shouldn’t we expect the same plan for all of them?
If it were the exact top 5 biggest cities using
Seq Scan, one possible theory could be that we’re right at the tipping point where Postgres decides to switch between plans. Unlikely, but at least plausible. The thing is… that’s not even the case—the top 3 cities are usingBitmap Index Scan. So clearly, something else is going on.
Before you start yelling at Postgres, I have a confession to make. I secretly changed a configuration behind the scenes. It was for teaching purposes, so I hope you’ll forgive me. Here’s what I did:
1
ALTER TABLE orders ALTER COLUMN city SET STATISTICS 5;
And now, to help explain what happened after this change, let me show you the result of a very specific query:
1
2
3
4
5
6
7
8
9
10
11
SELECT
most_common_vals,
most_common_freqs
FROM
pg_stats
WHERE
tablename = 'orders'
AND attname = 'city';
most_common_vals: { "City 7", "City 9", "City 4", "City 5", "City 10"}
most_common_freqs: { 0.09239999949932098, 0.09066666662693024, 0.09023333340883255, 0.09019999951124191, 0.09003333002328873 }
It’s definitely not a coincidence that City 7, City 9, City 4, City 5, and City 10 are exactly the ones where Postgres chose a Seq Scan. Here is the secret: Postgres tracks the most common values (MCVs) for each column, and I changed this configuration from it’s default value of 100 to 5. As a result, Postgres “knew” those 5 cities were frequent and had accurate frequency data for them—enough to justify a Seq Scan. But for the other 5 top cities, it didn’t have any specific stats, so it had to assume a uniform distribution across the remaining 995 cities, which made it underestimate their actual frequency and stick with the Bitmap Index Scan.
Why didn’t Postgres capture exactly the top 5 cities? Because its statistics aren’t calculated using the full dataset—just a representative sample. That means small deviations are expected.
The key takeaway I hope you walk away with is this: query plans can vary depending on the actual values in your columns—but Postgres only knows what its statistics tell it, and those statistics have limits. A good exercise to wrap up this post—or even a fun interview question for a database role—might go like this:
“Assuming the default Postgres configuration, which city is most likely to suffer from a suboptimal execution plan?”
The best answer? The city ranked 101st in number of orders. Why? Because it’s the first city that doesn’t make it into the top 100 most common values, which means Postgres is blind to its actual frequency. It might have a lot of rows—but Postgres doesn’t know that—so it underestimates its importance and possibly picks a suboptimal plan.