Playing with Postgres Histogram Stats

Posted Jul 19, 2025 Updated Sep 6, 2025

By Marlon Giona

6 min read

In my previous post about Postgres random_page_cost, we experimented with an orders table containing 10,000,000 rows and a status column with 4 distinct values. When filtering by each status, the query plans chosen by Postgres were as follows:

Status	Count	%	Query Plan
AWAITING_DELIVERY	2,547,518	25.4%	Bitmap Index Scan
AWAITING_PAYMENT	426,870	4.2%	Bitmap Index Scan
CANCELLED	22,552	0.2%	Bitmap Index Scan
COMPLETED	7,003,060	70.0%	Seq Scan

We showed how these decisions made sense based on the balance between random and sequential I/O. But we left out an important part of the story: how does Postgres know how many rows match each status value in the first place? It clearly decided to use a Seq Scan for status = 'COMPLETED' because it knew this would yield a huge number of rows. But how does it know that? Let’s take a look at how Postgres tracks and uses data distribution statistics behind the scenes.

Let’s imagine one of the columns in our orders table is city, and the distribution of its values looks like this: 90% of the orders are concentrated in just 10 cities, while the remaining 10% are spread thinly across 990 other cities. This kind of skewed distribution is quite common in real-world datasets. Here are the top 10 cities with the most orders and the bottom 10 with the fewest:

Top 10	Orders	Bottom 10	Orders
City 1	901,153	City 400	911
City 3	900,939	City 627	916
City 2	900,677	City 891	923
City 9	899,886	City 803	930
City 6	899,853	City 126	931
City 10	899,825	City 208	931
City 5	899,447	City 163	932
City 8	899,410	City 742	934
City 7	899,218	City 729	935
City 4	899,170	City 31	935

Let’s now apply what we’ve learned so far to reason about what Postgres might do when filtering by different cities:

Querying one of the top 10 cities: Each of these cities accounts for roughly 9% of all rows. With the default random_page_cost of 4, that’s likely not enough to justify a Seq Scan, so Postgres will probably use the index. But for teaching purposes, let’s intentionally increase random_page_cost to 15 to induce a Seq Scan on these cities. You’ll see why this matters in a moment.
Querying one of the bottom 10 cities: This case is much clearer—each of these cities represents a tiny fraction of the dataset, so Postgres will almost certainly favor using the index, as the cost of scanning the whole table would far outweigh the few random reads required.

Everything seems obvious, right? But if there’s one thing this industry has taught me over the years, it’s this: never blindly trust our reasoning. So let’s actually run the queries for each of these cities and check the query plans. Here’s what we got:

Top 10	Orders	Query Plan	Bottom 10	Orders	Query Plan
City 1	901,153	Bitmap Index Scan	City 400	911	Bitmap Index Scan
City 3	900,939	Bitmap Index Scan	City 627	916	Bitmap Index Scan
City 2	900,677	Bitmap Index Scan	City 891	923	Bitmap Index Scan
City 9	899,886	Seq Scan	City 803	930	Bitmap Index Scan
City 6	899,853	Bitmap Index Scan	City 126	931	Bitmap Index Scan
City 10	899,825	Seq Scan	City 208	931	Bitmap Index Scan
City 5	899,447	Seq Scan	City 163	932	Bitmap Index Scan
City 8	899,410	Bitmap Index Scan	City 742	934	Bitmap Index Scan
City 7	899,218	Seq Scan	City 729	935	Bitmap Index Scan
City 4	899,170	Seq Scan	City 31	935	Bitmap Index Scan

OK. Things seem very reasonable for the bottom 10 cities, since Postgres uses a Bitmap Index Scan for all of them. But for the top 10 cities… something feels off: we got a Seq Scan for only 5 of them, while the other 5 still use a Bitmap Index Scan. Given that the top 10 cities have basically the same number of rows, shouldn’t we expect the same plan for all of them?

If it were the exact top 5 biggest cities using Seq Scan, one possible theory could be that we’re right at the tipping point where Postgres decides to switch between plans. Unlikely, but at least plausible. The thing is… that’s not even the case—the top 3 cities are using Bitmap Index Scan. So clearly, something else is going on.

Before you start yelling at Postgres, I have a confession to make. I secretly changed a configuration behind the scenes. It was for teaching purposes, so I hope you’ll forgive me. Here’s what I did:

  
ALTER TABLE orders ALTER COLUMN city SET STATISTICS 5;

And now, to help explain what happened after this change, let me show you the result of a very specific query:

  
SELECT
    most_common_vals, 
    most_common_freqs
FROM 
   pg_stats 
WHERE 
   tablename = 'orders' 
   AND attname = 'city';

most_common_vals: { "City 7", "City 9", "City 4", "City 5", "City 10"}	
most_common_freqs: { 0.09239999949932098, 0.09066666662693024, 0.09023333340883255, 0.09019999951124191, 0.09003333002328873 }

It’s definitely not a coincidence that City 7, City 9, City 4, City 5, and City 10 are exactly the ones where Postgres chose a Seq Scan. Here is the secret: Postgres tracks the most common values (MCVs) for each column, and I changed this configuration from it’s default value of 100 to 5. As a result, Postgres “knew” those 5 cities were frequent and had accurate frequency data for them—enough to justify a Seq Scan. But for the other 5 top cities, it didn’t have any specific stats, so it had to assume a uniform distribution across the remaining 995 cities, which made it underestimate their actual frequency and stick with the Bitmap Index Scan.

Why didn’t Postgres capture exactly the top 5 cities? Because its statistics aren’t calculated using the full dataset—just a representative sample. That means small deviations are expected.

The key takeaway I hope you walk away with is this: query plans can vary depending on the actual values in your columns—but Postgres only knows what its statistics tell it, and those statistics have limits. A good exercise to wrap up this post—or even a fun interview question for a database role—might go like this:

“Assuming the default Postgres configuration, which city is most likely to suffer from a suboptimal execution plan?”

The best answer? The city ranked 101st in number of orders. Why? Because it’s the first city that doesn’t make it into the top 100 most common values, which means Postgres is blind to its actual frequency. It might have a lot of rows—but Postgres doesn’t know that—so it underestimates its importance and possibly picks a suboptimal plan.

Trending Tags