Quick summary
Summarize this blog with AI
Introduction
Median and percentile questions show up as soon as people move beyond averages. A team wants the median delivery time instead of the mean because outliers are distorting the story. Someone needs the 90th percentile latency for an SLA. Another person wants quartiles by customer segment and suddenly the query that looked obvious starts returning a value they did not expect.
The confusion usually comes from two places. First, median is not always the same thing as “pick the middle row” because even-sized sets require a rule. Second, SQL exposes more than one percentile function, and they do different jobs. If you do not know whether you need an interpolated value or an actual observed value from the dataset, it is easy to choose the wrong one and quietly publish the wrong metric.
The practical fix is to separate the business question from the function syntax. Decide whether you need a continuous percentile, a discrete percentile, or a manual workaround for a database that does not implement ordered-set functions directly. Once that choice is explicit, the query shape gets much simpler.
If You Only Remember Four Rules
PERCENTILE_CONTcan interpolate between adjacent values. That is usually what people mean by the mathematical median for an even-sized set.PERCENTILE_DISCreturns an actual value from the ordered data. It does not invent a midpoint that never occurred.- The 50th percentile is only “the median” after you decide which rule you mean for even-sized samples.
- Percentile bugs are usually grain bugs. Get the grouping level right before you calculate the percentile.
What Continuous and Discrete Percentiles Really Mean
PostgreSQL documents percentile_cont as the continuous percentile and notes that it interpolates between adjacent input items when needed. It documents percentile_disc as the discrete percentile that returns the first actual value whose position reaches the requested fraction.
That difference matters most when the row count is even or when the gaps between values are large. If the ordered values are 10, 20, 30, 40, a continuous median is 25. A discrete median is one of the observed values, depending on the engine’s rule. Neither answer is universally “wrong.” They answer slightly different questions.
If your stakeholder wants the midpoint of the middle two values, use the continuous form. If they need a real value that actually occurred in the data, use the discrete form.
A Straightforward Median in PostgreSQL
In PostgreSQL, median and percentile calculations use ordered-set aggregate syntax. That means you aggregate over the rows in the current group and specify the ordering inside WITHIN GROUP.
SELECT
percentile_cont(0.5) WITHIN GROUP (ORDER BY response_time_ms) AS median_response_time,
percentile_cont(0.9) WITHIN GROUP (ORDER BY response_time_ms) AS p90_response_time,
percentile_disc(0.5) WITHIN GROUP (ORDER BY response_time_ms) AS median_observed_response_time
FROM api_requests;
This version is clean because the reporting grain is the whole filtered table. There is no extra join or window logic getting in the way.
Percentiles Per Group Without Losing the Grain
Percentiles become more interesting when you need one result per team, country, plan tier, or product. In PostgreSQL, you can group first and apply the ordered-set aggregate inside each group.
SELECT
plan_tier,
percentile_cont(0.5) WITHIN GROUP (ORDER BY monthly_spend) AS median_monthly_spend,
percentile_cont(ARRAY[0.25, 0.5, 0.75]) WITHIN GROUP (ORDER BY monthly_spend) AS quartiles
FROM customer_spend
GROUP BY plan_tier;
This works because each group defines its own ordered set. The mistake people make is calculating percentiles after a join that duplicated the entity they meant to measure. If one customer now appears three times because of a downstream events join, your percentile is no longer at customer grain. Fix the grain before you ask for the percentile.
Why Window Percentile Syntax Confuses People
Not every database exposes percentiles the same way. Some engines offer percentile functions as analytic expressions with OVER (PARTITION BY ...). In that style, the percentile is repeated on every row in the partition, and you then need one more step to collapse back to one row per group.
The business logic is still valid, but the query shape changes. If your engine repeats the percentile on every row, compute it in a subquery or CTE and then reduce to the reporting grain deliberately instead of mixing the percentile expression into a grouped query and hoping the planner will interpret your intent.
WITH scored AS (
SELECT
department,
salary,
PERCENTILE_CONT(0.5) WITHIN GROUP (ORDER BY salary)
OVER (PARTITION BY department) AS median_salary
FROM payroll
)
SELECT
department,
MAX(median_salary) AS median_salary
FROM scored
GROUP BY department;
Multiple Percentiles in One Pass
If you need quartiles or a p50, p90, and p99 set for the same metric, compute them together instead of scattering separate queries through the codebase. PostgreSQL supports an array form for continuous and discrete percentiles, which makes it easier to keep percentile definitions consistent.
That consistency matters operationally. If one dashboard uses a continuous median and another uses a discrete median, teams will waste time arguing about a metric difference that is really just a function-choice difference.
What to Do If Your Database Lacks a Built-In Median Function
Some SQL engines do not support these ordered-set functions directly, or they support only one of the variants. In that case, the fallback is a row-number pattern: order the rows inside each group, count the group size, and then select the middle row or average the two middle rows. It is more verbose, but the logic is still manageable.
The important part is to encode the business rule explicitly. Decide whether an even-sized group should average the two middle values or choose one observed value. Do not let that decision stay implicit.
Common Percentile Mistakes
- Calculating percentiles after a join that multiplied the rows you meant to measure.
- Using a discrete percentile when the stakeholder expects interpolation, or using a continuous percentile when they need an actual observed value.
- Comparing percentile outputs across tools without checking whether both tools use the same percentile definition.
- Forgetting that percentiles answer distribution questions. If the metric needs a weighted business definition, percentile functions alone may not be enough.
Final Takeaway
Median and percentile queries are not hard because the functions are exotic. They are hard because a small change in definition produces a different but plausible answer. Start by defining the metric precisely: interpolated versus observed, whole table versus per-group, raw rows versus a pre-aggregated business grain. Then choose the SQL function that matches that definition instead of treating all percentile syntax as interchangeable.
FAQ
Is the median always the same as PERCENTILE_CONT(0.5)?
That is the most common mathematical interpretation, especially for even-sized samples, because it can interpolate between the two middle values. But some teams want the median to be an actual observed value from the data, which is a better fit for PERCENTILE_DISC(0.5).
When should I prefer PERCENTILE_DISC?
Use it when the result must come from the dataset itself, such as rating scales, ranked categories, or workflows where a midpoint between values would not make business sense.
Why does my percentile change after a join?
Because the join may have changed what one row means. If you want a customer percentile, do not compute it on a table where each customer appears once per order line or once per event unless that is the grain you actually intend.