Anscombe's quartet, or why your averages lie to you

Data · Visualization · Decision-making

Why bother making a chart at all? After all, a well-ordered table of statistics — a mean, a standard deviation, a few other figures — ought to be enough to form a view of a dataset. A picture is worth a thousand words, so they say. But perhaps it is worth more than that? The answer comes from an English statistician, Francis Anscombe, who published a short paper in 1973 that went on to become famous.

Four datasets, identical statistics

Anscombe builds four sets of eleven points each. He chooses them so that they share exactly the same statistical characteristics: the same mean in x, the same mean in y, the same variance, the same correlation, and therefore the same regression line (y = 3 + 0.5 x). On paper, in any report, the four sets would be strictly interchangeable.

For all four sets: mean x = 9.0 · mean y = 7.5 · variance x = 11.0 · variance y = 4.1 · correlation = 0.82 · line: y = 3 + 0.5 x

Stare at those numbers as long as you like: there is no way to guess what the data has inside. Now let's draw it.

The four datasets of Anscombe's quartet. Same statistics, same regression line (in green), four unrelated realities.

And everything changes. The first set is what the statistics led us to imagine: an ordinary cloud around a line. The second is not linear at all — it is a perfect curve, and the regression line misses the point entirely. The third is a near-perfect line, except for a single outlier that is enough to swing the whole regression. The fourth is more brutal still: every x is identical but one, and it is that single isolated point on the right that single-handedly creates the illusion of a relationship.

The summary erases what matters

Here is the heart of it: four radically different stories, perfectly invisible in the summary numbers. Had you been handed the table of statistics alone, you would have treated all four situations the same way. Yet they call for entirely different decisions. Set II tells you to change your model. Set III tells you to go and understand that one outlier — a data-entry error? a special case rich in lessons? Set IV tells you your beautiful correlation rests on a single observation, and that it will collapse with the next one.

A statistical summary, by construction, throws away information to keep a compact version of it. That is useful, but it is never neutral: it decides for you what deserves to be seen.

An average is an opinion about the data, not the data.

What this changes for the decision-maker

Here, in pictorial form, is the idea that runs through all of my work: the number is only a consequence, and the decision lies in what sits beneath it. A KPI on a dashboard is one more statistical summary. It may very well hide a Set II, a Set III or a Set IV — a trend that is bending, a customer who single-handedly carries all the growth, an average that reconciles two opposing populations.

The remedy costs almost nothing. Before deciding on an aggregate number, look at the distribution it summarizes: a scatter plot, a histogram, the series over time. Thirty seconds to check that the number is not lying by omission. It is exactly the discipline we apply when we evaluate an investment case or rethink a project portfolio: never let a single indicator stand in for reality.

Anscombe needed four drawings where a thousand words would not have done. Which is the irony of it: the best demonstration of the usefulness of charts is itself a chart. Next time someone hands you an average, don't only ask "how much?". Ask: "what does the distribution look like?".

← All articles