Statistical significance and substantive significance are different
Sometimes you can't get around reading the study
News stories about scientific studies often report that "X causes Y" or "A is associated with B" or "P is linked to Q." When people discuss these stories, they often debate whether this result is true. Could X *really* cause Y? Or is it Y that causes X? Is A still associated with B once you control for C? Is P only linked to Q within a small segment of the population?
It makes sense to doubt whether these relationships really exist, or whether they take the form described. Causation is hard to establish, spurious correlations are commonplace, and p-hacking is rampant. But more often than not, debates about whether such-and-such a relationship exists are beside the point. Social scientists often use the idea of "statistical significance" in order to answer the question of whether a relationship exists. When a relationship is found to be statistically significant, it is relatively unlikely to have occurred by chance. But the magnitude of such a relationship is something completely different. A correlation can be very small, but still statistically significant.
It's useful for scientists and researchers to uncover small relationships because, from the standpoint of scientific progress, more knowledge is always good. Every study showing a relationship between two interesting variables, no matter how small, is still a building block for future research. Yet we often care more about substantive significance. Things are different when scientific findings are used outside of a research setting. When scientific findings make their way into the popular media, or into public policy, that's usually because they're expected to be action-relevant in some way. That's where a lack of attention to substantive (rather than statistical) significance can start to cause problems.
A typical story in the New York Times advertises a study published in the Annals of Internal Medicine. In the Times' interpretation, coffee is "linked with" lower mortality risk. This study does not explicitly claim, in a causal sense, to find that coffee lengthens life, but this is obviously the main noteworthy implication of the finding. No one cares if it happens to be the case that that people who live longer also, for unrelated reasons, tend to drink more coffee; the only reason to care about this study is because it suggests that you can save a trip to the cardiologist by going to Starbucks instead.
Very importantly, correlation is not causation. There are all kinds of ways that two variables can be related, and most observational studies that, at their most fundamental, can only describe this relationship, not explain why they exist. In order to make statements about causation, you typically need randomized controlled trials, or at least robust quasi-experimental designs. But, just this once, let's pretend that correlation is causation. Let's suppose that this coffee study—and some others like it that we'll look at in a moment—are actually indicative of a causal relationship. How big of an effect are we talking about here?
One of the topline results of the study, reported as a hazard ratio of 0.77, suggesting that, relative to nondrinkers, people who drank more than 4.5 cups of coffee per day at the beginning of the study, on average, had about a 23% lower risk of death at the time of followup (a median of 7 years later). A hazard ratio is simply the ratio of the risk of some event occurring in a treatment condition to the risk of it occurring in a control condition. If your risk of getting an illness is 1% over some time period, and taking a medication reduces it to 0.5%, then the hazard ratio over that time period is 2 (1/0.5).
Now how does the hazard ratio reported in the coffee study compare to hazard ratios reported in other studies that try to relate behavior to the risk of death?
Walking a lot is associated with around a 50-70% reduction in mortality
Frequent, vigorous physical activity is associated with a roughly 45% reduction in mortality
Normal-weight people have roughly half the risk of death of the severely obese.
Assuming — again, with no justification — that these relationships are causal in nature, it looks like there are things you can do to reduce your mortality risk by between 2 and 3 times as much as drinking tons of coffee, which is a big deal but not a huge deal. Of course, coffee has one big advantage over these other methods, which is that it's much easier to do on the surface. And on the face of it, adding coffee to the mix, if you don't drink it already, doesn't seem crazy.
The wrinkle lies in what these models *don't* tell you. Hazard ratios, correlations, or regression coefficients are all measurements of effect size. Effect size is a measure of how much, on average, a change in Variable A (e.g. coffee consumption) is associated with a change in Variable B (e.g. risk of death). Hazard ratios, in particular, compare the rate of occurrence of some event (e.g. death) in a treatment group (e.g. coffee drinkers) versus a control group (e.g. nondrinkers).
But Another way of looking at the same kind of data is via risk difference. Instead of taking the ratio of two risks, we look at the simple difference between them. We can use this to get an idea of how big an effect coffee can have on health.
In order to do this calculation, we need to know what percentage of people in the non-coffee-drinking group died over the course of the coffee study, and the paper helpfully tells us: about 1.9%. Remembering the result from the coffee study that drinking more than 4.5 cups of unsweetened coffee per day “reduces” risk of death by 23%, we can calculate that moving to the lots-of-coffee regime from the no-coffee regime is associated with a risk reduction of about 0.4 percentage points.
In the walking study, by contrast, the death rate among those who didn't walk much was about 7% and the risk reduction was about half, so walking often is worth about a 3.5 percentage point reduction in the risk of death, about nine times bigger than the relative risk reduction from drinking coffee. In the study on vigorous exercise, the death rate among those who did no vigorous exercise was about 10.5%, and the hazard ratio was, again, about 50%, suggesting about a 5 percentage point risk reduction, about thirteen times bigger than the risk reduction from coffee.
Compare this to the importance of obesity: in the population of the obesity study linked above, about 50% of the obese participants died, and normal-weight people were half as likely to die. That means that moving from obesity to normal weight — or, better yet, avoiding severe obesity altogether — is associated with a risk reduction of about 25 percentage points, about sixty times bigger than the risk reduction from coffee.
What do we really care about here? The hazard ratio tells us (roughly) the relative risk between treated and untreated groups. The risk difference tells us the absolute difference in the risk of death between treated and untreated groups. Most people don't care about how likely they are to die relative to other people - they care about their absolute risk of death! That's why, in this case, I'd argue that relative risk difference is a better measure of substantive significance.
This matters for personal decision-making; if keeping your weight down is sixty times more effective at prolonging life than drinking tons of coffee, then you are probably better off expending your energy doing the former. But the distinction between statistical and substantive significance matters a lot more in public policy.
Last month, New York passed a bill intended to reduce class sizes by lowering the maximum number of students allowed in a class. Advocates for smaller class sizes argue on the basis of evidence that smaller classes improve student performance. Yet, as Freddie deBoer points out, evidence for the positive effects of smaller class size is limited, estimates of its effects vary widely, and the research that launched the small-class-size craze, the Tennessee STAR study, claimed substantively very modest results.
The data from STAR suggested that students enrolled in smaller classes were slightly more likely to attend college, but, crucially, that there was no long-term effect of small class size on adult earnings. By contrast, teacher quality (as measured by education level and years of experience) was robustly associated with both earnings and college attendance, with much larger effects.
As the Times pointed out, reducing class sizes, which is extremely expensive, also often entails hiring lower-quality teachers. So New York's insensitivity to the importance of effect size results in a potential double-whammy in which New York (1) ignores an intervention with big effects (teacher quality) in favor of one with small effects (class size) and as a result (2) acts to move the variable that doesn't matter, at the cost of the one that does. For public policy, at least, (effect) size matters.