In my last post, “Being Statistically Significant Is Nothing Like Being Pregnant,” I explained why I took the following result seriously, even though with a P value of 0.06, it isn’t “statistically significant.”
This is a graph from the LA Veterans Administration Hospital Study showing that the incidence of cancer was higher in the “experimental” group consuming vegetable oils than in the “control” group consuming mostly traditional fats. I had stated in that post that if P=0.06, we can be 94 percent confident that the result was not due to chance, which I offered as “a more intuitively graspable way” of phrasing the implication than the technical definition, which I also explained. A friendly commenter thought I was misinterpreting the P value. More specifically, he thought I was saying that there is a 94% probability that the result is not due to chance, which is wrong. These, however, are two different things. In this post, I’ll try to explain the difference using a few pictures.
The following discussion is drawn from the principles outlined in Statistical Methods in Medical Research by Armitage, Berry, and Matthews, using my own examples. I’ll emphasize again that statistics is way outside the realm of my expertise, so be sure to check the comments to see if any of the readers of this blog who are much more knowledgeable in this area than I am have taken me to task (e.g., here).
The prevailing approach to probability in technical circles is the frequentist approach. This approach defines probability as a proportion of randomly distributed possibilities. For example, the probability of reaching into the jar below and pulling out a blue ball is fifty percent:
What we mean in this case is that half the possibilities are red and half of them are blue. If we were to keep taking balls out of the jar, we would pick out a blue ball half the time.
Now, let’s suppose an imp took one of the balls while we weren’t looking.
The imp tells us the ball that he’s hiding in his left hand is blue. Then the little troublemaker asks us, “What’s the probability I’m telling the truth?”
From a frequentist perspective, the question doesn’t make any sense. The imp is either lying or telling the truth. Whether a particular fact is true or false cannot be a probability, because the correct answer does not exist as a random distribution of possibilities. There is only one ball in his hand (unless he’s lying about that too), so there is only one correct answer to the question, not a random distribution of correct answers. If there were a random distribution of correct answers, it would look something like this:
But of course we know there isn’t, because a jar that size couldn’t possibly fit in the imp’s hand:
Clearly, the imp can only have either one ball or no balls in his hand. Thus, from a frequentist perspective, there is no such thing as “the probability he is telling the truth.”
Something is deeply unsatisfying about this perspective. If we can’t say anything about the likelihood that something is true, how are we supposed to make decisions? How are we to decide, for example, whether we should dress our salad with olive oil or corn oil, unless we have some way to express our confidence that one or the other is the better choice?
An alternative approach to probability would be to quantify our degree of belief in something, where zero represents total disbelief and one represents total certainty. Such an approach to probability prevailed in the late eighteenth century and for most of the nineteenth, under the influence of Thomas Bayes (1701-1761) and P.S. Laplace (1749-1827). At that time it was referred to as “inverse probability,” in contrast to the “direct probability” characterized by the frequentist interpretation. The influence of this “Bayesian” view of probability waned in the twentieth century in large part due to the frequentist influence of R.A. Fisher (1890-1962), but it nevertheless continues to carry some weight even in the present.
A “Bayesian” approach would attempt to take into account the totality of any relevant knowledge we’ve heretofore collected. We already know that fifty percent of the balls in this jar are blue, so our previous knowledge about other jars of balls isn’t particularly important. But we have no idea if the imp is telling the truth. Can we say anything from previous experience about how often imps lie? We would want to have some idea not only of the typical propensity of an imp to lie, but also the variation. Suppose, for example, there exists a population of imps with varying proclivities for fibbing.
If we could establish, for example, that imps lie on average eighty percent of the time, and that 95 percent of imps fall somewhere in the range of lying seventy percent of the time and lying ninety percent of the time, this would go a long way toward helping us quantify our degree of belief in the imp’s claim that he’s holding a blue ball in his left hand.
There could be innumerable other pieces of information that are relevant: perhaps, for example, imps are so sick of being red all the time that they have an intrinsic drive to incorporate blueness in whatever way possible, and would be much more likely to choose a blue ball rather than to choose any ball at random.
On the other hand, suppose we don’t know anything at all about the propensity of imps to lie or tell the truth, or to pick red or blue balls. In such a scenario, we would have to fall back solely on our knowledge that fifty percent of the balls in the jar are blue. All we would be able to say would be that we were fifty percent confident the imp was telling the truth. Lo and behold! This just happens to be the exact same probability of taking a blue ball according to the frequentist perspective! Thus we find that under many conditions where we have no prior knowledge, the “Bayesian” and “frequentist” perspectives become mathematically equivalent: the likelihood of picking a blue ball at random becomes mathematically equivalent to our degree of belief that a blue ball has been picked (but see the dispute in the comments).
With that in mind, let’s return to the topic of whether we can turn a P value into “confidence” that something is or is not due to chance.
To make this easier to follow, I’m going to invent some data from a hypothetical clinical trial and try to distill the basic principles from some of the minutiae that would often complicate the mathematical calculations.*
Let’s suppose a randomized, controlled, clinical trial showed that the incidence of cancer was forty percent among subjects consuming vegetable oil and twenty percent among subjects consuming animal fat. Since vegetable oil is now the norm, let’s express this as the effect of switching back to traditional animal fat. Expressed this way, that’s a twenty percent decrease in absolute risk from consuming animal fat and a two-fold decrease in relative risk. Clearly, this result is different than the “null hypothesis,” which is that switching from vegetable oil to animal fat has no effect:
By itself this doesn’t tell us much, however, because there are innumerable things besides vegetable oil that affect cancer risk. Even though we randomly assigned subjects to one or the other group, we can’t possibly assume that the myriad factors affecting cancer risk were distributed perfectly evenly between the two groups. Even if the null hypothesis is true and there is no effect at all, the likelihood of obtaining a difference of exactly zero is incredibly small because there are so many other factors that could vary to some degree between the two groups.
For convenience, we call all of this additional complexity “chance.” It isn’t truly “random,” if there even is such a thing as randomness, but it’s random with respect to animal fat — in other words, it is neither caused by nor causes the consumption of animal fat — so it’s convenient to lump it all under the umbrella of “chance.” We therefore turn to statistics to help us quantify our confidence that this difference in cancer incidence between the two groups arose as a result of the different treatment (vegetable oil versus animal fat) rather than as a result of “chance.”
Our first task is to compute a P value that attempts to answer the question, “if switching from vegetable oil to animal fat has no effect on the risk of cancer, how likely would we be to obtain a difference as large or larger than the one we have observed?”
Following a frequentist perspective, the question can only be answered by taking a trip from the Land of Observed Data into the Land of Imaginary Data. Let’s imagine that the null hypothesis is true and there is no effect of animal fat on cancer risk. Suppose we conducted lots of similar trials testing the hypothesis over and over again. We test it a hundred times, or even a thousand. Each time, the difference between the two groups should be a little different, but if we pool all the trials together, the overall average should be very close to zero. Results very close to zero should also be the most common, even though they would only occur in a minority of cases. By estimating the variability and assuming a certain sample size, we should be able to draw a sampling distribution that displays the hypothetical results of all of these imaginary trials, and it should look something like this:
In this graph, we have the results of many imaginary studies plotted according to their expected frequency, assuming the null hypothesis is true. The blue line represents our expected difference of zero. While we would expect only a few observations to fall precisely along the blue line, we would nevertheless expect a full 95 percent of the observations to fall within the shaded green area. Conversely, five percent of the observations would fall outside the bounds of the shaded green area, in the two red-shaded tails of the curve, with 2.5 percent falling into each tail. The results of our own imaginary trial showed a twenty percent decrease in risk, which falls exactly along the left-most border between the shaded red and green areas. We would say the P value is equal to 0.05 because there would be a five percent chance of obtaining a difference of this magnitude or greater if the null hypothesis were true.
Although it would not usually be expressed this way in a publication, we have essentially established a 95% confidence interval around zero, which is the “average” result we would expect from many trials if the null hypothesis were true. We would never express it that way in a publication, however, because when performing such a statistical test our goal is usually to report the P value rather than the actual breadth of the interval around zero. It helps to realize the similarity, however, because we could obtain the same exact conclusion by establishing a 95% confidence interval around the result from our own trial, and it is very common to report confidence intervals established in such a way. Here is what one would look like for our study:
Here, the mathematics are much the same, but the interpretation is somewhat different. If we conducted this trial, say, a thousand times, each trial would generate a different 95% confidence interval. Yet we could expect 95 percent of those studies to generate a confidence interval that includes the true effect of switching from vegetable oil to animal fat. We should notice a striking agreement between the two approaches: whereas above we had seen that a twenty percent reduction in the incidence of cancer (-20) rested at the very limit of the 95% confidence interval around zero, zero now lies at the very limit of the 95% confidence interval around -20. Thus, we could use either approach to determine that P is equal to 0.05 and that there would be a five percent likelihood of obtaining a difference this large if the null hypothesis were true.
Informally, and not without potential pitfalls, it seems natural to follow the logic something like this:
- The likelihood of obtaining a result of this magnitude or greater if there is no effect of animal fat is five percent.
- Conversely, the 95% confidence interval lies on the very edge of excluding zero. The likelihood that the confidence interval includes the “true” value is 95 percent.
- On an informal and intuitive level, it seems right to use this likelihood as a measure of our confidence that the “true” value is different from zero.
- If we can say that we are just under 95 percent confident that the true value is different from zero, it follows that we could say we are just under 95 percent confident that animal fat reduced the risk of cancer in this study (but see the dispute in the comments).
If we used a Bayesian approach to quantify our degree of belief, we could construct a Bayesian confidence interval, which could also be called a credibility interval or a Bayesian probability interval. As it so happens, under the simplest conditions and without any prior knowledge to factor into the calculation, the “Bayesian” and “frequentist” intervals are mathematically identical (but see dispute). The difference, of course, is the interpretation: the Bayesian interval is a formal way of quantifying the degree of belief that something is true.
Now, on one hand I think it is quite clear that “belief” is intrinsically less amenable to quantification than a proportion of possibilities. Thus, there is something “softer” about the science of quantifying “confidence” than the science of quantifying distributions of possible outcomes. On the other hand, it is useful to have some way of weighting our confidence that one or another thing is true in order to make decisions.
The problem with this approach, however, is that in most cases there are multiple studies to take into account. Say one study showed that animal fat reduced the risk of cancer with P<0.05. If there were several studies that showed the opposite result, this should greatly shake our confidence that the results of the first study were not due to chance. Our confidence should be much less than 95%. Yet the P value of that particular study wouldn’t change, because the probability of obtaining that result if the null hypothesis were true would still be less than five percent. Thus, reading a degree of belief into a P value tends to be rather shaky business unless there is little other evidence to go on.
As I have argued in previous posts, the LA Veterans Administration Hospital Study was unique for several reasons, not the least of which because it was the only study where the mean age of the subjects was over 60. I think it is fairly reasonable to say, as a “soft” inference from the P value, that we can be just under 94 percent confident that the increase in risk among those consuming vegetable oil was a result of the treatment (eating in the dining hall where foods were made with vegetable oil) rather than chance. Expressing our confidence in this way is important to translating the finding into something useful for making decisions. It would, of course, be a more difficult task to formally quantify our degree of belief in the general statement that “vegetable oil increases the risk of cancer.” That would require looking at the totality of the data rather than a single study.
Of course we should never lose sight of the fact that the technical definition of a P value is the likelihood of obtaining a result this large or larger if there is truly no effect. We should likewise never lose sight of the fact that belief is subjective and requires debate about the relative value and proper interpretation of different forms of evidence.
I’m 95% confident that at least a few folks will find this helpful, but if not, please let me know in the comments so I can improve the next post. And if you made it this far, thanks for reading!
* I have treated this example as if I knew the standard deviation of the population to be 10.2, whereas in a real trial we would be very unlikely to know the standard deviation of the population and we would have to estimate it based on the standard deviation found in our own trial.
Read more about the author, Chris Masterjohn, PhD, here.
Armitage P, Berry G, Matthews JNS. Statistical Methods in Medical Research: Fourth Edition. Malden, MA: Blackwell Science Ltd. 2002.