In my last post, “Being Statistically Significant Is Nothing Like Being Pregnant,” I explained why I took the following result seriously, even though with a P value of 0.06, it isn’t “statistically significant.”
This is a graph from the LA Veterans Administration Hospital Study showing that the incidence of cancer was higher in the “experimental” group consuming vegetable oils than in the “control” group consuming mostly traditional fats. I had stated in that post that if P=0.06, we can be 94 percent confident that the result was not due to chance, which I offered as “a more intuitively graspable way” of phrasing the implication than the technical definition, which I also explained. A friendly commenter thought I was misinterpreting the P value. More specifically, he thought I was saying that there is a 94% probability that the result is not due to chance, which is wrong. These, however, are two different things. In this post, I’ll try to explain the difference using a few pictures.
The following discussion is drawn from the principles outlined in Statistical Methods in Medical Research by Armitage, Berry, and Matthews, using my own examples. I’ll emphasize again that statistics is way outside the realm of my expertise, so be sure to check the comments to see if any of the readers of this blog who are much more knowledgeable in this area than I am have taken me to task (e.g., here).
The prevailing approach to probability in technical circles is the frequentist approach. This approach defines probability as a proportion of randomly distributed possibilities. For example, the probability of reaching into the jar below and pulling out a blue ball is fifty percent:
What we mean in this case is that half the possibilities are red and half of them are blue. If we were to keep taking balls out of the jar, we would pick out a blue ball half the time.
Now, let’s suppose an imp took one of the balls while we weren’t looking.
The imp tells us the ball that he’s hiding in his left hand is blue. Then the little troublemaker asks us, “What’s the probability I’m telling the truth?”
From a frequentist perspective, the question doesn’t make any sense. The imp is either lying or telling the truth. Whether a particular fact is true or false cannot be a probability, because the correct answer does not exist as a random distribution of possibilities. There is only one ball in his hand (unless he’s lying about that too), so there is only one correct answer to the question, not a random distribution of correct answers. If there were a random distribution of correct answers, it would look something like this:
But of course we know there isn’t, because a jar that size couldn’t possibly fit in the imp’s hand:
Clearly, the imp can only have either one ball or no balls in his hand. Thus, from a frequentist perspective, there is no such thing as “the probability he is telling the truth.”
Something is deeply unsatisfying about this perspective. If we can’t say anything about the likelihood that something is true, how are we supposed to make decisions? How are we to decide, for example, whether we should dress our salad with olive oil or corn oil, unless we have some way to express our confidence that one or the other is the better choice?
An alternative approach to probability would be to quantify our degree of belief in something, where zero represents total disbelief and one represents total certainty. Such an approach to probability prevailed in the late eighteenth century and for most of the nineteenth, under the influence of Thomas Bayes (1701-1761) and P.S. Laplace (1749-1827). At that time it was referred to as “inverse probability,” in contrast to the “direct probability” characterized by the frequentist interpretation. The influence of this “Bayesian” view of probability waned in the twentieth century in large part due to the frequentist influence of R.A. Fisher (1890-1962), but it nevertheless continues to carry some weight even in the present.
A “Bayesian” approach would attempt to take into account the totality of any relevant knowledge we’ve heretofore collected. We already know that fifty percent of the balls in this jar are blue, so our previous knowledge about other jars of balls isn’t particularly important. But we have no idea if the imp is telling the truth. Can we say anything from previous experience about how often imps lie? We would want to have some idea not only of the typical propensity of an imp to lie, but also the variation. Suppose, for example, there exists a population of imps with varying proclivities for fibbing.
If we could establish, for example, that imps lie on average eighty percent of the time, and that 95 percent of imps fall somewhere in the range of lying seventy percent of the time and lying ninety percent of the time, this would go a long way toward helping us quantify our degree of belief in the imp’s claim that he’s holding a blue ball in his left hand.
There could be innumerable other pieces of information that are relevant: perhaps, for example, imps are so sick of being red all the time that they have an intrinsic drive to incorporate blueness in whatever way possible, and would be much more likely to choose a blue ball rather than to choose any ball at random.
On the other hand, suppose we don’t know anything at all about the propensity of imps to lie or tell the truth, or to pick red or blue balls. In such a scenario, we would have to fall back solely on our knowledge that fifty percent of the balls in the jar are blue. All we would be able to say would be that we were fifty percent confident the imp was telling the truth. Lo and behold! This just happens to be the exact same probability of taking a blue ball according to the frequentist perspective! Thus we find that under many conditions where we have no prior knowledge, the “Bayesian” and “frequentist” perspectives become mathematically equivalent: the likelihood of picking a blue ball at random becomes mathematically equivalent to our degree of belief that a blue ball has been picked (but see the dispute in the comments).
With that in mind, let’s return to the topic of whether we can turn a P value into “confidence” that something is or is not due to chance.
To make this easier to follow, I’m going to invent some data from a hypothetical clinical trial and try to distill the basic principles from some of the minutiae that would often complicate the mathematical calculations.*
Let’s suppose a randomized, controlled, clinical trial showed that the incidence of cancer was forty percent among subjects consuming vegetable oil and twenty percent among subjects consuming animal fat. Since vegetable oil is now the norm, let’s express this as the effect of switching back to traditional animal fat. Expressed this way, that’s a twenty percent decrease in absolute risk from consuming animal fat and a two-fold decrease in relative risk. Clearly, this result is different than the “null hypothesis,” which is that switching from vegetable oil to animal fat has no effect:
By itself this doesn’t tell us much, however, because there are innumerable things besides vegetable oil that affect cancer risk. Even though we randomly assigned subjects to one or the other group, we can’t possibly assume that the myriad factors affecting cancer risk were distributed perfectly evenly between the two groups. Even if the null hypothesis is true and there is no effect at all, the likelihood of obtaining a difference of exactly zero is incredibly small because there are so many other factors that could vary to some degree between the two groups.
For convenience, we call all of this additional complexity “chance.” It isn’t truly “random,” if there even is such a thing as randomness, but it’s random with respect to animal fat — in other words, it is neither caused by nor causes the consumption of animal fat — so it’s convenient to lump it all under the umbrella of “chance.” We therefore turn to statistics to help us quantify our confidence that this difference in cancer incidence between the two groups arose as a result of the different treatment (vegetable oil versus animal fat) rather than as a result of “chance.”
Our first task is to compute a P value that attempts to answer the question, “if switching from vegetable oil to animal fat has no effect on the risk of cancer, how likely would we be to obtain a difference as large or larger than the one we have observed?”
Following a frequentist perspective, the question can only be answered by taking a trip from the Land of Observed Data into the Land of Imaginary Data. Let’s imagine that the null hypothesis is true and there is no effect of animal fat on cancer risk. Suppose we conducted lots of similar trials testing the hypothesis over and over again. We test it a hundred times, or even a thousand. Each time, the difference between the two groups should be a little different, but if we pool all the trials together, the overall average should be very close to zero. Results very close to zero should also be the most common, even though they would only occur in a minority of cases. By estimating the variability and assuming a certain sample size, we should be able to draw a sampling distribution that displays the hypothetical results of all of these imaginary trials, and it should look something like this:
In this graph, we have the results of many imaginary studies plotted according to their expected frequency, assuming the null hypothesis is true. The blue line represents our expected difference of zero. While we would expect only a few observations to fall precisely along the blue line, we would nevertheless expect a full 95 percent of the observations to fall within the shaded green area. Conversely, five percent of the observations would fall outside the bounds of the shaded green area, in the two red-shaded tails of the curve, with 2.5 percent falling into each tail. The results of our own imaginary trial showed a twenty percent decrease in risk, which falls exactly along the left-most border between the shaded red and green areas. We would say the P value is equal to 0.05 because there would be a five percent chance of obtaining a difference of this magnitude or greater if the null hypothesis were true.
Although it would not usually be expressed this way in a publication, we have essentially established a 95% confidence interval around zero, which is the “average” result we would expect from many trials if the null hypothesis were true. We would never express it that way in a publication, however, because when performing such a statistical test our goal is usually to report the P value rather than the actual breadth of the interval around zero. It helps to realize the similarity, however, because we could obtain the same exact conclusion by establishing a 95% confidence interval around the result from our own trial, and it is very common to report confidence intervals established in such a way. Here is what one would look like for our study:
Here, the mathematics are much the same, but the interpretation is somewhat different. If we conducted this trial, say, a thousand times, each trial would generate a different 95% confidence interval. Yet we could expect 95 percent of those studies to generate a confidence interval that includes the true effect of switching from vegetable oil to animal fat. We should notice a striking agreement between the two approaches: whereas above we had seen that a twenty percent reduction in the incidence of cancer (-20) rested at the very limit of the 95% confidence interval around zero, zero now lies at the very limit of the 95% confidence interval around -20. Thus, we could use either approach to determine that P is equal to 0.05 and that there would be a five percent likelihood of obtaining a difference this large if the null hypothesis were true.
Informally, and not without potential pitfalls, it seems natural to follow the logic something like this:
- The likelihood of obtaining a result of this magnitude or greater if there is no effect of animal fat is five percent.
- Conversely, the 95% confidence interval lies on the very edge of excluding zero. The likelihood that the confidence interval includes the “true” value is 95 percent.
- On an informal and intuitive level, it seems right to use this likelihood as a measure of our confidence that the “true” value is different from zero.
- If we can say that we are just under 95 percent confident that the true value is different from zero, it follows that we could say we are just under 95 percent confident that animal fat reduced the risk of cancer in this study (but see the dispute in the comments).
If we used a Bayesian approach to quantify our degree of belief, we could construct a Bayesian confidence interval, which could also be called a credibility interval or a Bayesian probability interval. As it so happens, under the simplest conditions and without any prior knowledge to factor into the calculation, the “Bayesian” and “frequentist” intervals are mathematically identical (but see dispute). The difference, of course, is the interpretation: the Bayesian interval is a formal way of quantifying the degree of belief that something is true.
Now, on one hand I think it is quite clear that “belief” is intrinsically less amenable to quantification than a proportion of possibilities. Thus, there is something “softer” about the science of quantifying “confidence” than the science of quantifying distributions of possible outcomes. On the other hand, it is useful to have some way of weighting our confidence that one or another thing is true in order to make decisions.
The problem with this approach, however, is that in most cases there are multiple studies to take into account. Say one study showed that animal fat reduced the risk of cancer with P<0.05. If there were several studies that showed the opposite result, this should greatly shake our confidence that the results of the first study were not due to chance. Our confidence should be much less than 95%. Yet the P value of that particular study wouldn’t change, because the probability of obtaining that result if the null hypothesis were true would still be less than five percent. Thus, reading a degree of belief into a P value tends to be rather shaky business unless there is little other evidence to go on.
As I have argued in previous posts, the LA Veterans Administration Hospital Study was unique for several reasons, not the least of which because it was the only study where the mean age of the subjects was over 60. I think it is fairly reasonable to say, as a “soft” inference from the P value, that we can be just under 94 percent confident that the increase in risk among those consuming vegetable oil was a result of the treatment (eating in the dining hall where foods were made with vegetable oil) rather than chance. Expressing our confidence in this way is important to translating the finding into something useful for making decisions. It would, of course, be a more difficult task to formally quantify our degree of belief in the general statement that “vegetable oil increases the risk of cancer.” That would require looking at the totality of the data rather than a single study.
Of course we should never lose sight of the fact that the technical definition of a P value is the likelihood of obtaining a result this large or larger if there is truly no effect. We should likewise never lose sight of the fact that belief is subjective and requires debate about the relative value and proper interpretation of different forms of evidence.
I’m 95% confident that at least a few folks will find this helpful, but if not, please let me know in the comments so I can improve the next post. And if you made it this far, thanks for reading!
Notes
* I have treated this example as if I knew the standard deviation of the population to be 10.2, whereas in a real trial we would be very unlikely to know the standard deviation of the population and we would have to estimate it based on the standard deviation found in our own trial.
Read more about the author, Chris Masterjohn, PhD, here.
References
Armitage P, Berry G, Matthews JNS. Statistical Methods in Medical Research: Fourth Edition. Malden, MA: Blackwell Science Ltd. 2002.
🖨️ Print post
Tim Ozenne says
Generally correct. But for 95% interval, you want 2.5% in each tail (two-tailed test). Whether a two-tailed test is best is a different question.
Chris Masterjohn says
Hi Tim,
I agree, but what I tried to say was that there would be five percent between the two tails. I’ll try to edit it in a way that makes that more clear. If you have any suggestions, please let me know.
Chris
Chris Masterjohn says
In the initial description of the first sampling distribution, I added, “2.5 percent falling into each tail.” I hope that helps clarify.
Chris
Norman Yarvin says
Uh, true Bayesians do not say that without prior knowledge you fall back to the frequentist position. True Bayesians say that without prior knowledge you can’t say a damn thing. When your religion is Bayes’ formula, and you don’t know one of the inputs to Bayes’ formula, you don’t know the output. The inputs do not default to 0.50, or any other number; they default to NaN.
This, of course, leaves open the question of where one might ever start getting knowledge in the first place, after being born without it. But then that’s a tough question in practice, too. Bayesian extremism would say nobody can ever know anything, but back off a hair on that extremism, and it’s quite a good way to explain all the clueless people in the world.
In any case, Chris, when you’re in a hole like this, you should stop digging. The mathematically defensible statement here is that if the trial results were just due to pure randomness, there’d only be a 6% chance of them being as extreme as they were. That’s just the truth; it’s not dependent on Bayesianism, frequentism, or any other -ism. In the language of Bayes’ theorem, that gives P(A|B); you were trying to make it the number for P(B|A), and the one can’t be gotten from the other unless you know P(A) and P(B). So please just admit that you confused P(A|B) with P(B|A), and make the excuse that many others have also done so and continue to do so. Youth and inexperience are also good excuses. Saying that you called it a “confidence” rather than a probability is not a good excuse, unless you want to be labeled a confidence man.
As a practical matter, making the mathematically defensible statement is impressive enough. You don’t have to confuse P(A|B) with P(B|A), when most of your readers will just do it for you anyway. And the few who don’t confuse it themselves will appreciate you getting it correct.
This is not an idle distinction. The significance one should place on a trial depends not just on the p value but also on a lot of other things, such as the number of people who have tried to get such a result. When 20 people try an experiment, odds are that one of them will get a result significant to p<0.05 just out of pure chance. Okay, maybe the field is so conscientious that the other 19 people will get published even though their results are boring, but maybe it isn't. Saying that you have to know prior probabilities is a way of saying that context matters, especially with results whose statistical significance is marginal.
Chris Masterjohn says
Hi Norman,
Thanks for your thoughts! According to the textbook I cited, the “extreme member” of the family of conjugate priors can be chosen to “represent ignorance,” which is called the “non-informative prior or the vague prior” (p. 167). They cite Lindley, 1965, for the mathematic correspondence between traditional (frequentist) methods and Bayesian methods when non-informative priors are chosen and appropriate changes of wording are made. They state on p. 174 that with a non-informative prior the Bayesian confidence interval corresponds mathematically to the traditional confidence interval. I don’t know whether this represents adherence to the true religion of Bayesianism, but apparently it reflects a standard component of the Bayesian statistical repertoire if it made it into this textbook, no? Like I said, this isn’t my field, so all I can do is try to represent what I read. In any case, how would you suggest making a quantitative estimate of confidence in something? Obviously this is more wishy washy than quantifying true probability, and I’m not very comfortable at present calling it “probability,” but I’m not sure there is a better way to satisfy what is more or less intuitively demanded: some way of describing how strongly we believe something, and some way of rooting it in evidence rather than pure guesswork. What are your ideas?
Chris
Norman Yarvin says
Sometimes textbooks are a record of things that have been proven rigorously. Other times they are a list of dodges that people have succeeded in getting away with. Note the wording of that textbook: “can be chosen”. Other things can be chosen too. It’s a choice, not something that the math dictates. In general, choosing a prior probability is a great way of smuggling your biases into the results of a study while still sounding scientific and objective. So of course lots of people have done it and gotten away with it. Talk of “conjugate priors” is an especially nice touch: in math, “conjugate” often denotes something quite formidable, as in the conjugate gradient method. So using that word is a good way of scaring people into not questioning your choice. But this use of “conjugate” is a completely different meaning of the word, which has nothing to do with those formidable uses; the only reason to prefer a “conjugate prior” over a non-conjugate one is that it makes the math a bit more convenient. The convenience of mathematicians was never a truly good reason to do anything, and now that we have computers to help us do math is less important than ever.
In any case, the right way to deal with biases is not to smuggle them in, but to drag them out into the open and argue about them too. That’s nothing you aren’t already doing, of course: your discussions about polyunsaturated fatty acids already contain plenty of other reasons to suspect them of causing harm. And, for that matter, you’re not shy about relating certain personal experiences that may have biased you. The trick is just to realize that these things are what the math is about, too: those are the “prior probabilities” that you’re bringing to the discussion. Someone who just leaps in to the discussion and looks only at the one study may have a very different set of prior probabilities: he may think that polyunsaturated oils taste good and thus can’t be doing harm, or maybe have a 5% chance at the most. This study, if he believes it, should shock him back to something more like a 50% or 25% chance.
Chris Masterjohn says
Hi Norman,
I agree with you that prior evidence shouldn’t be dragged into a single study. I think it is useful, however, to take quantitative approaches to meta-analysis. They shouldn’t displace subjective, qualitative debate, but they are still very useful, especially when they analyze the data in multiple ways that incorporate different subjective biases or assumptions. But here, my purpose was not to justify the use of Bayesian priors. My point was that if the Bayesian priors are chosen as non-informative, the quantification of degree of belief correspond mathematically to the frequentist approach. Thus, this comes closest to translating the technical statistical inference about probability into some inference about confidence that something is true. I added a paragraph towards the end to explain this. My point isn’t that this should be the primary reading of the P value. The technical reading should. However, there is some usefulness to coming up with a quantitative way of weighing confidence, when it is done without excluding subjective debate, is there not?
Chris
ProudDaddy says
Your note was the most interesting to me. What is our confidence in this estimation?
Further, doesn’t saying we are x% confident or whatever still reflect only the MEAN results? Let us say we are studying disease X which unknown to us is actually two different diseases with the same symptoms. Let’s further assume that the pathologies of these diseases are totally different. We make an intervention and report the mean. Does it have any meaning at all?
To expand on what I am trying to get at, let us assume that our intervention always cures those with one of the diseases and worsens those with the other. Assuming an equal occurence, our mean result would be “no effect”. And if we repeated it often enough, we could become 99.9% confident of same, and totally miss the fact that we have a cure for one of the two diseases.
I am certainly not a satistician, but I still think that the true outliers (not measurement error) might be able to tell us more about the worth of our study than a bunch of powerful statistics.
Chris Masterjohn says
Hi ProudDaddy,
Great point! Indeed, we can’t put our faith blindly in the outcomes of scientific studies precisely for this reason: we are only scratching the surface of the complexity. So, what we need to try to do is dig deeper to find the true nature of the complexity. On the other hand, statistics can be valuable in that pursuit as well.
Chris
Norman Yarvin says
This is a continuation of my above conversation with Chris Masterjohn, which has gotten too deeply nested, so is being displayed with only a single word per line.)
There’s certainly a place for meta-analysis done in a quantitative way; but the usual sort of meta-analysis operates by combining the results from several sources, rather than by trying to use some theory of interpretation to extract more meaning from a single study. Also, meta-analysis yields yet another P value, which poses the same problems of interpretation as the P values from the source studies do.
Not that those are really problems, if you simply take the rigorous way, and say that the P value is the probability that the result (or another result even more extreme) would occur just from chance.
The problem comes when you want to translate that to a probability that the result did occur just from chance.
To take a case where the two are completely different, consider the case of someone trying, via experiment, to disprove a law of mathematics — something we absolutely know is true, because we have proven it rigorously. He does an experiment which has some randomness in it — say, something using the Monte Carlo method (a well-established random algorithm) — and comes up with the result that the law of mathematics being tested is untrue to P=0.02. It’s correct to say that there’s only a 2% probability that his result (or another result even more extreme) would occur just from chance. Nevertheless, it’s incorrect to say that there’s a 2% probability that it did occur just from chance, because that probability is really 100%; the law of mathematics he tested is actually true, so his result must have occurred just from chance.
Of course, in useful cases, the prior probabilities are not as completely dominant as here. But they still stand in the way of going from “would” to “did”.
Chris Masterjohn says
Hi Norman,
Thanks for clarifying!
I think we both agree that the P value cannot be used to say the probability that something *did* occur, in the sense of a proportion of possibilities. I think I made that clear in the first half of the post. So we are not debating that point, correct?
In your example, I completely agree that it would be absurd to view the experiment in isolation and use the P value to indicate confidence in the non-randomness of the result. In this example, however, the scenario is the opposite of the one I am suggesting in this post: the prior evidence overwhelms the study result. A Bayesian analysis, if I understand it even somewhat correctly, would conclude the probability of chance was approaching 100%.
In the example here, this is the only study of its kind. I have argued rather extensively in past posts that it needs to be given special treatment because of its design and primarily because of the age of its subjects. If there were even one or two other comparable studies that did not show this result, or showed the opposite, I don’t think it would be fair to say we are “94% confident” that the difference resulted from the treatment.
What I’m realizing now, though, is perhaps I gave the impression that this should be a general interpretation of the P value, when in fact usually there are multiple relevant studies (and of course in this case, depending on how one phrases the research question, there could be other relevant studies). I think I missed this point early in our discussion because you had reacted so strongly to the Bayesian inference. Are you against any attempt to quantify degree of belief? Or are you just specifically against inferring it into the P value because there is almost always other evidence to take into account?
Chris
Norman Yarvin says
Oh, and by the way, I’m not trying to argue that you shouldn’t import other knowledge into your interpretation of this study. On the contrary, I think you should. All I’m arguing is that whatever numbers you import to fill the role of a prior probability should be your numbers, not some stupid “default” numbers from some textbook. They should be based on your knowledge of biochemistry and of other relevant human studies. Of course if you’re not comfortable reducing that knowledge to a number, that’s fine too, but in that case you don’t have the wherewithal to make the jump from “would” to “did” (that is, from P(A|B) to P(B|A)).
Chris Masterjohn says
Hi Norman,
I added this paragraph:
“The problem with this approach, however, is that in most cases there are multiple studies to take into account. Say one study showed that animal fat reduced the risk of cancer with P<0.05. If there were several studies that showed the opposite result, this should greatly shake our confidence that the results of the first study were not due to chance. Our confidence should be much less than 95%. Yet the P value of that particular study wouldn’t change, because the probability of obtaining that result if the null hypothesis were true would still be less than five percent. Thus, reading a degree of belief into a P value tends to be rather shaky business unless there is little other evidence to go on."
I realize this changes the conclusions somewhat, but I think it makes them more correct. If we arrive at some consensus in the comments, I'll probably edit this post accordingly and then re-post to notify everyone of the correction. I realize you and I might not come to total agreement, but do you think this improves the conclusion?
Chris Masterjohn says
Hi Norman,
It seems one other thing that might be missing from my conclusion is that our confidence in the truth or one another thing should be proportional to how much it has been researched. In other words, it shouldn’t be near 100% because there is only one relevant study. If there is only one relevant study, it should be pretty low until we have the chance to elaborate the research. What do you think?
Chris
Chris Wilson says
Hi Chris and Norman,
Good discussion you have going. I would simply add that one thing being missed here is that Bayesian methods aren’t just an alternative way of doing hypothesis testing.
As in a meta-analysis, you need to be careful about the source, nature, and mathematical form of data that are used in selecting a prior for Bayesian analysis.
For simple hypothesis testing, most “Bayesians” agree that an uninformative prior (“flat distribution”) is the most appropriate choice. The results will be basically identical to an analogous frequentist test.
In my opinion, Bayesian methods come into their own when you consider that they allow you to treat the parameters of the system you’re studying as random variables, rather than as fixed “true” values.
One exciting application of Bayesian methods is in adaptive management of complex systems. Say you have a well-defined set of competing hypotheses (management options), and some decent background data. You then collect some good data “testing” these hypotheses, and can then discriminate amongst them using your prior data and Bayes’ theorem. In effect, you’ve generated a set of posterior distributions which allow you to predict the effect of taking one action as opposed to another (from your list of hypotheses).
You then take an action that seems best, and then collect more data. Your previous experimental data becomes the prior, and you generate a new round of posteriors. This cycle can be iterative and adaptive.
Frequentist methods constrain you to estimating parameters and confidence intervals about the “true state” of the system, whereas what you’re more interested in as a manager is how your management actions are pushing the system– effecting the form and variability of the parameters of interest.
Again, this is just one example of where the Bayesian framework is more powerful. There are philosophical trade-offs between frequentist and Bayesian approaches.
I agree that in this case, in evaluating this veggie oil study which is kind of in a category of its own, a frequentist statistical test makes the most sense.
Chris Masterjohn’s other knowledge of biochemistry, nutrition, and so on, provides an interesting context for interpreting it, but shouldn’t be used to futz with the p-value or the confidence interval…I don’t think most folks using Bayesian methods would suggest otherwise…
Chris
Chris Masterjohn says
Hi Chris,
Thanks for chiming in!
What do you think of this issue?
From a Bayesian perspective, how does having only one relevant study affect quantification of confidence in something? High because of a non-informative prior, or low because there is only one study which itself should produce uncertainty?
Thanks!
Chris
Chris Wilson says
Hi Chris,
The short answer is: more studies gives more confidence in the sense you mean for Bayesians just as for frequentists. Afterall, you’re only justified in using an informative prior if you have good knowledge-base/data to work from. So the choice of a non-informative versus informative prior depends on a lot of factors. It definitely effects the quantification of confidence! But I suggest it’s worth distinguishing between “confidence” in an interpretive sense (epistemic question), versus “confidence” as calculated in a Bayesian credible set or in frequentist “confidence intervals”.
After further thought, I would approach these data differently. I would use a Bayesian statistical model that has both informative and uninformative priors (be warned that I’m way out of my depth here, my only familiarity with Bayesian inference is in ecological studies):
I would build a generalized linear model with three predictor variables: veggie-oil/animal-fat(categorical), age (categorical) and smoking rates (continuous). The response variable is cancer rate. You have two sets of models (one for each treatment).
Rate=B(0)+B(1)(Veggie-oil OR animal fat)+B(2)(smoking)+B(3)(age)+interaction terms
You could use frequentist model selection and parameter estimation at this point. However, using a Bayesian approach you would specify priors for each parameter.
Since it seems there should be robust data for the effects of smoking and age on cancer risk, it makes sense to incorporate them as Bayesian priors (rather than treating this study as a de novo universe for cancer risk attributable to smoking and age). Might the NIH have such data tables?
I would choose a non-informative Bayesian prior for the animal-fat/veggie-oil parameter. This is where the absence of previous studies comes into the picture.
Anyways, I would evaluate these data with Bayesian model selection and generate credible sets for the parameters.
Rather than simple hypothesis testing we’ll: 1) select the “best” models for cancer risk for each treatment group, and 2) build credible sets for the parameters of interest to explore their biological meaning.
P-values would never enter into this analysis. Afterall, we already have a p-value (p=0.06), and it’s arbitrary to say that’s not significant but p=0.05 is! I think this statistical model would be far more interesting.
Chris
P.S. I may take a gander at these data, if I can track down the original LA Veterans study. Unfortunately, I still suck at actually programming statistical models 😉
Chris Wilson says
edit: there is no stratification by treatment group, that was silly idea. there’s one family of models with treatment (veggie-oil/animal-fat) as a two-valued categorical variable..
Chris Masterjohn says
Hi Chris,
Thanks so much for your suggestions! I think Norman is correct that we are unlikely to get original smoking data from this study, but there is a publication with mean data stratified by several categorical smoking rates. I think it would be important to try to fit in vitamin E versus smoking interactions.
I’m a little bit confused about your stance on interpretation of the P value. You wrote on your blog, “there’s little justification for a Bayesian informative prior to re-run an ANOVA style hypothesis test with.” Isn’t the issue the choice of a non-informative prior? My understanding is the non-informative prior would yield a similar result to the frequentist test. Are you saying there’s little justification for a Bayesian hypothesis at all, because the prior would be non-informative if we chose one? In other words, are you opposed to doing a simple one-study Bayseian simple hypothesis test with a non-informative prior?
I’m also confused on another point: you said that for both Bayseian and frequentist approaches, confidence should increase with increasing repetition. Is there a brief way you could explain how that works for a Bayesian approach? The main point of confusion for me is it seems that if a non-informative prior is chosen and this leads to an interpretation of “95% confidence” on a P value of 0.05, it seems there is little room for increasing confidence with increasing repetition of the result.
Also, what is your view of the two objections to my original conclusion that I raised in my comment to Norman? Basically:
1) With increasing repetition of the same result, say, P=0.05, using the same design, it would seem that confidence should increase and asymptotically approach a maximum of 95%. Yet if we interpret a single study of P=0.05 as yielding 95% confidence, there is no room for confidence to increase.
2) With conflicting results, confidence in the true value of the parameter should change but the probability of one or another result given the null hypothesis should not. In other words, two studies are identical, the first yields P=0.05 and the second yields P=0.5. The second study should lead to much lower than 95% confidence in the results of the first study, yet the first study should remain P=0.05 because the probability of obtaining that result with the null hypothesis is the same.
What do you think?
Thanks!
Chris
Chris Wilson says
Also, I just uploaded a post on my blog about this discussion, for anyone that’s interested (similar dorks). I outline my proposal for a statistical model a little more. I’m interested if this could actually go anywhere or if there’s something I’m overlooking…
http://regenerative-ecology.blogspot.com/2012/07/bayesian-modeling-la-veterans-pufa-study.html
Chris
Jim Stone says
Hi Chris.
What you say is good to a point. However the biggest mistake people make with statistics is that they don’t do Bayesian updating based on the circumstances by which correlations are discovered.
For instance, let’s say the researchers actually tested for 100 correlations and 5 of them had a p value of .05 or less. And those are the ones they reported. We would have reason to reduce our confidence that the correlations are not due to chance in this case, because we would EXPECT 5 of the correlations to be significant just by chance.
Similarly, we have to update based on publishing bias. Let’s say for every published study there are 5 unpublished because they didn’t find significant p values. This should affect our confidence in the correlation in the reported study.
We really should have much higher standards for our p values. I’m really not excited at all unless I see something less than 0.001 for the most part.
Even then I usually want to see a followup study with a larger sample size.
Remember, the p values for out best discoveries are usually something like 0.00000001 or better. Think about the link of lung cancer with cigarette smoking, or the idea that antibiotics actually cure infections, or that smallpox vaccine prevents smallpox.
Chris Wilson says
Hi Jim,
I agree somewhat with what you say. Your example of scanning through correlations is well-taken; I agree that “data-mining” is a problematic pursuit. That’s different than calculating a p-value based on a controlled study– with good experimental design and clear hypothesis testing.
Also, recall that probability (as a mathematical value) does not change based on past results for independent trials- i.e. just because I flipped ten heads in a row, doesn’t make it more or less likely that I’ll flip one this time. Likewise, just because researchers may have scanned a hundred correlations to find one with p=0.05, does not make it any more or less likely to be the result of chance.
Your criteria for p-values smacks of an extreme bias for chemistry, physics and engineering, where super-controlled conditions are possible and generally linear dynamics prevail . In the real world of complex phenomena, variability rules- both measurement and process-based. You’d have to throw out a lot of biology, ecology, climate science, and much else besides….;)
The point for discussion here is the arbitrariness of saying that p=0.05 would be significant, but p=0.06 isn’t. Particularly if we retroactively have reason to believe that PUFA’s could contribute to cancer risk (based on biochemical/nutritional reasoning), the question was: how can we approach these data? Should we let our other knowledge affect their interpretation? How does the uniqueness of this study help or hinder our inference?
The frequentist, ANOVA-based point of view is that nothing else that we know should impinge on this result, it is what it is, and we should leave the hypothesis test as it is (in the dust-bin). My suggestion is two-fold: 1) move to statistical modeling rather than simple hypothesis testing, and 2) use Bayesian methods to incorporate high-quality data for what we know about smoking and age contributions to cancer rates.
Obviously, the ideal thing to do is another similar study, only run it for longer and balance the smoking between treatments, but that’s not going to happen soon. In the meantime, I’m proposing that we could get more information out of these data (although I confess the details I envision may be in error, I’m mainly speculating).
Chris Masterjohn says
Hi Chris,
Thanks so much for sharing all your thoughts. I would caution, however, against accusing the researchers of throwing the hypothesis in the dust bin. I think this study was actually very influential in having the establishment quietly back off recommending corn oil.
Chris
Chris Masterjohn says
Hi Jim,
Thanks for writing! I agree with your points about adjustment for multiple comparisons, publication bias, and reporting bias. However, I agree with Chris that your expectations for p values are extraordinarily unrealistic to apply to anything biological or ecological in nature.
Chris
Norman Yarvin says
With multiple studies, a Bayesian approach is to make the posterior probability after interpreting each study the prior probability in interpreting the next. After enough studies, the prior you started with is drowned by the data; pretty much any one you choose, as long as it has a moderate level of doubt in it, will give about the same result. That is the ideal situation, and it makes arguments over what prior to start with moot.
But with just one study, the prior has a lot of impact. In this case you (Chris Masterjohn) personally seem to be comfortable, or at least not too uncomfortable, with the “uninformative” prior probability of 0.50. But that’s not really a neutral thing to assume. To demonstrate this, suppose that the study had been, instead of about what people ate, about whether the seats they sat in were made of wood or made of plastic, with the people who sat in wooden seats getting higher cancer rates. In that case you wouldn’t be nearly so comfortable, I’d think, with that 0.50 prior, because it would be quite strange if wooden seats caused cancer. (Well, unless they were real torture devices…) What makes you comfortable with the idea that polyunsaturated fatty acids might cause cancer is, I believe, that they are eaten, that when eaten they don’t just get destroyed immediately but get distributed throughout the body, and that they are quite prone to oxidation. (Of course that’s only one aspect of their biochemistry; you may have other things in mind too.)
It’s not that I’m against any attempt to quantify degree of belief. It’s that I think belief just inherently has to be a personal thing, and that readers will inevitably bring a different perspective. A reader may start less suspicious of polyunsaturated fats than you are, or more so. Assuming that “uninformative” prior is forcing your own views on them, which — well, there are a lot of worse people’s views to have forced on you, but it’s still something of an imposition. And yes, as you’ve noticed, if you tried to make it a general principle to use that prior, it wouldn’t be self-consistent: in the presence of multiple studies whose results differ, you can’t apply that same argument to each of them individually, because it’s not consistent to say both that there’s a 94% chance of PUFAs causing cancer and that there’s (say) a 97% chance (presuming some future study was done that yielded p=0.03). Multiple studies that agree with each other should improve the confidence, not yield self-contradiction. The fact that at the moment there are no other studies doesn’t change this: an argument that would have to change if there were other studies is no good in the first place. Other studies should change the final probability with which you think that PUFAs cause cancer, but should not cause you to have to retract any arguments.
There are other details, too, standing in the way of the leap from “there is a 94% probability that this result would not happen from pure chance” to “we are 94 percent confident that animal fat reduced the risk of cancer”. There are plenty of confounders that are not pure chance, and not animal or vegetable fat, yet could have been operating here. The cooks in one cafeteria, for instance, could have been more in the habit of burning food to the point of producing carcinogens. Also, there could have been mistakes made in the sort of ordinary clerical details of which there are thousands in a study like this, and those mistakes might not have been evenly distributed.
As for Chris Wilson’s idea to redo the P value, including in the math things like age and smoking, that sounds good, but it also sounds like one would need age and smoking status for each study participant, which might be hard to get, especially for a study done so long ago. Summary statistics perhaps could be used instead of data on each participant, but would be of much less value.
Chris Masterjohn says
Hi Norman,
Thanks so much for sharing your thoughts. I would note that in the conclusion of my article I was expressing confidence that the effect was a result of the treatment, which I defined as eating in the dining hall that served vegetable oils, not as consuming the vegetable oil. Thus, I was expressing this not as 94% confidence that “vegetable oils cause cancer,” but 94% confidence that the treatment caused the effect, with all of the potential confounding implied by that treatment.
To extend this to the experiment with wooden seats, if I were to follow my original line of argument, I would express this as confidence that sitting in the particular seats caused cancer, not as confidence that “wooden seats” in general cause cancer. I think that would require numerous other experiments in different contexts that attempt to tease out potential confounders in much the same way as the dining hall issue introduces.
At the moment, I’m trying to decide whether to publish a new post retracting my original conclusion. I agree with you that belief is subjective, but I’m also open to the idea of attempting to *loosely* quantify it, acknowledging that this rests on subjective assumptions that can be chosen differently by different people. I’m basically struggling with this:
1) Should the mere fact that there is one relevant study factor negatively into an estimation of confidence? In other words, if we had five studies in a row, conducted in precisely the same way, that all generated P values of 0.05, then shouldn’t our confidence *increase* substantially with each study, progressively approaching a maximum of 95%?
If a non-informative prior leads to the Bayesian inference for one study equaling the frequentist test, this would seem to deny this phenomenon of increasing confidence with increasing replication.
2) While I am sympathetic to your point that multiple studies with low P values should not lead to self-contradiction, I’m even more concerned with a dramatic discrepancy: say the first study has P=0.05 and the second study, identical in design to the first, has P=0.5. This second study should cause the confidence in the result of the first study to be far lower than 95%, yet it would not make any sense to change the P value because the probability of that result occurring despite the null hypothesis is the same.
Do you think these two objections to my original conclusions are correct?
Thanks,
Chris
Chris Wilson says
Hi Chris,
First, I think I need to track down the original study to see how the authors set up their statistics. Another caveat is that statistical modeling and Bayesian statistics in particular are an active area of research and innovation, and I can’t hope to represent the field as well as I’d like. With that in mind:
1) My point about hypothesis testing is that using Bayesian methods with an uninformative prior (although in practice it can be tricky to define one) should give extremely similar results. Thus, I mainly question the utility of doing so. There are more interesting things to do with data, i.e. maximum-likelihood or Bayesian model selection and parameter estimation. Or, as I suggested, specifying informative priors for the parameters we can justify (like smoking and age).
2) A Bayesian approach to repeat testing is to “update” the prior with the posterior of a previous trial. But see below.
3) In your example of increasing repetition of the same result, say some effect size with p=0.05, it seems to me you have two choices. The first is to pool the data- which I think is justified given your conditions- and run the frequentist test again. With a larger sample, your statistical power improves, increasing the calculated F-ratio and thus decreasing the calculated p-value. A Bayesian approach with an updated prior, would I think yield no net change (again, assuming unrealistically that the data were identical). I’ll have to think about this some more. I suppose you could pool and use the original prior.
4) With conflicting results, I think the problem is this. In each trial, with a frequentist test, the probability of obtaining those results (or more extreme) from a null hypothesis remains the same (the calculated p-value). Now, if you pool together and retest, your new p-value is going to be different than either. Assuming you believe the methodology is sound, pooling makes the most sense to me.
But I think your wider question is about confidence in an epistemic sense, not the mechanics of calculating p-values and intervals. Here’s my take: Popper’s vision of science- at least popularly understood- is about falsifying hypotheses (analogous to the frequentist rejection of null hypotheses). You can never directly confirm an hypothesis, only fail to falsify it, and the accumulation of failures indirectly supports it.
I find this unsatisfactory, and unrealistic. I think that prior beliefs are always at work, both in the questions we pose, the methods we use to answer them, and the interpretation of data.
Pragmatically, I personally weigh hypotheses against theories, critical reasoning, qualitative and quantitative models, and various kinds of data. I think we have to take p-values and intervals in this wider context. Our confidence should come from the whole.
Chris
Chris Masterjohn says
Hi Chris,
Thank you so much for your comments!
I haven’t read any original Popper myself, but I agree with you to an extent that there is something unsatisfactory about simply trying to falsify a hypothesis. I do agree, however, that we can’t directly confirm a hypothesis. What I think we have to do is, with each experimental result, try to determine how many alternative hypotheses it satisfies in addition to our own, and then try to generate new designs that will rule out those alternatives (or rule them in; “distinguish between” might be a less biased way of phrasing it). In my view, the more this is done, the more we begin to asymptotically approach confirmation of the hypothesis. It should often be difficult to tell how close we are though!
I see your point about the lack of utility in doing a simple hypothesis test with Bayesian methods. My interest in it is less the actual execution and utility, and more the point of whether its interpretation as “degree of belief” sheds any light on interpretation of the frequentist P values. I understand your point 2. I have comments on points 3 and 4:
3) I should have said that our confidence should asymptotically approach 100% rather than 95% with repeat studies yielding P=0.05, but somehow the obvious fact that pooling the data increases N eluded my exhausted brain while I was writing. Nevertheless, beginning with 95% hardly allows much room for confidence to increase. If confidence only increases from 95% to near 100%, this would indicate that the marginal utility of repeating the study even once is so small as to hardly justify spending money on it. But is that realistic? Why should neutrality be represented by a non-informative prior? Shouldn’t total neutrality, in the absence of any biases about plausibility, be represented by 50% confidence? It seems to me that our “degree of belief” should move from neutrality (50%) to certainty (100%) at a slow enough rate that a second repetition of a single study should have a major impact.
(I realize that equivalent studies yielding identical effect size and P value is unrealistic, but I’m just positing that to simplify the concepts and math.)
4) I agree with you about the effect of pooling the results given certain assumptions. My point, though, is that the P value of the individual study shouldn’t change, but our confidence in the studies results should reflect the pooled results rather than the individual P value. Thus, my point that a non-informative Bayesian prior yields “degree of belief” results identical to the frequentist test has, at the absolute most, applicability only for single, unique studies.
Because of the points I raised in “3,” however, I question the use of a non-informative prior to collapse the Bayesian inference into one identical in numerical value with the frequentist inference, and thus I call into question my general conclusion in this post. That is, shouldn’t we, in the absence of plausibility-related biases, move steadily from neutrality (50%) to near certainty (~100%) with increasing repetitions of the same convincing results, where the marginal confidence-increasing utility of each repetition is quite large when repeating the study for the first time and is quite low after there are many repetitions? Beginning at 95% confidence with the first study seems to utterly preclude this principle from operating.
You have said that we should distinguish between epistemic confidence and Bayesian confidence. I’m interested in the former. My main interest in Bayesian methods here is that they purport, as I understand it, to quantify epistemic confidence. As I’ve said, I don’t think that it is particularly amenable to quantifying, and is subjective, but at the same time, I support efforts to loosely quantify it, with a recognition that they rely on subjective assumptions, because being able to attach an evidence-based weight to our confidence can be very useful for making decisions.
Thanks so much for your input!
Chris
Norman Yarvin says
If you have five studies which agree with each other, each with p=0.05, a meta-analysis would give a much more significant p value than p=0.05. This is easy to see: the probability that pure chance would yield the results of the first study is one in twenty, but the probability that pure chance would yield that result five times in a row is 1/20 to the fifth power. That is, one in 3.2 million, or p=0.0000003. (That math may not be quite right, but it’s something like that.) So there is no “maximum of 95%” that gets approached with multiple studies; it can be more than that. As the number of patients increases, the effect emerges clearly from the noise, assuming that the size of the effect is constant (at, say, 20% additional deaths). That is because the sum of the effect grows linearly with N, while the noise only grows as the square root of N (N being the number of patients).
Likewise, a study with p=0.50 doesn’t add much additional doubt to a study with p=0.05, if you stick the two into a meta-analysis. (The size of the effect will decrease, but its level of statistical significance won’t change all that much.) Still, to answer your question, yes, that does make a hash of the idea that one can take any single study and interpret its p value as the probability that the null hypothesis is false.
You do need to retract, one way or another. Even if you weren’t trying to be a member of the scientific community, it’d be the right thing to do. Since you are, it’s de rigueur; scientific competition can be brutal. People who just follow the herd often get away with sloppiness, but people who say controversial stuff get severe scrutiny. Under such scrutiny, even having had to retract something is embarrassing, but it’s still a lot better than not having retracted it. And you do want to be able to say controversial stuff.
Even in a retraction, there’s nothing wrong with expressing a personal opinion that in this case, all things considered, the p value is about the same as the probability the null hypothesis is false. That is, if that does happen to be your opinion. People are allowed to have opinions; it’s only the attempt to pretend that an opinion is really some kind of objective truth that is frowned upon.
Chris Masterjohn says
Hi Norman,
Thanks for your comments! I think the ethics of retraction are pretty straightforward. The reason I am asking additional questions is to clarify what is correct here.
Somehow the obvious fact that N increases for pooled analyses eluded me when I wrote my last message. I apologize for that. In any case, improving confidence from 95% to approaching 100% doesn’t leave much room for improvement. I guess what I’m getting at is that choosing a non-informative prior for a one-study simple Bayesian hypothesis test seems deeply problematic, because neutrality in the absence of any plausibility-based biases, for something unstudied, should be represented by 50% confidence, it seems to me. And it should take more than one study to budge that up to 95%, or else doing a second study has very little effect on “confidence” or “degree of belief” compared to the first.
Chris
Chris Wilson says
Hi Chris,
you wrote:
“I guess what I’m getting at is that choosing a non-informative prior for a one-study simple Bayesian hypothesis test seems deeply problematic, because neutrality in the absence of any plausibility-based biases, for something unstudied, should be represented by 50% confidence, it seems to me. And it should take more than one study to budge that up to 95%, or else doing a second study has very little effect on “confidence” or “degree of belief” compared to the first.”
I see what you’re trying to get at…I think! Yes, the issue of what a neutral prior is, is sometimes a vexed one. It depends on your “hypothesis space”, which depends on your statistical model.
If your parameters are continuous, then your priors for them cannot be discrete (i.e. P(H)=0.50), they have to be a continuous probability density. If you have a discrete parameter, then you partition out hypotheses Hi and assign P(Hi) values.
In the discrete case, if you decide there are two and only two independent hypotheses that completely partition the universe your data sample, then you have to decide, based on knowledge, prior data, or intuition, whether to weight them equally (i.e. p=0.50) or not.
Having collected data, you then apply Bayes’ rule to calculate the posterior probabilities for the parameters (whether discrete or continuous). If the data are strong, it’s not unreasonable to go from a prior=0.50, to a posterior p=0.95. In the continuous case, you could considerably sharpen and move the normal distribution from its initial “flat” shape.
So, yes, if you have one study with really good, strong data, it will give you a correspondingly strong signal in the posteriors– challenging the information-value of follow-up studies.
Chris
Norman Yarvin says
Neutrality is tolerably represented by a 50% probability. But ignorance is not represented well by a 50% probability, nor by any other number for the probability. It’s a sort of pre-numerical primordial chaos that makes people want to turn to some god to sort things out.
Now, a lot of working statisticians are in the situation of having to make a decision: they have to choose some number, for political reasons. (By which I don’t just mean national politics; office politics could equally well demand it.) We aren’t in that situation; we can just admit ignorance.
A way to expound this would be to take the perspective that readers might start with any of several levels of suspicion of vegetable oils, and to tell them, for each, what their suspicion should move to, if they believe this study. As in: if you started with a 50% suspicion, it should move to 94%; or if you started with a 5% suspicion, it should move to 50%; or if you started with a 90% suspicion, this should move you to 99%. (Numbers made up, but they’re something like that.)
Anyway, I’m repeating myself somewhat, and I think we’ve covered all the bases by now.
Ned Kock says
Probabilistic assumptions aside, it seems to me that the analysis of models with mediating effects is well aligned with the spirit of Bayes’s theorem. One can even extend that to the analysis of models with multiple interaction and total effects:
http://youtu.be/D9m4K_fv2vI
Bayesian statistics is often discussed in the context of comparison of means studies, which can create confusion. This is analogous to discussions of Simpson’s paradox in the same type of context. In a structural model, of which a comparison of means model is a special case, Simpson’s paradox is much more easily described:
http://bit.ly/Km3G9p
Perhaps we should use structural models as a basis for the discussion of frequentist and Bayesian inferences. It may make things a lot easier to understand.
Richard David Feinman says
I wanted to ask something here and, first, I declare that I do not know much about statistics. My question is about the statement “With multiple studies, a Bayesian approach is to make the posterior probability after interpreting each study the prior probability in interpreting the next.” Isn’t this exactly what you don’t do in a meta-analysis. It seems like your belief in the outcome of an experiment should incorporate the prior belief about the proposition to be tested. In other words, if your test of the effect of some dietary fat on disease is positive, shouldn’t it have to be very positive to overcome the six prior studies that found no effect? In a meta-analysis, you seem to assume that it didn’t count. So, for example, I see the Jakobsen and Mozaffarian studies on saturated fat as trying to add up zeroes to come up with a real number as I described on my blog http://wp.me/p16vK0-8t. In fact, I wonder how much a meta-analysis is good for beyond getting together studies with small n’s. So maybe it’s a matter of choice but if a study is good, then confirming it is different than simply incorporating the data into your own. No?
Rachel Ramey says
I just have to say that, while I find all of the details in this discussion fascinating, as someone who is NOT a statistician, but rather “just” a plain old citizen wanting to interpret scientific studies for practical purposes, I’ve found both of these articles immensely helpful. It really doesn’t matter to me, for my purposes, whether they ultimately define Bayesian or frequentist in technically precise ways – it matters that I can see and understand the difference between applying sheer numbers to an interpretation of results, and applying numbers PLUS PREVIOUSLY-KNOWN FACTS.
Which is all to say a) please continue 🙂 and b) at the same time, the articles have served their originally-intended purpose, in my estimation, regardless of whatever conclusion ya’ll arrive at.
abdur rauf says
for figure 4. of the paper in the link, it is said that spearman correlation coefficient exceeds the 95% confidence level between about 20:00 UT and 06:0 UT. If possible the kindly tell me, what is meant by that “spearman correlation coefficient exceeds the 95% confidence “. From the figure, you can see that for this time the spearman correlation coefficient plot is clearly exceeding the 95% confidence level.
https://www.researchgate.net/publication/29624675_Are_variations_in_PMSE_intensity_affected_by_energetic_particle_precipitation
thanks