*Scientists reported today the results of a statistical study that proves that women can never be president of the United States. Based on a sample of 43 persons who have been president, the scientists observed that none have been female. “This result is highly statistically significant,” a scientist said in an interview. “Our p-value is < .001.”*

There has been a lot of focus over the past year on *p*-values and their role in hypothesis testing. One psychology journal has banned their use in its publications. When reporting on this issue, even *Nature* gave an erroneous interpretation and subsequently issued a correction.

In an attempt to clarify things, the American Statistical Association issued a statement on *p*-values, which is excellent reading for anyone interested in pursuing the topic further.

What is the fuss about? It has to do with how “hypothesis tests” are conducted and interpreted.

The process starts with a researcher having an hypothesis: cigarette smoking causes lung cancer, women face gender discrimination in the salaries they earn, the Cubs are destined to win the World Series, etc.

The researcher then assumes the *opposite* of her hypothesis: cigarette smoking does *not* cause lung cancer, women *are* treated fairly in the workplace, the Cubs will *never* win the World Series, etc. This is called the “null hypothesis.” The original hypothesis is called the “alternative hypothesis.”

Next, the researcher collects data and performs a statistical test to see how consistent these data are with the null hypothesis. If the data and the null hypothesis are not consistent, the researcher “rejects the null hypothesis.” If they are consistent, the researcher “does not reject the null hypothesis.”

A *p*-value is a measure of the degree to which the observed data are consistent with the null hypothesis. To put it another way, “If the null hypothesis were true, how likely would we be to obtain the result we did, or a result more extreme?” The smaller the *p*-value, the more unlikely this outcome would be.

Here’s where the problem arises: typically, the alternative hypothesis is what the researcher believes is true. The researcher really hopes that the null hypothesis can be rejected. If the results indicate that it can be, the researcher may feel that her understanding of reality has been validated.

What the researcher is forgetting, or does not understand, is that “reject the null hypothesis” does not mean “the alternative hypothesis is true.”

Back to the “study” that “proved” that women can’t be president of the U.S.

Let’s say your claim is, “Women can never be president of the United States.” You set up your null hypothesis as, “Women *can* be president of the United States.” You then observe that, out of the 43 individuals who have been president, none have been female. You conclude that “the observed data are inconsistent with the null hypothesis.” In other words, if women can be president, it would be highly unlikely to have 43 male presidents.

Can you then claim, “We have proved our hypothesis; women can never be president of the United States”?

No, you can’t.

Perhaps the observed data can be explained in other ways. For example, women were excluded from political life until the 19th Amendment to the U.S. Constitution was ratified in 1920 and have faced discrimination since then.

Or maybe you weren’t using the right statistical test. Perhaps you were assuming that, under your null hypothesis women and men had an equal chance of getting elected, like flipping a coin. If this was your assumption, then, yes, getting 43 males would be highly unlikely.

But suppose instead that the null hypothesis was, “Women can be president, although the chance is very small.” Let’s say the chance is 1%. Then in fact there is a 65% chance of observing 43 males and 0 females (by using the binomial distribution).

It’s pretty obvious in hindsight that you had some poorly-conceived hypotheses.

There is always the danger that we want so much to “prove” our hypothesis that we’re unaware of the biases we build into the null hypothesis. As a result we might make it too easy to reject the null.

There are a number of reasons for observing a small *p*-value. One of them is that, in fact, the null hypothesis is not the correct explanation of reality. But others can be:

- As above, the hypotheses are poorly constructed.
- Although the results are unlikely, every once in a while unlikely results show up, and this just happens to be one of those times.
- The number of observations we have is so large that we’re almost guaranteed to have a “significant” result.

As the ASA statement makes clear, *p*-values are just one piece of evidence in the overall story. Many other factors should be considered before reaching final conclusions.