So much of statistics is based on the concept of a random sample. We learn this concept early in our training. A typical definition is given on page 139 of Statistics (5th edition) by McClave and Dietrich: “If n elements are selected from a population in such a way that every set of n elements has an equal probability of being selected, the n elements are said to be a random sample.” We then learn about methods to produce random samples: draw names out of a hat, use a random number generator, flip a coin, etc.
Later, we are told that our methods for conducting hypothesis tests are based on random samples. For example, McClave and Dietrich write on page 347 that a one-sample t-test is based on the assumption that “a random sample is selected from a population with a relative frequency that is approximately normal.”
I dutifully learned the material, passed my exams, and merrily applied the methods I had learned in my working life.
When I began working at Minneapolis Community and Technical College (MCTC), however, I began to feel uneasy about using the methods I had learned. Why? Because we had data on every student who had ever attended MCTC. If we wanted to determine whether there were differences between two groups of students (e.g., males vs. females, or first-generation college attendees vs. those whose parents had attended college) we didn’t need to generate a random sample. We just extracted the data from the database. There was no sampling variability.
I found myself thinking, “If we have information on the entire population of students, and we find that males have a grade-point average (GPA) of 2.911 while females have a GPA of 2.912, does it make sense to test whether the population means are different? We know they are different, since we have complete knowledge. So why conduct a test to determine whether the difference is statistically significant?”
I had similar qualms while working at The Improve Group on a project that involved data on all of the students in a particular grade in a school district.
When I went back to graduate school to obtain my master’s degree in Statistics from the University of Minnesota, I had an opportunity to ask my professors about this. The first one I asked was a post-doc who was the instructor for my second-semester theory course. She seemed stumped by the question and said she’d think about it and get back to me. She never did.
I next asked a professor whose course I had taken the previous semester. He gave me an answer that, in retrospect, makes sense, but I did not see it at the time.
I felt that it would be a wasted opportunity if I completed the master’s program without getting a satisfactory answer to my question: “Does it make sense to conduct hypothesis tests when we have population-level data?” I asked Adam Rothman, who was my professor for STAT 5701, Statistical Computing (and who supervised my Plan B project).
In Statistical Computing, Adam repeatedly used the phrase “realization of a random variable.” The way Adam explained it to me, a “random sample of size n” could be thought of as n independent realizations of a random variable. Of course, this was all in his lecture notes from the first day of class (September 5, 2014):
- A random variable is a numerical measurement of the outcome of an experiment which as yet to be performed
- A realization of a random variable is the value the random variable took on after the experiment was performed
but of course I didn’t get it at first. (A little slow on the uptake, I was.)
In this view, a student’s GPA is a random variable, whose value depends on the outcome of the “experiment” (i.e., the student taking classes and getting grades). Once the experiment is performed, each individual student’s GPA is a realization of that random variable. The collection of all of these realizations of the random variable is, in fact, a random sample. True, there is no sampling variability, but there is variability nonetheless (not everyone has the same GPA). Moreover, if we were to repeat the experiment again (say, with different students) we would get a different set of values. But as long as the distribution from which the random sample is drawn it doesn’t matter that these values are different.
In my master’s program we were using DeGroot and Schervish’s Probability and Statistics (4th edition) in our theory course. Once again, the definition of “random sample” is given early in the text (p. 158), and by the time we got to talking about hypothesis testing based on random samples I had just assumed that they were using a definition for “random sample” similar to McClave and Dietrich’s. But it turns out that they weren’t (p. 158):
“Consider a probability distribution on the real line that can be represented by either a p.f. [probability function] or a p.d.f. [probability density function] f. It is said that n random variables X(1), …, X(n) form a random sample from this distribution if these random variables are independent and the marginal p.f. or p.d.f. of each of them is f.”
In other words, there is no notion of “sampling from a population”–no drawing cards from a deck or names from a hat. But I hadn’t caught on to this the first time around.
So now I feel comfortable performing hypothesis tests even with complete data on a population.