### The importance of tracking identities when doing a pre-test / post-test design

I was recently doing a project for a client of mine which sought to judge the effectiveness of a training program. Participants in the program were asked to complete a questionnaire at the beginning of the program and to complete the same questionnaire after the program was completed.

This is a straightforward pre-test / post-test design. If the training program was effective, there should be a noticeable change in participants’ responses to the questionnaire items before the training and after it.

The questionnaires were conducted in Survey Monkey by the client’s predecessor. The client asked me to help analyze the results. I asked some usual questions, and then asked whether it was possible to compare each individuals responses on the pre-test and the post-test. The answer was, “No,” which limited the types of analyses that could be done.

To see how a statistician could have helped improved the outcome, read on.

Let’s say that the program has two participants, you and me. At the beginning of the training (which involves the history of anchovy fishermen), we’re asked whether we like anchovies on pizza. You say “yes” and I say “no.” At the end of the training , we are once again asked whether we like anchovies on pizza.

Scenario 1: We don’t change our minds at all.

Scenario 2: We change our minds completely: you say “no” and I say “yes.”

In either scenario, we have 1 Yes and 1 No, both pre-test and post-test.

If we did not keep track of our individual responses, all we can say is that we got the same results pre-test and post-test. But obviously, scenarios 1 and 2 are quite different outcomes.

If a statistician were consulted at the beginning of the process (that is, when the pre-test / post-test design was being developed), he or she could have advised the client about the need to keep track of individual participants’ IDs, and we would be able to distinguish between the two scenarios.

So, even though you can use Survey Monkey, you can still benefit from the expertise of a statistician.

#### OK, time to geek out on the stats.

I have contrived an example using Stata where there are 100 participants. On the pre-test, 55 answered “No” to a question and 45 said “Yes.” On the post-test, 45 said “No” and 55 said “Yes.”

Let’s suppose that individual IDs were not tracked, so all we can do is compare the 55/45 pre-test to the 55/45 post-test. The cross-tabulation looks like this:

| response Pre_Post | No Yes | Total -----------+----------------------+---------- Pre | 55 45 | 100 Post | 45 55 | 100 -----------+----------------------+---------- Total | 100 100 | 200

Note that we have 200 observations, 100 pre-test and 100 post-test, even though we had only 100 participants in the program.

With a 2 x 2 table and enough observations, the standard approach is to do a Chi-squared test. We generate the the table above and obtain the results of the test using Stata’s “tabulate” command, abbreviated here as “tab”:

. tab Pre_Post response, chi2 (table above) Pearson chi2(1) = 2.0000 Pr = 0.157

With a p-value of 0.157, at the 5% significance level we conclude that there is insufficient evidence to reject the null hypothesis of “no change pre- and post-test.”

That is the situation my client was in.

Now let’s suppose that the client’s predecessor had talked to me, and I convinced him or her of the importance of keeping track of individuals’ responses. In this case, after the data are collected, we can use McNemar’s test, which is invoked in Stata for 2 x 2 tables using the “symmetry” command.

. symmetry pre_1 post_1 ------------------------------- | post_1 pre_1 | No Yes Total ----------+-------------------- No | 30 25 55 Yes | 15 30 45 | Total | 45 55 100 ------------------------------- chi2 df Prob>chi2 ------------------------------------------------------------------------ Symmetry (asymptotic) | 2.50 1 0.1138 Marginal homogeneity (Stuart-Maxwell) | 2.50 1 0.1138 ------------------------------------------------------------------------

Here’s how to read it: First, note that now we have only 100 observations, because we have 100 *pairs* of observations (that is, 100 individuals, and their responses on the pre-test and post-test). In the rightmost, Total column, we see that on the pre-test 55 responded “No” and 45 “Yes.”

Looking at the bottom row we see that 45 said “No” and 55 “Yes” on the post-test.

How do we read the values in the individual cells? Well, of the 55 people who said “No” on the pre-test, 30 said “No” and 25 said “Yes”on the post-test. In other words, 25 people changed their minds. Similarly, of the 45 who said “Yes” on the pre-test, 15 changed their minds on the post-test, while 30 did not.

Finally, we see the results of the McNemar test. There are 2 versions of the p-value given, but they are both 0.1138, so at the 5% significance level we conclude that there is insufficient evidence to reject the null hypothesis of no change.

Heres a different scenario.

. symmetry pre_2 post_2 ------------------------------- | post_2 pre_2 | No Yes Total ----------+-------------------- No | 45 10 55 Yes | 0 45 45 | Total | 45 55 100 ------------------------------- chi2 df Prob>chi2 ------------------------------------------------------------------------ Symmetry (asymptotic) | 10.00 1 0.0016 Marginal homogeneity (Stuart-Maxwell) | 10.00 1 0.0016 ------------------------------------------------------------------------

We have the same number of “Yes” and “No” responses as in the previous example, but here the p-value is 0.0016, leading us to reject the null hypothesis at the 5% significance level.

So we see that we can come to different conclusions regarding the effectiveness of the training when we track individuals’ responses.

#### More geeking out

In the first example, 60 respondents (30 No/No + 30 Yes/Yes) did not change their minds, but in the second, 90 did not. Why do we reject the null hypothesis in the second case, when fewer people changed their minds?

To answer this, look at the “off-diagonal” numbers. In the first example, we have 25 “No/Yes” responses and 15 “Yes/No.” In the second example, we have 10 “No/Yes” and 0 “Yes/No.”

In the first example, whether someone change her response looks like it could be somewhat random, that is, the off-diagonal elements could just be “noise” as a result of people not being very consistent in how they respond.

In the second example, there is a clear direction to the change: The only movement was from “No” in the pre-test to “Yes” in the post-test.

As the name of the Stata “symmetry” command implies, what the test if really doing is checking how symmetric the table is. Are the off-diagonal elements the same? In the first case, they’re not exactly the same, but they are close. In the second case the off-diagonal elements are clearly not the same, so we get a significant p-value.