The GRIM Test: An Easy Way to Check Your Data Is Not Faulty

This post originally appeared in The Measurement Advisor.

Social science’s Replicability Crisis demonstrates that numerous problems can affect the ability of research to reach reliable conclusions. Brown and Heathers’ GRIM test is a new and elegant method to check research for certain data problems.

You use public relations research every day. You read about it, commission it, report on it, and rely on it to make decisions. But is that research trustworthy?

You, your staff, and everyone at your organization are well-trained and conscientious at reporting on and carrying out research. Of course you are. But, as we shall see below, even the very best peer-reviewed social science research includes a surprising number of errors and questionable findings. Odds are that the research that you use is problematic as well.

The good news is that there is an easy way to check a broad category of research data for problems. It’s a very clever, very simple technique called the GRIM test, and you’ll learn how to use it below. But first, here’s why you need it.

What If Published Research Is Wrong? = The Replicability Crisis

You’ve probably heard of the Replicability Crisis. It’s the social science calamity that began a couple of years ago, sparked when a team of researchers attempted to replicate peer-reviewed studies published in the top psychological journals. More than half the time their results came out differently than the original studies. Among the studies that did manage to replicate previous research results, the findings were significantly less impressive than what was originally published.

Ever since, the world of social science has been trying to tidy up its house, attempting to determine how and why Science Might Be Broken?! Is it because of fraudulent data? A faulty peer-review process? Flawed statistical methods? Or maybe a biased research environment? The answer turns out to be yes, to all. (Read more about it here.)

Why is this important for communications measurement? Well, if the research that passes the peer-review process of the gold-standard journals has problems, than typical communications measurement and public relations research probably has problems of some sort as well. And note that it’s not just one research review that has found flaws. For instance, in the initial testing phase for the GRIM test (see “Shocking Bonus!” below), about half of published, peer-reviewed research that was checked included data that was impossible. Faulty research is much more common than you’d think.

One result of this brouhaha that is beneficial to all social science researchers is a renewed interest in insuring the rigor of statistical and experimental methods. Turns out, for instance, that the sacred p<.05 level of statistical significance is often not the magic test we thought it was. There is a move afoot to change this to p<.005. Another result is increased vigilance for errors in research data. Enter Nick Brown, James Heathers, and their GRIM test.

The GRIM Test

Last year, Brown and Heathers discovered a new, simple, and elegant mathematical technique to check the validity of averages from experiments that collect integer data. It’s called the GRIM test: Granularity-Related Inconsistency of Means. Given a mean and the sample size from which that mean was calculated, the GRIM test will tell you if that mean is mathematically possible.

What’s amazing about this test is that:

It’s based on a simple and previously undiscovered mathematical principle, which is a rare thing nowadays; and
It can quickly and easily be applied to published research to verify the validity of the reported data averages.

Here’s a description from the Brown and Heathers (2016) abstract:

“We present a simple mathematical technique that we call GRIM (Granularity-Related Inconsistency of Means) for verifying the summary statistics of published research reports in psychology. This technique evaluates whether the reported means of integer data such as Likert-type scales are consistent with the given sample size and number of items…”

The Elegance of Averages

Many types of research, including that used in public relations, involves averaging whole number data. Whole number (that is, integer) data are variables like age (typically reported to the nearest year), or the number of tweets per day, or the answers to all those questions that start out “On a scale of 1 to 7…”

Brown and Heathers’ insight was that when you take the average of whole number data, the average can only be certain specific values. That is, because the data is not continuous, the possible averages of that data aren’t continuous, either.

Here’s a way to understand it. Suppose, as an extreme example, you had a sample of three subjects, whose ages range somewhere between 36 to 40 years of age. There are only so many different ways that 3 people can have ages in a 5-year range. So when you calculate the average of their ages, there are only so many possible values that the average can be.

The Simple Test

The point is that, when the data is in integers, and the sample size is relatively small, there are a limited number of values that the average of that data can have. Therefore, a simple test of the validity of research data is to test the averages reported to determine if they are in the set of possible averages. (You can use Jordan Anaya’s online calculator, here.)

If the averages reported in the research could not have resulted from the reported sample size, then you know something is wrong. (Read more about the GRIM test in James Heathers’ posts here or here. He has developed a conceptually similar test that uses means and standard deviations, it’s called SPRITE.)

Given a mean and the sample size from which that mean was calculated, the GRIM test will tell you if an average is possible, but it can’t confirm that it is correct, or if the data that went into it is real. The authors take great pains to indicate that this is not a test of fraud. There are several other ways in which impossible averages can occur, including various recording and computing mistakes, and typographical errors.

Shocking Bonus: More Faulty Research Discovered!

To test out their new technique, Brown and Heathers applied the GRIM test to a sample of recent articles in leading psychology journals. Out of 71 articles tested, about half contained at least one reported mean inconsistent with the reported sample sizes, and more than 20% contained multiple such inconsistencies.

So, again, a surprising number of papers published in big time, gold-standard, peer-reviewed journals are faulty. Does that affect your confidence in the quality of the research that you use?

So, Is This a Big Deal?

Well, yes and no… mostly yes. As far as the GRIM test goes, journal editors now have a quick and easy way to test one aspect of submitted research. There is no doubt that the test can and does reduce errors in published work. Will it reduce fraud? Probably not much, because, now that the cat is out of the bag, would-be fraudsters know what they need to do to doctor their data.

What is really a big deal is the surprising amount of faulty research that gets published. It should make you think twice about insuring the quality of the research you use.

The good news here is how social science is responding to the Replicability Crisis. It’s making an effort to take care of its problems. You’d be wise to check the methods and analysis that go into the research that you use, as well.

Bill Paarlberg is the Editor of The Measurement Advisor. He has been editing and writing about measurement for over 20 years.