Joe Hilgard writes:
Some years ago, you blogged about a research article by Hasan and colleagues (2013).
I had tried to direct your attention to the narrowness of the error bars, which I found suspicious. What I was really trying to say was that the effect size was much, much too big — by day 3, it is 3.5 standard deviations, or an R^2 of 78%.
I’ve finally managed to publish an article pointing out how implausibly massive that effect size is. I find that a much stronger manipulation yields a large effect size that is still only half as large as the original authors’ effect. It seems that there’s some kind of serious error in the original research.
Hilgard’s paper is called “Maximal positive controls: A method for estimating the largest plausible effect size.” Also, one of the authors of the earlier paper is Brad “Voodoo” Bushman. I wouldn’t trust anything Brad Bushman writes. In that earlier paper, Bushman and his colleagues characterized a three-day study as “long term.” So just beyond all the problems with statistics and data errors, we’re talking about people who don’t even accurately describe their own research claims.
Getting back to Hilgard’s general point: I find it frustrating when researchers don’t think about their effect sizes. This came up in some of our posts years ago on the claim about ovulation and voting (that women’s vote preferences change by 20 percentage points during different times of the month) and ovulation and clothing (that women were three times more likely to wear certain colors during certain times of the month). These are (a) ridiculously large effect sizes, (b) even more so given that everything is measured with error, which should cause attenuation of effect size estimates, and (c) very obviously explainable as the result of random chance. But the researchers in question would just not let go. On the plus side, all this discussion motivated a better understanding of experimental design, so future researchers can be helped even if some practitioners of old methods refuse to change their ways.
P.S. Hilgard adds:
I think that the post underplays what is potentially most scandalous here. It’s one thing to publish just-significant results that capitalize on sampling error—that might get someone as far as d=0.8, p=.040. This seems to be a different beast, though—how does this study find an effect nearly twice the size of “men are taller than women?” How did they find similarly massive effects in their 2012 JESP and 2014 conference presentation? If these results are accurate, they are some of the strongest results in all of social psychology. I think there is something wrong.
Yeah, I’m reminded of the painful email exchange I had with a psychology professor who was angry that I criticized at ovulation-and-clothing study. I kept asking him whether he really believed that women were three times more likely to wear red or pink during certain times of the month. He kept not answering. I think he recognized at some level that the effect size was ridiculous, but he didn’t care, because the purpose of the study was not to estimate an effect size but rather to demonstrate the existence of an effect. And, from that perspective, a huge effect size estimate was good because it represented super-strong evidence! It seems reasonable—even though it isn’t!—to think that p=0.03 with a huge effect size estimate is better than p=0.03 with a small effect size estimate. Actually, it’s not. The huge effect size estimate is just a signal that the estimate is really noisy, and we’re in kangaroo territory.