Last week, I wrote a post about sleep training and stress, in which I argued that everything we know about stress suggests that sleep training is not harmful.
In response, some people objected that sleep trained babies continue to experience elevated cortisol and significant distress, even after they have stopped crying. In their view, sleep training teaches babies that crying does not help. They haven’t learned to self-soothe or to fall asleep on their own, they’ve simply given up.
What a heartbreaking thought. And one that surely strikes fear in the heart of many parents.
So it’s important to realize that this claim comes from a single small and deeply flawed study of 25 babies, led by Wendy Middlemiss, a researcher at the University of North Texas’s College of Education.
Here is a typical example of how her study is described in the popular press:
“The researchers found high levels of cortisol, a stress hormone, in both the mothers and the babies during the times the babies were crying. After several days, the babies learned to go to sleep without crying. Researchers found that during these quiet nights, the mothers no longer had high cortisol levels but the babies’ cortisol levels remained high. They had merely learned to remain quiet while distressed.
The researchers noted that this was the first time the mothers and babies had not been in sync emotionally. The mothers no longer had high stress levels, not realizing that their babies were still just as upset.”
I can see how many parents would read something like this and swear off sleep training.
Here’s the truth, though: Nothing in her study supports these claims.
To see why, let’s start by briefly reviewing her study’s design. Middlemiss studied 25 mother-infant pairs who were enrolled in a sleep training program at a local hospital. The babies ranged in age from 4 to 10 months.
Mothers spent the day at the hospital with their infants, helped prepare them for sleep at naptime and bedtime, and then retreated to a hallway outside the room where they could hear them, but their infants could no longer see them. The nurses put the infants down in their cribs and let them cry, without soothing, until they fell asleep. This process was repeated for 4 days.
On the 1st and 3rd nights of this program, Middlemiss measured the babies and mothers’ cortisol levels, a hormonal marker of stress. She tested their cortisol levels once right before bedtime and then again 20 minutes after the babies had fallen asleep.
So what’s wrong with her study? Well, a lot.
(1) The study lacked a control group. Without a group of control babies who were put to sleep by nurses in the hospital, but who did not experience sleep training, we cannot say whether sleep training affected infants’ cortisol levels, or whether something else about the program, like being put to sleep in an unfamiliar room in a hospital or being put to sleep by a stranger, affected infants’ cortisol levels.
(2) She does not analyze her data correctly. Let’s take her findings one at a time.
Claim #1: Babies cortisol levels remained “high” before and after falling asleep on the first and third nights of sleep training, while mothers’ levels dropped after their babies had fallen asleep on the third night.
Problem #1. Middlemiss does not report a baseline cortisol level for the babies. We do not actually know whether infants’ cortisol levels were “high” or “low” or “normal”. She calls them high. But we have no way to know that they’re high; they stay constant throughout the study.
When Melinda Wenner Moyer, a reporter for Slate, asks why she calls their cortisol levels high, Middlemiss responds that she also assessed the babies’ cortisol levels while at home, and the levels were lower than at the hospital. But she never reports these baseline levels in her paper.
To anyone who has published a scientific paper, this is a baffling response. If you are drawing conclusions based on data you collected, you report that data. That’s the way it works.
Problem #2. She uses the wrong statistical analyses to compare before and after cortisol levels. To me, this is the most egregious problem with her research.
In stats-speak, Middlemiss compared whether the group means before and after sleep training were significantly different from one another. This is the wrong analysis. She should have used a repeated measures analysis. Simply comparing group means is incorrect. It is also considerably less powerful (it fails to take into account that you already know something about the individuals the second time around) and thus more likely to lead to a false conclusion of no change.
I’ll illustrate the problem by analogy. Imagine you have a group of 25 students who enroll in an SAT prep class. You compare their test scores before the class has begun with their tests scores after the class ends. The mean test score among students does not increase. Does this imply that the class was worthless?
Well, maybe, and maybe not. What you really want to know is not whether the group mean is higher, but whether on average the students improved. And those are two different questions. If, for instance, 90% of the students improved by say, 50 points, while 10% dropped by 500 points, the means would remain the same, despite the vast majority of students improving.
Problem #3 Middlemiss is missing a ton of data, and–you guessed it–she does not handle that issue correctly either.
Look back how she describes her analyses. Do you see how the number of mothers and infants in each group changes from before and after sleep training? Consider the mothers’ cortisol data on the third night. The pre-sleep group includes cortisol samples from 17 of the 25 mothers. The post-sleep group includes cortisol samples from 12 of the 25 mothers.
This raises some questions: Are these 12 mothers a subset of the first 17? Or do these 12 include mothers not included in the before group? Middlemiss never tells us.
Why is missing data a problem? Let me again use an analogy. Imagine I have a bag of 25 apples. First I pull 17 apples out and weigh them, and then put them back. Next I pull out a second set of 12 apples out and weigh those. The second set of 12 apples weigh less, on average, than the first 17. Can I conclude that apples in the bag have lost weight?
Of course not.
Now, missing a sample or two is not a huge deal, even in a relatively small study. Large studies can often handle significant amounts of missing data, provided data loss occurs more or less at random. But Middlemiss is missing samples from over half of the mothers on the third night!
At the very least, if you wanted to test whether the women’s cortisol has dropped after sleep training, you ought to compare the same mothers before and after the their infants fell asleep. But she does not do that.
So, was the babies’ cortisol high? I don’t know. Did the mothers’ cortisol drop? I don’t know. It’s impossible to tell from what she reports.
(Note that this should never be the case in a scientific publication. The whole point of a scientific publication is to make what you did clear enough that someone else could replicate your study and analyses.)
Claim #2 Mothers’ and babies’ cortisol levels were no longer correlated after the third night of sleep training.
This is what she reports:
Problem #4 Here again Middlemiss uses the wrong statistical test. She claims that the mothers’ and babies’ cortisol levels no longer correlated after the third night of sleep training, because the second correlation, r(10)=.422 is not statistically different from zero. This is the wrong test.
She should have tested whether post-sleep correlation of r=.422 is significantly different than the pre-sleep correlation of r=.582. A significant difference between the correlations is, after all, what she claims to have found.
And by the way, the two correlations are not significantly different from one another, suggesting no change in how much mothers’ were affected by their babies’ stress levels.
Problem #5 Even if Middlemiss had performed the correct test, we would still have a major problem, because, yet again, she’s lost over half of her sample! We have no way of knowing whether the pre-sleep mothers-infant pairs are the same, largely the same, or completely different from the post-sleep mother-infant pairs. She nevers tell us.
Problem #6 Her entire argument boils down to 10 mother-infant pairs. That’s too small of a number to tell much of anything. To see why, note that if she found a correlation of .422 in her entire sample of 25 mother-infant pairs, it would have been significantly different from zero–meaning her whole argument rests not just on the wrong analysis but on a lack of statistical power.
What did Middlemiss and colleagues conclude from these non-findings?
That infants continue to experience physiological distress, as measured by cortisol, despite being able to soothe themselves to sleep, and that sleep training leads to an “asynchrony” in mother and infant cortisol levels.
What do I conclude?
That this study should never have been published, at least not in its current form. Middlemiss and colleagues had no control group. They performed the wrong statistical analyses. They had huge amounts of missing data which they did not account for at all. Her key findings relied on less than half of her original sample of 25.
These are not minor, nitpicky problems. These are major, glaring problems that make interpretation of her findings impossible.
Warning women against sleep training on the basis of this study is absurd.