Are Randomly Controlled Trials Really The “Gold Standard” For Evidence?

Back from vacation and still metabolizing the caloric tsunami that swept through our house over the holidays.  Now that the fruit cake and egg nog safely out of site, it’s time to get our heads back into deep thoughts about research methods.

Today we turn our attention to randomly controlled trials, or RCTs.  An RCT is an experimental design in which subjects are randomly assigned to one of two or more conditions (such as a treatment or a control group); a treatment is applied; and then results between the groups are compared.  They are regarded as the “gold standard” of evidence.  But is this justified, particularly in the social sciences?  Are randomly controlled trials (RCT’s) the be-all end-all of research-based evidence?

The Power of RCTs

RCTs are very powerful because they come the closest of any research design to actually determining causality.  And this, after all, is our ultimate goal in research on program effectiveness.  We don’t just want to know that something happened, we want to know why,  so we have a basis to act in the future.  By measuring the outcomes of a treatment and its counterfactual (such as a control group or alternative treatment), an RCT has the best potential to identify causal relationships and rule out spurious associations.

For this reason, The U.S. Department of Education, among other groups, have placed increased importance on RCTs in recent years.  For example, the What Works clearinghouse, which evaluates the evidence of educational interventions, has a strict hierarchy of evidence levels, in which only well-designed and well-implemented” RCTs can be designated as “meeting evidence standards”.  The result is that relatively few programs are designated as “effective” by What Works.  For example, only eight early childhood programs are designated as having “positive effects”, and another nine are designated as having “potentially positive effects”.

Limits of RCTs

Is the emphasis on RCT’s well-founded?  In an ideal situation, all programs of interest would have a research base of well-conducted RCTs, and the answer would be ‘yes’.  When available, well-conducted RCTs are generally bnest means of determining program effectiveness.

But in the “real world”, things are different.  For one, the proportion of currently operating or proposed programs which have RCT data are likely to be quite low, particularly in the social sciences.

In addition, a host of “real-world” factors can impact the ability of RCTs to determine causality, such as publication bias and replication of interventions.  For example, imagine a new preschool curriculum that, its creators claim, increase kindergarten readiness.  Further imagine that we know that the proposed curriculum, in reality, has no actual effect on kindergarten readiness.  Now imagine that the curriculum is implemented in 100 different school districts and 100 researchers conduct 100 RCTs., one in each district

Using standard confidence levels (which allow for a 5% chance of mistaking no effect for an actual treatment effect), five of the studies are likely to turn up an effect, even though none exists.  Publication bias will amplify the problem, if we believe that those five studies are more likely to get published than the 95 studies which found no effect.

This is a somewhat apocryphal example, but it illuminates some of the real-world challenges we face in overinterpreting the results of RCTs.  They also face a host of other threats to validity, such as the possibility that members of the “control” group inadvertently received the treatment too, which would have the effect of underestimating the true extent of the causal relationship.

These “real world” issues may overwhelm the virtues of the “ideal” landscape in which a host of well-conducted RCTs are available for us to examine.   In practice, RCTS are now always possible to conduct, and they can be expensive.  (To this end, several organizations, including the Coalition for Evidence Based Policy, are issuing “low-cost” RCTs to be completed for under $100,000 – a bargain compared to the $3 to $5 million tab that a full-scale educational RCT can cost.)

In early childhood, relatively few RCTs have been conducted, for a range of reasons logistical, political, and practical.  Even the most well-respected – such as the recent Head Start Impact Study, which randomly assigned 5000 3- and 4-year olds into Head Start or a non-Head Start control program – generated as many questions as it answered.  So what do we do when RCT evidence is not available?  We still need a way to make determinations about the best programs to pursue, and the best way to spend our money.

The best way to proceed, in my view, is to take into account all of the research available, and weigh it according to a variety of characteristics, including research methodology used, attrition, and other factors.  If the best research we have on an existing program is a tracking study with no control group, we ought to examine that research and draw what we can from it.  If a good RCT comes along which studied the same program than that research can take precedence – but until that point we would be mistaken to ignore the body of evidence already assembled, even if it is lower quality than we’d prefer.