An Intermediate Guide to Reading Research

In my last article I wrote a beginners guide to reading education research. Much to my surprise, it ended up being my second most popular article ever and I received multiple requests to write an intermediate guide. In this article, I will attempt to do just that. However, as I assume the readers will have already read my first article, I will attempt to address some of the nuances regarding the sub-topics, rather than explain step by step, how people should interpret the literature. That all being said, I will not be writing an advanced guide, as I do not feel qualified to do so. I have spent the past several years researching and talking about evidence based education, as I ran this blog, podcast, and wrote a book on the topic. However, at the end of the day, I am not a professor, I do not hold a Phd. I am just a nerdy teacher. I do think that there is a serious misunderstanding though of how to read science in general and this series has been my attempt to add a small amount of clarity for the average person.


 The Inflation Problem:

In education research we tend to see exaggerated effect sizes. On average education studies produce an effect size of .40. Comparatively, we see in exercise science and nutrition research, most effect sizes are below .20. As the average effect size of a placebo intervention is .20, research in many fields is considered pertinent the moment it crosses that .20 barrier and proves itself better than a placebo. However, in education the vast majority of studies show an effect much higher than .20. 


There are likely many factors here inflating the average effect size in education research. One such factor as Dylan William pointed out to me, is the “file drawer problem”. The “file drawer problem” is the noted phenomenon of researchers not bothering to publish studies with insignificant results. Indeed, this is why some of the more reputable researchers, pre-register their studies, so that their undertakings are on the record before beginning. However, to the best of my knowledge, most researchers do not pre-register their studies. 


Another problem is likely tied to the quality of education studies in general. As education has been mostly viewed as an art and not a science, there is somewhat of a deficit in terms of the quality of education studies. Indeed, this especially seems to be a more pronounced problem in older papers. Many education papers have no control group, small sample sizes, and excessively long durations. This in general tends to wildly exaggerate the size of effect sizes. 


Additionally, there exists what I would call the structure factor. In general we see interventions that are more structured have greater effect sizes than interventions that are not. For example, Direct instruction outperforms Inquiry Based Learning, Inquiry Based Learning outperforms Problem Based Learning, and Problem Based Learning outperforms Discovery Based Learning. That being said, most studies that even have a control group, assign no specific teaching interventions or strategies to the control group. So what we end up having is a structured teaching group vs an unstructured teaching group and lone behold the structured group almost always outperforms the unstructured group. 


For all of these reasons, I think education researchers should adopt the mindset that the effect size of an education placebo should be considered as .40, not .20. That being said, I think there might be a time and a place for implementing interventions with smaller effect sizes. Ultimately the reason I got into this research was the realization that there is an opportunity cost to education interventions. Everything you do in your classroom takes time, both in its learning curve and its implementation, that is why it is important to use high yield strategies. However, the time cost to different strategies are not all equal. I would rather suggest a super low time cost strategy with a small to moderate impact, than a teaching strategy with an extremely high time cost and a moderate to high impact. Although ultimately, I think the best strategies are ones that are both easy to implement and high yield. We might be able to refer to this paradigm as the impact to time ratio. 


The Quality Problem:

As you undoubtedly realize at this point, not all studies are created equally. However, in meta-analysis, we place an equal weight on studies of different quality levels. Unfortunately the higher quality or more structured a research paper is, the lower the effect size tends to be. This might be, because we are removing some of the placebo impacts from the intervention. As pointed out earlier, control group studies tend to have lower effect sizes than studies without control groups. That being said there are many different control group designs, all aimed at reducing some of the randomness of intervention results.


The gold standard of experimental designs is a randomnized controlled trial. This means people are randomly assigned to the control group and experimental group. This is meant to stop researchers from doing unscrupulous things like putting all the strongest students in the experimental group. However, an even better design (in my opinion) that is sometimes used, involves basing groups based on test scores. So you make sure both the control group and the experiment group have the same mean pre-test scores. 


As pointed out earlier, structure almost always beats less structure. This is why some researchers, rather than just having the control group have no structure, assign the teachers in the control group to a specific alternative intervention and give both groups equal training. For example, rather than having a phonics group and a non structured group, having a phonics group and a balanced literacy group. This type of approach is likely more fair, especially if neither group knows if they are the control group or the experiment group. However, studies with this design tend to have very low effect sizes. Ultimately we the more fairly conducted and the more structured a study design is the lower the results tend to be. 


For these reasons, some scholars would argue against meta-analysis that accounts for less rigorous study designs. And in some cases they might be right. Would you rather look at one very well conducted study or 4 very poorly conducted studies? Unfortunately, there are several reasons that make this reductionist approach less useful. Firstly, many education topics do not have any high quality studies behind them, so if we only base our hypotheses only on areas where there are high quality studies, we force ourselves to not take any stances over perhaps most of the literature. However, this is not reflective of the scientific process. A more reflective position would be to recognize that evidence is always fluid, never perfect, and to be cognisant that we can only speak in degrees of possibilities, not absolutes. That being said, when we have a high yield found in multiple high quality studies and within a meta-analysis we can be reasonably certain that the strategy is high yield. Whereas, when we have multiple poorly done studies, with a high yield, a more reflective statement might be “the strategy appears evidence-based, according to the evidence we have now, but there needs to be more high quality research.” 


Another issue with ignoring the lower quality research is that it forces us to ignore the majority of older research. Very few studies from the 80s and 90s have randomnized control trials or statistically corrected test groups and if we ignore this research, we end up having to throw out large amounts of our body of research. This might be one day advisable but within the field of education, we do not have enough of a base of high quality research built up that this would be feasible. Lastly, our understanding of effect sizes in education research largely comes from low quality studies. As most of the research is low quality, the natural comparisons being made are with the contextual understanding of what is the normal range for effect sizes in education research.


The Sponsorship Problem:

Within the research, we often see research conducted by specific parties get specific results. IE researchers critical of a specific strategy tend to get less positive results than researchers who are promoting the same strategy. Of course this is why we try to use rigorous study designs, to correct for this bias. However, this does not always work. For example, I recently conducted my own meta-analysis of the topic LLI. Within this meta-analysis I came across a series of experiments done by an institute in favour of LLI. These papers, despite seemingly the most well done papers on the topic, consistently showed far superior results to all other studies conducted on the topic. To make matters worse, despite the fact that institute experiments were the only rigorously conducted experiments, I had some reliability concerns, as I noted several strange statistical anomalies within their papers.


The Sample Problem: 

Large sample sizes on average tend to produce more normalized results than smaller sample sizes. As smaller sample sizes can distort a SD calculation it can make the data look both more or less random than it actually is. For example, let's say we have a sample of 6 and all students get a result within 5% of each other, this will create an extremely low SD and an extremely high effect size. Now let's say within a proper sample size, we would see most students on average have a range of results within 10%, with outliers ranging up to 40% in either direction. If we have another study with a sample size of 6 and we get two large outliers, then our SD will suddenly become extremely high and the ES will be extremely low. For these reasons, it can sometimes be better to borrow a hypothetical SD, from a similarly designed study with a large sample size, when calculating the ES of a study with too small of a sample. Of course in general, we probably should not place a high weight on studies that have sample sizes below 20.


Size is not the only consideration we have to make when examining samples, as different demographics tend to have different results. Overall we see younger students make progress far more rapidly than older students, in part because their curriculum is more elementary. Indeed, we also see different education interventions can have drastically different results on different grades of students. For example phonics interventions tend to have by far the largest results between pre-k and grade 2. Whereas problem based learning tends to have the best results on students in grade 12 or older. For these reasons, it is likely inappropriate to include studies in a meta-analysis that are not from what should be the targeted demographic. Lastly on the topic of sample, we see that students in disadvantaged demographics, IE impoverished neighbourhoods, tend to have lower reported results than students from wealthy neighbourhoods. 


Types of Effect Size Calculations: 

While Cohen’s d is likely the most commonly used effect size within education research, it is not the only one used. Hedge’s g is also commonly used within education research and is meant to normalise the results for smaller sample sizes. Hedge’s g is calculated by dividing the results by the pooled SD. When the control group has substantially different deviations from the experiment group, Glass’s Delta is recommended instead, which only uses the SD of the control group. A Pearson effect size is used when examining the effect of two variables to determine correlation. For example, you would use a Pearson calculation, if you wanted to examine the correlation between parent income and student results. While all of these calculations are different, they are meant to be used under specific circumstances and to normalize results within a standard interpretation. Some authors criticize meta-analyses that include studies with different types of effect size calculations; however, as all of these calculations are meant to be interpreted the same, I cannot say I agree with the criticism. Sometimes, instead of using an effect size calculation, authors will use a T value or a p-value, these tests are used to determine the likelihood of significance, when accounting for the degree of variability. They are essentially trying to measure the degree to which the study results might be random noise. 


The Comparison Problem:

So this of course, all begs the question, how do we compare low quality and high quality research, if they generate different effect sizes? The reality is with humility. While the state of the literature is far from perfect, we have to work with the research we have. Yes high quality studies included with meta-analysis will on average drag down an effect size and yes low quality studies will on average bring up an effect size. But we should only ever examine the research in degrees of probabilities, not absolutes. Moreover, it is not as if all well controlled studies have effect sizes below .40 and all poorly controlled studies have effect sizes above .70. Indeed, I have come across multiple well done studies with effect sizes above 1 and multiple poorly done studies with effect sizes below .2. Ultimately, we just need to understand that all of this influences the possible result of meta-analysis and therefore should temper our confidences. 


Ideally, sample size corrects for all errors. Take phonics for example. Phonics is one of the most well studied topics in the literature, with over 1000 studies conducted. Within individual studies I have seen results below .20 and above 1.0; however, within meta-analysis I have seen a much narrower range of results. The lowest meta-analysis effect size I can think of for phonics found an effect size of .4 and the highest was around .8; however, the majority of meta-analyses on the topic found an effect size within the relatively small range of .40-.70. The largest meta-analysis on the topic was done by John Hattie and found an effect size of .60. When the vast majority of meta-analyses on the topic consistently find phonics to have a moderately large effect size, I feel confident in saying that phonics has a moderately positive result. 


Some would argue that degree of variability within the research, would suggest that we need to disregard meta-analysis and focus on parsing out the best constructed studies in each topic; however, I disagree with this approach for several reasons. Firstly, even within well constructed studies, we still see a large variability. The human condition is complex and determining the effect of a human intervention is challenging. Secondly, it discounts the majority of the research. But last and most importantly, in my opinion in de-democratizes research. 


Without meta-analysis we largely have to rely on the ability of benign and brilliant scholars to interpret the literature for everyone else, the “sage on the stage” so to speak. However, the problem with this approach is it requires individual teachers to find trustworthy scholars to interpret the evidence for them. This has been largely the most popular method for understanding literature. However, it is usually not the most informed scholars who rise in popularity, but rather the ones who are best at marketing. It is this practice and belief system that has allowed pseudo-scientific practices such as teaching to learning styles to become popular within our field to begin with. 


When we use meta-analysis we empower teachers to be able to quickly and easily interpret the efficacy of different teaching interventions within the literature.If I am being perfectly honest, I think this is the true reason meta-analysis is sometimes criticized within the field of education. Meta-analysis has the ability to prove pedagogies people have spent their life promoting and researching as futile. Furthermore, it diminishes the importance of all those scholars who have aspired to be the “sage on the stage”, as it allows people the ability to interpret the literature for themselves, without spending a lifetime reading every published study.