Why Meta-Analysis is a Necessity in Education
Meta-analyses are systematic summaries of research that use quantitative methods to find the mean effect size (standardized mean difference) for interventions. Critics of meta-analysis point out that such analyses can conflate the results of low and high quality studies, make improper comparisons, and result in statistical noise. All of these criticisms are valid for low quality meta-analyses. However, high quality meta-analyses correct for all of these problems. Critics of meta-analysis often suggest that selecting high quality RCTs is a more valid methodology. However, education RCTs do not show consistent findings, even when all factors are controlled for. Education is a social science and variability is inevitable. Scholars who try to select the best RCTs are likely to select RCTs that confirm their bias. High quality meta-analyses offer a more transparent and rigorous model for determining best practice in education.
Have you ever heard someone say something to the effect of, “One week scientists say eggs are good for you, the next week they say they’re bad. I don’t know what to think”? This is a common conceptualization for science and it stems from the media poorly reporting on new research. The media tends to report on each new landmark study, as if it stands in a vacuum, as the sole edict, to what science proves. This is problematic, because it assumes the newest study is always the most correct, rather than looking to see what the majority of research shows. In the past, researchers would complete systematic literature reviews to discover the scientific consensus on a topic. With this approach, a researcher reads all of the studies on a topic and then writes about their findings. This can be problematic, because it tends to be purely qualitative and the researcher gets to present their interpretation, without being beholden to any kind of quantitative data.
A meta-analysis is similar to a literature review, except the authors also find the average statistical result for studies on a topic. Typically meta-analysis results are displayed in effect sizes, an equation that seeks to create a standardized mean difference, so that we can compare multiple studies with each other. I tend to think of a meta-analysis as a literature review with receipts. Looking at research through meta-analysis is the most systematic way of examining research. The author must review all studies, and then systematically synthesize quantitative results. Ideally, this removes as much bias as possible and provides an interpretation of the most normalized result on a topic. Meta-analysis also serves a fundamental scientific principle, replication. A scientific finding is only truly valid if it can be consistently replicated. By using meta-analysis, we can be sure whether or not a finding has been well replicated. This is especially important in education research, because scientific results tend to be more variable, and experiments are often carried out by those selling pedagogical products.
Over the last two decades meta-analyses have been crucial in helping to determine what is best practice in literacy instruction. Most famously, the National Reading Panel conducted multiple meta-analyses, including one that compared systematic phonics and whole language instruction. Their research showed systematic phonics has a mean effect size of .44. Which is why most reading researchers today recommend systematic phonics instruction, as part of a comprehensive literacy program.
There are scholars who object to meta-analysis and they usually cite three main arguments:
Meta-analysis ignores study quality
Meta-analysis makes apples to oranges comparisons
Meta-analysis tends to show random statistical results, not meaningful results
Let’s dive into The Main Criticism of Meta-Analysis:
1. Quality: There are typically 4 main types of studies included in a meta-analysis.
-Case studies: studies without control groups or done retrospectively
-Correlation studies: studies that look at the correlation between two data-sets
-Quasi-experimental studies: studies that have a non-randomized treatment and control group
-Randomized Control Trial (RCT): studies that have a randomized treatment and control group.
Typically an RCT is seen as a higher quality study than a quasi-experimental study and a quasi-experimental study is seen as higher quality than a case study. Sample size, duration, fidelity tracking, attrition, and measurement also affect the quality of a study. Typically higher quality studies show on average lower results. For example, a large sample size, long duration RCT with standardized measurements, is likely going to be far more accurate than a small sample size, short duration case study that uses researcher designed assessments.
Meta-analyses that do a poor job with controlling quality, will typically include studies with various levels of quality, such as case studies and RCTs and report on one mean effect size. Whereas a well done meta-analysis will either exclude low quality studies or show the difference in results for high vs low quality studies. Take a look at this result section from a fantastic meta-analysis by Fritton et al.
The authors have clearly controlled for how studies with different levels of quality show different results. Interestingly, the highest quality studies showed a similar effect size (.31), to the overall mean for the study (.28), suggesting in this case that quality did not bear a significant impact on results.
2. Apples to Oranges
Apples to Oranges comparisons refers to making comparisons that are too dissimilar to be meaningful. An example of this could be trying to find the mean effect of comprehension instruction and including multiple different types of comprehension instruction together as if they were the same thing. For example, vocabulary instruction and strategy instruction are both used to teach comprehension, but they are very different approaches. That said, good meta-analyses control for this by separating the results out as moderator variables, as can be seen in the below meta-analysis by Filderman et al.
3: Statistical Noise
One less common criticism I see of meta-analysis is that the authors are capturing random effects and averages, not meaningful trends. Let’s make a hypothetical example. Say we have 10 studies and they show the following effects: .10, .20, .30, .40, .50, .60, .70, .80, .90, 1.0, you will find a mean effect size of .50, which is quite significant. However, there is clearly no average discernible trend within those studies. So by taking a mean, we have actually made the data less meaningful, as opposed to more meaningful. Of course, there are multiple tools for addressing this issue. Most typically, meta-analyses will use confidence intervals, which show the likely range of results between effect sizes, and or p values, which display the likelihood that a statistic is random, alongside their mean effect sizes so that readers can discern if the mean effect found was meaningful or random noise. Indeed, if you look back at the two graphics above, I selected from well-done meta-analyses and both of them included confidence intervals and p values alongside their effect sizes.
So Are these Criticisms Valid?
All three of these criticisms are valid. However, they also only really apply to a poorly done meta-analysis. Meta-analysis is a relatively new technique for reviewing research and it has evolved quite a bit over the last 20 years. If you read meta-analyses done in the late 90’s they often combine a lot of poor-quality studies to produce one mean effect size. While more modern meta-analyses tend to be much more sophisticated, there is a lack of consistency within the field of education for meta-analysis methodology. For example, currently, the authors of this article are researching ESL education. We found 12 meta-analyses on ESL education research, dating back to 2009. Of these 12 meta-analyses, 6 meta-analyses included studies without controls and did not use moderator analysis to compare the impact of studies with and without control groups. The 6 meta-analyses that did not control for quality were not rigorous and therefore cannot be used as a definitive proof for the scientific consensus.
Those who criticize meta-analysis often claim we should instead rely on high-quality RCTs instead. I think this is a problematic solution for two reasons. Firstly, it means that we have to trust researchers to tell us which RCTs are the best done. This means trusting individuals to act unbiasedly. However, researchers tend to rise in popularity not necessarily on their merit or qualification, but because they are good at marketing. Relying on methods like this often results in unscientific findings being popularized as scientific. For example, many scholars have cited Balanced Literacy as the gold standard of reading instruction, based on a handful of RCTs reviewed by WWC. This suggestion was made in comparison to the findings of the NRP meta-analysis which recommended systematic phonics instruction, based on dozens of studies.
Secondly, this methodology is based on the belief that well-done RCTs show precise outcomes and therefore do not need replication. But within the field of education, this is undoubtedly false. Let’s look at some of the findings from our 2022 meta-analysis on language programs. There were 20 identified RCTs that looked at structured literacy phonics programs. The mean effect size was .48 and the 95% confidence intervals were [.31, .66]. Suggesting that we can expect results of .31 to .66 in 95% of structured literacy RCT studies. This is a pretty wide range. .66 is a moderate to high effect size and .31 is low. The lowest study showed an effect size of -.11 (Vaden-Kiernan 2008). And the highest effect size was 1.16 (Farokhbakht, unlisted date). Obviously, neither one of these effect sizes are particularly representative of the normal effect of a phonics intervention. However, a scholar with an agenda could point to either study to make a case for or against structured literacy.
That said, the Vaden-Kiernan study is of far higher quality than the Farokhbakht study. If we examine the RCT studies that are of the highest quality, meaning in this case longitudinal RCTs with standardized assessments. We get 3 studies: Vaden-Kiernan 2008, Torgesen 2007, and Bratsch 2020. These studies showed a mean effect size of .22, with 95% confidence intervals of [-.50, .95], suggesting a high degree of variability. The lowest study showed a mean effect size of -.11 and the highest study showed a mean effect size of .43 (Bratsch 2020). Again, a biased academic could pick any one of those three studies and argue either for or against phonics/structured literacy.
All of these studies could also be argued to be apple to-oranges comparisons, as each study looked at different demographics, programs, and styles of approaches. One study was looking at a scripted DI approach (Vaden-Kiernan 2008). One study was looking at an Orton Gillingham approach (Torgesen 2007). And one study was looking at a speech-to-print approach (Bratsch 2020).
However, even if we only look at RCT studies on the same program we see very different results. For example, let's look at Read 180. In 2022 Hansford and Mcglynn identified 12 RCTs on Read 180, with a mean effect size of .11 and 95% confidence intervals of [.04, .19]. Here the confidence intervals suggest a very narrow range. However, the highest effect size study (Interactive Inc 2002) showed a mean effect size of .41 and the lowest effect size study (Fitzgerald 2008) showed a mean effect size of 0 (for longitudinal outcomes). If we remove all the lowest quality studies and only include the studies that used standardized measurements, were longitudinal and controlled for fidelity we get 4 studies, Interactive Inc 2002, Fitzgerald 2008, Meisch 2011, and Sprague 2012. Together the studies show a mean effect size of .16, but the confidence intervals are much wider than when all 12 RCTs are included, [-.12, .40]. Moreover, both the Fitzgerald 2008 study and the Interactive Inc 2002 study were within the highest quality category, so the range of effect sizes was still 0-.41. If we look at both quasi-experimental and RCT studies, 13 out of 19 mean effect sizes were between 0 and .29. With 95% confidence intervals of [0, .19]. While looking at all the studies together suggested a very consistent trend of a low effect, looking at only the highest quality studies made the found effect appear more random and made finding a meaningful trend more difficult.
That said, the Read 180 studies, covered multiple different grades and used multiple different designs. Reading Recovery might be a better example. Within our 2022 meta-analysis of Language programs, we identified 11 RCT studies on Reading Recovery, all of which looked at the identical grade. Moreover, all but 2 of them used the same basic design, of comparing 1 on 1 intensive reading instruction for 20 weeks, to a no-treatment control group. These 11 studies showed a mean effect size of .38, with 95% confidence intervals of [-.99, 1.24] (outliers included). All of these studies are RCTs on the same grade and same program. All but 2 of these studies compared no treatment to treatment. And yet, a large range of effect sizes were found. The largest impact was found in Iverson 1999, with a mean effect size of 2.59 and the lowest was Schimitt 2004 with a mean effect size of -.50. Again any scholar with an agenda could pick either of these RCTs and make completely opposite arguments. Even if we take the two highest quality studies, which are in our opinion Holiman 2013 and the Center for Research in Education and Social Policy 2022, we still get opposite results. Both of these studies were large-scale longitudinal RCTs. Holliman 2013 showed a mean effect size of .48 and the CRESP 2022 study showed a mean effect size of -.19.
Inconsistent findings among RCTs create that “are eggs good or bad” sentiment. Again any scholar with an agenda could pick either of these RCTs and make completely opposite arguments. Scholars on either side of the reading wars debate are likely to want to point to the flaws in either study as a defense for their perspective. Indeed, pro-Reading Recovery scholars frequently point to the Holliman study as evidence that Reading Recovery works and pro-structured literacy advocates frequently point to the 2022 CRESP study, including Emily Hanford. Both the Holliman and CRESP study have weaknesses. The Holliman study had poor fidelity controls in the control group and the CRESP study had high attrition rates. Both studies compared intensive 1-1 reading instruction to no additional instruction, which is not an ideal study design. That said, the studies are both of higher than-average quality, when compared to other studies in our 2022 meta-analysis.
Implications for Practice:
Whether trying to measure the efficacy, of a principle, or of a program, it is incredibly difficult to find a consistent effect found across multiple RCTs. This difficulty stems from the fact that education is not a hard science, it is a social science. There is a large degree of variability in research results. Things like teacher quality, student motivation, demographics, study design, and study quality are all going to impact the effect size. Controlling for all of these variables consistently is nearly impossible. You therefore cannot expect results to be static across multiple studies. It, therefore, does not make rational sense to expect individual RCT studies to produce results that do not vary.
Even if high-quality RCTs did show consistent results, trying to isolate the most high-quality RCTs is very difficult and requires people to make unbiased judgments. People are likely to be more critical of the studies that do not confirm their biases and less critical of the ones that do. Only by reviewing all of the relevant studies on a topic, can we truly avoid cherry-picking results to support our biases. This does not mean viewing all studies uncritically, instead, a good meta-analysis uses objective criteria to identify how effect sizes varied according to study quality. Using methodologies like moderator variable analysis, regression analysis, and multilevel modeling, with meta-analysis is a far more transparent process than simply trying to select the most valid study.
Written by Nathaniel Hansford
Contributed to by Rachel Schechter.
Last edited: 2/12/2023
- Adesope OO, Lavin T, Thompson T, Ungerleider C. Pedagogical strategies for teaching literacy to ESL immigrant students: a meta-analysis. Br J Educ Psychol. 2011 Dec;81(Pt 4):629-53. doi: 10.1111/j.2044-8279.2010.02015.x. Epub 2011 Jan 6. PMID: 22050311.
- Clougherty, Leah, "Emergent bilinguals and academic language acquisition through the use of sentence frames" (2019). Cal Poly Humboldt theses and projects. 247.
-Filderman, M. J., Austin, C. R., Boucher, A. N., O’Donnell, K., & Swanson, E. A. (2022). A Meta-Analysis of the Effects of Reading Comprehension Interventions on the Reading Comprehension Outcomes of Struggling Readers in Third Through 12th Grades. Exceptional Children, 88(2), 163–184. https://doi.org/10.1177/00144029211050860
- Fitton, L., McIlraith, A. L., & Wood, C. L. (2018). Shared Book Reading Interventions With English Learners: A Meta-Analysis. Review of Educational Research, 88(5), 712–751. https://doi.org/10.3102/0034654318790909
- Jia. (2021). Toward a set of design principles for decoding training: A systematic review of studies of English as a foreign/second language listening education. Educational Research Review., 33, N.PAG.
- Li, R. (2022). Effects of blended language learning on EFL learners’ language performance: An activity theory approach. Journal of Computer Assisted Learning, 38(5), 1273–1285. https://doi.org/10.1111/jcal.12697
- Lv, X., Ren, W., & Xie, Y. (2021). The Effects of Online Feedback on ESL/EFL Writing: A Meta-Analysis. Asia-Pacific Education Researcher (Springer Science & Business Media B.V.), 30(6), 643–653. https://doi-org.ezproxy.lakeheadu.ca/10.1007/s40299-021-00594-6
- Roessingh, H. (2004). Effective High School ESL Programs: A Synthesis and Meta-analysis. Canadian Modern Language Review, 60(5), 611–636. https://doi-org.ezproxy.lakeheadu.ca/10.3138/cmlr.60.5.611
- Rui Li. (2022). Effects of Mobile-Assisted Language Learning on EFL/ESL Reading Comprehension. Journal of Educational Technology & Society, 25(3), 15–29.
- Schenck, Andrew. (2020). Using meta-analysis of technique and timing to optimize corrective feedback for specific grammatical features. Asian-Pacific Journal of Second and Foreign Language Education. 5. 1-20. 10.1186/s40862-020-00097-9.
- Thompson, C. (2020). Video-game based instruction for vocabulary acquisition with English language learners: A Bayesian meta-analysis. Educational Research Review., 30, N.PAG.
- Unkyoung Maeng. (2014). The Effectiveness of Reading Strategy Instruction: A Meta-Analysis. English Teaching, 69(3), 105–127. https://doi-org.ezproxy.lakeheadu.ca/10.15858/engtea.69.3.201409.105
- What Characterizes Comprehensible and Native‐like Pronunciation Among English‐as‐a‐Second‐Language Speakers? Meta‐Analyses of Phonological, Rater, and Instructional Factors. (2021). TESOL Quarterly, 55(3), 866–900. https://doi-org.ezproxy.lakeheadu.ca/10.1002/tesq.3027
-N, Hansford. (2022). A Meta-Analysis and Literature Review of Language Programs. Teaching by Science. Retrieved from <https://www.teachingbyscience.com/a-meta-analysis-of-language-programs>.
-N, Hansford. (2022). Read 180. Teaching by Science. Retrieved from <https://www.pedagogynongrata.com/read-180>.