How Big is Big: A Beginners Guide to Understanding Effect Sizes
In this article, I attempt to provide teachers with a basic framework for understanding effect sizes. The interpretation of effect sizes is case-specific, and thus frameworks like this cannot always be universally applied. That said, I think it is helpful for teachers to have a basic understanding of the topic, to avoid the pitfalls of over-reliance on education influencers and celebrities.
Why We Need Effect Sizes:
I often see people claim a pedagogy works, because a peer-reviewed study showed a positive or statistically significant finding for its use. However, this is a particularly bad metric, because as (Ropovik, 2021) points out, education research has a big problem with publication bias. While 83% of independently funded studies show statistically insignificant results (Kraft, 2018). Most peer-reviewed education studies show significantly positive results. In my opinion, this is why it is important to compare pedagogies using meta-analyses, (systematic reviews of studies), with effect sizes.
In theory, an effect size is a standardized mean difference. It is supposed to allow readers to compare the findings of one study with another study and compare the “magnitude of effect”. To put it more simply, using meta-analysis effect sizes in theory allows us to compare how much different teaching methods improve learning. By comparing the improvements of different teaching methods, we can correct for the problem that virtually all education research shows positive findings, due to publication bias.
The Problem with Effect Sizes:
When I first read about meta-analysis research in education, I found the concept inspiring. It made me believe that we could easily rank teaching methods and programs, with one simple metric. Unfortunately, this was a bit naive. And while, I still believe that using meta-analysis and effect sizes to compare approaches is the best method for evaluating them, I must admit doing this well is terribly complex.
In 2019, I interviewed Dylan William on the challenges of meta-analysis. I must admit, he shattered my naive world view at the time and made me realize just how terribly complex evaluating pedagogies is. Moreover, I think, much of what I have done with Pedagogy Non Grata, since then has been in response to that moment. Whenever I analyze programs and pedagogies, I continually think about the limitations of meta-analysis as pointed out by Dylan, William and how best to correct for them.
I think the biggest challenge with evaluating pedagogies using meta-analysis effect sizes, is that different study designs tend to produce different ranges of results. In general, we tend to see more rigorous studies showing lower effect sizes. Similarly, the closer the instruction is between a treatment group and a control group the lower the effect size tends to be. For example, I have recently been asked to help out with a study on handwriting instruction. We will be comparing two different fonts of handwriting instruction and comparing the benefits. In this scenario, I would fully expect to see a very small effect size, because the difference between the treatment group and the control groups instruction is so slight. Conversely, there are many studies that compare systematic phonics instruction with word memorization instruction. In this scenario, I would expect to see a much larger effect size, because the difference in instruction is quite meaningful.
In my personal experience, both researchers and teachers tend to have their own bias in interpreting effect sizes. Teachers often make the same mistake that I did, when first learning about effect sizes and assume bigger is always better, without taking the time to understand the various nuances of differences. And I don’t blame them. Teaching is a difficult job, not everyone is going to want to spend hours of time trying to learn how to better interpret experimental statistics. Researchers on the other hand, (in my experience) often want to assume very small effect sizes are more meaningful than they are. Researchers often want to explore their own hypotheses, and I think it’s difficult to admit, when our hypothesis doesn't pan out. I often see researchers claim that an effect size was very small, but should still be seen as significant, because the study was especially rigorous. I have even seen researchers claim negative effect sizes (when a treatment group does worse than a control group), supported the use of the treatment pedagogy.
Fortunately, we do have some research on how effect sizes range, according to different study designs. In this article, I review this research and attempt to give teachers a general guide/reference points for interpreting effect sizes.
When Jacob Cohen invented the effect size, he proposed the following benchmarks.
Below .20 = negligible
Between .20 and .39 = small
Between .40 and .79 = large
Above 1.20 = very large
The Basic Interpretation:
On average this model largely works. As both (Ropovik, 2021) and (Hattie, 2008) point out, the average education study shows an effect size of roughly .40. The average education study falling precisely at the benchmark for moderate evidence, does seem to justify the common interpretation. However, as many have pointed out, this model's reliability falls apart when we look at more specific study designs. That mean effect size of .40, comes from a wide variety of studies. The less rigorous studies are increasing that mean and the more rigorous studies are decreasing it.
One of the most common types of education studies measures learning from beginning to end, without using a control group. This study is often referred to as a case study or a single group design study. The advantage of this design is that it is incredibly easy to do. In my experience most quantitative education studies use this design. The problem with this design is that we assume that students will learn over time, so any learning that happens cannot be attributed to the treatment teaching method. For this reason, many meta-analysis authors exclude all case studies. While I generally agree with excluding case studies, there are some resulting challenges. Most quantitative studies in education are case studies and therefore, by excluding case studies we exclude most education research. Case studies are also by far the cheapest studies to conduct, this means that excluding case studies can unintentionally result in biasing research towards bigger more wealthy companies.
In 2014, Luke Plonsky and Frederick L Oswald analyzed 346 primary studies and 91 meta-analyses to find the difference in effect sizes for studies that use control groups vs do not. He proposed the following guidelines:
Below .40 = negligible
Between .40 and .70 = weak
Between .70 and 1 = average
Above 1 = Strong
While (Plonksy, 2014) gives us a tool for interpreting case studies, it is important to remember that these studies are still only correlational and even large effect sizes may not indicate that teaching method works. Moreover, it does not help us to interpret more rigorous studies. Indeed, many claim that large RCTs show substantially lower effect sizes. In my experience, randomization does not significantly impact study results. However, sample size (Serdar, 2021) and assessments used do (Silverman, 2020).
Standardized assessments likely lower effect sizes, because they are harder to teach to; whereas, custom made assessments are more often more closely aligned to the instruction. Researchers refer to this as “distal” vs “proximal” assessments. While there is likely an appropriate time to use both types of assessments in research, it is easier to inflate results with custom/proximal assessments.
Sample size likely impacts effect sizes, because less scrupulous authors can subdivide samples into multiple smaller samples and only publish the positive results. For example, if a company is conducting a study in a district, they can instead do a study on each school and publish only the studies with the best results. This allows them to “fish” for better findings. While there are tools to spot this behavior, like funnel plots, they are not always used (Lee, 2012). However, such unscrupulous practices are less likely to be utilized in much larger studies, as the practice becomes more difficult and more expensive, with larger samples. This would potentially explain why large scale studies show on average lower effect sizes (Serdar, 2021).
In 2023, myself, Elizabeth Reenstra, Pamela Aitchison and Dr. Rachel Schechter conducted a review of large scale language program studies reviewed by Evidence for ESSA, which had a strong rating. In order to qualify for this rating, studies had to be randomized, have a minimum sample size of 700, use standardized assessments, and assessments had to be conducted by unbiased third parties.
In total 63 studies met this inclusion criteria and we found a mean effect size of .17. I used a quartile analysis to identify a possible guideline for interpretation. Effect sizes in the first quartile fell between .10 and .14. Effect sizes in the second quartile fell between .15 and .23. Effect sizes in the third quartile fell between .24 and .45. Effect sizes above .45 fell in the final quartile. Based on these results, I would suggest the following guidelines for interpreting most RCT effect sizes based on standardized assessments.
Of course as Dylan William (2018) pointed out to me, effect sizes are also age dependent. On average, studies on younger students show lower effect sizes. This may happen because their curriculum is easier and students can therefore more quickly make progress. In 2018 Mathew Kraft explored this factor, by analyzing 481 rigorous RCTs on students in upper elementary and secondary school. To be included in the analysis, these studies had to be randomized, use standardized assessments, provide equivalent instruction for both the treatment and control group, and focus on older students. (Kraft, 2018) proposes the following benchmarks for interpreting the effect sizes of rigorous RCTs, on older students.
Truthfully, I find (Kraft’s, 2018) findings very difficult to accept, because they assume all positive results are significant, without controlling for publishing bias. That said, I do think that they help to highlight that very small effect sizes can be significant, for very rigorous studies on older students.
Summary Points to Remember:
-On average education studies show an effect size of approximately .40. That said, the more rigorous a study is and the older the student population is, the lower the effect size that we should expect.
-Because of publishing bias, most education studies show positive results. It is therefore necessary to compare the results of studies to judge which pedagogies are truly effective.
-Studies without control groups can provide interesting preliminary data; however, their findings are only correlational.
-The larger the differences are between a treatment group and a control group, the larger the effect size we should expect.
-Studies with proximal assessments and small sample sizes are more prone to bias. This does not mean such studies are less valid, just that they sometimes show inflated results.
Written by Nathaniel Hansford
Last Edited 2024/02/05
Dylan William. (2019). Interview with Dylan William. Pedagogy Non Grata. https://podcasters.spotify.com/pod/show/pedagogynongrata/episodes/Interview-with-Dylan-Wiliam-continued-The-Limitations-of-Meta-Studies---Episode-34-egjk5a/a-a2m49v9
Kraft, M. (2018). Interpreting Effect Sizes of Education Interventions. Brown University.
Hansford, N,. Reenstra, E,. Aitchison, P & Schechter, R. (2023). What is the best language program? Teaching by Science. https://www.teachingbyscience.com/what-is-the-best-language-program
Ropovik, I., Adamkovic, M., & Greger, D. (2021). Neglect of publication bias compromises meta-analyses of educational research. PloS one, 16(6), e0252415. https://doi.org/10.1371/journal.pone.0252415
Serdar, C. C., Cihan, M., Yücel, D., & Serdar, M. A. (2021). Sample size, power and effect size revisited: simplified and practical approaches in pre-clinical, clinical and laboratory studies. Biochemia medica, 31(1), 010502. https://doi.org/10.11613/BM.2021.010502
Silverman, Rebecca & Johnson, Erika & Keane, Kristin & Khanna, Saurabh. (2020). Beyond Decoding: A Meta‐Analysis of the Effects of Language Comprehension Interventions on K–5 Students’ Language and Literacy Outcomes. Reading Research Quarterly. 55. 10.1002/rrq.346.
Plonsky, Luke & Oswald, Frederick. (2014). How Big Is "Big"? Interpreting Effect Sizes in L2 Research. Language Learning. 64. 878-912. 10.1111/lang.12079.
William Lee and Matthew Hotopf. (2012).Core Psychiatry (Third Edition). Science Direct.