top of page
A Beginners Guide to Reading Research

Evidence-based instruction is commonly used as a label on teaching pedagogies; however, it rarely means what people think it means. For example, both Balanced Literacy advocates and structured literacy advocates refer to themselves as evidence based. However, the types of evidence they are using are vastly different. When we make the claim that a teaching strategy is evidence-based, what we are trying to say is that there is research evidence supporting the efficacy of that strategy. But not all types of research evidence are equal. In this article, I will attempt to give the reader a basic understanding about how to assess the efficacy of a teaching strategy and the strength of the evidence behind it. 


There are three main types of education research papers. The first is qualitative. Qualitative research tends to be observational and rationalist. Researchers usually observe teachers using a specific teaching strategy and then record their observations and hypotheses regarding those observations. Qualitative research can be a great place to start research, because it can give us hints as to what strategies might be interesting to explore further. Qualitative research can also be useful to explain why one strategy works better than another or how a strategy might best be used. That being said, qualitative studies should never be used as definitive proof of efficacy or the lack thereof. Ultimately a qualitative study is really just a very well thought out anecdote.  


The second main type of research is quantitative. Quantitative research usually seeks to create an experiment and measure the results of the experiment using statistical analysis, most commonly effect sizes. There are many effect size calculations used in the literature, but the most common one is referred to as Cohen’s D. Cohen’s D is calculated by dividing the mean difference or results found of an intervention with the statistical deviation (the range of results). Effect sizes are meant to be interpreted by their magnitude. An effect size below .20 usually signifies that the result was statistically negligible. .20 is often used as the threshold because it is the average effect size found for a placebo intervention. Within education research, we find that the average education study presents an effect size of .40. This is actually a higher effect size when compared to other fields of study; however, there are some common practices in education research that can inflate effect sizes. That being said, based on my personal experiences anything in the range of .40-.69 should likely be described as moderate or average. Effect sizes between .70 and .99 should likely be considered high meaning that there is strong evidence that the intervention works. And effect sizes above 1.0 should be considered very strong, meaning there is very strong evidence that the intervention works. It is important to remember that in science, we speak in degrees of probabilities, not absolutes. This means the higher of an effect size we see in the research, the more willing we should be to believe in the efficacy of that strategy. However, we should never truly be certain of anything. Of course, the above guidelines are my personal recommendations for education research. Below you can see the interpretation guidelines recommended by Jacob Cohen the inventor of the formula. 

Guide to Interpreting Effect Sizes (1).png

That all being said, not all experimental studies are created equally. Some can be poorly designed. For example, last year I came across a study where in the experiment group a teacher read a book to a student and then had the student read the book to themselves. In the control group the teacher had the student read the book to themselves. Their study showed the experiment group out-performed the control group for comprehension and they concluded in their discussion that this proved the efficacy of “ear-reading”. Of course this is a terribly designed study for two reasons. Firstly, the students got to read the story twice in the experimental group and only once in the control group. Secondly, of course struggling readers understood the text better if a teacher read it to them first; however, this does not prove the efficacy of “ear-reading” as an instructional strategy.  


When we look at quantitative papers usually we want to see a rigorously designed experiment, a sufficient sample size, and ideally a randomised control group. That being said many education studies do not use a control group at all. They simply have a pre-test and a post-test for the intervention and measure the effect size of the results. However, the problem with this study design is that we are not really testing the efficacy of the idea compared to regular instruction. Ultimately we would assume that any time spent on instruction should cause students to learn. When we conduct an experiment, we should be testing whether this teaching method works better than regular instruction. When we do an experiment the time frame also really matters, as the longer the experiment is, the longer the students have to learn the curriculum, the larger the results should be. When we see studies with no control groups, or very long time horizons, or worse both, we should expect larger effect sizes. When you see a study that does not have a control group and is carried out for an excessively long time span (such as a year or longer) and you see a small effect size, you can be reasonably sure that the evidence from that study is extremely weak. 


Other things that should make us leery when reading research is very small sample sizes, researcher bias, or lack of randomization. When a study uses a smaller sample, this drastically affects the range of results and can end up creating distorted effect sizes on both ends of the spectrum. Moreover, we typically see researchers who are very invested in an idea, publish studies with higher results than researchers who are testing other peoples hypotheses. This is likely not intentional, but rather a result of the invested researcher doing everything they can to make sure the intervention group is successful. While this is not necessarily wrong, we want to make sure that results are reproducible by the average teacher. Lastly, while a study with a control group is almost always going to be better than a study without one, we ideally want an experiment group and control group randomly assigned. Now this is less important than some of the other points mentioned; however, it can still matter. For example, we would not want the control group to be our weakest students and our experiment group to be our strongest students, as that would obviously bias the results.  


Even if we have one really well done study, we do not typically place a high value on individual studies, that is because we usually see a range of results in the research. This is often the part of science that the general public gets most wrong. Not just in education, but in science in general. For example, I recently did a secondary meta-analysis on morphology and I found one study with an effect size of .29 and another with an effect size of 1.24. Obviously both effect sizes cannot best represent the effect of morphological instruction, so we need a method to best determine what is referred to as the scientific consensus.This is where our third main type of research comes into play. Meta-analysis looks at all the studies in an area of research and tries to use statistical analysis to find the mean results. 


Ideally meta-analysis is done by weighting studies according to design and sample size, so that we don't give equal weight to a study with a sample size of 10 and a study with a sample size of 500. However, this is not always possible and not all meta-analyses do this. When researchers cannot weight a meta-analysis they will take an average of the effect sizes reported, while ideally removing any outlier effect sizes. Meta-analysis is by far the best way to determine the efficacy of a teaching intervention. However, not all meta-analyses are created equal. For example, I came across a meta-analysis on individualized instruction with an effect size of 2.35. This is an extremely large effect size; however, it was based on 4 studies. Phonics on the other hand usually has a result of around .45 depending on the meta-analysis looked at. However, some of these meta-analyses have over 100 studies behind them. This makes me more confident in the research behind phonics than the research behind individualized instruction, although I do think both are an evidence-based strategy. 


Tangentially, one final type of research I will cover is secondary meta-analysis. Secondary meta-analysis is a strategy popularized in education by John Hattie, and is something I often do myself, on my website . Secondary meta-analyses are meta-analyses of multiple other meta-analyses (how meta is that?). This idea sometimes gets criticized for taking too broad of an approach, as it can be used to compare research that is hard to compare IE different student populations, sample sizes, effect calculations, and types of research. However, personally, I am a big fan of this type of research as it allows people to easily digest large amounts of education research quickly, to identify which teaching strategies have strong evidence in support of them and which do not. As an example of this, I will share an infographic from my 2021 secondary meta-analysis on commonly used teaching strategies. 

Written by
Nathaniel Hansford

Last Edited: 2021-12-19

bottom of page