What Separates Tutoring Programs That Work From Ones That Almost Work
What Separates Tutoring Programs That Work From Ones That Almost Work
A three-year independent study reveals the infrastructure conditions behind a 46-point literacy gain — and what most programs are still missing.
A new technical report from the Center for Educational Measurement and Evaluation at UNC Charlotte offers something genuinely rare in education research: a natural experiment in what happens when a tutoring program gets its operational infrastructure right.
The findings are worth sitting with. Not because they validate any particular product or approach, but because they reframe a question that nonprofit leaders and school administrators spend considerable resources getting wrong.
The question isn't whether tutoring works. It's why the same tutoring program, serving the same population, with the same underlying pedagogy, can produce dramatically different outcomes in different years.
This study has an answer.
The Program, The Problem, and The Numbers
The Augustine Literacy Project-Charlotte (ALP) has delivered one-to-one Orton-Gillingham literacy tutoring in Charlotte-Mecklenburg Schools since 2005 — volunteer tutors, 45 minutes, twice per week, focused on early elementary students in under-resourced communities. The model is well-regarded. The mission is clear.
Using propensity score matching across three academic years (2021-22 through 2023-24), CEME researchers compared literacy outcomes for ALP students against carefully matched peers who did not receive tutoring. The methodology is rigorous: students were matched on baseline literacy scores, demographics, disability status, and school — and a doubly robust estimation approach was used to account for remaining differences.
The results tell a before-and-after story that most program evaluations never get to tell.
In 2021-22 and 2022-23, ALP produced positive but statistically non-significant effects on end-of-year DIBELS literacy scores. Encouraging, but not conclusive.
In 2023-24, everything changed.
Both first and second graders showed statistically significant gains of approximately 6 points over matched peers. First-grade ALP students entered the year with just 7% performing at or above grade level. By year-end, 53.1% had reached grade-level benchmarks — a 46-point gain that nearly doubled the 23.7-point improvement seen in the comparison group. Second graders showed a 33.4-point gain against the comparison group's 20.7.
The tutoring model itself didn't change. What changed was everything surrounding it.
Lesson 1: You Cannot Evaluate a Model You Cannot Deliver Consistently
One of the clearest takeaways from this study is that the quality of an instructional approach cannot be measured independently of how consistently it reaches students.
Prior to 2023-24, ALP tutors were trained extensively in early literacy instruction and then largely left to operationalize that training on their own. Phase one of the evaluation — published in Literacy Research and Instruction — surfaced what that variability looked like from inside: newer tutors wanted more support, and implementation consistency was uneven across settings.
The 2023-24 transformation addressed this through the adoption of scripted lessons as a universal instructional framework. Critically, this wasn't just a curriculum update. Seventy-two existing tutors were retrained before the school year began, achieving near-total alignment across active tutors.
The result: what happened in a tutoring session became predictable regardless of which volunteer was in the room.
For nonprofit leaders, this is uncomfortable but important. Volunteer-dependent models carry structural variance that good intentions alone cannot resolve. This study suggests that standardization isn't a compromise of the relational, individualized character of tutoring — it appears to have been a prerequisite for that character to consistently emerge at scale.
Lesson 2: Annual Test Scores Are Not a Feedback Loop
A second major shift in 2023-24 was the introduction of session-level reflection forms completed by tutors after every lesson — the first time ALP had a consistent, granular feedback mechanism.
This matters because program improvement requires actionable signals, and annual outcome data provides almost none. By the time end-of-year assessments reveal a problem, twelve months have passed. Session-level data compresses that feedback cycle and enables the kind of real-time course correction the study describes through ALP's adoption of a Plan-Do-Check-Adjust improvement process.
The absence of this infrastructure in prior years didn't mean nothing was being learned. It meant whatever was being learned couldn't be acted on quickly enough to influence outcomes within the same school year.
For administrators evaluating tutoring partners, the presence or absence of session-level data infrastructure is worth treating as a quality indicator — not just an administrative nicety.
Lesson 3: Integration Removes the Friction That Kills Fidelity
In January 2024, ALP launched remote tutoring through an integrated platform that combined scripted lesson delivery, remote instruction, and session data capture in a single environment. ALP's own evaluation found no measurable performance difference between in-person and remote sessions — significant for equity reasons, but also significant as a validation of the standardized approach itself.
The deeper operational point is what integration accomplishes that disconnected tools cannot.
When lesson delivery, progress monitoring, and session feedback exist in separate systems, the friction of reconciling them means it rarely happens consistently. Tutors face more administrative burden. Supervisors have less visibility. Variability creeps back in. When these functions are unified in a single environment, data collection becomes structural rather than discretionary, and the platform itself becomes an accountability layer that doesn't depend on any individual tutor's discipline or organization.
This is the mechanism by which the 2023-24 reforms allowed tutors to focus more fully on students — not by reducing rigor, but by systematizing everything that technology could handle so that human attention could go where it actually matters.
Lesson 4: Infrastructure Compounds. Results Don't Appear Immediately.
It would be a misreading of this study to conclude that deploying a delivery platform in year three caused a sudden literacy breakthrough.
The 2023-24 results reflect two years of sequential investment: scripted curriculum developed and piloted in spring 2023, standard training adopted in August 2023, 72 tutors retrained by October 2023, and integrated remote delivery launched in January 2024. The study's best year was built on two prior years of groundwork that didn't show up as statistically significant gains at the time.
This timeline carries its own lesson. Program leaders and funders who evaluate interventions on one-year cycles may systematically underinvest in infrastructure improvements because the returns are deferred. The ALP data suggests that consistent positive-but-non-significant effects in years one and two were not evidence of a weak program — they were evidence of a program approaching the conditions it needed to fully work.
What This Study Actually Proves — and What It Doesn't
It would overreach to conclude that any specific technology causes literacy gains. The 2023-24 results reflect a convergence — standardized curriculum, systematic retraining, session-level data infrastructure, and integrated delivery — not a single variable.
What the study demonstrates is that these conditions, taken together, correlate with meaningfully better outcomes for the students who need the most support. For a population where 54-77% began the year at the highest risk level for reading difficulty, a gain that nearly doubles the comparison group's improvement is not marginal.
The cost data reinforces the point: ALP's cost-per-child dropped 37% between 2021-22 and 2023-24, while the number of students served grew by more than 60%. Better outcomes at lower cost for more students is not what program improvement usually looks like. It's what program improvement looks like when the infrastructure investment is right.
The lesson for education leaders isn't about which tools to adopt. It's about recognizing that instructional quality and operational infrastructure are not separate problems. In tutoring programs especially — where delivery depends on volunteers, fidelity is hard to monitor, and students have the least margin for an inconsistent experience — the systems surrounding instruction are part of the intervention. This study makes that case with unusual clarity.
Further Reading
[Full CEME Technical Report — Program Evaluation for Augustine Literacy Project-Charlotte, December 2025]
Herrera & Lambert (2024). Helping children achieve literacy proficiency: A case study. Literacy Research and Instruction, 64(4), 387–410.





