Economics gives you theory first, then sends you looking for data. Education often gives you data first—dozens of items on classroom quality instruments, child assessments, parent surveys—and asks you to figure out what it means. This is the domain of structural equation modeling and factor analysis: powerful tools, and dangerous ones if you let them run without discipline.
Measurement instruments in early childhood education are high-dimensional by nature. The Classroom Assessment Scoring System (CLASS) has three domains, ten dimensions, and dozens of behavioral indicators. The Environment Rating Scales (ERS) family—ECERS-R, ITERS-R, FCCERS-R—contains hundreds of items on everything from room arrangement to the quality of teacher–child interactions. Child assessments like the PPVT (vocabulary) and Woodcock-Johnson (decoding) generate multiple subscores. The question that runs through all of this work is deceptively simple: how do you combine these items into scores that are both reliable and valid?
Historically, the answer in the field was the same as the “magic numbers” approach I encountered at Amazon: someone in charge decided. A group of items would be labeled a domain, averaged into a score, and those scores averaged into an overall rating. Whether this aggregation reflected the actual structure of the data—whether the items within a domain really measured one thing, whether different domains contributed equally to child outcomes—was rarely tested. This is where I came in.
My first project at Cultivate Learning—then called CQEL, the Childcare Quality and Early Learning Center for Research and Professional Development at the University of Washington—was the statewide validation of Early Achievers, Washington’s Quality Rating and Improvement System (QRIS). Every state had been incentivized by federal Race to the Top grants to create and validate such a system. The states displayed a compliance mindset in this effort: they did what was required to the letter of their contracts, no more. The question of whether their rating systems actually predicted child outcomes—whether a “Level 4” program was meaningfully better than a “Level 2”—was secondary to the question of whether the grant requirements had been met.
The Department of Early Learning contracted CQEL to conduct the validation. I did all of the statistical analysis. The challenge was substantial: 947 children across 156 classrooms, with massive missingness in the demographic variables (over 60% of households had incomplete data on income, education, or subsidy status), small cell sizes when the data was split by classroom type (13 infant-only classrooms, 25 blended family child care homes), and the fundamental problem that the study was observational—children were not randomly assigned to programs, so selection bias was a constant threat.
The first problem was imputation. With so many missing demographic variables, I needed a strategy that was robust to overfitting while preserving the correlations between variables. I framed this as a prediction problem and compared four approaches: gradient boosting machines (which handle missing data natively and resist overfitting through regularization), random forests, ordinary least squares, and multivariate imputation by chained equations (MICE). I evaluated each using 10-fold cross-validation on out-of-sample predictive accuracy—not in-sample fit, which would have been misleading given the small dataset. Gradient boosting won for income and education; random forest won for subsidy status.
This was a deliberate choice to bring machine learning into a field that had never seen it. The Early Achievers items were a classic high-dimensional problem: many correlated predictors, small sample sizes, the constant risk of overfitting. The field’s standard approach—confirmatory factor analysis followed by multilevel regression—was well-suited to testing pre-specified structures but poorly equipped to discover whether those structures were the right ones. I ran both: the classic factor analyses that the contract required, and the machine learning approaches—gradient boosting, random forests, LASSO, regression trees—that could reveal patterns the factor model couldn’t see.
The results were sobering. The relationship between classroom quality measures and child outcomes was weak, explained less than 10% of the variance, and was concentrated at the low end of the quality scale—suggesting diminishing returns to quality improvements beyond a basic threshold. I reported this honestly, including a power analysis showing that meaningful detection of these effects would require 200–300 classrooms, not the 100–150 available. The 530-page technical report I produced contained every analysis, every sensitivity check, every robustness test—propensity score matching, spline regressions, quadratic specifications, HLM with multiple imputation—so that nothing was hidden. The published validation report condensed this into a shorter document that went to the state.
Structural equation modeling is not something I learned in economics. Economics comes with theory—utility maximization, market equilibrium, rational choice—and then tests that theory against data. The relationship between theory and measurement is relatively disciplined because the theory constrains what you can claim. In education, the situation is often reversed: you have a large number of observed variables and no strong theory about how they should be related. SEM and factor analysis are designed for exactly this situation—they let you posit latent constructs that explain patterns in observed data and test whether those constructs fit.
The danger is real. Without strong theory, factor analysis becomes a license to find whatever structure you want. Rotate the factors differently, drop an item or two, and a different story emerges. This is the replicability problem in miniature: the analyst has too many degrees of freedom, and the data is not constraining enough to rule out bad models. I approach SEM the same way I approach any statistical method—with explicit attention to what assumptions are required, what sensitivity analyses can probe those assumptions, and what the method cannot tell you.
In the Early Achievers validation, I used confirmatory factor analysis to test whether the state’s predetermined rating structure—its grouping of items into domains and standards—was supported by the data. I tested measurement invariance across program types and age groups. And I compared the state’s aggregation scheme against alternative structures derived from the data itself, including proportion scoring methods for the ERS family that proved more predictive of child outcomes than the traditional scoring approach.
- Confirmatory factor analysis — Testing latent structure of CLASS, ERS, and QRIS rating instruments; measurement invariance across groups
- Structural equation modeling — Path models with latent mediators (PCLITACT in HSIS), multi-group SEM for subgroup comparisons
- Multilevel modeling — HLM for children nested within classrooms within sites; random intercepts and slopes; cluster-robust inference
- Dimensionality reduction — Gradient boosting, random forests, LASSO for variable selection and imputation in high-dimensional education data
- Threshold & nonlinearity analysis — Quadratic specifications, spline regressions, propensity score matching to detect nonlinear quality–outcome relationships
- Power analysis — Minimum detectable effect size calculations for clustered designs; sample size recommendations for future validation studies