Methods

Causal Inference

Randomized experiments, causal mediation, A/B testing at industrial scale, and the identification assumptions that make any of it credible.

← All methods

Estimating a causal effect is easy. Estimating the right one—and knowing what assumptions are required for it to be the right one—is the entire discipline. My work spans randomized experiments, stepped wedge designs, quasi-experimental methods, and causal mediation, always with an emphasis on identification: what exactly are we claiming, and what would have to be true for that claim to hold?

Causal mediation: the Head Start Impact Study

The Head Start Impact Study (HSIS) is one of the rare large-scale randomized controlled trials in early childhood education. Random assignment of children to Head Start or a control condition provides clean identification of the program’s total effect. But knowing that Head Start works is not the same as knowing how it works. Understanding mechanisms requires moving beyond simple treatment–control comparisons toward a more nuanced causal framework.

This project grew out of my colleague Soojin Oh Park’s Harvard dissertation, which pioneered the idea of applying the Average Causal Mediation Effects (ACME) framework to the HSIS data to examine whether parent–child literacy activities mediated Head Start’s effects on children’s vocabulary and decoding skills—and whether these mechanisms differed for dual language learners. The dissertation was ambitious, comparing both multilevel structural equation modeling (MSEM) and ACME approaches, but the breadth came at the cost of depth. When Soojin and I began collaborating on the publication, I helped reshape the paper: we dropped the MSEM comparison entirely to focus exclusively on the ACME framework, which allowed us to go much deeper on the causal identification and the statistical rigor that peer reviewers would demand.

The methodological contributions I brought to the published version were substantial. For the first research question—whether Head Start affects parenting and child outcomes—I added treatment-on-the-treated (TOT) estimation via two-stage least squares, using random assignment as an instrument for actual program participation. This gave us the Local Average Treatment Effect alongside the intent-to-treat estimate, a distinction that matters when compliance is imperfect (86.6% of treatment-group children attended Head Start, while 14.4% of control children crossed over). I also added standardized effect sizes (Cohen’s d) throughout, which the dissertation had omitted.

For the comparison of mediation effects between DLLs and non-DLLs, I introduced bootstrap confidence intervals for the difference in ACME across subgroups—a formal statistical test that the dissertation had not performed. Without this test, you can observe that the mediated effects look different between groups, but you cannot claim the difference is statistically meaningful. I also expanded the missing data strategy, implementing multiple imputation by chained equations (MICE) with a battery of robustness checks: inverse probability weighting to adjust for attrition, comparisons of missingness patterns across DLL status, and sensitivity analyses for the sequential ignorability assumption.

Finally, we added a fourth research question that decomposed the mediator into theoretically distinct components—code-focused literacy practices (alphabet, spelling, rhyming) and meaning-focused practices (reading, storytelling)—and traced their differential contributions to decoding and vocabulary outcomes separately for DLLs and non-DLLs. This decomposition revealed that the pathways through which parenting supports literacy are not uniform across linguistic groups—a finding with direct implications for how Head Start programs tailor family engagement. The paper was accepted at Early Childhood Research Quarterly in 2026.

Experimentation at Amazon

At Amazon, I encountered causal inference at industrial scale. The buy box team’s new econometric model—a logit Manski discrete choice framework I helped build—was not simply deployed. It was tested against the legacy system in a controlled experiment: an A/B test where a fraction of live customer traffic was routed to the new model while the rest continued on the old hand-tuned coefficients.

The scale changes everything about how you think about experiments. A “small” test might involve millions of customer sessions per day. Statistical significance is not the problem—everything is significant at that sample size. The challenge is detecting meaningful differences in business metrics while accounting for the thousand confounds that come from a live marketplace: seasonal effects, promotional events, inventory fluctuations, competitor behavior.

I monitored the experiment and chaired the weekly results meeting where we presented findings to senior leadership. The decisions were consequential: expand the new model to more traffic, modify the parameters, or kill the experiment and go back to the legacy system. Every recommendation had to be grounded in the data, and every anomaly had to be explained—not just flagged.

Amazon’s experimentation infrastructure was itself a lesson in causal thinking. Every customer session is tagged with its treatment assignment. Every downstream action—click, add-to-cart, purchase, return—is linked back to that assignment through trigger recording. The trigger fires at the moment the customer is exposed to the treatment, and everything that follows is attributed to that exposure. Understanding this pipeline—its assumptions, its failure modes, where the stable unit treatment value assumption might break down in a marketplace with network effects—was as important as understanding the econometrics.

Field experiments on Mechanical Turk

My dissertation work at the University of Washington used Amazon Mechanical Turk as a field laboratory for labor economics. Mechanical Turk is a real labor market—workers accept tasks, exert effort, and earn money—but with a critical advantage for causal inference: the researcher controls the wage, the task, and the information environment. This makes it possible to run randomized experiments on questions that are nearly impossible to study causally in traditional labor markets.

I used this setting to test competing theories of efficiency wages. The shirking model predicts that higher wages increase effort because workers have more to lose if fired. The sorting model predicts that higher wages attract more productive workers. In a standard labor market, these mechanisms are confounded—you cannot observe the counterfactual of the same worker at a different wage, or the same wage offered to a different applicant pool. On Mechanical Turk, you can randomize both.

The first paper tested the shirking and sorting predictions directly by randomly assigning different wages to identical tasks and measuring effort. The second, with Claus Pörtner and Michael Toomim, tested compensating wage differentials—the theory that unpleasant jobs must pay more—by randomizing workers into tasks of varying difficulty at varying wages. The third estimated labor supply elasticities in this low-friction market. Together, these papers demonstrated that online labor markets could serve as rigorous field settings for testing core economic theory, with the kind of clean identification that observational labor data rarely permits.

Stepped wedge design: the ELO study

The Expanded Learning Opportunity (ELO) study at Cultivate Learning was a different kind of experiment—one where the unit of randomization was time, not individuals. The project had two goals: develop and validate a new quality measurement tool for after-school programs (the Quality Seal), and test whether coaching interventions could improve program quality. The design challenge was that you cannot randomly deny coaching to programs that have agreed to participate in a quality improvement effort. The stepped wedge design solves this: every program eventually receives treatment, but the timing of treatment onset is randomized, creating the variation needed for causal identification.

I led the quantitative analysis. The study enrolled two cohorts of after-school programs across Washington state, with three treatment arms: in-person coaching, online coaching, and a hybrid of both. Programs were blocked on site type and pre-experimental quality scores (measured by the Program Quality Assessment), then randomized within blocks to treatment timing using an incomplete block design. This meant that at any given observation point, some programs had already begun receiving coaching while others had not—providing both within-program and between-program variation in treatment exposure.

The analytical framework was hierarchical linear modeling with crossed random effects—observations nested simultaneously in programs, raters, and site types—to account for the multiple sources of clustering in the data. The primary outcomes were the PQA and the Expectations and Demands of Children in Care (ECDC), measured at multiple time points before and after coaching onset. The stepped wedge structure meant that control observations came not from a separate group of programs but from the same programs before their treatment began, which controls for time-invariant program characteristics by design.

The results were striking. All three coaching modalities produced statistically significant improvements in program quality, but online coaching had the strongest positive effect on PQA scores—a counterintuitive finding given the field’s bias toward in-person interaction. The cost-benefit analysis made the case even clearer: online coaching was three to five times more cost-effective than in-person delivery, primarily because it eliminated travel time and allowed coaches to serve more programs. This was 2019, months before COVID-19 would make remote delivery not just efficient but necessary.

  • Causal mediation analysis — Average Causal Mediation Effects (ACME) via Imai, Keele & Tingley (2010), sensitivity analysis for sequential ignorability
  • Randomized controlled trials — Intent-to-treat estimation, compliance adjustment, subgroup heterogeneity analysis
  • A/B testing at scale — Trigger-based attribution, multi-metric monitoring, marketplace experimentation with network effects
  • Stepped wedge designs — Randomized treatment timing with incomplete block randomization, HLM with crossed random effects
  • Field experiments — Randomized wage and task assignment on Mechanical Turk for testing labor market theory
  • Quasi-experimental methods — Propensity score matching, difference-in-differences, instrumental variables
Examples
Causal mediation
Head Start & DLL Literacy Pathways
ACME framework, 2SLS for TOT, bootstrap tests of subgroup differences, code- vs. meaning-focused decomposition. Park & Hassairi, ECRQ 2026.
Industrial experimentation
Amazon Buy Box A/B Testing
Live marketplace experiment comparing econometric model vs. legacy system. Millions of sessions per day. Weekly executive briefings. Amazon, 2015.
Stepped wedge trial
ELO Coaching Intervention
Stepped wedge randomized design with three coaching arms. Incomplete block randomization, HLM with crossed random effects, cost-benefit analysis. Cultivate Learning, 2019.
Field experiment
Efficiency Wages on Mechanical Turk
Randomized wage assignment to test shirking vs. sorting predictions. Clean identification of effort and selection effects. Hassairi, 2016.
Field experiment
Compensating Wage Differentials
Randomized task difficulty and wages to test hedonic wage theory. Pörtner, Hassairi & Toomim, 2016.