In graduate school, the economics PhD curriculum gave me a strong foundation in econometrics and theory. But I was restless. I wanted to understand methods that my program did not teach—so I went looking for them across the university.
Over the course of my PhD at the University of Washington, I took classes far outside the standard economics sequence: Bayesian statistics, regression splines, spatial regression analysis, numerical analysis, hierarchical modeling, experimental design, and computer programming. Each of these would prove useful—some immediately, some years later. The Bayesian statistics course, in particular, introduced me to Latent Dirichlet Allocation, a generative probabilistic model for discovering hidden thematic structure in large collections of text. At the time, it was just another algorithm. Within a few years, it would become the basis of a published paper.
The story begins with the Bill & Melinda Gates Foundation’s Prenatal-to-Three (PPI) initiative at the University of Washington. I was scrounging for every possible dataset that could inform the foundation’s strategy on early childhood education policy. One of the sources I stumbled on was the National Conference of State Legislatures (NCSL), which maintained a tracking database of early care and education bills across all 50 states. The full text of each piece of legislation was available online—machine-readable, structured, and covering years of legislative activity.
I wrote a web scraping program to download the entire database: 9,272 records spanning 2008 to 2018. After cleaning—removing bills that died without deliberation, deduplicating records that appeared at multiple stages of the legislative process, and filtering to bills with full text available—the analytic sample included 3,203 unique bills, of which 2,396 had concluded their legislative journey with a definitive outcome.
The question was simple: what predicts whether an early childhood education bill passes into law? The traditional approach in political science was to look at characteristics of the bill’s sponsor—seniority, party affiliation, committee membership. But nobody had looked at what the bills actually said. This is where the Bayesian statistics course paid off. I applied Latent Dirichlet Allocation to the full text of the legislation to discover the latent topic structure. LDA treats each document as a mixture of topics, and each topic as a distribution over words. It does not know what the topics are in advance—it discovers them from the patterns of co-occurrence in the text.
The algorithm identified a six-topic solution as the best fit: two meta-priorities emerged, which we labeled “ECE finance” (comprising revenues, expenditures, and fiscal governance) and “ECE services” (comprising PreK, child care, and Health and Human Services). We validated these topics against expert knowledge and the existing literature on early childhood policy. The topics were not just statistically coherent—they mapped directly onto the real fault lines of legislative debate.
I then used Hierarchical Generalized Linear Models (HGLM) to predict bill passage from the topic proportions, controlling for sponsor characteristics and the nested structure of the data (bills within legislators within states within years). The key finding was that bills focused on HHS, fiscal governance, or expenditures were more likely to pass, while bills focused on PreK, child care, or revenues were less likely—and that the sponsor’s legislative effectiveness moderated this relationship. Highly effective legislators could pass bills regardless of topic; less effective ones were at the mercy of content.
Soojin Oh Park brought the subject-matter expertise—the deep knowledge of early childhood policy that allowed us to interpret and validate the topics the algorithm discovered. I brought the data pipeline, the NLP, and the statistical modeling. The paper was published in PLOS ONE in 2021 and, to our knowledge, represented the first application of machine learning and NLP methods to the study of early childhood education legislation.
A very different application of these methods came in a marketing data science context: analyzing the effectiveness of marketing expenditures across channels (Google, Facebook, affiliates) and recommending optimal budget allocation. The dataset covered 3,051 weekly observations across 26 geographic markets over two years, with revenue as the outcome and channel-specific spending and impressions as predictors.
The methodological challenge was that marketing effects are not static. Spending on Facebook this week may affect revenue next week. Google spend may crowd out or complement affiliate impressions. The lag structures vary by channel, and the geographic heterogeneity means that what works in one market may not work in another. A simple regression would miss all of this.
I built a progression of increasingly sophisticated models: pooled OLS as a baseline, fixed effects to control for unobserved geographic heterogeneity, hierarchical linear models to allow effectiveness to vary by market, then Granger causality tests and vector autoregressions (VAR) to uncover the temporal dynamics and cross-channel dependencies. The final specification was a dynamic panel model estimated via Generalized Method of Moments (Arellano–Bond), which captured revenue persistence, lagged marketing effects, and unobserved heterogeneity simultaneously while addressing the endogeneity that arises when past revenue influences current spending.
The 138-page analysis showed that Facebook was the most effective channel (highest elasticity, strongest lag effects) despite being allocated the least budget—a clear misallocation. Google showed counterintuitive negative contemporaneous effects but strong positive lag effects, suggesting that its value materializes over time. The dynamic GMM outperformed the random coefficients model in prediction accuracy across all 26 geographies, with 52 times lower average prediction error. The budget reallocation recommendations followed directly from the equimarginal principle: shift spending toward the channel with the highest marginal return per dollar until the elasticities equalize.
A third strand of ML work emerged in the Early Achievers project, where the challenge was not text or time series but missing data and high dimensionality. With 947 children nested in 156 classrooms and massive missingness across dozens of classroom quality and child outcome measures, I turned to machine learning imputation: gradient boosting machines, random forests, and LASSO, benchmarked against traditional multiple imputation by chained equations (MICE). Ten-fold cross-validation showed where each method outperformed—and where the ensemble approach yielded the most trustworthy out-of-sample predictions. The details are on the Measurement & Modeling page.
- Topic modeling — Latent Dirichlet Allocation (LDA), document-topic and topic-word distributions, model selection via coherence and fit
- Natural language processing — Web scraping, text preprocessing, legislative text analysis at scale
- Time series econometrics — Vector autoregression (VAR), Granger causality, impulse response functions, variance decomposition
- Dynamic panel methods — Arellano–Bond GMM estimation, dynamic multipliers, long-run effect computation
- Ensemble methods — Gradient boosting, random forests, LASSO for imputation and variable selection
- Bayesian methods — Generative probabilistic models, posterior inference, model comparison