At Amazon, the data was not a sample. It was the population—every click, every purchase, every seller offer, every second of every day. The challenge was never whether you had enough data. It was whether your methods could survive contact with it.
In 2014 I joined Amazon’s Customer Behavior Analytics team as a Research Economist, working under Patrick Bajari—Amazon’s chief economist and one of the most cited econometricians in the world—alongside Harry Paarsch, Konstantin Golyaev, and Gregory Duncan. The project was the buy box: the algorithm that decides which seller’s offer gets the “Add to Cart” button when multiple sellers compete on the same product.
The buy box is arguably the single most consequential allocation mechanism on the internet. Every day, it routes billions of dollars in commerce. The legacy system used hand-tuned coefficients—“magic numbers”—set by business analysts who adjusted them based on intuition and A/B test results. The economics team’s mandate was to replace these with econometrically estimated parameters from a structural model of consumer choice.
The model was a logit Manski discrete choice framework. Customers arriving at a product page face a set of seller offers that vary by price, shipping speed, seller reputation, and fulfillment channel. The model estimates the probability that a customer chooses each offer—or chooses not to buy at all—as a function of these attributes. The “outside option”—the probability of walking away—is what makes it a Manski model rather than a standard conditional logit.
My first assignment was a feasibility test. Bajari wanted to know if R could handle the estimation at Amazon’s scale. The answer, after weeks of testing, was no. The data was too large, the iterations too slow, and R’s memory management couldn’t cope with the matrix operations required. We moved the estimation pipeline to Stata, which handled it. This was one of those early lessons in the difference between academic-scale computation and production-scale computation—a lesson that shaped everything I did afterward.
Raw coefficient estimates from a logit model at this scale are noisy. A coefficient might imply that customers prefer slower shipping or higher prices—artifacts of collinearity, sparse cells, or endogeneity in the raw data. In an academic paper, you might note these anomalies and move on. In a production system that routes billions of dollars, you cannot.
The solution was a post-estimation regularization step: a quadratic programming problem that takes the raw coefficients and finds the nearest set of parameters that satisfy economic constraints—monotonicity (customers prefer lower prices, faster shipping, better sellers) and box constraints (no coefficient can be absurdly large or small). The objective function minimizes the squared distance from the original estimates subject to these constraints, solved using FICO’s Xpress optimization engine via its Mosel modeling language.
The experience of translating economic intuition into formal mathematical constraints—and watching a commercial solver find a feasible solution in seconds over parameter spaces that would take hours to search manually—was formative. It was my first time working at the intersection of econometrics and operations research, and the regularized parameters went into production as part of the buy box algorithm that served hundreds of millions of customers.
The new econometric model didn’t just get deployed. It was tested against the legacy system in a controlled experiment—an A/B test where a fraction of live traffic was routed to the new model while the rest continued on the old magic numbers. My role was to monitor the experiment: tracking conversion rates, revenue per session, defect rates, and customer satisfaction metrics across treatment and control.
The scale of Amazon experimentation is unlike anything in academic research. A “small” test might involve millions of customer sessions per day. Statistical significance is not the problem—everything is significant. The challenge is detecting meaningful differences in business metrics while accounting for the thousand confounds that come from a live marketplace: seasonal effects, promotional events, inventory fluctuations, competitor behavior. I chaired the weekly results meeting where we presented findings to an L6 executive—Amazon’s equivalent of a senior director—and made recommendations about whether to expand, modify, or kill the experiment.
The experimentation infrastructure itself was a revelation. Every customer session was tagged with its treatment assignment. Every downstream action—click, add-to-cart, purchase, return—was linked back to that assignment through what Amazon calls “trigger recording.” The trigger fires at the moment the customer is exposed to the treatment, and everything that happens afterward is attributed to that exposure. The analytical pipeline processes these billions of trigger-outcome pairs into the summary statistics that land on the executive’s desk. Understanding this pipeline—its assumptions, its failure modes, its blind spots—was as important as understanding the econometrics.
The skills I developed at Amazon—working with data that doesn’t fit in memory, building automated pipelines, thinking about computational efficiency—carried directly into my subsequent work with large federal datasets. The Census Bureau’s American Community Survey Public Use Microdata Sample (ACS PUMS) contains millions of person-level records. The Common Core of Data (CCD) covers every public school in the United States. The ACF CCDF administrative data tracks childcare subsidy enrollment across all states over time.
For the state factsheets I produced as part of the PPI project, I built automated analysis pipelines in R and Stata that pulled ACS PUMS microdata, applied design weights, computed population estimates by income group, ethnicity, and metropolitan status, merged these with CCDF enrollment counts and NIEER pre-K data, and produced formatted output ready for visualization. The pipeline ran for each focus state—Washington, Oregon, Tennessee—with state-specific parameters but a common analytical framework.
This is the unglamorous heart of empirical research: building the infrastructure that turns raw data into reliable answers. It is not the econometrics or the identification strategy that takes the most time. It is the data engineering—the cleaning, the merging, the validation, the pipeline that must run correctly every time because everything downstream depends on it.
- Mosel / FICO Xpress — Quadratic programming for coefficient regularization with monotonicity and box constraints
- Stata — Large-scale logit estimation, complex survey analysis with design weights, automated tabulation pipelines
- R — Data visualization (ggplot2), ACS PUMS microdata processing, statistical computing
- SQL / AWS — Amazon data warehouse queries, EC2 compute instances for estimation jobs, S3 data storage
- Amazon experimentation platform — A/B testing infrastructure, trigger recording, treatment assignment, metric computation at scale