Job Market Paper
The Need for Equivalence Testing in Economics. MetaArXiv, 2025.
[ Abstract | Draft | Online Appendix | Stata Command | R Package | Shiny App | 30-Minute Presentation | Institute for Replication Discussion Paper (Older Version) | Interview: Economisch Statistische Berichten (in Dutch) ]
Equivalence testing can provide statistically significant evidence that economic relationships are practically negligible. I demonstrate its necessity in a large-scale reanalysis of estimates defending 135 null claims made in 81 recent articles from top economics journals. 36-63% of estimates defending the aver-age null claim fail lenient equivalence tests. In a prediction platform survey, researchers accurately predict that equivalence testing failure rates will significantly exceed levels which they deem acceptable. Obtaining equivalence testing failure rates that these researchers deem acceptable requires arguing that nearly 75% of published estimates in economics are practically equal to zero. These results imply that Type II error rates are unacceptably high throughout economics, and that many null findings in economics reflect low power rather than truly negligible relationships. I provide economists with guidelines and commands in Stata and R for conducting credible equivalence testing and practical significance testing in future research.
Published and Forthcoming Articles
Is There a Foreign Language Effect on Workplace Bribery Susceptibility? Evidence from a Randomized Controlled Vignette Experiment (with Paul Stroet, Arjen van Witteloostuijn, and Kristina S. Weißmüller). Journal of Business Ethics 197, 73-97, 2025.
[ Abstract | Article (Open Access) | Draft | Code ]
Theory and evidence from the behavioral science literature suggest that the widespread and rising use of lingua francas in the workplace may impact the ethical decision-making of individuals who must use foreign languages at work. We test the impact of foreign language usage on individual susceptibility to bribery in workplace settings using a vignette-based randomized controlled trial in a Dutch student sample. Results suggest that there is not even a small foreign language effect on workplace bribery susceptibility. We combine traditional null hypothesis significance testing with equivalence testing methods novel to the business ethics literature that can provide statistically significant evidence of bounded or null relationships between variables. These tests suggest that the foreign language effect on workplace bribery susceptibility is bounded below even small effect sizes. Post hoc analyses provide evidence suggesting fruitful further routes of experimental research into bribery.
US States That Mandated COVID-19 Vaccination See Higher, Not Lower, Take-Up of COVID-19 Boosters and Flu Vaccines. Proceedings of the National Academy of Sciences (121)41, e2403758121, 2024.
[ Abstract | Article (Open Access) | Draft | Data & Code, Published Replication | Reply | Response to Reply | Data & Code, Response to Reply | Twitter/X Thread ]
Rains & Richards (2024, Proceedings of the National Academy of Sciences) find that compared to US states that instituted bans on COVID-19 vaccination requirements, states that imposed COVID-19 vaccination mandates exhibit lower adult and child uptake of flu vaccines, and lower uptake of COVID-19 boosters. These differences are generally interpreted causally. However, further inspection reveals that these results are driven by the inclusion of a single bad control variable. When removed, the data instead shows that states which mandated COVID-19 vaccination experience higher COVID-19 booster and flu vaccine takeup than states that banned COVID-19 vaccination requirements.
Working Papers
Comparing Human-Only, AI-Assisted, and AI-Led Teams on Assessing Research Reproducibility in Quantitative Social Science (with Abel Brodeur et al). Institute for Replication Discussion Paper Series No. 195, 2025. Revise & resubmit, Nature.
[ Abstract | Draft | Replication Data, Code, and Pre-Analysis Plan ]
This study evaluates the effectiveness of varying levels of human and artificial intelligence (AI) integration in reproducibility assessments of quantitative social science research. We computationally reproduced quantitative results from published articles in the social sciences with 288 researchers, randomly assigned to 103 teams across three groups — human-only teams, AI-assisted teams and teams whose task was to minimally guide an AI to conduct reproducibility checks (the “AI-led” approach). Findings reveal that when working independently, human teams matched the reproducibility success rates of teams using AI assistance, while both groups substantially outperformed AI-led approaches (with human teams achieving 57 percentage points higher success rates than AI-led teams, p < 0.001). Human teams were particularly effective at identifying serious problems in the analysis: they found significantly more major errors compared to both AI-assisted teams (0.7 more errors per team, p = 0.017) and AI-led teams (1.1 more errors per team, p < 0.001). AI-assisted teams demonstrated an advantage over more automated approaches, detecting 0.4 more major errors per team than AI-led teams (p = 0.029), though still significantly fewer than human-only teams. Finally, both human and AI-assisted teams significantly outperformed AI-led approaches in both proposing (25 percentage points difference, p = 0.017) and implementing (33 percentage points difference, p = 0.005) comprehensive robustness checks. These results underscore both the strengths and limitations of AI assistance in research reproduction and suggest that despite impressive advancements in AI capability, key aspects of the research publication process still require substantial human involvement.
Identifying the Impact of Hypothetical Stakes on Experimental Outcomes and Treatment Effects. MetaArXiv, 2025. Revise & resubmit, Experimental Economics.
[ Abstract | Code & Data Retrieval Instructions | Draft | Tinbergen Institute Discussion Paper | Slides ]
Recent studies showing that some outcome variables do not statistically significantly differ between real-stakes and hypothetical-stakes conditions have raised methodological challenges to experimental economics' disciplinary norm that experimental choices should be incentivized with real stakes. I show that the hypothetical bias measures estimated in these studies do not econometrically identify the hypothetical biases that matter in most modern experiments. Specifically, traditional hypothetical bias measures are fully informative in 'elicitation experiments' where the researcher is uninterested in treatment effects (TEs). However, in 'intervention experiments' where TEs are of interest, traditional hypothetical bias measures are uninformative; real stakes matter if and only if TEs differ between stakes conditions. I demonstrate that traditional hypothetical bias measures are often misleading estimates of hypothetical bias for intervention experiments, both econometrically and through re-analyses of three recent hypothetical bias experiments. The fact that a given experimental outcome does not statistically significantly differ on average between stakes conditions does not imply that all TEs on that outcome are unaffected by hypothetical stakes. Therefore, the recent hypothetical bias literature does not justify abandoning real stakes in most modern experiments. Maintaining norms that favor completely or probabilistically providing real stakes for experimental choices is useful for ensuring externally valid TEs in experimental economics.
Three-Sided Testing to Establish Practical Significance: A Tutorial (with Peder Isager). PsyArXiv and Tinbergen Institute Discussion Paper Series No. 2024-077/III, 2024. Revise & resubmit, Advances in Methods and Practices in Psychological Science.
[ Abstract | Draft | Stata Command | R Package | Shiny App | Slides | Twitter/X Thread ]
Researchers may want to know whether an observed statistical relationship is either meaningfully negative, meaningfully positive, or small enough to be considered practically equivalent to zero. Such a question can not be addressed with standard null hypothesis significance testing, nor with standard equivalence testing. Three-sided testing (TST) is a procedure to address such questions, by simultaneously testing whether an estimated relationship is significantly below, within, or above predetermined smallest effect sizes of interest. TST is a natural extension of the standard two one-sided tests (TOST) procedure for equivalence testing. TST offers a more comprehensive decision framework than TOST with no penalty to error rates or statistical power. In this paper, we give a non-technical introduction to TST, provide commands for conducting TST in R, Jamovi, and Stata, and provide a Shiny app for easy implementation. Whenever a meaningful smallest effect size of interest can be specified, TST should be combined with null hypothesis significance testing as the default frequentist testing procedure.
A Comment on “Improving Women’s Mental Health During a Pandemic” (with Abel Brodeur, Lenka Fiala, Essi Kujansuu, David Valenta, Ole Rogeberg, & Gunther Bensch). Open Science Framework, 2025. Under invited submission, American Economic Journal: Applied Economics.
[ Abstract | Draft | IZA Discussion Paper | Institute for Replication Discussion Paper | Data & Code | Author Statement 1 | Author Statement 2 | Media: The Australian ]
Vlassopoulos et al. (2024, American Economic Journal: Applied Economics) find that after providing two hours of telephone counseling over three months, a sample of Bangladeshi women saw significant reductions in stress and depression after ten months. We find three anomalies. First, estimates are almost entirely driven by reverse-scored survey items, which are handled inconsistently both in the code and in the field. Second, participants in this experiment are reused from multiple prior experiments conducted by the paper's authors, and estimates are extremely sensitive to the experiment from which participants originate. Finally, inconsistencies and irregularities in raw survey files raise doubts about the data.
Imputations, Inverse Hyperbolic Sines, and Impossible Values. 2024. Under invited submission, Nature Human Behaviour.
[ Abstract | Data & Code | Draft ]
Wolfowicz et al. (2023, Nature Human Behaviour) find that more arrests and convictions for terrorism offenses decrease terrorism, more charges increase terrorism, and longer sentences do not deter terrorism in 28 European Union member states from 2006-2021. I assess the computational reproducibility of their study and find many data irregularities. The article's primary dependent variable - purportedly an inverse hyperbolic sine transformation of terrorist attack rates - takes on 292 different values when attack rates equal zero, and negatively correlates with attack rates. Many variables exhibit impossible values or undisclosed imputations, often masking a lack of reporting in the article's main data sources. I estimate that the authors have access to 57% fewer observations than claimed. Reproduction attempts produce estimates at least 77.7% smaller than the published estimates. Models reflecting the true degree of missing data produce estimates that are not statistically significantly different from zero for any independent variable of interest.
Manipulation Tests in Regression Discontinuity Design: The Need for Equivalence Testing. MetaArXiv, 2025.
[ Abstract | Draft | R Package | Stata Command | Python Package (created by Leo Stimpfle) | Institute for Replication Discussion Paper (Older Version) | Slides | Twitter/X Thread ]
Researchers applying regression discontinuity design (RDD) often test for endogenous running variable (RV) manipulation around treatment cutoffs, but misinterpret statistically insignificant RV manipulation as evidence of negligible RV manipulation. I introduce novel procedures that can provide statistically significant evidence that RV manipulation around a cutoff is bounded beneath practically negligible levels. The procedures augment classic RV density tests with an equivalence testing framework, along with bootstrap methods for (cluster-)robust inference. I apply these procedures to replication data from 36 RDD publications, conducting 45 equivalence-based RV manipulation tests. Over 44% of RV density discontinuities at the cutoff cannot be significantly bounded beneath a 50% upward jump. Obtaining equivalence testing failure rates beneath 5% requires arguing that a 350% upward RV density jump at the cutoff is practically equal to zero. My results imply that meaningful RV manipulation around treatment cutoffs cannot be ruled out in many published RDD papers, and that standard tests frequently misclassify the practical significance of RV manipulation. I provide research guidelines to help researchers conduct more credible equivalence-based manipulation testing in future RDD research. The lddtest estimation routine is available in R, Stata, and Python.
Revisiting the Impacts of Anti-Discrimination Employment Protections on American Businesses. 2024. Under submission, Management Science.
[ Abstract | Code & Data Retrieval Instructions | Draft ]
Greene & Shenoy (2022, Management Science) - henceforth GS22 - find that the staggered adoption of U.S. state-level protections against racial discrimination in employment decreased both the profitability and leverage of affected businesses. However, these results arise from two-way fixed effects (TWFE) difference-in-differences models. Such models are now known to return inaccurate estimates of average treatment effects on the treated (ATTs) when treatment assignment is staggered, as some firm-year ATTs can enter the TWFE estimator with negative weight. I find that 21-36% of firm-year ATTs in GS22's sample enter the TWFE estimator with negative weight. I then replicate GS22's results using recently-developed difference-in-differences estimators that return valid ATT estimates under staggered adoption. None of these new ATT estimates are statistically significantly different from zero.
The Problems with Poor Proxies: Does Innovation Mitigate Agricultural Damage from Climate Change? Institute for Replication Discussion Paper Series No. 158, 2024. Under submission.
[ Abstract | Draft | Data & Code | Authors’ Response | Twitter/X Thread ]
Moscona & Sastry (2023, Quarterly Journal of Economics) - henceforth MS23 - find that cropland values are significantly less damaged by extreme heat exposure (EHE) when crops are more exposed to technological innovation. Re-analyzing MS23's replication data, I document extensive evidence that this finding is not robust, and that the mitigatory effects of innovation on climate change damage are negligibly small. MS23's 'innovation exposure' variable does not measure innovation, instead proxying innovation using a measure of crops' national heat exposure. This proxy moderates EHE impacts for reasons unrelated to innovation. I show that the proxy is practically identical to local EHE, meaning that MS23's models examining interaction effects between their proxy and local EHE effectively interact local EHE with itself. I demonstrate that MS23's findings on 'innovation exposure' simply reflect nonlinear impacts of local EHE on agricultural land value, and uncover robustness issues for other key findings. I then construct direct measures of innovation exposure from MS23's crop variety and patenting data. Replacing MS23's proxy with these direct innovation measures decreases MS23's moderating effect estimates by at least 99.2% in standardized units; none of these new estimates are statistically significantly different from zero. Similar results arise from an instrumental variables strategy that instruments my direct innovation measures with MS23's heat proxy. These results cast doubt on the general capacity for market innovations to mitigate agricultural damage from climate change.
A Comment on “Resisting Social Pressure in the Household Using Mobile Money: Experimental Evidence on Microenterprise Investment in Uganda” (with Lenka Fiala, Essi Kujansuu, & David Valenta). 2024.
[ Abstract | Draft ]
In a pre-registered experiment, Riley (2024, American Economic Review) finds that providing microcredit loans onto mobile money accounts yields significantly more profit and capital for women's businesses than providing loans in cash, as this disbursement technique permits women to resist family pressure to share loans. We uncover two credibility issues. First, we find evidence suggesting that most of the experiment's participants are not assigned to treatment using the pre-registered stratified randomization protocol described in the paper. Second, the reported variables and empirical methods contradict commitments in the paper's pre-registration; these contradictions are unacknowledged and meaningfully impact the paper's main findings.