Job Market Paper
The Need for Equivalence Testing in Economics. MetaArXiv, 2025. Under submission.
[ Abstract | Draft | Online Appendix | Stata Command | R Package | Shiny App | 30-Minute Presentation | Institute for Replication Discussion Paper (Older Version) | Interview: Economisch Statistische Berichten (in Dutch) ]
Equivalence testing can provide statistically significant evidence that economic relationships are practically negligible. I demonstrate its necessity in a large-scale reanalysis of estimates defending 135 null claims made in 81 recent articles from top economics journals. 36-63% of estimates defending the average null claim fail lenient equivalence tests. In a prediction platform survey, researchers accurately predict that equivalence testing failure rates will significantly exceed levels which they deem acceptable. Obtaining equivalence testing failure rates that these researchers deem acceptable requires arguing that nearly 75% of published estimates in economics are practically equal to zero. These results imply that Type II error rates are unacceptably high throughout economics, and that many null findings in economics reflect low power rather than truly negligible relationships. I provide economists with guidelines and commands in Stata and R for conducting credible equivalence testing and practical significance testing in future research.
Current and Forthcoming Publications
AI-Assisted Teams Outperform AI-Led Teams but Not Human-Only Teams in Assessing Research Reproducibility in Quantitative Social Science (with Abel Brodeur et al). Institute for Replication Discussion Paper Series No. 195, 2026. In press, Proceedings of the National Academy of Sciences.
[ Abstract | Draft | Replication Data, Code, and Pre-Analysis Plan ]
Large Language Models (LLMs) such as ChatGPT are transforming how scientists conduct and validate research, offering promise as tools to improve scientific reproducibility. However, computational reproducibility and error detection remain expensive and labor-intensive. We experimentally test how collaboration between researchers and LLM assistants influences the reproduction of quantitative social science findings across different levels of AI autonomy. We randomly assigned 288 researchers to 103 teams working under three conditions: human-only, AI-assisted (using ChatGPT as a collaborative tool), or AI-led (ChatGPT operating with minimal human oversight). Teams reproduced published results from leading social science journals, detected coding errors, and proposed robustness checks. Human-only and AI-assisted teams achieved comparable reproduction rates (94% vs. 91%) and performed similarly on most outcomes, except human-only teams identified significantly more major coding errors. Both substantially outperformed AI-led teams, which achieved only a 37% reproduction rate, detected fewer errors across all categories, proposed weaker robustness checks, and required more time. This autonomous approach, however, likely represents only a lower bound of AI capabilities. Despite rapid model advances, expert human judgment currently remains indispensable for reliable empirical verification. While AI assistance did not degrade most outcomes, it provided no measurable advantages and was associated with reduced detection of major errors. However, the 37% autonomous reproduction rate indicates that AI could provide value in settings where scale or cost constraints preclude human review of papers, even though general-purpose LLMs offer no immediate advantages for human-supervised verification. (Previously circulated under the title "Comparing Human-Only, AI-Assisted, and AI-Led Teams on Assessing Research Reproducibility in Quantitative Social Science.")
Identifying the Impact of Hypothetical Stakes on Experimental Outcomes and Treatment Effects. Forthcoming, Experimental Economics, 2026.
[ Abstract | Article (Open Access) | Code & Data Retrieval Instructions | MetaArXiv Preprint | Tinbergen Institute Discussion Paper | Slides ]
Recent studies showing that some outcome variables do not statistically significantly differ between real-stakes and hypothetical-stakes conditions have raised methodological challenges to experimental economics' disciplinary norm that experimental choices should be incentivized with real stakes. I show that the hypothetical bias measures estimated in these studies do not econometrically identify the hypothetical biases that matter in most modern experiments. Specifically, traditional hypothetical bias measures are fully informative in 'elicitation experiments' where the researcher is uninterested in treatment effects (TEs). However, in 'intervention experiments' where TEs are of interest, traditional hypothetical bias measures are uninformative; real stakes matter if and only if TEs differ between stakes conditions. I demonstrate that traditional hypothetical bias measures are often misleading estimates of hypothetical bias for intervention experiments, both econometrically and through re-analyses of three recent hypothetical bias experiments. The fact that a given experimental outcome does not statistically significantly differ on average between stakes conditions does not imply that all TEs on that outcome are unaffected by hypothetical stakes. Therefore, the recent hypothetical bias literature does not justify abandoning real stakes in most modern experiments. Maintaining norms that favor completely or probabilistically providing real stakes for experimental choices is useful for ensuring externally valid TEs in experimental economics.
Three-Sided Testing to Establish Practical Significance: A Tutorial (with Peder Isager). Advances in Methods and Practices in Psychological Science 9(1), 2026.
[ Abstract | Article (Open Access) | PsyArXiv Preprint | Tinbergen Institute Discussion Paper | Stata Command | R Package | Shiny App | Teaching Slides | Twitter/X Thread ]
Researchers may want to know whether an observed statistical relationship is either meaningfully negative, meaningfully positive, or small enough to be considered practically equivalent to zero. Such a question can not be addressed with standard null hypothesis significance testing, nor with standard equivalence testing. Three-sided testing (TST) is a procedure to address such questions, by simultaneously testing whether an estimated relationship is significantly below, within, or above predetermined smallest effect sizes of interest. TST is a natural extension of the standard two one-sided tests (TOST) procedure for equivalence testing. TST offers a more comprehensive decision framework than TOST with no penalty to error rates or statistical power. In this paper, we give a non-technical introduction to TST, provide commands for conducting TST in R, Jamovi, and Stata, and provide a Shiny app for easy implementation. Whenever a meaningful smallest effect size of interest can be specified, TST should be combined with null hypothesis significance testing as the default frequentist testing procedure.
Imputations, Inverse Hyperbolic Sines and Impossible Values. Nature Human Behaviour 10, 239-242, 2026.
[ Abstract | Article | Open Access Version | Data & Code ]
Wolfowicz et al. (2023, Nature Human Behaviour) find that more arrests and convictions for terrorism offenses decrease terrorism, more charges increase terrorism, and longer sentences do not deter terrorism in 28 European Union member states from 2006-2021. I assess the computational reproducibility of their study and find many data irregularities. The article's primary dependent variable - purportedly an inverse hyperbolic sine transformation of terrorist attack rates - takes on 292 different values when attack rates equal zero, and negatively correlates with attack rates. Many variables exhibit impossible values or undisclosed imputations, often masking a lack of reporting in the article's main data sources. I estimate that the authors have access to 57% fewer observations than claimed. Reproduction attempts produce estimates at least 77.7% smaller than the published estimates. Models reflecting the true degree of missing data produce estimates that are not statistically significantly different from zero for any independent variable of interest.
Revisiting the Cognitive Advantages of Professional Soccer Players (with Abel Brodeur and Niklas Jakobsson). Proceedings of the National Academy of Sciences 123(8), e2515523123, 2026.
[ Abstract | Article | Author’s Reply | Data & Code ]
Bonetti et al. (2025, Proceedings of the National Academy of Sciences) find that professional soccer players in Brazil and Sweden exhibit detectable cognitive advantages compared to a sample of Brazilian control participants, and that machine learning models trained on cognitive and personality characteristics can distinguish the professional players from the control participants with 97% accuracy. Analyzing the study's replication data, we find that some of the study's statistical analyses are mischaracterized, and we document potential issues in the sampling of control participants. In light of the latter, we focus on quality differences between professional Swedish players previously analyzed by some of the study's authors. We find that the paper's machine learning models can only distinguish high-quality Swedish professional players from lower-quality players with just 53% average accuracy, near the no-information rate.
A Comment on “Delivering Remote Learning Using a Low-Tech Solution: Evidence from a Randomized Controlled Trial in Bangladesh” (with Lenka Fiala, Essi Kujansuu, Derek Mikola, David Valenta, Juan P. Aparicio, Michael Wiebe, Matthew D. Webb, and Abel Brodeur) Accepted, Journal of Political Economy: Microeconomics, 2025.
[ Abstract ]
Wang et al. (2024) report that Bangladeshi students randomly given access to lessons on a phone server saw significant learning gains during COVID-19 school closures. We identify three sets of anomalies. First, this experiment shares participants with another experiment conducted simultaneously in the same region, but test scores for the same children systematically differ between the two experiments. Second, test scores for treated participants exhibit enormous jumps immediately after students take their first handful of lessons. Third, numerous documentation inconsistencies cast doubt on the study's data reliability. These anomalies raise serious concerns about the credibility of the reported results.
Is There a Foreign Language Effect on Workplace Bribery Susceptibility? Evidence from a Randomized Controlled Vignette Experiment (with Paul Stroet, Arjen van Witteloostuijn, and Kristina S. Weißmüller). Journal of Business Ethics 197, 73-97, 2025.
[ Abstract | Article (Open Access) | Draft | Code ]
Theory and evidence from the behavioral science literature suggest that the widespread and rising use of lingua francas in the workplace may impact the ethical decision-making of individuals who must use foreign languages at work. We test the impact of foreign language usage on individual susceptibility to bribery in workplace settings using a vignette-based randomized controlled trial in a Dutch student sample. Results suggest that there is not even a small foreign language effect on workplace bribery susceptibility. We combine traditional null hypothesis significance testing with equivalence testing methods novel to the business ethics literature that can provide statistically significant evidence of bounded or null relationships between variables. These tests suggest that the foreign language effect on workplace bribery susceptibility is bounded below even small effect sizes. Post hoc analyses provide evidence suggesting fruitful further routes of experimental research into bribery.
US States That Mandated COVID-19 Vaccination See Higher, Not Lower, Take-Up of COVID-19 Boosters and Flu Vaccines. Proceedings of the National Academy of Sciences (121)41, e2403758121, 2024.
[ Abstract | Article (Open Access) | Data & Code, Published Replication | Reply | Response to Reply | Data & Code, Response to Reply | Twitter/X Thread ]
Rains & Richards (2024, Proceedings of the National Academy of Sciences) find that compared to US states that instituted bans on COVID-19 vaccination requirements, states that imposed COVID-19 vaccination mandates exhibit lower adult and child uptake of flu vaccines, and lower uptake of COVID-19 boosters. These differences are generally interpreted causally. However, further inspection reveals that these results are driven by the inclusion of a single bad control variable. When removed, the data instead shows that states which mandated COVID-19 vaccination experience higher COVID-19 booster and flu vaccine takeup than states that banned COVID-19 vaccination requirements.
Invited Submissions and Resubmissions
Reframing Eating: Drivers, Influences, and Transitions (ReDIET) for a Sustainable Food System (with Meike Morren, Guido van Koningsbruggen, Angela Johnson, & Kristiaan Kok). 2026. Invited submission, Current Opinion in Environmental Sustainability.
[ Abstract | Draft ]
To accelerate sustainability transitions in food systems and to meet climate change mitigation targets, fundamental changes in both food production and consumption are essential. While substantial research has examined either systemic or individual interventions to drive change, these approaches often fall short in isolation. System-level interventions may prove ineffective or even counterproductive if they disregard behavioral insights into sustainable food choice, whereas individual-level interventions alone are unlikely to achieve large-scale food system transitions. We propose a new framework that elucidates the interactions between system- and individual-level drivers of sustainable food choice from a multi-stakeholder perspective. Drawing on insights from both natural and the social sciences, we identify key push and pull factors influencing sustainable food behavior both at the system- and individual-level. Our framework provides action-oriented strategies for integrating behavioral science with food system transition, enabling decision-makers to leverage the mutual reinforcement of systematic and individual interventions toward more sustainable food production and consumption.
A Comment on “Improving Women’s Mental Health During a Pandemic” (with Abel Brodeur, Lenka Fiala, Essi Kujansuu, David Valenta, Ole Røgeberg, & Gunther Bensch). Open Science Framework, 2025. Revise & resubmit, American Economic Journal: Applied Economics.
[ Abstract | Draft | IZA Discussion Paper | Institute for Replication Discussion Paper | Data & Code | Author Statement 1 | Author Statement 2 | Media: The Australian ]
Vlassopoulos et al. (2024, American Economic Journal: Applied Economics) find that after providing two hours of telephone counseling over three months, a sample of Bangladeshi women saw significant reductions in stress and depression after ten months. We find three anomalies. First, estimates are almost entirely driven by reverse-scored survey items, which are handled inconsistently both in the code and in the field. Second, participants in this experiment are reused from multiple prior experiments conducted by the paper's authors, and estimates are extremely sensitive to the experiment from which participants originate. Finally, inconsistencies and irregularities in raw survey files raise doubts about the data.
Revisiting the Impacts of Anti-Discrimination Employment Protections on American Businesses. 2024. Revise & resubmit, Management Science.
[ Abstract | Code & Data Retrieval Instructions | Draft ]
Greene & Shenoy (2022, Management Science) - henceforth GS22 - find that the staggered adoption of U.S. state-level protections against racial discrimination in employment decreased both the profitability and leverage of affected businesses. However, these results arise from two-way fixed effects (TWFE) difference-in-differences models. Such models are now known to return inaccurate estimates of average treatment effects on the treated (ATTs) when treatment assignment is staggered, as some firm-year ATTs can enter the TWFE estimator with negative weight. I find that 21-36% of firm-year ATTs in GS22's sample enter the TWFE estimator with negative weight. I then replicate GS22's results using recently-developed difference-in-differences estimators that return valid ATT estimates under staggered adoption. None of these new ATT estimates are statistically significantly different from zero.
Working Papers
Which Businesses Answer Surveys? Evidence from Dutch Administrative Data. SocArXiv, 2026. Under submission.
[ Abstract | Draft | Code and Data Access Instructions | AsCollected Registration ]
I leverage a unique administrative register covering the universe of establishments in the Netherlands to examine how characteristics differ between establishments that do and do not respond to business surveys. Only 19% of Dutch establishments responded to regional business surveys in 2022. Responsive establishments employ fewer people, and exhibit higher parttime employment rates, than unresponsive establishments. Sectoral response rates vary by up to 32 percentage points. Enterprises registered to residential addresses comprise over 70% of establishments, yet exhibit response rates 18 percentage points lower than an average office. However, controlling for contact probability reveals that the majority of sectoral and occupational variation in response rates can be traced back to differences in contact probability rather than responsiveness. These findings highlight challenges in the representativeness and generalizability of business survey data, as well as opportunities to improve the design of business surveys.
Non-Robustness in Log-Like Specifications (with Joop Adema, Lenka Fiala, Essi Kujansuu, and David Valenta). MetaArXiv, 2026.
[ Abstract | Draft | Institute for Replication Discussion Paper | Teaching Slides ]
Recent literature shows that when regression models are estimated on variables transformed with 'log-like' functions such as the inverse hyperbolic sine or ln(Z + 1) transformations, one can obtain (semi-)elasticity estimates of any magnitude by linearly re-scaling the input variable(s) before transformation. We systematically re-analyze the replication data of 46 papers whose main conclusions are defended by log-like specifications. Our replication findings motivate new theoretical and simulation results showing that in log-like specifications, unit scale can be used to overfit data, creating an uncontrolled multiple hypothesis testing problem that frequently yields spuriously significant results. In particular, 38% of the estimates we re-analyze sit in a 'sweet spot', where both upward and downward re-scalings of variables' units before transformation shrink test statistics. Consequently, published estimates in this literature are statistically significant over 40% more frequently than in the general economics literature. We find that modest changes to model specification yield different statistical significance conclusions for 14-37% of estimates defending papers' main claims. We also show that for 99.8% of estimates, variables transformed with log-like functions do not meet data requirements for log-like specifications from a methodological recommendation cited by all papers in our replication sample. We synthesize and harmonize methodological guidelines and advocate for more robust alternative specifications, including normalized estimands, Poisson regression, and quantile regression.
Manipulation Tests in Regression Discontinuity Design: The Need for Equivalence Testing. MetaArXiv, 2025.
[ Abstract | Draft | R Package | Stata Command | Python Package (created by Leo Stimpfle) | Institute for Replication Discussion Paper (Older Version) | Slides | Twitter/X Thread ]
Researchers applying regression discontinuity design (RDD) often test for endogenous running variable (RV) manipulation around treatment cutoffs, but misinterpret statistically insignificant RV manipulation as evidence of negligible RV manipulation. I introduce novel procedures that can provide statistically significant evidence that RV manipulation around a cutoff is bounded beneath practically negligible levels. The procedures augment classic RV density tests with an equivalence testing framework, along with bootstrap methods for (cluster-)robust inference. I apply these procedures to replication data from 36 RDD publications, conducting 45 equivalence-based RV manipulation tests. Over 44% of RV density discontinuities at the cutoff cannot be significantly bounded beneath a 50% upward jump. Obtaining equivalence testing failure rates beneath 5% requires arguing that a 350% upward RV density jump at the cutoff is practically equal to zero. My results imply that meaningful RV manipulation around treatment cutoffs cannot be ruled out in many published RDD papers, and that standard tests frequently misclassify the practical significance of RV manipulation. I provide research guidelines to help researchers conduct more credible equivalence-based manipulation testing in future RDD research. The lddtest estimation routine is available in R, Stata, and Python.
A Many-Designs Study Crowdsourcing 516 Aggregation Algorithms to Increase the Wisdom of the Crowd (with Christian König-Kersting et al). 2025.
[ Abstract | Project Website ]
Accurate predictions are central to decision-making in domains such as politics, climate, or economics. Although aggregating independent individual judgmnets can improve accuracy, a phenomenon known as "Wisdom of the Crowd," the optimal method for combining predictions remains underexplored. In this preregistered many-designs study, 129 research teams submitted 516 aggregation algorithms to predict real-world outcomes across four domains — Economics, Politics, Climate, Sports — over six months. Drawing on 640 individual forecasts per month, we evaluated algorithm performance with respect to both accuracy and variability. We also examined potential predictors of these outcomes, including features of the algorithms themselves (e.g., reliance on domain knowledge or confidence) and characteristics of the researchers who designed them. Results showed that algorithm accuracy is remarkably consistent over time within certain domains, that researchers' expectations about which features would predict algorithm accuracy were often misplaced, and that researchers were overly confident about the success of their own and others' algorithms. No features of algorithms or researchers consistently explained performance differences across domains. By pitting hundreds of independently developed prediction aggregation methods against each other under identical conditions, this study establishes a rigorous empirical benchmark for prediction aggregation, advancing the science of forecasting and our understanding of collective intelligence.
The Problems with Poor Proxies: Does Innovation Mitigate Agricultural Damage from Climate Change? Institute for Replication Discussion Paper Series No. 158, 2024.
[ Abstract | Draft | Data & Code | Authors’ Response | Twitter/X Thread ]
Moscona & Sastry (2023, Quarterly Journal of Economics) - henceforth MS23 - find that cropland values are significantly less damaged by extreme heat exposure (EHE) when crops are more exposed to technological innovation. Re-analyzing MS23's replication data, I document extensive evidence that this finding is not robust, and that the mitigatory effects of innovation on climate change damage are negligibly small. MS23's 'innovation exposure' variable does not measure innovation, instead proxying innovation using a measure of crops' national heat exposure. This proxy moderates EHE impacts for reasons unrelated to innovation. I show that the proxy is practically identical to local EHE, meaning that MS23's models examining interaction effects between their proxy and local EHE effectively interact local EHE with itself. I demonstrate that MS23's findings on 'innovation exposure' simply reflect nonlinear impacts of local EHE on agricultural land value, and uncover robustness issues for other key findings. I then construct direct measures of innovation exposure from MS23's crop variety and patenting data. Replacing MS23's proxy with these direct innovation measures decreases MS23's moderating effect estimates by at least 99.2% in standardized units; none of these new estimates are statistically significantly different from zero. Similar results arise from an instrumental variables strategy that instruments my direct innovation measures with MS23's heat proxy. These results cast doubt on the general capacity for market innovations to mitigate agricultural damage from climate change.
A Comment on “Resisting Social Pressure in the Household Using Mobile Money: Experimental Evidence on Microenterprise Investment in Uganda” (with Lenka Fiala, Essi Kujansuu, & David Valenta). 2024.
[ Abstract | Draft ]
In a pre-registered experiment, Riley (2024, American Economic Review) finds that providing microcredit loans onto mobile money accounts yields significantly more profit and capital for women's businesses than providing loans in cash, as this disbursement technique permits women to resist family pressure to share loans. We uncover two credibility issues. First, we find evidence suggesting that most of the experiment's participants are not assigned to treatment using the pre-registered stratified randomization protocol described in the paper. Second, the reported variables and empirical methods contradict commitments in the paper's pre-registration; these contradictions are unacknowledged and meaningfully impact the paper's main findings.