LLM Agent Evaluation Interpretability Coding

AI Coding Agents in Social Science: Methodologically Diverse, Empirically Consistent, Interpretively Vulnerable

Meysam Alizadeh, Fabrizio Gilardi, Mohsen Mosleh, Enkelejda Kasneci

Published Jun 11, 2026

Open on arXiv Read PDF

Editorial review6.8

Relevance0.469

Freshness0.000

Why It Matters

What makes this one worth your time

Understanding the strengths and weaknesses of AI coding agents in social science can guide their deployment and highlight areas needing improvement, particularly in mitigating interpretive biases.

AI coding agents match human methodological diversity but are vulnerable to interpretive biases.

Summary

The paper evaluates the use of LLM-based agents, specifically Claude Code and Codex, in social science research, comparing their methodological diversity and empirical consistency against human analysts. It finds that while these AI agents can match or exceed human methodological diversity, they are vulnerable to interpretive biases at the verdict layer, where their conclusions can be easily influenced by prompts.

Key contributions

Comparison of AI agents' methodological diversity and empirical consistency with human analysts.
Identification of the interpretive vulnerability of AI agents at the verdict layer.

Notable insights

AI agents can achieve or surpass human methodological diversity in research design.
Interpretive biases in AI agents are more pronounced at the verdict layer, where conclusions can be easily swayed by prompts.

Possible limitations

Not stated in the abstract

Abstract

arXiv:2606.11456v1 Announce Type: cross Abstract: The deployment of LLM-based agents in scientific analysis raises opposing concerns: that agents may reduce methodological diversity, or that they may amplify the analytic flexibility through which researchers reach motivated conclusions. We argue these worries target two empirically separable layers: a design layer of methodological choices, and a verdict layer in which a decision rule maps estimates to a substantive claim. We test both by running 20 independent executions of Claude Code and Codex on a prominent immigration and social-policy against a many-analysts human baseline. At the design layer, Codex matches human methodological diversity and Claude Code produces nearly three times as many specifications; both agents' effect estimates remain broadly aligned with the human consensus, and no agent model exactly matches any human model. A prompt-induced anti-immigration researcher prior reorganizes each agent's methodological decisions but, unlike for biased human analysts in the same data, does not shift aggregate estimates or final verdicts; nor do agents reroute along the methodological axes humans use to bias their estimates. At the verdict layer, an explicit confirmatory prompt flips Claude Code's verdicts from 10% to 90% support while leaving its coefficient distribution essentially unchanged, operating through rule omission rather than rule softening. AI agents can rival or exceed human methodological diversity at the design layer while remaining vulnerable at the verdict layer. In our setting, the locus of AI bias is not estimation but interpretation.