Agentic Trading: When LLM Agents Meet Financial Markets
Yihan Xia, Panpan You, Taotao Wang, Fang Liu, Han Qi, Xiaoxiao Wu, Shengli Zhang
Why It Matters
What makes this one worth your time
Understanding the current limitations in LLM-based trading systems can guide future research towards more robust and reproducible financial AI applications.
The paper audits LLM-based trading agents, revealing gaps in reproducibility and evaluation protocols.
Summary
The paper surveys the integration of Large Language Models (LLMs) as agents in trading systems, analyzing 77 studies to map evidence and audit reproducibility. It highlights the lack of standardized evaluation protocols and reproducible artifacts, proposing an Architecture-Capability-Adaptation framework for analysis.
Key contributions
- Audit-oriented evidence map of LLM-based trading agents.
- Identification of reproducibility and evaluation protocol gaps in current research.
Notable insights
- The field lacks standardized evaluation protocols, which hinders comparability across studies.
- Reproducibility remains a significant challenge, with no study achieving R3 reproducibility.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2605.19337v1 Announce Type: new Abstract: A growing body of work explores how Large Language Models (LLMs) can be embedded in trading systems as agents that perceive market information, retrieve context, reason about decisions, emit tradable actions, and adapt under market feedback. This paper reframes LLM-based trading agents as expert-system decision pipelines and presents an audit-oriented evidence map of 77 included studies in a protocol-coded snapshot screened through 2026-03-09. A primary empirical subset (n=19) satisfies the minimum boundary of Action Output plus Closed-Loop Evaluation; the remaining 58 included studies are retained as background and design context. The central empirical finding is protocol incomparability: within the primary subset, only 2/19 studies report extractable time-consistent split protocols, 1/19 reports an explicit transaction-cost model, 1/19 documents universe or survivorship handling, 11/19 report execution timing or semantics, 15/19 are coded as R0, and no study reaches R3 reproducibility. We therefore use Architecture-Capability-Adaptation as a working analytical lens rather than a validated taxonomy, and we foreground the evidence ledger, reproducibility audit, and reporting checklist as the main contributions. The resulting survey shows that architectural experimentation is expanding rapidly, while comparable evaluation protocols, execution semantics, and reproducible artifacts remain the field's immediate bottlenecks.