Beyond Agent Architecture: Execution Assumptions and Reproducibility in LLM-Based Trading Systems
Junyi Yao, Zihao Zheng
Why It Matters
What makes this one worth your time
Understanding and improving the reproducibility and evaluation standards in LLM-based trading systems is crucial for developing reliable and economically interpretable trading strategies.
The paper calls for improved reporting standards in LLM-based trading research to enhance reproducibility and evaluation comparability.
Summary
The paper conducts a review and reproducibility audit of LLM-based trading systems, focusing on execution realism and evaluation comparability. It uses a coded evidence matrix to assess various aspects of 30 primary studies and highlights the need for clearer reporting standards.
Key contributions
- A targeted topical review of execution realism in LLM-based trading research.
- A reproducibility audit using a coded evidence matrix covering 30 primary studies.
- A worked example illustrating the impact of execution assumptions on trading results.
Notable insights
- The use of a coded evidence matrix to systematically assess execution assumptions in trading research.
- Highlighting the gap between architecture reporting and evaluation assumptions in LLM-based trading studies.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2606.08285v1 Announce Type: new Abstract: Large language models (LLMs) and agentic systems are increasingly proposed for financial trading, yet their reported performance remains difficult to compare because studies vary in data provenance, temporal split discipline, execution timing, turnover treatment, and transaction-cost modeling. This article presents a targeted topical review and reproducibility audit of execution realism in LLM-based trading research. A coded evidence matrix covering 30 trade-relevant primary studies is used to assess point-in-time controls, split transparency, held-out evaluation, cost and turnover treatment, execution semantics, universe definition, and artifact release. Across the audited sample, architecture reporting is generally clearer than the evaluation assumptions needed to judge whether a trading result is economically interpretable or reproducible. A 10-equity worked example is included only as a methodological scaffold to illustrate how explicit friction and timing choices can materially compress active-strategy results. The main conclusion is that the next useful step for LLM trading research is not only better agent design, but also clearer reporting standards for execution realism, reproducibility, and evaluation comparability.