Evaluating Prompting and Execution-Based Methods for Deterministic Computation in LLMs

Hongkun Yu

Published May 8, 2026Featured #8In the daily list May 8, 2026

Open on arXiv Read PDF

Daily score68.2

Editorial review7.5

Relevance0.450

Freshness0.722

Why It Matters

What makes this one worth your time

Understanding the capabilities and limitations of LLMs in deterministic tasks is crucial for researchers and engineers aiming to integrate these models into reliable computational applications.

This study reveals the limitations of prompting strategies in LLMs for exact computation tasks.

Summary

The paper evaluates various prompting strategies for Large Language Models (LLMs) in performing deterministic computations, introducing a synthetic dataset for controlled testing and demonstrating that while some methods show promise, others struggle with accuracy.

Key contributions

Systematic evaluation of multiple prompting strategies for deterministic computation in LLMs.
Introduction of a synthetic dataset for testing LLMs on exact computation tasks.
Development of a domain-specific model (CodeT5-small) that achieves perfect accuracy on synthetic test data.

Notable insights

Program-of-Thought (PoT) achieves perfect accuracy by generating executable code, highlighting the potential of combining LLMs with external interpreters.
Self-Consistency improves robustness but at a significant computational cost, suggesting a trade-off between accuracy and efficiency.

Possible limitations

The abstract does not address the generalizability of results to real-world tasks or the potential for error in more complex scenarios.
Not stated in the abstract.

Abstract

arXiv:2605.03227v2 Announce Type: replace Abstract: Large Language Models (LLMs) have demonstrated strong capabilities in natural language understanding and reasoning. However, their ability to perform exact, deterministic computation remains unclear. In this work, we systematically evaluate multiple prompting strategies, including Chain-of-Thought (CoT), Least-to-Most decomposition, Program-of-Thought (PoT), and Self-Consistency (SC), on tasks requiring precise and error-free outputs, including binary counting, longest substring detection, and arithmetic evaluation. To support this study, we introduce a synthetic dataset with diverse natural language instructions, enabling controlled evaluation of exact computation across multiple task types. Our results show that standard prompting methods achieve only moderate accuracy on sequence-based tasks. CoT provides limited improvement, while Least-to-Most suffers from error accumulation. In contrast, PoT achieves perfect accuracy by generating executable code and delegating computation to an external interpreter. Self-Consistency improves robustness through majority voting, but incurs substantial computational overhead. We further train a small domain-specific model (CodeT5-small) to generate executable programs, which achieves perfect accuracy on held-out synthetic test data across all tasks with minimal training cost. Overall, our findings suggest that LLMs may simulate reasoning patterns rather than reliably perform exact symbolic computation. For deterministic tasks, combining LLMs with external tools or using specialized models provides a more reliable and efficient solution.