LLMs for Text-Based Exploration and Navigation Under Partial Observability

Stephan Sandfuchs, Maximilian Melchert, J\"org Frochte

Score8.500

LLMn/a

Embedding0.461

Recencyn/a

Feedback

Why It Matters

Understanding how LLMs can be effectively utilized in navigation and exploration tasks under partial observability has significant implications for applications in logistics, search-and-rescue, and robotics.

Contributions

Introduction of a reproducible benchmark for evaluating LLMs in exploration and navigation tasks under partial observability.

Insights

Reasoning-tuned models outperform others in navigation tasks but struggle with efficiency compared to oracle paths.

Limitations

The reliance on a fixed ASCII gridworld may limit the generalizability of the findings to more complex real-world scenarios.

Abstract

arXiv:2604.09604v1 Announce Type: new Abstract: Exploration and goal-directed navigation in unknown layouts are central to inspection, logistics, and search-and-rescue. We ask whether large language models (LLMs) can function as \emph{text-only} controllers under partial observability -- without code execution, tools, or program synthesis. We introduce a reproducible benchmark with oracle localisation in fixed ASCII gridworlds: each step reveals only a local $5\times5$ window around the agent and the model must select one of \texttt{UP/RIGHT/DOWN/LEFT}. Nine contemporary LLMs ranging from open/proprietary, dense / Mixture of Experts and instruction- vs. reasoning-tuned are evaluated on two tasks across three layouts of increasing difficulty: \emph{Exploration} (maximising revealed cells) and \emph{Navigation} (reach the goal on the shortest path). The experimental results are evaluated on quantitative metrics including \emph{success rate}, \emph{efficiency} such as normalised coverage and \emph{path length} vs. oracle as well as qualitative analysis. Reasoning-tuned models reliably complete navigation across all layouts, yet remain less efficient than oracle paths. Few-shot demonstrations in the prompt chiefly help these Reasoning-tuned models by reducing invalid moves and shortening paths, while classic dense instruction models remain inconsistent. We observe characteristic action priors (UP/RIGHT) that can induce looping under partial observability. Overall, training regimen and test-time deliberation predict control ability better than raw parameter count. These findings suggest lightweight hybridisation with classical online planners as a practical route to deployable partial map systems.

arXiv PDF