The cognitive companion: a lightweight parallel monitoring architecture for detecting and recovering from reasoning degradation in LLM agents
Rafflesia Khan, Nafiul Islam Khan
Why It Matters
What makes this one worth your time
This research addresses a critical issue in LLM performance, providing insights that could enhance the reliability of AI systems in complex tasks.
Introducing a novel architecture to mitigate reasoning degradation in LLM agents.
Summary
The paper presents the Cognitive Companion, a parallel monitoring architecture designed to detect and recover from reasoning degradation in large language model agents, demonstrating its effectiveness through a feasibility study and various implementations.
Key contributions
- Development of the Cognitive Companion architecture with two distinct implementations.
- Empirical evidence demonstrating the effectiveness of the LLM-based Companion in reducing task repetition.
- Exploration of task-type sensitivity in the performance of monitoring companions.
Notable insights
- The effectiveness of the Companion is task-type dependent, suggesting that monitoring strategies may need to be tailored for specific task characteristics.
- The zero-overhead Probe-based Companion shows promise for efficient monitoring without additional computational costs.
Possible limitations
- The findings are based on a feasibility study and may not generalize to all LLMs or tasks.
- The abstract does not address potential scalability issues beyond the tested model sizes.
Abstract
arXiv:2604.13759v1 Announce Type: new Abstract: Large language model (LLM) agents on multi-step tasks suffer reasoning degradation, looping, drift, stuck states, at rates up to 30% on hard tasks. Current solutions include hard step limits (abrupt) or LLM-as-judge monitoring (10-15% overhead per step). This paper introduces the Cognitive Companion, a parallel monitoring architecture with two implementations: an LLM-based Companion and a novel zero-overhead Probe-based Companion. We report a three-batch feasibility study centered on Gemma 4 E4B, with an additional exploratory small-model analysis on Qwen 2.5 1.5B and Llama 3.2 1B. In our experiments, the LLM-based Companion reduced repetition on loop-prone tasks by 52-62% with approximately 11% overhead. The Probe-based Companion, trained on hidden states from layer 28, showed a mean effect size of +0.471 at zero measured inference overhead; its strongest probe result achieved cross-validated AUROC 0.840 on a small proxy-labeled dataset. A key empirical finding is that companion benefit appears task-type dependent: companions are most helpful on loop-prone and open-ended tasks, while effects are neutral or negative on more structured tasks. Our small-model experiments also suggest a possible scale boundary: companions did not improve the measured quality proxy on 1B-1.5B models, even when interventions fired. Overall, the paper should be read as a feasibility study rather than a definitive validation. The results provide encouraging evidence that sub-token monitoring may be useful, identify task-type sensitivity as a practical design constraint, and motivate selective companion activation as a promising direction for future work.