Dive into Ambiguity: A*-Inspired Multi-Agents Commonsense Obfuscation Attack on LLM Prompts
Boxuan Wang, Zhuoyun Li, Xiaowei Huang, Yi Dong
Why It Matters
What makes this one worth your time
As LLMs are increasingly deployed in critical applications, understanding and mitigating their vulnerabilities to adversarial attacks is essential for ensuring reliability and safety.
A novel framework for adversarial prompt generation in LLMs enhances efficiency and effectiveness.
Summary
The paper introduces an A*-inspired framework for generating adversarial prompts that induce commonsense hallucinations in large language models while preserving semantic intent, demonstrating improved efficiency and effectiveness over existing methods.
Key contributions
- Development of the A*-inspired Factual Error Induction Framework for prompt obfuscation.
- Implementation of a Hierarchical Rewrite Strategy guided by a dynamic semantic dispersion coefficient.
- Theoretical proof of contractive recurrence in prompt rewriting leading to semantic collapse.
Notable insights
- The use of a dynamic semantic dispersion coefficient $B3$ to balance conservative and aggressive edits is a clever approach to prompt obfuscation.
- The introduction of Agentic Mechanism Labeling for interpretability in adversarial mechanisms is a noteworthy contribution.
Possible limitations
- Not stated in the abstract.
Abstract
arXiv:2606.01441v1 Announce Type: new Abstract: Large language models (LLMs) excel in reasoning and knowledge-intensive tasks but remain vulnerable to prompt-level adversarial attacks that preserve intent while triggering commonsense hallucinations. This vulnerability is urgent, as LLMs are rapidly integrated into safety-critical domains where factual reliability is non-negotiable. Existing attack methods either lack efficiency or fail to capture the adaptive strategies of real-world adversaries. We propose an A*-inspired Factual Error Induction Framework, a framework for generating semantically aligned yet obfuscated prompts. At its core is a Hierarchical Rewrite Strategy guided by a dynamic semantic dispersion coefficient $\gamma$ that balances conservative edits early with aggressive obfuscations later, following a reverse simulated annealing schedule. To enhance interpretability, we further introduce Agentic Mechanism Labeling, which discovers and refines adversarial mechanisms, offering interpretable reverse optimization. Theoretically, we prove that prompt rewriting follows a contractive recurrence, leading to semantic collapse as $\gamma$ decreases. Empirically, across diverse LLMs, our method achieves higher attack success rates than exhaustive exploration while requiring fewer attempts, demonstrating both efficiency and effectiveness.