Personal feed
For You
Ranked from cached scores, your tag preferences, and your paper feedback.
prefers agentprefers llmprefers reasoning
#1Collaborative Multi-Agent Scripts Generation for Enhancing Imperfect-Information Reasoning in Murder Mystery GamesThis paper addresses the challenge of improving VLMs' reasoning abilities in complex, adversarial environments, which is crucial for advancing AI's understanding and interaction in real-world scenarios with incomplete information.agent, benchmark, evaluation, multimodalFinal 0.772Boost 0.100LLM 8.500Emb 0.488#2The Blind Spot of Agent Safety: How Benign User Instructions Expose Critical Vulnerabilities in Computer-Use AgentsAs CUAs become more prevalent in executing complex tasks, understanding their vulnerabilities in benign contexts is crucial for developing robust safety mechanisms and preventing unintended harm.agent, benchmark, evaluation, securityFinal 0.749Boost 0.080LLM 8.500Emb 0.478#3Instructing LLMs to Negotiate using Reinforcement Learning with Verifiable RewardsBy effectively training LLMs to negotiate using verifiable rewards, this research advances the potential for autonomous agents in complex economic interactions, which could have wide-ranging applications in commerce and automated decision-making.agent, alignment, llmFinal 0.747Boost 0.080LLM 8.500Emb 0.469#4Agentic Application in Power Grid Static Analysis: Automatic Code Generation and Error CorrectionThe integration of LLMs in technical domains like power grid analysis can significantly streamline workflows, reduce human error, and improve the accessibility of complex systems.agent, evaluation, llm, otherFinal 0.746Boost 0.080LLM 8.500Emb 0.468#5From Query to Counsel: Structured Reasoning with a Multi-Agent Framework and Dataset for Legal ConsultationThe development of JurisCQAD and the JurisMA framework addresses critical challenges in legal AI, such as data scarcity and complex reasoning, paving the way for more effective legal consultation tools.agent, data, llm, reasoningFinal 0.745Boost 0.080LLM 8.500Emb 0.464#6SWE-AGILE: A Software Agent Framework for Efficiently Managing Dynamic Reasoning ContextThis paper addresses critical challenges in autonomous software engineering by enhancing reasoning capabilities, which could lead to more effective and efficient AI systems in complex environments.agent, benchmark, reasoningFinal 0.745Boost 0.080LLM 8.500Emb 0.462#7Agentic Driving Coach: Robustness and Determinism of Agentic AI-Powered Human-in-the-Loop Cyber-Physical SystemsAs AI systems increasingly interact with humans and physical environments, ensuring their reliability and predictability is crucial for safety and effectiveness, particularly in applications like autonomous driving.agent, other, roboticsFinal 0.744Boost 0.080LLM 8.500Emb 0.462#8LLMs for Text-Based Exploration and Navigation Under Partial ObservabilityUnderstanding how LLMs can be effectively utilized in navigation and exploration tasks under partial observability has significant implications for applications in logistics, search-and-rescue, and robotics.agent, benchmark, llm, reasoningFinal 0.744Boost 0.080LLM 8.500Emb 0.461#9The Amazing Agent Race: Strong Tool Users, Weak NavigatorsThis research highlights the limitations of current tool-use benchmarks by demonstrating that agents struggle more with navigation than tool execution, prompting a reevaluation of how we assess AI capabilities.agent, benchmark, evaluation, llmFinal 0.742Boost 0.080LLM 8.500Emb 0.454#10Agent^2 RL-Bench: Can LLM Agents Engineer Agentic RL Post-Training?This research addresses a critical gap in evaluating the interactive capabilities of LLM agents in reinforcement learning, which is essential for advancing model alignment and specialization.agent, benchmark, evaluation, llmFinal 0.741Boost 0.080LLM 8.500Emb 0.451