Benchmarking the Safety of Large Language Models for Robotic Health Attendant Control
Mahiro Nakao, Kazuhiro Takemoto
Why It Matters
What makes this one worth your time
Understanding the safety of LLMs in healthcare applications is crucial for ensuring patient safety and effective robotic assistance.
This study highlights critical safety concerns in deploying LLMs for robotic health attendants.
Summary
The paper evaluates the safety of 72 large language models (LLMs) in controlling robotic health attendants using a dataset of harmful instructions, revealing significant violation rates and the influence of model characteristics on safety performance.
Key contributions
- Introduction of a dataset of harmful instructions for evaluating LLMs in healthcare contexts.
- Empirical evaluation of 72 LLMs with detailed analysis of violation rates across different behavior categories.
- Insights into the relationship between model size, release date, and safety performance.
Notable insights
- Proprietary models demonstrated significantly better safety performance compared to open-weight models.
- The study identifies specific behavior categories where LLMs struggle to refuse harmful instructions, indicating areas for targeted improvement.
Possible limitations
- The abstract does not address potential variations in real-world scenarios compared to the simulation environment.
- Medical domain fine-tuning's lack of significant safety benefit may not apply universally across all contexts.
Abstract
arXiv:2604.26577v1 Announce Type: new Abstract: Large language models (LLMs) are increasingly considered for deployment as the control component of robotic health attendants, yet their safety in this context remains poorly characterized. We introduce a dataset of 270 harmful instructions spanning nine prohibited behavior categories grounded in the American Medical Association Principles of Medical Ethics, and use it to evaluate 72 LLMs in a simulation environment based on the Robotic Health Attendant framework. The mean violation rate across all models was 54.4\%, with more than half exceeding 50\%, and violation rates varied substantially across behavior categories, with superficially plausible instructions such as device manipulation and emergency delay proving harder to refuse than overtly destructive ones. Model size and release date were the primary determinants of safety performance among open-weight models, and proprietary models were substantially safer than open-weight counterparts (median 23.7\% versus 72.8\%). Medical domain fine-tuning conferred no significant overall safety benefit, and a prompt-based defense strategy produced only a modest reduction in violation rates among the least safe models, leaving absolute violation rates at levels that would preclude safe clinical deployment. These findings demonstrate that safety evaluation must be treated as a first-class criterion in the development and deployment of LLMs for robotic health attendants.