AICompanionBench: Benchmarking LLMs-as-Judges for AI Companion Safety
Yanjing Ren, Reza Ebrahimi, TengTeng Ma
Why It Matters
What makes this one worth your time
Understanding and improving the safety of AI companion interactions is crucial as these platforms become more prevalent, and this work provides a tool for evaluating and enhancing model performance in this area.
AICompanionBench provides a new dataset for evaluating AI companion safety using LLMs.
Summary
The paper introduces AICompanionBench, a benchmark dataset for evaluating the safety of AI companion interactions, using conversations from Replika annotated with safety risk categories. It assesses 20 language models on their ability to detect unsafe interactions, finding that while models perform well on explicit harmful content, they struggle with nuanced categories.
Key contributions
- Introduction of AICompanionBench, a benchmark dataset for AI companion safety.
- Evaluation of 20 LLMs on their ability to detect unsafe interactions in AI companion conversations.
Notable insights
- The use of human-AI collaboration for annotating safety risks in conversations.
- The identification of specific challenges in detecting nuanced unsafe interactions like manipulation.
Possible limitations
- Potential bias in dataset sourced from Reddit.
- Not stated in the abstract
Abstract
arXiv:2606.04867v1 Announce Type: new Abstract: As AI companion platforms such as Replika and Character.AI rapidly grow, concerns about unsafe human-AI interactions have intensified. This study introduces AICompanionBench, to our knowledge the first publicly available benchmark dataset of human-AI companion conversations annotated with fine-grained safety risk categories. The dataset contains 2,123 real-world Replika conversations collected from Reddit and annotated through human-AI collaboration across nine categories: sexual behavior, antisocial behavior, physical aggression, verbal aggression, substance abuse, self-harm and suicide, control, manipulation, and no-harm. Using this benchmark, we evaluate 20 state-of-the-art open-source and closed-source LLMs under an LLM-as-judge framework for detecting unsafe interactions. Results show substantial variation in model performance, with stronger models achieving high overall accuracy but still struggling with nuanced categories such as manipulation, as well as benign conversations that are incorrectly identified as harmful. Our findings suggest that while current LLMs can effectively detect explicit harmful content, they remain limited in identifying implicit unsafe interactions. Overall, our work contributes a new benchmark dataset for AI companionship safety research and offers insights into monitoring AI companion systems using LLMs. The dataset is publicly available at: https://github.com/anonymousresearcher2026/AICompanionBench/blob/main/AICompanionBench.xlsx