Evaluating LLMs' Effectiveness on Real-World Consumer Device Repair Questions
Atm Mizanur Rahman (University of Illinois Urbana-Champaign), Md Arid Hasan (University of Toronto), Syed Ishtiaque Ahmed (University of Toronto), Sharifa Sultana (University of Illinois Urbana-Champaign)
Why It Matters
What makes this one worth your time
Understanding LLMs' performance in high-stakes domains like device repair is crucial for developing safer AI applications and improving user trust.
This study benchmarks LLMs on real-world consumer device repair tasks, revealing significant reliability issues.
Summary
The paper introduces a benchmark of 991 real-world consumer device repair questions and evaluates six state-of-the-art LLMs on their effectiveness in providing repair assistance, highlighting their limitations in safety-critical scenarios.
Key contributions
- Introduction of a benchmark dataset for consumer device repair questions.
- Evaluation of LLMs using repair-specific criteria, providing insights into their practical utility.
- Analysis of cross-lingual performance with Bangla translations.
Notable insights
- The study identifies specific areas where LLMs struggle, such as board-level diagnosis and safety procedures, which are critical for real-world applications.
- The performance disparity between English and Bangla responses highlights the challenges in cross-lingual LLM applications.
Possible limitations
- The abstract does not address the potential biases in the dataset or the representativeness of the repair questions.
- Not stated in the abstract.
Abstract
arXiv:2606.03331v1 Announce Type: cross Abstract: Consumer device repair is an important but underexplored testbed for large language models (LLMs). Repair tasks require reasoning over incomplete problem descriptions, hardware-specific diagnostics, actionable troubleshooting, and safety-critical decisions, where incorrect advice can cause device damage, battery hazards, or permanent data loss. We introduce a benchmark of 991 real-world repair questions from Reddit spanning phone repair, computer repair, and data recovery, each paired with technician-written reference solutions, and provide Bangla translations to evaluate cross-lingual performance. We evaluate six state-of-the-art LLMs in English and Bangla using four repair-specific criteria: correctness, completeness, practicality, and safety. Our results show that while LLMs can provide useful repair assistance, they remain unreliable for high-risk real-world repair tasks without rigorous evaluation and explicit safety safeguards. Phone repair is the most difficult and safety-sensitive domain, and all models make substantial errors in board-level diagnosis, repair prioritization, and safe recovery procedures. Across domains and models, Bangla responses consistently perform worse than English responses. Among the evaluated models, GPT-5.4 performs best overall.