Evaluating LLMs' Effectiveness on Real-World Consumer Device Repair Questions

Atm Mizanur Rahman (University of Illinois Urbana-Champaign), Md Arid Hasan (University of Toronto), Syed Ishtiaque Ahmed (University of Toronto), Sharifa Sultana (University of Illinois Urbana-Champaign)

Published Jun 3, 2026Featured #6In the daily list Jun 4, 2026

Open on arXiv Read PDF

Daily score70.0

Editorial review7.5

Relevance0.459

Freshness0.722

Why It Matters

What makes this one worth your time

Understanding LLMs' performance in high-stakes domains like device repair is crucial for developing safer AI applications and improving user trust.

This study benchmarks LLMs on real-world consumer device repair tasks, revealing significant reliability issues.

Summary

The paper introduces a benchmark of 991 real-world consumer device repair questions and evaluates six state-of-the-art LLMs on their effectiveness in providing repair assistance, highlighting their limitations in safety-critical scenarios.

Key contributions

Introduction of a benchmark dataset for consumer device repair questions.
Evaluation of LLMs using repair-specific criteria, providing insights into their practical utility.
Analysis of cross-lingual performance with Bangla translations.

Notable insights

The study identifies specific areas where LLMs struggle, such as board-level diagnosis and safety procedures, which are critical for real-world applications.
The performance disparity between English and Bangla responses highlights the challenges in cross-lingual LLM applications.

Possible limitations

The abstract does not address the potential biases in the dataset or the representativeness of the repair questions.
Not stated in the abstract.

Abstract

arXiv:2606.03331v1 Announce Type: cross Abstract: Consumer device repair is an important but underexplored testbed for large language models (LLMs). Repair tasks require reasoning over incomplete problem descriptions, hardware-specific diagnostics, actionable troubleshooting, and safety-critical decisions, where incorrect advice can cause device damage, battery hazards, or permanent data loss. We introduce a benchmark of 991 real-world repair questions from Reddit spanning phone repair, computer repair, and data recovery, each paired with technician-written reference solutions, and provide Bangla translations to evaluate cross-lingual performance. We evaluate six state-of-the-art LLMs in English and Bangla using four repair-specific criteria: correctness, completeness, practicality, and safety. Our results show that while LLMs can provide useful repair assistance, they remain unreliable for high-risk real-world repair tasks without rigorous evaluation and explicit safety safeguards. Phone repair is the most difficult and safety-sensitive domain, and all models make substantial errors in board-level diagnosis, repair prioritization, and safe recovery procedures. Across domains and models, Bangla responses consistently perform worse than English responses. Among the evaluated models, GPT-5.4 performs best overall.