Explain Like I'm 5 or Whatever I Choose: Evaluating the Interactive Potential of Language Model Responses
Indu Panigrahi, Tal August
Why It Matters
What makes this one worth your time
Understanding and improving LLMs' ability to tailor response complexity is crucial for enhancing user interaction and accessibility in diverse applications.
The paper evaluates LLMs' ability to vary response complexity, revealing inconsistent performance.
Summary
The paper proposes a new evaluation framework for large language models (LLMs) that assesses their ability to generate responses with varying language complexity. This is based on a study with 16 participants and involves testing models like GPT-5.1 and Claude Sonnet 4.5 on 98 scientific queries. The results show that while models can vary response complexity, they do so inconsistently, with the best model achieving correct complexity shifts only 46% of the time.
Key contributions
- Proposes a new evaluation framework for assessing language complexity variation in LLM responses.
- Conducts a formative study with 16 participants to test the framework.
- Evaluates multiple LLMs on their ability to generate responses with varying complexity.
Notable insights
- The study uses a formative approach inspired by human-centered design to evaluate language complexity variation.
- The evaluation framework highlights the inconsistency in LLMs' ability to adjust language complexity reliably.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2606.06788v1 Announce Type: new Abstract: Evaluations of large language models (LLMs) in scientific information seeking tasks have become increasingly use-centric, such as conducting live or multi-turn evaluations with real users. These evaluations still assume a single, static chat interface, but as models are integrated into new interfaces, evaluations must shift to incorporate interface-specific criteria. We propose a new evaluation framework based on a formative study with $16$ participants that tests models' ability to generate multiple responses to one query that differ along an interpretable axis of language (language complexity), inspired by direct manipulation interfaces from human-centered design literature. We evaluate GPT-5.1, GPT-5 mini, Claude Sonnet 4.5 + Thinking, and DeepSeek-V3.1 by generating 5 responses at different levels of language complexity for $98$ scientific queries. While models vary complexity across responses, most changes remain inconsistent, with the best performing model (Claude Sonnet 4.5) only shifting reliable complexity measures in the correct direction $46\%$ of the time. Our findings hold with increased sample size and alternative complexity levels.