GPF-LiveNews: A Streaming Evaluation Protocol for Group-Conditioned Framing in Large Language Models
Mohd Ariful Haque, Fahad Rahman, Kishor Datta Gupta, Roy George
Why It Matters
What makes this one worth your time
Understanding how language models frame information for different audiences is crucial for ensuring fairness and mitigating bias in AI systems.
GPF-LIVENEWS provides a dynamic protocol for evaluating bias in language models using live news streams.
Summary
The paper introduces GPF-LIVENEWS, a streaming evaluation protocol designed to audit group-conditioned framing in large language models by using fresh news content and evaluating model responses across identity labels and prompt families.
Key contributions
- Introduction of a streaming evaluation protocol for group-conditioned framing.
- Development of a benchmark using fresh news content and identity labels.
- Provision of a comprehensive artifact including metadata, templates, and scripts for reproducibility.
Notable insights
- The use of live news streams allows for real-time evaluation of language model biases.
- Semantic-sensitivity and sentiment-disparity signals are used as metrics for auditing model outputs.
Possible limitations
- Not stated in the abstract
Abstract
arXiv:2605.28848v1 Announce Type: cross Abstract: Deployed language models are evaluated in a non-stationary environment: model versions, retrieval layers, safety systems, and real-world inputs all change over time. Static bias benchmarks remain useful, but they do not show how models frame newly emerging events for different prompted audiences. We introduce GPF-LIVENEWS, a streaming evaluation protocol and benchmark snapshot for auditing group-conditioned framing in open-ended LLM outputs. The protocol expands fresh BBC/Reuters news anchors across 42 identity labels and seven prompt families, then evaluates response bundles using semantic-sensitivity and sentiment-disparity signals. In a pilot over 12 monitoring runs and 23 hosted models, Policy/Action prompts produce the strongest semantic movement, while sentiment variation is flatter across dimensions and prompt families. The released artifact includes article metadata, prompt templates, instantiated prompts, model-output metadata, score tables, documentation, and reproduction scripts. We interpret all scores as observed-window audit signals for human review, not as permanent fairness rankings or direct proof of harmful bias.