TrustLDM: Benchmarking Trustworthiness in Language Diffusion Models

Yichuan Mo, Yukun Jiang, Yanbo Shi, Mingjie Li, Michael Backes, Yang Zhang, Yisen Wang

Published Jun 2, 2026Featured #2In the daily list Jun 3, 2026

Open on arXiv Read PDF

Daily score73.4

Editorial review7.5

Relevance0.464

Freshness0.722

Why It Matters

What makes this one worth your time

Understanding trustworthiness in LDMs is crucial for developing safer AI systems, especially as these models gain prominence in language processing tasks.

TrustLDM benchmarks the trustworthiness of Language Diffusion Models, highlighting critical vulnerabilities.

Summary

The paper introduces TrustLDM, a benchmark for evaluating the trustworthiness of Language Diffusion Models (LDMs) in terms of safety, privacy, and fairness, revealing vulnerabilities in their alignment behavior when exposed to malicious contexts.

Key contributions

Introduction of the TrustLDM benchmark for assessing trustworthiness in LDMs.
Empirical analysis revealing vulnerabilities in LDMs under malicious contexts.
Development of TrustLDM-Auto, an automatic evaluation framework for identifying weak configurations.

Notable insights

Longer contexts do not necessarily lead to stronger trustworthiness effects in LDMs.
Decoding order and generation length significantly influence evaluation outcomes.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2606.00023v1 Announce Type: cross Abstract: The rapid development of Language Diffusion Models (LDMs) challenges the dominant position of auto-regressive competitors in language processing. However, their flexible, any-order decoding strategies not only enable fast decoding speed but also potentially bring new trustworthiness challenges. To better understand the risks behind their pipelines, we introduce a comprehensive trustworthiness benchmark tailored to LDMs (TrustLDM), evaluating safety, privacy, and fairness across different LDM architectures with multiple categories of static post contexts. Our empirical results show that although LDMs generally exhibit strong trustworthiness with only the user prompts, their alignment behavior degrades noticeably when the malicious post contexts are attached to the masked responses. We further observe that longer contexts do not necessarily induce stronger effects, and both decoding order and generation length affect the evaluation outcomes. Finally, we propose TrustLDM-Auto, an automatic evaluation framework that leverages LDM decoding flexibility to systematically identify vulnerable configurations, revealing substantial trustworthiness weaknesses across all evaluated models and dimensions. Our work may potentially help the community build more trustworthy LDMs. Our code is available at https://github.com/PKU-ML/TrustLDM.