Impacts of Histories and Models on LLM Grading: A Study in Advanced Software Engineering Courses

Qilin Zhou, Zhuo Wang, Yue Li, W. K. Chan

Published Jun 9, 2026Featured #5In the daily list Jun 10, 2026

Open on arXiv Read PDF

Daily score71.4

Editorial review7.5

Relevance0.466

Freshness0.722

Why It Matters

What makes this one worth your time

Understanding LLM grading reliability is crucial for educators considering automation in assessment, as it directly impacts educational fairness and workload.

This study reveals significant inconsistencies in LLM grading compared to human evaluators.

Summary

The paper investigates the reliability of large language models (LLMs) for grading graduate-level research reports, analyzing grading consistency and alignment with human scores using a case study of 180 student submissions.

Key contributions

Evaluation of Grok and GPT LLMs in a real educational context.
Identification of intra-model and inter-model grading inconsistencies.
Proposition of a human-aligned LLM-assisted grading workflow.

Notable insights

Continuous interaction history can lead to systematic drift in grading standards of LLMs.
Simple ensemble approaches do not enhance alignment with human grading.

Possible limitations

Not stated in the abstract.

Abstract

arXiv:2606.08400v1 Announce Type: cross Abstract: Graduate-level research reading report assessment creates a substantial labor burden for educators. While large language models (LLMs) hold great potential for automating academic grading, their reliability for this specialized task remains understudied, particularly regarding grading consistency, the lack of which represents a primary obstacle to educational fairness. This paper proposes a human-aligned LLM-assisted grading workflow and presents a case study based on 180 student submissions from a graduate advanced software engineering course. We evaluate two mainstream LLMs, Grok and GPT, in terms of grading consistency and alignment with human scores. We find LLMs exhibit distinct levels of intra-model consistency and significant inter-model grading inconsistencies, while simple ensemble approaches cannot improve alignment with human evaluation. Critically, continuous interaction history drives systematic drift in models' grading standards away from human expert scores. Our findings demonstrate LLMs' potential in reducing grading workload for educators in graduate education, while highlighting that indiscriminate LLM grading may introduce systemic unfairness, suggesting that specific operational practices are required to mitigate such disparities.