A4 Refereed article in a conference publication
Exploring the use of Artificial Intelligence in rubric production and detection of reviewer’s bias
Authors: Rytilahti, Juuso; Kaila, Erkki; Lokkila, Erno
Editors: Carmo, Mafalda
Conference name: International Conference on Education and New Developments
Publication year: 2025
Journal: Education and New Developments
Book title : Education and New Developments 2025 : Volume II
First page : 3
Last page: 7
ISBN: 978-989-35728-8-7
ISSN: 2184-044X
eISSN: 2184-1489
DOI: https://doi.org/10.36315/2025v2end001
Publication's open availability at the time of reporting: Open Access
Publication channel's open availability : Open Access publication channel
Web address : https://doi.org/10.36315/2025v2end001
This study explores the use of artificial intelligence in education from a unique angle, providing insight into
whether a large language model (LLM) can be used to create a rubric or detect a scoring bias in teachers'
evaluation. For this, we examine a specific, publicly available model by OpenAI, called GPT-4o. The study
was conducted using student submissions for assignments in an introductory programming course. As the
first approach, we provided the selected LLM with a sub-set of submissions with feedback from differently
graded answers to produce a rubric. The second approach was to provide the LLM with a reference answer
to produce the justification for grading from "nothing". Finally, we tried combining these two approaches.
We also tested producing the rubric with the exercise description. After the AI model output grading
justifications, we compared the justification produced by LLM to ones created by teaching personnel. We
also tested if a bias in reviews could be detected. Our results indicate, that there is a lot of variances in the
AI-generated rubrics, and in general they are stricter than the human-generated ones. Moreover, we found
that utilizing the exercise description often produced better results. Regarding bias detection, the model
seemed to be mostly able to detect bias, if only one answer was examined at the time. If multiple answers
were introduced at once, this led to a considerable inability to detect incorrect reviews. The insights gained
from this study can guide tool development and make teachers more aware of the possibilities and
limitations of utilizing LLMs in assessment.
Funding information in the publication:
This work has been supported by FAST, the Finnish Software Engineering Doctoral Research Network, funded by the Ministry of Education and Culture, Finland.