Exploring the use of Artificial Intelligence in rubric production and detection of reviewer’s bias - UTU Research Portal

A4 Refereed article in a conference publication

Exploring the use of Artificial Intelligence in rubric production and detection of reviewer’s bias

Authors: Rytilahti, Juuso; Kaila, Erkki; Lokkila, Erno

Editors: Carmo, Mafalda

Conference name: International Conference on Education and New Developments

Publisher: InScience Press

Publication year: 2025

Journal: Education and New Developments

Book title : Education and New Developments 2025 : Volume II

First page : 3

Last page: 7

ISBN: 978-989-35728-8-7

ISSN: 2184-044X

eISSN: 2184-1489

DOI: https://doi.org/10.36315/2025v2end001

Publication's open availability at the time of reporting: Open Access

Publication channel's open availability : Open Access publication channel

Web address : https://doi.org/10.36315/2025v2end001

Abstract

This study explores the use of artificial intelligence in education from a unique angle, providing insight into
whether a large language model (LLM) can be used to create a rubric or detect a scoring bias in teachers'
evaluation. For this, we examine a specific, publicly available model by OpenAI, called GPT-4o. The study
was conducted using student submissions for assignments in an introductory programming course. As the
first approach, we provided the selected LLM with a sub-set of submissions with feedback from differently
graded answers to produce a rubric. The second approach was to provide the LLM with a reference answer
to produce the justification for grading from "nothing". Finally, we tried combining these two approaches.
We also tested producing the rubric with the exercise description. After the AI model output grading
justifications, we compared the justification produced by LLM to ones created by teaching personnel. We
also tested if a bias in reviews could be detected. Our results indicate, that there is a lot of variances in the
AI-generated rubrics, and in general they are stricter than the human-generated ones. Moreover, we found
that utilizing the exercise description often produced better results. Regarding bias detection, the model
seemed to be mostly able to detect bias, if only one answer was examined at the time. If multiple answers
were introduced at once, this led to a considerable inability to detect incorrect reviews. The insights gained
from this study can guide tool development and make teachers more aware of the possibilities and
limitations of utilizing LLMs in assessment.

Funding information in the publication:
This work has been supported by FAST, the Finnish Software Engineering Doctoral Research Network, funded by the Ministry of Education and Culture, Finland.