A4 Vertaisarvioitu artikkeli konferenssijulkaisussa

Exploring the use of Artificial Intelligence in rubric production and detection of reviewer’s bias




TekijätRytilahti, Juuso; Kaila, Erkki; Lokkila, Erno

ToimittajaCarmo, Mafalda

Konferenssin vakiintunut nimiInternational Conference on Education and New Developments

Julkaisuvuosi2025

Lehti: Education and New Developments

Kokoomateoksen nimiEducation and New Developments 2025 : Volume II

Aloitussivu3

Lopetussivu7

ISBN978-989-35728-8-7

ISSN2184-044X

eISSN2184-1489

DOIhttps://doi.org/10.36315/2025v2end001

Julkaisun avoimuus kirjaamishetkelläAvoimesti saatavilla

Julkaisukanavan avoimuus Kokonaan avoin julkaisukanava

Verkko-osoitehttps://doi.org/10.36315/2025v2end001


Tiivistelmä

This study explores the use of artificial intelligence in education from a unique angle, providing insight into
whether a large language model (LLM) can be used to create a rubric or detect a scoring bias in teachers'
evaluation. For this, we examine a specific, publicly available model by OpenAI, called GPT-4o. The study
was conducted using student submissions for assignments in an introductory programming course. As the
first approach, we provided the selected LLM with a sub-set of submissions with feedback from differently
graded answers to produce a rubric. The second approach was to provide the LLM with a reference answer
to produce the justification for grading from "nothing". Finally, we tried combining these two approaches.
We also tested producing the rubric with the exercise description. After the AI model output grading
justifications, we compared the justification produced by LLM to ones created by teaching personnel. We
also tested if a bias in reviews could be detected. Our results indicate, that there is a lot of variances in the
AI-generated rubrics, and in general they are stricter than the human-generated ones. Moreover, we found
that utilizing the exercise description often produced better results. Regarding bias detection, the model
seemed to be mostly able to detect bias, if only one answer was examined at the time. If multiple answers
were introduced at once, this led to a considerable inability to detect incorrect reviews. The insights gained
from this study can guide tool development and make teachers more aware of the possibilities and
limitations of utilizing LLMs in assessment.


Julkaisussa olevat rahoitustiedot
This work has been supported by FAST, the Finnish Software Engineering Doctoral Research Network, funded by the Ministry of Education and Culture, Finland.


Last updated on