A4 Vertaisarvioitu artikkeli konferenssijulkaisussa
Exploring the use of Artificial Intelligence in rubric production and detection of reviewer’s bias
Tekijät: Rytilahti, Juuso; Kaila, Erkki; Lokkila, Erno
Toimittaja: Carmo, Mafalda
Konferenssin vakiintunut nimi: International Conference on Education and New Developments
Julkaisuvuosi: 2025
Lehti: Education and New Developments
Kokoomateoksen nimi: Education and New Developments 2025 : Volume II
Aloitussivu: 3
Lopetussivu: 7
ISBN: 978-989-35728-8-7
ISSN: 2184-044X
eISSN: 2184-1489
DOI: https://doi.org/10.36315/2025v2end001
Julkaisun avoimuus kirjaamishetkellä: Avoimesti saatavilla
Julkaisukanavan avoimuus : Kokonaan avoin julkaisukanava
Verkko-osoite: https://doi.org/10.36315/2025v2end001
This study explores the use of artificial intelligence in education from a unique angle, providing insight into
whether a large language model (LLM) can be used to create a rubric or detect a scoring bias in teachers'
evaluation. For this, we examine a specific, publicly available model by OpenAI, called GPT-4o. The study
was conducted using student submissions for assignments in an introductory programming course. As the
first approach, we provided the selected LLM with a sub-set of submissions with feedback from differently
graded answers to produce a rubric. The second approach was to provide the LLM with a reference answer
to produce the justification for grading from "nothing". Finally, we tried combining these two approaches.
We also tested producing the rubric with the exercise description. After the AI model output grading
justifications, we compared the justification produced by LLM to ones created by teaching personnel. We
also tested if a bias in reviews could be detected. Our results indicate, that there is a lot of variances in the
AI-generated rubrics, and in general they are stricter than the human-generated ones. Moreover, we found
that utilizing the exercise description often produced better results. Regarding bias detection, the model
seemed to be mostly able to detect bias, if only one answer was examined at the time. If multiple answers
were introduced at once, this led to a considerable inability to detect incorrect reviews. The insights gained
from this study can guide tool development and make teachers more aware of the possibilities and
limitations of utilizing LLMs in assessment.
Julkaisussa olevat rahoitustiedot:
This work has been supported by FAST, the Finnish Software Engineering Doctoral Research Network, funded by the Ministry of Education and Culture, Finland.