Exploring the use of Artificial Intelligence in rubric production and detection of reviewer’s bias - UTU Tutkimustietojärjestelmä

A4 Vertaisarvioitu artikkeli konferenssijulkaisussa

Exploring the use of Artificial Intelligence in rubric production and detection of reviewer’s bias

Tekijät: Rytilahti, Juuso; Kaila, Erkki; Lokkila, Erno

Toimittaja: Carmo, Mafalda

Konferenssin vakiintunut nimi: International Conference on Education and New Developments

Julkaisuvuosi: 2025

Lehti: Education and New Developments

Kokoomateoksen nimi: Education and New Developments 2025 : Volume II

Aloitussivu: 3

Lopetussivu: 7

ISBN: 978-989-35728-8-7

ISSN: 2184-044X

eISSN: 2184-1489

DOI: https://doi.org/10.36315/2025v2end001

Julkaisun avoimuus kirjaamishetkellä: Avoimesti saatavilla

Julkaisukanavan avoimuus : Kokonaan avoin julkaisukanava

Verkko-osoite: https://doi.org/10.36315/2025v2end001

Tiivistelmä

This study explores the use of artificial intelligence in education from a unique angle, providing insight into
whether a large language model (LLM) can be used to create a rubric or detect a scoring bias in teachers'
evaluation. For this, we examine a specific, publicly available model by OpenAI, called GPT-4o. The study
was conducted using student submissions for assignments in an introductory programming course. As the
first approach, we provided the selected LLM with a sub-set of submissions with feedback from differently
graded answers to produce a rubric. The second approach was to provide the LLM with a reference answer
to produce the justification for grading from "nothing". Finally, we tried combining these two approaches.
We also tested producing the rubric with the exercise description. After the AI model output grading
justifications, we compared the justification produced by LLM to ones created by teaching personnel. We
also tested if a bias in reviews could be detected. Our results indicate, that there is a lot of variances in the
AI-generated rubrics, and in general they are stricter than the human-generated ones. Moreover, we found
that utilizing the exercise description often produced better results. Regarding bias detection, the model
seemed to be mostly able to detect bias, if only one answer was examined at the time. If multiple answers
were introduced at once, this led to a considerable inability to detect incorrect reviews. The insights gained
from this study can guide tool development and make teachers more aware of the possibilities and
limitations of utilizing LLMs in assessment.

Julkaisussa olevat rahoitustiedot:
This work has been supported by FAST, the Finnish Software Engineering Doctoral Research Network, funded by the Ministry of Education and Culture, Finland.