Artificial Intelligence assessing content-focused short answers - UTU Tutkimustietojärjestelmä

A4 Vertaisarvioitu artikkeli konferenssijulkaisussa

Artificial Intelligence assessing content-focused short answers

Tekijät: Rytilahti, Juuso; Kaila, Erkki; Lokkila, Erno

Toimittaja: Carmo, Mafalda

Konferenssin vakiintunut nimi: International Conference on Education and New Developments

Julkaisuvuosi: 2025

Lehti: Education and New Developments

Kokoomateoksen nimi: Education and New Developments 2025 : Volume II

Aloitussivu: 61

Lopetussivu: 65

ISBN: 978-989-35728-8-7

ISSN: 2184-044X

eISSN: 2184-1489

DOI: https://doi.org/10.36315/2025v2end013

Julkaisun avoimuus kirjaamishetkellä: Avoimesti saatavilla

Julkaisukanavan avoimuus : Kokonaan avoin julkaisukanava

Verkko-osoite: https://doi.org/10.36315/2025v2end013

Tiivistelmä

The capabilities of Artificial Intelligence (AI), and specifically large language models (LLMs) have
changed the way teachers work. Using LLM or other AI-assisted tools to help review student submissions
has quickly become common practice. These AI-driven automatic assessment tools still have a lot of open
questions regarding their effectiveness, performance, and reliability. In this study, we observe the LLMs'
capabilities to assess textual answers. The data set used consisted of 31 different computer science-related
questions and 2981 answers written in English with detailed feedback and the correct answers. The LLM
we used was GPT-4o from OpenAI. At first, the performance of the LLM was tested against a single
question present in the data set producing scores for all of its answers (N=82) using multiple different
variations of settings. The best-performing approach was then used to process the full data set. With the
full set, the model got the exactly correct evaluation in 41,3% of the cases. With an accepted error margin
of ±20%, the correctness was 74.7%. When observing the fully correct answers in the set (N=1802), the
model was able to correctly identify 51.4% of them. The results can be used to guide future research
endeavors in AI-driven automatic assessment research and to guide teachers on how to improve the
performance of educational use of LLMs in different ways.

Julkaisussa olevat rahoitustiedot:
This work has been supported by FAST, the Finnish Software Engineering Doctoral Research Network, funded by the Ministry of Education and Culture, Finland.