Large Language Model Performance in Automatic Assessment on an Introductory Programming Course - UTU Tutkimustietojärjestelmä

A4 Vertaisarvioitu artikkeli konferenssijulkaisussa

Large Language Model Performance in Automatic Assessment on an Introductory Programming Course

Tekijät: Rytilahti, Juuso; Kaila, Erkki; Ingman, Valtteri

Toimittaja: Tatti, Nikolaj; Kasurinen, Jussi; Päivärinta, Tero

Konferenssin vakiintunut nimi: Annual Doctoral Symposium of Computer Science

Julkaisuvuosi: 2026

Lehti: CEUR Workshop Proceedings

Kokoomateoksen nimi: Proceedings of the Annual Doctoral Symposium of Computer Science 2025 (TKTP 2025), Helsinki, Finland, June, 2025

Artikkelin numero: paper02

Vuosikerta: 4181

eISSN: 1613-0073

Julkaisun avoimuus kirjaamishetkellä: Avoimesti saatavilla

Julkaisukanavan avoimuus : Kokonaan avoin julkaisukanava

Verkko-osoite: https://ceur-ws.org/Vol-4181/paper02.pdf

Rinnakkaistallenteen osoite: https://research.utu.fi/converis/portal/detail/Publication/508258622

Rinnakkaistallenteen lisenssi: CC BY

Rinnakkaistallennetun julkaisun versio: Kustantajan versio

Tiivistelmä

Large language models (LLMs) are a potential solution for solving the significant assessment load in big courses with numerous assignments. However, the quality of the automated assessment may not match the evaluation by teachers or other experts. In this paper, we examine the automated assessment of programming-related tasks in a large-scale introductory programming course. The study is structured into two parts: first, we examine how reliably LLMs can assess the tasks compared to course teachers when provided the same rubric. Second, we try to find out if a simple autonomous agent pipeline, mimicking a review board, can improve the assessment outcome. The study was conducted on a university-level introductory Python course with more than 500 students. We chose a total of four programming-related assignments from the four final weeks of the course. First, we provided the selected LLM models with the student answers accompanied by an evaluation rubric and a simple prompt and recorded the resulting scores and feedback comments. Second, we built a pipeline of autonomous agents with different roles of a review board and used the student submissions as input for the pipeline, again recording the scores and comments. In the article, we discuss the feasibility and the performance of the given approaches. We also provide a detailed analysis of the comparison between the results of the two approaches and the teacher-assessed results, and discuss the differences in the results and the likely reasons for them. Finally, we outline the potential for future work.

Ladattava julkaisu

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.

paper02.pdf

Julkaisussa olevat rahoitustiedot:
This work has been supported by FAST, the Finnish Software Engineering Doctoral Research Network, funded by the Ministry of Education and Culture, Finland.