Automatic Short Answer Grading for Finnish with ChatGPT - UTU Tutkimustietojärjestelmä

A4 Vertaisarvioitu artikkeli konferenssijulkaisussa

Automatic Short Answer Grading for Finnish with ChatGPT

Tekijät: Chang Li-Hsin, Ginter Filip

Toimittaja: Wooldridge Michael, Dy Jennifer, Natarajan Sriraam

Konferenssin vakiintunut nimi: AAAI Conference on Artificial Intelligence

Kustantaja: Association for the Advancement of Artificial Intelligence

Kustannuspaikka: Washington, DC

Julkaisuvuosi: 2024

Lehti: Proceedings of the AAAI Conference on Artificial Intelligence

Kokoomateoksen nimi: Proceedings of the 38th AAAI Conference on Artificial Intelligence

Tietokannassa oleva lehden nimi: Proceedings of the AAAI Conference on Artificial Intelligence

Sarjan nimi: Proceedings of the AAAI Conference on Artificial Intelligence

Numero sarjassa: 21

Vuosikerta: 38

Aloitussivu: 23173

Lopetussivu: 23181

ISBN: 978-1-57735-887-9

ISSN: 2159-5399

eISSN: 2374-3468

DOI: https://doi.org/10.1609/aaai.v38i21.30363

Julkaisun avoimuus kirjaamishetkellä: Avoimesti saatavilla

Julkaisukanavan avoimuus : Kokonaan avoin julkaisukanava

Verkko-osoite: https://doi.org/10.1609/aaai.v38i21.30363

Tiivistelmä

Automatic short answer grading (ASAG) seeks to mitigate the burden on teachers by leveraging computational methods to evaluate student-constructed text responses. Large language models (LLMs) have recently gained prominence across diverse applications, with educational contexts being no exception. The sudden rise of ChatGPT has raised expectations that LLMs can handle numerous tasks, including ASAG. This paper aims to shed some light on this expectation by evaluating two LLM-based chatbots, namely ChatGPT built on GPT-3.5 and GPT-4, on scoring short-question answers under zero-shot and one-shot settings. Our data consists of 2000 student answers in Finnish from ten undergraduate courses. Multiple perspectives are taken into account during this assessment, encompassing those of grading system developers, teachers, and students. On our dataset, GPT-4 achieves a good QWK score (0.6+) in 44% of one-shot settings, clearly outperforming GPT-3.5 at 21%. We observe a negative association between student answer length and model performance, as well as a correlation between a smaller standard deviation among a set of predictions and lower performance. We conclude that while GPT-4 exhibits signs of being a capable grader, additional research is essential before considering its deployment as a reliable autograder.