Evaluating Students’ Open-ended Written Responses with LLMs: Using the RAG Framework for GPT-3.5, GPT-4, Claude-3, and Mistral-Large - UTU Tutkimustietojärjestelmä

A1 Vertaisarvioitu alkuperäisartikkeli tieteellisessä lehdessä

Evaluating Students’ Open-ended Written Responses with LLMs: Using the RAG Framework for GPT-3.5, GPT-4, Claude-3, and Mistral-Large

Tekijät: Jauhiainen, Jussi; Garagorry Guerra, Agustín

Kustantaja: Shimur Publications

Julkaisuvuosi: 2024

Lehti: Advances in Artificial Intelligence and Machine Learning

Tietokannassa oleva lehden nimi: Advances in Artificial Intelligence and Machine Learning

Vuosikerta: 4

Numero: 4

Aloitussivu: 3097

Lopetussivu: 3113

eISSN: 2582-9793

DOI: https://doi.org/10.54364/AAIML.2024.44177

Julkaisun avoimuus kirjaamishetkellä: Avoimesti saatavilla

Julkaisukanavan avoimuus : Kokonaan avoin julkaisukanava

Verkko-osoite: https://doi.org/10.54364/aaiml.2024.44177

Rinnakkaistallenteen osoite: https://research.utu.fi/converis/portal/detail/Publication/477835432

Rinnakkaistallenteen lisenssi: CC BY

Rinnakkaistallennetun julkaisun versio: Kustantajan versio

Tiivistelmä

Evaluating open-ended written examination responses from students is an essential yet time-intensive task for educators, requiring a high degree of effort, consistency, and precision. Recent developments in Large Language Models (LLMs) present a promising opportunity to balance the need for thorough evaluation with efficient use of educators' time. We explore LLMs—GPT-3.5, GPT-4, Claude-3, and Mistral-Large—in assessing university students' open-ended responses to questions about reference material they have studied. Each model was instructed to evaluate 54 responses repeatedly under two conditions: 10 times (10-shot) with a temperature setting of 0.0 and 10 times with a temperature of 0.5, expecting a total of 1,080 evaluations per model and 4,320 evaluations across all models. The RAG (Retrieval Augmented Generation) framework was used to make the LLMs to process the evaluation. Notable variations existed in studied LLMs consistency and the grading outcomes. There is a need to comprehend strengths and weaknesses of using LLMs for educational assessments.

Ladattava julkaisu

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.

245944177.pdf