Educational Evaluation with Large Language Models (LLMs): ChatGPT-4 in Recalling and Evaluating Students’ Written Responses - UTU Tutkimustietojärjestelmä

A1 Vertaisarvioitu alkuperäisartikkeli tieteellisessä lehdessä

Educational Evaluation with Large Language Models (LLMs): ChatGPT-4 in Recalling and Evaluating Students’ Written Responses

Tekijät: Jauhiainen, Jussi S.; Garagorry Guerra, Agustín

Kustantaja: Informing Science Institute

Julkaisuvuosi: 2025

Lehti: Journal of Information Technology Education: Innovations in Practice

Tietokannassa oleva lehden nimi: Journal of Information Technology Education: Innovations in Practice

Artikkelin numero: 2

Vuosikerta: 24

ISSN: 2165-3151

eISSN: 2165-316X

DOI: https://doi.org/10.28945/5433

Julkaisun avoimuus kirjaamishetkellä: Avoimesti saatavilla

Julkaisukanavan avoimuus : Kokonaan avoin julkaisukanava

Verkko-osoite: https://doi.org/10.28945/5433

Rinnakkaistallenteen osoite: https://research.utu.fi/converis/portal/detail/Publication/491502817

Tiivistelmä

Aim/Purpose

This article investigates the process of identifying and correcting hallucinations in ChatGPT-4’s recall of student-written responses as well as its evaluation of these responses, and provision of feedback. Effective prompting is examined to enhance the pre-evaluation, evaluation, and post-evaluation stages.

Background

Advanced Large Language Models (LLMs), such as ChatGPT-4, have gained significant traction in educational contexts. However, as of early 2025, systematic empirical studies on their application for evaluating students’ essays and open-ended written exam responses remain limited. It is important to consider pre-evaluation, evaluation and post-evaluation stages when using LLMs.

Methodology

In this study, ChatGPT-4 recalled 10 times 54 open-ended responses submitted by university students, making together almost 50,000 words, and assessing and offering feedback on each response.

Contribution

The findings emphasize the critical importance of pre-evaluation, evaluation, and post-evaluation stages, and in particular prompting and recalling when utilizing LLMs for educational assessments.

Findings

Using systematic prompting techniques, such as Chain of Thought (CoT), ChatGPT-4 can be effectively prepared to accurately recall, evaluate, and provide meaningful, individualized feedback on students’ written responses, following specific instructional guidelines.

Recommendations for Practitioners

Proper implementation of pre-evaluation, evaluation and post-evaluation stages and testing of recall accuracy are important when using ChatGPT-4 for evaluating students’ open-ended responses and providing feedback.

Recommendation for Researchers

Recall accuracy needs to be tested, and the prompting process carefully revealed when using and researching LLMs like ChatGPT-4 for educational evaluations.

Impact on Society

As LLMs continue to evolve, they are expected to become valuable tools for assessing student essays and open-ended responses, offering potential time and resource savings for educators and educational institutions.

Future Research

Future research should explore the use of various LLMs across different academic fields and topics to better understand their potential and limitations in educational evaluation.

Ladattava julkaisu

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.

JITE-IIPv24Art002Jauhiainen11203.pdf