Explainable Publication Year Prediction of Eighteenth Century Texts with the BERT Model - UTU Tutkimustietojärjestelmä

A4 Vertaisarvioitu artikkeli konferenssijulkaisussa

Explainable Publication Year Prediction of Eighteenth Century Texts with the BERT Model

Tekijät: Rastas Iiro, Ryan Yann, Tiihonen Iiro, Qaraei Mohammedreza, Repo Liina, Babbar Rohit, Mäkelä Eetu, Tolonen Mikko, Ginter Filip

Toimittaja: Tahmasebi Nina, Montariol Syrielle, Kutuzov Andrey, Hengchen Simon, Dubossarsky Haim, Borin Lars

Konferenssin vakiintunut nimi: Workshop on Computational Approaches to Historical Language Change

Julkaisuvuosi: 2022

Kokoomateoksen nimi: Proceedings of the 3rd Workshop on Computational Approaches to Historical Language Change

Tietokannassa oleva lehden nimi: PROCEEDINGS OF THE 3RD INTERNATIONAL WORKSHOP ON COMPUTATIONAL APPROACHES TO HISTORICAL LANGUAGE CHANGE 2022 (LCHANGE 2022)

Aloitussivu: 68

Lopetussivu: 77

Sivujen määrä: 10

ISBN: 978-1-955917-42-1

Julkaisun avoimuus kirjaamishetkellä: Avoimesti saatavilla

Julkaisukanavan avoimuus : Kokonaan avoin julkaisukanava

Verkko-osoite: https://aclanthology.org/2022.lchange-1.7.pdf

Rinnakkaistallenteen osoite: https://research.utu.fi/converis/portal/detail/Publication/176709131

Rinnakkaistallenteen lisenssi: CC BY

Rinnakkaistallennetun julkaisun versio: Kustantajan versio

Tiivistelmä

In this paper, we describe a BERT model trained on the Eighteenth Century Collections Online (ECCO) dataset of digitized documents. The ECCO dataset poses unique modelling challenges due to the presence of Optical Character Recognition (OCR) artifacts. We establish the performance of the BERT model on a publication year prediction task against linear baseline models and human judgement, finding the BERT model to be superior to both and able to date the works, on average, with less than 7 years absolute error. We also explore how language change over time affects the model by analyzing the features the model uses for publication year predictions as given by the Integrated Gradients model explanation method.

Ladattava julkaisu

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.

2022_Rastas_Explainable_Publ_ACL.pdf