FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering - UTU Tutkimustietojärjestelmä

A4 Vertaisarvioitu artikkeli konferenssijulkaisussa

FinerWeb-10BT: Refining Web Data with LLM-Based Line-Level Filtering

Tekijät: Henriksson, Erik; Tarkka, Otto; Ginter, Filip

Toimittaja: Johansson, Richard; Stymne, Sara

Konferenssin vakiintunut nimi: Nordic Conference on Computational Linguistics and Baltic Conference on Human Language Technologies

Julkaisuvuosi: 2025

Lehti: NEALT proceedings series

Kokoomateoksen nimi: Proceedings of the Joint 25th Nordic Conference on Computational Linguistics and 11th Baltic Conference on Human Language Technologies (NoDaLiDa/Baltic-HLT 2025)

Vuosikerta: 57

Aloitussivu: 258

Lopetussivu: 268

ISBN: 978-9908-53-109-0

ISSN: 1736-8197

eISSN: 1736-6305

Julkaisun avoimuus kirjaamishetkellä: Avoimesti saatavilla

Julkaisukanavan avoimuus : Kokonaan avoin julkaisukanava

Verkko-osoite: https://aclanthology.org/2025.nodalida-1.27/

Rinnakkaistallenteen osoite: https://research.utu.fi/converis/portal/detail/Publication/506553763

Tiivistelmä

Data quality is crucial for training Large Language Models (LLMs). Traditional heuristic filters often miss low-quality text or mistakenly remove valuable content. In this paper, we introduce an LLM-based line-level filtering method to enhance training data quality. We use GPT-4o mini to label a 20,000-document sample from FineWeb at the line level, allowing the model to create descriptive labels for low-quality lines. These labels are grouped into nine main categories, and we train a DeBERTa-v3 classifier to scale the filtering to a 10B-token subset of FineWeb. To test the impact of our filtering, we train GPT-2 models on both the original and the filtered datasets. The results show that models trained on the filtered data achieve higher accuracy on the HellaSwag benchmark and reach their performance targets faster, even with up to 25% less data. This demonstrates that LLM-based line-level filtering can significantly improve data quality and training efficiency for LLMs. We release our quality-annotated dataset, FinerWeb-10BT, and the codebase to support further work in this area.

Ladattava julkaisu

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.

2025.nodalida-1.27.pdf

Julkaisussa olevat rahoitustiedot:
This project has received funding from the European Union’s Horizon Europe research and innovation programme under Grant agreement No 101070350 and from UK Research and Innovation (UKRI) under the UK government’s Horizon Europe funding guarantee [grant number 10052546]. This work was supported by the Research Council of Finland.