Active learning of molecular data for task-specific objectives - UTU Research Portal

A1 Refereed original research article in a scientific journal

Active learning of molecular data for task-specific objectives

Authors: Ghosh, Kunal; Todorović, Milica; Vehtari, Aki; Rinke, Patrick

Publisher: AIP Publishing

Publication year: 2025

Journal:Journal of Chemical Physics

Article number: 014103

Volume: 162

Issue: 1

ISSN: 0021-9606

eISSN: 1089-7690

DOI: https://doi.org/10.1063/5.0229834

Web address : https://doi.org/10.1063/5.0229834

Self-archived copy’s web address: https://research.utu.fi/converis/portal/detail/Publication/477959192

Abstract

Active learning (AL) has shown promise to be a particularly data-efficient machine learning approach. Yet, its performance depends on the application, and it is not clear when AL practitioners can expect computational savings. Here, we carry out a systematic AL performance assessment for three diverse molecular datasets and two common scientific tasks: compiling compact, informative datasets and targeted molecular searches. We implemented AL with Gaussian processes (GP) and used the many-body tensor as molecular representation. For the first task, we tested different data acquisition strategies, batch sizes, and GP noise settings. AL was insensitive to the acquisition batch size, and we observed the best AL performance for the acquisition strategy that combines uncertainty reduction with clustering to promote diversity. However, for optimal GP noise settings, AL did not outperform the randomized selection of data points. Conversely, for targeted searches, AL outperformed random sampling and achieved data savings of up to 64%. Our analysis provides insight into this task-specific performance difference in terms of target distributions and data collection strategies. We established that the performance of AL depends on the relative distribution of the target molecules in comparison to the total dataset distribution, with the largest computational savings achieved when their overlap is minimal.

Downloadable publication

This is an electronic reprint of the original article.
This reprint may differ from the original in pagination and typographic detail. Please cite the original version.

014103_1_5.0229834.pdf

Funding information in the publication:
This study received the financial support from the Academy of Finland through its flagship program, the Finnish Center for Artificial Intelligence, and the Centers of Excellence Program (CoE VILMA, Grant No. 346377). Computing resources from the Aalto Science-IT project and the CSC—IT Center for Science, Finland, are gratefully acknowledged. In addition, K.G. thanks the Finnish Cultural Foundation (Grant No. 00210309) for funding the research.