SEDA: Similarity-Enhanced Data Augmentation for Imbalanced Learning - UTU Tutkimustietojärjestelmä

A4 Vertaisarvioitu artikkeli konferenssijulkaisussa

SEDA: Similarity-Enhanced Data Augmentation for Imbalanced Learning

Tekijät: Sheikh, Javad; Farahnakian, Farshad; Farahnakian, Fahimeh; Zelioli, Luca; Heikkonen, Jukka

Toimittaja: Antonacopoulos, Apostolos; Chaudhuri, Subhasis; Chellappa, Rama; Liu, Cheng-Lin; Bhattacharya, Saumik; Pal, Umapada

Konferenssin vakiintunut nimi: International Conference on Pattern Recognition

Kustantaja: Springer Nature Switzerland

Julkaisuvuosi: 2024

Lehti: Lecture Notes in Computer Science

Kokoomateoksen nimi: Pattern Recognition: 27th International Conference, ICPR 2024, Kolkata, India, December 1–5, 2024, Proceedings, Part XXVI

Tietokannassa oleva lehden nimi: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Vuosikerta: 15326

Aloitussivu: 32

Lopetussivu: 47

ISBN: 978-3-031-78394-4

eISBN: 978-3-031-78395-1

ISSN: 0302-9743

eISSN: 1611-3349

DOI: https://doi.org/10.1007/978-3-031-78395-1_3

Julkaisun avoimuus kirjaamishetkellä: Ei avoimesti saatavilla

Julkaisukanavan avoimuus : Ei avoin julkaisukanava

Verkko-osoite: https://doi.org/10.1007/978-3-031-78395-1_3

Tiivistelmä

Imbalanced datasets can significantly affect the performance of Machine Learning (ML) models, as they tend to overfit to the majority class and struggle to generalize well for minority classes. To mitigate these issues, we introduce an augmentation technique called Similarity-Enhanced Data Augmentation (SEDA) for handling imbalanced datasets. SEDA integrates feature and distance similarities to augment the minority samples. By incorporating feature importance, SEDA ensures that the most influential features are prioritized, leading to more meaningful synthetic samples. We evaluated the impact of SEDA on the performance of four ML models, including Multi-Layer Perceptron (MLP), Random Forest (RF), Decision Tree (DT), and Logistic Regression (LR). SEDA’s effectiveness is compared against random and SMOTE oversampling methods. Experimental results are collected on geophysical data from Lapland, Finland. The dataset exhibits a significant class imbalance, comprising 15 known samples in contrast to 2.92×105 unknown samples. Experiments show that adding high-quality synthetic samples can help the model to generalize better to unseen data, addressing the overfitting issue commonly seen in imbalanced datasets. A part of the implemented methodology of this work is integrated in QGIS as a new toolkit which is called EIS Toolkit (https://github.com/GispoCoding/eis_toolkit) for mineral prospectivity mapping.

Julkaisussa olevat rahoitustiedot:
The compilation of the presented work is supported by\u00A0funds from the Horizon Europe research and innovation program under Grant Agreement number 101057357, EIS - Exploration Information System. For further information, check the website: EIS.