A4 Vertaisarvioitu artikkeli konferenssijulkaisussa
SEDA: Similarity-Enhanced Data Augmentation for Imbalanced Learning
Tekijät: Sheikh, Javad; Farahnakian, Farshad; Farahnakian, Fahimeh; Zelioli, Luca; Heikkonen, Jukka
Toimittaja: Antonacopoulos, Apostolos; Chaudhuri, Subhasis; Chellappa, Rama; Liu, Cheng-Lin; Bhattacharya, Saumik; Pal, Umapada
Konferenssin vakiintunut nimi: International Conference on Pattern Recognition
Kustantaja: Springer Nature Switzerland
Julkaisuvuosi: 2024
Journal: Lecture Notes in Computer Science
Kokoomateoksen nimi: Pattern Recognition: 27th International Conference, ICPR 2024, Kolkata, India, December 1–5, 2024, Proceedings, Part XXVI
Tietokannassa oleva lehden nimi: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)
Vuosikerta: 15326
Aloitussivu: 32
Lopetussivu: 47
ISBN: 978-3-031-78394-4
eISBN: 978-3-031-78395-1
ISSN: 0302-9743
eISSN: 1611-3349
DOI: https://doi.org/10.1007/978-3-031-78395-1_3
Verkko-osoite: https://doi.org/10.1007/978-3-031-78395-1_3
Imbalanced datasets can significantly affect the performance of Machine Learning (ML) models, as they tend to overfit to the majority class and struggle to generalize well for minority classes. To mitigate these issues, we introduce an augmentation technique called Similarity-Enhanced Data Augmentation (SEDA) for handling imbalanced datasets. SEDA integrates feature and distance similarities to augment the minority samples. By incorporating feature importance, SEDA ensures that the most influential features are prioritized, leading to more meaningful synthetic samples. We evaluated the impact of SEDA on the performance of four ML models, including Multi-Layer Perceptron (MLP), Random Forest (RF), Decision Tree (DT), and Logistic Regression (LR). SEDA’s effectiveness is compared against random and SMOTE oversampling methods. Experimental results are collected on geophysical data from Lapland, Finland. The dataset exhibits a significant class imbalance, comprising 15 known samples in contrast to 2.92×105 unknown samples. Experiments show that adding high-quality synthetic samples can help the model to generalize better to unseen data, addressing the overfitting issue commonly seen in imbalanced datasets. A part of the implemented methodology of this work is integrated in QGIS as a new toolkit which is called EIS Toolkit (https://github.com/GispoCoding/eis_toolkit) for mineral prospectivity mapping.
Julkaisussa olevat rahoitustiedot:
The compilation of the presented work is supported by\u00A0funds from the Horizon Europe research and innovation program under Grant Agreement number 101057357, EIS - Exploration Information System. For further information, check the website: EIS.