SEDA: Similarity-Enhanced Data Augmentation for Imbalanced Learning

: Sheikh, Javad; Farahnakian, Farshad; Farahnakian, Fahimeh; Zelioli, Luca; Heikkonen, Jukka

: Antonacopoulos, Apostolos; Chaudhuri, Subhasis; Chellappa, Rama; Liu, Cheng-Lin; Bhattacharya, Saumik; Pal, Umapada

: International Conference on Pattern Recognition

Publisher: Springer Nature Switzerland

: 2024

Lecture Notes in Computer Science

: Pattern Recognition: 27th International Conference, ICPR 2024, Kolkata, India, December 1–5, 2024, Proceedings, Part XXVI

: Lecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

: 15326

: 32

: 47

: 978-3-031-78394-4

: 978-3-031-78395-1

: 0302-9743

: 1611-3349

DOI: https://doi.org/10.1007/978-3-031-78395-1_3

: https://doi.org/10.1007/978-3-031-78395-1_3

Imbalanced datasets can significantly affect the performance of Machine Learning (ML) models, as they tend to overfit to the majority class and struggle to generalize well for minority classes. To mitigate these issues, we introduce an augmentation technique called Similarity-Enhanced Data Augmentation (SEDA) for handling imbalanced datasets. SEDA integrates feature and distance similarities to augment the minority samples. By incorporating feature importance, SEDA ensures that the most influential features are prioritized, leading to more meaningful synthetic samples. We evaluated the impact of SEDA on the performance of four ML models, including Multi-Layer Perceptron (MLP), Random Forest (RF), Decision Tree (DT), and Logistic Regression (LR). SEDA’s effectiveness is compared against random and SMOTE oversampling methods. Experimental results are collected on geophysical data from Lapland, Finland. The dataset exhibits a significant class imbalance, comprising 15 known samples in contrast to 2.92×105 unknown samples. Experiments show that adding high-quality synthetic samples can help the model to generalize better to unseen data, addressing the overfitting issue commonly seen in imbalanced datasets. A part of the implemented methodology of this work is integrated in QGIS as a new toolkit which is called EIS Toolkit (https://github.com/GispoCoding/eis_toolkit) for mineral prospectivity mapping.

:
The compilation of the presented work is supported by\u00A0funds from the Horizon Europe research and innovation program under Grant Agreement number 101057357, EIS - Exploration Information System. For further information, check the website: EIS.