A4 Vertaisarvioitu artikkeli konferenssijulkaisussa

SEDA: Similarity-Enhanced Data Augmentation for Imbalanced Learning




TekijätSheikh, Javad; Farahnakian, Farshad; Farahnakian, Fahimeh; Zelioli, Luca; Heikkonen, Jukka

ToimittajaAntonacopoulos, Apostolos; Chaudhuri, Subhasis; Chellappa, Rama; Liu, Cheng-Lin; Bhattacharya, Saumik; Pal, Umapada

Konferenssin vakiintunut nimiInternational Conference on Pattern Recognition

KustantajaSpringer Nature Switzerland

Julkaisuvuosi2024

JournalLecture Notes in Computer Science

Kokoomateoksen nimiPattern Recognition: 27th International Conference, ICPR 2024, Kolkata, India, December 1–5, 2024, Proceedings, Part XXVI

Tietokannassa oleva lehden nimiLecture Notes in Computer Science (including subseries Lecture Notes in Artificial Intelligence and Lecture Notes in Bioinformatics)

Vuosikerta15326

Aloitussivu32

Lopetussivu47

ISBN978-3-031-78394-4

eISBN978-3-031-78395-1

ISSN0302-9743

eISSN1611-3349

DOIhttps://doi.org/10.1007/978-3-031-78395-1_3

Verkko-osoitehttps://doi.org/10.1007/978-3-031-78395-1_3


Tiivistelmä
Imbalanced datasets can significantly affect the performance of Machine Learning (ML) models, as they tend to overfit to the majority class and struggle to generalize well for minority classes. To mitigate these issues, we introduce an augmentation technique called Similarity-Enhanced Data Augmentation (SEDA) for handling imbalanced datasets. SEDA integrates feature and distance similarities to augment the minority samples. By incorporating feature importance, SEDA ensures that the most influential features are prioritized, leading to more meaningful synthetic samples. We evaluated the impact of SEDA on the performance of four ML models, including Multi-Layer Perceptron (MLP), Random Forest (RF), Decision Tree (DT), and Logistic Regression (LR). SEDA’s effectiveness is compared against random and SMOTE oversampling methods. Experimental results are collected on geophysical data from Lapland, Finland. The dataset exhibits a significant class imbalance, comprising 15 known samples in contrast to 2.92×105 unknown samples. Experiments show that adding high-quality synthetic samples can help the model to generalize better to unseen data, addressing the overfitting issue commonly seen in imbalanced datasets. A part of the implemented methodology of this work is integrated in QGIS as a new toolkit which is called EIS Toolkit (https://github.com/GispoCoding/eis_toolkit) for mineral prospectivity mapping.


Julkaisussa olevat rahoitustiedot
The compilation of the presented work is supported by\u00A0funds from the Horizon Europe research and innovation program under Grant Agreement number 101057357, EIS - Exploration Information System. For further information, check the website: EIS.


Last updated on 2025-27-01 at 19:06