Exploring data augmentation and active learning benefits in imbalanced datasets

Moles, Luis; Andres, Alain; Echegaray, Goretti; Boto Sánchez, Fernando

Exploring data augmentation and active learning benefits in imbalanced datasets

dc.contributor.author	Moles, Luis
dc.contributor.author	Andres, Alain
dc.contributor.author	Echegaray, Goretti
dc.contributor.author	Boto Sánchez, Fernando
dc.date.accessioned	2024-11-12T13:15:53Z
dc.date.available	2024-11-12T13:15:53Z
dc.date.issued	2024-06
dc.date.updated	2024-11-12T13:15:53Z
dc.description.abstract	Despite the increasing availability of vast amounts of data, the challenge of acquiring labeled data persists. This issue is particularly serious in supervised learning scenarios, where labeled data are essential for model training. In addition, the rapid growth in data required by cutting-edge technologies such as deep learning makes the task of labeling large datasets impractical. Active learning methods offer a powerful solution by iteratively selecting the most informative unlabeled instances, thereby reducing the amount of labeled data required. However, active learning faces some limitations with imbalanced datasets, where majority class over-representation can bias sample selection. To address this, combining active learning with data augmentation techniques emerges as a promising strategy. Nonetheless, the best way to combine these techniques is not yet clear. Our research addresses this question by analyzing the effectiveness of combining both active learning and data augmentation techniques under different scenarios. Moreover, we focus on improving the generalization capabilities for minority classes, which tend to be overshadowed by the improvement seen in majority classes. For this purpose, we generate synthetic data using multiple data augmentation methods and evaluate the results considering two active learning strategies across three imbalanced datasets. Our study shows that data augmentation enhances prediction accuracy for minority classes, with approaches based on CTGANs obtaining improvements of nearly 50% in some cases. Moreover, we show that combining data augmentation techniques with active learning can reduce the amount of real data required.	en
dc.description.sponsorship	This work was financed by the Basque Government through their Elkartek program (SONETO project, ref. KK-2023/00038)	en
dc.identifier.citation	Moles, L., Andres, A., Echegaray, G., & Boto, F. (2024). Exploring Data Augmentation and Active Learning Benefits in Imbalanced Datasets. Mathematics, 12(12). https://doi.org/10.3390/MATH12121898
dc.identifier.doi	10.3390/MATH12121898
dc.identifier.issn	2227-7390
dc.identifier.uri	http://hdl.handle.net/20.500.14454/1782
dc.language.iso	eng
dc.publisher	Multidisciplinary Digital Publishing Institute (MDPI)
dc.rights	© 2024 by the authors
dc.subject.other	Active learning
dc.subject.other	CTGAN
dc.subject.other	Data augmentation
dc.subject.other	Entropy sampling
dc.subject.other	Machine learning
dc.title	Exploring data augmentation and active learning benefits in imbalanced datasets	en
dc.type	journal article
dcterms.accessRights	open access
oaire.citation.issue	12
oaire.citation.title	Mathematics
oaire.citation.volume	12
oaire.licenseCondition	https://creativecommons.org/licenses/by/4.0/
oaire.version	VoR

Ficheros en el ítem

Bloque original

Mostrando 1 - 1 de 1

Nombre:: moles_exploring_2024.pdf
Tamaño:: 12.39 MB
Formato:: Adobe Portable Document Format

Descargar

Colecciones

Artículos