A1 Vertaisarvioitu alkuperäisartikkeli tieteellisessä lehdessä
Self-organization and missing values in SOM and GTM
Tekijät: T. Vatanen, M. Osmala, T. Raiko, K. Lagus, M. Sysi-Aho, M. Orešič, T. Honkela, H. Lähdesmäki
Kustantaja: ELSEVIER SCIENCE BV
Julkaisuvuosi: 2015
Journal: Neurocomputing
Tietokannassa oleva lehden nimi: NEUROCOMPUTING
Lehden akronyymi: NEUROCOMPUTING
Vuosikerta: 147
Aloitussivu: 60
Lopetussivu: 70
Sivujen määrä: 11
ISSN: 0925-2312
DOI: https://doi.org/10.1016/j.neucom.2014.02.061
Tiivistelmä
In this paper, we study fundamental properties of the Self-Organizing Map (SOM) and the Generative Topographic Mapping (GTM), ramifications of the initialization of the algorithms and properties of the algorithms in the presence of missing data. We show that the commonly used principal component analysis (PCA) initialization of the GTM does not guarantee good learning results with high-dimensional data. Initializing the GTM with the SOM is shown to yield improvements in self-organization with three high-dimensional data sets: commonly used MNIST and ISOLET data sets and epigenomic ENCODE data set. We also propose a revision of handling missing data to the batch SOM algorithm called the Imputation SOM and show that the new algorithm is more robust in the presence of missing data. We benchmark the performance of the topographic mappings in the missing value imputation task and conclude that there are better methods for this particular task. Finally, we announce a revised version of the SOM Toolbox for Matlab with added GTM functionality. (C) 2014 Elsevier B.V. All rights reserved.
In this paper, we study fundamental properties of the Self-Organizing Map (SOM) and the Generative Topographic Mapping (GTM), ramifications of the initialization of the algorithms and properties of the algorithms in the presence of missing data. We show that the commonly used principal component analysis (PCA) initialization of the GTM does not guarantee good learning results with high-dimensional data. Initializing the GTM with the SOM is shown to yield improvements in self-organization with three high-dimensional data sets: commonly used MNIST and ISOLET data sets and epigenomic ENCODE data set. We also propose a revision of handling missing data to the batch SOM algorithm called the Imputation SOM and show that the new algorithm is more robust in the presence of missing data. We benchmark the performance of the topographic mappings in the missing value imputation task and conclude that there are better methods for this particular task. Finally, we announce a revised version of the SOM Toolbox for Matlab with added GTM functionality. (C) 2014 Elsevier B.V. All rights reserved.