G5 Artikkeliväitöskirja
Analysis of High-dimensional and Left-censored Data with Applications in Lipidomics and Genomics
Tekijät: Pesonen Maiju
Kustantaja: University of Turku
Kustannuspaikka: Turku
Julkaisuvuosi: 2016
ISBN: ISBN 978-951-29-6642-4
eISBN: ISBN 978-951-29-6643-1
Verkko-osoite: http://urn.fi/URN:ISBN:978-951-29-6643-1
Rinnakkaistallenteen osoite: https://www.doria.fi/handle/10024/125768
Recently, there has been an occurrence of new kinds of high- throughput measurement techniques enabling biological research to focus on fundamental building blocks of living organisms such as genes, proteins, and lipids. In sync with the new type of data that is referred to as the omics data, modern data analysis techniques have emerged. Much of such research is focusing on finding biomarkers for detection of abnormalities in the health status of a person as well as on learning unobservable network structures representing functional associations of biological regulatory systems. The omics data have certain specific qualities such as left-censored observations due to the limitations of the measurement instruments, missing data, non-normal observations and very large dimensionality, and the interest often lies in the connections between the large number of variables.
There are two major aims in this thesis. First is to provide efficient methodology for dealing with various types of missing or censored omics data that can be used for visualisation and biomarker discovery based on, for example, regularised regression techniques. Maximum likelihood based covariance estimation method for data with censored values is developed and the algorithms are described in detail. Second major aim is to develop novel approaches for detecting interactions displaying functional associations from large-scale observations. For more complicated data connections, a technique based on partial least squares regression is investigated. The technique is applied for network construction as well as for differential network analyses both on multiple imputed censored data and next- generation sequencing count data.