KANN: estimation of genetic ancestry profiles by nearest neighbor regression

: Riikonen, Juha; Kerminen, Sini; Havulinna, Aki; Pirinen, Matti

Publisher: Oxford University Press (OUP)

: 2026

Nucleic Acids Research

: gkag209

: 54

: 5

: 0305-1048

: 1362-4962

DOI: https://doi.org/10.1093/nar/gkag209

: https://doi.org/10.1093/nar/gkag209

: https://research.utu.fi/converis/portal/detail/Publication/516225679

State-of-the-art methods for inferring individual-level genetic ancestry are based on statistical models for haplotype data. Unfortunately, these methods are computationally demanding, making them impractical for biobank-scale analyses. In this paper, we describe KANN, an efficient k-nearest neighbor regression method for individual-level ancestry estimation with respect to predefined source populations using only principal components of genetic structure. Contrary to the existing tools that can only use reference samples with discrete source population assignment, KANN enables the use of reference samples with continuous ancestry profiles across multiple source populations. We observe that KANN’s ancestry estimates agree well with the haplotype-based method SOURCEFIND when estimating ancestry profiles across up to 10 Finnish source populations on a dataset of 18 125 Finnish samples from THL Biobank. In the 1000 Genomes Project data containing globally diverse genetic backgrounds, KANN produces highly similar results to the ADMIXTURE software. Based on our results, KANN is a promising tool for ancestry estimation in large-scale genomic studies.

gkag209.pdf

:
This work was supported by the Sigrid Jusélius Foundation [8047 to M.P.] and the Research Council of Finland [338507, 352795, and 336285 to M.P.]. Funding to pay the Open Access publication charges for this article was provided by the Helsinki University Library.