Pool-seq driven proteogenomic database for Group G Streptococcus - UTU Research Portal

A1 Refereed original research article in a scientific journal

Pool-seq driven proteogenomic database for Group G Streptococcus

Authors: Weldatsadik RG, Datta N, Kolmeder C, Vuopio J, Kere J, Wilkman SV, Flatt JW, Vuento R, Haapasalo KJ, Keskitalo S, Varjosalo M, Jokiranta TS

Publication year: 2019

Journal: Journal of Proteomics

Journal name in source: Journal of proteomics

Journal acronym: J Proteomics

Volume: 201

First page : 84

Last page: 92

Number of pages: 9

ISSN: 1874-3919

eISSN: 1876-7737

DOI: https://doi.org/10.1016/j.jprot.2019.04.015

Abstract

Proteogenomic databases use genomic and transcriptomic information for improved identification of peptides and proteins from mass spectrometry analyses. One application of such databases is in the discovery of variants/mutations. In this study, we created a proteogenomic database that contained sequences with variants derived from Pooled sequencing experiments (137 Group G Streptococcus strains sequenced in 3 pools) and used tandem mass spectrometry (MS/MS) to analyse eight protein samples from randomly selected strains sequenced in the pools. Using the proteogenomic variant database, we identified 385 variant peptides from the eight samples, none of which could be identified from the single genome conventional database utilized, while 71.2% and 93.5% of them were identified from the databases that contained 4 complete genomes and 26 assemblies, respectively. The proteogenomic variant databases exhibited the same properties as the conventional databases in terms of the Andromeda score distributions and the posterior error probability (PEP) values of the identified peptides. SIGNIFICANCE: For bacterial populations, such as Group G Streptococcus (GGS), with substantial intra-species diversity, simultaneous sequencing of large numbers of strains and generation of proteogenomic databases from those aids in improving the discovery of peptides in mass spectrometric analyses. Therefore, generation of proteogenomic variant protein databases from Pooled sequencing experiments can be a cost-effective method to complement conventional databases and discover subtle strain wise differences.