A1 Refereed original research article in a scientific journal
Pool-seq driven proteogenomic database for Group G Streptococcus
Authors: Weldatsadik RG, Datta N, Kolmeder C, Vuopio J, Kere J, Wilkman SV, Flatt JW, Vuento R, Haapasalo KJ, Keskitalo S, Varjosalo M, Jokiranta TS
Publication year: 2019
Journal: Journal of Proteomics
Journal name in source: Journal of proteomics
Journal acronym: J Proteomics
Volume: 201
First page : 84
Last page: 92
Number of pages: 9
ISSN: 1874-3919
eISSN: 1876-7737
DOI: https://doi.org/10.1016/j.jprot.2019.04.015
Abstract
Proteogenomic databases use genomic and transcriptomic information for improved identification of peptides and proteins from mass spectrometry analyses. One application of such databases is in the discovery of variants/mutations. In this study, we created a proteogenomic database that contained sequences with variants derived from Pooled sequencing experiments (137 Group G Streptococcus strains sequenced in 3 pools) and used tandem mass spectrometry (MS/MS) to analyse eight protein samples from randomly selected strains sequenced in the pools. Using the proteogenomic variant database, we identified 385 variant peptides from the eight samples, none of which could be identified from the single genome conventional database utilized, while 71.2% and 93.5% of them were identified from the databases that contained 4 complete genomes and 26 assemblies, respectively. The proteogenomic variant databases exhibited the same properties as the conventional databases in terms of the Andromeda score distributions and the posterior error probability (PEP) values of the identified peptides. SIGNIFICANCE: For bacterial populations, such as Group G Streptococcus (GGS), with substantial intra-species diversity, simultaneous sequencing of large numbers of strains and generation of proteogenomic databases from those aids in improving the discovery of peptides in mass spectrometric analyses. Therefore, generation of proteogenomic variant protein databases from Pooled sequencing experiments can be a cost-effective method to complement conventional databases and discover subtle strain wise differences.
Proteogenomic databases use genomic and transcriptomic information for improved identification of peptides and proteins from mass spectrometry analyses. One application of such databases is in the discovery of variants/mutations. In this study, we created a proteogenomic database that contained sequences with variants derived from Pooled sequencing experiments (137 Group G Streptococcus strains sequenced in 3 pools) and used tandem mass spectrometry (MS/MS) to analyse eight protein samples from randomly selected strains sequenced in the pools. Using the proteogenomic variant database, we identified 385 variant peptides from the eight samples, none of which could be identified from the single genome conventional database utilized, while 71.2% and 93.5% of them were identified from the databases that contained 4 complete genomes and 26 assemblies, respectively. The proteogenomic variant databases exhibited the same properties as the conventional databases in terms of the Andromeda score distributions and the posterior error probability (PEP) values of the identified peptides. SIGNIFICANCE: For bacterial populations, such as Group G Streptococcus (GGS), with substantial intra-species diversity, simultaneous sequencing of large numbers of strains and generation of proteogenomic databases from those aids in improving the discovery of peptides in mass spectrometric analyses. Therefore, generation of proteogenomic variant protein databases from Pooled sequencing experiments can be a cost-effective method to complement conventional databases and discover subtle strain wise differences.