Tools for the Validation of Genomes and Transcriptomes with Proteomics data 1 Pang,
2 Aya,
1 Tay,
Chi Nam Ignatius Carlos Aidan Nandan P. 1 3 1 Natalie A. Twine, Moustapha Kassem, Marc R. Wilkins 1. 2. 3.
1 Deshpande,
Nadeem O.
1 Kaakoush,
Hazel
1 Mitchell,
Systems Biology Initiative, School of Biotechnology and Biomolecular Sciences, University of New South Wales, Sydney, Australia Intersect Australia Limited, Sydney, Australia Center for Experimental Bioinformatics, Department of Biochemistry and Molecular Biology, University of Southern Denmark, Odense, Denmark
Aims
Analysis of Novel Bacterial Proteomes: under development
With the large amount of genomics and proteomics data currently available, there remains a lack of tools to integrate data from these two fields. This project aims to provide a ‘nexus’ for integrating genomics and transcriptomics data generated from next-generation sequencing with proteomics data generated from protein mass spectrometry. We are developing a set of tools which allow users to:
• Virtual protein generator: A tool which generates Mascot sequence databases based on genes predicted by tools such as Glimmer.3 Novel open reading frames are accounted for by creating a database of ‘virtual proteins’, in which the genome is sliced into overlapping, fixed sized regions and translated in all six frames.4
• Co-visualise genomics, transcriptomics, and proteomics data using the Integrated Genomics Viewer (IGV).1
• Virtual protein merger: This tool takes a list of peptides that matches to ‘virtual proteins’ and recalculates the position of the open reading frames by searching for flanking start and end codons.
• Validate the existence of genes and mRNAs using peptides identified from mass spectrometry experiments. • Validate alternatively spliced mRNA isoforms by searching for peptides that span across exon-exon junctions. Figure 3. The Virtual protein generator and virtual protein merger. The bacterial genome is sliced into overlapping, fixed sized regions and translated in all six frames to create a database of ‘virtual proteins’. Peptides that match to ‘virtual protein’ are merged together into putative open reading frames based on flanking start and end codons.
Analytical Pipeline The pipeline consists of a number of tools and requires a number of input files. It is represented as a diagram below:
Applications – Proof of Concept • The Results Analyzer was used to verify proteins coded in the Campylobacter concisus and Saccharomyces cerevisiae genome. Proteins were verified on the basis of two or more peptide ‘hits’, with Mascot scores exceeding an identity threshold. • Campylobacter concisus (emergent gut pathogen) - 66% (1320/2002) of proteins in Uniprot2 were verified with peptides identified from mass spectrometry experiments. • Saccharomyces cerevisiae (Baker’s yeast)- 14% (895/6621) of the proteins in Uniprot as well as 9% (29/313) of all splice junctions in the yeast proteome were verified with peptide evidence.
Downloads Figure 1. The analytical pipeline allows genomics and transcriptomics data generated from nextgeneration sequencing platforms to be used in custom sequence databases for Mascot searches. This allows the verification of novel genes or novel alternatively spliced mRNA isoforms using proteomics data.
The software is available via the GitHub code repository:
https://github.com/IntersectAustralia/ap11_samifier
Project Blog Integration and Visualisation of Genomics and Proteomics Data • Samifier: A tool which converts results from protein tandem mass spectrometry into SAM format. This enables co-visualization of genomics, transcriptomics, and proteomics data using the Integrative Genomics Viewer (IGV), which displays SAM files.
http://intersectaustralia.github.com/ap11/
Contact Prof. Marc Wilkins -
[email protected]
Genomic location Peptide at exon-exon junction Peptides matches from Mascot Gene architecture (exons and introns) Figure 2. The Integrative Genomics Viewer was used to visualize experimental peptides for the yeast 40S ribosomal protein S7-B (YNL096C). A peptide which spans exon-exon junction is highlighted in the red box. This analysis has also been done on a genome / proteome scale (see Applications).
• Results analyzer: This tool reports the number and types of peptides and proteins, and their corresponding Mascot scores based on customizable filters. Peptides that span across exon-exon junctions are also highlighted, which can be used to validate alternatively spliced isoforms of proteins.
Scan here to download the program.
Acknowledgements This project is supported by the Australian National Data Service (ANDS). ANDS is supported by the Australian Government through the National Collaborative Research Infrastructure Strategy (NCRIS) Program and the Education Investment Fund (EIF) Super Science Initiative. The software is developed in conjunction with Intersect Australia Limited, a not-for-profit eResearch company. We thank the Australian Proteomics Computational Facility (APCF) for providing access to the Mascot server and Simon Michnowicz for technical support. We also thank Dr. Gene Hart-Smith for access to the Wilkins Lab yeast proteomics data.
References 1. 2. 3. 4.
Robinson, J. T.; Thorvaldsdottir, H.; Winckler, W.; Guttman, M.; Lander, E. S.; Getz, G.; Mesirov, J. P., Integrative genomics viewer. Nat Biotechnol 2011, 29, (1), 24-6. Deshpande, N. P.; Kaakoush, N. O.; Mitchell, H.; Janitz, K.; Raftery, M. J.; Li, S. S.; Wilkins, M. R., Sequencing and validation of the genome of a Campylobacter concisus reveals intra-species diversity. PLoS One 2011, 6, (7), e22170. Delcher, A. L.; Bratke, K. A.; Powers, E. C.; Salzberg, S. L., Identifying bacterial genes and endosymbiont DNA with Glimmer. Bioinformatics 2007, 23, (6), 673-9. Arthur, J. W.; Wilkins, M. R., Using proteomics to mine genome sequences. Journal of Proteome Research 2004, 3, (3), 393-402.