Author template for journal articles

Viewer
Transcript

Regulatory Domains, Gene Function and small-RNAs in the annotation of Azotobacter Vinelandii; Conserved Domains Luisa P. Mesquita ([email protected]) Universidade Católica Portuguesa

Abstract This work describes the methodology used and results obtained during my six-month period working at Virginia Bioinformatics Institute, at Virginia Tech University, USA. It is a four-fold work, where we are interested in studying important features related to some prokaryotic genome sequences of interest. The four projects we have worked on are: searching for regulatory domains in some Gamma Proteobacteria genome sequences; reannotation of unknown-function Azotobacter vinelandii genes; searching for non-coding RNAs elements in some genomes of interest; and searching for conserved domains in protein families. All results will be used in future phases of the corresponding genome projects they are related to.

Keywords Regulatory Domains, Azotobacter Vinelandii, RNAs, Conserved Domains

1

Introduction

This text describes the methodology and results of four different research studies I have been done during my grant period at Virginia Bioinformatics Institute, Virginia Tech. All of the four studies are related to genome sequence analysis, with special attention to pokaryotic genomes of interest. The four topics studied are: searching for regulatory domains in some Gamma Proteobacteria genome sequences; reannotation of unknown-function Azotobacter vinelandii genes; searching for non-coding RNAs elements in some genomes of interest; and searching for conserved domains in protein families. A manuscript for this last topic is being prepared to be submitted to a conference. The organization of the text follows this order.

2

Regulatory Domains in Gamma Proteobacteria genome sequences

The main goal of this work was to find regulatory domains in the entire genomes of some Gamma Proteobacteria. The lab group where we were included in is interested in the study of the Azotobacter Vinelandii, which is classified as a genus of the family Pseudomonadaceae. In this work the gamma proteobacteria used were, beside the Azotobacter, all the Pseudomonas present in the NCBI (http://www.ncbi.nlm.nih.gov/) due to the high relativity between them. The Proteobacteria are a major group (phylum) of bacteria. They include a wide variety of pathogens, such as Escherichia, Salmonella, Vibrio, Helicobacter, and many other notable genera. Others are free-living, and include many of the bacteria responsible for nitrogen fixation. The group 1

is defined primarily in terms of ribosomal RNA (rRNA) sequences, and is named for the Greek God Proteus (also the name of a bacterial genus within the Proteobacteria), who could change his shape, because of the great diversity of forms found in this group. The proteobacteria are divided into five sections, referred to by the Greek letters alpha through epsilon, again based on rRNA sequences. These are often treated as classes. The Gammaproteobacteria comprise several medically and scientifically important groups of bacteria, such as the Enterobacteriaceae, Vibrionaceae and Pseudomonadaceae. An exceeding number of important pathogens belongs to this class, e.g. Salmonella (enteritis and typhoid fever), Yersinia pestis (plague), Vibrio (cholera), Pseudomonas aeruginosa (lung infections in hospitalized or cystic fibrosis patients), and Escherichia coli O157:H7 (food poisoning) [1]. In order to find the regulatory domains we used a database of protein and domain families, Pfam (http://www.sanger.ac.uk/Software/Pfam/). This database contains curated multiple sequence alignments for each family, as well as profile hidden Markov models (profile HMMs) [2] to find these domains in new sequences. A profile HMMs is like a table where we have in the first row all the amino acids and in the first column the positions in the sequence. Both regions contain numbers representing the probabilities that parameterize the HMM. These are stored as integers which are related to the probability via a log-odds calculation. For each position we have a score for each amino acid, so, to evaluate the probability of one entire domain in a specific gene we calculate the score based on that profile. An example of a profile is shown in Figure 1.

Figure 1 - Example of a profile HMMs (The domain is 7TMR-DISM_7TM)

To choose the Regulatory domains from the Pfam database we selected all that had the word regulatory in their description. The total number of domains selected was 129, 1.4% of the entire Pfam database. HMMER (http://hmmer.janelia.org/) is a freely distributable implementation of profile HMM software for protein sequence analysis, which compares all the domains in the database against the sequence of interest, as shown in the Figure 2. The Figure 3 shows that we used as input the fasta files of the Azotobacter and all the Pseudomonas and a database containing only the regulatory domains selected

Figure 2 – Use of the hmmer software

Figure 3 - Use of the hmmer software for the selected domains and sequences 2

As output we got a file containing the results from all genes, an example of the output for one gene from the Pseudomonas stutzeri A 1501 is shown in the Figure 4. As one can see we have, after the header which contains the query sequence (gene), its accession number of this gene, that is empty in this case and its description, the scores for sequence family classification ordered from the greatter to the smallest, and then the descripton of each domain location, ordered from the first to the last. In this last table we have in the first column the name of the domain, then the number of the domain, the examples shown do not contain more that one copy for each domain, but when is the case we have, for example 1/2 , which means the first domain from a total of 2. Then we have the location of the domain in the gene, for example from the position number 344 to the 457 , seq-f means “sequence from” and seq-t means “sequence to”. After these two fields is a shorthand annotation for whether the alignment is “global” with respect to the sequence or not. A dot (.) means the alignment does not go all the way to the end; a bracket ([ or ]) means it does. Thus, .. means that the alignment is local within the sequence; [. means that the alignment starts at the beginning of the sequence, but doesn‟t go all the way to its end; .] means the alignment starts somewhere internally and goes all the way to the end; and [] means the alignment includes the entire sequence. Analogously, the fields marked “hmm-f” and “hmm-t” indicate the start and end points with respect to the consensus coordinates of the model, and the following field is a shorthand for whether the alignment is global or not with respect to the model. Then we have the score followed by the E-value.

Figure 4 – Example of an output from a gene in Pseudomonas stutzeri A 1501

As we can see there are some domains that are in the same region of the gene, so we have to choose the right one. To avoid choose the wrong one we selected the one with the best score. The domains in this specific gene are shown in the Figure 5. The ones that are not in the stripe were discarded.

Figure 5 – Example of an overlap for the output described

As result we got a big table containing the total number of all the domains in the genomes of interest. A small part of this table is shown in the Figure 6. 3

Name motif

Description Reg motif

Azotobacter

Pseudomonas Pseudomonas stutzeri A1501 aeruginosa PA7

Pseudomonas aeruginosa PAO1

4

4

4

4

3

4

4

4

ATPase family associated with various cellular activities (AAA)

7

6

10

10

ATPase family associated with various cellular activities (AAA) AAA_5 ANTAR ANTAR domain Adenylate_cycl Adenylate cyclase, class-I

10

13

12

12

1 0

2 1

2 1

2 1

AflR AntA

Aflatoxin regulatory protein AntA/AntB antirepressor

0

0

0

0

1

0

0

0

AraC_E_bind

Bacterial transcription activator, effector binding domain

1

0

1

1

AraC_N

AraC-type transcriptional regulator N-terminus

1

1

2

3

AraC_binding

AraC-like ligand binding domain

6

6

22

19

AreA_N Nitrogen regulatory protein AreA N terminus AsnC_trans_reg AsnC family Autoind_synth Autoinducer synthetase

0

0

0

0

3 0

3 0

9 2

8 2

Protein phosphatase 2A regulatory B subunit (B56 family) B56 Bac_DNA_bindi ng Bacterial DNA-binding protein

0

0

0

0

6

4

5

5

BofA

SigmaK-factor processing regulatory protein BofA

0

0

0

0

Putative phosphatase regulatory subunit

0

0

0

0

7TMR-DISMED2 7TMR-DISM extracellular 2 7TMRDISM_7TM 7TM diverse intracellular signalling AAA_2

CBM_21

Figure 6 -Number of Azotobacter and all Pseudomonas genes with regulatory domains. This table is an overview of the Hmmer output, using only regulatory domains of these bacteria. Overlapping domains with good, but greater e-values are not considered.

To conclude, as we can see in the table Azotobacter is very similar to the Pseudomonas in some of the domains, and different in another ones, that could be a clue to find the function of some genes with unknown function in the annotation of the Azotobacter Vinilandii.

3

Reannotation of Azotobacter genes with unknown function

As told before, the lab group where we were included in, is interested in the study of the Azotobacter Vinelandii. Besides the characteristics of Azotobacter that are due to the fact that it is a gamma proteobacteria we can describe it as an aerobic soil-dwelling organism with a wide variety of metabolic capabilities which include the ability to fix atmospheric nitrogen by converting it into ammonia. To help us to understand this process knowing the function of all genes could be an important clue. When I joined the group, a big part of the genes were annotated with unknown function (as we can see on Figure 7) and my job was to decrease the number of those genes.

4

Undefined Hypothetical Transport and binding proteins Energy metabolism Cellular processes Regulatory functions Other categories Central intermediary metabolism Cell envelope Translation Biosynthesis of cofactors, prosthetic groups, and carriers DNA metabolism Transcription Amino acid biosynthesis Fatty acid and phospholipid metabolism

Figure 7 – Graphic showing the percent of genes characterized with this functions for the genome of Azotobacter Vinelandii

To decrease this number we distinguished these genes with unknown function by their product, there were two product categories: the ones with the words “unknown” or “hypothetical” in the description, and the ones without these words. From the initial 1710 genes, 643 didn‟t have these words in their product and they could be characterized with the characteristic of this product. For the other 1067 genes nothing could be done by their product so we selected a different technique that was to parse the output from the BLAST. The primal annotation of this bacterium was based on this blast output, the most significant hit for each gene without function was considered. But sometimes the genes with the best hit didn‟t have function as well, so Azotobacter genes were annotated with unknown functions. My job was to find a good and significant hit for these genes in the Blast output with the function well defined to use this function to characterize the gene of Azotobacter. So from the 1067 genes without the words “unknown” or “hypothetical” in the description of their product we could distinguish 357 with significant hits on Blast. This process is shown in Figure 8.

5

Figure 8 - Flowchart describing the process of characterization of some genes from Azotobacter Vinelandii

Concluding, using this method we staid only with 710 genes with unknown function, from 1710 genes in the beginning. 1000 genes can now be annotated with a specific function.

4

Searching for non-coding RNAs

The goal of this work was to obtain the exact location of all non-coding RNAs in alpha and gamma proteobacteria. This location was obtained using INFERNAL ("INFERence of RNA ALignment") software. We were interested in the Azotobacter Vinelandii (as example of Gamma Proteobacteria), Agrobacterium radiobacter K84 and Agrobacterium vitis S4 (as examples of Alpha Proteobacteria). Non-coding RNAs An RNA gene is a gene whose functional product is an RNA rather than a protein. It corresponds to a contiguous sub-sequence of the genome which is an un-spliced version of its functional transcript. A non-coding RNA (ncRNA) is any RNA molecule that does not code proteins, but instead exercises control over those RNA that do. They are thus also called RNA genes or even small RNA (sRNA). Because non-coding RNAs can control the transcription and translation of protein-coding RNAs, many authors hypothesize that they represent a newly discovered level of control over the workings of the genome. In addition they may provide further clues to understanding the 98% of the human genome that doesn't direct the production of proteins. Figure 9 shows several kinds of non-coding RNAs.

Figure 9. The RNA family [3]

6

The number of members in the RNA family has grown rapidly over recent years. In addition to the coding messenger RNAs (mRNAs), transcriptional RNAs (ribosomal RNAs (rRNAs) and transfer RNAs (tRNAs)), another subfamily called “small RNAs” has been discovered. The small RNA subfamily, in which each member has a particular function, contains small interfering RNAs (siRNAs), microRNAs (miRNAs, including small temporal RNAs (stRNAs)), small nucleolar RNAs (snoRNAs) and small nuclear RNAs (snRNAs) [4]. The functional form of single stranded RNA molecules, just like proteins, frequently requires a specific tertiary structure. The scaffold for this structure is provided by secondary structural elements which are hydrogen bonds within the molecule. This leads to several recognizable "domains" of secondary structure like hairpin loops, bulges and internal loops. As shown on Figure 10 [5]. Figure 10 - Secondary structure for the Glycine RNA[6]

INFERNAL software INFERNAL (http://infernal.janelia.org/) is a software package that allows us to make consensus RNA secondary structure profiles and to use them to search nucleic acid sequence databases for homologous RNAs, or to create new structure-based multiple sequence alignments. To make a profile we need to have a multiple sequence alignment of an RNA sequence family, and the alignment must be annotated with a consensus RNA secondary structure. The program cmbuild takes an annotated multiple alignment as input, and outputs a profile. We can then use that profile to search a sequence database for homologs using the program cmsearch. This process is described in Figure 11.

Figure 11 - INFERNAL work. As one can see we have to build first the covariance file for each RNA and then compare it to our genomes

Covariance Model INFERNAL builds a model of consensus RNA secondary structure using a formalism called a covariance model (CM) [7], which is a type of profile stochastic context-free grammar (profile SCFG). To build this model we must have an input alignment file in Stockholm format for each RNA of interest. This file can be downloaded from the Rfam webpage, and it must have a consensus secondary structure annotation line (#=GC SS_cons). The command line syntax used to build the CM file is: > cmbuild [cmfile] [alifile]

7

where [alifile] is the name of the input alignment file (from Rfam) and [cmfile] is the name of the output CM file. No options were used in this work. After making the CM files for all RNAs we want to search in the sequence, we are able to run INFERNAL. The command used to search is: > cmsearch [cmfile] [segfile]

where [cmfile] is the RNA file in CM format and [seqfile] is the file with the sequence of interest in FASTA format. RNA homology search with CMs is slow, and query-dependent banding (QDB) is turned on by default. To speed it up QDB can be turned off with the --noqdb option. QDB [8] precalculates regions of the dynamic programming matrix that have negligible probability based on the query CM‟s transition probabilities. These regions of the matrix are ignored to make searches faster. Another option for accelerating cmsearch is HMM filtering. The idea is to first search the database with an HMM using HMM search algorithms that are much faster than CM search algorithms. Highscoring hits to the HMM are then searched again using the expensive CM methods. A CM Plan 9 HMM is built from the CM in cmfile and is used to filter the database in seqfile. Only hits to the HMM with E-values less than or equal to 500 will then be searched with the CM. This E-value threshold of 500 can be changed using the --hmmE option. E-value at 50.The command line used to build each RNA comparison against the sequence was: > cmsearch –noqdb

--hmmfilter

--hmmE 50 [cmfile] [segfile]

RNAs Rfam (http://www.sanger.ac.uk/Software/Rfam/) is a large collection of multiple sequence alignments and covariance models covering many common non-coding RNA families. In conjunction with the INFERNAL software suite, Rfam can be used to annotate sequences (including complete genomes) for homologues to known non-coding RNAs [9]. We used the precalculated lists of putative RNAs provided by Rfam to find the correct list of RNAs for each Bacterium. To choose the correct RNAs for the two Agrobacteria we combined all the RNAs from all Alpha bacteria on the database. For the Azotobacter we used the RNAs detected at Pseudomonas. Table 1 lists the RNAs used to search with INFERNAL on Alphas and Gammas Proteobacteria.

Table 1. RNAs used to search with INFERNAL on Alphas and Gammas Proteobacteria

8

Output We now show an example of the output that we obtained from INFERNAL. CM 1: 5_8S_rRNA.1 CM lambda and K undefined -- no statistics Using CM score cutoff of 0.00 CP9 statistics calculated with simulation of 1000 samples of length 364 Random seed: 1200515608 No partition points N = 7452750 Using CP9 E cutoff of 50.00 >s4_c1

The first line gives the name of the CM, and the second to ninth lines give information on E-value statistics. The results section follows, where the name of each target sequence in the target database is given starting with a „>‟ (we use only one: s4_c1).The hits to the top (Watson) strand are shown next (in this example there is one hit from position 64497 to 64647 with a score of 23.30 bits). E-value statistics can give an estimate of the significance of bit scores, and larger bit scores are better. As a rough guide, scores greater than the log (base two) of the target database size are significant. Here, given a 3726375 nt target, scores over 21,82 bits are significant. The alignment is shown in a BLAST-like format, augmented by secondary structure annotation. The top line shows the predicted secondary structure of the target sequence. The format is more informative than the simple least-common-denominator format we used in the input alignment file and is designed to facilitate viewing the secondary structure. Base pairs in simple stem loops are annotated with <> characters. Base pairs enclosing multifurcations (multiple stem loops) are annotated with (). In more complicated structures, [ ] and { } annotations are used to reflect deeper nestings of multifurcations. For single stranded residues „,‟ characters mark hairpin loops „;‟ characters mark interior loops and bulges, „,‟ characters mark single-stranded residues in multifurcation loops, and „:‟ characters mark single stranded residues external to any secondary structure. Insertions relative to this consensus are annotated by a „.‟ character [10]. Results Table 2 compares the number of RNAs from C58 that were already on the Rfam website. All the replicons from S4 and K84 are examples from Alpha proteobacteria, and Azotobacter is an example of Gamma proteobacterium.

9

Table 2. Results from Infernal for all the bacteria of interest Note:Yellow painted cells indicate the specific RNA was not tested for that genome.

Concluding, Infernal is a good software to find non-coding RNAs and allowed us to find the number of RNAs in each replicon as well as their location.

5

Searching for conserved domains in protein families

The goal of this work was to find motifs in COGs family that were very well conserved. The Clusters of Orthologous Groups of proteins (COGs) [11] (http://www.ncbi.nlm.nih.gov/COG/) were delineated by comparing protein sequences encoded in complete genomes, representing major phylogenetic lineages. Each COG consists of individual proteins or groups of paralogs from at least 3 lineages and thus corresponds to an ancient conserved domain. To select the COGs with the higher conservation we developed a perl program that select the pieces of each COG completely conserved in all the members of the family and from them we made a list with the regular expression for each one. To confirm that the regular expressions were real conserved domains and not only conserved in the COGs family we used the families in other websites which were: TIGRFAMs (http://cmr.tigr.org/tigr-scripts/CMR/CmrHomePage.cgi), HAMAP families (http://www.expasy.ch/sprot/hamap/) and FIGFAMs. The regions found were confirmed as conserved by the recourses mentioned and are shown on the Figure 12. CoSMoS (Conserved Sequence Motif Search) (http://www.biology.lsa.umich.edu/cosmos/index.php) is a database of alignments of all Escherichia coli proteins with their homologues in 2780 different species found in the NCBI RefSeq database. CoSMoS motif search can be searched for sequence motifs in all proteins encoded in the E. coli genome and displays information on their evolutionary conservation [12]. So, we submitted there our regular expressions and got the percent of conservation shown on the above mentioned figure. 10

Figure 12 - Regular expressions searched for the COGs families and their correspondent members in the HAMAP, TIGRFAM and FIGFAM. In the second column we have the percent of conservation from COSMOS website, and in the laste one we have the Pfam domain corresponding to that region.

Concluding, the excerpts considered as conserved by the analyses made to the COGs families are conserved on the other resources available so we can call to the domains correspondent highly conserved domains.

6

Concluding remarks

This text described the methods used and results obtained for four studies about different features related to prokaryotic genomic sequences, including searching for regulatory motifs, reannotation of genes without known functionality, searching for non-coding RNA elements and seeking of conserved motifs in protein families. Because the methodology is also promising, this last topic will be the subject of a manuscript that is being prepared for a conference. It is worth to mention that all the results are and will be used in future phases of the genome projects conducted by Dr. João Setubal lab at VBI.

Acknowledgments We thank Professor João Setubal, Professor Nalvo Almeida and all the members of the lab group. We are also grateful to Virginia Bioinformatics Institute, Fundação Luso-Americana para o Desenvolvimento and Universidade Católica Portuguesa.

References [1] Wikipedia [http://en.wikipedia.org/wiki/Proteobacteria] (12 Dezembro 2007). [2] Bateman A, Coin L, Durbin R, Finn RD, et al (2008) The Pfam Protein Families Database. Nucleic. Acids Res. 32:D138-41. [3] Buckingham S. (2003) The Major World of microRNAs. Horizon Symposia. [4] Stricklin1 SL, Griffiths-Jones S, Eddy SR. C (2005) elegans Noncoding RNA Genes. WormBook. [5] Wikipedia [http://en.wikipedia.org/wiki/Rna] (2 Fevereiro 2008). 11

[6] Wikipedia [http://en.wikipedia.org/wiki/Image:RF00504.jpg] (2 Fevereiro 2008). [7] Eddy SR, Durbin R (1994) RNA Sequence Analysis Using Covariance Models. Nucleic Acids Research. 22:2079-2088. [8] Nawrocki EP, Eddy SR (2007) Query-Dependent Banding (QDB) for Faster RNA Similarity Searches. PLoS Comput Biol. 3:e56. [9] Griffiths-Jones S, Moxon S, Marshall M, Khanna A, Eddy SR, Bateman A (2005) Rfam: annotating non-coding RNAs in complete genomes. Nucleic Acids Res. 33:D121-4. [10] Griffiths-Jones S, et al (2003) Rfam: an RNA family database. Nucleic Acids Res. 31(1):439-41. [11] Tatusov RL, Koonin EV, Lipman DJ (1997) A genomic perspective on protein families. Science. 278(5338):631-7. [12] Liu XI, Korde N, Jakob U and Leichert LI (2006) CoSMoS: Conserved Sequence Motif Search in the proteome. BMC Bioinformatics. 7(1): 37

12

Author template for journal articles

The main goal of this work was to find regulatory domains in the entire ... Proteus (also the name of a bacterial genus within the Proteobacteria), who could ...

Download PDF

676KB Sizes 0 Downloads 257 Views

Report

Author template for journal articles

Recommend Documents