D36–D39 Nucleic Acids Research, 2007, Vol. 35, Database issue doi:10.1093/nar/gkl778

Published online 1 November 2006

InSatDb: a microsatellite database of fully sequenced insect genomes Sunil Archak, Eshwar Meduri, P. Sravana Kumar and J. Nagaraju* Laboratory of Molecular Genetics, Centre for DNA Fingerprinting and Diagnostics, ECIL Road, Nacharam, Hyderabad 500 076, India Received August 11, 2006; Revised September 27, 2006; Accepted September 29, 2006

ABSTRACT InSatDb presents an interactive interface to query information regarding microsatellite characteristics per se of five fully sequenced insect genomes (fruitfly, honeybee, malarial mosquito, red-flour beetle and silkworm). InSatDb allows users to obtain microsatellites annotated with size (in base pairs and repeat units); genomic location (exon, intron, up-stream or transposon); nature (perfect or imperfect); and sequence composition (repeat motif and GC%). One can access microsatellite cluster (compound repeats) information and a list of microsatellites with conserved flanking sequences (microsatellite family or paralogs). InSatDb is complete with the insects information, web links to find details, methodology and a tutorial. A separate ‘Analysis’ section illustrates the comparative genomic analysis that can be carried out using the output. InSatDb is available at www.cdfd.org.in/insatdb.

INTRODUCTION Microsatellites are simple sequence repeats (SSRs) that exhibit complex patterns in their frequency of occurrence, genomic distribution, mutability, function and evolution. Apart from being the source of informative genetic markers, microsatellites per se have attracted a lot of attention with respect to their origin, distribution, expansion, mutation and disintegration (1–7). Questions are also asked about the functional role of microsatellites in particular and biological significance of the microsatellites in general (4,8–12). Genetic studies and whole genome sequence analysis have established non-random distribution, variability and high mutability as characteristics of microsatellites. Evidences are accruing, which support the role of microsatellites in gene regulation,

transcription and protein function (13). Existence of qualitative and quantitative differences between microsatellites of different genomes and their role in adaptive evolution have also been theorized (2,8). However, such studies require information on type (mono to hexa), motif (GC%), abundance (motif preferences), frequency, distribution (linkage group-wise and chromosomal position), location (exon, intron, regulatory element and transposon), nature (perfect, imperfect and compound) and copy number (existence of paralogs) of microsatellites not only on a whole genome basis but also as a comparative analysis of multiple genomes that are related by phylogeny (for instance, fully sequenced primate genomes or fungal genomes or insect genomes) to draw functional conclusions. Insects have long exhibited the greatest genetic diversity on earth that has puzzled mankind. Biologists have relied on insects to unravel many fundamental tenets of biology. Whole genome sequences of insects have lived up to the reputation of diversity and have thrown immense variability in size and organization of their genomes. Among others, there are five fully sequenced insect genomes: Drosophila melanogaster (as a model organism it provides maximum annotated data), Anopheles gambiae (another Dipteran but economically highly important as malarial vector), Tribolium castaneum (relatively early insect order of Coleoptera), Apis mellifera (Hymenoptera, relatively a recent insect order) and Bombyx mori (economically important as silk-producing member of Lepidoptera, members of which are crop pests; also significant as a model for insect development). Researchers attempting to understand the biology and evolution of microsatellites are often faced with the following questions: (i) Do microsatellites occur everywhere in the genome? (ii) Does the length of microsatellites have any relationship with their frequency? (iii) Does the flanking sequence composition influence origin of microsatellites? (iv) Does the microsatellite size affect microsatellite disintegration rate? (v) Does the GC content of the motif affect the length, repeat units or mutation rate of microsatellites? (vi) Do genomes possess hotspots and islands of microsatellites?

*To whom correspondence should be addressed. Tel: +91 40 27171427; Fax: +91 40 27155610; Email: [email protected] The authors wish to be known that, in their opinion, the second and third authors should be regarded as joint Second Authors  2006 The Author(s). This is an Open Access article distributed under the terms of the Creative Commons Attribution Non-Commercial License (http://creativecommons.org/licenses/ by-nc/2.0/uk/) which permits unrestricted non-commercial use, distribution, and reproduction in any medium, provided the original work is properly cited.

Nucleic Acids Research, 2007, Vol. 35, Database issue

(vii) Is there any favoured association of microsatellites in the compound repeats? (viii) Do microsatellites occur as families of common flanking sequences in the genomes (paralogs)? InSatDb, unlike many other microsatellite databases that cater to only the needs of microsatellites as markers, allows users to address the above-mentioned questions by accessing qualitative and quantitative genome level microsatellites profile of a single insect or to carry out comparative genomic analysis using all the five genomes.

METHODS Drosophila melanogaster, A.mellifera, A.gambiae and T.castaneum sequences were downloaded from GenBank (ftp://ftp.ncbi.nlm.nih.gov/genomes) and Bombyx mori sequences were downloaded from http://silkworm.genomics. org.cn. Repeats were extracted employing Tandem Repeat Finder version 4 (14). To ensure that the extracted repeat sequences were real microsatellites, those with less than five repeat units and shorter than 15 bp in length were excluded. Tandem Repeat Finder does not employ minimal alignment score for detecting microsatellites; rather a probabilistic model of random repeat sequences specified by per cent identity and frequency of insertions and deletions. This includes calculation of average per cent identity between the copies (pM) and average percentage of insertions and deletions (pI). The algorithm has a pair of matching probability and indel probability values (pM ¼ 0.80, pI ¼ 0.10) as default to cover most divergent copies at every locus. We used two sets of alignment parameters (match, mismatch, gap), (+2, 3, 5) and (+2, 5, 7) to score the matches. All the microsatellites with a minimum alignment score of

Figure 1. InSatDb organization and implementation.

D37

30 are reported in the database, which means that both perfect and imperfect microsatellites are listed. The genome sequences were also analysed using RepeatMasker (A.F.A. Smit, R. Hubley and P. Green, unpublished data; http://www.repeatmasker.org) to obtain indices marking the occurrence of simple repeats, tandem repeats, segmental duplications, interspersed repeats including SINEs, DNA transposons, retrotransposons, LINEs, etc. Further, sequences were analysed for the delineation of exons and introns using GENSCAN (15). Flanking sequences of microsatellites were aligned to catalogue paralogous microsatellites that exhibit identical origin and hence considered belonging to the same microsatellite family. Occurrence of two or more microsatellites contiguously with intervening non-repeat sequence of <70 bp were separately categorized as compound repeats.

DATABASE ORGANIZATION InSatDb is developed as a multi-tier relational database (Figure 1). It stores microsatellites from all the five insect genomes separately as well as carries complete annotations of these microsatellites. The database also provides basic information on each of the five insects and important links to obtain further knowledge, and contains a tutorial page and a glossary page. Microsatellite data can be accessed in two formats. End users with adequate computational capabilities can batch download full complement of microsatellites (insect-wise), microsatellite sequences, compound microsatellites and full list of microsatellite loci existing as families. These data are made available as csv files, which are compatible with spreadsheet programmes such as MS Excel. Alternatively, details of the microsatellites with highly

D38

Nucleic Acids Research, 2007, Vol. 35, Database issue

Figure 2. Screen shots of (A) InSatDb homepage, (B) multi-option query sheet, (C) output table and (D) analysis page.

specific characteristics may be queried using a multi-option query sheet (Figure 2). The options include insect (one at a time); location (intron, exon, i.e. boundary, upstream, intergenic, repeat elements—single or in combination); repeat type (motif size, mono- to hexa-nucleotide) or actual repeat motif (by essentially entering up to five repeat motifs); GC% (fixed value or range); repeat size in either base pairs or number of units (fixed value or range); repeat kind (perfect or imperfect). Once insect and location options are selected rest of the fields are set at ‘ALL’ by default. The output is primarily a list of microsatellites annotated for all options of the query sheet and the output table is generated as a hierarchical pre-sorted list. Each microsatellite is given a unique ID that also carries genomic sequence ID and corresponding indices. If the number of microsatellites selected based on the options of the query sheet exceeds 500, the output is split into sets of 500 microsatellites. In addition, a csv file containing total output is also made available for downloading. If the query options do not select any microsatellite, a message indicating zero output is displayed and a back button is provided to refine the options. The table is a ‘one-stop’ output and gives complete information on microsatellites. SSR motif and 100 bp each of the left and right flanking sequences are given for each microsatellite entry, which allows users to carry out sequence analysis of microsatellite vis-a`-vis locus. In addition, users can select individual microsatellites to convert them into locus-specific markers. This is facilitated

by automatic uploading of repeat and flanking sequences of the selected microsatellite into Primer3 query form (16).

DATA ANALYSIS Insect genomes vary greatly in SSR composition, diversity and distribution. Our analysis showed that microsatellite content of five fully sequenced insect genomes is independent of both genome size and GC content (Table 1). The database consists of a dedicated section (Analysis) that describes the types of analysis that can be carried out using the data obtained from InSatDb. Some of the quick observations and inferences from a comparative genomic analysis are given in this section. Preponderance of di- and tri-nucleotide repeats is observed in Drosophila and Anopheles, whereas tri- and tetranucleotide repeats are abundant in Bombyx and Tribolium. On the whole, shorter microsatellites are abundant in the five insect genomes; as the length of the microsatellite increases their number decreases logarithmically typified by Bombyx and Drosophila microsatellites (>90% of the microsatellites <50 bp); on the other hand, Anopheles and Tribolium have longer microsatellites in a relatively high frequency. Shorter microsatellites not only predominate microsatellite population in the five insect genomes, but also seem to possess higher number of imperfect repeat

Nucleic Acids Research, 2007, Vol. 35, Database issue

D39

Table 1. Microsatellite content of insect genomes Insect

Chr (n)

Genome size (Mb)

GC%

Number of repeats

Microsatellite content (% Genome)

Number of microsatellites per Mb genome

Bombyx mori Drosophila melanogaster Anopheles gambiae Apis mellifera Tribolium castaneum

28 4 3 16 10

397.71 118.36 287.79 228.45 198.06

37.33 42.45 40.51 32.28 25.53

111 63 150 236 24

0.72 1.56 1.58 3.41 0.41

280 538 525 1035 122

units. On the other hand, microsatellites spanning >100 bp consisted of perfect, rather than imperfect repeats. Imperfect repeat units originate because of substitutions and indels. Interruptions, if at all, occur mainly in the middle region of the repeat sequence and the ends seem to be selected against decomposition. On the whole, most of the microsatellites occur within 20% GC bracket. There is no linear correlation between GC content and the average number of repeat units. Average length of the microsatellite across GC range is 37 ± 9 bp and between 0 and 5% GC content, microsatellites tend to be longer than 60 bp. Compound microsatellites account for nearly 3.2% in the insect genomes analysed; owing to high density of microsatellites, Apis has higher number of compound loci (6.12%). Anopheles and Apis genomes have as many as 50 and 60% of the total microsatellites in coding region, respectively. Bombyx genome has only 10% of the microsatellites in regions spanning exons, introns and their boundary. More than 70% of the microsatellites present in exons are trinucleotide repeats except in Apis, where 50% tri- and 25% dinucleotide repeats are present in exonic regions. Microsatellites in insects are AT rich (on an average 23.4% GC); however, they exist within regions that are not always AT rich.

DATABASE ACCESS AND FUTURE PERSPECTIVES InSatDb is freely available through www.cdfd.org.in/insatdb. Incorporation of microsatellite data from additional insects, query facility for better comparative genomic analysis such as gene-based microsatellite extraction and conservation analysis are planned. Additionally, based on users’ feedback, supplementary features will be added to make InSatDb a single window system for insect genome analyses using microsatellite tools.

ACKNOWLEDGEMENTS J.N. acknowledges the financial assistance from the Department of Biotechnology, Government of India. The

006 637 936 480 246

Open Access publication charges for this article were waived by Oxford University Press. Conflict of interest statement. None declared. REFERENCES 1. Dieringer,D. and Schlotterer,C. (2003) Two distinct modes of microsatellite mutation processes: evidence from the complete genomic sequences of nine species. Genome Res., 13, 2242–2251. 2. Karaoglu,H., Lee,C.M. and Meyer,W. (2005) Survey of simple sequence repeats in completed fungal genomes. Mol. Biol. Evol., 22, 639–649. 3. Kruglyak,S., Durrett,R.T., Schug,M.D. and Aquadro,C.F. (1998) Equilibrium distributions of microsatellite repeat length resulting from a balance between slippage events and point mutations. Proc. Natl Acad. Sci. USA, 95, 10774–10778. 4. Li,Y.C., Korol,A.B., Fahima,T. and Nevo,E. (2004) Microsatellites within genes: structure, function, and evolution. Mol. Biol. Evol., 21, 991–1007. 5. Metzgar,D., Liu,L., Hansen,C., Dybvig,K. and Wills,C. (2002) Domain-level differences in microsatellite distribution and content result from different relative rates of insertion and deletion mutations. Genome Res., 12, 408–413. 6. Nadir,E., Margalit,H., Gallily,T. and Ben-Sasson,S.A. (1996) Microsatellite spreading in the human genome: evolutionary mechanisms and structural implications. Proc. Natl Acad. Sci. USA, 93, 6470–6475. 7. Wilder,J. and Hollocher,H. (2001) Mobile elements and the genesis of microsatellites in dipterans. Mol. Biol. Evol., 18, 384–392. 8. Kashi,Y. and King,D.G. (2006) Simple sequence repeats as advantageous mutators in evolution. Trends Genet., 22, 253–259. 9. Boeva,V., Regnier,M., Papatsenko,D. and Makeev,V. (2006) Short fuzzy tandem repeats in genomic sequences, identification, and possible role in regulation of gene expression. Bioinformatics, 22, 676–684. 10. Katti,M.V., Ranjekar,P.K. and Gupta,V.S. (2001) Differential distribution of simple sequence repeats in eukaryotic genome sequences. Mol. Biol. Evol., 18, 1161–1167. 11. Morgante,M., Hanafey,M. and Powell,W. (2002) Microsatellites are preferentially associated with nonrepetitive DNA in plant genomes. Nature Genet., 30, 194–200. 12. Toth,G., Gaspari,Z. and Jurka,J. (2000) Microsatellites in different eukaryotic genomes: survey and analysis. Genome Res., 10, 967–981. 13. Borstnik,B. and Pumpernik,D. (2002) Tandem repeats in protein coding regions of primate genes. Genome Res., 12, 909–915. 14. Benson,G. (1999) Tandem repeats finder: a program to analyze DNA sequences. Nucleic Acids Res., 27, 573–580. 15. Burge,C. and Karlin,S. (1997) Prediction of complete gene structures in human genomic DNA. J. Mol. Biol., 268, 78–94. 16. Rozen,S. and Skaletsky,H. (2000) Primer3 on the WWW for general users and for biologist programmers. Methods Mol. Biol., 132, 365–386.

InSatDb: a microsatellite database of fully sequenced insect genomes

Nov 1, 2006 - analysis that can be carried out using the output. InSatDb is available at www.cdfd.org.in/insatdb. INTRODUCTION. Microsatellites are simple ...

3MB Sizes 0 Downloads 138 Views

Recommend Documents

(Coleoptera: Chrysomelidae), a serious insect pest of ...
Development Corporation (CDC) agro-industry located in the ... selection cycle provided by the Specialized Centre for. Oil Palm Research .... its life cycle. However, it has been .... Afrique de l'ouest”, Insect Science Application ;. 2000, 20 (1) 

Development of nine polymorphic microsatellite ...
the interspersed distribution feature of the L1-like element. .... for a heterozygote excess (HWE test) are given, locus by locus and for all loci, for two populations, ...

Insect Test.pdf
There was a problem previewing this document. Retrying... Download. Connect more apps... Try one of the apps below to open or edit this item. Insect Test.pdf.

Signals of adaptation in genomes
Exceptionally long haplotypes. • Pattern of polymorphism out of kilter with levels of divergence ..... and unusually long haplotypes. Derived allele is out-of-Africa ...

Comments on" A Fully Electronic System for Time Magnification of ...
The above paper by Schwartz et al. recently demonstrates time stretching of RF signals entirely in the electronic domain [1], which is in contrast to the large body ...

A fully automatic method for the reconstruction of ...
based on a mixture density network (MDN), in the search for a ... (pairs of C and S vectors) using a neural network which can be any ..... Recovery of fundamental ...

Insect and non-insect pests of mulberry silkworm Notes 1.pdf ...
(Integrated pest management package) against Uzi fly was. developed at Central Tasar Research and Training institute,. Ranchi, India, which involves implementation of mechanical,. chemical (use of bleaching powder solution as ovicide) and. biological

Microsatellite markers to monitor a commercialized ...
2010). However, despite these numbers, fungus-based mycoinsec- ticides do not account for a substantial part of the US or European biopesticide market (Jaronski, 2010). Aspects on stability of achieved control levels, costs, product quality and shelf

Development of a fully automated system for delivering ... - Springer Link
Development of a fully automated system for delivering odors in an MRI environment. ISABEL CUEVAS, BENOÎT GÉRARD, PAULA PLAZA, ELODIE LERENS, ...

buffering-unshared-tales-of-a-life-fully-loaded.pdf
buffering-unshared-tales-of-a-life-fully-loaded.pdf. buffering-unshared-tales-of-a-life-fully-loaded.pdf. Open. Extract. Open with. Sign In. Main menu.

Architecture of a Fully Digital CDR for Plesiochronous ...
CDR with a digital one as in [1], data recovery is done based on a digital correlation rather ... are taken by a smart finite state machine (FSM). The proposed CDR ...

FPGA Implementation of a Fully Digital CDR for ...
fully digital clock and data recovery system (FD-CDR) with .... which carries the actual phase information in the system, changes .... compliance pattern [10]. Fig.

Isolation and characterization of polymorphic microsatellite markers in ...
Mar 20, 2009 - Abstract Eight polymorphic microsatellite markers were developed for the grasshopper Mioscirtus wagneri. Poly- morphism at these loci was ...

Discovery and Cross-Amplification of Microsatellite ... - Semantic Scholar
(www.ncbi.nlm.nih.gov/BLAST). We used these results to identify microsatellite clones that were strongly similar to other nucleotide sequences (BLASTn, optimized for some- what similar sequences) or to gene predictions or identified protein-coding ge

Characterization of microsatellite markers for the ... - Wiley Online Library
tree, Lithocarpus densiflorus. VERONICA R. F. MORRIS and RICHARD S. DODD. Department of Environmental Science, Policy and Management, University of ...

Isolation of polymorphic microsatellite loci for the ...
... Marc Rius, Fax: +34 934035740. E-mail: [email protected] ... with an automated sequencer (ABI PRISM 3100 Genetic. Analyser, Applied Biosystems) from ...

Discovery and Cross-Amplification of Microsatellite ... - Semantic Scholar
MICHAEL W. HART1,*. 1Department of Biological Sciences, Simon Fraser University, Burnaby, British Columbia V5A 1S6,. Canada; 2Department of Renewable ...

network of genomes Getting a better picture of microbial ...
Email alerting service ... Receive free email alerts when new articles cite this article - sign up in the box at ...... not a good proxy for what is likely to be found in the.

Correlations between Heterozygosity at Microsatellite ...
variability in most eukaryote genomes (Hughes and Queller,. 1993; Jarne and Lagoda, 1996; .... System software package (SAS 1996). The R2-value of the.