Predicting CDSs from RNAseq data, with TransDecoder • STEP1: De novo assembly of a transcriptome (40 millions pair-end Illumina : reads of 100bp -not oriented) with Trinity (Acyrthosiphon svalbardicum, an aphid species). Only contigs >200 bp are retained • STEP2: Prediction of CDSs with TransDecoder, aided by search of pfamA motifs and blastp hits to the proteins of a related genome (pea aphid, A. pisum). Run with option –m 50 (peptides of 50 residues at least) => Aims : evaluating the cds prediction (how many CDSs / transcript, as a function of the transcript sequence size. Writing a program that rapidly calculates a histogram showing how many transcripts have 0, 1, 2, … or more CDSs predicted, for different transcript length bins. Using a filter to consider only CDS of a minimum sequence length in bp.

The program in short STEPS: 1. Calculates the dimension of the histogram by determining the maximum transcript sequence length and the maximum number of CDSs for any transcript 2. Initializes the histogram 3. Fills the Histogram (counts how many transcripts in each category) 4. Displays the results in a two-dimensionnal table

OPTIONS USED: -t transcript sequence file (fasta file used for transdecoder prediction) -c cds cequence file (fasta file produced by Transdecoder.Predict) -min minimum size of cds (e.g. can be set to 150bp, 300bp… default=0) -bin size (of the transcript sequence length – default=50 bp) -o output file name

Output text file: transcript file= svalbard_assembly_CPU8hypermem.Trinity.fasta cds file= svalbard_assembly_ CPU8hypermem.Trinity.fasta.transdecoder.cds minimum size to consider cds= 450 bin size= 50 maximum transcript size= 11585 maximum number of cds for any transcript= 5 low 0 50 100 150 200 250 300 350 400 450 500 550

high 49 99 149 199 249 299 349 399 449 499 549 599

0 0 0 0 0 8781 5512 3570 2620 1966 1140 817 675

1 0 0 0 0 0 0 0 0 0 342 261 276

2 0 0 0 0 0 0 0 0 0 18 23 20

3 0 0 0 0 0 0 0 0 0 2 0 1

Number of CDSs

4 0 0 0 0 0 0 0 0 0 0 0 0



Transcript sequence size range

E.g., the number of transcripts of size comprised between 550and 599 bp with exactly 1 CDS (> 450 bp) predicted is 276

5 0 0 0 0 0 0 0 0 0 0 0 0

Number of cds / transcripts of different sequence length (cumulated percentages) With –min 150 (counting only CDSs > 150 bp)

#CDSs per transcript

Transcript sequence size range

Number of cds / transcripts of different sequence length With –min 300

#CDSs per transcript

Transcript sequence size range

Number of cds / transcripts of different sequence length With –min 450

#CDSs per transcript

Transcript sequence size range

Number of cds / transcripts of different sequence length -

maximum transcript sequence length and the maximum number of. CDSs for any transcript. 2. Initializes the histogram. 3. Fills the Histogram (counts how many transcripts in each category). 4. Displays the results in a two-dimensionnal table. OPTIONS USED: -t transcript sequence file (fasta file used for transdecoder ...

397KB Sizes 3 Downloads 242 Views

Recommend Documents

No documents