Predicting CDSs from RNAseq data, with TransDecoder • STEP1: De novo assembly of a transcriptome (40 millions pair-end Illumina : reads of 100bp -not oriented) with Trinity (Acyrthosiphon svalbardicum, an aphid species). Only contigs >200 bp are retained • STEP2: Prediction of CDSs with TransDecoder, aided by search of pfamA motifs and blastp hits to the proteins of a related genome (pea aphid, A. pisum). Run with option –m 50 (peptides of 50 residues at least) => Aims : evaluating the cds prediction (how many CDSs / transcript, as a function of the transcript sequence size. Writing a program that rapidly calculates a histogram showing how many transcripts have 0, 1, 2, … or more CDSs predicted, for different transcript length bins. Using a filter to consider only CDS of a minimum sequence length in bp.
The program in short STEPS: 1. Calculates the dimension of the histogram by determining the maximum transcript sequence length and the maximum number of CDSs for any transcript 2. Initializes the histogram 3. Fills the Histogram (counts how many transcripts in each category) 4. Displays the results in a two-dimensionnal table
OPTIONS USED: -t transcript sequence file (fasta file used for transdecoder prediction) -c cds cequence file (fasta file produced by Transdecoder.Predict) -min minimum size of cds (e.g. can be set to 150bp, 300bp… default=0) -bin size (of the transcript sequence length – default=50 bp) -o output file name
Output text file: transcript file= svalbard_assembly_CPU8hypermem.Trinity.fasta cds file= svalbard_assembly_ CPU8hypermem.Trinity.fasta.transdecoder.cds minimum size to consider cds= 450 bin size= 50 maximum transcript size= 11585 maximum number of cds for any transcript= 5 low 0 50 100 150 200 250 300 350 400 450 500 550
E.g., the number of transcripts of size comprised between 550and 599 bp with exactly 1 CDS (> 450 bp) predicted is 276
5 0 0 0 0 0 0 0 0 0 0 0 0
Number of cds / transcripts of different sequence length (cumulated percentages) With –min 150 (counting only CDSs > 150 bp)
#CDSs per transcript
Transcript sequence size range
Number of cds / transcripts of different sequence length With –min 300
#CDSs per transcript
Transcript sequence size range
Number of cds / transcripts of different sequence length With –min 450
#CDSs per transcript
Transcript sequence size range
Number of cds / transcripts of different sequence length -
maximum transcript sequence length and the maximum number of. CDSs for any transcript. 2. Initializes the histogram. 3. Fills the Histogram (counts how many transcripts in each category). 4. Displays the results in a two-dimensionnal table. OPTIONS USED: -t transcript sequence file (fasta file used for transdecoder ...