RPKM_saturation.py manual with usage examples

Usage

RPKM_saturation.py [options] -r REFGENE_BED -i INPUT_BAM -o OUTPUT_PREFIX

Manual

The precision of any sample statitics (RPKM) is affected by sample size (sequencing depth); “resampling” or “jackknifing” is a method to estimate the precision of sample statistics by using subsets of available data. This module will resample a series of subsets from total RNA reads and then calculate RPKM value using each subset. By doing this we are able to check if the current sequencing depth was saturated or not (or if the RPKM values were stable or not) in terms of genes’ expression estimation. If sequencing depth was saturated, the estimated RPKM value will be stationary or reproducible. By default, this module will calculate 20 RPKM values (using 5%, 10%, ... , 95%,100% of total reads) for each transcripts.

Required arguments

-i INPUT_FILE, --input-file=INPUT_FILE: Alignment file in BAM or SAM format.
-o OUTPUT_PREFIX, --out-prefix=OUTPUT_PREFIX: Prefix of output files(s).
-r REFGENE_BED, --refgene=REFGENE_BED: Reference gene model in bed fomat. Precompiled gene models for human, mouse, fly, and zebrafish are available on the author's website.

Options

-d STRAND_RULE, --strand=STRAND_RULE: How read(s) were stranded during sequencing.
For example: --strand='1++,1--,2+-,2-+' means that this is a pair-end, strand-specific RNA-seq, and the strand rule is:
- read1 mapped to ‘+’ => parental gene on ‘+’;
- read1 mapped to ‘-‘ => parental gene on ‘-‘;
- read2 mapped to ‘+’ => parental gene on ‘-‘;
- read2 mapped to ‘-‘ => parental gene on ‘+’.
If you are not sure about the strand rule, run infer_experiment.py (default=none, Not a strand specific RNA-seq data)
-l PERCENTILE_LOW_BOUND, --percentile-floor=PERCENTILE_LOW_BOUND: Sampling starts from this percentile. A integer between 0 and 100. default=5
-u PERCENTILE_UP_BOUND, --percentile-ceiling=PERCENTILE_UP_BOUND: Sampling ends at this percentile. A integer between 0 and 100. default=100
-s PERCENTILE_STEP, --percentile-step=PERCENTILE_STEP: Sampling frequency. Smaller value means more sampling times. A integer between 0 and 100. default=5
-c RPKM_CUTOFF, --rpkm-cutoff=RPKM_CUTOFF: Transcripts with RPKM smaller than this number will be ignored in visualization plot. default=0.01
-q MAP_QUAL, --mapq=MAP_QUAL: Minimum mapping quality (phred scaled) for an alignment to be called “uniquely mapped”. default=30
--version: show program’s version number and exit
-h, --help: show this help message and exit

Output files

OUTPUT_PREFIX.eRPKM.xls: RPKM values for each transcript
OUTPUT_PREFIX.rawCount.xls: Raw count for each transcrip
OUTPUT_PREFIX.saturation.r: R script to generate plot
OUTPUT_PREFIX.saturation.pdf: In the output figure, Y axis is “Percent Relative Error” or “Percent Error” which is used to measures how the RPKM estimated from subset of reads (i.e. RPKMobs) deviates from real expression level (i.e. RPKMreal). However, in practice one cannot know the RPKMreal. As a proxy, we use the RPKM estimated from total reads to approximate RPKMreal.
All transcripts were sorted in ascending order according to expression level (RPKM). Then they are divided into 4 groups:
- Q1 (0-25%): Transcripts with expression level ranked below 25 percentile.
- Q2 (25-50%): Transcripts with expression level ranked between 25 percentile and 50 percentile.
- Q3 (50-75%): Transcripts with expression level ranked between 50 percentile and 75 percentile.
- Q4 (75-100%): Transcripts with expression level ranked above 75 percentile.

Example

Estimate sequencing saturation on a forward-designed RNA-seq library (Pairend_StrandSpecific_51mer_Human_hg19.bam).

RPKM_saturation.py -r hg19.refseq.bed12 -d '1++,1--,2+-,2-+' -i Pairend_StrandSpecific_51mer_Human_hg19.bam -o output

RPKM_saturation.py

Category