Saturday, November 30, 2013

Setting up High throughput RNA-Seq analysis pipeline

This email provides general discussion of the workflow and tools for analysis of RNA-Seq data. This are not the only tools but things that work.

A general requirement is install of recent version of Java, R and Bioconductor packages.

1) Fastq file(s).

FASTQ format stores sequences and Phred quality scores in a single file.
http://en.wikipedia.org/wiki/FASTQ_format

(As an example) Assuming a Illumina Hi-Seq run of 8 lanes, RNA-Seq data (reads) for up to 32 samples can be generated in 1 run. This results in upto 4 samples in a single lane. After de-multiplexing (if any) based on sequence barcode, one normally receives one or more FASTQ file for a given sample (single experiment).

If the paired-end (PE) sequencing was run the file will have R1 and R2 in the name.

One sample can have more than 1 fastq file and it can be merged with simple PERL script to concatnet them. R1 and R2 files should be merged separately.

2) Alignment to the reference genome assembly. (SAM and BAM files)

The first step in the analysis is to align the sequence reads in the fastq files to the genome.
There are three repositories to download the genome assemblies the UCSC, NCBI and ENSEMBL.

UCSC : http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/ chromFaMasked.tar.gz provides masked chromosome fasta files. Similarly, ENSEMBL fasta files can be downloaded from the ENSEMBL website (www.ensembl.org)

The UCSC assmebly is called "hg19" and ENSEMBL assembly is called "GRCh37". There is NO difference between them in terms of nucleotide sequences. (annotations differ)

While many tools available for doing the alignment Bowtie (http://bowtie-bio.sourceforge.net/index.shtml), Tophat (http://tophat.cbcb.umd.edu/) and BWA (http://bio-bwa.sourceforge.net/) are most frequently used. You may not need to download genome sequences from UCSC or ENSEMBL website. The websites of the aligners provides assemblies and index files
specifically designed to run with the aligners.

The output of the alignment is SAM files and its binary version is called BAM. see http://samtools.sourceforge.net/

3) QC for the sequencing reads and RNA-Seq specific QC

FASTQC allows for quality control of raw sequencing reads. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

RNA-SeQC allows for quality control specific of RNA-seq experiments, providing quality metric for transcriptome analysis.
http://www.broadinstitute.org/cancer/cga/rna-seqc

There are other packages available but these two are currently the best.

RNA-SeQC allows marking and removal of duplicate reads. The BAM files from which the duplicate reads are removed is the input of the further analysis.


4) Transcript assembly and Counting of reads for transcriptome.

This is a RNA-Seq analysis specific step. This step is a basic requirement for down stream analysis like differential expression analysis. requires genomic annotation information that is available
from GFF format files. The two places for genomic annotation are UCSC and ENSEMBL. I prefer ENSEMBL annotations but it is nice to use the ENCODE annotations currently used by TCGA.

The softwares are Cufflinks (http://cufflinks.cbcb.umd.edu/), HTSeq
(http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html) and
easyRNASeq (http://www.bioconductor.org/packages/release/bioc/html/easyRNASeq.html)

Various quantitation metrices like RPKM, FPFM (for paired end data), RSEM and raw counts are proposed for RNA-seq data. Current opinion is that RSEM (http://deweylab.biostat.wisc.edu/rsem/) is useful for within sample comparison and raw counts are suitable for differential expression analysis.

5) Normalization, Filtering and Differential Expression

The most commonly used packages are DESeq , DESeq2 and edgeR. There are others but I used mostly DESeq and edgeR for they allow incorporation of experiment design for fitting linear models 
and analysis of variance (ANOVA).

Filtering is important to increase power of the differential expression analysis. See discussion on Independent filtering by the "genefilter" and HTSFilter Bioconductor packages.

6) Visualization

The CummerBund package (http://compbio.mit.edu/cummeRbund/) associated with cufflinks allows for visualization of transcript data. RSEM (http://deweylab.biostat.wisc.edu/rsem/) is useful for within sample comparison and visualization.

7) Functional enrichment and pathway analysis.

No clear winner here. However I use GSVA, GSEA, DAVID and other packages.