Friday, December 20, 2013

Turning R string in to a variable

http://cran.r-project.org/doc/FAQ/R-FAQ.html#How-can-I-turn-a-string-into-a-variable_003f

"get" and "assign" allows "paste" to generate variables. 

Infinitely Abusable !

Saturday, November 30, 2013

Setting up High throughput RNA-Seq analysis pipeline

This email provides general discussion of the workflow and tools for analysis of RNA-Seq data. This are not the only tools but things that work.

A general requirement is install of recent version of Java, R and Bioconductor packages.

1) Fastq file(s).

FASTQ format stores sequences and Phred quality scores in a single file.
http://en.wikipedia.org/wiki/FASTQ_format

(As an example) Assuming a Illumina Hi-Seq run of 8 lanes, RNA-Seq data (reads) for up to 32 samples can be generated in 1 run. This results in upto 4 samples in a single lane. After de-multiplexing (if any) based on sequence barcode, one normally receives one or more FASTQ file for a given sample (single experiment).

If the paired-end (PE) sequencing was run the file will have R1 and R2 in the name.

One sample can have more than 1 fastq file and it can be merged with simple PERL script to concatnet them. R1 and R2 files should be merged separately.

2) Alignment to the reference genome assembly. (SAM and BAM files)

The first step in the analysis is to align the sequence reads in the fastq files to the genome.
There are three repositories to download the genome assemblies the UCSC, NCBI and ENSEMBL.

UCSC : http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/ chromFaMasked.tar.gz provides masked chromosome fasta files. Similarly, ENSEMBL fasta files can be downloaded from the ENSEMBL website (www.ensembl.org)

The UCSC assmebly is called "hg19" and ENSEMBL assembly is called "GRCh37". There is NO difference between them in terms of nucleotide sequences. (annotations differ)

While many tools available for doing the alignment Bowtie (http://bowtie-bio.sourceforge.net/index.shtml), Tophat (http://tophat.cbcb.umd.edu/) and BWA (http://bio-bwa.sourceforge.net/) are most frequently used. You may not need to download genome sequences from UCSC or ENSEMBL website. The websites of the aligners provides assemblies and index files
specifically designed to run with the aligners.

The output of the alignment is SAM files and its binary version is called BAM. see http://samtools.sourceforge.net/

3) QC for the sequencing reads and RNA-Seq specific QC

FASTQC allows for quality control of raw sequencing reads. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

RNA-SeQC allows for quality control specific of RNA-seq experiments, providing quality metric for transcriptome analysis.
http://www.broadinstitute.org/cancer/cga/rna-seqc

There are other packages available but these two are currently the best.

RNA-SeQC allows marking and removal of duplicate reads. The BAM files from which the duplicate reads are removed is the input of the further analysis.


4) Transcript assembly and Counting of reads for transcriptome.

This is a RNA-Seq analysis specific step. This step is a basic requirement for down stream analysis like differential expression analysis. requires genomic annotation information that is available
from GFF format files. The two places for genomic annotation are UCSC and ENSEMBL. I prefer ENSEMBL annotations but it is nice to use the ENCODE annotations currently used by TCGA.

The softwares are Cufflinks (http://cufflinks.cbcb.umd.edu/), HTSeq
(http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html) and
easyRNASeq (http://www.bioconductor.org/packages/release/bioc/html/easyRNASeq.html)

Various quantitation metrices like RPKM, FPFM (for paired end data), RSEM and raw counts are proposed for RNA-seq data. Current opinion is that RSEM (http://deweylab.biostat.wisc.edu/rsem/) is useful for within sample comparison and raw counts are suitable for differential expression analysis.

5) Normalization, Filtering and Differential Expression

The most commonly used packages are DESeq , DESeq2 and edgeR. There are others but I used mostly DESeq and edgeR for they allow incorporation of experiment design for fitting linear models 
and analysis of variance (ANOVA).

Filtering is important to increase power of the differential expression analysis. See discussion on Independent filtering by the "genefilter" and HTSFilter Bioconductor packages.

6) Visualization

The CummerBund package (http://compbio.mit.edu/cummeRbund/) associated with cufflinks allows for visualization of transcript data. RSEM (http://deweylab.biostat.wisc.edu/rsem/) is useful for within sample comparison and visualization.

7) Functional enrichment and pathway analysis.

No clear winner here. However I use GSVA, GSEA, DAVID and other packages.

Thursday, August 08, 2013

RPKM with easyRNASeq

Seems there is a bug in the RPKM function in easyRNASeq. Trying to generate RPKM values from counts results in the matrix with "NA".

The solution from seqanswers works :
http://seqanswers.com/forums/showthread.php?t=29909

It works for the other version of the package too.

Saturday, June 08, 2013

Nice worked out example on analyzing array data using bioconductor.

http://bioinformatics.knowledgeblog.org/2011/06/20/analysing-microarray-data-in-bioconductor/

Monday, April 15, 2013

R/Bioconductor packages for VennDiagram

gplots::venn
venneuler
limma::vennDiagram
vennDiagram

the paper by Hanbo Chen and Paul Boutros give good comparison of features of multiple tools. 
http://www.biomedcentral.com/1471-2105/12/35

Wednesday, January 02, 2013

Experimental Design for ANOVA

A good introduction is here

graphpad website for ANOVA introduction

The term repeated-measures refers to an experiment that collects multiple measurements from each subject. The analysis of repeated measures data is identical to the analysis of randomized block experiments that use paired or matched subjects. Prism can calculate repeated-measures two-way ANOVA when either one of the factors are repeated or matched (mixed model) or when both factors are.
One data table can correspond to four experimental designs
Prism uses a unique way to enter data. You use rows and columns to designate the different groups (levels) of each factor. Each data set (column) represents a different level of one factor, and each row represents a different level of the other factor. You need to decide which factor is defined by rows, and which by columns. Your choice will not affect the ANOVA results, but the choice is important as it affects the appearance of graphs.
The table above shows example data testing the effects of three doses of drugs in control and treated animals.
These data could have come from four distinct experimental designs.
Not repeated measures
The experiment was done with six animals. Each animal was given one of two treatments at one of three doses. The measurement was then made in duplicate. The value at row 1, column A, Y1 (23) came from the same animal as the value at row 1, column A, Y2 (24). Since the matching is within a treatment group, it is a replicate, not a repeated measure. Analyze these data with ordinary two-way ANOVA, not repeated-measures ANOVA.
Matched values are spread across a rows
The experiment was done with six animals, two for each dose. The control values were measured first in all six animals. Then you applied a treatment to all the animals and made the measurement again. In the table above, the value at row 1, column A, Y1 (23) came from the same animal as the value at row 1, column B, Y1 (28). The matching is by row.
Matched values are stacked into a subcolumn
The experiment was done with four animals. First each animal was exposed to a treatment (or placebo). After measuring the baseline data (dose=zero), you inject the first dose and make the measurement again. Then inject the second dose and measure again. The values in the first Y1 column (23, 34, and 43) were repeated measurements from the same animal. The other three subcolumns came from three other animals. The matching was by column.
Repeated measures in both factors
The experiment was done with two animals. First you measured the baseline (control, zero dose). Then you injected dose 1 and made the next measurement, then dose 2 and measured again. Then you gave the animal the experimental treatment, waited an appropriate period of time, and made the three measurements again. Finally, you repeated the experiment with another animal (Y2). So a single animal provided data from both Y1 subcolumns (23, 34, 43 and 28, 41, 56).
When do you specify which design applies to this experiment?
The example above shows that one grouped data set can represent four different experimental designs. You do not distinguish these designs when creating the data table. The data table doesn't "know" wether or not the data are repeated measures. You should take into account experimental design when choosing how to graph the data. And you must take it into account when performing two-way ANOVA. On the first tab of the two-way ANOVA dialog, you'll designate the experimental design.
Lingo: "Repeated measures" vs. "randomized block" experiments
The term repeated measures is appropriate when you made repeated measurements from each subject.
Some experiments involve matching but not repeated measurements. The term randomized-block describes these kinds of experiments. For example, imagine that the three rows were three different cell lines. All the Y1 data came from one experiment, and all the Y2 data came from another experiment performed a month later. The value at row 1, column A, Y1 (23) and the value at row 1, column B, Y1 (28) came from the same experiment (same cell passage, same reagents). The matching is by row.
Randomized block data are analyzed identically to repeated-measures data. Prism always uses the term repeated measures, so you should choose repeated measures analyses when your experiment follows a randomized block design.