Wednesday, February 15, 2017

gene expression profile outlier analysis


1) Packages that makes parametric assumption - mostly for microarray gene expression data.

MOST, ZOPET, OPPAR, mCOPA, PPST, OS, ORT.

2) non- parametric

coGPS. From the first glance look very useful.

#1 - non parametric
#2 - allows integration of multiple data types like expression, methylation and copy number.
#3 - permutation based p-value (based on D. Ghosh's work)
#4 -  Both gene and patience wise outlier detection.

Wednesday, November 11, 2015

Installing MACS2 without hiccups

Installing MACS2 by Tao Liu

I was interested in identifying differential peak from ChIP-Seq data without replicates.
So, as a happy user of MACS, I wanted to use MACS2.

Tips:

1) Install numpy
2) Install Cython (this would remove Prob.c error)
3) In "setup_w_cython.py" script - change -Ofast to -O3 to make sure GCC is not offended.

It should compile smooth.

Here is how to run MACS2
https://github.com/taoliu/MACS/wiki/Call-differential-binding-events

Will update on result quality.

 

Friday, December 20, 2013

Turning R string in to a variable

http://cran.r-project.org/doc/FAQ/R-FAQ.html#How-can-I-turn-a-string-into-a-variable_003f

"get" and "assign" allows "paste" to generate variables. 

Infinitely Abusable !

Saturday, November 30, 2013

Setting up High throughput RNA-Seq analysis pipeline

This email provides general discussion of the workflow and tools for analysis of RNA-Seq data. This are not the only tools but things that work.

A general requirement is install of recent version of Java, R and Bioconductor packages.

1) Fastq file(s).

FASTQ format stores sequences and Phred quality scores in a single file.
http://en.wikipedia.org/wiki/FASTQ_format

(As an example) Assuming a Illumina Hi-Seq run of 8 lanes, RNA-Seq data (reads) for up to 32 samples can be generated in 1 run. This results in upto 4 samples in a single lane. After de-multiplexing (if any) based on sequence barcode, one normally receives one or more FASTQ file for a given sample (single experiment).

If the paired-end (PE) sequencing was run the file will have R1 and R2 in the name.

One sample can have more than 1 fastq file and it can be merged with simple PERL script to concatnet them. R1 and R2 files should be merged separately.

2) Alignment to the reference genome assembly. (SAM and BAM files)

The first step in the analysis is to align the sequence reads in the fastq files to the genome.
There are three repositories to download the genome assemblies the UCSC, NCBI and ENSEMBL.

UCSC : http://hgdownload.soe.ucsc.edu/goldenPath/hg19/bigZips/ chromFaMasked.tar.gz provides masked chromosome fasta files. Similarly, ENSEMBL fasta files can be downloaded from the ENSEMBL website (www.ensembl.org)

The UCSC assmebly is called "hg19" and ENSEMBL assembly is called "GRCh37". There is NO difference between them in terms of nucleotide sequences. (annotations differ)

While many tools available for doing the alignment Bowtie (http://bowtie-bio.sourceforge.net/index.shtml), Tophat (http://tophat.cbcb.umd.edu/) and BWA (http://bio-bwa.sourceforge.net/) are most frequently used. You may not need to download genome sequences from UCSC or ENSEMBL website. The websites of the aligners provides assemblies and index files
specifically designed to run with the aligners.

The output of the alignment is SAM files and its binary version is called BAM. see http://samtools.sourceforge.net/

3) QC for the sequencing reads and RNA-Seq specific QC

FASTQC allows for quality control of raw sequencing reads. http://www.bioinformatics.babraham.ac.uk/projects/fastqc/

RNA-SeQC allows for quality control specific of RNA-seq experiments, providing quality metric for transcriptome analysis.
http://www.broadinstitute.org/cancer/cga/rna-seqc

There are other packages available but these two are currently the best.

RNA-SeQC allows marking and removal of duplicate reads. The BAM files from which the duplicate reads are removed is the input of the further analysis.


4) Transcript assembly and Counting of reads for transcriptome.

This is a RNA-Seq analysis specific step. This step is a basic requirement for down stream analysis like differential expression analysis. requires genomic annotation information that is available
from GFF format files. The two places for genomic annotation are UCSC and ENSEMBL. I prefer ENSEMBL annotations but it is nice to use the ENCODE annotations currently used by TCGA.

The softwares are Cufflinks (http://cufflinks.cbcb.umd.edu/), HTSeq
(http://www-huber.embl.de/users/anders/HTSeq/doc/overview.html) and
easyRNASeq (http://www.bioconductor.org/packages/release/bioc/html/easyRNASeq.html)

Various quantitation metrices like RPKM, FPFM (for paired end data), RSEM and raw counts are proposed for RNA-seq data. Current opinion is that RSEM (http://deweylab.biostat.wisc.edu/rsem/) is useful for within sample comparison and raw counts are suitable for differential expression analysis.

5) Normalization, Filtering and Differential Expression

The most commonly used packages are DESeq , DESeq2 and edgeR. There are others but I used mostly DESeq and edgeR for they allow incorporation of experiment design for fitting linear models 
and analysis of variance (ANOVA).

Filtering is important to increase power of the differential expression analysis. See discussion on Independent filtering by the "genefilter" and HTSFilter Bioconductor packages.

6) Visualization

The CummerBund package (http://compbio.mit.edu/cummeRbund/) associated with cufflinks allows for visualization of transcript data. RSEM (http://deweylab.biostat.wisc.edu/rsem/) is useful for within sample comparison and visualization.

7) Functional enrichment and pathway analysis.

No clear winner here. However I use GSVA, GSEA, DAVID and other packages.

Thursday, August 08, 2013

RPKM with easyRNASeq

Seems there is a bug in the RPKM function in easyRNASeq. Trying to generate RPKM values from counts results in the matrix with "NA".

The solution from seqanswers works :
http://seqanswers.com/forums/showthread.php?t=29909

It works for the other version of the package too.

Saturday, June 08, 2013

Nice worked out example on analyzing array data using bioconductor.

http://bioinformatics.knowledgeblog.org/2011/06/20/analysing-microarray-data-in-bioconductor/

Monday, April 15, 2013

R/Bioconductor packages for VennDiagram

gplots::venn
venneuler
limma::vennDiagram
vennDiagram

the paper by Hanbo Chen and Paul Boutros give good comparison of features of multiple tools. 
http://www.biomedcentral.com/1471-2105/12/35