Friday, December 04, 2009

Quality control of Affymetrix arrays

I am playing around with different Bioconductor packages for QC on affy exon arrays.

Here is a nice introduction to quality assessment and processing.

http://www.bioconductor.org/workshops/2009/GenentechNov2009/Module2/module2-affy-preprocess.pdf


Here are some files one should have:

• CEL : contain one observation per spot
• CDF : map from spot locations to probeset and ultimately to the identity of the
gene being probed
• Bioconductor annotation" packages map from probe sets to gene and other
annotations.
• Tab-delimited, database, or other les provide phenotypic information.



Some packages are

arrayQualityMetrics
SimpleAffy
yaqcaffy
estrogen package vignette also has some QC. [>openVignette("estrogen")]

Before we start normnalization process here are some quality matric that Affymetrix advises

1)Average background : should be similar for all chips
2) Scale Factor: should be within 3 fold
3) # of genes called present : For similar samples - number should be similar. May
be different for different tissue types.
4) 3' to 5' ratio of GAPDH and beta-actin : should be close to one up to 3 is fine.
1.25 is what "simpleaffy" recommands.
5)Value for spike in transcripts: present in atleast 70% of arrays
please see
http://bioconductor.org/packages/2.5/bioc/vignettes/simpleaffy/inst/doc/QCandSimpleaffy.pdf

for more info.

======

We are interested in looking at two different aspects : Per slide aspects and Between slide aspects. Per slide aspects are - intensity dependence of ratios and spatial effects on the array. This can be done by looking at MA plots and a false image of chip. Between slide aspects are Homogeneity, outlier samples and biological meanings. This can be done by Boxplots, density plots, Heatmap and PCA. Other plots include Variance-mean dependency, GC content and probe mapping studies. Other Affy only plots include NUSE, RLE, RNA degradation, QC stats, PM/MM. Finally, one should be able to identify outliers.

The image function allows us to look at the spatial distribution of the intensities on a chip.

Another way to visualize what is going on on a chip is to look at the histogram of
its intensity distribution. Because of the large dynamical range (O(104)), it is useful to look at the log-transformed values

To compare the intensity distribution across several chips, we can look at the boxplots, both of the raw intensities and the normalized probe set values

The scatterplot is a visualization that is useful for assessing the variation (or
reproducibility, depending on how you look at it) between chips. We can look at all probes, the perfect match probes only, the mismatch probes only, and of course also at the normalized, probe-set-summarized data

Diff erences between arrays in the shape or center of the distribution often highlight the need for normalization.

The MA plot is a rotated version of a scatter plot. The
rotation helps to detect patterns as deviations from horizontal,
rather than diagonal.

• Instead of ploting two vectors Y2;j versus Y1;j , we plot
Mj = Y2;j - Y1;j versus Aj = (Y2;j + Y1;j)=2.
• if Y1 and Y2 are logarithmic expression values, then
{ Mj represents fold change for gene j
{ Aj represents average log intensity for gene j.

No comments: