Friday, December 11, 2009

Test about a single proportion for categorical data

Ho: Pi = Pio vs. HA: Pi NA Pio

Wald Test: Z = Pi-hat - Pio/ sqrt (Pi-hat (1 - Pi-hat)/n)

Score Test : Z = Pi-hat - Pio/ sqrt (Pio (1 - Pio)/n)

Score test is slightly more powerful in distinguising evidence against Ho.


R-code for score test : prop.test (27, 922, p=0.02, correct =FALSE)

Here is p is the population probability. and 27 is the observed instance of some phenomena in population. 27/922 will give Pi-hat - the binomial estimate in sample.
Its a Chi-square test on 1 degree of freedom.

Here the p-value is 0.0440 - where the Ho is rejected and 27/922 = 0.029 is NA to 0.02.

Obtaining Confidence Intervals
===============================

The Wald Confidence Interval for single proportion can be obtained by inverting the Wald test statistic but it falls into problem if n is small or Pi is small.

A Score CI can be used in this case.

The R-code is

Wald

library(Hmisc)

binconf(27,922,method="asymptotic")

PointEst Lower Upper
0.02928416 0.01840125 0.04016708


Score (Wilson):

binconf(27,922,method="wilson")

PointEst Lower Upper
0.02928416 0.02020271 0.04227177

prop.test also give the score CI.

========================

In case observed events are small in number and normality assumption is not valid.
One can use binomial exact test to calculate the p-value and CI.

binom.test(3, 58, p = 0.02,
+ alternative = "two.sided",
+ conf.level = 0.95)
Exact binomial test
data: 3 and 58
number of successes = 3, number of trials = 58,
p-value = 0.1101
alternative hypothesis: true probability of success
is not equal to 0.02
95 percent confidence interval:
0.01079648 0.14380463
sample estimates:
probability of success
0.05172414

Tuesday, December 08, 2009

R-packages unix to windows

Use

http://win-builder.r-project.org/upload.aspx

Make sure MAINTAINER has your email address. The site should send instructions on downloading the windows version of the package.

Monday, December 07, 2009

Bioconductor post on using arrayQualityMatrics with exon arrays

Looks like both simpleaffy and arrayQualityMetrics have problem with QCing Affy Exon 1.0 ST arrays.

Though following post does suggest a way to put custom CDF.


Hi Gard,

Sorry for the delay answering. I do not have much experience using
arrayQualityMetrics for Exon arrays, so I have talked with Crispin Miller
(simpleaffy package) about it and according to him "most of the Affymetrix
QC metrics for the 3' IVT arrays aren't directly applicable to the exon
arrays. They rely on MAS 5 and paired MM spots (neither of which are
applicable for exon arrays) and also make assumptions on 3'/5' ratios that
don't apply because the exon array chemistry is different."
I have now modified the package and version 2.4.3 of arrayQualityMetrics
should not perform the QC statistics from simpleaffy when "exon" is in the
cdfname.

Best wishes,
Audrey

> Hi.
>
> I am trying to get the arrayQualityMetrics package to run on a set of
CEL files from the Human Exon array from Affymetrix
>
> My problems begin when I want to run the arrayQualityMetrics function
and it gives the following error message :
>
> running R 2.9.2 and bioconductor version 2.4
>
> >library(affy)
> >library(simpleaffy)
> >ibrary(arrayQualityMetrics)
> >ecesbatch<-read.affybatch("H1.CEL", "H2.CEL", "H3.CEL", "H4.CEL",
> "H5.CEL", "H6.CEL", "H7.CEL", "H8.CEL", "H9.CEL", "H10.CEL",
> "H11.CEL", "H12.CEL", "H13.CEL", "H14.CEL", "H15.CEL", "H16.CEL")
>
> ## attach cdf to expr set
> ecesbatch cdfName <- "exon.pmcdf" ## this is a cdf file from the XMAP
website
>
> #Check the name is correct for the cdf file (unneccessary)
> > cdfname <- cleancdfname(cdfName(ecesbatch))
> > cdfname
> [1] "exon.pmcdf"
>
> >arrayQualityMetrics(expressionset = ecesbatch,outdir =
> "output",force = TRUE,do.logtransform = TRUE)
>
> This cmd runs for a very long time and generates a bunch of .pdfs and
.pngs and an empty QCReport.html file.
> And R says there is an error sonce the arrayQualityMetrics package does
not know the QCparameters of this chip.
>
>
> I have found an instruction from C. Miller (one of the persons behind
simpleaffy) about how use the three functions provided by the
> simpleaffy package, or to make the needed .qcdf file:
> I need alpha values (that is okay) and I need control and spike
> probeIDs.
>
> I am using the Human Exon array 1.0 from affymetrix, and I do not know
what to fill in in the .qcdf file,
> anyone who knows how to get by this problem?
> Trying to get the probenames to set the values I ran into another problem..
>
> > prbs <- ls(cdfname)
> Error in as.environment(pos) :
> no item called "exon.pmcdf" on the search list
> >
>
> crashes like this shown here.
>
>
> Please if anyone knows or has an idea, basicly what I need is
> the .qcdf file for the HUman Exon array from Affy.
> Best regards
> Gard
>
> #################################
> Gard Thomassen
> Ph.D student CMBN, Rikshospitalet, Oslo
> Bioinformatician, Radiumhospitalet, Oslo
> Norway
> Email : gardt@...
> Office: + 47 22781736
> Phone +47 93674926

Friday, December 04, 2009

Quality control of Affymetrix arrays

I am playing around with different Bioconductor packages for QC on affy exon arrays.

Here is a nice introduction to quality assessment and processing.

http://www.bioconductor.org/workshops/2009/GenentechNov2009/Module2/module2-affy-preprocess.pdf


Here are some files one should have:

• CEL : contain one observation per spot
• CDF : map from spot locations to probeset and ultimately to the identity of the
gene being probed
• Bioconductor annotation" packages map from probe sets to gene and other
annotations.
• Tab-delimited, database, or other les provide phenotypic information.



Some packages are

arrayQualityMetrics
SimpleAffy
yaqcaffy
estrogen package vignette also has some QC. [>openVignette("estrogen")]

Before we start normnalization process here are some quality matric that Affymetrix advises

1)Average background : should be similar for all chips
2) Scale Factor: should be within 3 fold
3) # of genes called present : For similar samples - number should be similar. May
be different for different tissue types.
4) 3' to 5' ratio of GAPDH and beta-actin : should be close to one up to 3 is fine.
1.25 is what "simpleaffy" recommands.
5)Value for spike in transcripts: present in atleast 70% of arrays
please see
http://bioconductor.org/packages/2.5/bioc/vignettes/simpleaffy/inst/doc/QCandSimpleaffy.pdf

for more info.

======

We are interested in looking at two different aspects : Per slide aspects and Between slide aspects. Per slide aspects are - intensity dependence of ratios and spatial effects on the array. This can be done by looking at MA plots and a false image of chip. Between slide aspects are Homogeneity, outlier samples and biological meanings. This can be done by Boxplots, density plots, Heatmap and PCA. Other plots include Variance-mean dependency, GC content and probe mapping studies. Other Affy only plots include NUSE, RLE, RNA degradation, QC stats, PM/MM. Finally, one should be able to identify outliers.

The image function allows us to look at the spatial distribution of the intensities on a chip.

Another way to visualize what is going on on a chip is to look at the histogram of
its intensity distribution. Because of the large dynamical range (O(104)), it is useful to look at the log-transformed values

To compare the intensity distribution across several chips, we can look at the boxplots, both of the raw intensities and the normalized probe set values

The scatterplot is a visualization that is useful for assessing the variation (or
reproducibility, depending on how you look at it) between chips. We can look at all probes, the perfect match probes only, the mismatch probes only, and of course also at the normalized, probe-set-summarized data

Diff erences between arrays in the shape or center of the distribution often highlight the need for normalization.

The MA plot is a rotated version of a scatter plot. The
rotation helps to detect patterns as deviations from horizontal,
rather than diagonal.

• Instead of ploting two vectors Y2;j versus Y1;j , we plot
Mj = Y2;j - Y1;j versus Aj = (Y2;j + Y1;j)=2.
• if Y1 and Y2 are logarithmic expression values, then
{ Mj represents fold change for gene j
{ Aj represents average log intensity for gene j.