At par with me: 2009

Friday, December 11, 2009

Test about a single proportion for categorical data

Ho: Pi = Pio vs. HA: Pi NA Pio

Wald Test: Z = Pi-hat - Pio/ sqrt (Pi-hat (1 - Pi-hat)/n)

Score Test : Z = Pi-hat - Pio/ sqrt (Pio (1 - Pio)/n)

Score test is slightly more powerful in distinguising evidence against Ho.

R-code for score test : prop.test (27, 922, p=0.02, correct =FALSE)

Here is p is the population probability. and 27 is the observed instance of some phenomena in population. 27/922 will give Pi-hat - the binomial estimate in sample.
Its a Chi-square test on 1 degree of freedom.

Here the p-value is 0.0440 - where the Ho is rejected and 27/922 = 0.029 is NA to 0.02.

Obtaining Confidence Intervals
===============================

The Wald Confidence Interval for single proportion can be obtained by inverting the Wald test statistic but it falls into problem if n is small or Pi is small.

A Score CI can be used in this case.

The R-code is

Wald

library(Hmisc)

binconf(27,922,method="asymptotic")

PointEst Lower Upper
0.02928416 0.01840125 0.04016708

Score (Wilson):

binconf(27,922,method="wilson")

PointEst Lower Upper
0.02928416 0.02020271 0.04227177

prop.test also give the score CI.

========================

In case observed events are small in number and normality assumption is not valid.
One can use binomial exact test to calculate the p-value and CI.

binom.test(3, 58, p = 0.02,
+ alternative = "two.sided",
+ conf.level = 0.95)
Exact binomial test
data: 3 and 58
number of successes = 3, number of trials = 58,
p-value = 0.1101
alternative hypothesis: true probability of success
is not equal to 0.02
95 percent confidence interval:
0.01079648 0.14380463
sample estimates:
probability of success
0.05172414

Tuesday, December 08, 2009

R-packages unix to windows

Use

http://win-builder.r-project.org/upload.aspx

Make sure MAINTAINER has your email address. The site should send instructions on downloading the windows version of the package.

Monday, December 07, 2009

Bioconductor post on using arrayQualityMatrics with exon arrays

Looks like both simpleaffy and arrayQualityMetrics have problem with QCing Affy Exon 1.0 ST arrays.

Though following post does suggest a way to put custom CDF.

Hi Gard,

Sorry for the delay answering. I do not have much experience using
arrayQualityMetrics for Exon arrays, so I have talked with Crispin Miller
(simpleaffy package) about it and according to him "most of the Affymetrix
QC metrics for the 3' IVT arrays aren't directly applicable to the exon
arrays. They rely on MAS 5 and paired MM spots (neither of which are
applicable for exon arrays) and also make assumptions on 3'/5' ratios that
don't apply because the exon array chemistry is different."
I have now modified the package and version 2.4.3 of arrayQualityMetrics
should not perform the QC statistics from simpleaffy when "exon" is in the
cdfname.

Best wishes,
Audrey

> Hi.
>
> I am trying to get the arrayQualityMetrics package to run on a set of
CEL files from the Human Exon array from Affymetrix
>
> My problems begin when I want to run the arrayQualityMetrics function
and it gives the following error message :
>
> running R 2.9.2 and bioconductor version 2.4
>
> >library(affy)
> >library(simpleaffy)
> >ibrary(arrayQualityMetrics)
> >ecesbatch<-read.affybatch("H1.CEL", "H2.CEL", "H3.CEL", "H4.CEL",
> "H5.CEL", "H6.CEL", "H7.CEL", "H8.CEL", "H9.CEL", "H10.CEL",
> "H11.CEL", "H12.CEL", "H13.CEL", "H14.CEL", "H15.CEL", "H16.CEL")
>
> ## attach cdf to expr set
> ecesbatch cdfName <- "exon.pmcdf" ## this is a cdf file from the XMAP
website
>
> #Check the name is correct for the cdf file (unneccessary)
> > cdfname <- cleancdfname(cdfName(ecesbatch))
> > cdfname
> [1] "exon.pmcdf"
>
> >arrayQualityMetrics(expressionset = ecesbatch,outdir =
> "output",force = TRUE,do.logtransform = TRUE)
>
> This cmd runs for a very long time and generates a bunch of .pdfs and
.pngs and an empty QCReport.html file.
> And R says there is an error sonce the arrayQualityMetrics package does
not know the QCparameters of this chip.
>
>
> I have found an instruction from C. Miller (one of the persons behind
simpleaffy) about how use the three functions provided by the
> simpleaffy package, or to make the needed .qcdf file:
> I need alpha values (that is okay) and I need control and spike
> probeIDs.
>
> I am using the Human Exon array 1.0 from affymetrix, and I do not know
what to fill in in the .qcdf file,
> anyone who knows how to get by this problem?
> Trying to get the probenames to set the values I ran into another problem..
>
> > prbs <- ls(cdfname)
> Error in as.environment(pos) :
> no item called "exon.pmcdf" on the search list
> >
>
> crashes like this shown here.
>
>
> Please if anyone knows or has an idea, basicly what I need is
> the .qcdf file for the HUman Exon array from Affy.
> Best regards
> Gard
>
> #################################
> Gard Thomassen
> Ph.D student CMBN, Rikshospitalet, Oslo
> Bioinformatician, Radiumhospitalet, Oslo
> Norway
> Email : gardt@...
> Office: + 47 22781736
> Phone +47 93674926

Friday, December 04, 2009

Quality control of Affymetrix arrays

I am playing around with different Bioconductor packages for QC on affy exon arrays.

Here is a nice introduction to quality assessment and processing.

http://www.bioconductor.org/workshops/2009/GenentechNov2009/Module2/module2-affy-preprocess.pdf

Here are some files one should have:

• CEL : contain one observation per spot
• CDF : map from spot locations to probeset and ultimately to the identity of the
gene being probed
• Bioconductor annotation" packages map from probe sets to gene and other
annotations.
• Tab-delimited, database, or other les provide phenotypic information.

Some packages are

arrayQualityMetrics
SimpleAffy
yaqcaffy
estrogen package vignette also has some QC. [>openVignette("estrogen")]

Before we start normnalization process here are some quality matric that Affymetrix advises

1)Average background : should be similar for all chips
2) Scale Factor: should be within 3 fold
3) # of genes called present : For similar samples - number should be similar. May
be different for different tissue types.
4) 3' to 5' ratio of GAPDH and beta-actin : should be close to one up to 3 is fine.
1.25 is what "simpleaffy" recommands.
5)Value for spike in transcripts: present in atleast 70% of arrays
please see
http://bioconductor.org/packages/2.5/bioc/vignettes/simpleaffy/inst/doc/QCandSimpleaffy.pdf

for more info.

======

We are interested in looking at two different aspects : Per slide aspects and Between slide aspects. Per slide aspects are - intensity dependence of ratios and spatial effects on the array. This can be done by looking at MA plots and a false image of chip. Between slide aspects are Homogeneity, outlier samples and biological meanings. This can be done by Boxplots, density plots, Heatmap and PCA. Other plots include Variance-mean dependency, GC content and probe mapping studies. Other Affy only plots include NUSE, RLE, RNA degradation, QC stats, PM/MM. Finally, one should be able to identify outliers.

The image function allows us to look at the spatial distribution of the intensities on a chip.

Another way to visualize what is going on on a chip is to look at the histogram of
its intensity distribution. Because of the large dynamical range (O(104)), it is useful to look at the log-transformed values

To compare the intensity distribution across several chips, we can look at the boxplots, both of the raw intensities and the normalized probe set values

The scatterplot is a visualization that is useful for assessing the variation (or
reproducibility, depending on how you look at it) between chips. We can look at all probes, the perfect match probes only, the mismatch probes only, and of course also at the normalized, probe-set-summarized data

Differences between arrays in the shape or center of the distribution often highlight the need for normalization.

The MA plot is a rotated version of a scatter plot. The
rotation helps to detect patterns as deviations from horizontal,
rather than diagonal.

• Instead of ploting two vectors Y2;j versus Y1;j , we plot
Mj = Y2;j - Y1;j versus Aj = (Y2;j + Y1;j)=2.
• if Y1 and Y2 are logarithmic expression values, then
{ Mj represents fold change for gene j
{ Aj represents average log intensity for gene j.

Thursday, November 19, 2009

Percents - GMAT Math Study Guide

This is taken from http://www.platinumgmat.com/gmat_study_guide/percents.

Percent Change vs. Percent Of

While most students find percentages to be an easier topic than one such as combinatorics, some individuals initially trip on the difference between a percent change and a percent of a number. Practically, this is the difference between saying "the price jumped 50%" and "the current price is 150% of the old price." Both of these phrases refer to the same amount, but are stated differently.
Percent Change

Percents are commonly used to measure or report the change in an amount. For example, a news reporter might say, "stocks rose 1.5% today" or a demographer might write, "minority representation in the population fell 3.5% during the past decade." The formula for calculating percent changes is:
percent change formula

This formula can also be expressed in decimal form. In other words, the following formula calculates the percent change between two numbers and represents this change in decimal form.
percent change formula as decimal

The following examples illustrate the use of this formula.
A company recently saw its stock fall from $10 to $9 as a result of a lawsuit award. What percent did the stock drop?
End Value = 9
Start Value = 10
Percent Change [as a percent] = ((9 - 10)/10) * 100 = -.1 * 100 = -10%

Another Example:
As a result of an increase in the required minimum wage and an increase in the price of raw materials, a manufacturer raised the price of its product from $50 to $60. By what percent did the manufacturer raise the price of its product?
End Value = 60
Start Value = 50
Percent Change [as a percent] = ((60 - 50)/50) * 100 = .2 * 100 = 20%

It is possible to calculate the percent change of a percent. Consider the following example:
Since the local government increased funding of high school education 10 years ago, the percent of students accepted at accredited four year universities jumped from 75% to 85%. By what percent did the percent of students accepted at four year universities increase over the 10 year period?
End Value = 85% = .85
Start Value = 75% = .75
Using Percents: Percent Change [as a percent] = ((85% - 75%)/75%) * 100 = 13.3% * 100 = 13.3%
Using Decimals: Percent Change [as a decimal] = ((.85 - .75)/.75) * 100 = .133
A Common Mistake in Working With Percent Decreases

Some students confuse a percent decrease of a certain percentage with finding the percent of a certain amount. The following example elucidates this confusion:
A foreign stock market index stood at 5,000 last year. However, since that time, its value fell 45%. What is the current value of the stock index?

Common Mistake: IndexToday = 5000(.45)
This calculation yields 45% of last year's index value. However, the question pertains to a 45% fall. Since the index's value fell 45%, its current value is 100% - 45% = 55% of last year's index value.
Correct Calculation: IndexToday = 5000(1-.45) = 5000(.55) = 2750
Percent of

Another common use of percents is as a measure of another number. For example, a stock analyst might say, "MicroMake's stock is trading at 130% of MacroMake's stock price." Similarly, a political historian might say, "President George W. Bush's approval rating in late November 2004 was about 50%, which is about 55% of his approval rating in late September 2001." In these instances, percents are being used not to describe change, but to compare amounts or quantities.

When working with percents that are used to compare different quantities, it is often best to translate each percent into decimals and set up equations or ratios. Consider the following examples:
What is 50% of 40?
Translate 50% into decimal format: 50% = .5
Translate the question into an equation: .5(40) = ?
.5(40) = 20

The following is a slightly more difficult example:
20 is what percent of 80?
Let X = the percent as a decimal
Translate the question into an equation: X(80) = 20
X = 20/80 = 1/4 = .25
Translate X into a percent: .25(100) = 25%

Percents can also be used to compare the size of percents. Consider the example with President George W. Bush's approval rating mentioned above.
President George W. Bush's approval rating in late November 2004 was about 50%, which is about 55% of his approval rating in late September 2001. What was President Bush's approval rating in late September 2001?
Let A = President Bush's approval rating in late September 2001
Condense Question Down to Simplify: 50% is 55% of A
Translate Into Equation: .50 = .55A
A [as a decimal] = .9
A [as a percent] = .9(100) = 90%

Recursive Percents

If a number rises by 30% and then falls by 35%, by what percent did it change from beginning to end? The topic of recursive (or successive) percents addresses this question. Consider an example:
From 2004 through 2007, the Dow Jones Industrials Average rose about 30%. However, during 2008, the Dow fell about 35%. About what percent did the Dow Jones change from 2004 through 2008?

Let DowBeginning of 2004 = X
DowEnd of 2007 = X(1 + 30%) = X(1.3)
DowEnd of 2008 = [X(1.3)](1-.35) = X(.845)

Percent Change = (End - Start/Start)*100
Percent Change = (X(.845) - X/X)*100 = -15.5%
Strategy: Picking Numbers (Especially 100)

Many students find it easier to solve problems involving percents by picking numbers instead of using theoretical variables. The previous question can be solved this way:
From 2004 through 2007, the Dow Jones Industrials Average rose about 30%. However, during 2008, the Dow fell about 35%. About what percent did the Dow Jones change from 2004 through 2008?

Let DowBeginning of 2004 = 100 [pick the number 100 instead of using a variable]
DowEnd of 2007 = 100(1.3)
DowEnd of 2008 = 100(1.3)(1-.35) = 84.5

The choice of 100 as a value for the Dow at the beginning of 2004 makes calculating the percent change from 2004 through 2008 much easier, as the next step should indicate.

Percent Change = (End - Start/Start)*100
Percent Change = (84.5 - 100/100)*100 = -15.5%
Interest Rate Problems

One rather common and important application of percents is the topic of interest rates and money. An important formula that relates interest, principal, and time follows:
Simple Interest Formula
I = PRT

I = Interest Payment
P = Principal
R = Interest Rate
T = Time Period
If a homeowner signs a 10 year loan for 5% worth $100,000, how much will his interest payment be the first year (assuming he pays interest once annually)?

T = 1 since the question asks for the interest, I, in the first year (i.e., a one year time period--not the entire 10 year time period)
P = $100,000
R = 5% = 0.05

I = $100,000(.05)(1) = $5000

While the above formula helps solve many problems, there are other problems that require another formula. The following formula is fundamental to the relationship between interest, time, present value, and future value:
FV = PV(1 + r)t
FV = Future Value = The amount of money to be received or owed at a future date t time periods from now
PV = Present Value = The amount of money to be received or owed at present (i.e., now)
r = Interest Rate = The interest rate on the money, expressed as a decimal
t = Time = The amount of time to pass between PV and FV

Note: The time period, t, and interest rate, r, must be expressed in the same terms. For example, one cannot use an annual interest rate and express time in terms of months. If you are using a value of t that expresses time in months, you must use a monthly interest rate. For more on this topic, see the compound interest section.

The following is an example of a common introductory interest rate problem.
If Sam invests $100,000 today and earned 5% a year, how much money would Sam have in 2 years?

PV = $100,000
r = 5% = 0.05
t = 2

FV = $100,000(1 + .05)2 = $110,250

Tuesday, October 06, 2009

Advances in Genetics is journal from Elsevier .. good for reviews.

Friday, August 07, 2009

simpler way to count the number of elements in a vector that are not NA

simpler way to count the number of elements in a vector that are not NA

sum( !is.na( yourvector ) )

Friday, July 17, 2009

Quantile normalizing the data

If you have a "matrix" of data (or A-value for two color arrays) and for some reason Affy or Limma is not a practical alternative, you can use the library called "Caret" and the function

x <- normalize2Reference (data.value)

Tuesday, June 23, 2009

Reading in Row and Column names in read.table

Keep the total number of entries in the first row less than total number of entries in the rest of the lines of the files. read.table will it self recognize the first row of the file and first column of the file as the names.

you can check with row.names and colnames

e.g.

adultControl <- read.table ("control_adult_cerebellum.txt", sep="\t")
row.names(adultControl)
colnames(adultControl)

Monday, February 16, 2009

Using Plotrix to create nice depletion/enrichment images

setwd ("/Users/shah/insulator_data/insulator_combination_numbers/clustering_visulaization")

# Read Table

insulatorFile <- read.table ("simulated_1fdr_no_250_pvalue_for_figure.txt", sep="\t")
presMatrix <- t(as.matrix (insulatorFile [,c (6,7,8,9,10,11)]))
#color2D.matplot (presMatrix, redrange=c(0.9,0), greenrange=c(0.9,0), bluerange=c(0.9,0), xlab = "presence", ylab ="insulator combination", axes=FALSE)

pv1 <- insulatorFile[,3]
changeVal <-function (x) { if (x <= 1e-16) { x <- 1e-16} else { x <- x}}

pv3 <- lapply (pv1, changeVal)
pv4 <- -1*log10(as.numeric(pv3))

direction <- insulatorFile[,4]

pv5 <- t(as.matrix(pv4*direction))

# Generate the colors

cellcol<-matrix(rep("#000000",63),nrow=63)
cellcol[pv5<0]<-color.scale(pv5[pv5<0], c(1,0),c(1,0),c(0,0))
cellcol[pv5>0]<-color.scale(pv5[pv5>0], c(0,0),c(0,0),c(0,1))

# Generate Legend (this one is yellow - black - blue)

legval<-seq(min(pv5),max(pv5),length.out=32)
legcol<-rep("#000000",32)

legcol[legval<0]<-color.scale(legval[legval<0], c(1,0),c(1,0), c(0,0))
legcol[legval>0]<-color.scale(legval[legval>0], c(0,0),c(0,0),c(0,1))

#color2D.matplot(pv5,cellcolors=cellcol,border=NA, axes=FALSE)
#color.legend(0,0,6,-4,round(c(min(pv5),0,max(pv5)),1),rect.col=legcol)
color2D.matplot(t(as.matrix(legcol)),cellcolors=legcol,border=NA, axes=FALSE)

Sunday, January 04, 2009

Adding new graph to existing plot

You can use plot for the first plot and points for the subsequent ones.

Points will add new points to the existing plot reusing the axes,
labels, etc.

e.g.

plot (x1, type="l", col="red")
points (x2, type="l")

At par with me