Tuesday, September 07, 2010

Generating random numbers in R

Original link : http://blog.revolutionanalytics.com/2009/02/how-to-choose-a-random-number-in-r.html


Generate a random number between 5.0 and 7.5

If you want to generate a decimal number where any value (including fractional values) between the stated minimum and maximum is equally likely, use the runif function. This function generates values from the Uniform distribution. Here's how to generate one random number between 5.0 and 7.5:

> x1 <- runif(1, 5.0, 7.5)
> x1
[1] 6.715697

Of course, when you run this, you'll get a different number, but it will definitely be between 5.0 and 7.5. You won't get the values 5.0 or 7.5 exactly, either.

If you want to generate multiple random values, don't use a loop. You can generate several values at once by specifying the number of values you want as the first argument to runif. Here's how to generate 10 values between 5.0 and 7.5:

> x2 <- runif(10, 5.0, 7.5)
> x2
[1] 6.339188 5.311788 7.099009 5.746380 6.720383 7.433535 7.159988
[8] 5.047628 7.011670 7.030854

Generate a random integer between 1 and 10

This looks like the same exercise as the last one, but now we only want whole numbers, not fractional values. For that, we use the sample function:

> x3 <- sample(1:10, 1)
> x3
[1] 4

The first argument is a vector of valid numbers to generate (here, the numbers 1 to 10), and the second argument indicates one number should be returned. If we want to generate more than one random number, we have to add an additional argument to indicate that repeats are allowed:

> x4 <- sample(1:10, 5, replace=T)
> x4
[1] 6 9 7 6 5

Note the number 6 appears twice in the 5 numbers generated. (Here's a fun exercise: what is the probability of running this command and having no repeats in the 5 numbers generated?)

Select 6 random numbers between 1 and 40, without replacement

If you wanted to simulate the lotto game common to many countries, where you randomly select 6 balls from 40 (each labelled with a number from 1 to 40), you'd again use the sample function, but this time without replacement:

> x5 <- sample(1:40, 6, replace=F)
> x5
[1] 10 21 29 12 7 31

You'll get a different 6 numbers when you run this, but they'll all be between 1 and 40 (inclusive), and no number will repeat. Also, you don't actually need to include the replace=F option -- sampling without replacement is the default -- but it doesn't hurt to include it for clarity.

Select 10 items from a list of 50

You can use this same idea to generate a random subset of any vector, even one that doesn't contain numbers. For example, to select 10 distinct states of the US at random:

> sample(state.name, 10)
[1] "Virginia" "Oklahoma" "Maryland" "Michigan"
[5] "Alaska" "South Dakota" "Minnesota" "Idaho"
[9] "Indiana" "Connecticut"

You can't sample more values than you have without allowing replacements:

> sample(state.name, 52)
Error in sample(state.name, 52) :
cannot take a sample larger than the population when 'replace = FALSE'

... but sampling exactly the number you do have is a great way to randomize the order of a vector. Here are the 50 states of the US, in random order:

> sample(state.name, 50)
[1] "California" "Iowa" "Hawaii"
[4] "Montana" "South Dakota" "North Dakota"
[7] "Louisiana" "Maine" "Maryland"
[10] "New Hampshire" "Rhode Island" "Texas"
[13] "Florida" "North Carolina" "Minnesota"
[16] "Arkansas" "Pennsylvania" "Colorado"
[19] "Idaho" "Connecticut" "Utah"
[22] "South Carolina" "Illinois" "Ohio"
[25] "New Jersey" "Indiana" "Wisconsin"
[28] "Mississippi" "Michigan" "Wyoming"
[31] "West Virginia" "Alaska" "Georgia"
[34] "Vermont" "Virginia" "Oklahoma"
[37] "Washington" "New Mexico" "New York"
[40] "Delaware" "Nevada" "Alabama"
[43] "Kentucky" "Missouri" "Oregon"
[46] "Tennessee" "Arizona" "Massachusetts"
[49] "Kansas" "Nebraska"

You could also have just used sample(state.name) for the same result -- sampling as many values as provided is the default.

Further reading

For more information about how R generates random numbers, check out the following help pages:

> ?runif
> ?sample
> ?.Random.seed

The last of these provides technical detail on the random number generator R uses, and how you can set the random seed to recreate strings of random numbers.

Tuesday, July 20, 2010

Running Geneset enrichment analysis on commandline

java -cp /stor1/shah/Ruben_peptide/gsea_analysis/gsea2-2.06.jar
-Xmx2000m xtools.gsea.Gsea
-res foo_expression_values.gct
-cls foo_expression_values.cls#Normal_versus_Treatment
-gmx msigdb.v2.5.symbols.gmt -chip HG_U133_Plus_2.chip
-collapse true -mode Max_probe -norm meandiv -nperm 1000
-permute phenotype -rnd_type no_balance -scoring_scheme weighted
-rpt_label my_analysis -metric Signal2Noise -sort real -order descending -include_only_symbols true -make_sets true -median false -num 100
-plot_top_x 20 -rnd_seed timestamp -save_rnd_lists false -set_max 500 -set_min 15 -zip_report false -out /stor1/shah/Ruben_peptide/gsea_analysis -gui false

commandline for GSEA.

Monday, July 19, 2010

Significance of overlapping gene lists

Wen Fury and Wentian Li
http://www.nslij-genetics.org/wli/pub/ieee-embs06.pdf

To identify significance of overlap for two differentially expressed gene sets n1 and n2 (e.g. d1-n1 and d2-n1) use either hypergeometric or Fisher's exact test p-value.

Given integers n, n1, n2, m (max(n1,n2) <= n and m <= min (n1,n2)), the hypergeometric distribution is defined as

P(m) = [C(n1, m) * C (n - n1, n2 -m)]/ C (n, n2)

where C(n,m) is the number of possibilities of choosing m objects out of n objects : C (n,m) = n!/[m! (n -m)!]

It is usually more interesting to calculate the sum of P(m) for m's equal or larger than the observed value (i.e. p-value) :


p-value = Sigma [k= m to min (n1,n2)] p(k)
= Sigma [k = 0 to min (n1,n2)] p(k) - Sigma [k = 0 to m - 1] p(k)

For calculating it in R use :

if m = 0, p-value = 1

phyper (m, n1, n - n1, n2):
p-value = phyper(min(n1,n2), n1, n-n1, n2) - phyper(m-1, n1, n-n1, n2) if m > 0

One can also use Fisher's exact test on the following 2-by-2 table:

col1 col2 total
row1 m n1-m n1
row2 n2-m n-n1-n2+m n - n1
total n2 n-n2 n

They produce identical results.

Thursday, July 08, 2010

From a logical matrix to numerical matrix

Que : From a matrix of TRUE/FALSE get a matrix of 0 and 1
Ans : Multiply the logical matrix by self

Monday, April 26, 2010

Removing all NA rows and columns

Removing all NA rows and/or columns

fsFit[-which(apply(fsFit,1,function(x)all(is.na(x)))),-which(apply(fsFit,2,function(x)all(is.na(x))))]

Wednesday, April 21, 2010

extracting a percentage of data by random by groups

1) Randomly choose 10% of data from each "age" group.

> x <- data.frame(group=sample(1:4,100,TRUE), age=runif(100,4,80))
> tapply(x$age, x$group, function(z) mean(z[sample(seq_along(z), length(z) / 10)]))


2) To split my dataset randomly into 2 parts: a prediction set (with 2/3 of my data) and a validation set (with 1/3 of my data).

> x <- 1:100 # test data
> y <- split(x, sample(1:2, length(x), replace=TRUE, prob=c(1,2)))

3) I would like to randomly divide this data frame in half. how to select those rows that were not selected and assign them to randomsample2

selected<-rep(0,39622)
selected[sample(1:39622,39622/2)]<-1
data$selected<-selected
rm(selected)
or
data$selected<-rbinom(39622,1,.5)

extracting a percentage of data by random by groups

Motivating example:

If I have a dataframe with one of the variables called "age" for
example, and I want to extract a random 10% of the observations from
each "age" group of the entire data frame.

> set.seed(23) # on Windows
> dat <- data.frame(age = factor(sample(1:4, 200, rep = T)), y = runif(200))
> head(dat) # ages are in random order

age y
1 3 0.64275524
2 1 0.56125314
3 2 0.82418228
4 3 0.97050933
5 4 0.02827508
6 2 0.72291636

> with(dat, table(age)) # how many in each age group
age
1 2 3 4
37 55 44 64

> ind <- lapply(split(1:nrow(dat), dat$age),
function(x) sample(x, round(length(x)/10))) # the trick

> ind
$`1`
[1] 135 2 188 133

$`2`
[1] 124 33 140 162 25 13

$`3`
[1] 115 79 27 44

$`4`
[1] 58 129 84 198 72 109

> sample_dat <- dat[sort(unlist(ind)), ] # with indices, select data

> sample_dat
age y
2 1 0.5612531
13 2 0.7339141
25 2 0.9548750
27 3 0.7419931
33 2 0.6965722
44 3 0.5363812
58 4 0.5464051
72 4 0.2785669
79 3 0.6453164
84 4 0.1203811
109 4 0.9154706
115 3 0.2118767
124 2 0.3056171
129 4 0.7635097
133 1 0.6474702
135 1 0.2466226
140 2 0.6292326
162 2 0.5338671
188 1 0.9882631
198 4 0.1983350
>

Sunday, April 18, 2010

Extract rows from data frame based on row names from anotherdata frame

Found in google searches .. can be useful

#Create data and data frames
x=rnorm(5,0,1)
y=rnorm(5,0,1)
z=rnorm(5,0,1)
d1=data.frame(x,y)
d2=data.frame(y,z)

#which variable name in d2 is a variable name in d1?
names(d2[names(d2)%in%names(d1)]) # it's y

#give me the columns of d2 that have variable names
#that are also variable names in d1

d2[names(d2)==names(d2[names(d2)%in%names(d1)])]

#check
d2$y

# continuing with example:

rownames(d1)<- letters[1:5]
rownames(d2)<- letters[3:7]

# and then for rownames of d1 that are also in rownames of d2:
# for the full rows ...

d1[row.names(d1) %in% row.names(d2),]

# or for just the names:

rownames(d1)[row.names(d1) %in% row.names(d2)]

Friday, March 05, 2010

Merge two data frames and order according to a column...

find_annotations_for_non_overlaping_probes <- function
(probeList, annotations, expressedList1) {

annotationIdx <- match(probeList, annotations[,1])
pvalueIdx1 <- match (probeList, expressedList1[,4])

id1 <- as.data.frame(expressedList1[pvalueIdx1,])
id3 <- as.data.frame(annotations[annotationIdx,])

rownames (id1) <- probeList ;
colnames(id1) <- c("rawp", "adjp", "index", "probeId")
rownames (id3) <- probeList

resultOut <- data.frame(merge(id1, id3, by = "row.names"), row.names = 1)
resultOut.sorted <- resultOut[do.call(order, resultOut[1]),]
}

Wednesday, March 03, 2010

Writing R-plots into different devices

pdf("foo.pdf")

plot(x)
dev.off()

Other possibilities are jpeg(), tiff(), postscript() etc.

Setting up aroma.affymetrix for analysis

Setting up aroma.affymetrix for analysis two directories : rawData and annotationData
from where aroma is launched

rawData/cancer_name/HuGene-1_0-st-v1/*.CEL
annotationData/chipTypes/HuGene-1_0-st-v1/*.cdf
annotationData/chipTypes/HuGene-1_0-st-v1/NetAffx/*.csv

Wednesday, January 13, 2010

RLE AND NUSE plots

Affy QC plots for exon arrays ...

One method of deciding whether or not an array is problematics from a quality standpoint is NUSE. The goal of NUSE is to identify any arrays which have elevated standard errors relative to other arrays in the dataset. This is done by standardizing the SE across arrays to have median 1 for each probeset. Our graphical tool consists of boxplots of these quantities for each array. A discordant boxplot indicates it is of poorer quality relative to the rest of the dataset. Instead of visually examining these quantities suitable numerical summaries such as the median and IQR NUSE could be used.

Another tool for making a decision about whether an array should be removed from subsequent analysis because of poor quality is RLE. These are the log-scale expression values relative to the median expression value computed on a probeset by probeset basis. a significantly different boxplot indicates problem.

aroma.affymetrix and other lots of packages allows to plot RLE and NUSE.

The NUSE is generally considered more sensitive than the RLE.