Original link : http://blog.revolutionanalytics.com/2009/02/how-to-choose-a-random-number-in-r.html
Generate a random number between 5.0 and 7.5
If you want to generate a decimal number where any value (including fractional values) between the stated minimum and maximum is equally likely, use the runif function. This function generates values from the Uniform distribution. Here's how to generate one random number between 5.0 and 7.5:
> x1 <- runif(1, 5.0, 7.5)
> x1
[1] 6.715697
Of course, when you run this, you'll get a different number, but it will definitely be between 5.0 and 7.5. You won't get the values 5.0 or 7.5 exactly, either.
If you want to generate multiple random values, don't use a loop. You can generate several values at once by specifying the number of values you want as the first argument to runif. Here's how to generate 10 values between 5.0 and 7.5:
> x2 <- runif(10, 5.0, 7.5)
> x2
[1] 6.339188 5.311788 7.099009 5.746380 6.720383 7.433535 7.159988
[8] 5.047628 7.011670 7.030854
Generate a random integer between 1 and 10
This looks like the same exercise as the last one, but now we only want whole numbers, not fractional values. For that, we use the sample function:
> x3 <- sample(1:10, 1)
> x3
[1] 4
The first argument is a vector of valid numbers to generate (here, the numbers 1 to 10), and the second argument indicates one number should be returned. If we want to generate more than one random number, we have to add an additional argument to indicate that repeats are allowed:
> x4 <- sample(1:10, 5, replace=T)
> x4
[1] 6 9 7 6 5
Note the number 6 appears twice in the 5 numbers generated. (Here's a fun exercise: what is the probability of running this command and having no repeats in the 5 numbers generated?)
Select 6 random numbers between 1 and 40, without replacement
If you wanted to simulate the lotto game common to many countries, where you randomly select 6 balls from 40 (each labelled with a number from 1 to 40), you'd again use the sample function, but this time without replacement:
> x5 <- sample(1:40, 6, replace=F)
> x5
[1] 10 21 29 12 7 31
You'll get a different 6 numbers when you run this, but they'll all be between 1 and 40 (inclusive), and no number will repeat. Also, you don't actually need to include the replace=F option -- sampling without replacement is the default -- but it doesn't hurt to include it for clarity.
Select 10 items from a list of 50
You can use this same idea to generate a random subset of any vector, even one that doesn't contain numbers. For example, to select 10 distinct states of the US at random:
> sample(state.name, 10)
[1] "Virginia" "Oklahoma" "Maryland" "Michigan"
[5] "Alaska" "South Dakota" "Minnesota" "Idaho"
[9] "Indiana" "Connecticut"
You can't sample more values than you have without allowing replacements:
> sample(state.name, 52)
Error in sample(state.name, 52) :
cannot take a sample larger than the population when 'replace = FALSE'
... but sampling exactly the number you do have is a great way to randomize the order of a vector. Here are the 50 states of the US, in random order:
> sample(state.name, 50)
[1] "California" "Iowa" "Hawaii"
[4] "Montana" "South Dakota" "North Dakota"
[7] "Louisiana" "Maine" "Maryland"
[10] "New Hampshire" "Rhode Island" "Texas"
[13] "Florida" "North Carolina" "Minnesota"
[16] "Arkansas" "Pennsylvania" "Colorado"
[19] "Idaho" "Connecticut" "Utah"
[22] "South Carolina" "Illinois" "Ohio"
[25] "New Jersey" "Indiana" "Wisconsin"
[28] "Mississippi" "Michigan" "Wyoming"
[31] "West Virginia" "Alaska" "Georgia"
[34] "Vermont" "Virginia" "Oklahoma"
[37] "Washington" "New Mexico" "New York"
[40] "Delaware" "Nevada" "Alabama"
[43] "Kentucky" "Missouri" "Oregon"
[46] "Tennessee" "Arizona" "Massachusetts"
[49] "Kansas" "Nebraska"
You could also have just used sample(state.name) for the same result -- sampling as many values as provided is the default.
Further reading
For more information about how R generates random numbers, check out the following help pages:
> ?runif
> ?sample
> ?.Random.seed
The last of these provides technical detail on the random number generator R uses, and how you can set the random seed to recreate strings of random numbers.
Tuesday, September 07, 2010
Tuesday, July 20, 2010
Running Geneset enrichment analysis on commandline
java -cp /stor1/shah/Ruben_peptide/gsea_analysis/gsea2-2.06.jar
-Xmx2000m xtools.gsea.Gsea
-res foo_expression_values.gct
-cls foo_expression_values.cls#Normal_versus_Treatment
-gmx msigdb.v2.5.symbols.gmt -chip HG_U133_Plus_2.chip
-collapse true -mode Max_probe -norm meandiv -nperm 1000
-permute phenotype -rnd_type no_balance -scoring_scheme weighted
-rpt_label my_analysis -metric Signal2Noise -sort real -order descending -include_only_symbols true -make_sets true -median false -num 100
-plot_top_x 20 -rnd_seed timestamp -save_rnd_lists false -set_max 500 -set_min 15 -zip_report false -out /stor1/shah/Ruben_peptide/gsea_analysis -gui false
commandline for GSEA.
-Xmx2000m xtools.gsea.Gsea
-res foo_expression_values.gct
-cls foo_expression_values.cls#Normal_versus_Treatment
-gmx msigdb.v2.5.symbols.gmt -chip HG_U133_Plus_2.chip
-collapse true -mode Max_probe -norm meandiv -nperm 1000
-permute phenotype -rnd_type no_balance -scoring_scheme weighted
-rpt_label my_analysis -metric Signal2Noise -sort real -order descending -include_only_symbols true -make_sets true -median false -num 100
-plot_top_x 20 -rnd_seed timestamp -save_rnd_lists false -set_max 500 -set_min 15 -zip_report false -out /stor1/shah/Ruben_peptide/gsea_analysis -gui false
commandline for GSEA.
Monday, July 19, 2010
Significance of overlapping gene lists
Wen Fury and Wentian Li
http://www.nslij-genetics.org/wli/pub/ieee-embs06.pdf
To identify significance of overlap for two differentially expressed gene sets n1 and n2 (e.g. d1-n1 and d2-n1) use either hypergeometric or Fisher's exact test p-value.
Given integers n, n1, n2, m (max(n1,n2) <= n and m <= min (n1,n2)), the hypergeometric distribution is defined as
P(m) = [C(n1, m) * C (n - n1, n2 -m)]/ C (n, n2)
where C(n,m) is the number of possibilities of choosing m objects out of n objects : C (n,m) = n!/[m! (n -m)!]
It is usually more interesting to calculate the sum of P(m) for m's equal or larger than the observed value (i.e. p-value) :
p-value = Sigma [k= m to min (n1,n2)] p(k)
= Sigma [k = 0 to min (n1,n2)] p(k) - Sigma [k = 0 to m - 1] p(k)
For calculating it in R use :
if m = 0, p-value = 1
phyper (m, n1, n - n1, n2):
p-value = phyper(min(n1,n2), n1, n-n1, n2) - phyper(m-1, n1, n-n1, n2) if m > 0
One can also use Fisher's exact test on the following 2-by-2 table:
col1 col2 total
row1 m n1-m n1
row2 n2-m n-n1-n2+m n - n1
total n2 n-n2 n
They produce identical results.
http://www.nslij-genetics.org/wli/pub/ieee-embs06.pdf
To identify significance of overlap for two differentially expressed gene sets n1 and n2 (e.g. d1-n1 and d2-n1) use either hypergeometric or Fisher's exact test p-value.
Given integers n, n1, n2, m (max(n1,n2) <= n and m <= min (n1,n2)), the hypergeometric distribution is defined as
P(m) = [C(n1, m) * C (n - n1, n2 -m)]/ C (n, n2)
where C(n,m) is the number of possibilities of choosing m objects out of n objects : C (n,m) = n!/[m! (n -m)!]
It is usually more interesting to calculate the sum of P(m) for m's equal or larger than the observed value (i.e. p-value) :
p-value = Sigma [k= m to min (n1,n2)] p(k)
= Sigma [k = 0 to min (n1,n2)] p(k) - Sigma [k = 0 to m - 1] p(k)
For calculating it in R use :
if m = 0, p-value = 1
phyper (m, n1, n - n1, n2):
p-value = phyper(min(n1,n2), n1, n-n1, n2) - phyper(m-1, n1, n-n1, n2) if m > 0
One can also use Fisher's exact test on the following 2-by-2 table:
col1 col2 total
row1 m n1-m n1
row2 n2-m n-n1-n2+m n - n1
total n2 n-n2 n
They produce identical results.
Thursday, July 08, 2010
From a logical matrix to numerical matrix
Que : From a matrix of TRUE/FALSE get a matrix of 0 and 1
Ans : Multiply the logical matrix by self
Ans : Multiply the logical matrix by self
Monday, April 26, 2010
Removing all NA rows and columns
Removing all NA rows and/or columns
fsFit[-which(apply(fsFit,1,function(x)all(is.na(x)))),-which(apply(fsFit,2,function(x)all(is.na(x))))]
fsFit[-which(apply(fsFit,1,function(x)all(is.na(x)))),-which(apply(fsFit,2,function(x)all(is.na(x))))]
Wednesday, April 21, 2010
extracting a percentage of data by random by groups
1) Randomly choose 10% of data from each "age" group.
> x <- data.frame(group=sample(1:4,100,TRUE), age=runif(100,4,80))
> tapply(x$age, x$group, function(z) mean(z[sample(seq_along(z), length(z) / 10)]))
2) To split my dataset randomly into 2 parts: a prediction set (with 2/3 of my data) and a validation set (with 1/3 of my data).
> x <- 1:100 # test data
> y <- split(x, sample(1:2, length(x), replace=TRUE, prob=c(1,2)))
3) I would like to randomly divide this data frame in half. how to select those rows that were not selected and assign them to randomsample2
selected<-rep(0,39622)
selected[sample(1:39622,39622/2)]<-1
data$selected<-selected
rm(selected)
or
data$selected<-rbinom(39622,1,.5)
> x <- data.frame(group=sample(1:4,100,TRUE), age=runif(100,4,80))
> tapply(x$age, x$group, function(z) mean(z[sample(seq_along(z), length(z) / 10)]))
2) To split my dataset randomly into 2 parts: a prediction set (with 2/3 of my data) and a validation set (with 1/3 of my data).
> x <- 1:100 # test data
> y <- split(x, sample(1:2, length(x), replace=TRUE, prob=c(1,2)))
3) I would like to randomly divide this data frame in half. how to select those rows that were not selected and assign them to randomsample2
selected<-rep(0,39622)
selected[sample(1:39622,39622/2)]<-1
data$selected<-selected
rm(selected)
or
data$selected<-rbinom(39622,1,.5)
extracting a percentage of data by random by groups
Motivating example:
If I have a dataframe with one of the variables called "age" for
example, and I want to extract a random 10% of the observations from
each "age" group of the entire data frame.
> set.seed(23) # on Windows
> dat <- data.frame(age = factor(sample(1:4, 200, rep = T)), y = runif(200))
> head(dat) # ages are in random order
age y
1 3 0.64275524
2 1 0.56125314
3 2 0.82418228
4 3 0.97050933
5 4 0.02827508
6 2 0.72291636
> with(dat, table(age)) # how many in each age group
age
1 2 3 4
37 55 44 64
> ind <- lapply(split(1:nrow(dat), dat$age),
function(x) sample(x, round(length(x)/10))) # the trick
> ind
$`1`
[1] 135 2 188 133
$`2`
[1] 124 33 140 162 25 13
$`3`
[1] 115 79 27 44
$`4`
[1] 58 129 84 198 72 109
> sample_dat <- dat[sort(unlist(ind)), ] # with indices, select data
> sample_dat
age y
2 1 0.5612531
13 2 0.7339141
25 2 0.9548750
27 3 0.7419931
33 2 0.6965722
44 3 0.5363812
58 4 0.5464051
72 4 0.2785669
79 3 0.6453164
84 4 0.1203811
109 4 0.9154706
115 3 0.2118767
124 2 0.3056171
129 4 0.7635097
133 1 0.6474702
135 1 0.2466226
140 2 0.6292326
162 2 0.5338671
188 1 0.9882631
198 4 0.1983350
>
If I have a dataframe with one of the variables called "age" for
example, and I want to extract a random 10% of the observations from
each "age" group of the entire data frame.
> set.seed(23) # on Windows
> dat <- data.frame(age = factor(sample(1:4, 200, rep = T)), y = runif(200))
> head(dat) # ages are in random order
age y
1 3 0.64275524
2 1 0.56125314
3 2 0.82418228
4 3 0.97050933
5 4 0.02827508
6 2 0.72291636
> with(dat, table(age)) # how many in each age group
age
1 2 3 4
37 55 44 64
> ind <- lapply(split(1:nrow(dat), dat$age),
function(x) sample(x, round(length(x)/10))) # the trick
> ind
$`1`
[1] 135 2 188 133
$`2`
[1] 124 33 140 162 25 13
$`3`
[1] 115 79 27 44
$`4`
[1] 58 129 84 198 72 109
> sample_dat <- dat[sort(unlist(ind)), ] # with indices, select data
> sample_dat
age y
2 1 0.5612531
13 2 0.7339141
25 2 0.9548750
27 3 0.7419931
33 2 0.6965722
44 3 0.5363812
58 4 0.5464051
72 4 0.2785669
79 3 0.6453164
84 4 0.1203811
109 4 0.9154706
115 3 0.2118767
124 2 0.3056171
129 4 0.7635097
133 1 0.6474702
135 1 0.2466226
140 2 0.6292326
162 2 0.5338671
188 1 0.9882631
198 4 0.1983350
>
Sunday, April 18, 2010
Extract rows from data frame based on row names from anotherdata frame
Found in google searches .. can be useful
#Create data and data frames
x=rnorm(5,0,1)
y=rnorm(5,0,1)
z=rnorm(5,0,1)
d1=data.frame(x,y)
d2=data.frame(y,z)
#which variable name in d2 is a variable name in d1?
names(d2[names(d2)%in%names(d1)]) # it's y
#give me the columns of d2 that have variable names
#that are also variable names in d1
d2[names(d2)==names(d2[names(d2)%in%names(d1)])]
#check
d2$y
# continuing with example:
rownames(d1)<- letters[1:5]
rownames(d2)<- letters[3:7]
# and then for rownames of d1 that are also in rownames of d2:
# for the full rows ...
d1[row.names(d1) %in% row.names(d2),]
# or for just the names:
rownames(d1)[row.names(d1) %in% row.names(d2)]
#Create data and data frames
x=rnorm(5,0,1)
y=rnorm(5,0,1)
z=rnorm(5,0,1)
d1=data.frame(x,y)
d2=data.frame(y,z)
#which variable name in d2 is a variable name in d1?
names(d2[names(d2)%in%names(d1)]) # it's y
#give me the columns of d2 that have variable names
#that are also variable names in d1
d2[names(d2)==names(d2[names(d2)%in%names(d1)])]
#check
d2$y
# continuing with example:
rownames(d1)<- letters[1:5]
rownames(d2)<- letters[3:7]
# and then for rownames of d1 that are also in rownames of d2:
# for the full rows ...
d1[row.names(d1) %in% row.names(d2),]
# or for just the names:
rownames(d1)[row.names(d1) %in% row.names(d2)]
Friday, March 05, 2010
Merge two data frames and order according to a column...
find_annotations_for_non_overlaping_probes <- function
(probeList, annotations, expressedList1) {
annotationIdx <- match(probeList, annotations[,1])
pvalueIdx1 <- match (probeList, expressedList1[,4])
id1 <- as.data.frame(expressedList1[pvalueIdx1,])
id3 <- as.data.frame(annotations[annotationIdx,])
rownames (id1) <- probeList ;
colnames(id1) <- c("rawp", "adjp", "index", "probeId")
rownames (id3) <- probeList
resultOut <- data.frame(merge(id1, id3, by = "row.names"), row.names = 1)
resultOut.sorted <- resultOut[do.call(order, resultOut[1]),]
}
(probeList, annotations, expressedList1) {
annotationIdx <- match(probeList, annotations[,1])
pvalueIdx1 <- match (probeList, expressedList1[,4])
id1 <- as.data.frame(expressedList1[pvalueIdx1,])
id3 <- as.data.frame(annotations[annotationIdx,])
rownames (id1) <- probeList ;
colnames(id1) <- c("rawp", "adjp", "index", "probeId")
rownames (id3) <- probeList
resultOut <- data.frame(merge(id1, id3, by = "row.names"), row.names = 1)
resultOut.sorted <- resultOut[do.call(order, resultOut[1]),]
}
Wednesday, March 03, 2010
Writing R-plots into different devices
pdf("foo.pdf")
plot(x)
dev.off()
Other possibilities are jpeg(), tiff(), postscript() etc.
plot(x)
dev.off()
Other possibilities are jpeg(), tiff(), postscript() etc.
Setting up aroma.affymetrix for analysis
Setting up aroma.affymetrix for analysis two directories : rawData and annotationData
from where aroma is launched
rawData/cancer_name/HuGene-1_0-st-v1/*.CEL
annotationData/chipTypes/HuGene-1_0-st-v1/*.cdf
annotationData/chipTypes/HuGene-1_0-st-v1/NetAffx/*.csv
from where aroma is launched
rawData/cancer_name/HuGene-1_0-st-v1/*.CEL
annotationData/chipTypes/HuGene-1_0-st-v1/*.cdf
annotationData/chipTypes/HuGene-1_0-st-v1/NetAffx/*.csv
Labels:
Affy_analysis,
aroma.affymetrix,
R-project
Wednesday, January 13, 2010
RLE AND NUSE plots
Affy QC plots for exon arrays ...
One method of deciding whether or not an array is problematics from a quality standpoint is NUSE. The goal of NUSE is to identify any arrays which have elevated standard errors relative to other arrays in the dataset. This is done by standardizing the SE across arrays to have median 1 for each probeset. Our graphical tool consists of boxplots of these quantities for each array. A discordant boxplot indicates it is of poorer quality relative to the rest of the dataset. Instead of visually examining these quantities suitable numerical summaries such as the median and IQR NUSE could be used.
Another tool for making a decision about whether an array should be removed from subsequent analysis because of poor quality is RLE. These are the log-scale expression values relative to the median expression value computed on a probeset by probeset basis. a significantly different boxplot indicates problem.
aroma.affymetrix and other lots of packages allows to plot RLE and NUSE.
The NUSE is generally considered more sensitive than the RLE.
One method of deciding whether or not an array is problematics from a quality standpoint is NUSE. The goal of NUSE is to identify any arrays which have elevated standard errors relative to other arrays in the dataset. This is done by standardizing the SE across arrays to have median 1 for each probeset. Our graphical tool consists of boxplots of these quantities for each array. A discordant boxplot indicates it is of poorer quality relative to the rest of the dataset. Instead of visually examining these quantities suitable numerical summaries such as the median and IQR NUSE could be used.
Another tool for making a decision about whether an array should be removed from subsequent analysis because of poor quality is RLE. These are the log-scale expression values relative to the median expression value computed on a probeset by probeset basis. a significantly different boxplot indicates problem.
aroma.affymetrix and other lots of packages allows to plot RLE and NUSE.
The NUSE is generally considered more sensitive than the RLE.
Labels:
Affy_analysis,
alternative splicing,
normalization,
R-project
Subscribe to:
Posts (Atom)