Wednesday, April 21, 2010

extracting a percentage of data by random by groups

Motivating example:

If I have a dataframe with one of the variables called "age" for
example, and I want to extract a random 10% of the observations from
each "age" group of the entire data frame.

> set.seed(23) # on Windows
> dat <- data.frame(age = factor(sample(1:4, 200, rep = T)), y = runif(200))
> head(dat) # ages are in random order

age y
1 3 0.64275524
2 1 0.56125314
3 2 0.82418228
4 3 0.97050933
5 4 0.02827508
6 2 0.72291636

> with(dat, table(age)) # how many in each age group
age
1 2 3 4
37 55 44 64

> ind <- lapply(split(1:nrow(dat), dat$age),
function(x) sample(x, round(length(x)/10))) # the trick

> ind
$`1`
[1] 135 2 188 133

$`2`
[1] 124 33 140 162 25 13

$`3`
[1] 115 79 27 44

$`4`
[1] 58 129 84 198 72 109

> sample_dat <- dat[sort(unlist(ind)), ] # with indices, select data

> sample_dat
age y
2 1 0.5612531
13 2 0.7339141
25 2 0.9548750
27 3 0.7419931
33 2 0.6965722
44 3 0.5363812
58 4 0.5464051
72 4 0.2785669
79 3 0.6453164
84 4 0.1203811
109 4 0.9154706
115 3 0.2118767
124 2 0.3056171
129 4 0.7635097
133 1 0.6474702
135 1 0.2466226
140 2 0.6292326
162 2 0.5338671
188 1 0.9882631
198 4 0.1983350
>

No comments: