At par with me: Significance of overlapping gene lists

Monday, July 19, 2010

Significance of overlapping gene lists

Wen Fury and Wentian Li
http://www.nslij-genetics.org/wli/pub/ieee-embs06.pdf

To identify significance of overlap for two differentially expressed gene sets n1 and n2 (e.g. d1-n1 and d2-n1) use either hypergeometric or Fisher's exact test p-value.

Given integers n, n1, n2, m (max(n1,n2) <= n and m <= min (n1,n2)), the hypergeometric distribution is defined as

P(m) = [C(n1, m) * C (n - n1, n2 -m)]/ C (n, n2)

where C(n,m) is the number of possibilities of choosing m objects out of n objects : C (n,m) = n!/[m! (n -m)!]

It is usually more interesting to calculate the sum of P(m) for m's equal or larger than the observed value (i.e. p-value) :

p-value = Sigma [k= m to min (n1,n2)] p(k)
= Sigma [k = 0 to min (n1,n2)] p(k) - Sigma [k = 0 to m - 1] p(k)

For calculating it in R use :

if m = 0, p-value = 1

phyper (m, n1, n - n1, n2):
p-value = phyper(min(n1,n2), n1, n-n1, n2) - phyper(m-1, n1, n-n1, n2) if m > 0

One can also use Fisher's exact test on the following 2-by-2 table:

col1 col2 total
row1 m n1-m n1
row2 n2-m n-n1-n2+m n - n1
total n2 n-n2 n

They produce identical results.

1 comment:

MichaelAngelo said...: Hi!
Saw your post on calculating significance of overlapping gene sets. I have a similar problem I have two sets of genomic regions bound by two different transcription factors. TF1 bind 12000 regions and TF2 binds 13000 genomic regions. Out which 4000 genomic regions are bound by both of them. Is their a way to calculate a p-value or z value for this overlap. Thank you for your time. Kindly reply on my email Id abhisheksingnl@gmail.com; 9:01 AM

At par with me

Monday, July 19, 2010

Significance of overlapping gene lists

1 comment:

About Me

Labels

Other Links

Blog Archive