Thursday, December 04, 2008
Sunday, November 16, 2008
running MACS for Drosophila
This is how to run MACS for ChIP-seq data analysis.
effective genome size is for Drosophila melanogaster and pvalue is chosen.
macs -t 100014_modENCODE_CTCF_IP_s_1_filtered_eland_result.txt -c 100014_modENCODE_CTCF_Input_s_2_filtered_eland_result.txt --format=ELAND --gsize=120000000 --tsize=36 --pvalue=1e-3 --bw=250 --mfold=7 --name=CTCF_mf7_100014_14 --wig --diag
effective genome size is for Drosophila melanogaster and pvalue is chosen.
macs -t 100014_modENCODE_CTCF_IP_s_1_filtered_eland_result.txt -c 100014_modENCODE_CTCF_Input_s_2_filtered_eland_result.txt --format=ELAND --gsize=120000000 --tsize=36 --pvalue=1e-3 --bw=250 --mfold=7 --name=CTCF_mf7_100014_14 --wig --diag
Wednesday, November 05, 2008
Sunday, September 28, 2008
changing newline character from Mac to *nix
#!/usr/bin/perl
while (<>) {
$_ =~ s/\015/\012/gs;
print $_;
}
while (<>) {
$_ =~ s/\015/\012/gs;
print $_;
}
Thursday, September 18, 2008
New Pakistan Govt and a new prophecy
I predicted on May 22 2007 that Pervez Mussharraf presidency in Pakistan has only few days to count. It did happen 9 months later. A lot of water has passed under the bridge since then. This included assassination of Benzair Bhutto, removel of Mussharaf as an army chief as well as president and election of Zardari as a new president with PPP in the helm of Pakistani government.
The Mr. 10% would certainly be supporting US policies in the south east asia. But he has tough task ahead with an economic meltdown in US as well as pakistani economy, resurgence of Taliban and slow but steady march of Pakistan to be the failed state.
With the elections due in US, if Barack Obama becomes president, Pakistan would have a tough time stoping the terrorists to enter Indian valley of Kashmir. Under McCain administration, one can assume the same failed Bush policies to continue.
Future of Pakistan under Zardari.
Option 1: Economy metls down completely and further Islamization of Pakistan and assassination of the president/prime minister.
Option 2: I don't see it ! I keep watching after the new US president swrons in.
The Mr. 10% would certainly be supporting US policies in the south east asia. But he has tough task ahead with an economic meltdown in US as well as pakistani economy, resurgence of Taliban and slow but steady march of Pakistan to be the failed state.
With the elections due in US, if Barack Obama becomes president, Pakistan would have a tough time stoping the terrorists to enter Indian valley of Kashmir. Under McCain administration, one can assume the same failed Bush policies to continue.
Future of Pakistan under Zardari.
Option 1: Economy metls down completely and further Islamization of Pakistan and assassination of the president/prime minister.
Option 2: I don't see it ! I keep watching after the new US president swrons in.
Monday, August 18, 2008
xMAN commandline
Here is the xMAN how to from Tao Liu
Text formatted bpmap may not be processed by MAT directly. You need to convert your text bpmap to a so-called 'plainseq' format. Please type 'xMAN' without any argument to see the usage for detail.
And here is the commandline used to generate bpmap for fruitfly:
xMAN -f bpmap -i Dm_tiling2_MR_v01.bpmap -s Dm_tiling2_MR_v01.stderr -o Dm_tiling2_MR_v01_dm3 --nro=Dm_tiling2_MR_v01_dm3_all.NR --maxSeqCopy=10 --ProbeResolution=35 dm3/all/*.fa
The input file is a binary bpmap downloaded from affy website. And I put all the genome sequence files in dm3/all/ directory. Please remember to add '-nro' argument. After the run, the file 'Dm_tiling2_MR_v01_dm3_all.NR.bpmap' can be used by MAT.
Text formatted bpmap may not be processed by MAT directly. You need to convert your text bpmap to a so-called 'plainseq' format. Please type 'xMAN' without any argument to see the usage for detail.
And here is the commandline used to generate bpmap for fruitfly:
xMAN -f bpmap -i Dm_tiling2_MR_v01.bpmap -s Dm_tiling2_MR_v01.stderr -o Dm_tiling2_MR_v01_dm3 --nro=Dm_tiling2_MR_v01_dm3_all.NR --maxSeqCopy=10 --ProbeResolution=35 dm3/all/*.fa
The input file is a binary bpmap downloaded from affy website. And I put all the genome sequence files in dm3/all/ directory. Please remember to add '-nro' argument. After the run, the file 'Dm_tiling2_MR_v01_dm3_all.NR.bpmap' can be used by MAT.
Friday, June 20, 2008
Making MAT RepeatLib
Download the repeat_mask.txt, simple_repeat.txt and segment_dups.txt from UCSC
and run
python Rep.py -m repeat_mask.txt simple_repeat.txt segment_dups.txt
The Rep.py is available at the MAT lib subdirectory of the MAT install.
For the genomes like Drosophila melanogaster, for which UCSC doesn't provide Segment Duplication file. The repeat library can be generated by
1) remove all the instances of usage of "segdup" from Rep.py file
2) remove the original Rep.py and Rep.pyc files form MAT compile.
3) Recompile MAT.
RepeatLib and the modified Rep.py files are also available on request from me.
parantu dot shah at gmail dot com
and run
python Rep.py -m repeat_mask.txt simple_repeat.txt segment_dups.txt
The Rep.py is available at the MAT lib subdirectory of the MAT install.
For the genomes like Drosophila melanogaster, for which UCSC doesn't provide Segment Duplication file. The repeat library can be generated by
1) remove all the instances of usage of "segdup" from Rep.py file
2) remove the original Rep.py and Rep.pyc files form MAT compile.
3) Recompile MAT.
RepeatLib and the modified Rep.py files are also available on request from me.
parantu dot shah at gmail dot com
Friday, June 06, 2008
BLAT PSL coordinates
.psl files
A .psl file describes a series of alignments in a dense easily parsed text format. It begins with a five line header which describes each field. Following this is one line for each alignment with a tab between each field. The fields are describe below in a format suitable for many relational databases.
matches int unsigned , # Number of bases that match that aren't repeats
misMatches int unsigned , # Number of bases that don't match
repMatches int unsigned , # Number of bases that match but are part of repeats
nCount int unsigned , # Number of 'N' bases
qNumInsert int unsigned , # Number of inserts in query
qBaseInsert int unsigned , # Number of bases inserted in query
tNumInsert int unsigned , # Number of inserts in target
tBaseInsert int unsigned , # Number of bases inserted in target
strand char(2) , # + or - for query strand, optionally followed by + or – for target strand
qName varchar(255) , # Query sequence name
qSize int unsigned , # Query sequence size
qStart int unsigned , # Alignment start position in query
qEnd int unsigned , # Alignment end position in query
tName varchar(255) , # Target sequence name
tSize int unsigned , # Target sequence size
tStart int unsigned , # Alignment start position in target
tEnd int unsigned , # Alignment end position in target
blockCount int unsigned , # Number of blocks in alignment. A block contains no gaps.
blockSizes longblob , # Size of each block in a comma separated list
qStarts longblob , # Start of each block in query in a comma separated list
tStarts longblob , # Start of each block in target in a comma separated list
In general the coordinates in psl files are “zero based half open.” The first base in a sequence is numbered zero rather than one. When representing a range the end coordinate is not included in the range. Thus the first 100 bases of a sequence are represented as 0-100, and the second 100 bases are represented as 100-200. There is a another little unusual feature in the .psl format. It has to do with how coordinates are handled on the negative strand. In the qStart/qEnd fields the coordinates are where it matches from the point of view of the forward strand (even when the match is on the reverse strand). However on the qStarts[] list, the coordinates are reversed.
Here's an example of a 30-mer that has 2 blocks that align on the minus strand and 2 blocks on the plus strand (this sort of stuff happens in real life in response to assembly errors sometimes).
0 1 2 3 tens position in query
0123456789012345678901234567890 ones position in query
++++ +++++ plus strand alignment on query
-------- ---------- minus strand alignment on query
Plus strand:
qStart 12 qEnd 31 blockSizes 4,5 qStarts 12,26
Minus strand:
qStart 4 qEnd 26 blockSizes 10,8 qStarts 5,19
Essentially the minus strand blockSizes and qStarts are what you would get if you reverse complemented the query.However the qStart and qEnd are non-reversed. To get from one to the other:
qStart = qSize - revQEnd
qEnd = qSize - revQStart
A .psl file describes a series of alignments in a dense easily parsed text format. It begins with a five line header which describes each field. Following this is one line for each alignment with a tab between each field. The fields are describe below in a format suitable for many relational databases.
matches int unsigned , # Number of bases that match that aren't repeats
misMatches int unsigned , # Number of bases that don't match
repMatches int unsigned , # Number of bases that match but are part of repeats
nCount int unsigned , # Number of 'N' bases
qNumInsert int unsigned , # Number of inserts in query
qBaseInsert int unsigned , # Number of bases inserted in query
tNumInsert int unsigned , # Number of inserts in target
tBaseInsert int unsigned , # Number of bases inserted in target
strand char(2) , # + or - for query strand, optionally followed by + or – for target strand
qName varchar(255) , # Query sequence name
qSize int unsigned , # Query sequence size
qStart int unsigned , # Alignment start position in query
qEnd int unsigned , # Alignment end position in query
tName varchar(255) , # Target sequence name
tSize int unsigned , # Target sequence size
tStart int unsigned , # Alignment start position in target
tEnd int unsigned , # Alignment end position in target
blockCount int unsigned , # Number of blocks in alignment. A block contains no gaps.
blockSizes longblob , # Size of each block in a comma separated list
qStarts longblob , # Start of each block in query in a comma separated list
tStarts longblob , # Start of each block in target in a comma separated list
In general the coordinates in psl files are “zero based half open.” The first base in a sequence is numbered zero rather than one. When representing a range the end coordinate is not included in the range. Thus the first 100 bases of a sequence are represented as 0-100, and the second 100 bases are represented as 100-200. There is a another little unusual feature in the .psl format. It has to do with how coordinates are handled on the negative strand. In the qStart/qEnd fields the coordinates are where it matches from the point of view of the forward strand (even when the match is on the reverse strand). However on the qStarts[] list, the coordinates are reversed.
Here's an example of a 30-mer that has 2 blocks that align on the minus strand and 2 blocks on the plus strand (this sort of stuff happens in real life in response to assembly errors sometimes).
0 1 2 3 tens position in query
0123456789012345678901234567890 ones position in query
++++ +++++ plus strand alignment on query
-------- ---------- minus strand alignment on query
Plus strand:
qStart 12 qEnd 31 blockSizes 4,5 qStarts 12,26
Minus strand:
qStart 4 qEnd 26 blockSizes 10,8 qStarts 5,19
Essentially the minus strand blockSizes and qStarts are what you would get if you reverse complemented the query.However the qStart and qEnd are non-reversed. To get from one to the other:
qStart = qSize - revQEnd
qEnd = qSize - revQStart
Wednesday, May 28, 2008
Running BLAT with maximum sensitivity
nohup /home/shah/local_progs/blat_dir/blat -fine -repMatch=1000000 -minIdentity=100 /home/shah/drosophila_annotations/dmel3_apr_2006/dmel3_apr_2006.fasta agilent_dm2_probe_sequences_5_215453.fasta agilent_dm2_probe_sequences_5_215453.psl &
explanation :
-fine
-repMatch=1000000 - number when tiles are considered overused
-minIdentity=100 - pick identical matches
e.g. for mapping probes to the reference genomes.
explanation :
-fine
-repMatch=1000000 - number when tiles are considered overused
-minIdentity=100 - pick identical matches
e.g. for mapping probes to the reference genomes.
Saturday, May 24, 2008
Installing R package tar.gz
use
R CMD INSTALL pkg.tar.gz
or within R session
install.packages("pkg.tar.gz", type="source", repos=NULL)
R CMD INSTALL pkg.tar.gz
or within R session
install.packages("pkg.tar.gz", type="source", repos=NULL)
Thursday, May 15, 2008
Running xMAN for Affy probe mapping
According to Drosophila Tiling 2.0R array design and library file description Affy tiling2.0R BPMAP file may contain probes that map to genomes multiple times, which may be derived from repeated gene families and other sequences that were not repeat masked.
Existence of probes that map to genomes multiple times lead to inflated p-values during the subsequent analysis and lead to false-positives. Also, remapping the probes to a newer genome assembly is desirable sometimes
Thus, it is important to remap the BPMAP to the genomes.
xMAN is available from Shirley Liu's lab.
Command line while running xMAN may look like
First try:
xMAN -f bpmap -i Dm_tiling2_MR_v01.bpmap -o Dm_tiling2_MR_v01_xman_nr.bpmap --nro=Dm_tiling2_MR_v01_xman_nro.bpmap --maxSeqCopy=10 chr2L.fa chr3L.fa chr2RHet.fa chr3RHet.fa chrUextra.fa chr4.fa chrX.fa chr2LHet.fa chr3LHet.fa chrM.fa chrXHet.fa chr2R.fa chr3R.fa chrU.fa chrYHet.fa
Existence of probes that map to genomes multiple times lead to inflated p-values during the subsequent analysis and lead to false-positives. Also, remapping the probes to a newer genome assembly is desirable sometimes
Thus, it is important to remap the BPMAP to the genomes.
xMAN is available from Shirley Liu's lab.
Command line while running xMAN may look like
First try:
xMAN -f bpmap -i Dm_tiling2_MR_v01.bpmap -o Dm_tiling2_MR_v01_xman_nr.bpmap --nro=Dm_tiling2_MR_v01_xman_nro.bpmap --maxSeqCopy=10 chr2L.fa chr3L.fa chr2RHet.fa chr3RHet.fa chrUextra.fa chr4.fa chrX.fa chr2LHet.fa chr3LHet.fa chrM.fa chrXHet.fa chr2R.fa chr3R.fa chrU.fa chrYHet.fa
Thursday, May 01, 2008
Generating a Affymatrix "tpmap" file for TiMAT
R Code that uses the Bioconductor module Affxparser.
Note: The code has a bug and in normal circumstances returns an error
> writeTpmap(filename="Dm_tiling2_MR_v01.tpmap", bpmaplist=bpmapr)
> Error in as.vector(y) : argument "y" is missing, with no default
The "y" can be fixed as follows
> y <- readBpmap("Dm_tiling2_MR_v01.bpmap")
> writeTpmap(filename="Dm_tiling2_MR_v01.tpmap", bpmaplist='y', verbose = as.integer(0))
This seems to work but doesn't print anything into the tpmap file due to some "sequence skipping".
I will try the Affymatrix windows based program next.
Note: The code has a bug and in normal circumstances returns an error
> writeTpmap(filename="Dm_tiling2_MR_v01.tpmap", bpmaplist=bpmapr)
> Error in as.vector(y) : argument "y" is missing, with no default
The "y" can be fixed as follows
> y <- readBpmap("Dm_tiling2_MR_v01.bpmap")
> writeTpmap(filename="Dm_tiling2_MR_v01.tpmap", bpmaplist='y', verbose = as.integer(0))
This seems to work but doesn't print anything into the tpmap file due to some "sequence skipping".
I will try the Affymatrix windows based program next.
Tuesday, February 12, 2008
me vs. reggaeton
Doing a PhD in Germany was a wonderful experience. Those 4.5 yrs taught me German and (mostly reading the literature) improved my English .. My proficiency over Latin based languages didn't improve !! Zero in French and in Spanish .. despite boasting a PhD from Autonoma De Madrid !!!
Now US is making sure that I learn some Spanish ... nope no spanish girl friends!! My wife will kill me if I do it. Its mostly the reggaeton ... from Don Omar, Daddy Yankee, Wishin Y Yandel , and others. I like the beats even though they are far away from Rammstein :D ... If you ask "Who is Don Omar" , then give a You tube search !!! Highly recommanded.
Now US is making sure that I learn some Spanish ... nope no spanish girl friends!! My wife will kill me if I do it. Its mostly the reggaeton ... from Don Omar, Daddy Yankee, Wishin Y Yandel , and others. I like the beats even though they are far away from Rammstein :D ... If you ask "Who is Don Omar" , then give a You tube search !!! Highly recommanded.
Monday, February 11, 2008
Protein 3D structure analysis
I was trying to find today programs for analyzing protein 3D structures .. A rare art
new generation of Bioinformaticians may never learnt. Especially, I wanted to calculate secondary structure, solvent accessibility, hydrogen bonds, torsion angles and residue packings from protein 3D structures.
Any ways, the old programs viz. Comparer, JOY suite (people believe its no fun at all running it) at the University of Cambridge and WHAT IF are still under copyrights that you still have to FAX them to get a license. (aka long wait .....)
I finally downloaded DSSP (CMBI version), STRIDE (from EMBL/EBI) and PSA (sali lab). Also, I will try VADAR suite of algorithms from Canada and PROMOTIF (I never used it before though) to get most of the required information.
Let's see where these guys take me !! Anybody who want to write Bioperl modules for them :D !!!
new generation of Bioinformaticians may never learnt. Especially, I wanted to calculate secondary structure, solvent accessibility, hydrogen bonds, torsion angles and residue packings from protein 3D structures.
Any ways, the old programs viz. Comparer, JOY suite (people believe its no fun at all running it) at the University of Cambridge and WHAT IF are still under copyrights that you still have to FAX them to get a license. (aka long wait .....)
I finally downloaded DSSP (CMBI version), STRIDE (from EMBL/EBI) and PSA (sali lab). Also, I will try VADAR suite of algorithms from Canada and PROMOTIF (I never used it before though) to get most of the required information.
Let's see where these guys take me !! Anybody who want to write Bioperl modules for them :D !!!
Subscribe to:
Posts (Atom)