Download the repeat_mask.txt, simple_repeat.txt and segment_dups.txt from UCSC
and run
python Rep.py -m repeat_mask.txt simple_repeat.txt segment_dups.txt
The Rep.py is available at the MAT lib subdirectory of the MAT install.
For the genomes like Drosophila melanogaster, for which UCSC doesn't provide Segment Duplication file. The repeat library can be generated by
1) remove all the instances of usage of "segdup" from Rep.py file
2) remove the original Rep.py and Rep.pyc files form MAT compile.
3) Recompile MAT.
RepeatLib and the modified Rep.py files are also available on request from me.
parantu dot shah at gmail dot com
Friday, June 20, 2008
Friday, June 06, 2008
BLAT PSL coordinates
.psl files
A .psl file describes a series of alignments in a dense easily parsed text format. It begins with a five line header which describes each field. Following this is one line for each alignment with a tab between each field. The fields are describe below in a format suitable for many relational databases.
matches int unsigned , # Number of bases that match that aren't repeats
misMatches int unsigned , # Number of bases that don't match
repMatches int unsigned , # Number of bases that match but are part of repeats
nCount int unsigned , # Number of 'N' bases
qNumInsert int unsigned , # Number of inserts in query
qBaseInsert int unsigned , # Number of bases inserted in query
tNumInsert int unsigned , # Number of inserts in target
tBaseInsert int unsigned , # Number of bases inserted in target
strand char(2) , # + or - for query strand, optionally followed by + or – for target strand
qName varchar(255) , # Query sequence name
qSize int unsigned , # Query sequence size
qStart int unsigned , # Alignment start position in query
qEnd int unsigned , # Alignment end position in query
tName varchar(255) , # Target sequence name
tSize int unsigned , # Target sequence size
tStart int unsigned , # Alignment start position in target
tEnd int unsigned , # Alignment end position in target
blockCount int unsigned , # Number of blocks in alignment. A block contains no gaps.
blockSizes longblob , # Size of each block in a comma separated list
qStarts longblob , # Start of each block in query in a comma separated list
tStarts longblob , # Start of each block in target in a comma separated list
In general the coordinates in psl files are “zero based half open.” The first base in a sequence is numbered zero rather than one. When representing a range the end coordinate is not included in the range. Thus the first 100 bases of a sequence are represented as 0-100, and the second 100 bases are represented as 100-200. There is a another little unusual feature in the .psl format. It has to do with how coordinates are handled on the negative strand. In the qStart/qEnd fields the coordinates are where it matches from the point of view of the forward strand (even when the match is on the reverse strand). However on the qStarts[] list, the coordinates are reversed.
Here's an example of a 30-mer that has 2 blocks that align on the minus strand and 2 blocks on the plus strand (this sort of stuff happens in real life in response to assembly errors sometimes).
0 1 2 3 tens position in query
0123456789012345678901234567890 ones position in query
++++ +++++ plus strand alignment on query
-------- ---------- minus strand alignment on query
Plus strand:
qStart 12 qEnd 31 blockSizes 4,5 qStarts 12,26
Minus strand:
qStart 4 qEnd 26 blockSizes 10,8 qStarts 5,19
Essentially the minus strand blockSizes and qStarts are what you would get if you reverse complemented the query.However the qStart and qEnd are non-reversed. To get from one to the other:
qStart = qSize - revQEnd
qEnd = qSize - revQStart
A .psl file describes a series of alignments in a dense easily parsed text format. It begins with a five line header which describes each field. Following this is one line for each alignment with a tab between each field. The fields are describe below in a format suitable for many relational databases.
matches int unsigned , # Number of bases that match that aren't repeats
misMatches int unsigned , # Number of bases that don't match
repMatches int unsigned , # Number of bases that match but are part of repeats
nCount int unsigned , # Number of 'N' bases
qNumInsert int unsigned , # Number of inserts in query
qBaseInsert int unsigned , # Number of bases inserted in query
tNumInsert int unsigned , # Number of inserts in target
tBaseInsert int unsigned , # Number of bases inserted in target
strand char(2) , # + or - for query strand, optionally followed by + or – for target strand
qName varchar(255) , # Query sequence name
qSize int unsigned , # Query sequence size
qStart int unsigned , # Alignment start position in query
qEnd int unsigned , # Alignment end position in query
tName varchar(255) , # Target sequence name
tSize int unsigned , # Target sequence size
tStart int unsigned , # Alignment start position in target
tEnd int unsigned , # Alignment end position in target
blockCount int unsigned , # Number of blocks in alignment. A block contains no gaps.
blockSizes longblob , # Size of each block in a comma separated list
qStarts longblob , # Start of each block in query in a comma separated list
tStarts longblob , # Start of each block in target in a comma separated list
In general the coordinates in psl files are “zero based half open.” The first base in a sequence is numbered zero rather than one. When representing a range the end coordinate is not included in the range. Thus the first 100 bases of a sequence are represented as 0-100, and the second 100 bases are represented as 100-200. There is a another little unusual feature in the .psl format. It has to do with how coordinates are handled on the negative strand. In the qStart/qEnd fields the coordinates are where it matches from the point of view of the forward strand (even when the match is on the reverse strand). However on the qStarts[] list, the coordinates are reversed.
Here's an example of a 30-mer that has 2 blocks that align on the minus strand and 2 blocks on the plus strand (this sort of stuff happens in real life in response to assembly errors sometimes).
0 1 2 3 tens position in query
0123456789012345678901234567890 ones position in query
++++ +++++ plus strand alignment on query
-------- ---------- minus strand alignment on query
Plus strand:
qStart 12 qEnd 31 blockSizes 4,5 qStarts 12,26
Minus strand:
qStart 4 qEnd 26 blockSizes 10,8 qStarts 5,19
Essentially the minus strand blockSizes and qStarts are what you would get if you reverse complemented the query.However the qStart and qEnd are non-reversed. To get from one to the other:
qStart = qSize - revQEnd
qEnd = qSize - revQStart
Subscribe to:
Posts (Atom)