Friday, June 06, 2008

BLAT PSL coordinates

.psl files

A .psl file describes a series of alignments in a dense easily parsed text format. It begins with a five line header which describes each field. Following this is one line for each alignment with a tab between each field. The fields are describe below in a format suitable for many relational databases.

matches int unsigned , # Number of bases that match that aren't repeats

misMatches int unsigned , # Number of bases that don't match

repMatches int unsigned , # Number of bases that match but are part of repeats

nCount int unsigned , # Number of 'N' bases

qNumInsert int unsigned , # Number of inserts in query

qBaseInsert int unsigned , # Number of bases inserted in query

tNumInsert int unsigned , # Number of inserts in target

tBaseInsert int unsigned , # Number of bases inserted in target

strand char(2) , # + or - for query strand, optionally followed by + or – for target strand

qName varchar(255) , # Query sequence name

qSize int unsigned , # Query sequence size

qStart int unsigned , # Alignment start position in query

qEnd int unsigned , # Alignment end position in query

tName varchar(255) , # Target sequence name

tSize int unsigned , # Target sequence size

tStart int unsigned , # Alignment start position in target

tEnd int unsigned , # Alignment end position in target

blockCount int unsigned , # Number of blocks in alignment. A block contains no gaps.

blockSizes longblob , # Size of each block in a comma separated list

qStarts longblob , # Start of each block in query in a comma separated list

tStarts longblob , # Start of each block in target in a comma separated list

In general the coordinates in psl files are “zero based half open.” The first base in a sequence is numbered zero rather than one. When representing a range the end coordinate is not included in the range. Thus the first 100 bases of a sequence are represented as 0-100, and the second 100 bases are represented as 100-200. There is a another little unusual feature in the .psl format. It has to do with how coordinates are handled on the negative strand. In the qStart/qEnd fields the coordinates are where it matches from the point of view of the forward strand (even when the match is on the reverse strand). However on the qStarts[] list, the coordinates are reversed.



Here's an example of a 30-mer that has 2 blocks that align on the minus strand and 2 blocks on the plus strand (this sort of stuff happens in real life in response to assembly errors sometimes).

0 1 2 3 tens position in query
0123456789012345678901234567890 ones position in query
++++ +++++ plus strand alignment on query
-------- ---------- minus strand alignment on query

Plus strand:
qStart 12 qEnd 31 blockSizes 4,5 qStarts 12,26
Minus strand:
qStart 4 qEnd 26 blockSizes 10,8 qStarts 5,19

Essentially the minus strand blockSizes and qStarts are what you would get if you reverse complemented the query.However the qStart and qEnd are non-reversed. To get from one to the other:
qStart = qSize - revQEnd
qEnd = qSize - revQStart

No comments: