# Ryan's Blog

## Novoalign Alignment Scores

Posted in research by ryanlayer on May 31, 2010

### Base Qualities and Alignment Scores

Novoalign aligns reads against a reference genome using qualities and ambiguous nucleotide codes.
The initial alignment process finds alignment locations in the indexed sequence that are possible
sources of the read sequence. The alignment locations are scored using the Needleman­Wunsch
algorithm with affine gap penalties and with position specific scoring derived from the read base
qualities and any ambiguous codes in the reference sequence. User defined affine gap penalties are
used for scoring insert/deletes.
Novoalign uses Needleman­Wunsch alignments with affine gap penalties, the gap opening penalty
should be set to $-10\log_{10}(P_{gap}) - G_{extend}$ where $P_{gap}$ is the probability of an insertion deletion
mutation vs the reference genome and $G_{extend}$ is the gap extension penalty. Likewise the gap extend
penalty can be set to $-10\log_{10}(P_{gap2} / P_{gap1})$ where $P_{gap1}$ is the probability of a single base indel
and $P_{gap2}$ is the probability of a 2 base insert/delete mutation. The default gap penalties were
derived from the frequency of short insert/deletes in human genome resequencing projects.
Base quality values are used to calculate base penalties for the Needleman ­Wunsch algorithm. The
base qualities are converted to base probabilities and then to score penalties.

### PRB Quality to Score Conversion

The prb file has quality score $Q(b,i)$ for each base, $b$, at each position, $i$, in the read. The quality
value is converted to a probability, $Pr(b,i)$ and then to a penalty $P(b, i)$.

$Pr(b,i) = \frac{10^{\frac{Q(b,i)}{10}}}{1 + 10^{\frac{Q(b,i)}{10}}}$

$P(b,i) = -10\log_{10}(Pr(b,i))$

### Alignment Score and Threshold

The alignment score is $-10\log_{10}(P(R|A_i))$ where $P(R | A_i)$ is the probability of the read sequence
given the alignment location $i$.

A threshold of 75 would allow for alignment of reads with two mismatches at high quality base
positions plus one or two mismatches at low quality positions or to ambiguous characters in the
reference sequence.
If a threshold is not specified then Novoalign will calculate a threshold for each read such that an
alignment to a non­repetitive sequence will have an alignment quality of at least 20. I.e. The
iterative process of finding an alignment will terminate before finding a low quality chance
alignment. Alignments to repetitive sequences may still have qualities less than 20.

### Posterior Alignment Probabilities and Quality Scores

The posterior alignment probability calculation includes all the alignments found; the probability
that the read came from a repeat masked region or from any regions coded in the reference genome
as N’s; and an allowance for a chance hit above the threshold based on the mutual information
content of the read and the genome.
A posterior alignment probability, $P(A_i| R, G)$ is calculated as:

$P(A_i| R, G) = \frac{P(R|A_i, G)}{P(R|N, G) + \sum P(R|A_i,B)}$

where $P(R|N,G)$ is the probability of finding the read by chance in any masked reference sequence
or   any   region   of   the   reference   sequence   coded   as   $N$‘s,   and   where  $\sum i$ is the sum over all the
alignments found plus a factor for chance alignments calculated using the usable read and genome
lengths.
The $P(R|N,G)$ term allows for the fact that a fragment could have been sourced from portions of the
genome that are not represented in the reference sequence. For instance in Human genome build 36
there is approximately 7% of sequence represented by large blocks of $N$‘s.
A quality score is calculated as $-10\log_{10}(1 - P(A_i | R, G))$, where $P(A_i|R, G)$ is the probability of the
alignment given the read and the genome.