Base Qualities and Alignment Scores
Novoalign aligns reads against a reference genome using qualities and ambiguous nucleotide codes.
The initial alignment process finds alignment locations in the indexed sequence that are possible
sources of the read sequence. The alignment locations are scored using the NeedlemanWunsch
algorithm with affine gap penalties and with position specific scoring derived from the read base
qualities and any ambiguous codes in the reference sequence. User defined affine gap penalties are
used for scoring insert/deletes.
Novoalign uses NeedlemanWunsch alignments with affine gap penalties, the gap opening penalty
should be set to where is the probability of an insertion deletion
mutation vs the reference genome and is the gap extension penalty. Likewise the gap extend
penalty can be set to where is the probability of a single base indel
and is the probability of a 2 base insert/delete mutation. The default gap penalties were
derived from the frequency of short insert/deletes in human genome resequencing projects.
Base quality values are used to calculate base penalties for the Needleman Wunsch algorithm. The
base qualities are converted to base probabilities and then to score penalties.
PRB Quality to Score Conversion
The prb file has quality score for each base, , at each position, , in the read. The quality
value is converted to a probability, and then to a penalty .
Alignment Score and Threshold
The alignment score is where is the probability of the read sequence
given the alignment location .
A threshold of 75 would allow for alignment of reads with two mismatches at high quality base
positions plus one or two mismatches at low quality positions or to ambiguous characters in the
If a threshold is not specified then Novoalign will calculate a threshold for each read such that an
alignment to a nonrepetitive sequence will have an alignment quality of at least 20. I.e. The
iterative process of finding an alignment will terminate before finding a low quality chance
alignment. Alignments to repetitive sequences may still have qualities less than 20.
Posterior Alignment Probabilities and Quality Scores
The posterior alignment probability calculation includes all the alignments found; the probability
that the read came from a repeat masked region or from any regions coded in the reference genome
as N’s; and an allowance for a chance hit above the threshold based on the mutual information
content of the read and the genome.
A posterior alignment probability, is calculated as:
where is the probability of finding the read by chance in any masked reference sequence
or any region of the reference sequence coded as ‘s, and where is the sum over all the
alignments found plus a factor for chance alignments calculated using the usable read and genome
The term allows for the fact that a fragment could have been sourced from portions of the
genome that are not represented in the reference sequence. For instance in Human genome build 36
there is approximately 7% of sequence represented by large blocks of ‘s.
A quality score is calculated as , where is the probability of the
alignment given the read and the genome.