Ryan's Blog

Novoalign Alignment Scores

Posted in research by ryanlayer on May 31, 2010

Base Qualities and Alignment Scores

Novoalign aligns reads against a reference genome using qualities and ambiguous nucleotide codes.
The initial alignment process finds alignment locations in the indexed sequence that are possible
sources of the read sequence. The alignment locations are scored using the Needleman­Wunsch
algorithm with affine gap penalties and with position specific scoring derived from the read base
qualities and any ambiguous codes in the reference sequence. User defined affine gap penalties are
used for scoring insert/deletes.
Novoalign uses Needleman­Wunsch alignments with affine gap penalties, the gap opening penalty
should be set to -10\log_{10}(P_{gap}) - G_{extend} where P_{gap} is the probability of an insertion deletion
mutation vs the reference genome and G_{extend} is the gap extension penalty. Likewise the gap extend
penalty can be set to -10\log_{10}(P_{gap2} / P_{gap1}) where P_{gap1} is the probability of a single base indel
and P_{gap2} is the probability of a 2 base insert/delete mutation. The default gap penalties were
derived from the frequency of short insert/deletes in human genome resequencing projects.
Base quality values are used to calculate base penalties for the Needleman ­Wunsch algorithm. The
base qualities are converted to base probabilities and then to score penalties.

PRB Quality to Score Conversion

The prb file has quality score Q(b,i) for each base, b, at each position, i, in the read. The quality
value is converted to a probability, Pr(b,i) and then to a penalty P(b, i).

Pr(b,i) = \frac{10^{\frac{Q(b,i)}{10}}}{1 + 10^{\frac{Q(b,i)}{10}}}

P(b,i) = -10\log_{10}(Pr(b,i))

Alignment Score and Threshold

The alignment score is -10\log_{10}(P(R|A_i)) where P(R | A_i) is the probability of the read sequence
given the alignment location i.

A threshold of 75 would allow for alignment of reads with two mismatches at high quality base
positions plus one or two mismatches at low quality positions or to ambiguous characters in the
reference sequence.
If a threshold is not specified then Novoalign will calculate a threshold for each read such that an
alignment to a non­repetitive sequence will have an alignment quality of at least 20. I.e. The
iterative process of finding an alignment will terminate before finding a low quality chance
alignment. Alignments to repetitive sequences may still have qualities less than 20.

Posterior Alignment Probabilities and Quality Scores

The posterior alignment probability calculation includes all the alignments found; the probability
that the read came from a repeat masked region or from any regions coded in the reference genome
as N’s; and an allowance for a chance hit above the threshold based on the mutual information
content of the read and the genome.
A posterior alignment probability, P(A_i| R, G) is calculated as:

P(A_i| R, G) = \frac{P(R|A_i, G)}{P(R|N, G) + \sum P(R|A_i,B)}

where P(R|N,G) is the probability of finding the read by chance in any masked reference sequence
or   any   region   of   the   reference   sequence   coded   as   N‘s,   and   where  \sum i is the sum over all the
alignments found plus a factor for chance alignments calculated using the usable read and genome
lengths.
The P(R|N,G) term allows for the fact that a fragment could have been sourced from portions of the
genome that are not represented in the reference sequence. For instance in Human genome build 36
there is approximately 7% of sequence represented by large blocks of N‘s.
A quality score is calculated as -10\log_{10}(1 - P(A_i | R, G)), where P(A_i|R, G) is the probability of the
alignment given the read and the genome.

Advertisements
Tagged with: ,

Finding Structural Variations with Pair-End Sequences and a Sliding Window

Posted in research by ryanlayer on October 19, 2009

A Greedy Algorithm for Aligning DNA Sequences

Posted in research by ryanlayer on October 8, 2009

JOURNAL OF COMPUTATIONAL BIOLOGY Volume 7, Numbers 1/2, 2000

Paper

Abstract

For aligning DNA sequences that differ only by sequencing errors, or by equivalent errors
from other sources, a greedy algorithm can be much faster than traditional dynamic programming
approaches and yet produce an alignment that is guaranteed to be theoretically
optimal.We introduce a new greedy alignment algorithm with particularly good performance
and show that it computes the same alignment as does a certain dynamic programming algorithm,
while executing over 10 times faster on appropriate data. An implementation of
this algorithm is currently used in a program that assembles the UniGene database at the
National Center for Biotechnology Information.

Tagged with: , , ,

PARPST: a PARallel algorithm to find peptide sequence tags

Posted in research by ryanlayer on October 8, 2009
BMC Bioinformatics. 2008; 9(Suppl 4): S11.
Paper
Abstract
Background
Protein identification is one of the most challenging problems in proteomics. Tandem mass spectrometry provides an important tool to handle the protein identification problem.
Results
We developed a work-efficient parallel algorithm for the peptide sequence tag problem. The algorithm runs on the concurrent-read, exclusive-write PRAM in O(n) time using log n processors, where n is the number of mass peaks in the spectrum. The algorithm is able to find all the sequence tags having score greater than a parameter or all the sequence tags of maximum length. Our tests on 1507 spectra in the Open Proteomics Database shown that our algorithm is efficient and effective since achieves comparable results to other methods.
Conclusions
The proposed algorithm can be used to speed up the database searching or to identify post-translational modifications, comparing the homology of the sequence tags found with the sequences in the biological database.
Tagged with: , ,

Yeast genome analysis identifies chromosomal translocation, gene conversion events and several sites of Ty element insertion.

Posted in research by ryanlayer on October 8, 2009

Nucleic Acids Res. 2009 Aug 26.

Paper

Abstract

Paired end mapping of chromosomal fragments has been used in human cells to identify numerous structural variations in chromosomes of individuals and of cancer cell lines; however, the molecular, biological and bioinformatics methods for this technology are still in development. Here, we present a parallel bioinformatics approach to analyze chromosomal paired-end tag (ChromPET) sequence data and demonstrate its application in identifying gene rearrangements in the model organism Saccharomyces cerevisiae. We detected several expected events, including a chromosomal rearrangement of the nonessential arm of chromosome V induced by selective pressure, rearrangements introduced during strain construction and gene conversion at the MAT locus. In addition, we discovered several unannotated Ty element insertions that are present in the reference yeast strain, but not in the reference genome sequence, suggesting a few revisions are necessary in the latter. These data demonstrate that application of the chromPET technique to a genetically tractable organism like yeast provides an easy screen for studying the mechanisms of chromosomal rearrangements during the propagation of a species.

CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment

Posted in research by ryanlayer on October 8, 2009

BMC Bioinformatics 2008, 9(Suppl 2):S10

Paper

Abstract

Background

Searching for similarities in protein and DNA databases has become a routine procedure in Molecular Biology. The Smith-Waterman algorithm has been available for more than 25 years. It is based on a dynamic programming approach that explores all the possible alignments between two sequences; as a result it returns the optimal local alignment. Unfortunately, the computational cost is very high, requiring a number of operations proportional to the product of the length of two sequences. Furthermore, the exponential growth of protein and DNA databases makes the Smith-Waterman algorithm unrealistic for searching similarities in large sets of sequences. For these reasons heuristic approaches such as those implemented in FASTA and BLAST tend to be preferred, allowing faster execution times at the cost of reduced sensitivity. The main motivation of our work is to exploit the huge computational power of commonly available graphic cards, to develop high performance solutions for sequence alignment.

Results

In this paper we present what we believe is the fastest solution of the exact Smith-Waterman algorithm running on commodity hardware. It is implemented in the recently released CUDA programming environment by NVidia. CUDA allows direct access to the hardware primitives of the last-generation Graphics Processing Units (GPU) G80. Speeds of more than 3.5 GCUPS (Giga Cell Updates Per Second) are achieved on a workstation running two GeForce 8800 GTX. Exhaustive tests have been done to compare our implementation to SSEARCH and BLAST, running on a 3 GHz Intel Pentium IV processor. Our solution was also compared to a recently published GPU implementation and to a Single Instruction Multiple Data (SIMD) solution. These tests show that our implementation performs from 2 to 30 times faster than any other previous attempt available on commodity hardware.

Conclusions

The results show that graphic cards are now sufficiently advanced to be used as efficient hardware accelerators for sequence alignment. Their performance is better than any alternative available on commodity hardware platforms. The solution presented in this paper allows large scale alignments to be performed at low cost, using the exact Smith-Waterman algorithm instead of the largely adopted heuristic approaches.

Tagged with: , , ,

Sequence Alignment with GPU: Performance and Design Challenges

Posted in research by ryanlayer on October 8, 2009

Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on

Publication Date: 23-29 May 2009

Slides

Paper

Abstract

In bioinformatics, alignments are commonly performed in genome and protein sequence analysis for gene identification and evolutionary similarities. There are several approaches for such analysis, each varying in accuracy and computational complexity. Smith-Waterman (SW) is by far the best algorithm for its accuracy in similarity scoring. However, execution time of this algorithm on general purpose processor based systems makes it impractical for use by life scientists. In this paper we take Smith-Waterman as a case study to explore the architectural features of Graphics Processing Units (GPUs) and evaluate the challenges the hardware architecture poses, as well as the software modifications needed to map the program architecture on to the GPU. We achieve a 23x speedup against the serial version of the SW algorithm. We further study the effect of memory organization and the instruction set architecture on GPU performance. For that purpose we analyze another implementation on an Intel Quad Core processor that makes use of Intel’s SIMD based SSE2 architecture. We show that if reading blocks of 16 words at a time instead of 4 is allowed, and if 64 KB of shared memory as opposed to 16 KB is available to the programmer, GPU performance enhances significantly making it comparable to the SIMD based implementation. We quantify these observations to illustrate the need for studies on extending the instruction set and memory organization for the GPU.

NovoCraft Short Read Alignment Package Documentation

Posted in research by ryanlayer on October 5, 2009