Nature Review, Genetics, Vol. 7, February 2006
The first wave of information from the analysis of the human genome revealed SNPs to be the main source of genetic and phenotypic human variation. However, the advent of genome-scanning technologies has now uncovered an unexpectedly large extent of what we term ‘structural variation’ in the human genome. This comprises microscopic and, more commonly, submicroscopic variants, which include deletions, duplications and large-scale copy-number variants — collectively termed copy-number variants or copy-number polymorphisms — as well as insertions, inversions and translocations. Rapidly accumulating evidence indicates that structural variants can comprise millions of nucleotides of heterogeneity within every genome, and are likely to make an important contribution to human diversity and disease susceptibility.
SNPs: A single-nucleotide polymorphism (SNP, pronounced snip) is a DNA sequence variation occurring when a single nucleotide — A, T, C, or G — in the genome (or other shared sequence) differs between members of a species (or between paired chromosomes in an individual). For example, two sequenced DNA fragments from different individuals, AAGCCTA to AAGCTTA, contain a difference in a single nucleotide. In this case we say that there are two alleles : C and T. Almost all common SNPs have only two alleles.
Within a population, SNPs can be assigned a minor allele frequency — the lowest allele frequency at a locus that is observed in a particular population. This is simply the lesser of the two allele frequencies for single-nucleotide polymorphisms. There are variations between human populations, so a SNP allele that is common in one geographical or ethnic group may be much rarer in another. SOURCE: http://en.wikipedia.org/wiki/Single-nucleotide_polymorphism
In IEEE/ACM Transactions on Computational Bilogy and Bioinformatics, Vol. 6, No. 3, Suly-September 2009
Some genetic diseases in human beings are dominated by short sequences repeated consec-
utively called tandem repeats. Once a region containing tandem repeats is found, it is of great
interest to study the history of creating the repeats. The computational problem of reconstruct-
ing the duplication history of tandem repeats has been studied extensively in the literature.
Almost all previous studies focused on the simplest case where the size of each duplication block
is 1. Only recently we succeeded in giving the first polynomial-time approximation algorithm
with a guaranteed ratio for a more general case where the size of each duplication block is at
most 2; the algorithm achieves a ratio of 6 and runs in O(n11) time. In this paper, we present
two new polynomial-time approximation algorithms for this more general case. One of them
achieves a ratio of 5 and runs in O(n9) time while the other achieves a ratio of 2:5 + epsilon for any
constant epsilon> 0 but runs slower.
Tandem repeats: Occur in DNA when a pattern of two or more nucleotides is repeated and the repetitions are directly adjacent to each other. An example would be: A-T-T-C-G-A-T-T-C-G-A-T-T-C-G in which the sequence A-T-T-C-G is repeated three times. Tandem repeat describes a pattern that helps determine an individual’s inherited traits. Tandem repeats can be very useful in determining parentage. Short tandem repeats are used for certain genealogical DNA tests. SOURCE: http://en.wikipedia.org/wiki/Tandem_repeat
In Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2009
The availability of easily programmable manycore CPUs and GPUs has motivated investigations into how to best exploit their tremendous computational power for scientific computing. Here we demonstrate how a systems biology application—detection and tracking of white blood cells in video microscopy—can be accelerated by 200x using a CUDA-capable GPU. Because the algorithms and implementation challenges are common to a wide range of applications, we discuss general techniques that allow programmers to make efficient use of a manycore GPU.
JOURNAL OF COMPUTATIONAL BIOLOGY Volume 7, Numbers 1/2, 2000
For aligning DNA sequences that differ only by sequencing errors, or by equivalent errors
from other sources, a greedy algorithm can be much faster than traditional dynamic programming
approaches and yet produce an alignment that is guaranteed to be theoretically
optimal.We introduce a new greedy alignment algorithm with particularly good performance
and show that it computes the same alignment as does a certain dynamic programming algorithm,
while executing over 10 times faster on appropriate data. An implementation of
this algorithm is currently used in a program that assembles the UniGene database at the
National Center for Biotechnology Information.
BMC Bioinformatics 2008, 9(Suppl 2):S10
Searching for similarities in protein and DNA databases has become a routine procedure in Molecular Biology. The Smith-Waterman algorithm has been available for more than 25 years. It is based on a dynamic programming approach that explores all the possible alignments between two sequences; as a result it returns the optimal local alignment. Unfortunately, the computational cost is very high, requiring a number of operations proportional to the product of the length of two sequences. Furthermore, the exponential growth of protein and DNA databases makes the Smith-Waterman algorithm unrealistic for searching similarities in large sets of sequences. For these reasons heuristic approaches such as those implemented in FASTA and BLAST tend to be preferred, allowing faster execution times at the cost of reduced sensitivity. The main motivation of our work is to exploit the huge computational power of commonly available graphic cards, to develop high performance solutions for sequence alignment.
In this paper we present what we believe is the fastest solution of the exact Smith-Waterman algorithm running on commodity hardware. It is implemented in the recently released CUDA programming environment by NVidia. CUDA allows direct access to the hardware primitives of the last-generation Graphics Processing Units (GPU) G80. Speeds of more than 3.5 GCUPS (Giga Cell Updates Per Second) are achieved on a workstation running two GeForce 8800 GTX. Exhaustive tests have been done to compare our implementation to SSEARCH and BLAST, running on a 3 GHz Intel Pentium IV processor. Our solution was also compared to a recently published GPU implementation and to a Single Instruction Multiple Data (SIMD) solution. These tests show that our implementation performs from 2 to 30 times faster than any other previous attempt available on commodity hardware.
The results show that graphic cards are now sufficiently advanced to be used as efficient hardware accelerators for sequence alignment. Their performance is better than any alternative available on commodity hardware platforms. The solution presented in this paper allows large scale alignments to be performed at low cost, using the exact Smith-Waterman algorithm instead of the largely adopted heuristic approaches.
Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on
Publication Date: 23-29 May 2009
In bioinformatics, alignments are commonly performed in genome and protein sequence analysis for gene identification and evolutionary similarities. There are several approaches for such analysis, each varying in accuracy and computational complexity. Smith-Waterman (SW) is by far the best algorithm for its accuracy in similarity scoring. However, execution time of this algorithm on general purpose processor based systems makes it impractical for use by life scientists. In this paper we take Smith-Waterman as a case study to explore the architectural features of Graphics Processing Units (GPUs) and evaluate the challenges the hardware architecture poses, as well as the software modifications needed to map the program architecture on to the GPU. We achieve a 23x speedup against the serial version of the SW algorithm. We further study the effect of memory organization and the instruction set architecture on GPU performance. For that purpose we analyze another implementation on an Intel Quad Core processor that makes use of Intel’s SIMD based SSE2 architecture. We show that if reading blocks of 16 words at a time instead of 4 is allowed, and if 64 KB of shared memory as opposed to 16 KB is available to the programmer, GPU performance enhances significantly making it comparable to the SIMD based implementation. We quantify these observations to illustrate the need for studies on extending the instruction set and memory organization for the GPU.