Ryan's Blog

Structural variation in the human genome

Posted in research by ryanlayer on October 19, 2009

Nature Review, Genetics, Vol. 7, February 2006

paper

Abstract:
The first wave of information from the analysis of the human genome revealed SNPs to be the main source of genetic and phenotypic human variation. However, the advent of genome-scanning technologies has now uncovered an unexpectedly large extent of what we term ‘structural variation’ in the human genome. This comprises microscopic and, more commonly, submicroscopic variants, which include deletions, duplications and large-scale copy-number variants — collectively termed copy-number variants or copy-number polymorphisms — as well as insertions, inversions and translocations. Rapidly accumulating evidence indicates that structural variants can comprise millions of nucleotides of heterogeneity within every genome, and are likely to make an important contribution to human diversity and disease susceptibility.

Glossary:

SNPs:  A single-nucleotide polymorphism (SNP, pronounced snip) is a DNA sequence variation occurring when a single nucleotideA, T, C, or G — in the genome (or other shared sequence) differs between members of a species (or between paired chromosomes in an individual). For example, two sequenced DNA fragments from different individuals, AAGCCTA to AAGCTTA, contain a difference in a single nucleotide. In this case we say that there are two alleles : C and T. Almost all common SNPs have only two alleles.

Within a population, SNPs can be assigned a minor allele frequency — the lowest allele frequency at a locus that is observed in a particular population. This is simply the lesser of the two allele frequencies for single-nucleotide polymorphisms. There are variations between human populations, so a SNP allele that is common in one geographical or ethnic group may be much rarer in another.  SOURCE:  http://en.wikipedia.org/wiki/Single-nucleotide_polymorphism

Advertisements

Improved Approximation Algorithms for Reconstructing the History of Tandem Repeats

Posted in research by ryanlayer on October 19, 2009

In IEEE/ACM Transactions on Computational Bilogy and Bioinformatics, Vol. 6, No. 3, Suly-September 2009

Paper

Abstract:

Some genetic diseases in human beings are dominated by short sequences repeated consec-
utively called tandem repeats
. Once a region containing tandem repeats is found, it is of great
interest to study the history of creating the repeats. The computational problem of reconstruct-
ing the duplication history of tandem repeats has been studied extensively in the literature.
Almost all previous studies focused on the simplest case where the size of each duplication block
is 1. Only recently we succeeded in giving the first polynomial-time approximation algorithm
with a guaranteed ratio for a more general case where the size of each duplication block is at
most 2; the algorithm achieves a ratio of 6 and runs in O(n11) time. In this paper, we present
two new polynomial-time approximation algorithms for this more general case. One of them
achieves a ratio of 5 and runs in O(n9) time while the other achieves a ratio of 2:5 + epsilon for any
constant epsilon> 0 but runs slower.

Glossary:

Tandem repeats: Occur in DNA when a pattern of two or more nucleotides is repeated and the repetitions are directly adjacent to each other.  An example would be: A-T-T-C-G-A-T-T-C-G-A-T-T-C-G in which the sequence A-T-T-C-G is repeated three times.  Tandem repeat describes a pattern that helps determine an individual’s inherited traits.  Tandem repeats can be very useful in determining parentage. Short tandem repeats are used for certain genealogical DNA tests.  SOURCE:  http://en.wikipedia.org/wiki/Tandem_repeat

Tagged with: ,

Finding Structural Variations with Pair-End Sequences and a Sliding Window

Posted in research by ryanlayer on October 19, 2009

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB)

Posted in research by ryanlayer on October 15, 2009

IEEE/ACM Transactions on Computational Biology and Bioinformatics (TCBB) is a quarterly journal that publishes archival research results related to the algorithmic, mathematical, statistical, and computational methods that are central in bioinformatics and computational biology.

Tagged with: , , ,

Workshop on Theory and Many-Cores (T&MC)

Posted in research by ryanlayer on October 9, 2009
Tagged with:

Parallel Computing Course

Posted in research by ryanlayer on October 9, 2009

While at Yale, Arvind Krishnamurthy taught  a parallel computing course that covered PRAM both in lecture (1, 2) and in handouts.

Tagged with: ,

Accelerating Leukocyte Tracking using CUDA: A Case Study in Leveraging Manycore Coprocessors

Posted in research by ryanlayer on October 8, 2009

In Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2009

paper

Abstract
The availability of easily programmable manycore CPUs and GPUs has motivated investigations into how to best exploit their tremendous computational power for scientific computing. Here we demonstrate how a systems biology application—detection and tracking of white blood cells in video microscopy—can be accelerated by 200x using a CUDA-capable GPU. Because the algorithms and implementation challenges are common to a wide range of applications, we discuss general techniques that allow programmers to make efficient use of a manycore GPU.

In Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2009
Tagged with: , ,

A Greedy Algorithm for Aligning DNA Sequences

Posted in research by ryanlayer on October 8, 2009

JOURNAL OF COMPUTATIONAL BIOLOGY Volume 7, Numbers 1/2, 2000

Paper

Abstract

For aligning DNA sequences that differ only by sequencing errors, or by equivalent errors
from other sources, a greedy algorithm can be much faster than traditional dynamic programming
approaches and yet produce an alignment that is guaranteed to be theoretically
optimal.We introduce a new greedy alignment algorithm with particularly good performance
and show that it computes the same alignment as does a certain dynamic programming algorithm,
while executing over 10 times faster on appropriate data. An implementation of
this algorithm is currently used in a program that assembles the UniGene database at the
National Center for Biotechnology Information.

Tagged with: , , ,

PARPST: a PARallel algorithm to find peptide sequence tags

Posted in research by ryanlayer on October 8, 2009
BMC Bioinformatics. 2008; 9(Suppl 4): S11.
Paper
Abstract
Background
Protein identification is one of the most challenging problems in proteomics. Tandem mass spectrometry provides an important tool to handle the protein identification problem.
Results
We developed a work-efficient parallel algorithm for the peptide sequence tag problem. The algorithm runs on the concurrent-read, exclusive-write PRAM in O(n) time using log n processors, where n is the number of mass peaks in the spectrum. The algorithm is able to find all the sequence tags having score greater than a parameter or all the sequence tags of maximum length. Our tests on 1507 spectra in the Open Proteomics Database shown that our algorithm is efficient and effective since achieves comparable results to other methods.
Conclusions
The proposed algorithm can be used to speed up the database searching or to identify post-translational modifications, comparing the homology of the sequence tags found with the sequences in the biological database.
Tagged with: , ,

Yeast genome analysis identifies chromosomal translocation, gene conversion events and several sites of Ty element insertion.

Posted in research by ryanlayer on October 8, 2009

Nucleic Acids Res. 2009 Aug 26.

Paper

Abstract

Paired end mapping of chromosomal fragments has been used in human cells to identify numerous structural variations in chromosomes of individuals and of cancer cell lines; however, the molecular, biological and bioinformatics methods for this technology are still in development. Here, we present a parallel bioinformatics approach to analyze chromosomal paired-end tag (ChromPET) sequence data and demonstrate its application in identifying gene rearrangements in the model organism Saccharomyces cerevisiae. We detected several expected events, including a chromosomal rearrangement of the nonessential arm of chromosome V induced by selective pressure, rearrangements introduced during strain construction and gene conversion at the MAT locus. In addition, we discovered several unannotated Ty element insertions that are present in the reference yeast strain, but not in the reference genome sequence, suggesting a few revisions are necessary in the latter. These data demonstrate that application of the chromPET technique to a genetically tractable organism like yeast provides an easy screen for studying the mechanisms of chromosomal rearrangements during the propagation of a species.