Ryan's Blog

Getting Sizes Right for CUDA

Posted in programming, research by ryanlayer on January 12, 2010

Block Elements:  this depends on the size of the block’s local memory (LOCAL_MEM) and the size of each element in the data structure (sizeof(int)).  For the Tesla card there is 16K of space so LOCAL_MEM=16384.  16384/4 = 4096

Block Size (number of threads per block):  Each thread is responsible for one window.  The number of threads per block depends on the block elements, the window size, and the slide size.  (block_elements – window_length) / slide_length. (4096 – 200)/50 = 77

Usable Threads:  Each thread is responsible for loading the first slide_length of its range from global memory to local memory.  This means that some threads at the end of the block will not have all of their data load.  There will be window_length/slide_length unusable threads, and block_size – (window_length/slide_length) usable threads.

NOTE:  blockDim.x and window_length/slide refer to number of windows (how many windows in a block, and how many unusable windows/threads per block).  To convert to position we often must multiply the number of windows by the slide_length.  For example, window 5 will start at position 5*slide_length.

Each block starts at position blockIdx.x*(blockDim.x – window_length/slide_length)*slide_length

blockIdx.x is the block ID, if blocks did not need to overlap, then we would just multiply this by blockDim.x*slide_length (blockDim.x refers to the number of windows, and we need position, so we multiply by slide_length).  Since things overlap, we need each block (after the first one) to start a few positions back.  The number of unusable windows at the end of each block is equal to window_length/slide_length.  The next block needs to cover these windows.  Block sizes are fixed, so blocks that are moved back to cover unused windows will leave some amount of unprocessed windows that must be covered by the next block (in addition to the unusable windows).  Block 1 needs to be moved back window_length/slide_length to cover the unusable windows in block 0; block 2 needs to be moved back 2*(window_length/slide_length) to cover both the unprocessed space and the unusable windows; block 3 needs to be moved back 3*(window_length/slide_length); and so forth.  The amount a block must be moved back is blockIdx.x*window_length/slide_length, and therefore each block starts at  blockIdx.x(blockDim.x – window_length/slide_length)slide_length.

Each thread, which corresponds to a window, starts at an offset from where the block starts, that offset is based on the slide size: threadIdx.x*slide_length + blockIdx.x(blockDim.x – window_length/slide_length)slide_length

Number of Blocks:  block_elements/chrom_size would be correct if there was no overlapping, but blocks must overlap to account for the unusable threads


Accelerating Leukocyte Tracking using CUDA: A Case Study in Leveraging Manycore Coprocessors

Posted in research by ryanlayer on October 8, 2009

In Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2009


The availability of easily programmable manycore CPUs and GPUs has motivated investigations into how to best exploit their tremendous computational power for scientific computing. Here we demonstrate how a systems biology application—detection and tracking of white blood cells in video microscopy—can be accelerated by 200x using a CUDA-capable GPU. Because the algorithms and implementation challenges are common to a wide range of applications, we discuss general techniques that allow programmers to make efficient use of a manycore GPU.

In Proceedings of the 23rd IEEE International Parallel and Distributed Processing Symposium (IPDPS), May 2009
Tagged with: , ,

CUDA compatible GPU cards as efficient hardware accelerators for Smith-Waterman sequence alignment

Posted in research by ryanlayer on October 8, 2009

BMC Bioinformatics 2008, 9(Suppl 2):S10




Searching for similarities in protein and DNA databases has become a routine procedure in Molecular Biology. The Smith-Waterman algorithm has been available for more than 25 years. It is based on a dynamic programming approach that explores all the possible alignments between two sequences; as a result it returns the optimal local alignment. Unfortunately, the computational cost is very high, requiring a number of operations proportional to the product of the length of two sequences. Furthermore, the exponential growth of protein and DNA databases makes the Smith-Waterman algorithm unrealistic for searching similarities in large sets of sequences. For these reasons heuristic approaches such as those implemented in FASTA and BLAST tend to be preferred, allowing faster execution times at the cost of reduced sensitivity. The main motivation of our work is to exploit the huge computational power of commonly available graphic cards, to develop high performance solutions for sequence alignment.


In this paper we present what we believe is the fastest solution of the exact Smith-Waterman algorithm running on commodity hardware. It is implemented in the recently released CUDA programming environment by NVidia. CUDA allows direct access to the hardware primitives of the last-generation Graphics Processing Units (GPU) G80. Speeds of more than 3.5 GCUPS (Giga Cell Updates Per Second) are achieved on a workstation running two GeForce 8800 GTX. Exhaustive tests have been done to compare our implementation to SSEARCH and BLAST, running on a 3 GHz Intel Pentium IV processor. Our solution was also compared to a recently published GPU implementation and to a Single Instruction Multiple Data (SIMD) solution. These tests show that our implementation performs from 2 to 30 times faster than any other previous attempt available on commodity hardware.


The results show that graphic cards are now sufficiently advanced to be used as efficient hardware accelerators for sequence alignment. Their performance is better than any alternative available on commodity hardware platforms. The solution presented in this paper allows large scale alignments to be performed at low cost, using the exact Smith-Waterman algorithm instead of the largely adopted heuristic approaches.

Tagged with: , , ,

Simple CUDA Program

Posted in research by ryanlayer on October 8, 2009

Getting this code to work may require some environment variable changes:

  • export LD_LIBRARY_PATH=/usr/local/cuda/lib/:$LD_LIBRARY_PATH
  • export PATH=/usr/local/cuda/bin/:$PATH

#include <stdio.h>
#include <stdlib.h>
#include <cuda.h>
#include <sys/time.h>

__global__ void vecMult_d(int *A, int *B, int N)
int i = blockIdx.x * blockDim.x + threadIdx.x ;
if(i<N) { B[i] = A[i]*2; }

void vecMult_h(int *A, int *B, int N)
for(int i=0;i<N;i++) { B[i] = A[i]*2; }

int main() {
int *a_h, *b_h; // pointers to host memory; a.k.a. CPU
int *a_d, *b_d; // pointers to device memory; a.k.a. GPU
//int blocksize=512, grid_size, n=32000;
int blocksize=512, n=1000000;
struct timeval t1_start,t1_end,t2_start,t2_end;
double time_d, time_h;

// allocate arrays on host
a_h = (int *)malloc(sizeof(int)*n);
b_h = (int *)malloc(sizeof(int)*n);

// allocate arrays on device
cudaMalloc((void **)&a_d,n*sizeof(int));
cudaMalloc((void **)&b_d,n*sizeof(int));
dim3 dimBlock( blocksize);
dim3 dimGrid( ceil(float(n)/float(dimBlock.x)));
for(int j=0;j<n;j++) a_h[j]=j;

// GPU

// CPU
time_d = (t1_end.tv_sec-t1_start.tv_sec)*1000000 + t1_end.tv_usec – t1_start.tv_usec;
time_h = (t2_end.tv_sec-t2_start.tv_sec)*1000000 + t2_end.tv_usec – t2_start.tv_usec;
printf(“%d %lf %lf\n”,n,time_d,time_h);free(a_h);

SOURCE: https://visualization.hpc.mil/wiki/Simple_CUDA_Program

Tagged with: ,

Sequence Alignment with GPU: Performance and Design Challenges

Posted in research by ryanlayer on October 8, 2009

Parallel & Distributed Processing, 2009. IPDPS 2009. IEEE International Symposium on

Publication Date: 23-29 May 2009




In bioinformatics, alignments are commonly performed in genome and protein sequence analysis for gene identification and evolutionary similarities. There are several approaches for such analysis, each varying in accuracy and computational complexity. Smith-Waterman (SW) is by far the best algorithm for its accuracy in similarity scoring. However, execution time of this algorithm on general purpose processor based systems makes it impractical for use by life scientists. In this paper we take Smith-Waterman as a case study to explore the architectural features of Graphics Processing Units (GPUs) and evaluate the challenges the hardware architecture poses, as well as the software modifications needed to map the program architecture on to the GPU. We achieve a 23x speedup against the serial version of the SW algorithm. We further study the effect of memory organization and the instruction set architecture on GPU performance. For that purpose we analyze another implementation on an Intel Quad Core processor that makes use of Intel’s SIMD based SSE2 architecture. We show that if reading blocks of 16 words at a time instead of 4 is allowed, and if 64 KB of shared memory as opposed to 16 KB is available to the programmer, GPU performance enhances significantly making it comparable to the SIMD based implementation. We quantify these observations to illustrate the need for studies on extending the instruction set and memory organization for the GPU.

Programming Massively Parallel Processors Course

Posted in research by ryanlayer on October 4, 2009

Wen-Mei W. Hwu from University of Illinois at Urbana-Champaign is teaching Programming Massively Parallel Processors. A draft of the course textbook, slides, and lecture audio is available online.