Ryan's Blog

Getting Sizes Right for CUDA

Posted in programming, research by ryanlayer on January 12, 2010

Block Elements:  this depends on the size of the block’s local memory (LOCAL_MEM) and the size of each element in the data structure (sizeof(int)).  For the Tesla card there is 16K of space so LOCAL_MEM=16384.  16384/4 = 4096

Block Size (number of threads per block):  Each thread is responsible for one window.  The number of threads per block depends on the block elements, the window size, and the slide size.  (block_elements – window_length) / slide_length. (4096 – 200)/50 = 77

Usable Threads:  Each thread is responsible for loading the first slide_length of its range from global memory to local memory.  This means that some threads at the end of the block will not have all of their data load.  There will be window_length/slide_length unusable threads, and block_size – (window_length/slide_length) usable threads.

NOTE:  blockDim.x and window_length/slide refer to number of windows (how many windows in a block, and how many unusable windows/threads per block).  To convert to position we often must multiply the number of windows by the slide_length.  For example, window 5 will start at position 5*slide_length.

Each block starts at position blockIdx.x*(blockDim.x – window_length/slide_length)*slide_length

blockIdx.x is the block ID, if blocks did not need to overlap, then we would just multiply this by blockDim.x*slide_length (blockDim.x refers to the number of windows, and we need position, so we multiply by slide_length).  Since things overlap, we need each block (after the first one) to start a few positions back.  The number of unusable windows at the end of each block is equal to window_length/slide_length.  The next block needs to cover these windows.  Block sizes are fixed, so blocks that are moved back to cover unused windows will leave some amount of unprocessed windows that must be covered by the next block (in addition to the unusable windows).  Block 1 needs to be moved back window_length/slide_length to cover the unusable windows in block 0; block 2 needs to be moved back 2*(window_length/slide_length) to cover both the unprocessed space and the unusable windows; block 3 needs to be moved back 3*(window_length/slide_length); and so forth.  The amount a block must be moved back is blockIdx.x*window_length/slide_length, and therefore each block starts at  blockIdx.x(blockDim.x – window_length/slide_length)slide_length.

Each thread, which corresponds to a window, starts at an offset from where the block starts, that offset is based on the slide size: threadIdx.x*slide_length + blockIdx.x(blockDim.x – window_length/slide_length)slide_length

Number of Blocks:  block_elements/chrom_size would be correct if there was no overlapping, but blocks must overlap to account for the unusable threads

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: