Ryan's Blog

C: Accelerating Data Processing with SSE and AVX intrinsics

Posted in programming by ryanlayer on October 17, 2014

Many programs contain loops that execute the same set of instructions on every element in an array. If the result from each of the elements is completely independent, then you could imagine speeding things up by processing all of the elements in parallel. This scenario, where multiple processors execute the same instructions, is refereed to as “data parallel” (v.s. task parallel) and there is a class of parallel architectures called Single Instruction Multiple Data (SIMD) that are optimized for this type of processing. While there are a number of SIMD architectures to choose from (NVIDA CUDA being a very powerful one), most modern CPUs support SIMD operations through a set of vector registers and instructions that allow operations to performed on a vector of elements.

The objective of this post is to demonstrate how to speed up data parallel portions of your C code through use of these vector registers/instructions. The capabilities of these vectors have improved over time, and just like CPUs are given new names, each generation of vector registers/instructions has a different name/version. I will cover the most widely used set of instructions called SSE (Streaming SIMD Extensions) and the very new (Q3’13) Advanced Vector Extensions (AVX2). The example I use has two functions, one that has almost no speedup with and another that is up to 4X faster. I believe that these examples nicely illuminate the basic steps that every programmer will need to follow when converting (or deciding whether to convert) an existing sequential loop to a parallel loop.

BASICS:

– Single Instruction Multiple Data (SIMD) is a type of parallelism (also called data parallel) where the same set of operations is performed on different data elements concurrently. The classic example being the element-wise summation of two arrays A and B into a resulting array R, where each R[i] = A[i] + B[i] is performed in parallel.

– Intel and AMD provide extensions to the basic x86 instruction set that can take advantage of data parallelism by operating on vectors. There are many different groups of instructions: MMX, SSE, AVX, etc.

– How “wide” these vectors are (how much data they can hold) and which operations can be performed on them depends on the CPU’s microarchitecture. The first generation of SSE instructions have been supported since the early Pentium processors, and the newest AVX2 instructions are supported by the Haswell microarchitecture which started shipping at the end of 2013.

– SSE registers are 128 bits wide, and AVX2 registers are 256 bit wide. If we are processing 32 bit ints, then SSE can process 4 ints at a time and AVX2 can process 8.

– In order to check what instructions your processor supports:

OS X :

$ sysctl hw | grep sse
hw.optional.sse: 1
hw.optional.sse2: 1
hw.optional.sse3: 1
hw.optional.supplementalsse3: 1
hw.optional.sse4_1: 1
hw.optional.sse4_2: 1
$ sysctl hw | grep avx
hw.optional.avx1_0: 1
hw.optional.avx2_0: 1

linux:

$ cat /proc/cpuinfo | grep flags
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid

As you can see my OS X machine is new and fancy and supports AVX2, while my linux box only has AVX (not 2) support.

– The programming model is fairly simple. One loads data into the vector registers, operates on them, moves data back to standard variables, then continues processing as normal.

– gcc, clang, etc provide a set of functions called “intrinsics” that you can use instead of writing straight assembly code. These functions are built in, and so there is no need to link in an external library. However, you need to tell the compiler which instructions you are using through flags like “-mavx” and “-msse4”.

– The best reference I have found is Intel’s: https://software.intel.com/sites/landingpage/IntrinsicsGuide/. Check the radio buttons that correspond to the instructions that are supported on your target hardware and you can’t go wrong.

CODE:

In the following sections, I present code using these intrinsics in the context of an algorithm for finding intersections between intervals. All the code and example data can be found at https://github.com/ryanlayer/avx_ex

To run the code simply:

$ git clone  https://github.com/ryanlayer/avx_ex
$ cd avx_ex
$ make
$ ./bed_intersect_sse_avx data/MB.bed data/MT.bed

This program counts the number of bases that are common between two BED files (http://genome.ucsc.edu/FAQ/FAQformat.html#format1). First, each BED file is mapped to a bit map where bit i = 1 only if the bed file has an interval that covers genomic position i. The bit map is represented by an array of unsigned 32-bit ints where each int corresponds to 32 genomic positions. Next, the two bed files are intersected by ANDing the two associated bit maps. Last, the summation of the resulting bits gives the total number that are set to 1. This total reflects the number of base pairs in common between the two BED interval sets.

Assuming that we have the following two bit maps (unsigned 32-bit arrays) that represent the two bed files (see code for details on this):

    uint32_t *bed_1, *bed_2, *bed_R;

    uint32_t bed_1_N = bed_to_bits(bed_1_file_name, &bed_1, GENOME_SIZE);
    uint32_t bed_2_N = bed_to_bits(bed_2_file_name, &bed_2, GENOME_SIZE);

We can find the intersection between bed_1 and bed_2 by taking the bit-wise AND of each element:

uint32_t bed_R_N = intersect_beds(bed_1, bed_2, &bed_R, GENOME_SIZE);

uint32_t intersect_beds(uint32_t *A,
                        uint32_t *B,
                        uint32_t **R,
                        uint32_t G_size)
{
    uint32_t ints_per_genome = (G_size + 32 - 1)/32;
    *R = (uint32_t *) calloc(ints_per_genome, sizeof(uint32_t));
    uint32_t i;
    for (i = 0; i < ints_per_genome; ++i)
        (*R)[i] = A[i] & B[i];
    return ints_per_genome;
}

This function is a classic SIMD scenario since all of the ANDs are independent, and we can use the SSE registers/instruction to perform some of these operations in parallel. SSE registers are 128 bits wide and can therefore hold 4 32-bit ints, which means that we can compute 4 values in parallel (R[i], R[i+1], R[i+2], and R[i+3]). The following function does exactly that:

uint32_t sse_intersect_beds(uint32_t *A,
                            uint32_t *B,
                            uint32_t **R,
                            uint32_t G_size)
{
    uint32_t ints_per_genome = (G_size + 32 - 1)/32;

    int r = posix_memalign((void **)R,
                            32,
                            ints_per_genome*sizeof(unsigned int));

    __m128i *A_sse = (__m128i *)A;
    __m128i *B_sse = (__m128i *)B;
    __m128i *R_sse = (__m128i *)*R;

    __m128 a, b;

    uint32_t i;

    for (i = 0; i < (ints_per_genome/4); ++i) {
        a = _mm_load_si128(&A_sse[i]);
        b = _mm_load_si128(&B_sse[i]);
        R_sse[i] = _mm_and_ps(a,b);
    }

    return ints_per_genome;
}

As with many parallel architectures, input data must be moved from the CPU memory space to some specialized location, and once the operations are complete the result must be moved back. In the case of SEE (and AVX), this location is a set of vector registers. While these register have names like xmm0 and you can use assembly to ship data around, compilers have made things much easier by providing specific types that associate variables with these registers and functions that handle the movement of data. Just remember that variables with types like “int” and “float” are in the CPU’s memory space, and variables with types like “__m128i” are in the vector space.

Lets step through the code:

Line 6: Since we are using a single bit for each genomic coordinate and there are 32 bits in an unsigned int, we will need the genome size / 32 (rounded up) ints to represent the full genome.

Line 8: When allocating space in this context, it is best to make sure that the allocated block is “memory aligned” (the start address is a multiple of, in this case, 32). Many instructions that move data between CPU and vector space require alignment, and those that don’t typically perform better when things are aligned

Lines 12-14: A and B are input arrays and R is an output array, and all three are in the CPU memory space. To facilitate the transfer of data between the CPU registers and the vector registers we need to create the associated variables in the vector space. This is done by casting the CPU-space variables to vector-space variables. Since you can used the 128 bits of the SSE2 register in any way you want, the __m128i specifics that in this case the values are integers.

Line 16: Create two new variables in vector-space that will be used in the for loop later.

Line 20: SSE2 registers are 128 bits wide, which means they can hold 4 32-bit variables. So in vector space A_sse[0] corresponds to A[0], A[1], A[2], and A[3], and B_sse[10] corresponds to B[40], B[41], B[42], B[43]. Considering this mapping, the for loop then scans from 0 to ints_per_genome/4.

Lines 21 and 22: Load the 4 32-bit ints from A (1 128-bit value from A_sse) and 4 32-bit ints from B (1 128-bit value from B_sse), to the vector registers corresponding to variables a and b.

Line 23: AND the vector registers corresponding to variables a and b and store the result back in the R. This computes the following opperations in a single instruction (i.e. in parallel): R[i*4] = A[i*4] + B[i*4], R[i*4+1] = A[i*4+1] + B[i*4+1], R[i*4+2] = A[i*4+2] + B[i*4+2], R[i*4+3] = A[i*4+3] + B[i*4+3]. Notice that we do not explicitly move the data from the vector space back to the CPU space. Those instructions are still executed, its just that the compiler took care of generating the code for us.

The code using AVX2 is nearly identical to SSE2, except for a few changes to account for the wider registers:

uint32_t avx_intersect_beds(uint32_t *A,
                            uint32_t *B,
                            uint32_t **R,
                            uint32_t G_size)
{
    uint32_t ints_per_genome = (G_size + 32 - 1)/32;

    int r = posix_memalign((void **)R,
                            32,
                            ints_per_genome*sizeof(unsigned int));

    __m256i *A_avx = (__m256i *)A;
    __m256i *B_avx = (__m256i *)B;
    __m256i *R_avx = (__m256i *)*R;

    __m256 a, b;

    uint32_t i;

    for (i = 0; i < (ints_per_genome/8); ++i) {
        a = _mm256_load_si256(&A_avx[i]);
        b = _mm256_load_si256(&B_avx[i]);
        R_avx[i] = _mm256_and_ps(a,b);
    }

    return ints_per_genome;
}

The amount of parallelism here is directly related to the size of the registers. If we use the AVX2 instruction set, we would expect to get twice the speedup since those registers are 256-bits wide. However, when we run the test:

$ ./bed_intersect_sse_avx data/MB.bed data/MT.bed

We get the following timings in microseconds

intersect_beds:		452664
count_bits:		9027021
sse_intersect_beds:	281873
sse_count_bits:		4449191
avx_intersect_beds:	221272
avx_count_bits:		2304001

As you can see, sse_intersect_beds is only 1.6X faster than the sequential intersect_beds, and avx_intersect_beds is only 2X faster. The issue is that we are performing 3 load operations (A to a, B to b, and the result back to R) to only 1 vector instruction. There is no way of getting around this, there is just not enough work to be done in parallel.

From those results you can also see that the counting operation is 2X faster with SSE and 4X faster with AVX2 (code below). While this is not the perfect 4X/8X speed that you might expect (but will almost certainly never get because of overhead), it is moving in the right direction because that function has a higher proportion of instructions to loads.

The count function is also much more complex, and is a good example of how parallelizing a function often requires rethinking the algorithm in the context of a particular architecture. The result of the intersect function is the bit map R, where R is any array of ints and each bit in those ints corresponds to a genomic region. If both bed files had intervals in that region, then the bit is 1, otherwise it is zero.

To find the sum in the sequential case, we simply set the variable sum to zero, loop of the ints in R, loop over each bit in each int, and then add 1 to the sum if that bit is zero.

uint32_t count_bits(uint32_t *R,
                    uint32_t R_size)
{
    uint32_t i,j,sum = 0;

    for (i = 0; i < R_size; ++i)
        for (j = 0; j < 32; ++j)
            sum += (R[i] >> j) & 1;

    return sum;
}

Since the vector instructions allow us to operate on 4 values at once in SSE2 and 8 in AVX2, we need to think about how best to apply that functionality while at the same time minimizing the number of loads. While there are many different solutions, I chose to load 4 (or 8) elements from R to the vector-space, then store the sum of each element into an independent summation register. In this strategy there is 1 load to 32 ands, 32 adds, and 32 shifts. Only after all elements in R have been considered do we move the summation values from vector-space back to the CPU-space.

Here is the SSE2 function:

uint32_t sse_count_bits(uint32_t *R,
                        uint32_t R_size)
{
    __m128i *R_sse = (__m128i *)R;

    __attribute__((aligned(32))) int masks[4] =  { 1, 1, 1, 1 };
    __m128i *masks_sse = (__m128i *)masks;
    __m128i m = _mm_load_si128(masks_sse);

    __m128i sums;
    sums = _mm_xor_si128 (sums, sums);
    __m128i cur, and;


    uint32_t i,j;
    for (i = 0; i < R_size/4; ++i) {
        cur = _mm_load_si128(&R_sse[i]);
        for (j = 0; j < 32; ++j) {
            and = _mm_and_si128 (cur, m);
            sums = _mm_add_epi32(sums, and);
            cur = _mm_srai_epi32(cur, 1);
        }
    }

    __attribute__((aligned(32))) int r[4];
    __m128i *r_sse = (__m128i *)r;
    _mm_store_si128(r_sse, sums);

    return r[0] + r[1] + r[2] + r[3];
}

Line 4: Just like before, we need to create a mapping between the CPU-space and vector-space variables. In this case our input array is R, and we cast that to a vector-space variable R_sse.

Lines 6-7: In the sequential code we anded the shifted bitmap value with 1 to isolate the last bit, if that bit was 1 we added 1 to the sum. This same process is also done here, but since we are operating on 4 values at once, we need a vector of 4 1’s. First we allocate a memory aligned array with 4 1’s, then move it over to a vector register.

Lines 10-12: Declare a new vector-space variable, and use the xor operation to zero out the bits.

Line 16: Once again SSE2 is 4 ints wide, so we need to scan from 0 to R_size/4.

Line 14: Load 4 of the bitmap values from R.

Line 18: Loop over each bit in those 4 bit maps.

Lines 19-20: Here we mask out the last bit, add those bits the running sum, then shift the values loaded from R.

Lines 25-27: Move the summation values from vector-space to CPU space.

Line 29: Since we maintained 4 different counts, we need to sum those together for a final count.

Once again the AVX2 code is very similar.

uint32_t avx_count_bits(uint32_t *R,
                        uint32_t R_size)
{
    __m256i *R_avx = (__m256i *)R;

    __attribute__((aligned(32))) int masks[8] =  { 1, 1, 1, 1, 1, 1, 1, 1 };
    __m256i *masks_avx = (__m256i *)masks;
    __m256i m = _mm256_load_si256(masks_avx);

    __m256i sums = _mm256_setzero_si256();
    __m256i cur, and;


    uint32_t i,j;
    for (i = 0; i < R_size/8; ++i) {
        cur = _mm256_load_si256(&R_avx[i]);
        for (j = 0; j < 32; ++j) {
            and = _mm256_and_si256 (cur, m);
            sums = _mm256_add_epi32(sums, and);
            cur = _mm256_srai_epi32(cur, 1);
        }
    }

    __attribute__((aligned(32))) int r[8];
    __m256i *r_avx = (__m256i *)r;
    _mm256_store_si256(r_avx, sums);

    return r[0] + r[1] + r[2] + r[3] + r[4] + r[5] + r[6] + r[7];
}

As you can see, there are many more vector instructions per load in this operation, which results is better overall performance. As vectors get wider, and there are already plans for 512 in future instruction sets, we can adapt these functions and expect a predictable performance improvement.

SUMMARY:

As with any parallel architecture, the extent to which vector processing can improve performance depends on how much of the code is parallelizable, and how much overhead is associated with shipping data around. In this example, the intersection function had a 3-to-1 proportion of memory loads to actual work. Not surprisingly, the resulting speedups were low. Things were much better in the count function, which struck a much more favorable balance with 1 load to 32 ands, adds, and shifts. While these two functions use only a fraction of the total number of vector instructions that are currently available (for a complete list see https://software.intel.com/sites/landingpage/IntrinsicsGuide/), a surprising number of analysis applications follow same basic structure as the intersect and count functions. Hopefully your application looks more like count and less like intersect.

Advertisements
Tagged with: , , , ,

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s

%d bloggers like this: