Ryan's Blog

Get CPU Info via Command Line

Posted in Uncategorized by ryanlayer on July 20, 2016

sysctl -n machdep.cpu.brand_string

via osxdaily

Getting samtools/bcftools to work with S3

Posted in Uncategorized by ryanlayer on April 27, 2016

Get htslib and configure it to enable libcurl

git clone https://github.com/samtools/htslib.git
cd htslib
./configure --enable-libcurl
cd ..

Get samtools, and modify the Makefile to use the relevant libraries

git clone https://github.com/samtools/samtools.git
cd samtools
vi Makefile

Change LIBS = to LIBS = -lcurl -lcrypto -lssl

Then compile.
cd ..

Get bcftools, and make the same changes to the Makefile
git clone https://github.com/samtools/bcftools.git
cd bcftools
vi Makefile

Now test it out.

samtools view s3://1000genomes/data/NA21124/alignment/NA21124.alt_bwamem_GRCh38DH.20150718.GIH.low_coverage.cram

Running commands in parallel with xargs

Posted in programming by ryanlayer on February 18, 2016

Because GNU Parallel is too complex for me, I use xargs.  The  -P N option will farm off each command to a pool of N threads, and since each command is likely to be some chain of pipes and file writes I also use the sh -c 'foo | bar > baz' option.  You will want to modify the -d " " option to contain whatever delimiter you are using.  Here I have spaces, but if your input is coming from a file you may need "\t" or "\n".

For example, if you have a file f.bed:

#chr start end
10   100   200
2    200   300
1    300   400
X    400   500
Y    500   600
15   600   700
7    100   200
1    200   300
3    300   400
5    400   500

And you want to split the file out by chromosome, sort by start, keep the header, and use 10 threads. Then you could:

echo -n $(seq 1 22) X Y \
| xargs -d ' ' -I{} -P 10 \
sh -c '(head -n1 f.bed; grep -w "^{}" f.bed | sort -n -k 2) > {}.f.bed'

This sends a list of chromosomes (echo -n $(seq 1 22) X Y xargs) to xargs. That list is then split by space (-d ' '). Each element in the split list is used to create a new command where the “{}” values are replaced by the element value. xargs will then manage the execution of these commands, in this case running 10 of these commands at a time (-P 10).

Tagged with: ,

Genotype Query Tools (GQT) index of 1000 Genomes phase 3

Posted in Uncategorized by ryanlayer on May 13, 2015

Genotype Query Tools (GQT) is command line software for indexing and querying large-scale genotype data sets like those produced by 1000 Genomes, the UK100K, and forthcoming datasets involving millions of genomes. GQT represents genotypes as compressed bitmap indices, which reduces the computational burden of variant queries based on sample genotypes, phenotypes, and relationships by orders of magnitude over standard “variant-centric” indexing strategies. This index can significantly expand the capabilities of population-scale analyses by providing interactive-speed queries to data sets with millions of individuals

We have made the GQT index for the 1000 Genomes Project phase 3 variants available through the 1000 Genomes FTP site:


For the code, basic installation instructions, and demos please refer to our github site (https://github.com/ryanlayer/gqt) and YouTube channel (http://bit.ly/gqt_videos). A detailed description and comparisons to other tools can be found in our bioRxiv preprint at http://biorxiv.org/content/early/2015/04/20/018259

Once you have GQT installed, you can begin exploring the 84 million variants across 2504 individuals that make up the phase 3 release by download the following files

  1. The compressed genotype index: ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.gqt
  2. The variant order: ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.vid
  3. The compressed variant metadata: ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.bim
  4. The sample metadata database: integrated_call_samples.20130502.ALL.ped.db

The sample metadata database is an SQLite database containing the sample attributes from the phase 3 PED file (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples.20130502.ALL.ped).Queries can be composed of any combination of the following fields:

BCF_Sample     TEXT
Family_ID      TEXT
Individual_ID  TEXT
Paternal_ID    TEXT
Maternal_ID    TEXT
Gender         TEXT
Phenotype      TEXT
Population     TEXT
Relationship   TEXT
Siblings       TEXT
Second_Order   TEXT
Third_Order    TEXT
Children       TEXT
Other_Comments TEXT


The BCF_ID and BCF_Sample fields correspond to the column number and sample name in the VCF file. The other fields correspond to the lines in the PED file where BCF_Sample matches Individual_ID. For valid values please refer to the source files.

Here is an example query:

Get a VCF file that contains the variants that are common in Europeans and rare in Africans.

gqt query \
  -i ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.gqt \
  -d integrated_call_samples.20130502.ALL.ped.db \
  -p "Population in ('CEU', 'EUR', 'TSI', 'FIN', 'GBR', 'IBS')" \
  -g "maf()>0.1" \
  -p "Population in ('YRI','LWK','GWD','MSL','ESN','ASW','ACB')" \
  -g "maf()<0.01" \
  > EUR_v_AFR.vcf

You can also use the count option (-c) to get just the number of matching variants. Considering the size of the VCF produced by the previous command, this operation is typically much faster.  In our tests, the full write operation took 48 sec and the count took 17 sec.

gqt query \
  -i ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.gqt \
  -d integrated_call_samples.20130502.ALL.ped.db \
  -p "Population in ('CEU', 'EUR', 'TSI', 'FIN', 'GBR', 'IBS')" \
  -g "maf()>0.1" \
  -p "Population in ('YRI','LWK','GWD','MSL','ESN','ASW','ACB')" \
  -g "maf()<0.01" \

GIT: Doing things the lazy way

Posted in Uncategorized by ryanlayer on February 5, 2015

Add all modified files:

git add `git ls-files --modified`

Add new .c files:

git add `git ls-files --other | grep .c`

Tagged with:

Matplotlib: using scientific notation in tick values

Posted in Uncategorized by ryanlayer on January 8, 2015

formatter = matplotlib.ticker.ScalarFormatter()

Tagged with: ,

C: Accelerating Data Processing with SSE and AVX intrinsics

Posted in programming by ryanlayer on October 17, 2014

Many programs contain loops that execute the same set of instructions on every element in an array. If the result from each of the elements is completely independent, then you could imagine speeding things up by processing all of the elements in parallel. This scenario, where multiple processors execute the same instructions, is refereed to as “data parallel” (v.s. task parallel) and there is a class of parallel architectures called Single Instruction Multiple Data (SIMD) that are optimized for this type of processing. While there are a number of SIMD architectures to choose from (NVIDA CUDA being a very powerful one), most modern CPUs support SIMD operations through a set of vector registers and instructions that allow operations to performed on a vector of elements.

The objective of this post is to demonstrate how to speed up data parallel portions of your C code through use of these vector registers/instructions. The capabilities of these vectors have improved over time, and just like CPUs are given new names, each generation of vector registers/instructions has a different name/version. I will cover the most widely used set of instructions called SSE (Streaming SIMD Extensions) and the very new (Q3’13) Advanced Vector Extensions (AVX2). The example I use has two functions, one that has almost no speedup with and another that is up to 4X faster. I believe that these examples nicely illuminate the basic steps that every programmer will need to follow when converting (or deciding whether to convert) an existing sequential loop to a parallel loop.


– Single Instruction Multiple Data (SIMD) is a type of parallelism (also called data parallel) where the same set of operations is performed on different data elements concurrently. The classic example being the element-wise summation of two arrays A and B into a resulting array R, where each R[i] = A[i] + B[i] is performed in parallel.

– Intel and AMD provide extensions to the basic x86 instruction set that can take advantage of data parallelism by operating on vectors. There are many different groups of instructions: MMX, SSE, AVX, etc.

– How “wide” these vectors are (how much data they can hold) and which operations can be performed on them depends on the CPU’s microarchitecture. The first generation of SSE instructions have been supported since the early Pentium processors, and the newest AVX2 instructions are supported by the Haswell microarchitecture which started shipping at the end of 2013.

– SSE registers are 128 bits wide, and AVX2 registers are 256 bit wide. If we are processing 32 bit ints, then SSE can process 4 ints at a time and AVX2 can process 8.

– In order to check what instructions your processor supports:

OS X :

$ sysctl hw | grep sse
hw.optional.sse: 1
hw.optional.sse2: 1
hw.optional.sse3: 1
hw.optional.supplementalsse3: 1
hw.optional.sse4_1: 1
hw.optional.sse4_2: 1
$ sysctl hw | grep avx
hw.optional.avx1_0: 1
hw.optional.avx2_0: 1


$ cat /proc/cpuinfo | grep flags
flags		: fpu vme de pse tsc msr pae mce cx8 apic sep mtrr pge mca cmov pat pse36 clflush dts acpi mmx fxsr sse sse2 ss ht tm pbe syscall nx pdpe1gb rdtscp lm constant_tsc arch_perfmon pebs bts rep_good nopl xtopology nonstop_tsc aperfmperf pni pclmulqdq dtes64 monitor ds_cpl vmx smx est tm2 ssse3 cx16 xtpr pdcm pcid dca sse4_1 sse4_2 x2apic popcnt tsc_deadline_timer aes xsave avx lahf_lm ida arat xsaveopt pln pts dtherm tpr_shadow vnmi flexpriority ept vpid

As you can see my OS X machine is new and fancy and supports AVX2, while my linux box only has AVX (not 2) support.

– The programming model is fairly simple. One loads data into the vector registers, operates on them, moves data back to standard variables, then continues processing as normal.

– gcc, clang, etc provide a set of functions called “intrinsics” that you can use instead of writing straight assembly code. These functions are built in, and so there is no need to link in an external library. However, you need to tell the compiler which instructions you are using through flags like “-mavx” and “-msse4”.

– The best reference I have found is Intel’s: https://software.intel.com/sites/landingpage/IntrinsicsGuide/. Check the radio buttons that correspond to the instructions that are supported on your target hardware and you can’t go wrong.


In the following sections, I present code using these intrinsics in the context of an algorithm for finding intersections between intervals. All the code and example data can be found at https://github.com/ryanlayer/avx_ex

To run the code simply:

$ git clone  https://github.com/ryanlayer/avx_ex
$ cd avx_ex
$ make
$ ./bed_intersect_sse_avx data/MB.bed data/MT.bed

This program counts the number of bases that are common between two BED files (http://genome.ucsc.edu/FAQ/FAQformat.html#format1). First, each BED file is mapped to a bit map where bit i = 1 only if the bed file has an interval that covers genomic position i. The bit map is represented by an array of unsigned 32-bit ints where each int corresponds to 32 genomic positions. Next, the two bed files are intersected by ANDing the two associated bit maps. Last, the summation of the resulting bits gives the total number that are set to 1. This total reflects the number of base pairs in common between the two BED interval sets.

Assuming that we have the following two bit maps (unsigned 32-bit arrays) that represent the two bed files (see code for details on this):

    uint32_t *bed_1, *bed_2, *bed_R;

    uint32_t bed_1_N = bed_to_bits(bed_1_file_name, &amp;amp;bed_1, GENOME_SIZE);
    uint32_t bed_2_N = bed_to_bits(bed_2_file_name, &amp;amp;bed_2, GENOME_SIZE);

We can find the intersection between bed_1 and bed_2 by taking the bit-wise AND of each element:

uint32_t bed_R_N = intersect_beds(bed_1, bed_2, &amp;amp;bed_R, GENOME_SIZE);

uint32_t intersect_beds(uint32_t *A,
                        uint32_t *B,
                        uint32_t **R,
                        uint32_t G_size)
    uint32_t ints_per_genome = (G_size + 32 - 1)/32;
    *R = (uint32_t *) calloc(ints_per_genome, sizeof(uint32_t));
    uint32_t i;
    for (i = 0; i &amp;lt; ints_per_genome; ++i)
        (*R)[i] = A[i] &amp;amp; B[i];
    return ints_per_genome;

This function is a classic SIMD scenario since all of the ANDs are independent, and we can use the SSE registers/instruction to perform some of these operations in parallel. SSE registers are 128 bits wide and can therefore hold 4 32-bit ints, which means that we can compute 4 values in parallel (R[i], R[i+1], R[i+2], and R[i+3]). The following function does exactly that:

uint32_t sse_intersect_beds(uint32_t *A,
                            uint32_t *B,
                            uint32_t **R,
                            uint32_t G_size)
    uint32_t ints_per_genome = (G_size + 32 - 1)/32;

    int r = posix_memalign((void **)R,
                            ints_per_genome*sizeof(unsigned int));

    __m128i *A_sse = (__m128i *)A;
    __m128i *B_sse = (__m128i *)B;
    __m128i *R_sse = (__m128i *)*R;

    __m128 a, b;

    uint32_t i;

    for (i = 0; i &amp;lt; (ints_per_genome/4); ++i) {
        a = _mm_load_si128(&amp;amp;A_sse[i]);
        b = _mm_load_si128(&amp;amp;B_sse[i]);
        R_sse[i] = _mm_and_ps(a,b);

    return ints_per_genome;

As with many parallel architectures, input data must be moved from the CPU memory space to some specialized location, and once the operations are complete the result must be moved back. In the case of SEE (and AVX), this location is a set of vector registers. While these register have names like xmm0 and you can use assembly to ship data around, compilers have made things much easier by providing specific types that associate variables with these registers and functions that handle the movement of data. Just remember that variables with types like “int” and “float” are in the CPU’s memory space, and variables with types like “__m128i” are in the vector space.

Lets step through the code:

Line 6: Since we are using a single bit for each genomic coordinate and there are 32 bits in an unsigned int, we will need the genome size / 32 (rounded up) ints to represent the full genome.

Line 8: When allocating space in this context, it is best to make sure that the allocated block is “memory aligned” (the start address is a multiple of, in this case, 32). Many instructions that move data between CPU and vector space require alignment, and those that don’t typically perform better when things are aligned

Lines 12-14: A and B are input arrays and R is an output array, and all three are in the CPU memory space. To facilitate the transfer of data between the CPU registers and the vector registers we need to create the associated variables in the vector space. This is done by casting the CPU-space variables to vector-space variables. Since you can used the 128 bits of the SSE2 register in any way you want, the __m128i specifics that in this case the values are integers.

Line 16: Create two new variables in vector-space that will be used in the for loop later.

Line 20: SSE2 registers are 128 bits wide, which means they can hold 4 32-bit variables. So in vector space A_sse[0] corresponds to A[0], A[1], A[2], and A[3], and B_sse[10] corresponds to B[40], B[41], B[42], B[43]. Considering this mapping, the for loop then scans from 0 to ints_per_genome/4.

Lines 21 and 22: Load the 4 32-bit ints from A (1 128-bit value from A_sse) and 4 32-bit ints from B (1 128-bit value from B_sse), to the vector registers corresponding to variables a and b.

Line 23: AND the vector registers corresponding to variables a and b and store the result back in the R. This computes the following opperations in a single instruction (i.e. in parallel): R[i*4] = A[i*4] + B[i*4], R[i*4+1] = A[i*4+1] + B[i*4+1], R[i*4+2] = A[i*4+2] + B[i*4+2], R[i*4+3] = A[i*4+3] + B[i*4+3]. Notice that we do not explicitly move the data from the vector space back to the CPU space. Those instructions are still executed, its just that the compiler took care of generating the code for us.

The code using AVX2 is nearly identical to SSE2, except for a few changes to account for the wider registers:

uint32_t avx_intersect_beds(uint32_t *A,
                            uint32_t *B,
                            uint32_t **R,
                            uint32_t G_size)
    uint32_t ints_per_genome = (G_size + 32 - 1)/32;

    int r = posix_memalign((void **)R,
                            ints_per_genome*sizeof(unsigned int));

    __m256i *A_avx = (__m256i *)A;
    __m256i *B_avx = (__m256i *)B;
    __m256i *R_avx = (__m256i *)*R;

    __m256 a, b;

    uint32_t i;

    for (i = 0; i &amp;lt; (ints_per_genome/8); ++i) {
        a = _mm256_load_si256(&amp;amp;A_avx[i]);
        b = _mm256_load_si256(&amp;amp;B_avx[i]);
        R_avx[i] = _mm256_and_ps(a,b);

    return ints_per_genome;

The amount of parallelism here is directly related to the size of the registers. If we use the AVX2 instruction set, we would expect to get twice the speedup since those registers are 256-bits wide. However, when we run the test:

$ ./bed_intersect_sse_avx data/MB.bed data/MT.bed

We get the following timings in microseconds

intersect_beds:		452664
count_bits:		9027021
sse_intersect_beds:	281873
sse_count_bits:		4449191
avx_intersect_beds:	221272
avx_count_bits:		2304001

As you can see, sse_intersect_beds is only 1.6X faster than the sequential intersect_beds, and avx_intersect_beds is only 2X faster. The issue is that we are performing 3 load operations (A to a, B to b, and the result back to R) to only 1 vector instruction. There is no way of getting around this, there is just not enough work to be done in parallel.

From those results you can also see that the counting operation is 2X faster with SSE and 4X faster with AVX2 (code below). While this is not the perfect 4X/8X speed that you might expect (but will almost certainly never get because of overhead), it is moving in the right direction because that function has a higher proportion of instructions to loads.

The count function is also much more complex, and is a good example of how parallelizing a function often requires rethinking the algorithm in the context of a particular architecture. The result of the intersect function is the bit map R, where R is any array of ints and each bit in those ints corresponds to a genomic region. If both bed files had intervals in that region, then the bit is 1, otherwise it is zero.

To find the sum in the sequential case, we simply set the variable sum to zero, loop of the ints in R, loop over each bit in each int, and then add 1 to the sum if that bit is zero.

uint32_t count_bits(uint32_t *R,
                    uint32_t R_size)
    uint32_t i,j,sum = 0;

    for (i = 0; i &amp;lt; R_size; ++i)
        for (j = 0; j &amp;lt; 32; ++j)
            sum += (R[i] &amp;gt;&amp;gt; j) &amp;amp; 1;

    return sum;

Since the vector instructions allow us to operate on 4 values at once in SSE2 and 8 in AVX2, we need to think about how best to apply that functionality while at the same time minimizing the number of loads. While there are many different solutions, I chose to load 4 (or 8) elements from R to the vector-space, then store the sum of each element into an independent summation register. In this strategy there is 1 load to 32 ands, 32 adds, and 32 shifts. Only after all elements in R have been considered do we move the summation values from vector-space back to the CPU-space.

Here is the SSE2 function:

uint32_t sse_count_bits(uint32_t *R,
                        uint32_t R_size)
    __m128i *R_sse = (__m128i *)R;

    __attribute__((aligned(32))) int masks[4] =  { 1, 1, 1, 1 };
    __m128i *masks_sse = (__m128i *)masks;
    __m128i m = _mm_load_si128(masks_sse);

    __m128i sums;
    sums = _mm_xor_si128 (sums, sums);
    __m128i cur, and;

    uint32_t i,j;
    for (i = 0; i &amp;lt; R_size/4; ++i) {
        cur = _mm_load_si128(&amp;amp;R_sse[i]);
        for (j = 0; j &amp;lt; 32; ++j) {
            and = _mm_and_si128 (cur, m);
            sums = _mm_add_epi32(sums, and);
            cur = _mm_srai_epi32(cur, 1);

    __attribute__((aligned(32))) int r[4];
    __m128i *r_sse = (__m128i *)r;
    _mm_store_si128(r_sse, sums);

    return r[0] + r[1] + r[2] + r[3];

Line 4: Just like before, we need to create a mapping between the CPU-space and vector-space variables. In this case our input array is R, and we cast that to a vector-space variable R_sse.

Lines 6-7: In the sequential code we anded the shifted bitmap value with 1 to isolate the last bit, if that bit was 1 we added 1 to the sum. This same process is also done here, but since we are operating on 4 values at once, we need a vector of 4 1’s. First we allocate a memory aligned array with 4 1’s, then move it over to a vector register.

Lines 10-12: Declare a new vector-space variable, and use the xor operation to zero out the bits.

Line 16: Once again SSE2 is 4 ints wide, so we need to scan from 0 to R_size/4.

Line 14: Load 4 of the bitmap values from R.

Line 18: Loop over each bit in those 4 bit maps.

Lines 19-20: Here we mask out the last bit, add those bits the running sum, then shift the values loaded from R.

Lines 25-27: Move the summation values from vector-space to CPU space.

Line 29: Since we maintained 4 different counts, we need to sum those together for a final count.

Once again the AVX2 code is very similar.

uint32_t avx_count_bits(uint32_t *R,
                        uint32_t R_size)
    __m256i *R_avx = (__m256i *)R;

    __attribute__((aligned(32))) int masks[8] =  { 1, 1, 1, 1, 1, 1, 1, 1 };
    __m256i *masks_avx = (__m256i *)masks;
    __m256i m = _mm256_load_si256(masks_avx);

    __m256i sums = _mm256_setzero_si256();
    __m256i cur, and;

    uint32_t i,j;
    for (i = 0; i &amp;lt; R_size/8; ++i) {
        cur = _mm256_load_si256(&amp;amp;R_avx[i]);
        for (j = 0; j &amp;lt; 32; ++j) {
            and = _mm256_and_si256 (cur, m);
            sums = _mm256_add_epi32(sums, and);
            cur = _mm256_srai_epi32(cur, 1);

    __attribute__((aligned(32))) int r[8];
    __m256i *r_avx = (__m256i *)r;
    _mm256_store_si256(r_avx, sums);

    return r[0] + r[1] + r[2] + r[3] + r[4] + r[5] + r[6] + r[7];

As you can see, there are many more vector instructions per load in this operation, which results is better overall performance. As vectors get wider, and there are already plans for 512 in future instruction sets, we can adapt these functions and expect a predictable performance improvement.


As with any parallel architecture, the extent to which vector processing can improve performance depends on how much of the code is parallelizable, and how much overhead is associated with shipping data around. In this example, the intersection function had a 3-to-1 proportion of memory loads to actual work. Not surprisingly, the resulting speedups were low. Things were much better in the count function, which struck a much more favorable balance with 1 load to 32 ands, adds, and shifts. While these two functions use only a fraction of the total number of vector instructions that are currently available (for a complete list see https://software.intel.com/sites/landingpage/IntrinsicsGuide/), a surprising number of analysis applications follow same basic structure as the intersect and count functions. Hopefully your application looks more like count and less like intersect.

Tagged with: , , , ,

C: allocate and assign values to an array of arrays

Posted in programming by ryanlayer on July 28, 2014
void f(int ***A, int r, int c)
  int i,j;
  if (*A == NULL) {
    *A = (int **) malloc (r*sizeof(int *));
    for (i = 0; i < r; ++i) {
      (*A)[i] = (int *) malloc (c*sizeof(int));

  for (i = 0; i < r; ++i) {
    for (j = 0; j < c; ++j) {
      (*A)[i][j] += i*10+j;


int main()
  int i,j,r = 5, c = 3;

  int **A = NULL;

  f(&A, r, c);
  f(&A, r, c);

  for (i = 0; i < r; ++i) {
    for (j = 0; j < c; ++j) {
      printf("%d\n", A[i][j]);
Tagged with:

awk: Find the sum of each line

Posted in programming by ryanlayer on June 23, 2014

cat data.txt | awk '{for(i=1;i<=NF;i++) t+=$i; print t; t=0}'

Found here.

Tagged with:

BASH: Compare the dates of two files

Posted in programming by ryanlayer on May 27, 2014

if [ "$file1" -ot "$file2" ]
  cp -f "$file2" "$file1"

via stackoverflow

Tagged with: