Ryan's Blog

Get CPU Info via Command Line

Posted in Uncategorized by ryanlayer on July 20, 2016

sysctl -n machdep.cpu.brand_string

via osxdaily


Getting samtools/bcftools to work with S3

Posted in Uncategorized by ryanlayer on April 27, 2016

Get htslib and configure it to enable libcurl

git clone https://github.com/samtools/htslib.git
cd htslib
./configure --enable-libcurl
cd ..

Get samtools, and modify the Makefile to use the relevant libraries

git clone https://github.com/samtools/samtools.git
cd samtools
vi Makefile

Change LIBS = to LIBS = -lcurl -lcrypto -lssl

Then compile.
cd ..

Get bcftools, and make the same changes to the Makefile
git clone https://github.com/samtools/bcftools.git
cd bcftools
vi Makefile

Now test it out.

samtools view s3://1000genomes/data/NA21124/alignment/NA21124.alt_bwamem_GRCh38DH.20150718.GIH.low_coverage.cram

Genotype Query Tools (GQT) index of 1000 Genomes phase 3

Posted in Uncategorized by ryanlayer on May 13, 2015

Genotype Query Tools (GQT) is command line software for indexing and querying large-scale genotype data sets like those produced by 1000 Genomes, the UK100K, and forthcoming datasets involving millions of genomes. GQT represents genotypes as compressed bitmap indices, which reduces the computational burden of variant queries based on sample genotypes, phenotypes, and relationships by orders of magnitude over standard “variant-centric” indexing strategies. This index can significantly expand the capabilities of population-scale analyses by providing interactive-speed queries to data sets with millions of individuals

We have made the GQT index for the 1000 Genomes Project phase 3 variants available through the 1000 Genomes FTP site:


For the code, basic installation instructions, and demos please refer to our github site (https://github.com/ryanlayer/gqt) and YouTube channel (http://bit.ly/gqt_videos). A detailed description and comparisons to other tools can be found in our bioRxiv preprint at http://biorxiv.org/content/early/2015/04/20/018259

Once you have GQT installed, you can begin exploring the 84 million variants across 2504 individuals that make up the phase 3 release by download the following files

  1. The compressed genotype index: ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.gqt
  2. The variant order: ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.vid
  3. The compressed variant metadata: ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.bim
  4. The sample metadata database: integrated_call_samples.20130502.ALL.ped.db

The sample metadata database is an SQLite database containing the sample attributes from the phase 3 PED file (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples.20130502.ALL.ped).Queries can be composed of any combination of the following fields:

BCF_Sample     TEXT
Family_ID      TEXT
Individual_ID  TEXT
Paternal_ID    TEXT
Maternal_ID    TEXT
Gender         TEXT
Phenotype      TEXT
Population     TEXT
Relationship   TEXT
Siblings       TEXT
Second_Order   TEXT
Third_Order    TEXT
Children       TEXT
Other_Comments TEXT


The BCF_ID and BCF_Sample fields correspond to the column number and sample name in the VCF file. The other fields correspond to the lines in the PED file where BCF_Sample matches Individual_ID. For valid values please refer to the source files.

Here is an example query:

Get a VCF file that contains the variants that are common in Europeans and rare in Africans.

gqt query \
  -i ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.gqt \
  -d integrated_call_samples.20130502.ALL.ped.db \
  -p "Population in ('CEU', 'EUR', 'TSI', 'FIN', 'GBR', 'IBS')" \
  -g "maf()>0.1" \
  -p "Population in ('YRI','LWK','GWD','MSL','ESN','ASW','ACB')" \
  -g "maf()<0.01" \
  > EUR_v_AFR.vcf

You can also use the count option (-c) to get just the number of matching variants. Considering the size of the VCF produced by the previous command, this operation is typically much faster.  In our tests, the full write operation took 48 sec and the count took 17 sec.

gqt query \
  -i ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.gqt \
  -d integrated_call_samples.20130502.ALL.ped.db \
  -p "Population in ('CEU', 'EUR', 'TSI', 'FIN', 'GBR', 'IBS')" \
  -g "maf()>0.1" \
  -p "Population in ('YRI','LWK','GWD','MSL','ESN','ASW','ACB')" \
  -g "maf()<0.01" \

GIT: Doing things the lazy way

Posted in Uncategorized by ryanlayer on February 5, 2015

Add all modified files:

git add `git ls-files --modified`

Add new .c files:

git add `git ls-files --other | grep .c`

Tagged with:

Matplotlib: using scientific notation in tick values

Posted in Uncategorized by ryanlayer on January 8, 2015

formatter = matplotlib.ticker.ScalarFormatter()

Tagged with: ,

Matplotlib: Set Font

Posted in Uncategorized by ryanlayer on April 22, 2014

from matplotlib import rcParams
rcParams['font.family'] = 'Arial'

Tagged with: ,

Latex: add date/time to footer

Posted in Uncategorized by ryanlayer on December 31, 2013
\usepackage{datetime} %access to \currenttime
\fancyhead{} %clear out default
\fancyfoot[R]{\today\ \currenttime} %set date/time on right
\fancyfoot[C]{\thepage} % page in center
\renewcommand{\headrulewidth}{0pt} % remove awful top bar

More at here and MORE here.

Tagged with: , ,

Matplotlib: plot border, removing and setting width

Posted in Uncategorized by ryanlayer on December 17, 2013
Tagged with: ,

Matplotlib: axis labels, rotating and aligning

Posted in Uncategorized by ryanlayer on December 17, 2013

x = np.arange(n)
y = np.sin(np.linspace(-3,3,n))
xlabels = ['Ticklabel %i' % i for i in range(n)]

fig, axs = plt.subplots(1,3, figsize=(12,3))

ha = ['right', 'center', 'left']

for n, ax in enumerate(axs):
    ax.plot(x,y, 'o-')
    ax.set_xticklabels(xlabels, rotation=40, ha=ha[n])



Tagged with: , ,

Venter Genome Simulation

Posted in Uncategorized by ryanlayer on December 19, 2012

$ wc -ml Simulated_Venter_Chr17.fa
1575757 80363571 Simulated_Venter_Chr17.fa

There are 1575757 lines and 80363571 characters (including new line chars).  Minus new lines, there are 78787814 characters.  The first line is “>Simulated_Venter_Chr17”, which is 24 characters, so the final genome size is 78787790.

To find the number of pairs we want to simulate we use the following formula: