Ryan's Blog

Genotype Query Tools (GQT) index of 1000 Genomes phase 3

Posted in Uncategorized by ryanlayer on May 13, 2015

Genotype Query Tools (GQT) is command line software for indexing and querying large-scale genotype data sets like those produced by 1000 Genomes, the UK100K, and forthcoming datasets involving millions of genomes. GQT represents genotypes as compressed bitmap indices, which reduces the computational burden of variant queries based on sample genotypes, phenotypes, and relationships by orders of magnitude over standard “variant-centric” indexing strategies. This index can significantly expand the capabilities of population-scale analyses by providing interactive-speed queries to data sets with millions of individuals

We have made the GQT index for the 1000 Genomes Project phase 3 variants available through the 1000 Genomes FTP site:


For the code, basic installation instructions, and demos please refer to our github site (https://github.com/ryanlayer/gqt) and YouTube channel (http://bit.ly/gqt_videos). A detailed description and comparisons to other tools can be found in our bioRxiv preprint at http://biorxiv.org/content/early/2015/04/20/018259

Once you have GQT installed, you can begin exploring the 84 million variants across 2504 individuals that make up the phase 3 release by download the following files

  1. The compressed genotype index: ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.gqt
  2. The variant order: ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.vid
  3. The compressed variant metadata: ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.bim
  4. The sample metadata database: integrated_call_samples.20130502.ALL.ped.db

The sample metadata database is an SQLite database containing the sample attributes from the phase 3 PED file (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples.20130502.ALL.ped).Queries can be composed of any combination of the following fields:

BCF_Sample     TEXT
Family_ID      TEXT
Individual_ID  TEXT
Paternal_ID    TEXT
Maternal_ID    TEXT
Gender         TEXT
Phenotype      TEXT
Population     TEXT
Relationship   TEXT
Siblings       TEXT
Second_Order   TEXT
Third_Order    TEXT
Children       TEXT
Other_Comments TEXT


The BCF_ID and BCF_Sample fields correspond to the column number and sample name in the VCF file. The other fields correspond to the lines in the PED file where BCF_Sample matches Individual_ID. For valid values please refer to the source files.

Here is an example query:

Get a VCF file that contains the variants that are common in Europeans and rare in Africans.

gqt query \
  -i ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.gqt \
  -d integrated_call_samples.20130502.ALL.ped.db \
  -p "Population in ('CEU', 'EUR', 'TSI', 'FIN', 'GBR', 'IBS')" \
  -g "maf()>0.1" \
  -p "Population in ('YRI','LWK','GWD','MSL','ESN','ASW','ACB')" \
  -g "maf()<0.01" \
  > EUR_v_AFR.vcf

You can also use the count option (-c) to get just the number of matching variants. Considering the size of the VCF produced by the previous command, this operation is typically much faster.  In our tests, the full write operation took 48 sec and the count took 17 sec.

gqt query \
  -i ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.gqt \
  -d integrated_call_samples.20130502.ALL.ped.db \
  -p "Population in ('CEU', 'EUR', 'TSI', 'FIN', 'GBR', 'IBS')" \
  -g "maf()>0.1" \
  -p "Population in ('YRI','LWK','GWD','MSL','ESN','ASW','ACB')" \
  -g "maf()<0.01" \

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Google+ photo

You are commenting using your Google+ account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )


Connecting to %s

%d bloggers like this: