Genotype Query Tools (GQT) index of 1000 Genomes phase 3
Genotype Query Tools (GQT) is command line software for indexing and querying large-scale genotype data sets like those produced by 1000 Genomes, the UK100K, and forthcoming datasets involving millions of genomes. GQT represents genotypes as compressed bitmap indices, which reduces the computational burden of variant queries based on sample genotypes, phenotypes, and relationships by orders of magnitude over standard “variant-centric” indexing strategies. This index can significantly expand the capabilities of population-scale analyses by providing interactive-speed queries to data sets with millions of individuals
We have made the GQT index for the 1000 Genomes Project phase 3 variants available through the 1000 Genomes FTP site:
For the code, basic installation instructions, and demos please refer to our github site (https://github.com/ryanlayer/gqt) and YouTube channel (http://bit.ly/gqt_videos). A detailed description and comparisons to other tools can be found in our bioRxiv preprint at http://biorxiv.org/content/early/2015/04/20/018259
Once you have GQT installed, you can begin exploring the 84 million variants across 2504 individuals that make up the phase 3 release by download the following files
- The compressed genotype index: ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.gqt
- The variant order: ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.vid
- The compressed variant metadata: ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.bim
- The sample metadata database: integrated_call_samples.20130502.ALL.ped.db
The sample metadata database is an SQLite database containing the sample attributes from the phase 3 PED file (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples.20130502.ALL.ped).Queries can be composed of any combination of the following fields:
BCF_ID INTEGER BCF_Sample TEXT Family_ID TEXT Individual_ID TEXT Paternal_ID TEXT Maternal_ID TEXT Gender TEXT Phenotype TEXT Population TEXT Relationship TEXT Siblings TEXT Second_Order TEXT Third_Order TEXT Children TEXT Other_Comments TEXT
The BCF_ID and BCF_Sample fields correspond to the column number and sample name in the VCF file. The other fields correspond to the lines in the PED file where BCF_Sample matches Individual_ID. For valid values please refer to the source files.
Here is an example query:
Get a VCF file that contains the variants that are common in Europeans and rare in Africans.
gqt query \ -i ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.gqt \ -d integrated_call_samples.20130502.ALL.ped.db \ -p "Population in ('CEU', 'EUR', 'TSI', 'FIN', 'GBR', 'IBS')" \ -g "maf()>0.1" \ -p "Population in ('YRI','LWK','GWD','MSL','ESN','ASW','ACB')" \ -g "maf()<0.01" \ > EUR_v_AFR.vcf
You can also use the count option (-c) to get just the number of matching variants. Considering the size of the VCF produced by the previous command, this operation is typically much faster. In our tests, the full write operation took 48 sec and the count took 17 sec.
gqt query \ -i ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.gqt \ -d integrated_call_samples.20130502.ALL.ped.db \ -p "Population in ('CEU', 'EUR', 'TSI', 'FIN', 'GBR', 'IBS')" \ -g "maf()>0.1" \ -p "Population in ('YRI','LWK','GWD','MSL','ESN','ASW','ACB')" \ -g "maf()<0.01" \ -c 155323