Get htslib and configure it to enable libcurl
git clone https://github.com/samtools/htslib.git
Get samtools, and modify the Makefile to use the relevant libraries
git clone https://github.com/samtools/samtools.git
LIBS = to
LIBS = -lcurl -lcrypto -lssl
Get bcftools, and make the same changes to the Makefile
git clone https://github.com/samtools/bcftools.git
Now test it out.
samtools view s3://1000genomes/data/NA21124/alignment/NA21124.alt_bwamem_GRCh38DH.20150718.GIH.low_coverage.cram
Genotype Query Tools (GQT) is command line software for indexing and querying large-scale genotype data sets like those produced by 1000 Genomes, the UK100K, and forthcoming datasets involving millions of genomes. GQT represents genotypes as compressed bitmap indices, which reduces the computational burden of variant queries based on sample genotypes, phenotypes, and relationships by orders of magnitude over standard “variant-centric” indexing strategies. This index can significantly expand the capabilities of population-scale analyses by providing interactive-speed queries to data sets with millions of individuals
We have made the GQT index for the 1000 Genomes Project phase 3 variants available through the 1000 Genomes FTP site:
For the code, basic installation instructions, and demos please refer to our github site (https://github.com/ryanlayer/gqt) and YouTube channel (http://bit.ly/gqt_videos). A detailed description and comparisons to other tools can be found in our bioRxiv preprint at http://biorxiv.org/content/early/2015/04/20/018259
Once you have GQT installed, you can begin exploring the 84 million variants across 2504 individuals that make up the phase 3 release by download the following files
- The compressed genotype index: ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.gqt
- The variant order: ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.vid
- The compressed variant metadata: ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.bim
- The sample metadata database: integrated_call_samples.20130502.ALL.ped.db
The sample metadata database is an SQLite database containing the sample attributes from the phase 3 PED file (ftp://ftp.1000genomes.ebi.ac.uk/vol1/ftp/release/20130502/integrated_call_samples.20130502.ALL.ped).Queries can be composed of any combination of the following fields:
BCF_ID INTEGER BCF_Sample TEXT Family_ID TEXT Individual_ID TEXT Paternal_ID TEXT Maternal_ID TEXT Gender TEXT Phenotype TEXT Population TEXT Relationship TEXT Siblings TEXT Second_Order TEXT Third_Order TEXT Children TEXT Other_Comments TEXT
The BCF_ID and BCF_Sample fields correspond to the column number and sample name in the VCF file. The other fields correspond to the lines in the PED file where BCF_Sample matches Individual_ID. For valid values please refer to the source files.
Here is an example query:
Get a VCF file that contains the variants that are common in Europeans and rare in Africans.
gqt query \ -i ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.gqt \ -d integrated_call_samples.20130502.ALL.ped.db \ -p "Population in ('CEU', 'EUR', 'TSI', 'FIN', 'GBR', 'IBS')" \ -g "maf()>0.1" \ -p "Population in ('YRI','LWK','GWD','MSL','ESN','ASW','ACB')" \ -g "maf()<0.01" \ > EUR_v_AFR.vcf
You can also use the count option (-c) to get just the number of matching variants. Considering the size of the VCF produced by the previous command, this operation is typically much faster. In our tests, the full write operation took 48 sec and the count took 17 sec.
gqt query \ -i ALL.phase3_shapeit2_mvncall_integrated_v5a.20130502.genotypes.vcf.gz.gqt \ -d integrated_call_samples.20130502.ALL.ped.db \ -p "Population in ('CEU', 'EUR', 'TSI', 'FIN', 'GBR', 'IBS')" \ -g "maf()>0.1" \ -p "Population in ('YRI','LWK','GWD','MSL','ESN','ASW','ACB')" \ -g "maf()<0.01" \ -c 155323
Add all modified files:
git add `git ls-files --modified`
Add new .c files:
git add `git ls-files --other | grep .c`
formatter = matplotlib.ticker.ScalarFormatter()
from matplotlib import rcParams
rcParams['font.family'] = 'Arial'
n=5 x = np.arange(n) y = np.sin(np.linspace(-3,3,n)) xlabels = ['Ticklabel %i' % i for i in range(n)] fig, axs = plt.subplots(1,3, figsize=(12,3)) ha = ['right', 'center', 'left'] for n, ax in enumerate(axs): ax.plot(x,y, 'o-') ax.set_title(ha[n]) ax.set_xticks(x) ax.set_xticklabels(xlabels, rotation=40, ha=ha[n])
$ wc -ml Simulated_Venter_Chr17.fa
1575757 80363571 Simulated_Venter_Chr17.fa
There are 1575757 lines and 80363571 characters (including new line chars). Minus new lines, there are 78787814 characters. The first line is “>Simulated_Venter_Chr17”, which is 24 characters, so the final genome size is 78787790.
To find the number of pairs we want to simulate we use the following formula:
NUM_PAIRS=`echo “($SIZE*$COVERAGE)/($READ_LENGTH*2)” | bc`