Recipes for GWAS data conversion/extraction
Useful tools
-
Download
PLINK 1.9
fromhttps://www.cog-genomics.org/plink2
. -
Install
bcftools
fromhttps://samtools.github.io/bcftools/
but for MAC OS X it is easier to install UNIX tools viahomebrew
. After havinghomebrew
installed, just tap in bybrew install bcftools
. -
Download SNP annotation for
hg19
(see below)
wget ftp://ftp.ncbi.nih.gov/snp/organisms/human_9606_b151_GRCh37p13/VCF/common_all_20180423.vcf.gz
Example: Downloading UK biobank genotype data using virtual box
Since UK Biobank genotypes are only downloadable by a linux-compiled binary, I had to spin off a virtual box. I thought it is generally a good idea to have a shared directory between the host (Mac OS) and the box. Here is how I did based on the stackoverflow search.
-
Create a virtual box with
ICH9
chipset and the bridged network adapter. -
Install an extension pack (
Oracle_VM_Virtual Box
etc.). -
Run “Insert Guest Additions CD images…” under the “Devices” tab.
-
Restart the box.
-
Create a shared folder via GUI
-
Mount the folder
mount -t vboxsf mountfolder /home/devInHost/mountfolder
Once I have successfully mounted the shared folder, I can use ukbgene
to download the imputed genotypes. To make sure that I write down to the shared folder, I just run the command with the root priviledge.
parallel -j4 ./ukbgene imp -c{1} :: {1..22}
Enjoy!
Example: Extracting dosage information from a VCF file to use a PrediXcan
model
-
Download weights
Brain_Amygdala.tar.gz
fromhttp://predictdb.org/
. -
Copy
*.py
and*.R
fromPrediXcan
github repository.
1. We need to construct annotations
cat common_all_20180423.vcf.gz | gzip -d | bgzip -c > snps.vcf.gz && tabix snps.vcf.gz
2. Annotate rsID
to match with the DB file.
We will store the updated VCF file in separate directory:
mkdir -p data/vcf/
We can create annotated VCF.
bcftools annotate -c CHROM,FROM,TO,ID -a snps.vcf.gz -o data/vcf/chr21.vcf.gz chr21.dose.vcf.gz
3. Create dosage file as PrediXcan requires
mkdir -p data/dosage/
We can extract dosage information with the format -f "[\t%DS]\n"
, prepending 6 header columns: chromosome, rsID, position, reference, alternative allele, and minor allele frequency with the format -f "%CHROM\t%ID\t%POS\t%REF\t%ALT\t%MAF"
. To make sure that our prediction is reliable, we may filter out SNP with MAF less than 5% adding -e "MAF[0]<0.05"
.
$ bcftools query -e "MAF[0]<0.05" -f "%CHROM\t%ID\t%POS\t%REF\t%ALT\t%MAF[\t%DS]\n" data/vcf/chr21.vcf.gz | gzip > data/dosage/chr21.dosage.gz
Additionally we can list samples in the same directory:
$ bcftools query -l data/vcf/chr21.vcf.gz | awk -F'\t' '{ print $1 FS $1 }' > data/dosage/chr21.samples
Once I have successfully mounted the shared folder, I can use ukbgene
to download the imputed genotypes. To make sure that I write down to the shared folder, I just run the command with the root priviledge.
parallel -j4 ./ukbgene imp -c{1} :: {1..22}
4. Run PrediXcan.
python PrediXcan.py --predict --dosages data/dosage/ --dosages_prefix chr21 --output_prefix temp --samples chr21.samples --weights Brain_Amygdala/gtex_v7_Brain_Amygdala_imputed_europeans_tw_0.5_signif.db
Conclusion
bcftools
are just amazing.- I found the way
PrediXcan
process files quite inefficient.