This is inspired by this post by the GATK folks about finding the optimal parameters for hard filtering genomic variant calls.
As a brief introduction, I have pooled population genomic sequencing from 100 individual copepods (little aquatic invertebrates). I’ve aligned those data to our reference genome, and after following GATK best practices for removing duplicates, etc., and I have called SNPs/variants with GATK HaplotypeCaller.
The next step, if it were possible, would be to run GATK variant quality score recalibration. Unfortunately I don’t have a “truthed” set of variants since we are working on a non-model organism. Which means I need to filter the GATK variant calls with hard-filtering parameters. GATK has generic recommendations for hard filtering, but they are very clear that those recommendations may not work in every situation.
So what I want to do is to explore properties of the variant data to see if I can tune parameters that will improve our variant filtration. Somewhat easier said than done. The first step is to gather the information I want from the VCF file.
The way I’m doing this is not optimal for large-scale analysis. In that case I’d want to pull out all the columns that I want to examine at once, and replace any missing values with “NA”s. But since I don’t want to code that and I’m not aware of a tool that will do it for me, I’m going to analyze this piece by piece. I’ll start by looking at the distribution of depth of coverage across variants.
The following commands will be useful for all of the information that I want. First I want to remove the VCF header:
egrep -v "^#" file.vcf > file_no_header.vcf
Next I need to extract the “INFO” column, which happens to be column 8:
cut -f 8 file_no_header.vcf > file_cut.vcf
And then extract only the information I want. In this case I’m using sed to match lines and return only the “DP=##” information:
cat file_cut.vcf | sed 's/^.*;DP=\([0-9]*\);.*$/\1/' > depth.txt
Or if I do it all together:
egrep -v "^#" VIE_GATK.vcf | \
cut -f 8 | \
sed 's/^.*;DP=\([0-9]*\);.*$/\1/' > depth.txt
That’ll give me a long column that I can import into R for visualization:
You can see that this plot has a VERY long tail extending almost to 8000X coverage. Which is clearly some sort of repetitive element where reads are mapping or something. So I can zoom in on the y-axis a little bit:
Awesome. It looks like most variants are covered around 25-30X. Based on this and some of our other exploratory anlaysis I’m going to exclude any variants that are covered over 200X
This is an important parameter for filtering. It is the variant confidence QUAL score divided by the depth of coverage. So by setting a QD cutoff ,filtering anything out less than QD=2.0, for example, we can exclude both low quality and high coverage variants. Honestly this probably makes the exclusion by high depth somewhat redundant, but that’s OK.
In this case I want the QD
parameter, so I’ll get that by doing this (I had to replace the sed
regex because the QD is a float, not an integer):
egrep -v "^#" VIE_GATK.vcf | \
cut -f 8 | \
sed 's/^.*;QD=\([0-9]*.[0-9]*\);.*$/\1/' > QD.txt
When compared to the example plot on the GATK page, this distribution looks left-biased. Which is to say that there are a LOT of variants piling up around QD=2.0. That is their cutoff recommendation. So what does that mean? Are these mostly artifcacts or are they quality variants? I would guess that the pool-seq approach and the sometimes poor mapping shift this distribution left.
This is something I’m not sure about. Does the major peak of this distribution represent poor-quality variants? I need to explore this but I’m not sure how. Maybe we could do some “ground-truthing” by visually scanning the alignments, finding clear examples of variants, and then checking their QD score? Maybe by eliminating some of these low QD variants we can explore only high-quality data.
egrep -v "^#" VIE_GATK.vcf | \
cut -f 8 | \
sed 's/^.*;FS=\([0-9]*.[0-9]*\);.*$/\1/' > FS.txt
This is interesting, there are a bunch of FS values that are equal to zero. In theory that is a good thing… that means that there is a low probabilty of strand bias. But it seems weird that they are exactly zero… is this an artifact or just due to the fact that this is a pooled experiment so there is a low probability of strand bias?
What if I scale the x-axis by log10 like GATK recommends? Clearly the zero values will disappear since log(0) is undefince. But at least we can get a look at the other values:
I’m inclined to think that the zero values are “real” and that there is a very low probability of strand bias at many positions since this is a pooled experiment. Since there are so many pooled individuals it seems unlikely that we would ever see much of a strand bias. But we should look into it.
The GATK hard-filtering recommendation to to remove variants FS > 60. That seems very reasonable. Maybe even 50.
egrep -v "^#" VIE_GATK.vcf | \
cut -f 8 | \
sed 's/^.*;SOR=\([0-9]*.[0-9]*\);.*$/\1/' > SOR.txt
This looks pretty normal based on the GATK document. I think we could leave the filtering parameter to exclude variants SOR > 3
egrep -v "^#" VIE_GATK.vcf | \
cut -f 8 | \
sed 's/^.*;MQ=\([0-9].[0-9]*\);.*$/\1/' > MQ.txt
Overall we have lower mapping quality than the GATK folks. But that’s probably to be expected given the divergence from the reference, etc. What’s kind of strange is the large excess of high MQ (>60) that we see compared to the GATK example data. Don’t know what that means.
Their recommendation is to filter anything with MQ < 40. That’s probably a good value.
egrep -v "^#" VIE_GATK.vcf | \
cut -f 8 | \
sed 's/^.*;MQRankSum=\(\-\{0,1\}[0-9]\{1,\}.[0-9]*\);.*$/\1/' > MQRankSum.txt
This is a nice distribution centered on zero. They recommend hard filtereing anything less than -12.5. We could probably even go -10.
egrep -v "^#" VIE_GATK.vcf | \
cut -f 8 | \
sed 's/^.*;ReadPosRankSum=\(\-\{0,1\}[0-9]\{1,\}.[0-9]*\);.*$/\1/' > ReadPosRankSum.txt
Note that I had to replace the regex for sed so that I could get both negative and positive numbers. That would have been easier if I used extended regex to being with (with sed -r
) but this will work too.
They recommend removing any variant with ReadPosRankSum < -8. It looks like that does exist in our data, so that recommendation seems appropriate.