The first step in this exercise is exactly the same as the exercise last week: get a table of observed genotype counts at each site (row) in the VCF file.
Remember the key steps in this process are:
Figure out how many total SNPs you have. This is the same as the number of rows in the VCF file. You will use this number to help set up you empty results table.
Set up an empty results table before you start your loop. This table will have the same number of rows as your input VCF file, and 3 columns; one for each genotype (AA, Aa, aa). You don't need a column for the missing data today (just make sure you don't accidentally count missing genotypes as something else!)
Once you've got your table of observed genotype counts, the next step is to calculate the allele frequencies, p and q, as well as the sample size, N, for each site.
Remember that to add a new column to an existing table, I can use the command cbind(...)
. Below is how to add the empty column for N to the my.results table:
> N=rep(0, nrow(my.results))
> my.results=cbind(my.results, N)
> head(my.results)
AA Aa aa N
1 85 6 0 0
2 84 7 0 0
3 85 6 0 0
4 74 8 0 0
5 61 0 30 0
6 65 6 0 0
You can do the same to create the columns for p and q.
To calculate the values for each column, you need to use the equations that we know from Hardy-Weinberg:
You can do these calculations inside of another for()
loop, OR you can use R's ability to do vectorized arithmetic, and do the calculations all at once. Below is how to calculate N without a loop:
> my.results$N = my.results$AA + my.results$Aa + my.results$aa
> head(my.results)
AA Aa aa N
1 85 6 0 91
2 84 7 0 91
3 85 6 0 91
4 74 8 0 82
5 61 0 30 91
6 65 6 0 71
Notice that sum(...)
will NOT do what we want here:
> sum(my.results$AA, my.results$Aa, my.results$aa)
[1] 86561
Fill in the columns for p and q in your results table.
Now that you've got the observed allele frequencies, you can again use the equations we know from Hardy-Weinberg to calculate the expected counts for each genotype:
So, you want to add 3 new columns to your results table, one for each expected count, and you can do this in the same way that we did in Part B.
Optional: Since we usually think of genotype counts as something that should be an integer value (instead of a decimal), you can force R to round to the nearest integer with the as.integer(...)
command:
> 0.25 * 333
[1] 83.25
> as.integer(0.25 * 333)
[1] 83
With our table of observed and expected counts, we can see how much certain sites may deviate from Hardy-Weinberg expectations, but to assign a significance to this, we want to use a statistical test such as Chi-square or Fisher's Exact Test.
We want to add 1 new column to our results table; this one will hold the p-values for our test at each site. Add a new column called "Pvalue" to your results table.
We need to do this test for each site separately, so you will need a for()
loop that goes through every line of your results table.
Inside of this loop, to run R's fisher.test()
command, you need to set up a matrix that looks like this:
a | b |
c | d |
a=my.results$AA[i]
b=as.integer(my.results$Aa[i]/2)
c=my.results$Aa[i] - b
d=my.results$aa[i]
Then, to create the matrix, run Fisher's test and extract the p-value, use the following code inside of your loop:
x=matrix(c(a,b,c,d), nrow=2, ncol=2, byrow=TRUE)
test=fisher.test(x)
pval=test$p.value
After getting a p-value for every site, we'll want to go through and see which loci have significant deviations, and then try and figure out why they might deviate.
Use the which(...)
or subset(...)
commands to find the loci that deviate significantly (p < 0.05) from Hardy-Weinberg expections. How many loci are there? What proportion of the total data set deviates from HWE?
We saw in class that maping and variant calling errors during the data processing pipeline could cause loci to deviate from HWE.
One thing we know is that some types of errors can cause observed Heterozygosity to be too low, while different types of errors might cause it to be too high. To help distinguish these two types of sites, let's first create a new column called Diff that contains
>my.results$Diff=my.results$Aa-my.results$expAa
>head(my.results)
AA Aa aa N p q expAA expAa expaa Pvalue Diff
1 85 6 0 91 0.9670330 0.03296703 85 5 0 1.000000e+00 1
2 84 7 0 91 0.9615385 0.03846154 84 6 0 1.000000e+00 1
3 85 6 0 91 0.9670330 0.03296703 85 5 0 1.000000e+00 1
4 74 8 0 82 0.9512195 0.04878049 74 7 0 1.000000e+00 1
5 61 0 30 91 0.6703297 0.32967033 40 40 9 9.958354e-25 -40
In next-gen data, a common reason that the observed heterozygosity might be too low is because the read depth coverage was too low to observe both alleles in an individual.
Use the code below to add a column to your results table that has the average individual read depth for each site:
my.results$rdDepth=rep(0,nrow(my.results))
for (i in 1:nrow(my.data)) {
d=get.field(as.vector(my.data[i,10:ncol(my.data)], mode="character"), my.data[i,9], "DP")
d[d=="."]=NA
my.results$rdDepth[i]=mean(as.numeric(d), na.rm=TRUE)
}
Now, try using subset()
or which()
to extract the sites where the observed Heterozygosity is significantly lower than the expected Heterozygosity. What can you say about the mean read depth at these sites? How does it compare to the overall mean read depth of the other sites?
Hint: You can take a subset of a table using more than one condition with the &
symbol, as in the example code below:
dataSubset = subset(dataAll, ((columnX < 0) & (columnY >20)))
When heterozygosity is too high, it can be a bit trickier to sort out what is going on, but one thing we know might cause this is if we have a paralogous locus that is causing mis-mapping. When this happens, we should see a sites with excess heterozygosity all clustered near each other.
We could plot this, but at the genome-wide scale, sometimes it is hard to see what's going on in a single region. So, sometimes I find the best thing to do is:
Use subset()
or which()
to extract the sites from my.data where the observed heterozygosity is significantly larger than expected. You can use a condition on one table to subset another table, like this:
subset(my.data) = my.data[which(my.results meet some condition), ]
Print the results of this table to your screen (there shouldn't be more than 40 rows). You can use print()
or simply type the variable name you gave the subset and hit enter.
Simply look at the chromosome and position columns (columns 1 and 2), to see if it looks like sites are right next to each other. I also find it helpful to look at the row numbers R assigns.
Which regions do you think might have paralogous loci?
R code solution for HWE Exercises