Hardy-Weinberg Equilibrium and Expected Frequencies

Calculate the Expected Genotype Counts at Each SNP

Part A: Get a Table of Observed Genotype Counts

The first step in this exercise is exactly the same as the exercise last week: get a table of observed genotype counts at each site (row) in the VCF file.

Remember the key steps in this process are:

Figure out how many total SNPs you have. This is the same as the number of rows in the VCF file. You will use this number to help set up you empty results table.
Set up an empty results table before you start your loop. This table will have the same number of rows as your input VCF file, and 3 columns; one for each genotype (AA, Aa, aa). You don't need a column for the missing data today (just make sure you don't accidentally count missing genotypes as something else!)

Part B: Calculate p, q, and N

Once you've got your table of observed genotype counts, the next step is to calculate the allele frequencies, p and q, as well as the sample size, N, for each site.

Remember that to add a new column to an existing table, I can use the command cbind(...). Below is how to add the empty column for N to the my.results table:

> N=rep(0, nrow(my.results))
> my.results=cbind(my.results, N)
> head(my.results)
   AA Aa aa N
1 85  6  0 0
2 84  7  0 0
3 85  6  0 0
4 74  8  0 0
5 61  0 30 0
6 65  6  0 0

You can do the same to create the columns for p and q.

To calculate the values for each column, you need to use the equations that we know from Hardy-Weinberg:

$p = \frac{2 (A A) + A a}{2 N}$

$q = \frac{2 (a a) + A a}{2 N} = 1 - p$

You can do these calculations inside of another for() loop, OR you can use R's ability to do vectorized arithmetic, and do the calculations all at once. Below is how to calculate N without a loop:

> my.results$N = my.results$AA + my.results$Aa + my.results$aa
> head(my.results)
AA Aa aa  N
1 85  6  0 91
2 84  7  0 91
3 85  6  0 91
4 74  8  0 82
5 61  0 30 91
6 65  6  0 71

Notice that sum(...) will NOT do what we want here:

> sum(my.results$AA, my.results$Aa, my.results$aa)[1] 86561

Fill in the columns for p and q in your results table.

Part C: Use the allele frequencies to calculate Expected Genotype Counts

Now that you've got the observed allele frequencies, you can again use the equations we know from Hardy-Weinberg to calculate the expected counts for each genotype:

$E x p (A A) = N \cdot p^{2}$

$E x p (A a) = (2 p q) \cdot N$

$E x p (a a) = q^{2} \cdot N$

So, you want to add 3 new columns to your results table, one for each expected count, and you can do this in the same way that we did in Part B.

Optional: Since we usually think of genotype counts as something that should be an integer value (instead of a decimal), you can force R to round to the nearest integer with the as.integer(...) command:

> 0.25 * 333
[1] 83.25
> as.integer(0.25 * 333)
[1] 83

2: Use Fisher's Exact Test to see which sites differ significantly from HWE

With our table of observed and expected counts, we can see how much certain sites may deviate from Hardy-Weinberg expectations, but to assign a significance to this, we want to use a statistical test such as Chi-square or Fisher's Exact Test.

Part A: Create the empty column to hold results

We want to add 1 new column to our results table; this one will hold the p-values for our test at each site. Add a new column called "Pvalue" to your results table.

Part B: Run the test inside of a loop

We need to do this test for each site separately, so you will need a for() loop that goes through every line of your results table.

Inside of this loop, to run R's fisher.test() command, you need to set up a matrix that looks like this:

a	b
c	d

a=my.results$AA[i]
b=as.integer(my.results$Aa[i]/2)
c=my.results$Aa[i] - b
d=my.results$aa[i]

Then, to create the matrix, run Fisher's test and extract the p-value, use the following code inside of your loop:

x=matrix(c(a,b,c,d), nrow=2, ncol=2, byrow=TRUE)
test=fisher.test(x)
pval=test$p.value

3: Check Loci with Significant Deviation from HWE

After getting a p-value for every site, we'll want to go through and see which loci have significant deviations, and then try and figure out why they might deviate.

Part A: How many loci deviate from HWE?

Use the which(...) or subset(...) commands to find the loci that deviate significantly (p < 0.05) from Hardy-Weinberg expections. How many loci are there? What proportion of the total data set deviates from HWE?

Part B: What might cause these loci to deviate?

We saw in class that maping and variant calling errors during the data processing pipeline could cause loci to deviate from HWE.

Difference between Observed and Expected Heterozygosity

One thing we know is that some types of errors can cause observed Heterozygosity to be too low, while different types of errors might cause it to be too high. To help distinguish these two types of sites, let's first create a new column called Diff that contains $H e t_{o b s} - H e t_{e x p}$

>my.results$Diff=my.results$Aa-my.results$expAa
>head(my.results)
AA Aa aa  N         p          q expAA expAa expaa       Pvalue Diff
1 85  6  0 91 0.9670330 0.03296703    85     5     0 1.000000e+00    1
2 84  7  0 91 0.9615385 0.03846154    84     6     0 1.000000e+00    1
3 85  6  0 91 0.9670330 0.03296703    85     5     0 1.000000e+00    1
4 74  8  0 82 0.9512195 0.04878049    74     7     0 1.000000e+00    1
5 61  0 30 91 0.6703297 0.32967033    40    40     9 9.958354e-25  -40

Check samples where observed Het. is too low

In next-gen data, a common reason that the observed heterozygosity might be too low is because the read depth coverage was too low to observe both alleles in an individual.

Use the code below to add a column to your results table that has the average individual read depth for each site:

my.results$rdDepth=rep(0,nrow(my.results))
for (i in 1:nrow(my.data)) {
d=get.field(as.vector(my.data[i,10:ncol(my.data)], mode="character"), my.data[i,9], "DP")
d[d=="."]=NA
my.results$rdDepth[i]=mean(as.numeric(d), na.rm=TRUE)
}

Now, try using subset() or which() to extract the sites where the observed Heterozygosity is significantly lower than the expected Heterozygosity. What can you say about the mean read depth at these sites? How does it compare to the overall mean read depth of the other sites?

Hint: You can take a subset of a table using more than one condition with the & symbol, as in the example code below:

dataSubset = subset(dataAll, ((columnX < 0) & (columnY >20)))

Check samples where observed Het. is too high

When heterozygosity is too high, it can be a bit trickier to sort out what is going on, but one thing we know might cause this is if we have a paralogous locus that is causing mis-mapping. When this happens, we should see a sites with excess heterozygosity all clustered near each other.

We could plot this, but at the genome-wide scale, sometimes it is hard to see what's going on in a single region. So, sometimes I find the best thing to do is:

Use subset() or which() to extract the sites from my.data where the observed heterozygosity is significantly larger than expected. You can use a condition on one table to subset another table, like this:

subset(my.data) = my.data[which(my.results meet some condition), ]
Print the results of this table to your screen (there shouldn't be more than 40 rows). You can use print() or simply type the variable name you gave the subset and hit enter.
Simply look at the chromosome and position columns (columns 1 and 2), to see if it looks like sites are right next to each other. I also find it helpful to look at the row numbers R assigns.

Which regions do you think might have paralogous loci?

R code solution for HWE Exercises