Plotting the Allele Frequency Spectrum and Calculating Waterson's Theta

Getting Started

Remember to either load my CompGen_Fxns17.R file or copy and paste the read.vcf function into your code.

Then read in the sample VCF file for this week:7_sampleData_AFS.vcf

Once you have read this file in, check that it has the correct dimensions:

dim(my.data)
[1] 1946   33

1. Plot the Derived Allele Frequency Spectrum

The first step in this exercise is to calculate and plot the derived frequency spectrum for the observed data. This file has been created such that the REF allele is always the ancestral allele, and the ALT allele is the derived allele.

Step 1: Get the Derived Allele Count at every SNP site

Remember to first set up an empty data structure to hold your results. We only need to save one calculation for each site, so your empty data structure can be as simple as a vector with length=number of sites, or a table with additional information if you prefer.

Once you've got your empty results vector/table set up, use a loop to count the total number of derived (ALT) alleles at each SNP position.

Here is what my first few values look like:

head(derived.counts)
[1]  5  3 25 12  7  5

Step 2: Plot a Histogram of the Counts

Now that we've got the derived counts, we can use R's built-in histogram function (which we have used before during the FST Exercises) to plot the frequency of each count.

The general structure of the histogram command will look like this:

hist(x=YOUR COUNTS, breaks=A VECTOR WITH ALL POSSIBLE COUNT VALUES)

The value for "YOUR COUNTS" is what you just calculated in the previous step.

To get the other value for the break points vector, remember that the range of possible counts will go from 1, all the way up to the total number of chromosomes in the sample (2N). To generate a vector with these values, you can use the seq() function:

seq(from = START OF RANGE, to = END OF RANGE, by=1)

We are using by=1 because we want every integer value between these two numbers.

When you have your breakpoint vector, use it in your histogram function to check the plot of the derived spectrum.

If it looks reasonable to you, go ahead and save the histogram data as a new variable (this will make it easy to extract the counts in each bin later):

histinfo = hist(x=YOUR COUNTS, breaks=BREAKPOINT VECTOR)

2. Calculate Waterson's Theta

Next, we want to calculate Waterson's Theta (LaTeX: \theta_w), which is an estimator of the population mutation rate.

The formula is:

LaTeX: \theta_w=\frac{S_n}{\sum_{i=1}^{2N-1}\left(\frac{1}{i}\right)}

where Sn = the number of segregating sites,

N = the number of individuals,

2N = the number of chromosomes.

Here is my calculated value:

LaTeX: \theta_w = 438.4894

If your answer matches up to mine, then make sure to save your value as the variable theta.w, and then go on to the next step.

3. Calculate and Plot the EXPECTED derived frequency spectrum

For the last part of the exercise, we will now use our estimate of LaTeX: \theta_w to determine what our frequency spectrum should look like under complete neutrality. To get the expected number of sites in each frequency bin, use the equation:

Exp(xi) = theta.w * 1/i

where i is each value from 1 to 2N-1.

Plot the expected neutral spectrum alongside the observed spectrum

Lastly, we will plot our observed and expected frequency distributions side by side, to see if our data looks neutral or not.

First, get your data into a matrix. In the code below, I'm extracting the observed counts from histinfo, and my calculations for the expected counts are in exp.counts:

compare.counts=matrix(c(histinfo$counts, exp.counts), ncol=2, byrow=FALSE)
colnames(compare.counts)=c("Obs", "Exp")

head(compare.counts)
 Obs       Exp
[1,] 178 438.48938
[2,] 179 219.24469
[3,] 133 146.16313
[4,] 148 109.62234
[5,] 114  87.69788
[6,] 109  73.08156

Now, we can use R's barplot() function:

barplot(t(compare.counts), beside=T, col=c("lightblue", "black"))
legend("topright", c("Obs", "Exp"), pch=c(15,15), col=c("lightblue", "black"))

How well does the observed data line up to the neutral distribution? In what ways does it differ? Do you have any hypotheses about what might cause this type of difference?

R code solution AFS Exercises