Remember to either load my CompGen_Fxns17.R
file or copy and paste the read.vcf
function into
your code.
Then read in the sample VCF file for this week:7_sampleData_AFS.vcf
Once you have read this file in, check that it has the correct dimensions:
dim(my.data)
[1] 1946 33
The first step in this exercise is to calculate and plot the derived frequency spectrum for the observed data. This file has been created such that the REF allele is always the ancestral allele, and the ALT allele is the derived allele.
Remember to first set up an empty data structure to hold your results. We only need to save one calculation for each site, so your empty data structure can be as simple as a vector with length=number of sites, or a table with additional information if you prefer.
Once you've got your empty results vector/table set up, use a loop to count the total number of derived (ALT) alleles at each SNP position.
Here is what my first few values look like:
head(derived.counts)
[1] 5 3 25 12 7 5
Now that we've got the derived counts, we can use R's built-in histogram function (which we have used before during the FST Exercises) to plot the frequency of each count.
The general structure of the histogram command will look like this:
hist(x=YOUR COUNTS, breaks=A VECTOR WITH ALL POSSIBLE COUNT VALUES)
The value for "YOUR COUNTS" is what you just calculated in the previous step.
To get the other value for the break points vector, remember that the range of possible counts will go from 1, all the way up to the total number of chromosomes in the sample (2N). To generate a vector with these values, you can use the seq()
function:
seq(from = START OF RANGE, to = END OF RANGE, by=1)
We are using by=1
because we want every integer value between these two numbers.
When you have your breakpoint vector, use it in your histogram function to check the plot of the derived spectrum.
If it looks reasonable to you, go ahead and save the histogram data as a new variable (this will make it easy to extract the counts in each bin later):
histinfo = hist(x=YOUR COUNTS, breaks=BREAKPOINT VECTOR)
Next, we want to calculate Waterson's Theta (), which is an estimator of the population mutation rate.
The formula is:
where Sn = the number of segregating sites,
N = the number of individuals,
2N = the number of chromosomes.
Here is my calculated value:
= 438.4894
If your answer matches up to mine, then make sure to save your value as the variable theta.w
, and then go on to the next step.
For the last part of the exercise, we will now use our estimate of to determine what our frequency spectrum should look like under complete neutrality. To get the expected number of sites in each frequency bin, use the equation:
Exp(xi) = theta.w * 1/i
where i is each value from 1 to 2N-1.
Lastly, we will plot our observed and expected frequency distributions side by side, to see if our data looks neutral or not.
First, get your data into a matrix. In the code below, I'm extracting the observed counts from histinfo
, and my calculations for the expected counts are in exp.counts
:
compare.counts=matrix(c(histinfo$counts, exp.counts), ncol=2, byrow=FALSE)
colnames(compare.counts)=c("Obs", "Exp")
head(compare.counts)
Obs Exp
[1,] 178 438.48938
[2,] 179 219.24469
[3,] 133 146.16313
[4,] 148 109.62234
[5,] 114 87.69788
[6,] 109 73.08156
Now, we can use R's barplot()
function:
barplot(t(compare.counts), beside=T, col=c("lightblue", "black"))
legend("topright", c("Obs", "Exp"), pch=c(15,15), col=c("lightblue", "black"))
How well does the observed data line up to the neutral distribution? In what ways does it differ? Do you have any hypotheses about what might cause this type of difference?
R code solution AFS Exercises