Remember to either load my CompGen_Fxns17.R
file or copy and paste the read.vcf
function into
your code.
For today's exercises, we will be scanning a region of the Heliconius butterfly genome that is potentially experiencing selection for color pattern differences.
Our goal is to plot Tajima's D in 10 kb sliding windows (with 2kb of overlap), along with its components: and . We'll also add a new statistic to the plot, called D_{xy} (the average nucleotide divergence), which is very similar to , except that it only compares pairs of sequences from different populations (instead of all pairs).
First, I go through the usual steps of setting my working
directory, loading my functions and reading in the VCF file
for this week:heliconius.vcf. Also, I've recently learned that you can globally set the option stringsAsFactors=FALSE
for an entire R session, so I'm going to do that as well:
library(devtools)
source_gist("c954c757db02eb6c4ccfae1aa090658c", filename="compGen17_fxns.R", quiet=TRUE)
options(stringsAsFactors=FALSE)
my.data=read.vcf("heliconius.vcf", header=TRUE)
Next, we want to follow the same procedure we used for the Tajima's D exercises to find the Start and End positions of all of our windows, and then set up our results table.
We are going to allow 2000 bp of overlap between our windows this time (to make sure we don't miss anything).
Find all of the Window START
positions. I recommend using the seq(from, to,
by)
function, like we did to find the sliding windows
before. Finding the from
and to
values will be exactly the same, but your by
value is going to change a bit (because of the overlaps and the new window size).
Find all of the Window END positions. Then put the Start and the End positions into a data frame like this:
my.results = data.frame(Start, End)
All of the VCF files we have worked with up until now have only had data for the positions where there was at least one mutation. This is the standard format, but it doesn't let you distinguish between sites where there is no mutation vs. sites where there is no data.
For sliding window tests, this can make a big difference, so for this week's exercise we have a more extensive file, with 1 ROW for EVERY non-missing base pair in the genome, whether or not it has a SNP.
Add a new column to your Results table call "NumBP." Now loop through all of your Window positions, get the subset of data in each window, and determine how many total base pairs are in each window.
Now that you know how much data you actually have in each window, we don't really have much need for the non-SNP sites anymore, so let's get rid of them (to save some space and memory). To compare the format of a line with a SNP and a line without a SNP, type:
my.data[c(1,1775),1:13]
Think about what is different about these two lines that you could take advantage of to select only the lines with/without SNPs, and then use that pattern to select only the rows from my.data with SNPs.
Once you remove the lines without SNPs from my.data
, add another new column to you Results called "NumSNP." Loop through your Windows again, and this time count the number of SNPs in each window.
Now, we want to get these 3 calculations for each of our Windows, so set up 3 new columns in your Results.
my.results$Tajima=0
my.results$Theta=0
my.results$Pi=0
Optional: you might want to add an extra column to hold your variance calculation, if it helps with your final Tajima's D calculation.
These are going to be the exact same calculations that we did during the Tajima's D exercise, so you should be able to copy and paste code from either your solutions or mine!
As a reminder, here are the equations for each statistic:
and
I've also added some new functions to the CompGen17_fxns.R file, which you may feel free to use:
waterson.theta(w, perBP=TRUE)
pi.diversity(w, perBP=TRUE)
In both functions w
refers to any subset of a vcf data frame (so "w" stands for "window" here). By default, they will calculate a per base pair rate of each statistic, BUT, they assume that there is no missing data. Since we might have missing data, set perBP=FALSE
when you run the function, and then use the counts of NumBP
that we calculated above to get your own per base pair rate.
Or, you can just use the exact same code you had 2 weeks ago!
We're doing one (slightly) new statistic this week, called D_{xy}. This is sometimes called , because it is the same as the calculation except that it only calculates the average number of differences between any pair of sequences taken from two different groups, rather than calculating the pairwise difference among all sequences.
Get 2 separate data frames: 1 for each subspecies in my.data
. The two groups in this file are H. mel. aglaope and H. mel. amaryllis. The column names start with either "Ag" or "Am" to denote each subspecies.
colnames(my.data)[10:ncol(my.data)]
[1] "Ag1" "Ag2" "Ag3" "Ag4" "Am1" "Am2" "Am3" "Am4"
Use the same strategy with grep
, and then the -
notation inside of the brackets that we used last week for the polar bears and during the F_{ST} exercises.
Set up an empty column for "Dxy" in your Results table. I've written a function in CompGen17_fxns.R to help you with the actual calculation:
dxy(vcf1,vcf2, perBP=FALSE)
vcf1
and vcf2
are the 2 data frames for your 2 groups (or in this case, they will be the subsets of those data frames in each window.)
Write a loop to go through your Windows one more time, and use the function to calculate D_{xy} for each window. Make sure to divide your result by NumBP
to get the per base pair rate for the final result.
When you want to plot multiple statistics together, you have a lot of options for how best to display them so that the result is actually informative. Today, we'll make a somewhat complex plot, to give you some ideas about various options.
The first consideration for us with these statistics is the fact that while are all rates per base pair, Tajima's D is scaled completely differently, so we actually want to use 2 different axes to get it onto the same plot. I first set up an empty plot, and allow myself some extra room on the margins to draw in another axis later:
par(mar=c(5,5,2,5))
plot(my.results$Start, my.results$Theta, type="n",
xlab="Position",
ylab=expression(paste("Population Mutation Rate (", theta[W], " & ", pi, ")")), ylim=c(-0.054,0.07))
You'll probably notice that my y-axis limits here seem very specific, and you'll wonder how I figured that out. A lot of it is trial and error (unfortunately), but I also use the quantile
function sometimes to help know where my data points will be for a certain statistic:
quantile(my.results$Dxy, na.rm=TRUE)
0% 25% 50% 75% 100%
0.0004807692 0.0058791244 0.0104398961 0.0202904595 0.0541867677
Instead of plotting as a line, I'm going to make it a filled in shape:
polygon(c(my.results$Start[1], my.results$Start, my.results$Start[nrow(my.results)]),
c(0, my.results$Theta, 0), col=rgb(0.545,0,0, alpha=0.99), border=NA)
To see how compares to , I'm going to overlay it on top of the first shape, but make it semi-transparent with the alpha
parameter. Where they overlap (i.e., they are the same), I should see the blended color, but where they are different, I'll see the distinct color. This helps me to see where they are really different (i.e. there are skews in the frequency spectrum!):
polygon(c(my.results$Start[1], my.results$Start, my.results$Start[nrow(my.results)]),
c(0, my.results$Pi, 0), col=rgb(0.933,0.705,0.133, alpha=0.6), border=NA)
I probably don't want to add any more solid shapes to this plot, so I'll make D_{xy} a simple black line:
lines(my.results$Start, my.results$Dxy, col="black")
Here comes the part where we need to make a new axis. I start by calling par(new=T)
, which tells R to re-set the plotting parameters (i.e. don't use the current scale). Then, I'll add the dotted line for Tajima's D, and finally I'll force R to put the new y-axis on the right instead of the left:
par(new=T)
plot(my.results$Start, my.results$Tajima, type="l",
col="black", ylim=c(-1.5,2), lty=3, axes=F, xlab=NA, ylab=NA)
axis(side=4)
mtext(side=4, line=3, "Tajima's D")
The last thing I do is add a legend to the upper right corner:
lgd <- legend("topright", legend=c(expression(theta[W]), expression(Pi), expression(D[XY]), "Tajima's D"),
lty=c(NA,NA,1,3),
col=c(NA,NA,"black","black"), bty="n")
legend(lgd$rect$left, lgd$rect$top, legend=c(NA,NA), pch=c(22,22),
fill=c(rgb(0.545,0,0,alpha=0.99),
rgb(0.933,0.705,0.133,alpha=0.6)), border=NA, bty="n",
col=NA)
Here's the final plot:
For the last part of the exercise, I want you to pick the statistic of your choice, to try out on this data set in order to test for further evidence of either balancing selection, directional selection, or some other phenomenon.
Statistics that we've covered in class exercises:
Statistics I've mentioned, and have now added functions/instructions for:
For tests 1,2, and 3 above, I have found an existing R package called "sfsr" on github and modified it to make it compatible with VCF files. This package can calculate the Fu and Li tests, HKA, and I've also added a Fay and Wu H function to it. Instructions to download and use the package are on my GitHub page here.
Here is a Nice Table of when to think about/use different Methods:
R code solution Sliding Window Exercises