Sliding Window Scan with Multiple Stats

Before Starting

Remember to either load my CompGen_Fxns17.R file or copy and paste the read.vcf function into your code.

Sliding Windows with Multiple Statistics

For today's exercises, we will be scanning a region of the Heliconius butterfly genome that is potentially experiencing selection for color pattern differences.

Our goal is to plot Tajima's D in 10 kb sliding windows (with 2kb of overlap), along with its components: LaTeX: \theta_W and LaTeX: \pi . We'll also add a new statistic to the plot, called Dxy (the average nucleotide divergence), which is very similar to LaTeX: \pi, except that it only compares pairs of sequences from different populations (instead of all pairs).

Getting Started

First, I go through the usual steps of setting my working directory, loading my functions and reading in the VCF file for this week:heliconius.vcf. Also, I've recently learned that you can globally set the option stringsAsFactors=FALSE for an entire R session, so I'm going to do that as well:

library(devtools)
source_gist("c954c757db02eb6c4ccfae1aa090658c", filename="compGen17_fxns.R", quiet=TRUE)
options(stringsAsFactors=FALSE)
my.data=read.vcf("heliconius.vcf", header=TRUE)

Set up the Results Table, Define the Window Positions

Next, we want to follow the same procedure we used for the Tajima's D exercises to find the Start and End positions of all of our windows, and then set up our results table.

We are going to allow 2000 bp of overlap between our windows this time (to make sure we don't miss anything).

  1. Find all of the Window START positions. I recommend using the seq(from, to, by) function, like we did to find the sliding windows before. Finding the from and to values will be exactly the same, but your by value is going to change a bit (because of the overlaps and the new window size).

  2. Find all of the Window END positions. Then put the Start and the End positions into a data frame like this:

    my.results = data.frame(Start, End)
    

Calculate the Number of Sequenced Base Pairs AND the number of SNPs in each Window

All of the VCF files we have worked with up until now have only had data for the positions where there was at least one mutation. This is the standard format, but it doesn't let you distinguish between sites where there is no mutation vs. sites where there is no data.

For sliding window tests, this can make a big difference, so for this week's exercise we have a more extensive file, with 1 ROW for EVERY non-missing base pair in the genome, whether or not it has a SNP.

  1. Add a new column to your Results table call "NumBP." Now loop through all of your Window positions, get the subset of data in each window, and determine how many total base pairs are in each window.

  2. Now that you know how much data you actually have in each window, we don't really have much need for the non-SNP sites anymore, so let's get rid of them (to save some space and memory). To compare the format of a line with a SNP and a line without a SNP, type:

    my.data[c(1,1775),1:13]
    

    Think about what is different about these two lines that you could take advantage of to select only the lines with/without SNPs, and then use that pattern to select only the rows from my.data with SNPs.

  3. Once you remove the lines without SNPs from my.data, add another new column to you Results called "NumSNP." Loop through your Windows again, and this time count the number of SNPs in each window.

Calculate Theta-W, Pi, and Tajima's D

Now, we want to get these 3 calculations for each of our Windows, so set up 3 new columns in your Results.

my.results$Tajima=0
my.results$Theta=0
my.results$Pi=0

Optional: you might want to add an extra column to hold your variance calculation, if it helps with your final Tajima's D calculation.

These are going to be the exact same calculations that we did during the Tajima's D exercise, so you should be able to copy and paste code from either your solutions or mine!

As a reminder, here are the equations for each statistic:

LaTeX: \theta_W=\frac{S_n}{\sum^{2N-1}_{i=1}\left(\frac{1}{i}\right)}\:,\:S_n=\:the\:number\:of\:SNPs

LaTeX: \pi\:=\:\sum^{Sn}_{i=1}\frac{2j_i\left(c_i-j_i\right)}{c_i\left(c_i-1\right)}

 LaTeX: where\:c_i=num.\:chrom.\:at\:the\:i^{th}\:SNP\:\left(i.e.\:2N\right) and LaTeX: j_i=num.\:derived\:alleles\:at\:the\:i^{th}\:SNP

I've also added some new functions to the CompGen17_fxns.R file, which you may feel free to use:

waterson.theta(w, perBP=TRUE)
pi.diversity(w, perBP=TRUE)

In both functions w refers to any subset of a vcf data frame (so "w" stands for "window" here). By default, they will calculate a per base pair rate of each statistic, BUT, they assume that there is no missing data. Since we might have missing data, set perBP=FALSE when you run the function, and then use the counts of NumBP that we calculated above to get your own per base pair rate.

Or, you can just use the exact same code you had 2 weeks ago!

Calculate Nucleotide Divergence (Dxy)

We're doing one (slightly) new statistic this week, called Dxy. This is sometimes called LaTeX: \pi_{XY}, because it is the same as the LaTeX: \pi calculation except that it only calculates the average number of differences between any pair of sequences taken from two different groups, rather than calculating the pairwise difference among all sequences.

  1. Get 2 separate data frames: 1 for each subspecies in my.data. The two groups in this file are H. mel. aglaope and H. mel. amaryllis. The column names start with either "Ag" or "Am" to denote each subspecies.

    colnames(my.data)[10:ncol(my.data)]
    [1] "Ag1" "Ag2" "Ag3" "Ag4" "Am1" "Am2" "Am3" "Am4"
    

    Use the same strategy with grep, and then the - notation inside of the brackets that we used last week for the polar bears and during the FST exercises.

  2. Set up an empty column for "Dxy" in your Results table. I've written a function in CompGen17_fxns.R to help you with the actual calculation:

    dxy(vcf1,vcf2, perBP=FALSE)
    

    vcf1 and vcf2 are the 2 data frames for your 2 groups (or in this case, they will be the subsets of those data frames in each window.)

    Write a loop to go through your Windows one more time, and use the function to calculate Dxy for each window. Make sure to divide your result by NumBP to get the per base pair rate for the final result.

Time to Make a Plot!

When you want to plot multiple statistics together, you have a lot of options for how best to display them so that the result is actually informative. Today, we'll make a somewhat complex plot, to give you some ideas about various options.

1. Create an empty plot, with extra margins

The first consideration for us with these statistics is the fact that while LaTeX: \theta_W\:,\:\pi\:,\:and\:D_{XY} are all rates per base pair, Tajima's D is scaled completely differently, so we actually want to use 2 different axes to get it onto the same plot. I first set up an empty plot, and allow myself some extra room on the margins to draw in another axis later:

 par(mar=c(5,5,2,5))
 plot(my.results$Start, my.results$Theta, type="n",
     xlab="Position",
     ylab=expression(paste("Population Mutation Rate (", theta[W], " & ", pi, ")")), ylim=c(-0.054,0.07))

You'll probably notice that my y-axis limits here seem very specific, and you'll wonder how I figured that out. A lot of it is trial and error (unfortunately), but I also use the quantile function sometimes to help know where my data points will be for a certain statistic:

quantile(my.results$Dxy, na.rm=TRUE)
        0%          25%          50%          75%         100% 
0.0004807692 0.0058791244 0.0104398961 0.0202904595 0.0541867677 

2. Plot LaTeX: \theta_W as a filled in shape

Instead of plotting LaTeX: \theta_W as a line, I'm going to make it a filled in shape:

polygon(c(my.results$Start[1], my.results$Start, my.results$Start[nrow(my.results)]),
    c(0, my.results$Theta, 0), col=rgb(0.545,0,0, alpha=0.99), border=NA)

3. Plot LaTeX: \pi as another, semi-transparent shape

To see how LaTeX: \pi compares to LaTeX: \theta_W, I'm going to overlay it on top of the first shape, but make it semi-transparent with the alpha parameter. Where they overlap (i.e., they are the same), I should see the blended color, but where they are different, I'll see the distinct color. This helps me to see where they are really different (i.e. there are skews in the frequency spectrum!):

polygon(c(my.results$Start[1], my.results$Start, my.results$Start[nrow(my.results)]),
    c(0, my.results$Pi, 0), col=rgb(0.933,0.705,0.133, alpha=0.6), border=NA)

4. Add a line for Dxy

I probably don't want to add any more solid shapes to this plot, so I'll make Dxy a simple black line:

lines(my.results$Start, my.results$Dxy, col="black")

5. Add Tajima's D

Here comes the part where we need to make a new axis. I start by calling par(new=T), which tells R to re-set the plotting parameters (i.e. don't use the current scale). Then, I'll add the dotted line for Tajima's D, and finally I'll force R to put the new y-axis on the right instead of the left:

par(new=T)
plot(my.results$Start, my.results$Tajima, type="l",
       col="black", ylim=c(-1.5,2), lty=3, axes=F, xlab=NA, ylab=NA)
axis(side=4)
mtext(side=4, line=3, "Tajima's D")

6. Add a legend

The last thing I do is add a legend to the upper right corner:

lgd <- legend("topright", legend=c(expression(theta[W]), expression(Pi), expression(D[XY]), "Tajima's D"),
          lty=c(NA,NA,1,3),
          col=c(NA,NA,"black","black"), bty="n")
legend(lgd$rect$left, lgd$rect$top, legend=c(NA,NA), pch=c(22,22),
          fill=c(rgb(0.545,0,0,alpha=0.99),
          rgb(0.933,0.705,0.133,alpha=0.6)), border=NA, bty="n",
          col=NA)

Here's the final plot: 

 dxy_plot.png

Choose Your Own Adventure

For the last part of the exercise, I want you to pick the statistic of your choice, to try out on this data set in order to test for further evidence of either balancing selection, directional selection, or some other phenomenon.

Statistics that we've covered in class exercises: 

  1. Expected Heterozygosity and Observed Heterozygosity
  2. FST
  3. Linkage Disequilibrium (not recommended-since we don't have phased data here)
  4. Allele Frequency Spectrums (the histograms)
  5. Waterson's Theta, Pi, and Tajima's D (what we just did)
  6. McDonald-Kreitman Test (not recommended here - not enough divergence, BUT you could look at VariantAnnotation - the gff3 and fa files are here)

Statistics I've mentioned, and have now added functions/instructions for:

  1. Fu and Li's D and F statistics
    • Very similar to Tajima's D, but modified to compare derived site frequencies instead of just minor allele frequencies (i.e. unfolded spectrum vs. folded).  This gives them more power to distinguish positive selection from negative selection.
  2. Fay and Wu's H
    • Another variant of Tajima's D that uses derived sites; this test looks for skews towards very high frequency derived variants (excellent method for finding selective sweeps).
  3. HKA
    • Discussed during the M-K Test week; this test compares the rate of divergence between species to the rate of polymorphism within species, with the expectation that under neutrality the ratio of these rates will be the same at any one locus, even if it differs across loci because of variation in mutation rate.
  4. KaKs Ratio KaKs R script
    • Also discussed during M-K Test week; this test examines the rate of non-synonymous polymorphism  versus synonymous polymorphism within a single gene.  Under neutrality, the ratio is expected to be 1; values >1 indicate positive selection, values <1 indicate negative/purifying selection.

For tests 1,2, and 3 above, I have found an existing R package called "sfsr" on github and modified it to make it compatible with VCF files.  This package can calculate the Fu and Li tests, HKA, and I've also added a Fay and Wu H function to it.  Instructions to download and use the package are on my GitHub page here.

Here is a Nice Table of when to think about/use different Methods:

table_of_tests.jpg

R code solution Sliding Window Exercises