Sliding Window Scan of Tajima's D

Before Starting

Remember to either load my CompGen_Fxns17.R file or copy and paste the read.vcf function into your code.

Read in the sample data file for this week: sampleData_Tajima.vcf.

Check that the dimensions of your table match up to mine:

dim(my.data)
[1] 1098  364

I. Divide Sample Data into Windows of 600 bp

To create the sample data for this week's exercise, I extracted a 30kb section from sorghum chromosome 1, that extends from roughly position 68,020,000 to position 68,050,000:

min(my.data$POS)
[1] 68020018
max(my.data$POS)
[1] 68049882

To do a sliding window scan of Tajima's D, we need to first figure out how to divide the data from this VCF file into windows of 600bp each, and then count the total number of SNPs in each window.

I suggest contstructing a results table with 3 columns: Window Start Position, Window End Position, and Number of SNPs:

Start End SNPs
1 600 ###
601 1200 ###
1201 1800 ###

The number of rows in this table will be equal to the total number of 600bp windows in this region, which we also need to figure out.

Here are the first few rows of my results table, after this step:

head(my.results)
   Start      End     SNPs
1 68020018 68020618   61
2 68020618 68021218   46
3 68021218 68021818   25
4 68021818 68022418   24
5 68022418 68023018   33
6 68023018 68023618   17

II. Calculate Waterson's Theta for Each Window

To calculate LaTeX: \theta_W, we can recall from last week that the two values we need from the data are:

  • Sn : The number of segregating sites (i.e. the number of SNPs)
  • 2N: The number of chromosomes (i.e. 2 times the number of individuals)

We've already calculated the number of SNPs in the previous step. Since this file has no missing data, the value for 2N will be the same for each window, but I'm going to go ahead and make a whole column of this value and add it to my results table, to make my LaTeX: \theta_W calculation easier to do with R's vectorized math:

my.results$Nchr = YOUR CALCULATION FOR 2N
head(my.results)
 Start      End      SNPs Nchr
1 68020018 68020618   61  710
2 68020618 68021218   46  710
3 68021218 68021818   25  710
4 68021818 68022418   24  710
5 68022418 68023018   33  710
6 68023018 68023618   17  710

Now that I've the values for Sn and 2N conveniently in my table, I can use the equation for LaTeX: \theta_W:

LaTeX: \theta_W\:=\:\frac{S_n}{\sum^{2N-1}_{i=1}\left(\frac{1}{i}\right)}

Here is what my first few results look like after I do my calculation:

head(my.results)
 Start      End      SNPs Nchr   ThetaW
1 68020018 68020618   61  710 8.541293
2 68020618 68021218   46  710 6.440975
3 68021218 68021818   25  710 3.500530
4 68021818 68022418   24  710 3.360509
5 68022418 68023018   33  710 4.620699
6 68023018 68023618   17  710 2.380360

III. Calculate Nucleotide Diversity for Each Window

The other estimator of the population mutation rate is nucleotide diversity (LaTeX: \Pi), which is the average number of differences between sequences in a region. Go ahead and add a column for this to your results table:

my.results$Pi = rep(0, nrow(my.results))

To calculate this value, we are going to use a formula from Begun et al. (2007):

LaTeX: \Pi\:=\:\sum\left(\frac{2j\left(c-j\right)}{c\left(c-1\right)}\right)

j = the derived allele count at a single SNP position

c = the number of chromosomes (same as 2N) at a single SNP position

the LaTeX: \sum is summing over ALL SNP positions in a region

To help you with this calculation, I have added a function that will calculate the derived allele count (j) at a single row of a VCF file to the CompGen_fxns.R file, so if you loaded those functions, you will automatically have derivedCount() available to you.

If you want to paste the source code for the function directly into your own code instead, here it is:

derivedCount <- function(row) {
row=as.vector(row, mode="character")
x=count.genotypes(get.field(row[10:length(row)], row[9], "GT"))
dc=(2*x["aa"])+x["Aa"]
return(unname(dc))
}

Remember that within EACH 600bp window, you have to get the count of the derived sites at EACH SNP position within that window. To do this, you can use a loop:

my.data.window = SUBSET OF my.data THAT FALLS WITHIN THE WINDOW
J = rep(0, nrow(my.data.window))
for (i in 1:nrow(my.data.window)) {
    J[i] = derivedCount(row = my.data.window[i,])
}

OR, you can skip the loop and use the apply function, like this:

my.data.window = SUBSET OF my.data THAT FALLS WITHIN THE WINDOW
J = apply(my.data.window, 1, derivedCount)

The above argument takes:

  1. The data object to apply a function to (in this case our data subset from the window)

  2. The dimension of the data to loop through (1=go row by row, 2 = column by column)

  3. The function to be applied (in this case the derivedCount function)

    head(J)
    1   2   3   4   5   6 
    2   8   7 122  84 122
    

Once you've got the J vector, you should be able to do the rest of the Pi calculation pretty easily with R's vectorized math. For example, to get (c-j)

nc=my.results$Nchr[1]
nc
[1] 710
head(nc-J)
1   2   3   4   5   6 
708 702 703 588 626 588

Here is what the first few lines of my results table look like after my LaTeX: \prod calculation:

head(my.results)
 Start      End       SNPs Nchr   ThetaW     Pi
1 68020018 68020618   61  710 8.541293 7.120912
2 68020618 68021218   46  710 6.440975 3.952895
3 68021218 68021818   25  710 3.500530 2.525795
4 68021818 68022418   24  710 3.360509 2.114297
5 68022418 68023018   33  710 4.620699 2.141604
6 68023018 68023618   17  710 2.380360 1.177584

IV. Calculate and Plot Tajima's D

The formula for Tajima's D is:

LaTeX: D\:=\:\frac{\Pi\:-\theta_W}{\sqrt[]{var\left(d\right)}}

We've just finished calculating LaTeX: \Piand LaTeX: \theta_W, so now the last ingredient is to get the square root of the variance(d). The formula for the variance relies on n (in this case, the same as 2N, or the number of chromosomes), and S (the same as Sn, or the number of SNPs).

The formula itself is somewhat long, so below I've provided you with a function that will do this calculation (but feel free to do your own calculation if you want!):

variance.d <- function(n,S) {
a1=sum(1/(seq(from=1, to=(n-1), by=1)))
a2=sum(1/((seq(from=1, to=(n-1), by=1))**2))
b1=(n+1)/(3*(n-1))
b2=(2*((n**2)+n+3))/((9*n)*(n-1))
c1=b1 - (1/a1)
c2=b2-((n+2)/(a1*n)) + (a2/(a1**2))
e1=c1/a1
e2=c2/((a1**2)+a2)
var=(e1*S) + (e2*S*(S-1))
return(var)
}

Set up a new column in your results table called varD, and then calculate the variance for every window.

head(my.results)
 Start      End       SNPs Nchr   ThetaW    Pi     varD
1 68020018 68020618   61  710 8.541293 7.120912 9.629897
2 68020618 68021218   46  710 6.440975 3.952895 5.759211
3 68021218 68021818   25  710 3.500530 2.525795 1.986663
4 68021818 68022418   24  710 3.360509 2.114297 1.854930
5 68022418 68023018   33  710 4.620699 2.141604 3.197334
6 68023018 68023618   17  710 2.380360 1.177584 1.054751

Now, you can run standard math functions on the columns to create a new column for TajimaD. (Hint: R has a function called sqrt() for taking the square root of a number).

head(my.results)
 Start      End     SNPs Nchr   ThetaW       Pi     varD    TajimaD
1 68020018 68020618   61  710 8.541293 7.120912 9.629897 -0.4577136
2 68020618 68021218   46  710 6.440975 3.952895 5.759211 -1.0367707
3 68021218 68021818   25  710 3.500530 2.525795 1.986663 -0.6915510
4 68021818 68022418   24  710 3.360509 2.114297 1.854930 -0.9150145
5 68022418 68023018   33  710 4.620699 2.141604 3.197334 -1.3864341

Finally, we can plot the results, and see if we can determine any genes that might be under selection:

plot(my.results$Start, my.results$TajimaD, pch=20,xlab="Position",ylab="Tajima's D", type="l", ylim=c(-2.2,0))

Since a good rule of thumb is that anything under -2 or over +2 might be significant, here is how to draw a semi-transparent box that will highlight the graph area under -2:

rect(68010000, -3, 68060000, -2, col=rgb(1,0,0,0.4), border=NA)

Do you see any points that might be significant? What kind of selection do you think they are experiencing?

Lastly, use the code below to draw some boxes at the top of the graph outlining the transcripts that map back to this region:

rect(68023734,-0.4,68024960,-0.3, col="goldenrod")
rect(68028443,-0.4,68029634,-0.3, col="goldenrod")
rect(68034103,-0.4,68036074,-0.3, col="goldenrod")

Do you think the 1st, 2nd, or 3rd gene is the best candidate for selection?

R code solution Tajima's D Exercises