Remember to either load my CompGen_Fxns17.R
file or copy and paste the read.vcf
function into
your code.
Read in the sample data file for this week: sampleData_Tajima.vcf.
Check that the dimensions of your table match up to mine:
dim(my.data)
[1] 1098 364
To create the sample data for this week's exercise, I extracted a 30kb section from sorghum chromosome 1, that extends from roughly position 68,020,000 to position 68,050,000:
min(my.data$POS)
[1] 68020018
max(my.data$POS)
[1] 68049882
To do a sliding window scan of Tajima's D, we need to first figure out how to divide the data from this VCF file into windows of 600bp each, and then count the total number of SNPs in each window.
I suggest contstructing a results table with 3 columns: Window Start Position, Window End Position, and Number of SNPs:
Start | End | SNPs |
1 | 600 | ### |
601 | 1200 | ### |
1201 | 1800 | ### |
The number of rows in this table will be equal to the total number of 600bp windows in this region, which we also need to figure out.
Here are the first few rows of my results table, after this step:
head(my.results)
Start End SNPs
1 68020018 68020618 61
2 68020618 68021218 46
3 68021218 68021818 25
4 68021818 68022418 24
5 68022418 68023018 33
6 68023018 68023618 17
To calculate , we can recall from last week that the two values we need from the data are:
We've already calculated the number of SNPs in the previous step. Since this file has no missing data, the value for 2N will be the same for each window, but I'm going to go ahead and make a whole column of this value and add it to my results table, to make my calculation easier to do with R's vectorized math:
my.results$Nchr = YOUR CALCULATION FOR 2N
head(my.results)
Start End SNPs Nchr
1 68020018 68020618 61 710
2 68020618 68021218 46 710
3 68021218 68021818 25 710
4 68021818 68022418 24 710
5 68022418 68023018 33 710
6 68023018 68023618 17 710
Now that I've the values for Sn and 2N conveniently in my table, I can use the equation for :
Here is what my first few results look like after I do my calculation:
head(my.results)
Start End SNPs Nchr ThetaW
1 68020018 68020618 61 710 8.541293
2 68020618 68021218 46 710 6.440975
3 68021218 68021818 25 710 3.500530
4 68021818 68022418 24 710 3.360509
5 68022418 68023018 33 710 4.620699
6 68023018 68023618 17 710 2.380360
The other estimator of the population mutation rate is nucleotide diversity (), which is the average number of differences between sequences in a region. Go ahead and add a column for this to your results table:
my.results$Pi = rep(0, nrow(my.results))
To calculate this value, we are going to use a formula from Begun et al. (2007):
j = the derived allele count at a single SNP position
c = the number of chromosomes (same as 2N) at a single SNP position
the is summing over ALL SNP positions in a region
To help you with this calculation, I have added a function that will calculate the derived allele count (j) at a single row of a VCF file to the CompGen_fxns.R file, so if you loaded those functions, you will automatically have derivedCount()
available to you.
If you want to paste the source code for the function directly into your own code instead, here it is:
derivedCount <- function(row) {
row=as.vector(row, mode="character")
x=count.genotypes(get.field(row[10:length(row)], row[9], "GT"))
dc=(2*x["aa"])+x["Aa"]
return(unname(dc))
}
Remember that within EACH 600bp window, you have to get the count of the derived sites at EACH SNP position within that window. To do this, you can use a loop:
my.data.window = SUBSET OF my.data THAT FALLS WITHIN THE WINDOW
J = rep(0, nrow(my.data.window))
for (i in 1:nrow(my.data.window)) {
J[i] = derivedCount(row = my.data.window[i,])
}
OR, you can skip the loop and use the apply
function, like this:
my.data.window = SUBSET OF my.data THAT FALLS WITHIN THE WINDOW
J = apply(my.data.window, 1, derivedCount)
The above argument takes:
The data object to apply a function to (in this case our data subset from the window)
The dimension of the data to loop through (1=go row by row, 2 = column by column)
The function to be applied (in this case the derivedCount function)
head(J)
1 2 3 4 5 6
2 8 7 122 84 122
Once you've got the J vector, you should be able to do the rest of the Pi calculation pretty easily with R's vectorized math. For example, to get (c-j)
nc=my.results$Nchr[1]
nc
[1] 710
head(nc-J)
1 2 3 4 5 6
708 702 703 588 626 588
Here is what the first few lines of my results table look like after my calculation:
head(my.results)
Start End SNPs Nchr ThetaW Pi
1 68020018 68020618 61 710 8.541293 7.120912
2 68020618 68021218 46 710 6.440975 3.952895
3 68021218 68021818 25 710 3.500530 2.525795
4 68021818 68022418 24 710 3.360509 2.114297
5 68022418 68023018 33 710 4.620699 2.141604
6 68023018 68023618 17 710 2.380360 1.177584
The formula for Tajima's D is:
We've just finished calculating and , so now the last ingredient is to get the square root of the variance(d). The formula for the variance relies on n (in this case, the same as 2N, or the number of chromosomes), and S (the same as Sn, or the number of SNPs).
The formula itself is somewhat long, so below I've provided you with a function that will do this calculation (but feel free to do your own calculation if you want!):
variance.d <- function(n,S) {
a1=sum(1/(seq(from=1, to=(n-1), by=1)))
a2=sum(1/((seq(from=1, to=(n-1), by=1))**2))
b1=(n+1)/(3*(n-1))
b2=(2*((n**2)+n+3))/((9*n)*(n-1))
c1=b1 - (1/a1)
c2=b2-((n+2)/(a1*n)) + (a2/(a1**2))
e1=c1/a1
e2=c2/((a1**2)+a2)
var=(e1*S) + (e2*S*(S-1))
return(var)
}
Set up a new column in your results table called varD, and then calculate the variance for every window.
head(my.results)
Start End SNPs Nchr ThetaW Pi varD
1 68020018 68020618 61 710 8.541293 7.120912 9.629897
2 68020618 68021218 46 710 6.440975 3.952895 5.759211
3 68021218 68021818 25 710 3.500530 2.525795 1.986663
4 68021818 68022418 24 710 3.360509 2.114297 1.854930
5 68022418 68023018 33 710 4.620699 2.141604 3.197334
6 68023018 68023618 17 710 2.380360 1.177584 1.054751
Now, you can run standard math functions on the columns to create a new column for TajimaD. (Hint: R has a function called sqrt()
for taking the square root of a number).
head(my.results)
Start End SNPs Nchr ThetaW Pi varD TajimaD
1 68020018 68020618 61 710 8.541293 7.120912 9.629897 -0.4577136
2 68020618 68021218 46 710 6.440975 3.952895 5.759211 -1.0367707
3 68021218 68021818 25 710 3.500530 2.525795 1.986663 -0.6915510
4 68021818 68022418 24 710 3.360509 2.114297 1.854930 -0.9150145
5 68022418 68023018 33 710 4.620699 2.141604 3.197334 -1.3864341
Finally, we can plot the results, and see if we can determine any genes that might be under selection:
plot(my.results$Start, my.results$TajimaD, pch=20,xlab="Position",ylab="Tajima's D", type="l", ylim=c(-2.2,0))
Since a good rule of thumb is that anything under -2 or over +2 might be significant, here is how to draw a semi-transparent box that will highlight the graph area under -2:
rect(68010000, -3, 68060000, -2, col=rgb(1,0,0,0.4), border=NA)
Do you see any points that might be significant? What kind of selection do you think they are experiencing?
Lastly, use the code below to draw some boxes at the top of the graph outlining the transcripts that map back to this region:
rect(68023734,-0.4,68024960,-0.3, col="goldenrod")
rect(68028443,-0.4,68029634,-0.3, col="goldenrod")
rect(68034103,-0.4,68036074,-0.3, col="goldenrod")
Do you think the 1st, 2nd, or 3rd gene is the best candidate for selection?
R code solution Tajima's D Exercises