Using our code from last week that calculates distance, D, and r2, this week we will create a loop that performs these calculations on all PAIRS of sites in a data file with 155 SNPs.
Remember to either load my CompGen_Fxns17.R file or copy and paste the read.vcf
function into your code.
Then, load the sample data file for this week:6_sampleData_LD.vc
You should have a total of 155 rows and 42 columns if your data is read in correctly.
As usual, we need to create a new data table to store results BEFORE starting our loop, but this time, the number of results that we are going to generate will be more than the number of rows we start out with.
Because we will test every pair of SNPs, we will actually have ("x choose 2") r2 calculations, where x is the number of SNPs. Lucky for us, R has a built-in combinatorics function that we can use to quickly figure out exactly how many rows will need:
num.results = ncol(combn(x=nrow(my.data), m=2))
Now, you can set up an empty table with this number of rows, and 2 columns:
This is going to require a nested loop structure, which we have not seen yet. The outer for
loop needs to start at 1 (the first SNP), and then go all the way to the next to the last SNP in the file:
for (i in 1:(nrow(my.data)-1)) {
The inner loop will always need to start at the next SNP after the ith SNP, and it will end at the last SNP in the file:
for (j in (i+1):nrow(my.data)) {
We are also going to need to set up a "counter" variable to help us keep track of which row in the results we are on, since this is no longer going to match up to the row of the input data.
Here is the general structure for what your loop should look like (in the example code I'm assuming I've got a results table called my.results
with two columns called Distance
and Rsquared
):
counter = 1
for (i in 1:(nrow(my.data)-1)) {
for (j in (i+1):nrow(my.data)) {
YOUR CODE TO CALCULATE DISTANCE BETWEEN THE ith AND jth SNP
YOUR CODE TO CALCULATE LD FOR THE ith AND the jth SNP
my.results$Distance[counter] = YOUR DISTANCE CALCULATION
my.results$Rsquared[counter] = YOUR R-SQUARED CALCULATION
counter = counter + 1
}
}
The last line of the code in the loop is advancing the counter, so that we add our results to the next row on the next iteration.
Note: When you run your loop, it may take several seconds or even a few minutes to complete-so don't worry if your code doesn't return the command prompt (">" symbol) right away.
Here are the first few results that I get, so you can check your answers:
head(my.results)
Distance Rsquared
1 446 0.126181723
2 681 0.620710220
3 2624 0.669148059
4 3068 0.004136029
5 3548 0.001504415
6 3625 0.678104959
Once you have results from the full data set, plot r2 as a function of distance in order to observe the decay of LD.
To do this, you can modify the plotting function below by filling in your own x
and y
variable names to create this graph.
Optional: If you'd like to avoid having to type out both the table and column names for every variable (i.e. my.results$Distance, etc.), you can use the attach
function to just refer to the column names of a table directly (I'm providing this tip in response to someone who asked about a way to do this in class, and I thought others might be interested as well):
head(my.results)
Distance Rsquared
1 446 0.126181723
2 681 0.620710220
3 2624 0.669148059
4 3068 0.004136029
5 3548 0.001504415
6 3625 0.678104959
attach(my.results)
head(Distance)
[1] 446 681 2624 3068 3548 3625
Plotting Function:
plot(x=Your distance values, y=Your r2 values, xlab="Distance between SNPs (bp)", ylab=expression(R^2), pch=20, col=rgb(0,0,0,alpha=0.2), main="Decay of Linkage Disequilibrium")
Within this command, I use the rgb()
function with the alpha
option to create semi-transparent points. This helps us to see where all of the data is.
When you create your plot, you’ll notice that even though you can see a trend, the raw data plot is pretty messy, and does not look like results that are typically published. To help with this, use the code below to create and plot a moving average of LD in increments of 500bp distances.
Find all possible 500bp intervals using the seq()
function:
bins=seq(1,max(YOUR DISTANCE RESULTS), by=500)
The seq()
function generates a series of numbers between a given minimum and maximum value, in intervals of a specified size. The function above uses a starting value of 1, a maximum value of whatever our calculated distance is, and it will report every number between those values in intervals of 500 base pairs:
head(bins)
[1] 1 501 1001 1501 2001 2501
Calculate the average r2 value in each bin.
Start by creating a table to hold our results. This table will have 2 columns: the first column will be the bin values, and the second will be the average r2 value for each bin.
my.means=rep(0, length(bins))
LD.averages=data.frame(bins, my.means)
Now, loop through the list of intervals, find the subset of data that corresponds to each interval, and get the mean for that subset of data:
for (i in 1:length(bins) {
data.interval=subset(my.results, (my.results$Distance >= bins[i] & my.results$Distance < (bins[i]+500)))
LD.averages$my.means[i]=mean(data.interval$Rsquared)
}
If you look at the first line inside of the above loop, you'll see that we're using the subset
function to get all of the Distance values that fall between the start of the ith bin and that value plus 500 (the interval size).
head(LD.averages)
bins my.means
1 1 0.4687347
2 501 0.3973289
3 1001 0.3873935
4 1501 0.2717473
5 2001 0.2754991
6 2501 0.2776452
Add points to the plot to show the interval average values (I will use 2 commands: one to add the points, and then a second to connect the points with a line):
points(x=LD.averages$bins, y=LD.averages$my.means, col="red", pch=20)
lines(x=LD.averages$bins, y=LD.averages$my.means, col="red", lwd=2)
One value that people are often interested in is how far apart SNPs have to be before LD becomes “negligible,” which conventionally means the point at which r2 < 0.1. Find this distance value in our data set. You can use the plot to verify if your answer makes sense.
Hint: Use the mean values calculated in the previous step for this, as the raw data will be too noisy.
Hint 2: The best way to do this is find the LAST distance value where r2 > 0.1, because what we really want to know is when does r2 stay below this threshold. Remember that R has min()
and max()
functions, as well as subset()
and which()
.
The value that you get here will be the average LD block size for the sample. What is this value?
Another value that is often reported is the LD “half-life,” which is the point at which the observed r2 between sites decays to less than half the maximum r2 value. Find this distance value in our data set, and again use the plot to verify if your answer makes sense.
Hint: This should also be based on the mean values rather than the raw data.
Finally, in order to estimate the population recombination rate (rho ()), we need to fit a nonlinear model based on Hill and Weir’s (1988) equation, given below:
where C = *distance and n=sample size
To get an estimate of based on this equation and our observed data, we will use R’s built in non-linear least squares function (nls). This function takes a formula describing the relationship between variables (in this case given by the above equation), a starting value for the unknown parameter (in this case ), and iteratively tries to find the value for the unknown parameter that gives the best fit to the data.
Find the value of n based on the number of genotypes (this should be 33):
n=length(genotypes)
Define the Hill and Weir equation in terms of the variable names for YOUR DATA. In this instance, all of my LD results are in a data frame called my.results, with columns named Distance and Rsquared.. You do NOT need to refer to the table name here (just use your column names!). We will give the function our table name in the next step.
hill.weir.eq=(Rsquared~(((10+(rho*Distance))/((2+(rho*Distance))*(11+(rho*Distance))))*(1+(((3+(rho*Distance))*(12+(12*(rho*Distance))+((rho*Distance)**2)))/(n*(2+(rho*Distance))*(11+(rho*Distance)))))))
Set a starting value for rho (must be between 0 and 1). Here I arbitrarily select 0.1. Then fit the model with the nls() function:
rho.start=0.1
m=nls(formula=hill.weir.eq, data=my.results, start=list(rho=rho.start))
results.m=summary(m)
results.m
Formula: Rsquared ~ (((10 + (rho * Distance))/((2 + (rho * Distance)) *
(11 + (rho * Distance)))) * (1 + (((3 + (rho * Distance)) *
(12 + (12 * (rho * Distance)) + ((rho * Distance)^2)))/(n *
(2 + (rho * Distance)) * (11 + (rho * Distance))))))
Parameters:
Estimate Std. Error t value Pr(>|t|)
rho 0.0007489 0.0000151 49.61 <2e-16 ***
Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.1594 on 11934 degrees of freedom
Number of iterations to convergence: 8
Achieved convergence tolerance: 4.066e-06
Test the fit of the model by looking at the correlation between the actual values and the predicted values:
cor(my.results$Rsquared, predict(m))
What is your rho estimate?
Extract the estimate of rho from the model results, and calculate the expected rho values using the same Hill and Weir equation, but now with your new estimate of rho plugged in:
rho.estimate=results.m$parameters[1]
Distance=sort(my.results$Distance)
exp.rsquared=(((10+(rho.estimate*Distance))/((2+(rho.estimate*Distance))*(11+(rho.estimate*Distance))))*(1+(((3+(rho.estimate*Distance))*(12+(12*(rho.estimate*Distance))+((rho.estimate*Distance)**2)))/(n*(2+(rho.estimate*Distance))*(11+(rho.estimate*Distance))))))
Plot the fitted curve by adding a line of the expected results to the graph:
lines(Distance, exp.rsquared, col="purple", lwd=2)
legend(32000,0.99, c("Means", "Expected R2"), lty=c(1,1), col=c("red", "purple"))
How well does your fitted curve match up to your moving average curve?
R code solution for LD II Exercises