Calculate the Linkage Disequilibrium statistic R2

Objective: The goal for today's exercise is to write and save the code that you will use again next week to estimate linkage disequilibrium (LD) in a larger data file.

Before Starting:

Remember to either load my CompGen_Fxns17.R file or copy and paste the read.vcf function into your code.

Then, load the sample data file for this week: sampleData_LDmini.vcf

You should have a total a 5 rows and 42 columns after reading in the data. You can check with the dim() function:

dim(my.data)
[1]  5 42

Extract Row 1 and Row 2 of my.data into 2 separate variables, and use these to test your calculations to start with (rather than worrying about the loop yet).

row1=as.vector(my.data[1,], mode="character")
row2=as.vector(my.data[2,], mode="character")

Remember that in a VCF file, each row is a position, so row1 is "position 1" and row2 is "position 2".

1. Calculate the physical distance between the 2 positions

The distance between positions isn't part of the actual LD calculation, but next week we will want to plot LD as a function of distance (in base pairs), so you should have this as part of your code now.

Remember: Since we have made our vectors into "character" vectors above, you will need to use the as.numeric() function to be able to transform values back into numbers that can be used in math equations.

2. Calculate the coefficient of Linkage Disequilibrium (D)

LaTeX: D\:=\:pAB\:-\:pA\cdot pB

pA is the frequency of the "0" allele at position 1 (in other words, this is the same as p for row 1).

pB is the frequency of the "0" allele at position 2.

Below are my calculated values for pA and pB, so you can check your own calculations:

pA
[1] 0.6363636

pB
[1] 0.09090909 

pAB is the frequency of the "AB" haplotype, or in other words the percentage of times you see the "0" allele on the same side of the "|" when you look across BOTH positions).

Finding pAB will be the most challenging part of the exercises today, so don't worry if you feel you spend most of your time on this part!

Here are my final results for pAB and D, so you can check your calculations:

pAB
[1] 0.09090909

D
[1] 0.03305785

Reminder of Some Useful R commands:

To help you with the calculation of pAB, here is a quick reminder of some R functions we have seen before, that you might find useful today.

  • These are just meant as hints; do NOT feel obligated to use these in your own solutions if you don't want to!

gsub

You can use gsub() to replace patterns. This could include replacing a character string with a number:

a = c("0|0", "0|1", "1|1", "1|0")
b = c("1|1", "1|0", "1|0", "0|1")

gsub("0\\|0", 1, a)
[1] "1"   "0|1" "1|1" "1|0"
gsub("0\\|1", 2, b) [1] "1|1" "1|0" "1|0" "2"

Or, you can get rid of a pattern or character by replacing with "":

gsub("\\|", "", a)
[1] "00" "01" "11" "10"
gsub("\\|", "", b) [1] "11" "10" "10" "01"

paste0

You can use paste0 to paste all of the items in one vector together into a single word:

c = gsub("\\|", "", a)
c
[1] "00" "01" "11" "10"

paste0(c, collapse="") [1] "00011110"

Or, you could paste together EACH item in a set of 2 vectors:

c
[1] "00" "01" "11" "10"
d = gsub("\\|", "", b) d [1] "11" "10" "10" "01"
paste0(c,d) [1] "0011" "0110" "1110" "1001"

Notice that I can also add a separator character when I paste things together:

paste(c,d, sep="|")
[1] "00|11" "01|10" "11|10" "10|01"

strsplit

In addition to pasting things together, we can also split them apart, either by a separating character or by the "" character (if we want to split everything into a separate item):

f=paste(c,d, sep="|")
f
[1] "00|11" "01|10" "11|10" "10|01"
unlist(strsplit(f, split="\\|")) [1] "00" "11" "01" "10" "11" "10" "10" "01" unlist(strsplit(f, split="")) [1] "0" "0" "|" "1" "1" "0" "1" "|" "1" "0" "1" "1" "|" "1" "0" "1" "0" "|" "0" "1"

transpose

If you would rather view your sites as 2 columns side-by-side, rather than 2 rows stacked together, you can use t():

 x
  [,1]  [,2]  [,3]  [,4] 
a "0|0" "0|1" "1|1" "1|0"
b "1|1" "1|0" "1|0" "0|1"

y=t(x) y a b [1,] "0|0" "1|1" [2,] "0|1" "1|0" [3,] "1|1" "1|0" [4,] "1|0" "0|1"

if/else loops

If you want to test a certain condition, you can use an if loop, or an if/else combination:

g=unlist(strsplit(f, split=""))
g
 [1] "0" "0" "|" "1" "1" "0" "1" "|" "1" 
"0" "1" "1" "|" "1" "0" "1" "0" "|" "0" "1" if (g[1]=="0") {
print("The first allele is 0")} [1] "The first allele is 0" if (g[1]=="0") { print("The first allele is 0") } else { print("The first allele is NOT 0")
}

You can also test multiple conditions:

if ((g[1]=="0") & (g[2]=="0")) {
      print("The first 2 alleles are 0")
} else {
      print("One or both of the first 2 \
alleles is not 0") } [1] "The first 2 alleles are 0"

In addition to the & symbol for "AND" you can also use the | symbol for "OR" and the ! symbol for "NOT".

3. Calculate r2.

Once you have the value for D along with the values for pA and pB, you can easily calculate r2 with the equation below:

LaTeX: r^2=\frac{D^2}{pA\cdot\left(1-pA\right)\cdot pB\cdot\left(1-pB\right)}

Here is my result, to help you check your math:

rsq
[1] 0.05714286

4. Perform the calculations on the remaining Site pairs.

This step is mostly so that you can double check your math and make sure your code works on all of the site pairs. You can either do this step by manually extracting the other rows (the way we did with rows 1 and 2 above) and copying and pasting the code for the calculations, OR if you feel comfortable, you can try writing the loop that will do this for you.

Do NOT feel obligated to try and reproduce the exact results table structure I have below; you can simply get your calculations, write them down, and check them by eye if you prefer.

We will discuss how to set up the results table and the loop next week.  Of course, if you want to give it a try this week, then go ahead!

Below are my results for all site pairs:

   First Site Second Site Distance (bp)            D         Rsq
1           1           2            14  0.033057851 0.057142857
2           1           3            34  0.038567493 0.067796610
3           1           4           126 -0.016528926 0.027210884
4           1           5          1552  0.004132231 0.002511161
5           2           3            20  0.081267218 0.842857143
6           2           4           112  0.004132231 0.004761905
7           2           5          1538 -0.012396694 0.063281250
8           3           4            92  0.004820937 0.005649718
9           3           5          1518 -0.011937557 0.051150121
10          4           5          1426 -0.001377410 0.001488095
R code solution for LD I Exercises