Objective: The goal for today's exercise is to write and save the code that you will use again next week to estimate linkage disequilibrium (LD) in a larger data file.
Remember to either load my CompGen_Fxns17.R file or copy and paste the
read.vcf function into your code.
Then, load the sample data file for this week: sampleData_LDmini.vcf
You should have a total a 5 rows and 42 columns after reading in the data. You can check with the
dim(my.data)  5 42
Extract Row 1 and Row 2 of
my.data into 2 separate variables, and use these to test your calculations to start with (rather than worrying about the loop yet).
row1=as.vector(my.data[1,], mode="character") row2=as.vector(my.data[2,], mode="character")
Remember that in a VCF file, each row is a position, so row1 is "position 1" and row2 is "position 2".
The distance between positions isn't part of the actual LD calculation, but next week we will want to plot LD as a function of distance (in base pairs), so you should have this as part of your code now.
Remember: Since we have made our vectors into "character" vectors above, you will need to use the
as.numeric() function to be able to transform values back into numbers that can be used in math equations.
pA is the frequency of the "0" allele at position 1 (in other words, this is the same as p for row 1).
pB is the frequency of the "0" allele at position 2.
Below are my calculated values for pA and pB, so you can check your own calculations:
pA  0.6363636 pB  0.09090909
pAB is the frequency of the "AB" haplotype, or in other words the percentage of times you see the "0" allele on the same side of the "|" when you look across BOTH positions).
Finding pAB will be the most challenging part of the exercises today, so don't worry if you feel you spend most of your time on this part!
Here are my final results for pAB and D, so you can check your calculations:
pAB  0.09090909 D  0.03305785
To help you with the calculation of pAB, here is a quick reminder of some R functions we have seen before, that you might find useful today.
Once you have the value for D along with the values for pA and pB, you can easily calculate r2 with the equation below:
Here is my result, to help you check your math:
rsq  0.05714286
This step is mostly so that you can double check your math and make sure your code works on all of the site pairs. You can either do this step by manually extracting the other rows (the way we did with rows 1 and 2 above) and copying and pasting the code for the calculations, OR if you feel comfortable, you can try writing the loop that will do this for you.
Do NOT feel obligated to try and reproduce the exact results table structure I have below; you can simply get your calculations, write them down, and check them by eye if you prefer.
We will discuss how to set up the results table and the loop next week. Of course, if you want to give it a try this week, then go ahead!
Below are my results for all site pairs:
R code solution for LD I Exercises
First Site Second Site Distance (bp) D Rsq 1 1 2 14 0.033057851 0.057142857 2 1 3 34 0.038567493 0.067796610 3 1 4 126 -0.016528926 0.027210884 4 1 5 1552 0.004132231 0.002511161 5 2 3 20 0.081267218 0.842857143 6 2 4 112 0.004132231 0.004761905 7 2 5 1538 -0.012396694 0.063281250 8 3 4 92 0.004820937 0.005649718 9 3 5 1518 -0.011937557 0.051150121 10 4 5 1426 -0.001377410 0.001488095