Objective: The goal for today's exercise is to write and save the code that you will use again next week to estimate linkage disequilibrium (LD) in a larger data file.
Remember to either load my CompGen_Fxns17.R file or copy and paste the read.vcf
function into your code.
Then, load the sample data file for this week: sampleData_LDmini.vcf
You should have a total a 5 rows and 42 columns after reading in the data. You can check with the dim()
function:
dim(my.data)
[1] 5 42
Extract Row 1 and Row 2 of my.data
into 2 separate variables, and use these to test your calculations to start with (rather than worrying about the loop yet).
row1=as.vector(my.data[1,], mode="character")
row2=as.vector(my.data[2,], mode="character")
Remember that in a VCF file, each row is a position, so row1 is "position 1" and row2 is "position 2".
The distance between positions isn't part of the actual LD calculation, but next week we will want to plot LD as a function of distance (in base pairs), so you should have this as part of your code now.
Remember: Since we have made our vectors into "character" vectors above, you will need to use the as.numeric()
function to be able to transform values back into numbers that can be used in math equations.
pA is the frequency of the "0" allele at position 1 (in other words, this is the same as p for row 1).
pB is the frequency of the "0" allele at position 2.
Below are my calculated values for pA and pB, so you can check your own calculations:
pA
[1] 0.6363636
pB
[1] 0.09090909
pAB is the frequency of the "AB" haplotype, or in other words the percentage of times you see the "0" allele on the same side of the "|" when you look across BOTH positions).
Finding pAB will be the most challenging part of the exercises today, so don't worry if you feel you spend most of your time on this part!
Here are my final results for pAB and D, so you can check your calculations:
pAB
[1] 0.09090909
D
[1] 0.03305785
To help you with the calculation of pAB, here is a quick reminder of some R functions we have seen before, that you might find useful today.
You can use gsub()
to replace patterns. This could include replacing a character string with a number:
a = c("0|0", "0|1", "1|1", "1|0")
b = c("1|1", "1|0", "1|0", "0|1")
gsub("0\\|0", 1, a)
[1] "1" "0|1" "1|1" "1|0"
gsub("0\\|1", 2, b)
[1] "1|1" "1|0" "1|0" "2"
Or, you can get rid of a pattern or character by replacing with "":
gsub("\\|", "", a)
[1] "00" "01" "11" "10"
gsub("\\|", "", b)
[1] "11" "10" "10" "01"
You can use paste0
to paste all of the items in one vector together into a single word:
c = gsub("\\|", "", a)
c
[1] "00" "01" "11" "10"
paste0(c, collapse="")
[1] "00011110"
Or, you could paste together EACH item in a set of 2 vectors:
c
[1] "00" "01" "11" "10"
d = gsub("\\|", "", b)
d
[1] "11" "10" "10" "01"
paste0(c,d)
[1] "0011" "0110" "1110" "1001"
Notice that I can also add a separator character when I paste things together:
paste(c,d, sep="|")
[1] "00|11" "01|10" "11|10" "10|01"
In addition to pasting things together, we can also split them apart, either by a separating character or by the "" character (if we want to split everything into a separate item):
f=paste(c,d, sep="|")
f
[1] "00|11" "01|10" "11|10" "10|01"
unlist(strsplit(f, split="\\|"))
[1] "00" "11" "01" "10" "11" "10" "10" "01"
unlist(strsplit(f, split=""))
[1] "0" "0" "|" "1" "1" "0" "1" "|" "1" "0" "1" "1" "|" "1" "0" "1" "0" "|" "0" "1"
If you would rather view your sites as 2 columns side-by-side, rather than 2 rows stacked together, you can use t()
:
x
[,1] [,2] [,3] [,4]
a "0|0" "0|1" "1|1" "1|0"
b "1|1" "1|0" "1|0" "0|1"
y=t(x)
y
a b
[1,] "0|0" "1|1"
[2,] "0|1" "1|0"
[3,] "1|1" "1|0"
[4,] "1|0" "0|1"
If you want to test a certain condition, you can use an if
loop, or an if/else
combination:
g=unlist(strsplit(f, split=""))
g
[1] "0" "0" "|" "1" "1" "0" "1" "|" "1"
"0" "1" "1" "|" "1" "0" "1" "0" "|" "0" "1"
if (g[1]=="0") {
print("The first allele is 0")}
[1] "The first allele is 0"
if (g[1]=="0") {
print("The first allele is 0")
} else {
print("The first allele is NOT 0")
}
You can also test multiple conditions:
if ((g[1]=="0") & (g[2]=="0")) {
print("The first 2 alleles are 0")
} else {
print("One or both of the first 2 \
alleles is not 0")
}
[1] "The first 2 alleles are 0"
In addition to the &
symbol for "AND" you can also use the |
symbol for "OR" and the !
symbol for "NOT".
Once you have the value for D along with the values for pA and pB, you can easily calculate r2 with the equation below:
Here is my result, to help you check your math:
rsq
[1] 0.05714286
This step is mostly so that you can double check your math and make sure your code works on all of the site pairs. You can either do this step by manually extracting the other rows (the way we did with rows 1 and 2 above) and copying and pasting the code for the calculations, OR if you feel comfortable, you can try writing the loop that will do this for you.
Do NOT feel obligated to try and reproduce the exact results table structure I have below; you can simply get your calculations, write them down, and check them by eye if you prefer.
We will discuss how to set up the results table and the loop next week. Of course, if you want to give it a try this week, then go ahead!
Below are my results for all site pairs:
First Site Second Site Distance (bp) D Rsq
1 1 2 14 0.033057851 0.057142857
2 1 3 34 0.038567493 0.067796610
3 1 4 126 -0.016528926 0.027210884
4 1 5 1552 0.004132231 0.002511161
5 2 3 20 0.081267218 0.842857143
6 2 4 112 0.004132231 0.004761905
7 2 5 1538 -0.012396694 0.063281250
8 3 4 92 0.004820937 0.005649718
9 3 5 1518 -0.011937557 0.051150121
10 4 5 1426 -0.001377410 0.001488095
R code solution for LD I Exercises