**Objective:** The goal for today's exercise is to write and save the code that you will use again next week to estimate linkage disequilibrium (LD) in a larger data file.

Remember to either load my CompGen_Fxns17.R file or copy and paste the `read.vcf`

function into your code.

Then, load the sample data file for this week: sampleData_LDmini.vcf

You should have a total a 5 rows and 42 columns after reading in the data. You can check with the `dim()`

function:

```
dim(my.data)
[1] 5 42
```

Extract Row 1 and Row 2 of `my.data`

into 2 separate variables, and use these to test your calculations to start with (rather than worrying about the loop yet).

```
row1=as.vector(my.data[1,], mode="character")
row2=as.vector(my.data[2,], mode="character")
```

Remember that in a VCF file, each row is a position, so row1 is "position 1" and row2 is "position 2".

The distance between positions isn't part of the actual LD calculation, but next week we will want to plot LD as a function of distance (in base pairs), so you should have this as part of your code now.

**Remember:** Since we have made our vectors into "character" vectors above, you will need to use the `as.numeric()`

function to be able to transform values back into numbers that can be used in math equations.

*pA* is the frequency of the "0" allele at position 1 (in other words, this is the same as *p* for row 1).

*pB* is the frequency of the "0" allele at position 2.

Below are my calculated values for *pA* and *pB*, so you can check your own calculations:

```
pA
[1] 0.6363636
pB
[1] 0.09090909
```

*pAB* is the frequency of the "AB" haplotype, or in other words the percentage of times you see the "0" allele on the same side of the "|" when you look across BOTH positions).

Finding *pAB* will be the most challenging part of the exercises today, so don't worry if you feel you spend most of your time on this part!

Here are my final results for *pAB* and *D*, so you can check your calculations:

```
pAB
[1] 0.09090909
D
[1] 0.03305785
```

To help you with the calculation of *pAB*, here is a quick reminder of some R functions we have seen before, that you might find useful today.

*These are just meant as hints; do NOT feel obligated to use these in your own solutions if you don't want to!*

You can use `gsub()`

to replace patterns. This could include replacing a character string with a number:

```
a = c("0|0", "0|1", "1|1", "1|0")
b = c("1|1", "1|0", "1|0", "0|1")
gsub("0\\|0", 1, a)
[1] "1" "0|1" "1|1" "1|0"
```

gsub("0\\|1", 2, b)
[1] "1|1" "1|0" "1|0" "2"

Or, you can get rid of a pattern or character by replacing with "":

```
gsub("\\|", "", a)
[1] "00" "01" "11" "10"
```

gsub("\\|", "", b)
[1] "11" "10" "10" "01"

You can use `paste0`

to paste all of the items in one vector together into a single word:

```
c = gsub("\\|", "", a)
c
[1] "00" "01" "11" "10"
```

paste0(c, collapse="")
[1] "00011110"

Or, you could paste together EACH item in a set of 2 vectors:

```
c
[1] "00" "01" "11" "10"
```

d = gsub("\\|", "", b)
d
[1] "11" "10" "10" "01"

paste0(c,d)
[1] "0011" "0110" "1110" "1001"

Notice that I can also add a separator character when I paste things together:

```
paste(c,d, sep="|")
[1] "00|11" "01|10" "11|10" "10|01"
```

In addition to pasting things together, we can also split them apart, either by a separating character or by the "" character (if we want to split everything into a separate item):

```
f=paste(c,d, sep="|")
f
[1] "00|11" "01|10" "11|10" "10|01"
```

unlist(strsplit(f, split="\\|"))
[1] "00" "11" "01" "10" "11" "10" "10" "01"
unlist(strsplit(f, split=""))
[1] "0" "0" "|" "1" "1" "0" "1" "|" "1" "0" "1" "1" "|" "1" "0" "1" "0" "|" "0" "1"

If you would rather view your sites as 2 columns side-by-side, rather than 2 rows stacked together, you can use `t()`

:

```
x
[,1] [,2] [,3] [,4]
a "0|0" "0|1" "1|1" "1|0"
b "1|1" "1|0" "1|0" "0|1"
```

y=t(x)
y
a b
[1,] "0|0" "1|1"
[2,] "0|1" "1|0"
[3,] "1|1" "1|0"
[4,] "1|0" "0|1"

If you want to test a certain condition, you can use an `if`

loop, or an `if/else`

combination:

```
g=unlist(strsplit(f, split=""))
g
[1] "0" "0" "|" "1" "1" "0" "1" "|" "1"
```

"0" "1" "1" "|" "1" "0" "1" "0" "|" "0" "1"
if (g[1]=="0") {

print("The first allele is 0")}
[1] "The first allele is 0"
if (g[1]=="0") {
print("The first allele is 0")
} else {
print("The first allele is NOT 0")

}

You can also test multiple conditions:

```
if ((g[1]=="0") & (g[2]=="0")) {
print("The first 2 alleles are 0")
} else {
print("One or both of the first 2 \
```

alleles is not 0")
}
[1] "The first 2 alleles are 0"

In addition to the `&`

symbol for "AND" you can also use the `|`

symbol for "OR" and the `!`

symbol for "NOT".

Once you have the value for *D* along with the values for *pA* and *pB*, you can easily calculate *r ^{2}* with the equation below:

Here is my result, to help you check your math:

```
rsq
[1] 0.05714286
```

This step is mostly so that you can double check your math and make sure your code works on all of the site pairs. You can either do this step by manually extracting the other rows (the way we did with rows 1 and 2 above) and copying and pasting the code for the calculations, OR if you feel comfortable, you can try writing the loop that will do this for you.

Do NOT feel obligated to try and reproduce the exact results table structure I have below; you can simply get your calculations, write them down, and check them by eye if you prefer.

We will discuss how to set up the results table and the loop next week. Of course, if you want to give it a try this week, then go ahead!

Below are my results for all site pairs:

```
First Site Second Site Distance (bp) D Rsq
1 1 2 14 0.033057851 0.057142857
2 1 3 34 0.038567493 0.067796610
3 1 4 126 -0.016528926 0.027210884
4 1 5 1552 0.004132231 0.002511161
5 2 3 20 0.081267218 0.842857143
6 2 4 112 0.004132231 0.004761905
7 2 5 1538 -0.012396694 0.063281250
8 3 4 92 0.004820937 0.005649718
9 3 5 1518 -0.011937557 0.051150121
10 4 5 1426 -0.001377410 0.001488095
```

R code solution for LD I Exercises