Remember to either load my CompGen_Fxns17.R
file or copy and paste the `read.vcf`

function into
your code.

Then read in the sample VCF file for this week:7_sampleData_AFS.vcf

Once you have read this file in, check that it has the correct dimensions:

```
dim(my.data)
[1] 1946 33
```

The first step in this exercise is to calculate and plot the derived frequency spectrum for the observed data. This file has been created such that the REF allele is always the **ancestral** allele, and the ALT allele is the **derived** allele.

Remember to first set up an empty data structure to hold your results. We only need to save one calculation for each site, so your empty data structure can be as simple as a vector with *length=number of sites*, or a table with additional information if you prefer.

Once you've got your empty results vector/table set up, use a loop to count the total number of derived (ALT) alleles at each SNP position.

Here is what my first few values look like:

```
head(derived.counts)
[1] 5 3 25 12 7 5
```

Now that we've got the derived counts, we can use R's built-in histogram function (which we have used before during the FST Exercises) to plot the frequency of each count.

The general structure of the histogram command will look like this:

```
hist(x=YOUR COUNTS, breaks=A VECTOR WITH ALL POSSIBLE COUNT VALUES)
```

The value for "YOUR COUNTS" is what you just calculated in the previous step.

To get the other value for the break points vector, remember that the range of possible counts will go from 1, all the way up to the total number of chromosomes in the sample (2N). To generate a vector with these values, you can use the `seq()`

function:

```
seq(from = START OF RANGE, to = END OF RANGE, by=1)
```

We are using `by=1`

because we want every integer value between these two numbers.

When you have your breakpoint vector, use it in your histogram function to check the plot of the derived spectrum.

If it looks reasonable to you, go ahead and save the histogram data as a new variable (this will make it easy to extract the counts in each bin later):

```
histinfo = hist(x=YOUR COUNTS, breaks=BREAKPOINT VECTOR)
```

Next, we want to calculate Waterson's Theta (), which is an estimator of the population mutation rate.

The formula is:

where S_{n} = the number of segregating sites,

N = the number of individuals,

2N = the number of chromosomes.

Here is my calculated value:

= 438.4894

If your answer matches up to mine, then make sure to save your value as the variable `theta.w`

, and then go on to the next step.

For the last part of the exercise, we will now use our estimate of to determine what our frequency spectrum should look like under complete neutrality. To get the expected number of sites in each frequency bin, use the equation:

Exp(x_{i}) = theta.w * 1/i

where *i* is each value from 1 to 2N-1.

Lastly, we will plot our observed and expected frequency distributions side by side, to see if our data looks neutral or not.

First, get your data into a matrix. In the code below, I'm extracting the observed counts from `histinfo`

, and my calculations for the expected counts are in `exp.counts`

:

```
compare.counts=matrix(c(histinfo$counts, exp.counts), ncol=2, byrow=FALSE)
colnames(compare.counts)=c("Obs", "Exp")
head(compare.counts)
Obs Exp
[1,] 178 438.48938
[2,] 179 219.24469
[3,] 133 146.16313
[4,] 148 109.62234
[5,] 114 87.69788
[6,] 109 73.08156
```

Now, we can use R's `barplot()`

function:

```
barplot(t(compare.counts), beside=T, col=c("lightblue", "black"))
legend("topright", c("Obs", "Exp"), pch=c(15,15), col=c("lightblue", "black"))
```

How well does the observed data line up to the neutral distribution? In what ways does it differ? Do you have any hypotheses about what might cause this type of difference?

R code solution AFS Exercises