Remember to either load my CompGen_Fxns17.R file or copy and paste the `read.vcf`

function into your code.

Then, load the sample data file for this week: sampleData_4_Fst.vcf

This table should have:

- The same number of rows as
`my.data`

*At least*9 columns:

**p**: The frequency of allele (A) in the total population

**n**: The number of non-missing individuals in the total population.

**Hexp**: The expected heterozygosity for the total population (frequency, not counts!!!).

**p1**: The frequency of allele (A) in sub-population 1.

**n1**: The number of (non-missing) individuals in sub-pop 1

**Hexp1** The expected heterozygosity in sub-pop 1

**p2**: The frequency of allele (A) is sub-population 2.

**n2**: The number of non-missing individuals in sub-pop 2.

**Hexp2**: The expected heterozygosity for sub-pop 2.

*Note: Feel free to include additional columns in your table, such as columns for the observed genotype counts, if you want them there to help with the calculations.*

This calculation is exactly the same as the calculation we did for the Hardy-Weinberg exercises last week, so you should be able to re-use your code and fill in the appropriate values of your new results table. (Although note again that this time you do not need the expected **count** of Aa individuals, just their expected **frequency** (i.e. just 2pq))

The first step in this part of the exercise is to figure out which columns of `my.data`

correspond to each sub-population.

In this file, the samples from the first sub-population (*Geospiza fuliginosa*) all have names that start with "FU_".

The samples from the second sub-population (*Geospiza magnirostris*) all have names that start with "MA_".

You can extract the column names from `my.data`

like this:

```
colnames(my.data)
```

You can use `grep()`

on the column names to get the index of the columns that match a particular pattern.

Then, you can use the square bracket indexing notation to pull out only those columns from the `my.data`

table:

```
my.data[ , (your grep results)]
```

Once you're able to get only the parts of the table that correspond to one sub-population, you can use the exact same methods for calculating H_{exp} that you used for the total population.

The general form for Wright's Fst equation is:

, i.e. the expected heterozygosity for the total population (so we already have this value calculated).

H_{s} is a weighted average of the expected heterozygosities for all sub-populations. It is calculated like this:

, where *k = **the num. of sub-pops, **n** = the total population size, **n_{i} = the size of the i^{th} sub-population, and H_{exp}_{i} = the expected heterozygosity of the i^{th} sub-population.*

For our data set today, where we have 2 sub-populations, this evaluates to:

Once you have calculated Hs, it should be very straightforward to calculate Fst.

```
hist(my.results$Fst, xlim=c(0,1), breaks=50,xlab=expression(F[ST]), main="Histogram of FST", col="lightblue")
```

The arguments that I'm using in this function are:

- The first argument is always the data that you want to plot, in this case
`my.results$Fst`

. **xlim**sets the x-axis range to plot; if we didn't specify this, R would just pick the range automatically based on the data.**breaks**tells R what size bins to use for the histogram; again, if we didn't specify this parameter, R would pick it for us automatically.**xlab**Creates a label for the x-axis. I'm using the additional function`expression()`

here to generate a label with a sub-script**main**creates the title for the plot**col**sets the plot color. Here is a showing all of the built-in color options in R. You can get even more with extra packages.

```
plot(my.data$POS, my.results$Fst, xlab="Chromosomal Position (bp)", ylab=expression(F[ST]), ylim=c(0,1), pch=20, main="Spatial Distribution of Fst", col="hotpink3")
```

Here, the plot arguments are:

- The first value should always be the
*x*values; in this case the positions - The second argument should always be the
*y*values, which here is Fst. **xlab**does the same thing here as it did in`hist()`

and**ylab**makes a label for the y-axis.**ylim**sets the limits for the y-axis.**pch**specifies which plotting symbol to use. I specify #20 for closed circles.**main**again gives the plot a title, and**col**again specifies the color.

What does looking at the histogram tell you about Fst? What does looking at the genomewide plot tell you?

If we hadn't known before starting our analysis that we had 2 sub-groups in the data, then we might have wanted to start with a more general analysis to detect population structure.

Principal Components Analysis (PCA) is one method of doing this. You can find a very basic overview of PCA if you're not familiar with it.

R can only perform PCA on numerical values, so we need to convert our genotypes to numbers that R can handle.

First, set up an empty **matrix** with the same number of rows as `my.data`

and a column for every sample:

```
numericGen = matrix(NA, nrow=nrow(my.data), ncol=(ncol(my.data) - 9))
colnames(numericGen) = colnames(my.data)[10:ncol(my.data)]
```

Now, you need to write a `for()`

loop that goes through `my.data`

and replaces the genotype strings with numbers, then saves the new numeric strings in the matrix.

This is a lot like Part 3 of the VCF Exercises where we converted the VCF genotypes to the diploid nucleotides. The only difference is now we convert them to the following numbers:

"0/0" = 1

"0/1" = 2

"1/1" = 3

"./." = 0

**Note**: Normally, you would *not* want to replace missing data with 0. For the purposes of this exercise, we are doing it as a crude but quick way to avoid the problems R has with missing data values in PCA. In a real analysis, you would want to use some sort of imputation method specific to genetic data to deal with these values.

For PCA in R, we need to transpose our matrix so that the samples become the rows, and the SNPs become the columns. This is easily done with the `t()`

function:

```
numGen.trans = t(numericGen)
```

Now, run PCA with the `prcomp`

function:

```
pca.results = prcomp(numGen.trans, scale=FALSE)
```

There a few things we can look at in `pca.results`

. First, if you type `summary(pca.results`

you will get a table of information about the principal components, including how much variance is explained by each.

If you type `plot(pca.results)`

you will see the variance explained by each PC.

Finally, the plot that we are most interested in is a plot of the first two PCs, which will show us if our data is clustered at all.

To get our plot points to be colored by species, first run this:

```
plot.col = gsub("^FU_.*", "darkturquoise", gsub("^MA_.*", "darkviolet", rownames(numGen.trans)))
```

Now, create the plot:

```
plot(pca.results$x, col=plot.col, pch=20, main="PCA of Finch Data")
legend(18, -1, c("Fuliginosa", "Magnirostris"), col=c("darkturquoise", "darkviolet"), pch=c(20,20), cex=0.75)
```

What can you tell about population structure from the plot? How does the information in the 3 plots we generated today differ?

R code solution for Fst Exercises