Population Differentiation with Fst

Before Starting:

Remember to either load my CompGen_Fxns17.R file or copy and paste the read.vcf function into your code.

Then, load the sample data file for this week: sampleData_4_Fst.vcf

1. Get the EXPECTED HET values for each population

Part A: Set up the Empty Results Table

This table should have:

  • The same number of rows as my.data
  • At least 9 columns:

p: The frequency of allele (A) in the total population

n: The number of non-missing individuals in the total population.

Hexp: The expected heterozygosity for the total population (frequency, not counts!!!).

p1: The frequency of allele (A) in sub-population 1.

n1: The number of (non-missing) individuals in sub-pop 1

Hexp1 The expected heterozygosity in sub-pop 1

p2: The frequency of allele (A) is sub-population 2.

n2: The number of non-missing individuals in sub-pop 2.

Hexp2: The expected heterozygosity for sub-pop 2.

Note: Feel free to include additional columns in your table, such as columns for the observed genotype counts, if you want them there to help with the calculations.

Part B: Calculate Hetexp for the Total population

This calculation is exactly the same as the calculation we did for the Hardy-Weinberg exercises last week, so you should be able to re-use your code and fill in the appropriate values of your new results table. (Although note again that this time you do not need the expected count of Aa individuals, just their expected frequency (i.e. just 2pq))

Part C: Calculate Hexp for the 2 Sub-Populations

The first step in this part of the exercise is to figure out which columns of my.data correspond to each sub-population.

In this file, the samples from the first sub-population (Geospiza fuliginosa) all have names that start with "FU_".

The samples from the second sub-population (Geospiza magnirostris) all have names that start with "MA_".

You can extract the column names from my.data like this:


You can use grep() on the column names to get the index of the columns that match a particular pattern.

Then, you can use the square bracket indexing notation to pull out only those columns from the my.data table:

my.data[ , (your grep results)]

Once you're able to get only the parts of the table that correspond to one sub-population, you can use the exact same methods for calculating Hexp that you used for the total population.

2. Use Expected Het. values to Calculate Fst

The general form for Wright's Fst equation is: 

LaTeX: \frac{H_T-H_S}{H_T} 

LaTeX: H_T=H_{exp_T} , i.e. the expected heterozygosity for the total population (so we already have this value calculated).

Hs is a weighted average of the expected heterozygosities for all sub-populations. It is calculated like this:

LaTeX: H_S=\sum^k_{i=1}\frac{n_i}{n}H_{exp_i}, where k = the num. of sub-pops, n = the total population size, ni = the size of the ith sub-population, and Hexpi = the expected heterozygosity of the ith sub-population.

For our data set today, where we have 2 sub-populations, this evaluates to:

LaTeX: H_S=\left[\frac{n_1}{n}\cdot H_{exp_1}\right]+\left[\frac{n_2}{n}\cdot H_{exp_2}\right]

Once you have calculated Hs, it should be very straightforward to calculate Fst.

3. Plot the Results

First, let's look at the frequency distribution of all of the Fst values, using R's built-in histogram function:
hist(my.results$Fst, xlim=c(0,1), breaks=50,xlab=expression(F[ST]), main="Histogram of FST", col="lightblue")

The arguments that I'm using in this function are:

  1. The first argument is always the data that you want to plot, in this case my.results$Fst.
  2. xlim sets the x-axis range to plot; if we didn't specify this, R would just pick the range automatically based on the data.
  3. breaks tells R what size bins to use for the histogram; again, if we didn't specify this parameter, R would pick it for us automatically.
  4. xlab Creates a label for the x-axis. I'm using the additional function expression() here to generate a label with a sub-script
  5. main creates the title for the plot
  6. col sets the plot color. Here is a handy pdf showing all of the built-in color options in R. You can get even more with extra packages.
Next, to look at how the Fst values are distributed spatial along the genome, we can simply make a scatterplot where Fst is a function of position.
plot(my.data$POS, my.results$Fst, xlab="Chromosomal Position (bp)", ylab=expression(F[ST]), ylim=c(0,1), pch=20, main="Spatial Distribution of Fst", col="hotpink3")

Here, the plot arguments are:

  1. The first value should always be the x values; in this case the positions
  2. The second argument should always be the y values, which here is Fst.
  3. xlab does the same thing here as it did in hist() and ylab makes a label for the y-axis.
  4. ylim sets the limits for the y-axis.
  5. pch specifies which plotting symbol to use. I specify #20 for closed circles.
  6. main again gives the plot a title, and col again specifies the color.

What does looking at the histogram tell you about Fst? What does looking at the genomewide plot tell you?

EXTRA: PCA to look at Population Structure

If we hadn't known before starting our analysis that we had 2 sub-groups in the data, then we might have wanted to start with a more general analysis to detect population structure.

Principal Components Analysis (PCA) is one method of doing this. You can find a very basic overview of PCA here if you're not familiar with it.

Part A: Convert Genotypes to a Matrix of Numeric Values

R can only perform PCA on numerical values, so we need to convert our genotypes to numbers that R can handle.

First, set up an empty matrix with the same number of rows as my.data and a column for every sample:

numericGen = matrix(NA, nrow=nrow(my.data), ncol=(ncol(my.data) - 9))
colnames(numericGen) = colnames(my.data)[10:ncol(my.data)]

Now, you need to write a for() loop that goes through  my.data and replaces the genotype strings with numbers, then saves the new numeric strings in the matrix.

This is a lot like Part 3 of the VCF Exercises where we converted the VCF genotypes to the diploid nucleotides. The only difference is now we convert them to the following numbers:

"0/0" = 1

"0/1" = 2

"1/1" = 3

"./." = 0

Note: Normally, you would not want to replace missing data with 0. For the purposes of this exercise, we are doing it as a crude but quick way to avoid the problems R has with missing data values in PCA. In a real analysis, you would want to use some sort of imputation method specific to genetic data to deal with these values.

Part B: Run PCA and plot the results

For PCA in R, we need to transpose our matrix so that the samples become the rows, and the SNPs become the columns. This is easily done with the t() function:

numGen.trans = t(numericGen)

Now, run PCA with the prcomp function:

pca.results = prcomp(numGen.trans, scale=FALSE)

There a few things we can look at in pca.results. First, if you type summary(pca.results you will get a table of information about the principal components, including how much variance is explained by each.

If you type plot(pca.results) you will see the variance explained by each PC.

Finally, the plot that we are most interested in is a plot of the first two PCs, which will show us if our data is clustered at all.

To get our plot points to be colored by species, first run this:

plot.col = gsub("^FU_.*", "darkturquoise", gsub("^MA_.*", "darkviolet", rownames(numGen.trans)))

Now, create the plot:

plot(pca.results$x, col=plot.col, pch=20, main="PCA of Finch Data")
legend(18, -1, c("Fuliginosa", "Magnirostris"), col=c("darkturquoise", "darkviolet"), pch=c(20,20), cex=0.75)

What can you tell about population structure from the plot? How does the information in the 3 plots we generated today differ?

R code solution for Fst Exercises