Remember to either load my CompGen_Fxns17.R file or copy and paste the read.vcf
function into your code.
Then, load the sample data file for this week: sampleData_4_Fst.vcf
This table should have:
my.data
p: The frequency of allele (A) in the total population
n: The number of non-missing individuals in the total population.
Hexp: The expected heterozygosity for the total population (frequency, not counts!!!).
p1: The frequency of allele (A) in sub-population 1.
n1: The number of (non-missing) individuals in sub-pop 1
Hexp1 The expected heterozygosity in sub-pop 1
p2: The frequency of allele (A) is sub-population 2.
n2: The number of non-missing individuals in sub-pop 2.
Hexp2: The expected heterozygosity for sub-pop 2.
Note: Feel free to include additional columns in your table, such as columns for the observed genotype counts, if you want them there to help with the calculations.
This calculation is exactly the same as the calculation we did for the Hardy-Weinberg exercises last week, so you should be able to re-use your code and fill in the appropriate values of your new results table. (Although note again that this time you do not need the expected count of Aa individuals, just their expected frequency (i.e. just 2pq))
The first step in this part of the exercise is to figure out which columns of my.data
correspond to each sub-population.
In this file, the samples from the first sub-population (Geospiza fuliginosa) all have names that start with "FU_".
The samples from the second sub-population (Geospiza magnirostris) all have names that start with "MA_".
You can extract the column names from my.data
like this:
colnames(my.data)
You can use grep()
on the column names to get the index of the columns that match a particular pattern.
Then, you can use the square bracket indexing notation to pull out only those columns from the my.data
table:
my.data[ , (your grep results)]
Once you're able to get only the parts of the table that correspond to one sub-population, you can use the exact same methods for calculating Hexp that you used for the total population.
The general form for Wright's Fst equation is:
, i.e. the expected heterozygosity for the total population (so we already have this value calculated).
Hs is a weighted average of the expected heterozygosities for all sub-populations. It is calculated like this:
, where k = the num. of sub-pops, n = the total population size, ni = the size of the ith sub-population, and Hexpi = the expected heterozygosity of the ith sub-population.
For our data set today, where we have 2 sub-populations, this evaluates to:
Once you have calculated Hs, it should be very straightforward to calculate Fst.
hist(my.results$Fst, xlim=c(0,1), breaks=50,xlab=expression(F[ST]), main="Histogram of FST", col="lightblue")
The arguments that I'm using in this function are:
my.results$Fst
.expression()
here to generate a label with a sub-scriptplot(my.data$POS, my.results$Fst, xlab="Chromosomal Position (bp)", ylab=expression(F[ST]), ylim=c(0,1), pch=20, main="Spatial Distribution of Fst", col="hotpink3")
Here, the plot arguments are:
hist()
and ylab makes a label for the y-axis.What does looking at the histogram tell you about Fst? What does looking at the genomewide plot tell you?
If we hadn't known before starting our analysis that we had 2 sub-groups in the data, then we might have wanted to start with a more general analysis to detect population structure.
Principal Components Analysis (PCA) is one method of doing this. You can find a very basic overview of PCA if you're not familiar with it.
R can only perform PCA on numerical values, so we need to convert our genotypes to numbers that R can handle.
First, set up an empty matrix with the same number of rows as my.data
and a column for every sample:
numericGen = matrix(NA, nrow=nrow(my.data), ncol=(ncol(my.data) - 9))
colnames(numericGen) = colnames(my.data)[10:ncol(my.data)]
Now, you need to write a for()
loop that goes through my.data
and replaces the genotype strings with numbers, then saves the new numeric strings in the matrix.
This is a lot like Part 3 of the VCF Exercises where we converted the VCF genotypes to the diploid nucleotides. The only difference is now we convert them to the following numbers:
"0/0" = 1
"0/1" = 2
"1/1" = 3
"./." = 0
Note: Normally, you would not want to replace missing data with 0. For the purposes of this exercise, we are doing it as a crude but quick way to avoid the problems R has with missing data values in PCA. In a real analysis, you would want to use some sort of imputation method specific to genetic data to deal with these values.
For PCA in R, we need to transpose our matrix so that the samples become the rows, and the SNPs become the columns. This is easily done with the t()
function:
numGen.trans = t(numericGen)
Now, run PCA with the prcomp
function:
pca.results = prcomp(numGen.trans, scale=FALSE)
There a few things we can look at in pca.results
. First, if you type summary(pca.results
you will get a table of information about the principal components, including how much variance is explained by each.
If you type plot(pca.results)
you will see the variance explained by each PC.
Finally, the plot that we are most interested in is a plot of the first two PCs, which will show us if our data is clustered at all.
To get our plot points to be colored by species, first run this:
plot.col = gsub("^FU_.*", "darkturquoise", gsub("^MA_.*", "darkviolet", rownames(numGen.trans)))
Now, create the plot:
plot(pca.results$x, col=plot.col, pch=20, main="PCA of Finch Data")
legend(18, -1, c("Fuliginosa", "Magnirostris"), col=c("darkturquoise", "darkviolet"), pch=c(20,20), cex=0.75)
What can you tell about population structure from the plot? How does the information in the 3 plots we generated today differ?
R code solution for Fst Exercises