Remember to either load my CompGen_Fxns17.R
file or copy and paste the read.vcf
function into
your code.
For today's exercise, we are going to perform the McDonald-Kreitman test, which looks at the proportion of amino acid changes occurring between species, and compares this to the proportion occurring within a species. In neutrally evolving genes (or at least genes neutral with respect to species differences), these proportions will be the same, but in genes under selection, they will be different.
Because we need to known which of our SNPs are nonsynonymous (cause amino acid changes) and which are synonymous, we are going to use an R package called "VariantAnnotation" to help us figure this out.
There are multiple data files needed for this week, and they can all be found in the zipped file: 9_sampleData_MK.tar.gz. Download and unzip this file into the folder where you normally keep your class files.
Once the folder is unzipped, you should see the following files inside of it:
UrsMar_subset.fa
UrsMar_subset.fa.fai
UrsMar_subset.gff3
bears.vcf
In total, the unzipped files will take up about 85 Mb of space on your computer. If you absolutely do not have space for this (even for today), let me know, I will figure out a way for you to get a copy of the final annotated vcf file only.
There are 3 packages that we are going to need today, and they all come from the Bioconductor repository. To install these packages, in R, type:
source("https://bioconductor.org/biocLite.R")
biocLite("VariantAnnotation")
biocLite("GenomicFeatures")
biocLite("Rsamtools")
Some of these packages have a number of dependencies, so don't worry if you see a lot of output to the screen while they install. If they ask you something about installing from source, you can simply type in the letter "a" and hit enter; this will get them to recompile everything they need.
Once all of the packages are installed, we can use them to load our data and annotate our SNPs.
I start (as always) by setting my working directory to where I keep files for this class:
setwd("~/Desktop/Teaching/CompGenWS-GEN8900/")
The first task is to open a connection to the Reference Genome .fasta file (this is what is taking up the most space):
library(Rsamtools)
fa = open(FaFile("9_sampleData_MK/UrsMar_subset.fa"))
Next, we read in the .gff3 file, which contains the locations of all of the genes in the reference:
library(GenomicFeatures)
txdb=makeTxDbFromGFF(file="9_sampleData_MK/UrsMar_subset.gff3", format="gff3")
Now, I'm going to use this R package's readVcf()
function (note that this is different from the function we wrote during the first week of class!!):
library(VariantAnnotation)
vcf.object=readVcf(file="9_sampleData_MK/bears.vcf")
With those 3 pieces of information read in, we can use the predictCoding()
function to determine which SNPs are nonsynonymous and synonymous:
effects = predictCoding(vcf.object, txdb, fa)
If we print the information in effects
to the screen, we should see (among other things), a column in the table called "CONSEQUENCE" where each SNP has been classified.
While this was all pretty quick and easy, we are not quite done yet: we want to edit this output so that it can fit into our more familiar "my.data" VCF format data frame before we try and work with it.
There are a couple of things to consider here:
predictCoding
only considers SNPs that fall within exons; so there is not an entry in the effects table for every entry in our starting VCF file.vcf.object
we created when we used their readVcf function and the effects
object are complex data structures (not the usual 2 dimensional data frames we're used to).I'm first going to deal with issue #1. My strategy is to get the unique row names (these were created by the functions) from both the effects
object and vcf.object
, then use R's match()
function to see how they match up:
id.from.effects = names(ranges(effects))
id.from.vcf = rownames(info(vcf.object))
m = match(id.from.vcf, id.from.effects)
The match()
function will take every item in the first vector that you give it (here we give it the row names from the vcf.object), and find the corresponding index of the match in the second vector. If there is no match, it will return NA.
Now, I'm going to use the matching index I got in m
, and assign the values from the effects $CONSEQUENCE
column to their corresponding rows in my original vcf.object. I'm going to put these values into the INFO column of the VCF:
info(vcf.object)$CSQ = effects$CONSEQUENCE[m]
Don't worry if you get a warning message ("info fields with no header") here.
I'm going to take a similar approach, and put the GENEID from effects
into the ID column of the vcf:
names(vcf.object) = effects$GENEID[m]
After doing this, I've got all of the details that I need for the MK-test stored in my vcf.object, but I still don't have a nice data frame that's easy to work with.
To get this, we're going to load in the functions we've created in this class, and then use our own read.vcf function:
library(devtools)
source_gist("c954c757db02eb6c4ccfae1aa090658c", filename="compGen17_fxns.R", quiet=TRUE)
my.data=read.vcf(writeVcf(vcf.object, "9_sampleData_MK/bears_wAnn.vcf"), header=TRUE, stringsAsFactors=FALSE)
After running the above command, we simultaneously write out a new VCF file (with all of our annotations saved) to our working directory, and we read that file back in with read.vcf
, to get the familiar my.data table:
dim(my.data)
[1] 2105 19
If you try head(my.data)
, you'll see that there are a bunch of NAs in the first few rows of the ID column, because most SNPs were not in coding regions. We don't care about these SNPs for the MK test, so lets filter them out like this:
my.data = my.data[!(is.na(my.data$ID)),]
dim(my.data)
[1] 141 19
The !
symbol stands for "NOT" in R, so I'm saying get all of the rows where the value of my.data$ID is NOT NA.
If you try head(my.data)
again, you should now see a gene name in the ID column, and at the very end of the (long) line in the INFO colum, you'll see a value for CSQ. Let's clean up the INFO column a bit, so that we only have the CSQ values to deal with:
my.data$INFO = gsub(".*;CSQ=", "", my.data$INFO)
head(my.data)
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT Uame_JC012 Uarc_EP040
16 NW_007907101.1 6052945 LYST G A 621.71 . synonymous GT:AD:DP:GQ:PL 0/0:17,0:17:51:0,51,576 0/1:7,15:22:99:359,0,249
17 NW_007907101.1 6052951 LYST A G 144.31 . synonymous GT:AD:DP:GQ:PL 0/0:17,0:17:51:0,51,576 0/1:15,8:23:99:178,0,627
34 NW_007907101.1 6054393 LYST A G 487.29 . synonymous GT:AD:DP:GQ:PL 1/1:0,17:17:51:534,51,0 0/0:5,0:5:15:0,15,179
337 NW_007907101.1 6075809 LYST T C 46.93 . nonsynonymous GT:AD:DP:GQ:PL 0/0:1,0:1:3:0,3,43 0/0:4,0:4:12:0,12,99
372 NW_007907101.1 6077520 LYST G A 205.31 . synonymous GT:AD:DP:GQ:PL 0/0:8,0:8:24:0,24,247 0/0:13,0:13:36:0,36,540
373 NW_007907101.1 6077526 LYST A G 1326.86 . synonymous GT:AD:DP:GQ:PL 1/1:0,8:8:24:209,24,0 1/1:0,14:14:42:525,42,0
To see the different gene IDs in this data set, try the unique()
function:
unique(my.data$ID)
You can run this function on any vector, and it will get the total number of different possible values that could be in that vector. Try it on the INFO column as well, to see the possible annotation categories.
Now that we've got the data table into a much more manageable format, and we've gone through and picked out the nonsynonymous and synonymous sites, the next key ingredient in the MK test is to identify Fixed and Polymorphic sites:
Fixed: Sites where individuals within each species all have the same allele, but individuals from different species have different alleles.
Polymorphic: Sites where more than 1 allele is found within either one (or both) of the species.
In today's data, we have 10 individuals: 7 of them are polar bears (Ursus maritimus), 2 of them are grizzly bears (Ursus arctos horribilis), and 1 is an American black bear (Ursus americanus). Since we are interested in selection on genes that may be important specifically in polar bear evolution, we will treat the grizzlies and the black bear as a single out-group. See the schematic below:
Select the ID and INFO columns from my.data
, and put them into a new data frame of their own.
head(new)
ID INFO
16 LYST synonymous
17 LYST synonymous
34 LYST synonymous
337 LYST nonsynonymous
372 LYST synonymous
373 LYST synonymous
Add at least 1 new column to this table, to hold the "Fixed" or "Polymorphic" classification for each SNP site (and feel free to create extra columns that can help you do this classification!).
Hint: Remeber from the FST exercises that we can find column names corresponding to a particular pattern (such as a species name) with grep, and extract those columns with the [] notation:
data.w.pattern = my.data[, grep("some pattern", colnames(my.data))]
There are several different ways you can determine if SNPs are fixed or polymorphic. I'll leave it up to you to figure out your own strategy, but to help you get started, here are some things to remember:
unique
function, which could possibly be of some use if you wanted to try and different strategyHere are the first few results of my table:
head(new)
ID INFO SiteClass
16 LYST synonymous Polymorphic
17 LYST synonymous Polymorphic
34 LYST synonymous Polymorphic
337 LYST nonsynonymous Polymorphic
372 LYST synonymous Polymorphic
373 LYST synonymous Fixed
Note: You should not feel obligated to use the words "Fixed" and "Polymorphic" to assign your SNP classes; if you want to use numeric classifications, that is fine. If you do want to use words, here is a hint to how I did that:
x = A numeric vector from a calculation I used to help me classify sites
new$SiteClass = x # I start by just assigning this numeric vector to my new table
new$SiteClass[which(x == 1)] = "Fixed" # Then I change values where x=1 to be called Fixed
new$SiteClass[which(x != 1)] = "Polymorphic" # x values not equal to 1 get called "Polymorphic"
After we've classified each of our SNPs as either Synonymous or Nonsynonymous, and as either Fixed or Polymorphic, it's time to get the count of SNPs in each of the following categories (DO THIS SEPARATELY FOR EACH GENE!!):
Here is what my empty results table looks like to start with:
my.results
Genes Fn Fs Pn Ps
1 LYST 0 0 0 0
2 APOB 0 0 0 0
For getting the counts of each category, there are several functions we've used in past exercises that will come in handy (such as subset
or which
). There is also another function, that we haven't talked about yet, called intersect()
, which you may be interested in:
x = seq(1,10)
y = seq(5,15)
intersect(x,y)
[1] 5 6 7 8 9 10
Here are my final counts so you can check your calculations:
my.results
Genes Fn Fs Pn Ps
1 LYST 14 5 19 27
2 APOB 17 4 26 29
For each gene, calculate the Nonsynonymous/Synonymous Ratio Between Species (=Fn/Fs), and Within Species (=Pn/Ps):
my.results
Genes Fn Fs Pn Ps BetweenN_S WithinN_S
1 LYST 14 5 19 27 2.80 0.7037037
2 APOB 17 4 26 29 4.25 0.8965517
Let's also calculate the Neutrality Index (NI) for each gene.
my.results
Genes Fn Fs Pn Ps BetweenN_S WithinN_S NI
1 LYST 14 5 19 27 2.80 0.7037037 0.2513228
2 APOB 17 4 26 29 4.25 0.8965517 0.2109533
The Neutrality Index is useful because we know that values < 1 can be indicative of positive selection, while values >1 may be caused by negative or balancing selection.
To get a p-value associate with each of our genes (to see if they significantly differ from neutral expectations), we'll use the Fisher's Exact Test, which we used all the way back during Hardy-Weinberg week.
To do Fisher's Test for each gene, you need your data into a matrix like this:
Fn | FS |
Pn | PS |
Then you can run the command:
fisher.test(YOUR MATRIX)
To extract the p-value, we do:
fisher.test(YOUR MATRIX)$p.value
Even though the key information comes from checking the p-values and the Neutrality Index, it can also be helpful to plot some of the values to highlight especially significant genes.
Below, I'll create a bar plot with the N/S ratios for Between and Within Species in each gene, and then I'll add some indicators of the p-values as well (using a very non-sophisticated technique):
Step 1: Set up a matrix (required by barplot) of the values I want to plot:
for.plot=matrix(c(my.results$BetweenN_S, my.results$WithinN_S), ncol=2)
colnames(for.plot)=c("Between", "Within")
rownames(for.plot)=my.results$Genes
for.plot
Between Within
LYST 2.80 0.7037037
APOB 4.25 0.8965517
Step 2: Plot the Between and Within Proportions for each Gene
barplot(t(for.plot), beside=T,
ylab="Ratio of Nonsynonymous to Synonymous Sites",
xlab="Gene", col=c("darkgreen", "dodgerblue"),
ylim=c(0,6))
Step 3: Add a Legend:
legend("topleft", c("Between Species (F)", "Within Species (P)"),
pch=15, col=c("darkgreen", "dodgerblue"), bty="n")
Step 4: Add "Brackets" over the bars to show exactly which difference the p-values are signifying (i.e. the diff. in ratios within each gene, not between genes):
lines(c(1.5,2.5), c(3,3))
lines(c(1.5,1.5), c(3,2.9))
lines(c(2.5,2.5), c(3,2.9))
lines(c(4.5,5.5), c(4.5,4.5))
lines(c(4.5,4.5), c(4.5,4.4))
lines(c(5.5,5.5), c(4.5,4.4))
Step 5: Write p-values on the graph:
temp=rep("p =", 2)
p.labels=paste(temp, round(my.results$pValues, 2))
text(c(2,5), c(3.1,4.6), p.labels)
Both of these genes appear to be under positive selection in polar bears. To see more about what these genes may actually be doing, you can look them up in NCBI:
LYST https://www.ncbi.nlm.nih.gov/gene/1130
apoB https://www.ncbi.nlm.nih.gov/gene/338
R code solution for MK Test Exercises