McDonald-Kreitman Test for Selection

Before Starting

Remember to either load my CompGen_Fxns17.R file or copy and paste the read.vcf function into your code.

Testing for Positive Selection in Polar Bears

For today's exercise, we are going to perform the McDonald-Kreitman test, which looks at the proportion of amino acid changes occurring between species, and compares this to the proportion occurring within a species. In neutrally evolving genes (or at least genes neutral with respect to species differences), these proportions will be the same, but in genes under selection, they will be different.

Part One: Annotating Coding and Non-Coding Variants

Because we need to known which of our SNPs are nonsynonymous (cause amino acid changes) and which are synonymous, we are going to use an R package called "VariantAnnotation" to help us figure this out.

Download the Sample Data

There are multiple data files needed for this week, and they can all be found in the zipped file: 9_sampleData_MK.tar.gz. Download and unzip this file into the folder where you normally keep your class files.

Once the folder is unzipped, you should see the following files inside of it:

UrsMar_subset.fa
UrsMar_subset.fa.fai
UrsMar_subset.gff3
bears.vcf

In total, the unzipped files will take up about 85 Mb of space on your computer.  If you absolutely do not have space for this (even for today), let me know, I will figure out a way for you to get a copy of the final annotated vcf file only.

Install the Necessary R Packages

There are 3 packages that we are going to need today, and they all come from the Bioconductor repository. To install these packages, in R, type:

source("https://bioconductor.org/biocLite.R")
biocLite("VariantAnnotation")
biocLite("GenomicFeatures")
biocLite("Rsamtools")

Some of these packages have a number of dependencies, so don't worry if you see a lot of output to the screen while they install. If they ask you something about installing from source, you can simply type in the letter "a" and hit enter; this will get them to recompile everything they need.

Running the VariantAnnotation Pipeline

Once all of the packages are installed, we can use them to load our data and annotate our SNPs.

I start (as always) by setting my working directory to where I keep files for this class:

setwd("~/Desktop/Teaching/CompGenWS-GEN8900/")

The first task is to open a connection to the Reference Genome .fasta file (this is what is taking up the most space):

library(Rsamtools)
fa = open(FaFile("9_sampleData_MK/UrsMar_subset.fa"))

Next, we read in the .gff3 file, which contains the locations of all of the genes in the reference:

library(GenomicFeatures)
txdb=makeTxDbFromGFF(file="9_sampleData_MK/UrsMar_subset.gff3", format="gff3")

Now, I'm going to use this R package's readVcf() function (note that this is different from the function we wrote during the first week of class!!):

library(VariantAnnotation)
vcf.object=readVcf(file="9_sampleData_MK/bears.vcf")

With those 3 pieces of information read in, we can use the predictCoding() function to determine which SNPs are nonsynonymous and synonymous:

effects = predictCoding(vcf.object, txdb, fa)

If we print the information in effects to the screen, we should see (among other things), a column in the table called "CONSEQUENCE" where each SNP has been classified.

While this was all pretty quick and easy, we are not quite done yet: we want to edit this output so that it can fit into our more familiar "my.data" VCF format data frame before we try and work with it.

There are a couple of things to consider here:

  1. predictCoding only considers SNPs that fall within exons; so there is not an entry in the effects table for every entry in our starting VCF file.
  2. Both the vcf.object we created when we used their readVcf function and the effects object are complex data structures (not the usual 2 dimensional data frames we're used to).

I'm first going to deal with issue #1. My strategy is to get the unique row names (these were created by the functions) from both the effects object and vcf.object, then use R's match() function to see how they match up:

id.from.effects = names(ranges(effects))
id.from.vcf = rownames(info(vcf.object))
m = match(id.from.vcf, id.from.effects)

The match() function will take every item in the first vector that you give it (here we give it the row names from the vcf.object), and find the corresponding index of the match in the second vector. If there is no match, it will return NA.

Now, I'm going to use the matching index I got in m, and assign the values from the effects $CONSEQUENCE column to their corresponding rows in my original vcf.object. I'm going to put these values into the INFO column of the VCF:

info(vcf.object)$CSQ = effects$CONSEQUENCE[m]

Don't worry if you get a warning message ("info fields with no header") here.

I'm going to take a similar approach, and put the GENEID from effects into the ID column of the vcf:

names(vcf.object) = effects$GENEID[m]

After doing this, I've got all of the details that I need for the MK-test stored in my vcf.object, but I still don't have a nice data frame that's easy to work with.

To get this, we're going to load in the functions we've created in this class, and then use our own read.vcf function:

library(devtools)
source_gist("c954c757db02eb6c4ccfae1aa090658c", filename="compGen17_fxns.R", quiet=TRUE)
my.data=read.vcf(writeVcf(vcf.object, "9_sampleData_MK/bears_wAnn.vcf"), header=TRUE, stringsAsFactors=FALSE)

After running the above command, we simultaneously write out a new VCF file (with all of our annotations saved) to our working directory, and we read that file back in with read.vcf, to get the familiar my.data table:

dim(my.data)
[1] 2105   19

If you try head(my.data), you'll see that there are a bunch of NAs in the first few rows of the ID column, because most SNPs were not in coding regions. We don't care about these SNPs for the MK test, so lets filter them out like this:

my.data = my.data[!(is.na(my.data$ID)),]
dim(my.data)
[1] 141  19

The ! symbol stands for "NOT" in R, so I'm saying get all of the rows where the value of my.data$ID is NOT NA.

If you try head(my.data) again, you should now see a gene name in the ID column, and at the very end of the (long) line in the INFO colum, you'll see a value for CSQ. Let's clean up the INFO column a bit, so that we only have the CSQ values to deal with:

my.data$INFO = gsub(".*;CSQ=", "", my.data$INFO)
head(my.data)
CHROM                       POS         ID      REF ALT    QUAL FILTER          INFO         FORMAT              Uame_JC012               Uarc_EP040
16  NW_007907101.1 6052945 LYST   G   A  621.71      .    synonymous GT:AD:DP:GQ:PL 0/0:17,0:17:51:0,51,576 0/1:7,15:22:99:359,0,249
17  NW_007907101.1 6052951 LYST   A   G  144.31      .    synonymous GT:AD:DP:GQ:PL 0/0:17,0:17:51:0,51,576 0/1:15,8:23:99:178,0,627
34  NW_007907101.1 6054393 LYST   A   G  487.29      .    synonymous GT:AD:DP:GQ:PL 1/1:0,17:17:51:534,51,0    0/0:5,0:5:15:0,15,179
337 NW_007907101.1 6075809 LYST   T   C   46.93      . nonsynonymous GT:AD:DP:GQ:PL      0/0:1,0:1:3:0,3,43     0/0:4,0:4:12:0,12,99
372 NW_007907101.1 6077520 LYST   G   A  205.31      .    synonymous GT:AD:DP:GQ:PL   0/0:8,0:8:24:0,24,247  0/0:13,0:13:36:0,36,540
373 NW_007907101.1 6077526 LYST   A   G 1326.86      .    synonymous GT:AD:DP:GQ:PL   1/1:0,8:8:24:209,24,0  1/1:0,14:14:42:525,42,0

To see the different gene IDs in this data set, try the unique() function:

unique(my.data$ID)

You can run this function on any vector, and it will get the total number of different possible values that could be in that vector. Try it on the INFO column as well, to see the possible annotation categories.

Part Two: Identify Fixed and Polymorphic Sites

Now that we've got the data table into a much more manageable format, and we've gone through and picked out the nonsynonymous and synonymous sites, the next key ingredient in the MK test is to identify Fixed and Polymorphic sites:

  • Fixed: Sites where individuals within each species all have the same allele, but individuals from different species have different alleles.

  • Polymorphic: Sites where more than 1 allele is found within either one (or both) of the species.

In today's data, we have 10 individuals: 7 of them are polar bears (Ursus maritimus), 2 of them are grizzly bears (Ursus arctos horribilis), and 1 is an American black bear (Ursus americanus). Since we are interested in selection on genes that may be important specifically in polar bear evolution, we will treat the grizzlies and the black bear as a single out-group. See the schematic below:

bear_tree.png

Select the ID and INFO columns from my.data, and put them into a new data frame of their own.

head(new)
  ID          INFO
16  LYST    synonymous
17  LYST    synonymous
34  LYST    synonymous
337 LYST nonsynonymous
372 LYST    synonymous
373 LYST    synonymous

Add at least 1 new column to this table, to hold the "Fixed" or "Polymorphic" classification for each SNP site (and feel free to create extra columns that can help you do this classification!).

Hint: Remeber from the FST exercises that we can find column names corresponding to a particular pattern (such as a species name) with grep, and extract those columns with the [] notation:

data.w.pattern = my.data[, grep("some pattern", colnames(my.data))]

There are several different ways you can determine if SNPs are fixed or polymorphic. I'll leave it up to you to figure out your own strategy, but to help you get started, here are some things to remember:

  • We have functions and a lot of previous code that calculates allele frequencies or genotype counts, and either of those values would be useful here,
  • The reference genome I aligned to was from another Polar Bear
  • Earlier in this exercise, we also saw the unique function, which could possibly be of some use if you wanted to try and different strategy
  • During the first week of LD exercises, we were trying to find haplotype patterns, we saw some ways we could use R's arithmetic capabilities to our advantage, which is also something to consider here.

Here are the first few results of my table:

head(new)
  ID             INFO            SiteClass
16  LYST    synonymous           Polymorphic
17  LYST    synonymous           Polymorphic
34  LYST    synonymous           Polymorphic
337 LYST nonsynonymous           Polymorphic
372 LYST    synonymous           Polymorphic
373 LYST    synonymous           Fixed

Note: You should not feel obligated to use the words "Fixed" and "Polymorphic" to assign your SNP classes; if you want to use numeric classifications, that is fine. If you do want to use words, here is a hint to how I did that:

x = A numeric vector from a calculation I used to help me classify sites
new$SiteClass = x # I start by just assigning this numeric vector to my new table
new$SiteClass[which(x == 1)] = "Fixed" # Then I change values where x=1 to be called Fixed
new$SiteClass[which(x != 1)] = "Polymorphic" # x values not equal to 1 get called "Polymorphic"

Part Three: Perform the MK Test

After we've classified each of our SNPs as either Synonymous or Nonsynonymous, and as either Fixed or Polymorphic, it's time to get the count of SNPs in each of the following categories (DO THIS SEPARATELY FOR EACH GENE!!):

  • Fn = The number of Fixed Nonsynonymous sites
  • Fs = The number of Fixed Synonymous sites
  • Pn = The number of Polymorphic Nonsynonymous sites
  • Ps = The number of Polymorphic Synonymous sites

Here is what my empty results table looks like to start with:

my.results
    Genes Fn Fs Pn Ps
1  LYST  0  0  0  0
2  APOB  0  0  0  0

For getting the counts of each category, there are several functions we've used in past exercises that will come in handy (such as subset or which). There is also another function, that we haven't talked about yet, called intersect(), which you may be interested in:

x = seq(1,10)
y = seq(5,15)
intersect(x,y)
[1]  5  6  7  8  9 10

Here are my final counts so you can check your calculations:

my.results
    Genes Fn Fs Pn Ps
1  LYST 14  5 19 27
2  APOB 17  4 26 29

For each gene, calculate the Nonsynonymous/Synonymous Ratio Between Species (=Fn/Fs), and Within Species (=Pn/Ps):

my.results
    Genes Fn Fs  Pn  Ps   BetweenN_S         WithinN_S
1  LYST   14  5  19  27            2.80         0.7037037
2  APOB  17  4  26  29            4.25         0.8965517

Let's also calculate the Neutrality Index (NI) for each gene.

LaTeX: NI\:=\:\frac{\left(\frac{P_N}{P_S}\right)}{\left(\frac{F_N}{F_S}\right)}

my.results
    Genes    Fn   Fs  Pn  Ps      BetweenN_S      WithinN_S        NI
1  LYST      14    5  19  27      2.80             0.7037037   0.2513228
2  APOB     17    4  26  29       4.25             0.8965517   0.2109533

The Neutrality Index is useful because we know that values < 1 can be indicative of positive selection, while values >1 may be caused by negative or balancing selection.

Test for Significance with Fisher's Exact Test

To get a p-value associate with each of our genes (to see if they significantly differ from neutral expectations), we'll use the Fisher's Exact Test, which we used all the way back during Hardy-Weinberg week.

To do Fisher's Test for each gene, you need your data into a matrix like this:

Fn FS
Pn PS

Then you can run the command:

fisher.test(YOUR MATRIX)

To extract the p-value, we do:

fisher.test(YOUR MATRIX)$p.value

Plot the Results

Even though the key information comes from checking the p-values and the Neutrality Index, it can also be helpful to plot some of the values to highlight especially significant genes.

Below, I'll create a bar plot with the N/S ratios for Between and Within Species in each gene, and then I'll add some indicators of the p-values as well (using a very non-sophisticated technique):

Step 1: Set up a matrix (required by barplot) of the values I want to plot:

for.plot=matrix(c(my.results$BetweenN_S, my.results$WithinN_S), ncol=2)
colnames(for.plot)=c("Between", "Within")
rownames(for.plot)=my.results$Genes

for.plot
     Between    Within
LYST    2.80 0.7037037
APOB    4.25 0.8965517

Step 2: Plot the Between and Within Proportions for each Gene

barplot(t(for.plot), beside=T,
    ylab="Ratio of Nonsynonymous to Synonymous Sites",
    xlab="Gene", col=c("darkgreen", "dodgerblue"),
    ylim=c(0,6))

Step 3: Add a Legend:

legend("topleft", c("Between Species (F)", "Within Species (P)"),
    pch=15, col=c("darkgreen", "dodgerblue"), bty="n")

Step 4: Add "Brackets" over the bars to show exactly which difference the p-values are signifying (i.e. the diff. in ratios within each gene, not between genes):

lines(c(1.5,2.5), c(3,3))
lines(c(1.5,1.5), c(3,2.9))
lines(c(2.5,2.5), c(3,2.9))

lines(c(4.5,5.5), c(4.5,4.5))
lines(c(4.5,4.5), c(4.5,4.4))
lines(c(5.5,5.5), c(4.5,4.4))

Step 5: Write p-values on the graph:

temp=rep("p =", 2)
p.labels=paste(temp, round(my.results$pValues, 2))
text(c(2,5), c(3.1,4.6), p.labels)

mk_plot.png

Both of these genes appear to be under positive selection in polar bears. To see more about what these genes may actually be doing, you can look them up in NCBI:

LYST https://www.ncbi.nlm.nih.gov/gene/1130

apoB https://www.ncbi.nlm.nih.gov/gene/338

 

R code solution for MK Test Exercises