Working with SNP Data in VCF Format

Reading in a VCF file

Before we can do anything with our data file, we have to get it into R. Since R has several built-in functions for reading different types of files, this isn't too hard, but there are a couple of aspects about VCF files that will require a special function.

To start with, I like to set my working environment to be within the folder where I keep the input data and other files for the course. This saves me the trouble of having to constantly type out file paths to find what I need:

> setwd("~/Desktop/Teaching/CompGenWS-GEN8900/")

The ~ symbol works as a place holder for /Users/lizcooper (my home directory). This symbol should work on any Mac or Linux machine, but on Windows you may need to type the full path to your home directory.

Once your working directory is set, let's first try to read in the file just using R's built-in read.table function, since we know now that VCF files are just tab-delimited text files:

>"SampleData/sampleData_wk2.vcf", header=TRUE, stringsAsFactors=FALSE)

Here, we give the command the name of the file we want to read in, we tell it that our file does have a HEADER row, and we say that we don't want strings to be treated as factors. This didn't return an error, but let's look at our file and see if it was read correctly:

> head(, n=5)
X1 X32306701 S1_32306701 G A X34 PASS . GT.AD.DP.GQ.PL X1.  X0.
1  1  34699267 S1_34699267 A G  20 PASS . GT:AD:DP:GQ:PL  1/1:0,17:17:12:15,12,0             ./.:.:.:.:.
2  1  80468344 S1_80468344 G A  26 PASS . GT:AD:DP:GQ:PL    0/0:18,4:22:7:7,20,0 0/1:11,10:21:11:11,12,0
3  2  15392497 S2_15392497 T C  34 PASS . GT:AD:DP:GQ:PL  0/0:22,2:24:19:19,27,0  0/0:23,4:27:10:10,10,0
4  2  15528730 S2_15528730 G C  29 PASS . GT:AD:DP:GQ:PL    0/0:15,2:17:1:23,1,0  0/0:13,1:14:18:39,18,0
5  3  41136420 S3_41136420 A G  27 PASS . GT:AD:DP:GQ:PL     1/1:2,24:26:1:1,9,0  0/1:12,7:19:10:10,12,0

head(...) is a command that lets you look at just the first few lines (in this case 5) of a file or data object (this is another command that is the same as in bash scripting). We can see that the data mostly look okay, but something weird is going on with our first row: instead of reading in the real column headers (that start with #CHROM), R has skipped that line because it sees # as the "comment" character, and it has mistakenly made the first row of the data into a header.

To see if we can get around this, we can try specifying that ## is the comment character instead of #:

>"SampleData/sampleData_wk2.vcf", header=TRUE, stringsAsFactors=FALSE, comment.char="##")
Error in read.table("SampleData/sampleData_wk2.vcf", header = TRUE, stringsAsFactors = FALSE,  : 
invalid 'comment.char' argument

Now we actually get an ERROR message saying that we tried an invalid comment character. This is because it turns out R can only take values of length=1 for this argument. So, it's time to write our own function to get around this issue and get the VCF read in correctly.

For our read.vcf function, we need to do the following things:

  1. Define a special search term that can match lines starting with "##" and followed by anything.
  2. Read in ALL of the lines from a file without skipping any, and then manually remove lines that match our "##" search term.
  3. Remove the single "#" from the header line so that R won't try to skip it later.
  4. Send our cleaned up lines back into the read.table function and let R proceed as usual.

Here is the new function:

read.vcf = function(file, special.char="##", ...) {, ".*")
    clean.lines=gsub(, "",  all.lines)
    clean.lines=gsub("#CHROM", "CHROM", clean.lines)
    read.table(..., text=paste(clean.lines, collapse="\n"))

The paste function that appears in two of the lines is doing exactly what it sounds like: it is "pasting" together any variables that we list into one string. In the first call, we use paste0 to paste the ## special character term and the .* pattern into a single regular expression: "##.*". This expression will find and match all lines starting with "##" followed by anything; we use gsub in the function to find and replace these lines with nothing (""), thereby deleting them. The second instance of paste is using the newline character "\n" as the separator for all of our "clean" lines; this is essentially putting the lines back into a text file, and then this is sent to read.table

Now, let's use our new funciton to read in the VCF file one final time:

>"SampleData/sampleData_wk2.vcf", header=TRUE, stringsAsFactors=FALSE)
> head(, n=5)
CHROM      POS          ID REF ALT QUAL FILTER INFO         FORMAT                 X16096                  X18238
1     1 32306701 S1_32306701   G   A   34   PASS    . GT:AD:DP:GQ:PL 1/1:4,14:18:14:22,14,0   0/1:9,3:12:13:23,13,0
2     1 34699267 S1_34699267   A   G   20   PASS    . GT:AD:DP:GQ:PL 1/1:0,17:17:12:15,12,0             ./.:.:.:.:.
3     1 80468344 S1_80468344   G   A   26   PASS    . GT:AD:DP:GQ:PL   0/0:18,4:22:7:7,20,0 0/1:11,10:21:11:11,12,0
4     2 15392497 S2_15392497   T   C   34   PASS    . GT:AD:DP:GQ:PL 0/0:22,2:24:19:19,27,0  0/0:23,4:27:10:10,10,0
5     2 15528730 S2_15528730   G   C   29   PASS    . GT:AD:DP:GQ:PL   0/0:15,2:17:1:23,1,0  0/0:13,1:14:18:39,18,0


Grep and Gsub Expressions with VCF files

Below is a table to help you find the best pattern matching expressions to use in your code today:

Pattern Description
"0/1" Match any occurence of the pattern "0/1" in any word within a vector
"^0/1" Match only occurences of the pattern "0/1" at the beginning of a word
"^(0|1)" Match any occurence of "0" OR "1" at the beginning of a word
"0/1.*" Match any occurence of of "0/1" followed by ANYTHING
"\\./\\." Matches the "./." pattern of missing data (by escaping the "." character

Note one slight difference in the behavior of grep and gsub: If I use grep("0/1",[1,]) on my input vcf file, it will behave the same as grep("0/1.*",[1,]) because they are both finding strings with any instance of the pattern "0/1" and returning the index of whenever that pattern occurred:

> grep("0/1",[1,])
[1] 11 14
> grep("0/1.*",[1,])
[1] 11 14

But if I do the same thing with gsub the results will differ; in the first case gsub will ONLY replace the "0/1" pattern, in the second case it replaces the whole string where that pattern was found:

> gsub("0/1", "A",[1,])
[1] "1"                      "32306701"               "S1_32306701"            "G"                     
[5] "A"                      "34"                     "PASS"                   "."                     
[9] "GT:AD:DP:GQ:PL"         "1/1:4,14:18:14:22,14,0" "A:9,3:12:13:23,13,0"    "0/0:12,1:13:1:1,6,0"   
[13] "0/0:21,1:22:1:1,20,0"   "A:19,5:24:1:18,1,0"     "0/0:16,3:19:10:10,15,0"

> gsub("0/1.*", "A",[1,])
[1] "1"                      "32306701"               "S1_32306701"            "G"                     
[5] "A"                      "34"                     "PASS"                   "."                     
[9] "GT:AD:DP:GQ:PL"         "1/1:4,14:18:14:22,14,0" "A"                      "0/0:12,1:13:1:1,6,0"   
[13] "0/0:21,1:22:1:1,20,0"   "A"                      "0/0:16,3:19:10:10,15,0"

Additional (Optional) R commands for today's exercises:

You should be able to complete all of the exercises for today using the indexing, grep, gsub, and length commands that we learned about last week. But, below are a few other commands that you might find helpful, if you prefer a different approach to the problem:

Splitting a Single String/Word into a Vector with strsplit()

Since the format for each sample string is fairly convoluted in a VCF file (e.g. "1/1:4,14:18:14:22,14,0"), you might be interested in separating out the different pieces of information in this string. You can do this with a command called strsplit, which will separate any word based on a given "delimiter":

> test
[1] "1/1:4,14:18:14:22,14,0"
> strsplit(test, split=":")
[1] "1/1"     "4,14"    "18"      "14"      "22,14,0"

strsplit returns a list object; in the above example the list is of length 1, and the first item in the list is a vector of length 5. To get the first list item, I would use the `[[...]]` notation like this:

> x=strsplit(test, split=":")
> x
[1] "1/1"     "4,14"    "18"      "14"      "22,14,0"

> x[[1]]
[1] "1/1"     "4,14"    "18"      "14"      "22,14,0"

To get the first vector item from the first list item, I would index it like this:

> x[[1]][1]
[1] "1/1"

You can run strsplit on a vector instead of a single string (in this case I'll use the first 5 genotypes from column 10), and it will return a list like this:

> x=strsplit([1:5,10], split=":") 
> x
[1] "1/1" "4,14" "18" "14" "22,14,0"
[1] "1/1"     "0,17"    "17"      "12"      "15,12,0"

[1] "0/0"    "18,4"   "22"     "7"      "7,20,0"

[1] "0/0"     "22,2"    "24"      "19"      "19,27,0"

[1] "0/0"    "15,2"   "17"     "1"      "23,1,0"

So, to get the genotypes, I would want the first vector item from each element in this list, which means I would need to write a loop like this:

> for (i in 1:length(x)) {
+ print(x[[i]][1])
+ }
[1] "1/1"
[1] "1/1"
[1] "0/0"
[1] "0/0"
[1] "0/0"

OR, there is also an advanced R trick using an apply function that will actually do the same things as the loop, except all in one line. Don't worry if you don't understand or don't want to use the apply functions at this point, I am just showing you so that you know:

> sapply(x, `[[`, 1)
[1] "1/1" "1/1" "0/0" "0/0" "0/0"

One final thing: if you do decide to use strsplit, you no longer need to use grep to match the "0/1" or other genotype patterns; you can now match them exactly like this:

> y = sapply(x, `[[`, 1)
> y
[1] "1/1" "1/1" "0/0" "0/0" "0/0"
> which(y == "0/0")
[1] 3 4 5

Getting Counts with table()

Since the exercises today are about counting the different genotypes, you may be interested in a function called table which can get counts automatically:

> table(y)
0/0 1/1 
3   2

The results from table are a named vector, which can be indexed the same way as a normal vector, OR can be indexed by name as well:

> z = table(y)
> z[1]
> z["0/0"]

By default, table will only return counts for variables that it finds, and it will NOT return 0 counts. To force it to count all of the possible genotypes, you can do this:

> table(factor(y, levels=c("0/0", "0/1", "1/1", "./.")))

0/0 0/1 1/1 ./. 
3   0   2   0 


1.) Get the COUNTS for each Genotype (AA, Aa, aa, and NN) at each position.

AA = Homozygous Reference, Aa = Heterozygous, aa = Homozygous Alternate, NN = Missing

The goal is to create a table that looks like this:

AA Aa aa NN
## ## ## ##
## ## ## ##
... ... ... ...

where you will have 1 row for each position in the input VCF file. To get you started, here is some Pseudocode for how I would break down the problem:

create an empty table to hold the results

for (i in 1:nrow( {

       get all of the Sample Strings from the row (i.e. without the CHROM,POS, ID, columns etc.)
       find the instances of each genotype code in the samples
       count the instances and save the results in the table

Hint: you may want to start by extracting just the first row of the file to test out your commands before trying to run the whole loop:

> test.row = as.vector([1,], mode="character")

2.) Get the Percent Heterozygosity for EACH INDIVIDUAL in the file.

The data for this exercise actually come from a cross of 2 inbred plants (like Mendel's experiments), where 2 individuals are the inbred parents (and should be completely homozygous at all sites), 1 individual comes from the F1 generation (and should be completely heterozygous), and the remaining 3 individuals come from the F2 generation (partially heterozygous).

To figure out which individual is which, calculate the overall heterozygosity across all sites for each sample, and see if you can tell what generation each belongs to.

3.) Convert to a new file format.

Often, to run a particular analysis in an outside program, you will need to convert from the VCF format to whatever format that program wants. In this exercise, I want you to convert this file to a table of the actual diploid nucleotide calls for each sample, instead of the binary 0/1 calls given in VCF.

For example, in row 1, where the REF alleles is "G," and the ALT allele is "A," any samples with "0/0" should be changed to "GG," "0/1" should become "GA," and "1/1" should be "AA." You can change missing data to "NN."

So, your final table will look like this:

Sample1 Sample2 Sample3 Sample4 Sample5 Sample6
.... ... ... ... ... ...

with one row per position.

Hint: to create the diploid genotype strings from the REF and ALT columns, you can use paste0:

> homREF=paste0($REF,$REF)
> homREF[1:5]
[1] "GG" "AA" "GG" "TT" "GG"
R code solution for VCF Exercises