Before we can do anything with our data file, we have to get it into R. Since R has several built-in functions for reading different types of files, this isn't too hard, but there are a couple of aspects about VCF files that will require a special function.
To start with, I like to set my working environment to be within the folder where I keep the input data and other files for the course. This saves me the trouble of having to constantly type out file paths to find what I need:
> setwd("~/Desktop/Teaching/CompGenWS-GEN8900/")
The ~
symbol works as a place holder for /Users/lizcooper
(my home directory). This symbol should work on any Mac or Linux machine, but on Windows you may need to type the full path to your home directory.
Once your working directory is set, let's first try to read in the file just using R's built-in read.table
function, since we know now that VCF files are just tab-delimited text files:
> my.data=read.table("SampleData/sampleData_wk2.vcf", header=TRUE, stringsAsFactors=FALSE)
Here, we give the command the name of the file we want to read in, we tell it that our file does have a HEADER row, and we say that we don't want strings to be treated as factors. This didn't return an error, but let's look at our file and see if it was read correctly:
> head(my.data, n=5)
X1 X32306701 S1_32306701 G A X34 PASS . GT.AD.DP.GQ.PL X1.1.4.14.18.14.22.14.0 X0.1.9.3.12.13.23.13.0
1 1 34699267 S1_34699267 A G 20 PASS . GT:AD:DP:GQ:PL 1/1:0,17:17:12:15,12,0 ./.:.:.:.:.
2 1 80468344 S1_80468344 G A 26 PASS . GT:AD:DP:GQ:PL 0/0:18,4:22:7:7,20,0 0/1:11,10:21:11:11,12,0
3 2 15392497 S2_15392497 T C 34 PASS . GT:AD:DP:GQ:PL 0/0:22,2:24:19:19,27,0 0/0:23,4:27:10:10,10,0
4 2 15528730 S2_15528730 G C 29 PASS . GT:AD:DP:GQ:PL 0/0:15,2:17:1:23,1,0 0/0:13,1:14:18:39,18,0
5 3 41136420 S3_41136420 A G 27 PASS . GT:AD:DP:GQ:PL 1/1:2,24:26:1:1,9,0 0/1:12,7:19:10:10,12,0
head(...)
is a command that lets you look at just the first few lines (in this case 5) of a file or data object (this is another command that is the same as in bash scripting). We can see that the data mostly look okay, but something weird is going on with our first row: instead of reading in the real column headers (that start with #CHROM), R has skipped that line because it sees #
as the "comment" character, and it has mistakenly made the first row of the data into a header.
To see if we can get around this, we can try specifying that ##
is the comment character instead of #
:
> my.data=read.table("SampleData/sampleData_wk2.vcf", header=TRUE, stringsAsFactors=FALSE, comment.char="##")
Error in read.table("SampleData/sampleData_wk2.vcf", header = TRUE, stringsAsFactors = FALSE, :
invalid 'comment.char' argument
Now we actually get an ERROR message saying that we tried an invalid comment character. This is because it turns out R can only take values of length=1 for this argument. So, it's time to write our own function to get around this issue and get the VCF read in correctly.
For our read.vcf
function, we need to do the following things:
read.table
function and let R proceed as usual.Here is the new function:
read.vcf = function(file, special.char="##", ...) {
my.search.term=paste0(special.char, ".*")
all.lines=readLines(file)
clean.lines=gsub(my.search.term, "", all.lines)
clean.lines=gsub("#CHROM", "CHROM", clean.lines)
read.table(..., text=paste(clean.lines, collapse="\n"))
}
The paste
function that appears in two of the lines is doing exactly what it sounds like: it is "pasting" together any variables that we list into one string. In the first call, we use paste0
to paste the ##
special character term and the .*
pattern into a single regular expression: "##.*". This expression will find and match all lines starting with "##" followed by anything; we use gsub
in the function to find and replace these lines with nothing (""), thereby deleting them. The second instance of paste
is using the newline character "\n" as the separator for all of our "clean" lines; this is essentially putting the lines back into a text file, and then this is sent to read.table
Now, let's use our new funciton to read in the VCF file one final time:
> my.data=read.vcf("SampleData/sampleData_wk2.vcf", header=TRUE, stringsAsFactors=FALSE)
> head(my.data, n=5)
CHROM POS ID REF ALT QUAL FILTER INFO FORMAT X16096 X18238
1 1 32306701 S1_32306701 G A 34 PASS . GT:AD:DP:GQ:PL 1/1:4,14:18:14:22,14,0 0/1:9,3:12:13:23,13,0
2 1 34699267 S1_34699267 A G 20 PASS . GT:AD:DP:GQ:PL 1/1:0,17:17:12:15,12,0 ./.:.:.:.:.
3 1 80468344 S1_80468344 G A 26 PASS . GT:AD:DP:GQ:PL 0/0:18,4:22:7:7,20,0 0/1:11,10:21:11:11,12,0
4 2 15392497 S2_15392497 T C 34 PASS . GT:AD:DP:GQ:PL 0/0:22,2:24:19:19,27,0 0/0:23,4:27:10:10,10,0
5 2 15528730 S2_15528730 G C 29 PASS . GT:AD:DP:GQ:PL 0/0:15,2:17:1:23,1,0 0/0:13,1:14:18:39,18,0
Success!!
Below is a table to help you find the best pattern matching expressions to use in your code today:
Pattern | Description |
"0/1" | Match any occurence of the pattern "0/1" in any word within a vector |
"^0/1" | Match only occurences of the pattern "0/1" at the beginning of a word |
"^(0|1)" | Match any occurence of "0" OR "1" at the beginning of a word |
"0/1.*" | Match any occurence of of "0/1" followed by ANYTHING |
"\\./\\." | Matches the "./." pattern of missing data (by escaping the "." character |
Note one slight difference in the behavior of grep
and gsub
: If I use grep("0/1", my.data[1,])
on my input vcf file, it will behave the same as grep("0/1.*", my.data[1,])
because they are both finding strings with any instance of the pattern "0/1" and returning the index of whenever that pattern occurred:
> grep("0/1", my.data[1,])
[1] 11 14
> grep("0/1.*", my.data[1,])
[1] 11 14
But if I do the same thing with gsub
the results will differ; in the first case gsub
will ONLY replace the "0/1" pattern, in the second case it replaces the whole string where that pattern was found:
> gsub("0/1", "A", my.data[1,])
[1] "1" "32306701" "S1_32306701" "G"
[5] "A" "34" "PASS" "."
[9] "GT:AD:DP:GQ:PL" "1/1:4,14:18:14:22,14,0" "A:9,3:12:13:23,13,0" "0/0:12,1:13:1:1,6,0"
[13] "0/0:21,1:22:1:1,20,0" "A:19,5:24:1:18,1,0" "0/0:16,3:19:10:10,15,0"
> gsub("0/1.*", "A", my.data[1,])
[1] "1" "32306701" "S1_32306701" "G"
[5] "A" "34" "PASS" "."
[9] "GT:AD:DP:GQ:PL" "1/1:4,14:18:14:22,14,0" "A" "0/0:12,1:13:1:1,6,0"
[13] "0/0:21,1:22:1:1,20,0" "A" "0/0:16,3:19:10:10,15,0"
You should be able to complete all of the exercises for today using the indexing, grep
, gsub
, and length
commands that we learned about last week. But, below are a few other commands that you might find helpful, if you prefer a different approach to the problem:
Since the format for each sample string is fairly convoluted in a VCF file (e.g. "1/1:4,14:18:14:22,14,0"), you might be interested in separating out the different pieces of information in this string. You can do this with a command called strsplit
, which will separate any word based on a given "delimiter":
> test=my.data[1,10]
> test
[1] "1/1:4,14:18:14:22,14,0"
> strsplit(test, split=":")
[[1]]
[1] "1/1" "4,14" "18" "14" "22,14,0"
strsplit
returns a list object; in the above example the list is of length 1, and the first item in the list is a vector of length 5. To get the first list item, I would use the `[[...]]` notation like this:
> x=strsplit(test, split=":")
> x
[[1]]
[1] "1/1" "4,14" "18" "14" "22,14,0"
> x[[1]]
[1] "1/1" "4,14" "18" "14" "22,14,0"
To get the first vector item from the first list item, I would index it like this:
> x[[1]][1]
[1] "1/1"
You can run strsplit
on a vector instead of a single string (in this case I'll use the first 5 genotypes from column 10), and it will return a list like this:
> x=strsplit(my.data[1:5,10], split=":")
> x
[[1]]
[1] "1/1" "4,14" "18" "14" "22,14,0"
[[2]]
[1] "1/1" "0,17" "17" "12" "15,12,0"
[[3]]
[1] "0/0" "18,4" "22" "7" "7,20,0"
[[4]]
[1] "0/0" "22,2" "24" "19" "19,27,0"
[[5]]
[1] "0/0" "15,2" "17" "1" "23,1,0"
So, to get the genotypes, I would want the first vector item from each element in this list, which means I would need to write a loop like this:
> for (i in 1:length(x)) {
+ print(x[[i]][1])
+ }
[1] "1/1"
[1] "1/1"
[1] "0/0"
[1] "0/0"
[1] "0/0"
OR, there is also an advanced R trick using an apply
function that will actually do the same things as the loop, except all in one line. Don't worry if you don't understand or don't want to use the apply
functions at this point, I am just showing you so that you know:
> sapply(x, `[[`, 1)
[1] "1/1" "1/1" "0/0" "0/0" "0/0"
One final thing: if you do decide to use strsplit
, you no longer need to use grep
to match the "0/1" or other genotype patterns; you can now match them exactly like this:
> y = sapply(x, `[[`, 1)
> y
[1] "1/1" "1/1" "0/0" "0/0" "0/0"
> which(y == "0/0")
[1] 3 4 5
Since the exercises today are about counting the different genotypes, you may be interested in a function called table
which can get counts automatically:
> table(y)
y
0/0 1/1
3 2
The results from table
are a named vector, which can be indexed the same way as a normal vector, OR can be indexed by name as well:
> z = table(y)
> z[1]
0/0
3
> z["0/0"]
0/0
3
By default, table
will only return counts for variables that it finds, and it will NOT return 0 counts. To force it to count all of the possible genotypes, you can do this:
> table(factor(y, levels=c("0/0", "0/1", "1/1", "./.")))
0/0 0/1 1/1 ./.
3 0 2 0
AA = Homozygous Reference, Aa = Heterozygous, aa = Homozygous Alternate, NN = Missing
The goal is to create a table that looks like this:
AA | Aa | aa | NN |
## | ## | ## | ## |
## | ## | ## | ## |
... | ... | ... | ... |
where you will have 1 row for each position in the input VCF file. To get you started, here is some Pseudocode for how I would break down the problem:
create an empty table to hold the results
for (i in 1:nrow(my.data)) {
get all of the Sample Strings from the row (i.e. without the CHROM,POS, ID, columns etc.)
find the instances of each genotype code in the samples
count the instances and save the results in the table
}
Hint: you may want to start by extracting just the first row of the file to test out your commands before trying to run the whole loop:
> test.row = as.vector(my.data[1,], mode="character")
The data for this exercise actually come from a cross of 2 inbred plants (like Mendel's experiments), where 2 individuals are the inbred parents (and should be completely homozygous at all sites), 1 individual comes from the F1 generation (and should be completely heterozygous), and the remaining 3 individuals come from the F2 generation (partially heterozygous).
To figure out which individual is which, calculate the overall heterozygosity across all sites for each sample, and see if you can tell what generation each belongs to.
Often, to run a particular analysis in an outside program, you will need to convert from the VCF format to whatever format that program wants. In this exercise, I want you to convert this file to a table of the actual diploid nucleotide calls for each sample, instead of the binary 0/1 calls given in VCF.
For example, in row 1, where the REF alleles is "G," and the ALT allele is "A," any samples with "0/0" should be changed to "GG," "0/1" should become "GA," and "1/1" should be "AA." You can change missing data to "NN."
So, your final table will look like this:
Sample1 | Sample2 | Sample3 | Sample4 | Sample5 | Sample6 |
AA | GA | GG | GG | GA | GG |
.... | ... | ... | ... | ... | ... |
with one row per position.
Hint: to create the diploid genotype strings from the REF and ALT columns, you can use paste0
:
> homREF=paste0(my.data$REF, my.data$REF)
> homREF[1:5]
[1] "GG" "AA" "GG" "TT" "GG"
R code solution for VCF Exercises