This R script is written to do the following:
Setting Explicit Paths for Package Installation and Loading
.libPaths(c('~/compGenWS_112017/Rlibs', .libPaths()))
install.packages("devtools",repos='http://cran.us.r-project.org',
lib='~/compGenWS_112017/Rlibs', dependencies=TRUE)
Because we have limited permissions when we are logged onto Palmetto, we need to be very clear about where we tell R to install and look for packages (if we use any outside packages in our script).
Let the Input and Output File Names, as well as the 2 SNP positions to look at be COMMAND LINE ARGUMENTS instead of hard coded within the R script:
args = commandArgs(trailingOnly=TRUE)
input = args[1]
output = args[2]
snp1 = as.numeric(args[3])
snp2 = as.numeric(args[4])
my.data=read.vcf(input, header=TRUE)
snp.of.interest = my.data[which(my.data$POS==snp1),]
control.snp = my.data[which(my.data$POS==snp2),]
This step isn't required for your script to run on the cluster, but it will make it much easier to run the same script repeatedly on different input files or different SNPs.
Use the write.table
command to save our results table into a .txt
file we can access later if we want to:
write.table(smoothed.results, paste0(output, "txt", collapse="."), quote=FALSE,
row.names=FALSE, col.names=TRUE, sep="\t")
This also is not required for running on the cluster, but we need it if we want to access our results after running the script!
Open a PDF file for plotting:
pdf(paste0(c(output, "pdf"), collapse="."))
dev.off()
The 2 commands above flank the usual plotting commands in our script; the first line opens up a new pdf file (that will take the form "output.pdf"), and the second line closes the connection to that file. You don't have to use PDF format on the cluster, but you do have to open up some kind of file for plotting, if you run a plot
command in your script!
In a cluster environment like Palmetto, we can use a "job" script that is essentially a list of instructions telling the cluster what we want to do. The first few lines of this script have some information particular to the cluster:
#PBS -N rLD
#PBS -l select=1:ncpus=1:mem=8gb,walltime=00:20:00
#PBS -j oe
These lines tell Palmetto things like what to name our job, and what sort of resources we need to run it.
The next line loads the R software onto whatever node we end up assigned to:
module load R/3.3.2
Then, we will cd
into the directory with our files:
cd $HOME/compGenWS_112017
And the last line actually runs the R script:
Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results 700962 1531867
The general format for command line R will ALWAYS be: Rscript
--vanilla YOUR_SCRIPT.R
followed by all of your command line arguments. Notice I have 4 arguments here, each separated by a space:
Log into your Palmetto account, either through the Moba-Xterm app, or directly from the Terminal (on a Mac):
ssh ecoope4@user.palmetto.clemson.edu
Once logged in, clone the sample files and scripts for this week into your home directory:
git clone https://github.com/eacooper400/compGenWS_112017.git
Next, we will cd
into the folder we just cloned, find the .pbs job script, and submit it:
cd compGenWS_112017/
cd Scripts/
qsub run_rscript_1.pbs
To check the progress of the script, you can use qstat -u
followed by your username:
qstat -u ecoope4
Under the next to the last column, you should see an "R" if your script is "Running," or a "Q" if your script is still queued. If nothing shows up, then the job has finished. This script will take about 5 full minutes to run the first time (because of installing devtools
and all of its dependencies).
When it is finished, you can see the output files here:
cd ..
ls -lh
Try the less
command to look at the ".txt" file. To look at the pdf results, you need to first download them (with sftp
or whatever is recommended for Windows).
One of the reasons I set up the R script the way that I did with the command line arguments was to make it easy to repeatedly use it on different sets of SNP positions.
First, I'm going to edit the 11_forCluster.R
script so that it does not waste time re-installing devtools
, because we've already got that installed now.
Let's say that there were 2 other "control" positions: 42,071 and 337,985 that I wanted to compare my potentially selected SNP (at 700962) to. I can easily edit my run_rscript_1.pbs
script to do this:
cd ~/compGenWS_112017/Scripts
nano run_rscript_1.pbs
While in "nano" (a text editor for the command line), let's copy and paste the line starting with "Rscript" 2 more times:
Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results 700962 1531867
Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results 700962 1531867
Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results 700962 1531867
Next, I edit my very last argument (which is the 2nd SNP position) to be different each time:
Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results 700962 1531867
Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results 700962 42071
Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results 700962 337985
Finally, I'll make sure to edit my output name/prefix, so that I don't accidentally overwrite the results from one run with the results from another:
Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results 700962 1531867
Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results2 700962 42071
Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results3 700962 337985
To save this file, type Control-X
, then when prompted to save, type the letter "y." Let's give it a new file name: run_rscript_2.pbs
, and hit ENTER. Type "y" again when asked if you want to save under a new name. Now submit this new job:
qsub run_rscript_2.pbs
When this job finishes, you should now have 6 output files: 2 for each run of the script.
There are a couple of other ways we could make this even more efficient. Let's open the run_rscript_2.pbs
file with nano
again, and now add an &
symbol after each line. This will let the next line start running before the first line finishes, so we can use multiple cores (without having to change how we write our R script at all):
cd Scripts/
nano run_rscript_2.pbs
#PBS -N rLD
#PBS -l select=1:ncpus=4:mem=31gb,walltime=00:20:00
#PBS -j oe
module load R/3.3.2
cd $HOME/compGenWS_112017
Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results 700962 1531867 &
Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results2 700962 42071 &
Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results3 700962 337985
wait
qsub run_rscript_3.pbs
The above job should take less than a minute to complete (vs. the previous job with took about 1.5 minutes).
Yet another option is to make a separate job script for each of the runs, and then we could submit them all simultaneously.
I'll start by copying run_rscript_1.pbs
3 times:
cp run_rscript_1.pbs run_rSim1.pbs
cp run_rscript_1.pbs run_rSim2.pbs
cp run_rscript_1.pbs run_rSim3.pbs
The first script is fine as is; so let's modify each of the other 2 to test the positions that we want, and to name the output files how we want:
nano run_rSim2.pbs
Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results2 700962 42071
nano run_rSim3.pbs
Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results3 700962 337985
Here's a small loop to run qsub
on all of them:
for file in ./run_rSim*
do
qsub $file
done
You are actually allowed to submit up to 500 jobs at once on Palmetto! Since our R scripts only use 1 core each, using the &
method is probably the most convenient - it let's you keep everything organized in 1 script, and also speeds things up.
But if you really have a lot of mini jobs, OR if you've got an R script that can use multiple cores, then writing and submitting a separate pbs script for each of them can be the way to go to get things done very efficiently.