An Example of Running R on the Cluster

Run the R script on your Laptop

R script to run on Palmetto

This R script is written to do the following:

  • read in a phased version of the Heliconius butterfly data from last week,
  • calculate linkage disequilibrium (r2) between a SNP that appears to be under selection and ALL other SNPs in the data,
  • calculate linkage disequilibrium between a SNP that does NOT appear to be under selection and ALL other SNPs in the data,
  • Plot the mean r2 in 10kb bins for both the selected SNP (red) and the control SNP (blue), to see if there is a larger linkage block around the selected SNP.

Setting Explicit Paths for Package Installation and Loading

.libPaths(c('~/compGenWS_112017/Rlibs', .libPaths()))
install.packages("devtools",repos='http://cran.us.r-project.org',
                  lib='~/compGenWS_112017/Rlibs', dependencies=TRUE)

Because we have limited permissions when we are logged onto Palmetto, we need to be very clear about where we tell R to install and look for packages (if we use any outside packages in our script).

  • Let the Input and Output File Names, as well as the 2 SNP positions to look at be COMMAND LINE ARGUMENTS instead of hard coded within the R script:

    args = commandArgs(trailingOnly=TRUE)
    input = args[1]
    output = args[2]
    snp1 = as.numeric(args[3])
    snp2 = as.numeric(args[4])
    
    my.data=read.vcf(input, header=TRUE)
    snp.of.interest = my.data[which(my.data$POS==snp1),]
    control.snp = my.data[which(my.data$POS==snp2),]
    

    This step isn't required for your script to run on the cluster, but it will make it much easier to run the same script repeatedly on different input files or different SNPs.

  • Use the write.table command to save our results table into a .txt file we can access later if we want to:

    write.table(smoothed.results, paste0(output, "txt", collapse="."), quote=FALSE,
                row.names=FALSE, col.names=TRUE, sep="\t")
    

    This also is not required for running on the cluster, but we need it if we want to access our results after running the script!

  • Open a PDF file for plotting:

    pdf(paste0(c(output, "pdf"), collapse="."))
    dev.off()
    

    The 2 commands above flank the usual plotting commands in our script; the first line opens up a new pdf file (that will take the form "output.pdf"), and the second line closes the connection to that file. You don't have to use PDF format on the cluster, but you do have to open up some kind of file for plotting, if you run a plot command in your script!

  • Look at the PBS Job script that runs the R script

    In a cluster environment like Palmetto, we can use a "job" script that is essentially a list of instructions telling the cluster what we want to do. The first few lines of this script have some information particular to the cluster:

    #PBS -N rLD
    #PBS -l select=1:ncpus=1:mem=8gb,walltime=00:20:00
    #PBS -j oe
    

    These lines tell Palmetto things like what to name our job, and what sort of resources we need to run it.

    The next line loads the R software onto whatever node we end up assigned to:

    module load R/3.3.2
    

    Then, we will cd into the directory with our files:

    cd $HOME/compGenWS_112017
    

    And the last line actually runs the R script:

    Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results 700962 1531867
    

    The general format for command line R will ALWAYS be: Rscript --vanilla YOUR_SCRIPT.R followed by all of your command line arguments. Notice I have 4 arguments here, each separated by a space:

    1. the sample data file name
    2. a prefix for my output files
    3. a position for the first snp I want to look at
    4. a position for the second snp I want to look at

    Run the script on the Cluster:

    Log into your Palmetto account, either through the Moba-Xterm app, or directly from the Terminal (on a Mac):

    ssh ecoope4@user.palmetto.clemson.edu
    

    Once logged in, clone the sample files and scripts for this week into your home directory:

    git clone https://github.com/eacooper400/compGenWS_112017.git
    

    Next, we will cd into the folder we just cloned, find the .pbs job script, and submit it:

    cd compGenWS_112017/
    cd Scripts/
    qsub run_rscript_1.pbs
    

    To check the progress of the script, you can use qstat -u followed by your username:

    qstat -u ecoope4
    

    Under the next to the last column, you should see an "R" if your script is "Running," or a "Q" if your script is still queued. If nothing shows up, then the job has finished. This script will take about 5 full minutes to run the first time (because of installing devtools and all of its dependencies).

    When it is finished, you can see the output files here:

    cd ..
    ls -lh
    

    Try the less command to look at the ".txt" file. To look at the pdf results, you need to first download them (with sftp or whatever is recommended for Windows).

    Edit the PBS job script to Run on a new Set of Positions

    One of the reasons I set up the R script the way that I did with the command line arguments was to make it easy to repeatedly use it on different sets of SNP positions.

    First, I'm going to edit the 11_forCluster.R script so that it does not waste time re-installing devtools, because we've already got that installed now.

    Let's say that there were 2 other "control" positions: 42,071 and 337,985 that I wanted to compare my potentially selected SNP (at 700962) to. I can easily edit my run_rscript_1.pbs script to do this:

    cd ~/compGenWS_112017/Scripts
    nano run_rscript_1.pbs
    

    While in "nano" (a text editor for the command line), let's copy and paste the line starting with "Rscript" 2 more times:

    Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results 700962 1531867
    Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results 700962 1531867
    Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results 700962 1531867
    

    Next, I edit my very last argument (which is the 2nd SNP position) to be different each time:

    Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results 700962 1531867
    Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results 700962 42071
    Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results 700962 337985
    

    Finally, I'll make sure to edit my output name/prefix, so that I don't accidentally overwrite the results from one run with the results from another:

    Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results 700962 1531867
    Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results2 700962 42071 
    Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results3 700962 337985
    

    To save this file, type Control-X, then when prompted to save, type the letter "y." Let's give it a new file name: run_rscript_2.pbs, and hit ENTER. Type "y" again when asked if you want to save under a new name. Now submit this new job:

    qsub run_rscript_2.pbs
    

    When this job finishes, you should now have 6 output files: 2 for each run of the script.

    Run the Script on Multiple Cores or Multiple Nodes

    There are a couple of other ways we could make this even more efficient. Let's open the run_rscript_2.pbs file with nano again, and now add an & symbol after each line. This will let the next line start running before the first line finishes, so we can use multiple cores (without having to change how we write our R script at all):

    cd Scripts/
    nano run_rscript_2.pbs
    
    #PBS -N rLD
    #PBS -l select=1:ncpus=4:mem=31gb,walltime=00:20:00
    #PBS -j oe
    
    module load R/3.3.2
    
    cd $HOME/compGenWS_112017
    
    Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results 700962 1531867 &
    Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results2 700962 42071 &
    Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results3 700962 337985
    
    wait
    
    qsub run_rscript_3.pbs
    

    The above job should take less than a minute to complete (vs. the previous job with took about 1.5 minutes).

    Yet another option is to make a separate job script for each of the runs, and then we could submit them all simultaneously.

    I'll start by copying run_rscript_1.pbs 3 times:

    cp run_rscript_1.pbs run_rSim1.pbs
    cp run_rscript_1.pbs run_rSim2.pbs
    cp run_rscript_1.pbs run_rSim3.pbs
    

    The first script is fine as is; so let's modify each of the other 2 to test the positions that we want, and to name the output files how we want:

    nano run_rSim2.pbs
    
    Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results2 700962 42071
    
    nano run_rSim3.pbs
    
    Rscript --vanilla Scripts/11_forCluster.R SampleData/heli_phased.vcf heli_rsq_results3 700962 337985
    

    Here's a small loop to run qsub on all of them:

    for file in ./run_rSim*
    do
        qsub $file
    done
    

    You are actually allowed to submit up to 500 jobs at once on Palmetto! Since our R scripts only use 1 core each, using the & method is probably the most convenient - it let's you keep everything organized in 1 script, and also speeds things up.

    But if you really have a lot of mini jobs, OR if you've got an R script that can use multiple cores, then writing and submitting a separate pbs script for each of them can be the way to go to get things done very efficiently.