High-throughput sequencing technology has made it possible to obtain large scale genetic data sets for almost any organism, creating a need for computational tools and skill sets to process these data. While the bioinformatics workflows for processing raw data into SNPs are typically well delineated, the path for analyzing and interpreting the resulting SNP data set can be less clear. In this workshop, students learn about classical population genetics statistics that test the neutral theory of evolution, and then get hands-on experience writing their own R code to perform each analysis on a realistic sample SNP data set. Emphasis is placed on programming fundamentals and algorithm design: skills that extend beyond the specific calculations learned in class. At the end of the semester, each student completes an independent project that consists of running an analysis on their own, often using their own data, and presenting their findings to the class.

- Syllabus
- R-tutorial
- Overview of course structure
- Introduction to R syntax, data objects, indexing, and loops.
- Download Slides

- VCF Format Exercises
- Download Exercise Solution
- Basics of next-gen sequencing, Standard raw data pipelines to call SNPs and produce VCF files.
- Details of the VCF format, and key considerations when reading this format into R and manipulating it.
- Download Slides

- Hardy-Weinberg Exercises
- Download Exercise Solution
- Review principle of HWE; calculating observed and expected frequencies; assumptions.
- Using Fisherâ€™s Exact test to find statistically significant deviation.
- Download Slides

- Wright's Fst Exercises
- Download Exercise Solution
- Discussion of population structure, what causes it, and how F
_{ST}is used to measure it. - Equations and assumptions for F
_{ST}; effects of population size on genetic drift; relationship between F_{ST}and migration. - Download Slides

- Linkage Disequilibrium I Exercises
- Download Exercise Solution
- Definition of Linkage Disequilibrium and underlying causes.
- Estimators of LD, and using haplotypes versus inferred haplotypes.
- Download Slides

- Linkage Disequilibrium II Exercises
- Download Exercise Solution
- The interpretation and meaning of LD decay.
- Estimating the recombination rate (rho) based on the mathematical relationship between LD and recombination.
- Download Slides

- AFS Exercises
- Download Exercise Solution
- A brief introduction to coalescent theory.
- Neutral (coalescent) theory expectations of allele frequency distributions
- Selective and demographic forces causing deviations from neutral.
- Population Mutation Rates and Waterson's Theta
- Download Slides

- TajimasD Exercises
- Download Exercise Solution
- Testing for selection using the difference in estimates of theta with Tajima's D.
- Different types of selection predicted by negative vs. positive D.
- Download Slides

- McDonald-Kreitman Exercises
- Download Exercise Solution
- Review what positive natural selection is, and the theory underlying the MK test to find sites under selection.
- Discussion of gene annotation: what this means, and what kind of programs give you this information.
- Download Slides

- Sliding Window Exercises
- Download Exercise Solution
- Comparing multiple statistics in a sliding window test
- Dealing with potentially missing data
- Download Slides

- Running R on the Cluster
- Download R Script
- Brief overview of the linux command line environment (navigating directories, creating, deleting, splitting, concatenating, and moving files).
- Cluster basics: logging in, transferring files, using interactive nodes and submitting job scripts.
- Running R in the cluster environment; installing packages, setting up scripts to use command line arguments.
- Download Slides