Gen8900-compGen

Course Summary and Objectives:

High-throughput sequencing technology has made it possible to obtain large scale genetic data sets for almost any organism, creating a need for computational tools and skill sets to process these data. While the bioinformatics workflows for processing raw data into SNPs are typically well delineated, the path for analyzing and interpreting the resulting SNP data set can be less clear. In this workshop, students learn about classical population genetics statistics that test the neutral theory of evolution, and then get hands-on experience writing their own R code to perform each analysis on a realistic sample SNP data set. Emphasis is placed on programming fundamentals and algorithm design: skills that extend beyond the specific calculations learned in class. At the end of the semester, each student completes an independent project that consists of running an analysis on their own, often using their own data, and presenting their findings to the class.

Syllabus
R-tutorial
Overview of course structure
Introduction to R syntax, data objects, indexing, and loops.
Download Slides

VCF Format Exercises
Download Exercise Solution
Basics of next-gen sequencing, Standard raw data pipelines to call SNPs and produce VCF files.
Details of the VCF format, and key considerations when reading this format into R and manipulating it.
Download Slides

Hardy-Weinberg Exercises
Download Exercise Solution
Review principle of HWE; calculating observed and expected frequencies; assumptions.
Using Fisher’s Exact test to find statistically significant deviation.
Download Slides

Wright's Fst Exercises
Download Exercise Solution
Discussion of population structure, what causes it, and how F_ST is used to measure it.
Equations and assumptions for F_ST; effects of population size on genetic drift; relationship between F_ST and migration.
Download Slides

Linkage Disequilibrium I Exercises
Download Exercise Solution
Definition of Linkage Disequilibrium and underlying causes.
Estimators of LD, and using haplotypes versus inferred haplotypes.
Download Slides

Linkage Disequilibrium II Exercises
Download Exercise Solution
The interpretation and meaning of LD decay.
Estimating the recombination rate (rho) based on the mathematical relationship between LD and recombination.
Download Slides

AFS Exercises
Download Exercise Solution
A brief introduction to coalescent theory.
Neutral (coalescent) theory expectations of allele frequency distributions
Selective and demographic forces causing deviations from neutral.
Population Mutation Rates and Waterson's Theta
Download Slides

TajimasD Exercises
Download Exercise Solution
Testing for selection using the difference in estimates of theta with Tajima's D.
Different types of selection predicted by negative vs. positive D.
Download Slides

McDonald-Kreitman Exercises
Download Exercise Solution
Review what positive natural selection is, and the theory underlying the MK test to find sites under selection.
Discussion of gene annotation: what this means, and what kind of programs give you this information.
Download Slides

Sliding Window Exercises
Download Exercise Solution
Comparing multiple statistics in a sliding window test
Dealing with potentially missing data
Download Slides

Running R on the Cluster
Download R Script
Brief overview of the linux command line environment (navigating directories, creating, deleting, splitting, concatenating, and moving files).
Cluster basics: logging in, transferring files, using interactive nodes and submitting job scripts.
Running R in the cluster environment; installing packages, setting up scripts to use command line arguments.
Download Slides

Useful External Links:

R-project website

R-studio

Pirates Guide to R