In our first year, we get to rotate with three, sometimes four, labs to find a good fit. We then work on our PhD theses in the lab we decide upon for the next four years or more.
My first rotation, which just ended, was in a plant genomics and evolution lab. I opted for a computational project, which happily turned out to be a mini-crash course on coding and genomics workflows. A neat little thing I learnt to do was looking at sequencing coverage depth across the genome, which can be used as a preliminary screen for duplications and deletions, among other uses. Since it so cool, I am making a tutorial on how to do it.
Next generation sequencing protocols generally involve randomly breaking up the DNA into small fragments, sequencing the fragments (which are described as reads), and assembling overlapping reads into longer and longer sequences until you get the complete chromosome, genome etc. Sequencing can be done at high and low coverage, which refers to the average number of times each nucleotide in the genome is sequenced. For 20X coverage (which is fairly high depth), each nucleotide would be represented in 20 overlapping reads on average. If there is a duplication, you would suddenly have twice as many reads for a region. By aligning the reads to a reference genome, we can see differences in coverage across the genome. Other factors, such as GC content and repeat regions that are unrepresented in the reference genome can also affect coverage, so it is good to be careful when interpreting depth of sequence coverage across the genome.
NGS read data from most studies are deposited in NCBI’s SRA (Sequence Read Archive) database while reference genomes are available in their Nucleotide database. I thought it would be nice to kind of reconstruct part of a figure from a very cool experimental evolution study, where they showed that duplication of a gene called K3L helped a Vaccinia virus mutant adapt to counter a host innate immune protein over ten passages. Read the followup study as well for a nuanced interpretation of their results. Here is the figure from the study:
The first row is the genome of the parental Vaccinia virus where they deleted a gene called E3L (second triangle), while the three rows below are replicates of the virus evolved over ten passages. The peaks (first triangle) combined with the dotted lines which flank a gene called K3L provide evidence that the gene has undergone duplication(s) in each of the three replicates. Now, onto the actual tutorial!
We will be using Unix tools for aligning the reads and generating the coverage across the viral genome, and R to plot the graphs. If you are using Windows as I am, download Ubuntu here. Once installed, right click on Ubuntu and select “Run as administrator”. You should see a black window where you can type in commands. Download R Studio here.
Let’s first install the Unix tools (the parts starting with # are comments and do not need to be included):
apt-get install sra-toolkit #provides a tool to import the reads as fastq files
apt-get install bwa #provides tools to align the reads to the reference genome
apt-get install bedtools #provides a tool to make a table with the intervals of the genome we want to use for our plot
apt-get install samtools #provides tools for formatting the aligned files and creating a table for the number of reads per interval of the genome
Make a directory called poxgenomes:
Make this your working directory:
In the “Experimental Procedures” of the Elde at al. study, they mention that the sequence files are available in the SRA database under the accession SRP013416. Go to the SRA database webpage, and search for SRP013416. Note the SRR accession numbers after clicking on the parental and the three passage 10 replicates. Keep in mind that for any command, you can look at the documentation to understand it better using ‐‐help (the double dash here and in the two next boxes show up as two underscores if you’re on a smartphone). Try it:
Import them as separate fastq files for forward and reverse reads using fastq-dump from SRA tools (‐‐gzip compresses the files):
fastq-dump ‐‐origfmt -I ‐‐split-files ‐‐gzip SRR503478 #replicate 1
fastq-dump ‐‐origfmt -I ‐‐split-files ‐‐gzip SRR503479 #replicate 2
fastq-dump ‐‐origfmt -I ‐‐split-files ‐‐gzip SRR503480 #replicate 3
This will take ten to thirty minutes for each.
Elde et al. mentions using the Copenhagen genome from Goebel et al. as the reference genome. Search for the keywords goebel and copenhagen here to get the accession version number. Download the genome using these commands from Entrez direct and save it as copenhagen.fasta:
Make index files; these are required for alignment in the next step, don’t worry about these too much beyond that:
bwa index -a bwtsw copenhagen.fasta
Align the reads to the reference genome and save as sam files; this may take fifteen to thirty minutes for each, so you can find the coverage tables here if you want to quickly move on to plotting the data:
#all in one line
#repeat for replicates 1-3
The parts after RG are just following naming conventions. To run the actual command, notice the positions of the fasta and fastq files.
Make a bed file with the complete range we want to cover (as in the paper, we will take the region between the terminal repeats whose locations can be found here):
In the text file that opens up, type in (the gaps between are tabs):
Then press Ctrl-X and Y (not Ctrl-Y) to save and exit.
Make 500 bp windows within that range:
Process the sam files into the appropriate format for coverage analysis and create index files:
samtools sort -o SRR503481_sorted.bam SRR503481.bam
samtools index SRR503481_sorted.bam
#repeat for replicates 1-3
Create tables with coverage within each window and save as .txt files:
#repeat for replicates 1-3
Now the time has finally come to talk about the weird directory that Ubuntu creates on your Windows computer. Check your current directory:
The home they refer to is deep, deep, deep inside your C drive. Try going in this direction:
C: > Users > yourname > AppData > Local > Packages > something like CanonicalGroupLimited.UbuntuonWindows_79rhkp1fndgsc > LocalState > rootfs > home
Typically, you don’t want to manually mess with files in this directory, but we are only going to copy SRR503481cov.txt and the three other cov.txt files you have produced into Documents or some other folder that is easier to use as your working directory for R.
Open up R Studio and type in the commands excluding comments (indicated by #):
?setwd #this is how you look for documentation for functions on R
SRR503481 <- read.delim(“SRR503481cov.txt”, header=FALSE) #reading in a tab delimited file
SRR503478 <- read.delim(“SRR503478cov.txt”, header=FALSE)
SRR503479 <- read.delim(“SRR503479cov.txt”, header=FALSE)
SRR503480 <- read.delim(“SRR503480cov.txt”, header=FALSE)
head(SRR503481) #take a look at the first few columns
colnames(SRR503481) <- c(“Fasta_header”, “Start”, “End”, “Depth”)#adding column names
colnames(SRR503478) <- c(“Fasta_header”, “Start”, “End”, “Depth”)
colnames(SRR503479) <- c(“Fasta_header”, “Start”, “End”, “Depth”)
colnames(SRR503480) <- c(“Fasta_header”, “Start”, “End”, “Depth”)
head(SRR503481) #take a look again and see that the column names have been added
library(ggplot2) #loads the packages
Parental <- SRR503481 %>% mutate(Sample = “Parental”, RelativeDepth = (Depth/median(Depth))) #creates new variables providing sample ID and read depths normalized by the median depth
Replicate_1 <- SRR503478 %>% mutate(Sample = “Replicate_1”, RelativeDepth = (Depth/median(Depth)))
Replicate_2 <- SRR503479 %>% mutate(Sample = “Replicate_2”, RelativeDepth = (Depth/median(Depth)))
Replicate_3 <- SRR503480 %>% mutate(Sample = “Replicate_3”, RelativeDepth = (Depth/median(Depth)))
vacciniadata <- rbind(Parental,Replicate_1,Replicate_2,Replicate_3) #combines the data frames
facet_grid(rows = vars(Sample)) #plotting while separating and coloring the plots by the Sample variable
You can download the above R code in an R script, along with the bedcov text files from here. You can run the script after setting your directory and placing the cov.txt files in the directory.
Ta-dah! This is what you have produced:
The dotted lines here represent 1500 bp before and after the start sites for K3L and E3L (which you can find here with a simple Ctrl-F). The graph looks a bit different because the authors of the study evidently used a different way of normalizing the depths, but you can spot the duplications and deletions all the same.
Hope this was fun! One last note: This kind of analysis (at least up until the plotting on R) is pretty much meant to be done on a server with the processing power and memory of a supercomputer. This is still somewhat manageable because it’s a viral genome, so it’s relatively small. I would not recommend trying this with an animal or plant genome because those are huge, and will hurt your computer.