07 Nov 2016
So you have mapped your reads to the reference genome, but what comes next?
How can you tell how many reads were aligned, where they were aligned, and actually see what your mapping algorithm did?
This post will show you how you can analyze the results of your genome mapper by using samtools and IGV.
SAM File Format
Most modern genome aligners will output the aligned reads in the SAM file format.
If you have a lot of time on your hands you can read through this SAM file and see where the reads are mapped to and read the information for each read (which probably adds up to millions of lines).
A’int nobody got time for that.
Instead, we are going to have
samtools do the work for us.
- Download samtools.
- Go to the directory that you downloaded it and unzip the file
$ cd ~/Downloads
$ tar jxvf samtools-1.3.1.tar.bz2
- Go to the upzipped directory and prepare for install
$ cd samtools-1.3.1
- Note: You may get a library dependency error if you don’t have the development files for the ncurses library installed. If you do get this error, install either
- If the
./configure command executed without errors, install samtools
$ sudo make install
- Note: If you do not have super-user privileges then you can run
$ make install --prefix=<directory of your choice>
- Test if the install worked by running
If the installation was successful then you should see a usage prompt for
samtools, otherwise your installation failed (or you aren’t using the correct path to the
BAM File Format
The BAM file format is the binary cousin to the SAM file format, it is the same thing, just in binary form (this means that you can’t directly view a BAM file in a text editor like
vim like you could a SAM tile).
Converting SAM to BAM
We will now use
samtools to convert our SAM file to a BAM file.
The SAM file that we are going to use is found in the
examples directory of the
samtools-1.3.1 directory, the file is
samtools is broken up into subprograms that perform different functions.
In order to convert the SAM file we use
samtools view, like so:
$ samtools view -b -S examples/toy.sam -o examples/toy.bam.
Next, we want to sort the reads in the BAM file.
This is a necessary step because many visualization programs cannot handle unsorted BAM files.
To sort the BAM file, run:
$ samtools sort examples/toy.bam -o examples/toy.sorted.bam.
We then have to index the sorted BAM file, which is also a necessary step for visualization tools.
$ samtools index examples/toy.sorted.bam.
Now we can finally view the aligned reads in our terminal!
$ samtools tview examples/toy.sorted.bam examples/toy.fa and you will suddenly see four reads aligned to an extremely small genome.
You should see something along these lines:
1 11 21 31 41 51
........ .... ......K.K......K. ..........
........AGAG....***... ,,,,, ,,,,,,,,,
The numbers at the top signify the index of the genome, and the first line of characters represents the reference genome itself.
The third line is the (consensus sequence)[https://en.wikipedia.org/wiki/Consensus_sequence] (the astute student will notice that in the consensus sequence there are three
K’s, look here to discover what
Each line under the consensus sequence is a read.
You may be wondering why most of the reads are made of
., well that is because they match the reference genome.
There are many different settings that you can play with in
samtools tview, to view all of the settings type
? and a help menu will come appear.
Integrative Genome Viewer (IGV)
If you want more flexibility and a more robust way of viewing your aligned reads you can use IGV.
It has a GUI, which makes things nice sometimes.
Download the appropriate files according to the system that you are running.
IGV is written in Java (using Java 7, not sure if Java 8/9 will work, but it should), so you need to make sure that you have Java installed on your computer.
Once you have Java installed and IGV downloaded, go ahead and unzip the downloaded file, if needed.
Then if you are in an Unix-like OS (Mac or Linux) you can run
$ ./IGV_2.3.88/igv.sh to open up the IGV GUI.
Loading the Reference Genome
Once the program is open, click the
Genomes button at the top of the window, then select
Load Genome from File....
Select the file
Loading the Reads
After the reference genome is loaded we can load in the reads by clicking the
File button at the top of the window, then select
Load from File....
Select the file
Seeing the Reads
You probably can’t see any changes in the view, that’s ok.
In the second row click on the dropdown arrow that says
All and select either
ref2 and then you will be able to see what we saw in
samtools tview, except it is way easier to figure out what everything means!
25 Oct 2016
What is a Suffix Tree?
A suffix tree is a data sructure that contains each suffix of a string, a suffix is defined as the end part of a string.
For example, the suffixes for the string banana are:
Notice that the entire string is also considered a suffix.
We can also include the empty string as a suffix, so in order to include the empty string we need to append a character that isn’t in the alphabet to the string.
We will assume that the character
$ is not in the alphabet.
Our string is now banana$, and the suffixes are:
How to Construct a Suffix Tree
The easiest way to understand how to construct a suffix tree is to first construct a suffix trie, then collapse nodes to convert the trie to a tree.
Here is the graphical representation of the suffix trie:
Now we collapse the nodes that only have one child, and the resulting suffix tree is this:
How to Use a Suffix Tree
Observe that each suffix is represented in this structure.
This means that we can efficiently search for a suffix (or substring) by traversing the suffix tree.
For example if we were to search for the substring
ana in the suffix tree for
banana$, this would be the traversal path:
0 -> 2 -> 3
Note: We can stop at any node because we are searching for a substring rather than a suffix, if we were searching for a suffix, the search string would have been
Suffix Tree Construction
How would you create a naive algorithm to construct a suffix tree?
You can easily create a simple algorithm to construct a suffix tree in O(n2), but Esko Ukkonen discovered a way to construct a suffix tree in O(n) (linear time).
19 Oct 2016
What is Genome Assembly?
Genome assembly is often compared to putting a puzzle together.
In this analogy, the pieces of the puzzle are individual reads that we get from a DNA sequencer.
The ultimate goal of putting a puzzle together is to find out where each and every piece fit exactly, this way you can see the completed picture.
While this is also true of genome assembly, we need to be realistic.
The first step in genome assembly is to generate contigs, or in the words of our puzzle analogy, to combine pieces that we know go together.
What is a Contig?
A contig is short for contiguous sequence.
It is a sequence that is longer than the reads, and shorter than the genome (technically the whole assembled genome could be considered a contig in an organism with a single chromosome, but for our purposes it is considered to be shorter than the entire genome).
If you have a puzzle that is a picture of a beach and a ocean, you would combine all of the tan colored pieces and then all of the blue colored pieces.
You may even combine the pieces that make up the colorful beach umbrellas littering the beach.
When you do this you are generating contigs!
You are making small chunks of the whole because they are easy to identify, and it is the same with genome assembly.
How Do We Discover Contigs?
Well, we know that the pieces of our puzzle are reads.
The regular puzzle piece has four sides with different shapes that define exactly where it should go.
Our reads also have “sides” that tell us the proper place it should be in the contig.
These “sides” are revealed when we break the read up into kmers and inserting them into a de Bruijn graph.
We then can construct the contigs by continuing the non-branching nodes (a node that has exactly one incoming edge and one outgoing edge).
Imagine that we are assembling an incredibly small genome, with way too few reads, which are the following:
The reads are of length 5.
The de Bruijn graph using the kmer length of 5 (the entire read) for this set of reads would look like this (the non-branching node is in red):
As you can see there is only one non-branching node in the graph.
We can only generate the following contigs:
I hope it is clear that this is not a very good assembly.
Why isn’t this assembly good?
Well for the most part all of our contigs are the same length of our reads, except for one, which is only one more base pair than the read length.
This assembly is equivalent to only connecting one puzzle piece.
Let’s see how we can improve this assembly.
The length of kmer that you use to construct your de Bruijn graph will greatly influence your assembly, for better or for worse.
What will happen if we decrease the kmer length from 5 to 3?
Here is the de Bruijn graph of kmer length 3:
Here is the de Bruijn graph of kmer size 3:
What is up with all of the edges?
Each edge represents an occurance of that kmer, for example the kmer
ACT has 4 edges to
CTG because there are 4 occurances of
Let’s clean this graph up by giving weights to the edges.
The number on each edge represents how many times that edge is repeated.
Here is the de Bruijn graph of kmer size 3 with weighted edges (the non-branching nodes are in red):
Let’s see if a kmer length of 3 is any better; we definitely have more branhcing nodes, but will that lead to longer contigs?
Essentially we have made our puzzle pieces smaller, so just because we can put together more puzzle pieces doesn’t mean that we are building more of the puzzle.
Here are the contigs generated by the graph of kmer length 3:
Well, compared to the assembly using kmers of length 5 we have the same number of contigs, with a shorter average, and the longest contig is only the length of the reads.
Is this assembly better or worse than the last one?
Short answer, yes.
You have to define “better” when classifying assemblies.
You may want to have the longest average contig length, or simply the longest contig.
You may have some other metric like N50 in which you determine which assembly is “better.”
One way of accounting for errors in your algorithm is to remove edges below a certain weight.
When an edge has a higher value, it occurs more often.
If an edge occurs more often, then it is more likely to be a valid contig.
What is Next?
You may be asking yourself, “I can see how we generate contigs, and how that can be useful, but we still don’t have one sequence that represents the genome. We haven’t fully completed the puzzle. How do we do that?”
If you would like to fully complete the puzzle, to put all of these contigs together, you would have to construct scaffolds.
A scaffold is a representation of how contigs relate to each other as well as accounting for gaps in the sequence.
Depending on how much data you have, you may not be able to create one continuous sequence.
NOTE: Scaffolding is not required for the CS 418/BIO 365 Genome Assembler project.
Generating contigs is good enough!