cole lyman CS, Bioinformatics & Life

Genome Assembly

What is Genome Assembly?

Genome assembly is often compared to putting a puzzle together. In this analogy, the pieces of the puzzle are individual reads that we get from a DNA sequencer. The ultimate goal of putting a puzzle together is to find out where each and every piece fit exactly, this way you can see the completed picture. While this is also true of genome assembly, we need to be realistic. The first step in genome assembly is to generate contigs, or in the words of our puzzle analogy, to combine pieces that we know go together.

What is a Contig?

A contig is short for contiguous sequence. It is a sequence that is longer than the reads, and shorter than the genome (technically the whole assembled genome could be considered a contig in an organism with a single chromosome, but for our purposes it is considered to be shorter than the entire genome). If you have a puzzle that is a picture of a beach and a ocean, you would combine all of the tan colored pieces and then all of the blue colored pieces. You may even combine the pieces that make up the colorful beach umbrellas littering the beach. When you do this you are generating contigs! You are making small chunks of the whole because they are easy to identify, and it is the same with genome assembly.

How Do We Discover Contigs?

Well, we know that the pieces of our puzzle are reads. The regular puzzle piece has four sides with different shapes that define exactly where it should go. Our reads also have “sides” that tell us the proper place it should be in the contig. These “sides” are revealed when we break the read up into kmers and inserting them into a de Bruijn graph. We then can construct the contigs by continuing the non-branching nodes (a node that has exactly one incoming edge and one outgoing edge).

Imagine that we are assembling an incredibly small genome, with way too few reads, which are the following:

ACTGT
TCTGT
CTGTT
CTGTA
GCATA
TGTTA
GTTAC

The reads are of length 5. The de Bruijn graph using the kmer length of 5 (the entire read) for this set of reads would look like this (the non-branching node is in red): de Bruijn Graph

As you can see there is only one non-branching node in the graph. We can only generate the following contigs:

ACTGT
TCTGT
CTGTTA
CTGTA
GCATA
TGTTA
GTTAC

I hope it is clear that this is not a very good assembly. Why isn’t this assembly good? Well for the most part all of our contigs are the same length of our reads, except for one, which is only one more base pair than the read length. This assembly is equivalent to only connecting one puzzle piece. Let’s see how we can improve this assembly.

Kmer Length

The length of kmer that you use to construct your de Bruijn graph will greatly influence your assembly, for better or for worse. What will happen if we decrease the kmer length from 5 to 3?

Here is the de Bruijn graph of kmer length 3:

Here is the de Bruijn graph of kmer size 3: de Bruijn Graph

What is up with all of the edges? Each edge represents an occurance of that kmer, for example the kmer ACT has 4 edges to CTG because there are 4 occurances of CTG. Let’s clean this graph up by giving weights to the edges. The number on each edge represents how many times that edge is repeated.

Here is the de Bruijn graph of kmer size 3 with weighted edges (the non-branching nodes are in red): de Bruijn Graph

Let’s see if a kmer length of 3 is any better; we definitely have more branhcing nodes, but will that lead to longer contigs? Essentially we have made our puzzle pieces smaller, so just because we can put together more puzzle pieces doesn’t mean that we are building more of the puzzle. Here are the contigs generated by the graph of kmer length 3:

ACT
TCT
CTG
TGTTA
TGTA
GCATA
TAC

Well, compared to the assembly using kmers of length 5 we have the same number of contigs, with a shorter average, and the longest contig is only the length of the reads. Is this assembly better or worse than the last one? Short answer, yes. You have to define “better” when classifying assemblies. You may want to have the longest average contig length, or simply the longest contig. You may have some other metric like N50 in which you determine which assembly is “better.”

Filtering Errors

One way of accounting for errors in your algorithm is to remove edges below a certain weight. When an edge has a higher value, it occurs more often. If an edge occurs more often, then it is more likely to be a valid contig.

What is Next?

You may be asking yourself, “I can see how we generate contigs, and how that can be useful, but we still don’t have one sequence that represents the genome. We haven’t fully completed the puzzle. How do we do that?” If you would like to fully complete the puzzle, to put all of these contigs together, you would have to construct scaffolds. A scaffold is a representation of how contigs relate to each other as well as accounting for gaps in the sequence. Depending on how much data you have, you may not be able to create one continuous sequence.

NOTE: Scaffolding is not required for the CS 418/BIO 365 Genome Assembler project. Generating contigs is good enough!

How to Assemble using Velvet

Genome Assembly using Velvet

What the heck is Genome Assembly?

Genome assembly is the process of constructing long contiguous sequences from shorter sequences. Think of this problem at a genomic scale. Same approach, just a lot more data.

What the heck is Velvet?

Velvet is a genome assembler that uses a de Bruijn graph to generate contigs. If you are interested in reading the paper describing how Velvet works, feel free to read Velvet: Algorithms for de novo short read assembly using de Bruijn graphs.

Installing Velvet

I say that the most important part of using software is figuring out how to install it. Sometimes it can be harder than you think.

Here is how you install Velvet:

  1. Download the source
    • Optional: Check out the sweet Velvet website complete with web 2.0 design. Al Gore would be proud.
  2. Go to the directory in which you downloaded the file $ cd ~/Downloads and unzip the file $ tar zxvf velvet_1.2.10.tgz
  3. Go to the just unzipped directory $ cd velvet_1.2.10 and compile Velvet by issuing the command $ make
    • Error warning!! If you get an error that says something along the lines of fatal error: zlib.h: NO such file or directory then try installing the package zlib1g-dev then running $ make again
  4. If you didn’t get any errors, it looks like you have installed Velvet! If you got errors, Google the error and figure out how to fix it.

Running Velvet

To execute the Velvet program make sure that you are in the velvet_1.2.10 directory and then type $ ./velveth and it should return a short help message. If it didn’t, check to see if you are in the correct directory by issuing the command $ pwd.

Assembling the Zika virus genome

Prepping the files for assembly

We have some reads from the Zika virus, fresh from Florida. We want to assemble the Zika virus genome to help find a cure. Download the reads zika.read1.fastq and zika.read2.fastq, then run this command $ ./velveth zika_genome 20 -fastq -shortPaired ~/Downloads/zika.read1.fastq ~/Downloads/zika.read2.fastq. This command is a sort of preprocessing command that constructs your dataset so that it can assemble it. Here are what the parameters mean:

  • ./velveth- the program that we use
  • zika_genome- this is the output directory of all the files
  • 20- this is the hash (in other words, kmer) size that we use, you will want to play around with this
  • -fastq- this is the type of input files that we have
  • -shortPaired- this is the type of input reads that we have
  • ~/Downloads/zika.read1.fastq- this is the first file of reads
  • ~/Downloads/zika.read2.fastq- this is the second file of reads

Note: You can have an unlimited number of input files.

Assembling the reads

We are now going to use the program ./velvetg to actually construct the de Bruijn graph and assemble the reads. Issue the command $ ./velvetg zika_genome/ -cov_cutoff 4 -min_contig_lgth 100, and now you have assembled your first genome! Here are what the parameters mean:

  • ./velvetg- the program that we use
  • -cov_cutoff 4- this removes the nodes that have a coverage less than 4
  • -min_contig_lgth 100- this gives us all of the contigs that are greater than 100 bases

Viewing the generated contigs

Now we can see the contigs we generated by this command, $ cd ./zika_genome and $ less contigs.fa. Feel free to explore around in this directory for other cool stuff about the contigs!

Happy assembling!

Above and beyond…

You can compare your generated contigs with the NCBI Reference Sequence for the Zika virus to see how well (or how poorly) your genome assembly actually is!

Python Tutorial CS 418/BIO 365

Python Tutorial

CS 418/BIO 365

30 August 2016

# This is a comment
""" This is a 
    multi-
    line
    comment"""
''' This is also a 
    multi-
    line
    comment'''
# How do I declare variables?

# String
my_string = 'Hello, CS 418/BIO 365!!'
my_other_string = "Hello, friends."
# Numbers
my_number = 418
my_other_number = 365.578
# List
my_list = ['this', 'is', 'fun', 56]
# or
my_other_list = []
# or
my_other_list2 = list()

my_other_list.append('bioinformatics')
# Dictionary (Map)
my_dict = {'key': 'value', 'other_key': 'other_value'}
# or
my_other_dict = {}
# or
my_other_dict2 = dict()

my_other_dict['my_key'] = 'my_value'
# How do I make loops?

# NOTE: Whitespace matters!!

# for loop
for i in my_list:
    print(i)
    
# NOTE: "print(i)" is used in Python 3, "print i" is used in Python 2 
this
is
fun
56
# for loop over range
for i in range(0, 5):
    my_other_list.append(i)
print(my_other_list)
['bioinformatics', 0, 1, 2, 3, 4]
# while loop
while len(my_list) > 0:
    del my_list[-1]
    print(my_list)
['this', 'is', 'fun']
['this', 'is']
['this']
[]
# break 
for val in my_other_list:
    if val is 2:
        break
    else:
        print(val)
bioinformatics
0
1
# continue
for val in my_other_list:
    if val is not 2:
        continue
    print(val)
2
# How do I perform file I/O?

# file input
import sys # this imports the library to get command line arguments
with open(sys.argv[1]) as fh: # sys.argv[1] is the first command line argument, sys.argv[2] is the second ... and so on.
    my_first_line = next(fh) # Python 2 & 3
    my_next_line = fh.next() # Python 2 
    
with open(sys.argv[1]) as other_fh:
    # iterate over all lines in file
    for line in other_fh:
        print(line)
ACGTTGCATGTCGCATGATGCATGAGAGCT

4
# file output
with open('./my_file.txt', 'w') as writable: ''' NOTE: the second 
    argument ('w') makes this file writable '''
    writable.write('This is my test file\n')
    writable.write(my_first_line)
# How do I manipulate strings?

# string slice
my_string = 'banana'
print(my_string[3])
print(my_string[0:5])
print(my_string[1:5])
print(my_string[-1])
print(my_string[0:-2])
a
banan
anan
a
bana

Rosalind Tutorial

You can register for the course at: http://rosalind.info/classes/enroll/e464d00a10/

Use your NetID as your username!!

The first problem is found here: http://rosalind.info/problems/ba1b/?class=322

# What do we need to do in pseudo code?

''' -Read in the sequence and the 
    length of the kmer (a kmer is essentially a substring)
    -Break up the sequence into kmers
    -Count each kmer
    -Find the kmer(s) with the highest counts
    -Print out the highest count kmers'''
# Read in the sequence and the length of the kmer from a file
import sys
with open(sys.argv[1]) as file:
    seq = next(file).strip() # REMEMBER: in Python 2 use file.next()
    kmer_len = int(next(file).strip())
# Break up the sequence into kmers
    for i in range(0, len(seq)):
        kmer = seq[ i : i + kmer_len]
# Count each kmer
    counts = {}
    for i in range(0, len(seq)):
        kmer = seq[ i : i + kmer_len]
        if kmer not in counts:
            counts[kmer] = 1
        else:
            counts[kmer] += 1
# Find the kmer(s) with the highest counts
    max_count = 0
    max_kmers = []
    for kmer, count in counts.items(): # NOTE: use counts.iteritems() in Python 2
        if count > max_count:
            max_count = count
            max_kmers.clear()
            max_kmers.append(kmer)
        elif count == max_count:
            max_kmers.append(kmer)
# Print out the kmers with the highest counts
    print(" ".join(max_kmers))