cole lyman CS, Bioinformatics & Life

How to Assemble using Velvet

Genome Assembly using Velvet

What the heck is Genome Assembly?

Genome assembly is the process of constructing long contiguous sequences from shorter sequences. Think of this problem at a genomic scale. Same approach, just a lot more data.

What the heck is Velvet?

Velvet is a genome assembler that uses a de Bruijn graph to generate contigs. If you are interested in reading the paper describing how Velvet works, feel free to read Velvet: Algorithms for de novo short read assembly using de Bruijn graphs.

Installing Velvet

I say that the most important part of using software is figuring out how to install it. Sometimes it can be harder than you think.

Here is how you install Velvet:

  1. Download the source
    • Optional: Check out the sweet Velvet website complete with web 2.0 design. Al Gore would be proud.
  2. Go to the directory in which you downloaded the file $ cd ~/Downloads and unzip the file $ tar zxvf velvet_1.2.10.tgz
  3. Go to the just unzipped directory $ cd velvet_1.2.10 and compile Velvet by issuing the command $ make
    • Error warning!! If you get an error that says something along the lines of fatal error: zlib.h: NO such file or directory then try installing the package zlib1g-dev then running $ make again
  4. If you didn’t get any errors, it looks like you have installed Velvet! If you got errors, Google the error and figure out how to fix it.

Running Velvet

To execute the Velvet program make sure that you are in the velvet_1.2.10 directory and then type $ ./velveth and it should return a short help message. If it didn’t, check to see if you are in the correct directory by issuing the command $ pwd.

Assembling the Zika virus genome

Prepping the files for assembly

We have some reads from the Zika virus, fresh from Florida. We want to assemble the Zika virus genome to help find a cure. Download the reads zika.read1.fastq and zika.read2.fastq, then run this command $ ./velveth zika_genome 20 -fastq -shortPaired ~/Downloads/zika.read1.fastq ~/Downloads/zika.read2.fastq. This command is a sort of preprocessing command that constructs your dataset so that it can assemble it. Here are what the parameters mean:

  • ./velveth- the program that we use
  • zika_genome- this is the output directory of all the files
  • 20- this is the hash (in other words, kmer) size that we use, you will want to play around with this
  • -fastq- this is the type of input files that we have
  • -shortPaired- this is the type of input reads that we have
  • ~/Downloads/zika.read1.fastq- this is the first file of reads
  • ~/Downloads/zika.read2.fastq- this is the second file of reads

Note: You can have an unlimited number of input files.

Assembling the reads

We are now going to use the program ./velvetg to actually construct the de Bruijn graph and assemble the reads. Issue the command $ ./velvetg zika_genome/ -cov_cutoff 4 -min_contig_lgth 100, and now you have assembled your first genome! Here are what the parameters mean:

  • ./velvetg- the program that we use
  • -cov_cutoff 4- this removes the nodes that have a coverage less than 4
  • -min_contig_lgth 100- this gives us all of the contigs that are greater than 100 bases

Viewing the generated contigs

Now we can see the contigs we generated by this command, $ cd ./zika_genome and $ less contigs.fa. Feel free to explore around in this directory for other cool stuff about the contigs!

Happy assembling!

Above and beyond…

You can compare your generated contigs with the NCBI Reference Sequence for the Zika virus to see how well (or how poorly) your genome assembly actually is!

Python Tutorial CS 418/BIO 365

Python Tutorial

CS 418/BIO 365

30 August 2016

# This is a comment
""" This is a 
    multi-
    line
    comment"""
''' This is also a 
    multi-
    line
    comment'''
# How do I declare variables?

# String
my_string = 'Hello, CS 418/BIO 365!!'
my_other_string = "Hello, friends."
# Numbers
my_number = 418
my_other_number = 365.578
# List
my_list = ['this', 'is', 'fun', 56]
# or
my_other_list = []
# or
my_other_list2 = list()

my_other_list.append('bioinformatics')
# Dictionary (Map)
my_dict = {'key': 'value', 'other_key': 'other_value'}
# or
my_other_dict = {}
# or
my_other_dict2 = dict()

my_other_dict['my_key'] = 'my_value'
# How do I make loops?

# NOTE: Whitespace matters!!

# for loop
for i in my_list:
    print(i)
    
# NOTE: "print(i)" is used in Python 3, "print i" is used in Python 2 
this
is
fun
56
# for loop over range
for i in range(0, 5):
    my_other_list.append(i)
print(my_other_list)
['bioinformatics', 0, 1, 2, 3, 4]
# while loop
while len(my_list) > 0:
    del my_list[-1]
    print(my_list)
['this', 'is', 'fun']
['this', 'is']
['this']
[]
# break 
for val in my_other_list:
    if val is 2:
        break
    else:
        print(val)
bioinformatics
0
1
# continue
for val in my_other_list:
    if val is not 2:
        continue
    print(val)
2
# How do I perform file I/O?

# file input
import sys # this imports the library to get command line arguments
with open(sys.argv[1]) as fh: # sys.argv[1] is the first command line argument, sys.argv[2] is the second ... and so on.
    my_first_line = next(fh) # Python 2 & 3
    my_next_line = fh.next() # Python 2 
    
with open(sys.argv[1]) as other_fh:
    # iterate over all lines in file
    for line in other_fh:
        print(line)
ACGTTGCATGTCGCATGATGCATGAGAGCT

4
# file output
with open('./my_file.txt', 'w') as writable: ''' NOTE: the second 
    argument ('w') makes this file writable '''
    writable.write('This is my test file\n')
    writable.write(my_first_line)
# How do I manipulate strings?

# string slice
my_string = 'banana'
print(my_string[3])
print(my_string[0:5])
print(my_string[1:5])
print(my_string[-1])
print(my_string[0:-2])
a
banan
anan
a
bana

Rosalind Tutorial

You can register for the course at: http://rosalind.info/classes/enroll/e464d00a10/

Use your NetID as your username!!

The first problem is found here: http://rosalind.info/problems/ba1b/?class=322

# What do we need to do in pseudo code?

''' -Read in the sequence and the 
    length of the kmer (a kmer is essentially a substring)
    -Break up the sequence into kmers
    -Count each kmer
    -Find the kmer(s) with the highest counts
    -Print out the highest count kmers'''
# Read in the sequence and the length of the kmer from a file
import sys
with open(sys.argv[1]) as file:
    seq = next(file).strip() # REMEMBER: in Python 2 use file.next()
    kmer_len = int(next(file).strip())
# Break up the sequence into kmers
    for i in range(0, len(seq)):
        kmer = seq[ i : i + kmer_len]
# Count each kmer
    counts = {}
    for i in range(0, len(seq)):
        kmer = seq[ i : i + kmer_len]
        if kmer not in counts:
            counts[kmer] = 1
        else:
            counts[kmer] += 1
# Find the kmer(s) with the highest counts
    max_count = 0
    max_kmers = []
    for kmer, count in counts.items(): # NOTE: use counts.iteritems() in Python 2
        if count > max_count:
            max_count = count
            max_kmers.clear()
            max_kmers.append(kmer)
        elif count == max_count:
            max_kmers.append(kmer)
# Print out the kmers with the highest counts
    print(" ".join(max_kmers))

You Don't Know How Bad You Are Until You Try to Be Good

One interesting phenomenon that I have noticed in myself is that I perceive that I am better than I actually am in reality. Call it pride or high expectations; either way, it seems that I never realize where I was until I begin to improve. When we don’t try to improve ourselves, we will have nothing to compare ourselves to in the future. If we never surpass our current state, then we will never realize that we were ever lacking in our old state.

I have also noticed that the moment that we begin on the path of improvement it seems as if that path is impossibly difficult, so much more difficult than it was before. I believe that this is due to the fact that change is difficult, and as stated previously, I believe that my abilities are higher than they are in reality.

Why try to improve?

Some may ask what the point is in striving to do your best? Is your current state not good enough? I can only speak for myself (obviously), so why do I strive to do my best? I see achieving your highest potential as the greatest challenge this life has to offer. Whether it be in sports, music, academics, or business, it is human nature to embrace difficulties and overcome them. I see overcoming weaknesses and flaws in my character as a challenge that is to be overcome, that I enjoy immensely.

Why is improving so difficult?

You don’t know how hard improving is going to be until you attempt it. If our current state of being isn’t easy or comfortable, then we wouldn’t that way in the first place. It is much easier to stay the way you are than to fundamentally change your character; however, change for the better is always worth it, no matter how difficult it was. Doing what we have always done is definitely easier, but making the change that we want can bring the satisfaction back to life that we may be lacking.

-Cole