cole lyman CS, Bioinformatics & Life

How to Assemble using Velvet

Genome Assembly using Velvet

What the heck is Genome Assembly?

Genome assembly is the process of constructing long contiguous sequences from shorter sequences. Think of this problem at a genomic scale. Same approach, just a lot more data.

What the heck is Velvet?

Velvet is a genome assembler that uses a de Bruijn graph to generate contigs. If you are interested in reading the paper describing how Velvet works, feel free to read Velvet: Algorithms for de novo short read assembly using de Bruijn graphs.

Installing Velvet

I say that the most important part of using software is figuring out how to install it. Sometimes it can be harder than you think.

Here is how you install Velvet:

  1. Download the source
    • Optional: Check out the sweet Velvet website complete with web 2.0 design. Al Gore would be proud.
  2. Go to the directory in which you downloaded the file $ cd ~/Downloads and unzip the file $ tar zxvf velvet_1.2.10.tgz
  3. Go to the just unzipped directory $ cd velvet_1.2.10 and compile Velvet by issuing the command $ make
    • Error warning!! If you get an error that says something along the lines of fatal error: zlib.h: NO such file or directory then try installing the package zlib1g-dev then running $ make again
  4. If you didn’t get any errors, it looks like you have installed Velvet! If you got errors, Google the error and figure out how to fix it.

Running Velvet

To execute the Velvet program make sure that you are in the velvet_1.2.10 directory and then type $ ./velveth and it should return a short help message. If it didn’t, check to see if you are in the correct directory by issuing the command $ pwd.

Assembling the Zika virus genome

Prepping the files for assembly

We have some reads from the Zika virus, fresh from Florida. We want to assemble the Zika virus genome to help find a cure. Download the reads zika.read1.fastq and zika.read2.fastq, then run this command $ ./velveth zika_genome 20 -fastq -shortPaired ~/Downloads/zika.read1.fastq ~/Downloads/zika.read2.fastq. This command is a sort of preprocessing command that constructs your dataset so that it can assemble it. Here are what the parameters mean:

  • ./velveth- the program that we use
  • zika_genome- this is the output directory of all the files
  • 20- this is the hash (in other words, kmer) size that we use, you will want to play around with this
  • -fastq- this is the type of input files that we have
  • -shortPaired- this is the type of input reads that we have
  • ~/Downloads/zika.read1.fastq- this is the first file of reads
  • ~/Downloads/zika.read2.fastq- this is the second file of reads

Note: You can have an unlimited number of input files.

Assembling the reads

We are now going to use the program ./velvetg to actually construct the de Bruijn graph and assemble the reads. Issue the command $ ./velvetg zika_genome/ -cov_cutoff 4 -min_contig_lgth 100, and now you have assembled your first genome! Here are what the parameters mean:

  • ./velvetg- the program that we use
  • -cov_cutoff 4- this removes the nodes that have a coverage less than 4
  • -min_contig_lgth 100- this gives us all of the contigs that are greater than 100 bases

Viewing the generated contigs

Now we can see the contigs we generated by this command, $ cd ./zika_genome and $ less contigs.fa. Feel free to explore around in this directory for other cool stuff about the contigs!

Happy assembling!

Above and beyond…

You can compare your generated contigs with the NCBI Reference Sequence for the Zika virus to see how well (or how poorly) your genome assembly actually is!