How to Assemble using Velvet
Oct 5 2016Genome Assembly using Velvet
What the heck is Genome Assembly?
Genome assembly is the process of constructing long contiguous sequences from shorter sequences. Think of this problem at a genomic scale. Same approach, just a lot more data.
What the heck is Velvet?
Velvet is a genome assembler that uses a de Bruijn graph to generate contigs. If you are interested in reading the paper describing how Velvet works, feel free to read Velvet: Algorithms for de novo short read assembly using de Bruijn graphs.
Installing Velvet
I say that the most important part of using software is figuring out how to install it. Sometimes it can be harder than you think.
Here is how you install Velvet:
- Download the source
- Optional: Check out the sweet Velvet website complete with web 2.0 design. Al Gore would be proud.
- Go to the directory in which you downloaded the file
$ cd ~/Downloads
and unzip the file$ tar zxvf velvet_1.2.10.tgz
- Go to the just unzipped directory
$ cd velvet_1.2.10
and compile Velvet by issuing the command$ make
- Error warning!! If you get an error that says something along the lines of
fatal error: zlib.h: NO such file or directory
then try installing the packagezlib1g-dev
then running$ make
again
- Error warning!! If you get an error that says something along the lines of
- If you didn’t get any errors, it looks like you have installed Velvet! If you got errors, Google the error and figure out how to fix it.
Running Velvet
To execute the Velvet program make sure that you are in the velvet_1.2.10
directory and then type $ ./velveth
and it should return a short help message. If it didn’t, check to see if you are in the correct directory by issuing the command $ pwd
.
Assembling the Zika virus genome
Prepping the files for assembly
We have some reads from the Zika virus, fresh from Florida.
We want to assemble the Zika virus genome to help find a cure.
Download the reads zika.read1.fastq and zika.read2.fastq, then run this command $ ./velveth zika_genome 20 -fastq -shortPaired ~/Downloads/zika.read1.fastq ~/Downloads/zika.read2.fastq
.
This command is a sort of preprocessing command that constructs your dataset so that it can assemble it.
Here are what the parameters mean:
./velveth
- the program that we usezika_genome
- this is the output directory of all the files20
- this is the hash (in other words, kmer) size that we use, you will want to play around with this-fastq
- this is the type of input files that we have-shortPaired
- this is the type of input reads that we have~/Downloads/zika.read1.fastq
- this is the first file of reads~/Downloads/zika.read2.fastq
- this is the second file of reads
Note: You can have an unlimited number of input files.
Assembling the reads
We are now going to use the program ./velvetg
to actually construct the de Bruijn graph and assemble the reads.
Issue the command $ ./velvetg zika_genome/ -cov_cutoff 4 -min_contig_lgth 100
, and now you have assembled your first genome!
Here are what the parameters mean:
./velvetg
- the program that we use-cov_cutoff 4
- this removes the nodes that have a coverage less than 4-min_contig_lgth 100
- this gives us all of the contigs that are greater than 100 bases
Viewing the generated contigs
Now we can see the contigs we generated by this command, $ cd ./zika_genome
and $ less contigs.fa
.
Feel free to explore around in this directory for other cool stuff about the contigs!
Happy assembling!
Above and beyond…
You can compare your generated contigs with the NCBI Reference Sequence for the Zika virus to see how well (or how poorly) your genome assembly actually is!