Nov 10 2018
The title of Donald Knuth’s famed series The Art of Computer Programming is
often misunderstood, and as he describes in his talk Computer programming as an
art which he delivered when he received the 1974 Turing award. I love this talk!
I think that it offers a unique perspective on programming that can be very
beneficial.
Knuth starts his talk with an explanation of the difference between an art and a
science through an etymologic analysis of the words. This seems to explain why
the fields of medicine have historically been considered arts even though our
modern perspective would consider them as sciences.
I can resonate with Knuth’s observation:
My feeling is that when we prepare a program, it can be like composing poetry or
music; as Andrei Ershov has said, programming can give us both intellectual and
emotional satisfaction, because it is a real achievement to master complexity
and to establish a system of consist rules.
I believe that code can certainly be beautiful and the process of reducing a
problem down into a set of explicit rules can be extremely satisfying (although
at times, very frustrating). Many people consider programs to merely be a
computational recipe that executes commands at the will of the programmer.
However, I resonate much more with the idea of programming as “procedural
epistomology,” (from Structure and Interpretation of Computer Programs) meaning
that we construct programs to discover truth and use programs as a tool to
justify our knowledge.
I do have a hard time reconciling this idea of programming as “procedural
epistomology,” essentially idealistic programming for discovering truth, and
pragmatic programming, programming to solve a problem. However, I think Knuth
addresses this in his talk quite nicely when he states that “we shouldn’t feel
guilty about programs that are just for fun.”
I enjoy how Knuth addresses the inverse relationship between amount of resources
and enjoyment of programming. The more constrained the resources in which we
work under, the more we tend to enjoy the programs we write. I like the idea of
working under contrived constraints as an exercise to increase our programming
abilities. I have never done this, but I sure would like to try it sometime!
In case you haven’t read the talk, I would highly recommend it. I wonder if
there is a video recording of him delivering it, I’m sure that it would be a
pleasure to watch Knuth himself deliver it.
Sep 19 2018
The hamming distance of strings \(a\) and \(b\) is defined as the number of
character mismatches between \(a\) and \(b\). The hamming distance can be calculated
in a fairly concise single line using Python.
The Code
def hamming(a, b):
return len([i for i in filter(lambda x: x[0] != x[1], zip(a, b))])
The Explanation
If you aren’t familiar with many of the shortcuts that Python provides, this
line may seem quite cryptic. To best explain how this function works, I will
expand it to the equivalent multi-line version, and then go over each part of
the multi-line version.
def hamming(a, b):
zipped_strings = zip(a, b)
mismatched_chars = []
# equivalent of the [...]
for i in zipped_strings:
# equivalent of the filter(lambda x: x[0] != x[1], ...)
if i[0] != i[1]:
mismatched_chars.append(i)
return len(mismatched_chars)
Zip
zip
is a useful built-in Python function that takes two lists as its arguments
and returns a list where each element is a tuple, and the first element in the
tuple comes from the first list and the second element in the tuple comes from
the second list. For example:
a = [1, 2, 3]
b = ['a', 'b', 'c']
print(list(zip(a, b)))
[(1, 'a'), (2, 'b'), (3, 'c')]
In this case, the elements of a
are paired up with the elements of b
.
List comprehension
List comprehensions in Python are a very handy trick to shorten any for
loop
in Python. They follow the form of [... for i in iterable]
where the ...
is
replaced by the code that is run on each element of iterable
(list, generator,
etc.).
An example of a list comprehension is:
evens = [i * 2 for i in range(10) if i % 2 == 0]
print(evens)
This code creates a list of all even numbers between 0 and 10, each of which are
multiplied by 2. Notice how the conditional is placed at the end of the expression.
Lambda functions
Lambda functions may look scary (mostly because it can be hard to recognize
where the parameters come from), but just think of them as functions that don’t
have a name. Lambda functions are usually short functions that
perform something straight-forward. In the case of the one-liner, the lambda
function
lambda x: x[0] != x[1]
takes one argument x
(which is of type tuple) and checks if the two elements
of the tuple (x[0]
and x[1
) are equal, thereby returning a boolean value.
Filter
The final piece to our one-line puzzle, filter
. In functional programming
there are patterns and ways to perform certain operations. Three ubiquitous
functions in any language resembling a functional language are map
, reduce
,
and filter
. They are of themselves very simple, yet extremely powerful (if you
want to read more about how to use them in Python, I would recommend Python
Tips).
filter
takes two arguments, first a function that takes an element of an
iterable as input and returns a boolean, and second an iterable. In our one line
case, we have
filter(lambda x: x[0] != x[1], zip(a, b))
As described earlier, lambda
is just a function without a name, and zip is a
list of tuples from two elements of a list. Our filter
expression returns a
list where the condition x[0] != x[1]
is True
, thus giving us a list of
characters that don’t match up with one another. When we take the length of this
list we get, by definition, the hamming distance.
Sep 5 2018
Introduction
In my predictive modelling class, we were given a homework assignment to apply
the Euler method for a simple equation and then write a program to
compute the method. I will show what I did to apply the Euler method, and then
how I programmed it using Haskell.
Our Equation
For our task we were given the simple equation to run the Euler method on:
\begin{equation}
\frac{dv}{dt} = -g
\end{equation}
where \(g=9.8 m/s^2\) (this should look familiar to any physicists). This equation
models the falling of an object (on Earth).
Applying the Euler Method
To apply the Euler method, one must first have a base case. In this instance,
assume that the initial velocity of the object is 0, thus when \(t = 0\), \(v = 0\);
in other words \(v(0) = 0\).
Now, from my understanding the Euler method is basically a recursive way to
approximate the function. Thus, let \(h\) be a given step size and \(v(t)\) be the
velocity at a given time \(t\), and \(v_n\) be the velocity at a given step \(n\). It
follows by Euler’s method that:
\begin{equation}
v_{n+1} = v_n + hv(t)
\end{equation}
In English, this equation is saying that the velocity at the next time step is
equal to the velocity at the previous time step (\(v_n\)) plus the velocity of the
current time (\(v(t)\)) times the step size (\(h\)). How do you compute the velocity
at the previous time step (\(v_n\))? Well, it is \(v_n = v(t - h)\), the velocity at
the previous time step.
Translating Math into Haskell
Due to Haskell’s nature, it isn’t hard to translate mathematical equations into
Haskell functions (if you have ever seen a Haskell function, you know what I
mean). Because of the recursive nature, the reasoning behind the Haskell
function is a little bit different than the mathematical intuition presented
earlier, but it is the same principle.
First, let’s define \(g\) in Haskell:
g :: Float
g = 9.8
This is pretty straightforward, as defined earlier \(g = 9.8\) (the acceleration
of gravity on Earth). If you are unfamiliar with Haskell, the first line
describes the type of \(g\) and the second line sets the value of \(g\). Now let’s
move onto the meat of the algorithm!
Forming the function step by step, the first step is to establish the types that
our function will be dealing with:
v :: Float -> Float -> Float
Which means that our function v
will take two parameters (that are of type
Float
), and return a value that is a Float
.
Next, let’s create our function modelled after the formulae we presented
earlier:
v :: Float -> Float -> Float
v t h = v_n + v_t
where v_n = v (t - h) h
v_t = h * (-g)
this should look very similar to the equation above. However, the acute observer
will notice that this function is not complete because there is no base case for
the recursion to break, thus this function will run forever!!
At the beginning we assumed that our base case was that the initial velocity was
0, thus adding this to our function we get:
v :: Float -> Float -> Float
v t h | t > 0 = v_n + v_t
| othwerwise = 0.0
where v_n = v (t - h) h
v_t = h * (-g)
In Haskell the |
are called guards, which essentially translates into a
conditional statement where the first expression that evaluates to true on the
left hand side of the =
returns what is on the right hand side of the =
.
For example, if we test our function where \(t = 1\) and \(h = 1\), then the first
time the function is called the first guard will evaluate to true (because \(t =
1; 1 > 0 = = True\)) thus the function is called again (inside v_n
), but \(t =
0\) so the second guard is reached. Ultimately the function will return \(9.8\).
Conclusion
This was my first encounter using the Euler method to do numerical analysis and
model an equation. I hope that you found this interesting and that you enjoy the
elegance of Haskell as much as I do!
Aug 18 2018
Yesterday I graduated with my Bachelor’s degree from Brigham Young University
and attended the Convocation services for my college (College of Physical and
Mathematical Sciences). Our dean, Dr. Shane Reese, gave a very inspirational
talk, of which I would like to share part of it with you.
Dr. Reese’s talk centered on two pieces of advice to the graduates, working hard
and being meek. He also provided tools of measurement in order for us to see how
well we are following his advice.
His first piece of advice was to work really hard. He shared that there is no
substitute to working hard, and that one can measure each day if one has given
an honest day’s work.
The second piece of advice is what I would like to primarily focus on, and that
advice was to be meek. As a tool to measure one’s meekness, he shared a
conversation that he had overheard where a faculty member was discussing the
president of our university, Dr. Kevin J. Worthen. This faculty member described
Dr. Worthen as having an incredibly high intelligence to arrogance ratio,
illustrated below.
\(\frac{intelligence}{arrogance}\)
Imagine that one’s intelligence and arrogance can be quantified, then the
intelligence to arrogance ratio would be your quantified intelligence divided by
your quantified arrogance. For example, if one is incredibly dumb but also quite
full of themselves, then the ratio would be extremely low. On the other hand, if
one is reasonably intelligent while still humble, then the ratio is quite high.
As a third example, if one is intelligent, but also pompous, then the ratio will
be around 1.
In my experience, the individuals with a low ratio are the most pleasant to be
around, because they offer interesting insights and advice while not being
condescending. They are the ones that show you (rather than tell you) how
intelligent they are.
I found this advice to be incredibly insightful and useful to all. I hope to
increase my own intelligence to arrogance ratio as I pursue higher education and
gain more life experience.
Jul 11 2018
I have really wanted to write this post for a long time, but seem to only now
get around to it. For more than a year now my research in the
Computational Sciences Lab (CSL) at Brigham Young University (BYU) we have been
researching various applications of the Colored de Bruijn Graph (CdBG). It all
started when we explored some novel phylogenetic reconstruction methods in the
CS 601R class during Winter semester 2017. We (or at least, I) kept being drawn
back to CdBG’s and their potential for phylogeny reconstruction. Here are some
of the things that I have learned along the way!
As with most scientific endeavors, this project certainly stands on the
shoulders of giants. Some of these giants include the following papers and their
respective authors. I think that they have done amazing work and I admire their
methods.
Motivation
We want to use the CdBG to reconstruct phylogenetic trees because it is very
efficient computationally. The CdBG can be constructed in \(O(n)\) time and space
and it can utilize whole genome sequences, which is a shortcoming of many of the
traditional phylogenetic tree reconstruction algorithms.
Furthermore, we also figured that the CdBG contains more information than many
of the kmer counting methods, and if they can perform so well then the CdBG will
only be able to perform better because it not only stored the kmers (as nodes in
the graph), but it also stores the context in which those kmers occur (as edges
where \(k - 1\) basepairs overlap on either end of the kmer).
Our Contribution
kleuren
In order to prove our hypothesis, we did what every self-respecting Computer
Scientist would do, we wrote a program to figure out if it worked. We call our
program kleuren
, which is Dutch for “colors” (referring to the colors in the
CdBG).
kleuren
works by finding bubble regions of the CdBG. A bubble is defined as
a subgraph of the CdBG that consists of a start and end node that are present in
\(n\) or more colors, and there are multiple paths connecting the start node to
the end node; where \(n\) is a given parameter and is no greater than the total
number of colors in the CdBG and a path is simply a traversal from one node to
another.
After the bubbles are found, they are aligned through Multiple Sequence
Alignment (MSA) via MAFFT and then each MSA block is concatenated to form a
supermatrix. The supermatrix is then fed into a Maximum-Likelihood program
(IQ-TREE) to reconstruct the phylogenetic tree of the taxa.
How Bubbles are Found
kleuren
uses fairly simple and straightforward algorithms to find
the bubbles, which is broken up into two steps: Finding the End Node
and Extending the Paths.
Finding the End Node
kleuren
iterates over the super-set (the union of all kmers from all
taxa) as potential start nodes (in a dBG the nodes are kmers, thus
\(node == kmer\)). Given a kmer, it is queried in the CdBG and the number
of taxa (or colors, thus \(taxon == color\)) is calculated to determine if
the number of colors for that kmer, \(c\), is greater than or equal to \(n\),
where \(n\) is a parameter provided by the user.
If \(c \geq n\) then the node is a valid start node and a breadth-first
search is performed starting from this node until another node is
found where the number of colors that it contains is greater than or
equal to \(n\), which then becomes the end node.
Extending the Paths
After an end node is discovered, the sequence of each path between the
start and end nodes must be calculated. In order to discover a path in
a dBG, one must collapse edges by appending the last nucleotide of
the next node to the previous node’s sequence. For example, if a node
is ACT -> CTG
, then the collapsed sequence will turn out to be
ACTG
.
This is implemented as at most \(c\) depth-first searches, where \(c\) is
the number of colors. The number of depth-first searches decreases as
the number of paths with shared colors increases.
Further Reading
If you are interested in the details of our algorithm and would like to see some
results, please check out our paper Whole Genome Phylogenetic Tree
Reconstruction Using Colored de Bruijn Graphs (preprint). We are currently
working on extending kleuren
to improve its efficiency.
Nov 7 2016
So you have mapped your reads to the reference genome, but what comes next?
How can you tell how many reads were aligned, where they were aligned, and actually see what your mapping algorithm did?
This post will show you how you can analyze the results of your genome mapper by using samtools and IGV.
Most modern genome aligners will output the aligned reads in the SAM file format.
If you have a lot of time on your hands you can read through this SAM file and see where the reads are mapped to and read the information for each read (which probably adds up to millions of lines).
A’int nobody got time for that.
Instead, we are going to have samtools
do the work for us.
Installation
- Download samtools.
- Go to the directory that you downloaded it and unzip the file
$ cd ~/Downloads
$ tar jxvf samtools-1.3.1.tar.bz2
- Go to the upzipped directory and prepare for install
$ cd samtools-1.3.1
$ ./configure
- Note: You may get a library dependency error if you don’t have the development files for the ncurses library installed. If you do get this error, install either
libncurses5-dev
or ncurses-devel
.
- If the
./configure
command executed without errors, install samtools
$ sudo make install
- Note: If you do not have super-user privileges then you can run
$ make install --prefix=<directory of your choice>
- Test if the install worked by running
samtools
If the installation was successful then you should see a usage prompt for samtools
, otherwise your installation failed (or you aren’t using the correct path to the samtools
binary).
The BAM file format is the binary cousin to the SAM file format, it is the same thing, just in binary form (this means that you can’t directly view a BAM file in a text editor like vim
like you could a SAM tile).
Converting SAM to BAM
We will now use samtools
to convert our SAM file to a BAM file.
The SAM file that we are going to use is found in the examples
directory of the samtools-1.3.1
directory, the file is toy.sam
.
samtools
is broken up into subprograms that perform different functions.
In order to convert the SAM file we use samtools view
, like so: $ samtools view -b -S examples/toy.sam -o examples/toy.bam
.
Sorting BAM
Next, we want to sort the reads in the BAM file.
This is a necessary step because many visualization programs cannot handle unsorted BAM files.
To sort the BAM file, run: $ samtools sort examples/toy.bam -o examples/toy.sorted.bam
.
Indexing BAM
We then have to index the sorted BAM file, which is also a necessary step for visualization tools.
Run $ samtools index examples/toy.sorted.bam
.
Now we can finally view the aligned reads in our terminal!
Run $ samtools tview examples/toy.sorted.bam examples/toy.fa
and you will suddenly see four reads aligned to an extremely small genome.
You should see something along these lines:
1 11 21 31 41 51
AGCATGTTAGATAA****GATA**GCTGTGCTAGTAGGCAG*TCAGCGCCATNNNNNNNN
........ .... ......K.K......K. ..........
........AGAG....***... ,,,,, ,,,,,,,,,
......GG**....AA
..C...**** ...**...>>>>>>>>>>>>>>T.....```
The numbers at the top signify the index of the genome, and the first line of characters represents the reference genome itself.
The third line is the (consensus sequence)https://en.wikipedia.org/wiki/Consensus_sequence to discover what K
represents).
Each line under the consensus sequence is a read.
You may be wondering why most of the reads are made of .
, well that is because they match the reference genome.
There are many different settings that you can play with in samtools tview
, to view all of the settings type ?
and a help menu will come appear.
Integrative Genome Viewer (IGV)
If you want more flexibility and a more robust way of viewing your aligned reads you can use IGV.
It has a GUI, which makes things nice sometimes.
Installation
Download the appropriate files according to the system that you are running.
IGV is written in Java (using Java 7, not sure if Java 8⁄9 will work, but it should), so you need to make sure that you have Java installed on your computer.
Once you have Java installed and IGV downloaded, go ahead and unzip the downloaded file, if needed.
Then if you are in an Unix-like OS (Mac or Linux) you can run $ ./IGV_2.3.88/igv.sh
to open up the IGV GUI.
Loading the Reference Genome
Once the program is open, click the Genomes
button at the top of the window, then select Load Genome from File...
.
Select the file samtools-1.3.1/examples/toy.fa
.
Loading the Reads
After the reference genome is loaded we can load in the reads by clicking the File
button at the top of the window, then select Load from File...
.
Select the file samtools-1.3.1/examples/toy.sorted.bam
.
Seeing the Reads
You probably can’t see any changes in the view, that’s ok.
In the second row click on the dropdown arrow that says All
and select either ref
or ref2
and then you will be able to see what we saw in samtools tview
, except it is way easier to figure out what everything means!
Oct 25 2016
Suffix Trees
What is a Suffix Tree?
A suffix tree is a data sructure that contains each suffix of a string, a suffix is defined as the end part of a string.
For example, the suffixes for the string banana are:
- a
- na
- ana
- nana
- anana
- banana
Notice that the entire string is also considered a suffix.
We can also include the empty string as a suffix, so in order to include the empty string we need to append a character that isn’t in the alphabet to the string.
We will assume that the character $
is not in the alphabet.
Our string is now banana$, and the suffixes are:
- $
- a$
- na$
- ana$
- nana$
- anana$
- banana$
How to Construct a Suffix Tree
The easiest way to understand how to construct a suffix tree is to first construct a suffix trie, then collapse nodes to convert the trie to a tree.
Here is the graphical representation of the suffix trie:
Now we collapse the nodes that only have one child, and the resulting suffix tree is this:
How to Use a Suffix Tree
Observe that each suffix is represented in this structure.
This means that we can efficiently search for a suffix (or substring) by traversing the suffix tree.
For example if we were to search for the substring ana
in the suffix tree for banana$
, this would be the traversal path:
0 -> 2 -> 3
Note: We can stop at any node because we are searching for a substring rather than a suffix, if we were searching for a suffix, the search string would have been ana$
.
Suffix Tree Construction
How would you create a naive algorithm to construct a suffix tree?
You can easily create a simple algorithm to construct a suffix tree in O(n2), but Esko Ukkonen discovered a way to construct a suffix tree in O(n) (linear time).
Oct 19 2016
What is Genome Assembly?
Genome assembly is often compared to putting a puzzle together.
In this analogy, the pieces of the puzzle are individual reads that we get from a DNA sequencer.
The ultimate goal of putting a puzzle together is to find out where each and every piece fit exactly, this way you can see the completed picture.
While this is also true of genome assembly, we need to be realistic.
The first step in genome assembly is to generate contigs, or in the words of our puzzle analogy, to combine pieces that we know go together.
What is a Contig?
A contig is short for contiguous sequence.
It is a sequence that is longer than the reads, and shorter than the genome (technically the whole assembled genome could be considered a contig in an organism with a single chromosome, but for our purposes it is considered to be shorter than the entire genome).
If you have a puzzle that is a picture of a beach and a ocean, you would combine all of the tan colored pieces and then all of the blue colored pieces.
You may even combine the pieces that make up the colorful beach umbrellas littering the beach.
When you do this you are generating contigs!
You are making small chunks of the whole because they are easy to identify, and it is the same with genome assembly.
How Do We Discover Contigs?
Well, we know that the pieces of our puzzle are reads.
The regular puzzle piece has four sides with different shapes that define exactly where it should go.
Our reads also have “sides” that tell us the proper place it should be in the contig.
These “sides” are revealed when we break the read up into kmers and inserting them into a de Bruijn graph.
We then can construct the contigs by continuing the non-branching nodes (a node that has exactly one incoming edge and one outgoing edge).
Imagine that we are assembling an incredibly small genome, with way too few reads, which are the following:
ACTGT
TCTGT
CTGTT
CTGTA
GCATA
TGTTA
GTTAC
The reads are of length 5.
The de Bruijn graph using the kmer length of 5 (the entire read) for this set of reads would look like this (the non-branching node is in red):
As you can see there is only one non-branching node in the graph.
We can only generate the following contigs:
ACTGT
TCTGT
CTGTTA
CTGTA
GCATA
TGTTA
GTTAC
I hope it is clear that this is not a very good assembly.
Why isn’t this assembly good?
Well for the most part all of our contigs are the same length of our reads, except for one, which is only one more base pair than the read length.
This assembly is equivalent to only connecting one puzzle piece.
Let’s see how we can improve this assembly.
Kmer Length
The length of kmer that you use to construct your de Bruijn graph will greatly influence your assembly, for better or for worse.
What will happen if we decrease the kmer length from 5 to 3?
Here is the de Bruijn graph of kmer length 3:
Here is the de Bruijn graph of kmer size 3:
What is up with all of the edges?
Each edge represents an occurance of that kmer, for example the kmer ACT
has 4 edges to CTG
because there are 4 occurances of CTG
.
Let’s clean this graph up by giving weights to the edges.
The number on each edge represents how many times that edge is repeated.
Here is the de Bruijn graph of kmer size 3 with weighted edges (the non-branching nodes are in red):
Let’s see if a kmer length of 3 is any better; we definitely have more branhcing nodes, but will that lead to longer contigs?
Essentially we have made our puzzle pieces smaller, so just because we can put together more puzzle pieces doesn’t mean that we are building more of the puzzle.
Here are the contigs generated by the graph of kmer length 3:
ACT
TCT
CTG
TGTTA
TGTA
GCATA
TAC
Well, compared to the assembly using kmers of length 5 we have the same number of contigs, with a shorter average, and the longest contig is only the length of the reads.
Is this assembly better or worse than the last one?
Short answer, yes.
You have to define “better” when classifying assemblies.
You may want to have the longest average contig length, or simply the longest contig.
You may have some other metric like N50 in which you determine which assembly is “better.”
Filtering Errors
One way of accounting for errors in your algorithm is to remove edges below a certain weight.
When an edge has a higher value, it occurs more often.
If an edge occurs more often, then it is more likely to be a valid contig.
What is Next?
You may be asking yourself, “I can see how we generate contigs, and how that can be useful, but we still don’t have one sequence that represents the genome. We haven’t fully completed the puzzle. How do we do that?”
If you would like to fully complete the puzzle, to put all of these contigs together, you would have to construct scaffolds.
A scaffold is a representation of how contigs relate to each other as well as accounting for gaps in the sequence.
Depending on how much data you have, you may not be able to create one continuous sequence.
NOTE: Scaffolding is not required for the CS 418/BIO 365 Genome Assembler project.
Generating contigs is good enough!
Oct 5 2016
Genome Assembly using Velvet
What the heck is Genome Assembly?
Genome assembly is the process of constructing long contiguous sequences from shorter sequences. Think of this problem at a genomic scale. Same approach, just a lot more data.
What the heck is Velvet?
Velvet is a genome assembler that uses a de Bruijn graph to generate contigs.
If you are interested in reading the paper describing how Velvet works, feel free to read Velvet: Algorithms for de novo short read assembly using de Bruijn graphs.
Installing Velvet
I say that the most important part of using software is figuring out how to install it.
Sometimes it can be harder than you think.
Here is how you install Velvet:
- Download the source
- Optional: Check out the sweet Velvet website complete with web 2.0 design. Al Gore would be proud.
- Go to the directory in which you downloaded the file
$ cd ~/Downloads
and unzip the file $ tar zxvf velvet_1.2.10.tgz
- Go to the just unzipped directory
$ cd velvet_1.2.10
and compile Velvet by issuing the command $ make
- Error warning!! If you get an error that says something along the lines of
fatal error: zlib.h: NO such file or directory
then try installing the package zlib1g-dev
then running $ make
again
- If you didn’t get any errors, it looks like you have installed Velvet! If you got errors, Google the error and figure out how to fix it.
Running Velvet
To execute the Velvet program make sure that you are in the velvet_1.2.10
directory and then type $ ./velveth
and it should return a short help message. If it didn’t, check to see if you are in the correct directory by issuing the command $ pwd
.
Assembling the Zika virus genome
Prepping the files for assembly
We have some reads from the Zika virus, fresh from Florida.
We want to assemble the Zika virus genome to help find a cure.
Download the reads zika.read1.fastq and zika.read2.fastq, then run this command $ ./velveth zika_genome 20 -fastq -shortPaired ~/Downloads/zika.read1.fastq ~/Downloads/zika.read2.fastq
.
This command is a sort of preprocessing command that constructs your dataset so that it can assemble it.
Here are what the parameters mean:
./velveth
- the program that we use
zika_genome
- this is the output directory of all the files
20
- this is the hash (in other words, kmer) size that we use, you will want to play around with this
-fastq
- this is the type of input files that we have
-shortPaired
- this is the type of input reads that we have
~/Downloads/zika.read1.fastq
- this is the first file of reads
~/Downloads/zika.read2.fastq
- this is the second file of reads
Note: You can have an unlimited number of input files.
Assembling the reads
We are now going to use the program ./velvetg
to actually construct the de Bruijn graph and assemble the reads.
Issue the command $ ./velvetg zika_genome/ -cov_cutoff 4 -min_contig_lgth 100
, and now you have assembled your first genome!
Here are what the parameters mean:
./velvetg
- the program that we use
-cov_cutoff 4
- this removes the nodes that have a coverage less than 4
-min_contig_lgth 100
- this gives us all of the contigs that are greater than 100 bases
Viewing the generated contigs
Now we can see the contigs we generated by this command, $ cd ./zika_genome
and $ less contigs.fa
.
Feel free to explore around in this directory for other cool stuff about the contigs!
Happy assembling!
Above and beyond…
You can compare your generated contigs with the NCBI Reference Sequence for the Zika virus to see how well (or how poorly) your genome assembly actually is!
Aug 30 2016
Python Tutorial
CS 418/BIO 365
30 August 2016
# This is a comment
""" This is a
multi-
line
comment"""
''' This is also a
multi-
line
comment'''
# How do I declare variables?
# String
my_string = 'Hello, CS 418/BIO 365!!'
my_other_string = "Hello, friends."
# Numbers
my_number = 418
my_other_number = 365.578
# List
my_list = ['this', 'is', 'fun', 56]
# or
my_other_list = []
# or
my_other_list2 = list()
my_other_list.append('bioinformatics')
# Dictionary (Map)
my_dict = {'key': 'value', 'other_key': 'other_value'}
# or
my_other_dict = {}
# or
my_other_dict2 = dict()
my_other_dict['my_key'] = 'my_value'
# How do I make loops?
# NOTE: Whitespace matters!!
# for loop
for i in my_list:
print(i)
# NOTE: "print(i)" is used in Python 3, "print i" is used in Python 2
this
is
fun
56
# for loop over range
for i in range(0, 5):
my_other_list.append(i)
print(my_other_list)
['bioinformatics', 0, 1, 2, 3, 4]
# while loop
while len(my_list) > 0:
del my_list[-1]
print(my_list)
['this', 'is', 'fun']
['this', 'is']
['this']
[]
# break
for val in my_other_list:
if val is 2:
break
else:
print(val)
bioinformatics
0
1
# continue
for val in my_other_list:
if val is not 2:
continue
print(val)
2
# How do I perform file I/O?
# file input
import sys # this imports the library to get command line arguments
with open(sys.argv[1]) as fh: # sys.argv[1] is the first command line argument, sys.argv[2] is the second ... and so on.
my_first_line = next(fh) # Python 2 & 3
my_next_line = fh.next() # Python 2
with open(sys.argv[1]) as other_fh:
# iterate over all lines in file
for line in other_fh:
print(line)
ACGTTGCATGTCGCATGATGCATGAGAGCT
4
# file output
with open('./my_file.txt', 'w') as writable: ''' NOTE: the second
argument ('w') makes this file writable '''
writable.write('This is my test file\n')
writable.write(my_first_line)
# How do I manipulate strings?
# string slice
my_string = 'banana'
print(my_string[3])
print(my_string[0:5])
print(my_string[1:5])
print(my_string[-1])
print(my_string[0:-2])
a
banan
anan
a
bana
Rosalind Tutorial
Use your NetID as your username!!
# What do we need to do in pseudo code?
''' -Read in the sequence and the
length of the kmer (a kmer is essentially a substring)
-Break up the sequence into kmers
-Count each kmer
-Find the kmer(s) with the highest counts
-Print out the highest count kmers'''
# Read in the sequence and the length of the kmer from a file
import sys
with open(sys.argv[1]) as file:
seq = next(file).strip() # REMEMBER: in Python 2 use file.next()
kmer_len = int(next(file).strip())
# Break up the sequence into kmers
for i in range(0, len(seq)):
kmer = seq[ i : i + kmer_len]
# Count each kmer
counts = {}
for i in range(0, len(seq)):
kmer = seq[ i : i + kmer_len]
if kmer not in counts:
counts[kmer] = 1
else:
counts[kmer] += 1
# Find the kmer(s) with the highest counts
max_count = 0
max_kmers = []
for kmer, count in counts.items(): # NOTE: use counts.iteritems() in Python 2
if count > max_count:
max_count = count
max_kmers.clear()
max_kmers.append(kmer)
elif count == max_count:
max_kmers.append(kmer)
# Print out the kmers with the highest counts
print(" ".join(max_kmers))