cole lyman Bioinformatics, Emacs, Programming, and Life

The Art of computer programming

The title of Donald Knuth’s famed series The Art of Computer Programming is often misunderstood, and as he describes in his talk Computer programming as an art which he delivered when he received the 1974 Turing award. I love this talk! I think that it offers a unique perspective on programming that can be very beneficial.

Knuth starts his talk with an explanation of the difference between an art and a science through an etymologic analysis of the words. This seems to explain why the fields of medicine have historically been considered arts even though our modern perspective would consider them as sciences.

I can resonate with Knuth’s observation:

My feeling is that when we prepare a program, it can be like composing poetry or music; as Andrei Ershov has said, programming can give us both intellectual and emotional satisfaction, because it is a real achievement to master complexity and to establish a system of consist rules.

I believe that code can certainly be beautiful and the process of reducing a problem down into a set of explicit rules can be extremely satisfying (although at times, very frustrating). Many people consider programs to merely be a computational recipe that executes commands at the will of the programmer. However, I resonate much more with the idea of programming as “procedural epistomology,” (from Structure and Interpretation of Computer Programs) meaning that we construct programs to discover truth and use programs as a tool to justify our knowledge.

I do have a hard time reconciling this idea of programming as “procedural epistomology,” essentially idealistic programming for discovering truth, and pragmatic programming, programming to solve a problem. However, I think Knuth addresses this in his talk quite nicely when he states that “we shouldn’t feel guilty about programs that are just for fun.”

I enjoy how Knuth addresses the inverse relationship between amount of resources and enjoyment of programming. The more constrained the resources in which we work under, the more we tend to enjoy the programs we write. I like the idea of working under contrived constraints as an exercise to increase our programming abilities. I have never done this, but I sure would like to try it sometime!

In case you haven’t read the talk, I would highly recommend it. I wonder if there is a video recording of him delivering it, I’m sure that it would be a pleasure to watch Knuth himself deliver it.

Hamming Distance One-liner in Python

The hamming distance of strings \(a\) and \(b\) is defined as the number of character mismatches between \(a\) and \(b\). The hamming distance can be calculated in a fairly concise single line using Python.

The Code

def hamming(a, b):
    return len([i for i in filter(lambda x: x[0] != x[1], zip(a, b))])

The Explanation

If you aren’t familiar with many of the shortcuts that Python provides, this line may seem quite cryptic. To best explain how this function works, I will expand it to the equivalent multi-line version, and then go over each part of the multi-line version.

def hamming(a, b):
    zipped_strings = zip(a, b)

    mismatched_chars = []

    # equivalent of the [...]
    for i in zipped_strings:
        # equivalent of the filter(lambda x: x[0] != x[1], ...)
        if i[0] != i[1]:
            mismatched_chars.append(i)

    return len(mismatched_chars)

Zip

zip is a useful built-in Python function that takes two lists as its arguments and returns a list where each element is a tuple, and the first element in the tuple comes from the first list and the second element in the tuple comes from the second list. For example:

a = [1, 2, 3]
b = ['a', 'b', 'c']

print(list(zip(a, b)))
[(1, 'a'), (2, 'b'), (3, 'c')]

In this case, the elements of a are paired up with the elements of b.

List comprehension

List comprehensions in Python are a very handy trick to shorten any for loop in Python. They follow the form of [... for i in iterable] where the ... is replaced by the code that is run on each element of iterable (list, generator, etc.).

An example of a list comprehension is:

evens = [i * 2 for i in range(10) if i % 2 == 0]

print(evens)

This code creates a list of all even numbers between 0 and 10, each of which are multiplied by 2. Notice how the conditional is placed at the end of the expression.

Lambda functions

Lambda functions may look scary (mostly because it can be hard to recognize where the parameters come from), but just think of them as functions that don’t have a name. Lambda functions are usually short functions that perform something straight-forward. In the case of the one-liner, the lambda function

lambda x: x[0] != x[1]

takes one argument x (which is of type tuple) and checks if the two elements of the tuple (x[0] and x[1) are equal, thereby returning a boolean value.

Filter

The final piece to our one-line puzzle, filter. In functional programming there are patterns and ways to perform certain operations. Three ubiquitous functions in any language resembling a functional language are map, reduce, and filter. They are of themselves very simple, yet extremely powerful (if you want to read more about how to use them in Python, I would recommend Python Tips).

filter takes two arguments, first a function that takes an element of an iterable as input and returns a boolean, and second an iterable. In our one line case, we have

filter(lambda x: x[0] != x[1], zip(a, b))

As described earlier, lambda is just a function without a name, and zip is a list of tuples from two elements of a list. Our filter expression returns a list where the condition x[0] != x[1] is True, thus giving us a list of characters that don’t match up with one another. When we take the length of this list we get, by definition, the hamming distance.

Modelling Gravity using the Euler Method in Haskell

Introduction

In my predictive modelling class, we were given a homework assignment to apply the Euler method for a simple equation and then write a program to compute the method. I will show what I did to apply the Euler method, and then how I programmed it using Haskell.

Our Equation

For our task we were given the simple equation to run the Euler method on:

\begin{equation} \frac{dv}{dt} = -g \end{equation}

where \(g=9.8 m/s^2\) (this should look familiar to any physicists). This equation models the falling of an object (on Earth).

Applying the Euler Method

To apply the Euler method, one must first have a base case. In this instance, assume that the initial velocity of the object is 0, thus when \(t = 0\), \(v = 0\); in other words \(v(0) = 0\).

Now, from my understanding the Euler method is basically a recursive way to approximate the function. Thus, let \(h\) be a given step size and \(v(t)\) be the velocity at a given time \(t\), and \(v_n\) be the velocity at a given step \(n\). It follows by Euler’s method that:

\begin{equation} v_{n+1} = v_n + hv(t) \end{equation}

In English, this equation is saying that the velocity at the next time step is equal to the velocity at the previous time step (\(v_n\)) plus the velocity of the current time (\(v(t)\)) times the step size (\(h\)). How do you compute the velocity at the previous time step (\(v_n\))? Well, it is \(v_n = v(t - h)\), the velocity at the previous time step.

Translating Math into Haskell

Due to Haskell’s nature, it isn’t hard to translate mathematical equations into Haskell functions (if you have ever seen a Haskell function, you know what I mean). Because of the recursive nature, the reasoning behind the Haskell function is a little bit different than the mathematical intuition presented earlier, but it is the same principle.

First, let’s define \(g\) in Haskell:

g :: Float
g = 9.8

This is pretty straightforward, as defined earlier \(g = 9.8\) (the acceleration of gravity on Earth). If you are unfamiliar with Haskell, the first line describes the type of \(g\) and the second line sets the value of \(g\). Now let’s move onto the meat of the algorithm!

Forming the function step by step, the first step is to establish the types that our function will be dealing with:

v :: Float -> Float -> Float

Which means that our function v will take two parameters (that are of type Float), and return a value that is a Float.

Next, let’s create our function modelled after the formulae we presented earlier:

v :: Float -> Float -> Float
v t h = v_n + v_t
    where v_n = v (t - h) h
          v_t = h * (-g)

this should look very similar to the equation above. However, the acute observer will notice that this function is not complete because there is no base case for the recursion to break, thus this function will run forever!!

At the beginning we assumed that our base case was that the initial velocity was 0, thus adding this to our function we get:

v :: Float -> Float -> Float
v t h | t > 0 = v_n + v_t
      | othwerwise = 0.0
    where v_n = v (t - h) h
          v_t = h * (-g)

In Haskell the | are called guards, which essentially translates into a conditional statement where the first expression that evaluates to true on the left hand side of the = returns what is on the right hand side of the =.

For example, if we test our function where \(t = 1\) and \(h = 1\), then the first time the function is called the first guard will evaluate to true (because \(t = 1; 1 > 0 = = True\)) thus the function is called again (inside v_n), but \(t = 0\) so the second guard is reached. Ultimately the function will return \(9.8\).

Conclusion

This was my first encounter using the Euler method to do numerical analysis and model an equation. I hope that you found this interesting and that you enjoy the elegance of Haskell as much as I do!

Intelligence to Arrogance Ratio

Yesterday I graduated with my Bachelor’s degree from Brigham Young University and attended the Convocation services for my college (College of Physical and Mathematical Sciences). Our dean, Dr. Shane Reese, gave a very inspirational talk, of which I would like to share part of it with you.

Dr. Reese’s talk centered on two pieces of advice to the graduates, working hard and being meek. He also provided tools of measurement in order for us to see how well we are following his advice.

His first piece of advice was to work really hard. He shared that there is no substitute to working hard, and that one can measure each day if one has given an honest day’s work.

The second piece of advice is what I would like to primarily focus on, and that advice was to be meek. As a tool to measure one’s meekness, he shared a conversation that he had overheard where a faculty member was discussing the president of our university, Dr. Kevin J. Worthen. This faculty member described Dr. Worthen as having an incredibly high intelligence to arrogance ratio, illustrated below.

\(\frac{intelligence}{arrogance}\)

Imagine that one’s intelligence and arrogance can be quantified, then the intelligence to arrogance ratio would be your quantified intelligence divided by your quantified arrogance. For example, if one is incredibly dumb but also quite full of themselves, then the ratio would be extremely low. On the other hand, if one is reasonably intelligent while still humble, then the ratio is quite high. As a third example, if one is intelligent, but also pompous, then the ratio will be around 1.

In my experience, the individuals with a low ratio are the most pleasant to be around, because they offer interesting insights and advice while not being condescending. They are the ones that show you (rather than tell you) how intelligent they are.

I found this advice to be incredibly insightful and useful to all. I hope to increase my own intelligence to arrogance ratio as I pursue higher education and gain more life experience.

Colored de Bruijn Graphs

I have really wanted to write this post for a long time, but seem to only now get around to it. For more than a year now my research in the Computational Sciences Lab (CSL) at Brigham Young University (BYU) we have been researching various applications of the Colored de Bruijn Graph (CdBG). It all started when we explored some novel phylogenetic reconstruction methods in the CS 601R class during Winter semester 2017. We (or at least, I) kept being drawn back to CdBG’s and their potential for phylogeny reconstruction. Here are some of the things that I have learned along the way!

As with most scientific endeavors, this project certainly stands on the shoulders of giants. Some of these giants include the following papers and their respective authors. I think that they have done amazing work and I admire their methods.

Motivation

We want to use the CdBG to reconstruct phylogenetic trees because it is very efficient computationally. The CdBG can be constructed in \(O(n)\) time and space and it can utilize whole genome sequences, which is a shortcoming of many of the traditional phylogenetic tree reconstruction algorithms.

Furthermore, we also figured that the CdBG contains more information than many of the kmer counting methods, and if they can perform so well then the CdBG will only be able to perform better because it not only stored the kmers (as nodes in the graph), but it also stores the context in which those kmers occur (as edges where \(k - 1\) basepairs overlap on either end of the kmer).

Our Contribution

kleuren

In order to prove our hypothesis, we did what every self-respecting Computer Scientist would do, we wrote a program to figure out if it worked. We call our program kleuren, which is Dutch for “colors” (referring to the colors in the CdBG).

kleuren works by finding bubble regions of the CdBG. A bubble is defined as a subgraph of the CdBG that consists of a start and end node that are present in \(n\) or more colors, and there are multiple paths connecting the start node to the end node; where \(n\) is a given parameter and is no greater than the total number of colors in the CdBG and a path is simply a traversal from one node to another.

After the bubbles are found, they are aligned through Multiple Sequence Alignment (MSA) via MAFFT and then each MSA block is concatenated to form a supermatrix. The supermatrix is then fed into a Maximum-Likelihood program (IQ-TREE) to reconstruct the phylogenetic tree of the taxa.

How Bubbles are Found

kleuren uses fairly simple and straightforward algorithms to find the bubbles, which is broken up into two steps: Finding the End Node and Extending the Paths.

Finding the End Node

kleuren iterates over the super-set (the union of all kmers from all taxa) as potential start nodes (in a dBG the nodes are kmers, thus \(node == kmer\)). Given a kmer, it is queried in the CdBG and the number of taxa (or colors, thus \(taxon == color\)) is calculated to determine if the number of colors for that kmer, \(c\), is greater than or equal to \(n\), where \(n\) is a parameter provided by the user.

If \(c \geq n\) then the node is a valid start node and a breadth-first search is performed starting from this node until another node is found where the number of colors that it contains is greater than or equal to \(n\), which then becomes the end node.

Extending the Paths

After an end node is discovered, the sequence of each path between the start and end nodes must be calculated. In order to discover a path in a dBG, one must collapse edges by appending the last nucleotide of the next node to the previous node’s sequence. For example, if a node is ACT -> CTG, then the collapsed sequence will turn out to be ACTG.

This is implemented as at most \(c\) depth-first searches, where \(c\) is the number of colors. The number of depth-first searches decreases as the number of paths with shared colors increases.

Further Reading

If you are interested in the details of our algorithm and would like to see some results, please check out our paper Whole Genome Phylogenetic Tree Reconstruction Using Colored de Bruijn Graphs (preprint). We are currently working on extending kleuren to improve its efficiency.

Genome Mapping Post Processing

So you have mapped your reads to the reference genome, but what comes next? How can you tell how many reads were aligned, where they were aligned, and actually see what your mapping algorithm did? This post will show you how you can analyze the results of your genome mapper by using samtools and IGV.

samtools

SAM File Format

Most modern genome aligners will output the aligned reads in the SAM file format. If you have a lot of time on your hands you can read through this SAM file and see where the reads are mapped to and read the information for each read (which probably adds up to millions of lines). A’int nobody got time for that. Instead, we are going to have samtools do the work for us.

Installation

  1. Download samtools.
  2. Go to the directory that you downloaded it and unzip the file
    • $ cd ~/Downloads
    • $ tar jxvf samtools-1.3.1.tar.bz2
  3. Go to the upzipped directory and prepare for install
    • $ cd samtools-1.3.1
    • $ ./configure
    • Note: You may get a library dependency error if you don’t have the development files for the ncurses library installed. If you do get this error, install either libncurses5-dev or ncurses-devel.
  4. If the ./configure command executed without errors, install samtools
    • $ sudo make install
    • Note: If you do not have super-user privileges then you can run $ make install --prefix=<directory of your choice>
  5. Test if the install worked by running samtools
    • $ samtools

If the installation was successful then you should see a usage prompt for samtools, otherwise your installation failed (or you aren’t using the correct path to the samtools binary).

BAM File Format

The BAM file format is the binary cousin to the SAM file format, it is the same thing, just in binary form (this means that you can’t directly view a BAM file in a text editor like vim like you could a SAM tile).

Converting SAM to BAM

We will now use samtools to convert our SAM file to a BAM file. The SAM file that we are going to use is found in the examples directory of the samtools-1.3.1 directory, the file is toy.sam. samtools is broken up into subprograms that perform different functions. In order to convert the SAM file we use samtools view, like so: $ samtools view -b -S examples/toy.sam -o examples/toy.bam.

Sorting BAM

Next, we want to sort the reads in the BAM file. This is a necessary step because many visualization programs cannot handle unsorted BAM files. To sort the BAM file, run: $ samtools sort examples/toy.bam -o examples/toy.sorted.bam.

Indexing BAM

We then have to index the sorted BAM file, which is also a necessary step for visualization tools. Run $ samtools index examples/toy.sorted.bam.

Viewing BAM Using samtools tview

Now we can finally view the aligned reads in our terminal! Run $ samtools tview examples/toy.sorted.bam examples/toy.fa and you will suddenly see four reads aligned to an extremely small genome. You should see something along these lines:

1         11              21        31         41        51
AGCATGTTAGATAA****GATA**GCTGTGCTAGTAGGCAG*TCAGCGCCATNNNNNNNN
  ........    ....  ......K.K......K. ..........
  ........AGAG....***...      ,,,,,    ,,,,,,,,,
    ......GG**....AA
    ..C...**** ...**...>>>>>>>>>>>>>>T.....```

The numbers at the top signify the index of the genome, and the first line of characters represents the reference genome itself. The third line is the (consensus sequence)https://en.wikipedia.org/wiki/Consensus_sequence to discover what K represents). Each line under the consensus sequence is a read.

You may be wondering why most of the reads are made of ., well that is because they match the reference genome. There are many different settings that you can play with in samtools tview, to view all of the settings type ? and a help menu will come appear.

Integrative Genome Viewer (IGV)

If you want more flexibility and a more robust way of viewing your aligned reads you can use IGV. It has a GUI, which makes things nice sometimes.

Installation

Download the appropriate files according to the system that you are running. IGV is written in Java (using Java 7, not sure if Java 89 will work, but it should), so you need to make sure that you have Java installed on your computer. Once you have Java installed and IGV downloaded, go ahead and unzip the downloaded file, if needed. Then if you are in an Unix-like OS (Mac or Linux) you can run $ ./IGV_2.3.88/igv.sh to open up the IGV GUI.

Loading the Reference Genome

Once the program is open, click the Genomes button at the top of the window, then select Load Genome from File.... Select the file samtools-1.3.1/examples/toy.fa.

Loading the Reads

After the reference genome is loaded we can load in the reads by clicking the File button at the top of the window, then select Load from File.... Select the file samtools-1.3.1/examples/toy.sorted.bam.

Seeing the Reads

You probably can’t see any changes in the view, that’s ok. In the second row click on the dropdown arrow that says All and select either ref or ref2 and then you will be able to see what we saw in samtools tview, except it is way easier to figure out what everything means!

Suffix Trees

Suffix Trees

What is a Suffix Tree?

A suffix tree is a data sructure that contains each suffix of a string, a suffix is defined as the end part of a string. For example, the suffixes for the string banana are:

  • a
  • na
  • ana
  • nana
  • anana
  • banana

Notice that the entire string is also considered a suffix. We can also include the empty string as a suffix, so in order to include the empty string we need to append a character that isn’t in the alphabet to the string. We will assume that the character $ is not in the alphabet. Our string is now banana$, and the suffixes are:

  • $
  • a$
  • na$
  • ana$
  • nana$
  • anana$
  • banana$

How to Construct a Suffix Tree

The easiest way to understand how to construct a suffix tree is to first construct a suffix trie, then collapse nodes to convert the trie to a tree. Here is the graphical representation of the suffix trie: suffix trie

Now we collapse the nodes that only have one child, and the resulting suffix tree is this: suffix tree

How to Use a Suffix Tree

Observe that each suffix is represented in this structure. This means that we can efficiently search for a suffix (or substring) by traversing the suffix tree. For example if we were to search for the substring ana in the suffix tree for banana$, this would be the traversal path:

0 -> 2 -> 3

Note: We can stop at any node because we are searching for a substring rather than a suffix, if we were searching for a suffix, the search string would have been ana$.

Suffix Tree Construction

How would you create a naive algorithm to construct a suffix tree? You can easily create a simple algorithm to construct a suffix tree in O(n2), but Esko Ukkonen discovered a way to construct a suffix tree in O(n) (linear time).

Genome Assembly

What is Genome Assembly?

Genome assembly is often compared to putting a puzzle together. In this analogy, the pieces of the puzzle are individual reads that we get from a DNA sequencer. The ultimate goal of putting a puzzle together is to find out where each and every piece fit exactly, this way you can see the completed picture. While this is also true of genome assembly, we need to be realistic. The first step in genome assembly is to generate contigs, or in the words of our puzzle analogy, to combine pieces that we know go together.

What is a Contig?

A contig is short for contiguous sequence. It is a sequence that is longer than the reads, and shorter than the genome (technically the whole assembled genome could be considered a contig in an organism with a single chromosome, but for our purposes it is considered to be shorter than the entire genome). If you have a puzzle that is a picture of a beach and a ocean, you would combine all of the tan colored pieces and then all of the blue colored pieces. You may even combine the pieces that make up the colorful beach umbrellas littering the beach. When you do this you are generating contigs! You are making small chunks of the whole because they are easy to identify, and it is the same with genome assembly.

How Do We Discover Contigs?

Well, we know that the pieces of our puzzle are reads. The regular puzzle piece has four sides with different shapes that define exactly where it should go. Our reads also have “sides” that tell us the proper place it should be in the contig. These “sides” are revealed when we break the read up into kmers and inserting them into a de Bruijn graph. We then can construct the contigs by continuing the non-branching nodes (a node that has exactly one incoming edge and one outgoing edge).

Imagine that we are assembling an incredibly small genome, with way too few reads, which are the following:

ACTGT
TCTGT
CTGTT
CTGTA
GCATA
TGTTA
GTTAC

The reads are of length 5. The de Bruijn graph using the kmer length of 5 (the entire read) for this set of reads would look like this (the non-branching node is in red): de Bruijn Graph

As you can see there is only one non-branching node in the graph. We can only generate the following contigs:

ACTGT
TCTGT
CTGTTA
CTGTA
GCATA
TGTTA
GTTAC

I hope it is clear that this is not a very good assembly. Why isn’t this assembly good? Well for the most part all of our contigs are the same length of our reads, except for one, which is only one more base pair than the read length. This assembly is equivalent to only connecting one puzzle piece. Let’s see how we can improve this assembly.

Kmer Length

The length of kmer that you use to construct your de Bruijn graph will greatly influence your assembly, for better or for worse. What will happen if we decrease the kmer length from 5 to 3?

Here is the de Bruijn graph of kmer length 3:

Here is the de Bruijn graph of kmer size 3: de Bruijn Graph

What is up with all of the edges? Each edge represents an occurance of that kmer, for example the kmer ACT has 4 edges to CTG because there are 4 occurances of CTG. Let’s clean this graph up by giving weights to the edges. The number on each edge represents how many times that edge is repeated.

Here is the de Bruijn graph of kmer size 3 with weighted edges (the non-branching nodes are in red): de Bruijn Graph

Let’s see if a kmer length of 3 is any better; we definitely have more branhcing nodes, but will that lead to longer contigs? Essentially we have made our puzzle pieces smaller, so just because we can put together more puzzle pieces doesn’t mean that we are building more of the puzzle. Here are the contigs generated by the graph of kmer length 3:

ACT
TCT
CTG
TGTTA
TGTA
GCATA
TAC

Well, compared to the assembly using kmers of length 5 we have the same number of contigs, with a shorter average, and the longest contig is only the length of the reads. Is this assembly better or worse than the last one? Short answer, yes. You have to define “better” when classifying assemblies. You may want to have the longest average contig length, or simply the longest contig. You may have some other metric like N50 in which you determine which assembly is “better.”

Filtering Errors

One way of accounting for errors in your algorithm is to remove edges below a certain weight. When an edge has a higher value, it occurs more often. If an edge occurs more often, then it is more likely to be a valid contig.

What is Next?

You may be asking yourself, “I can see how we generate contigs, and how that can be useful, but we still don’t have one sequence that represents the genome. We haven’t fully completed the puzzle. How do we do that?” If you would like to fully complete the puzzle, to put all of these contigs together, you would have to construct scaffolds. A scaffold is a representation of how contigs relate to each other as well as accounting for gaps in the sequence. Depending on how much data you have, you may not be able to create one continuous sequence.

NOTE: Scaffolding is not required for the CS 418/BIO 365 Genome Assembler project. Generating contigs is good enough!

How to Assemble using Velvet

Genome Assembly using Velvet

What the heck is Genome Assembly?

Genome assembly is the process of constructing long contiguous sequences from shorter sequences. Think of this problem at a genomic scale. Same approach, just a lot more data.

What the heck is Velvet?

Velvet is a genome assembler that uses a de Bruijn graph to generate contigs. If you are interested in reading the paper describing how Velvet works, feel free to read Velvet: Algorithms for de novo short read assembly using de Bruijn graphs.

Installing Velvet

I say that the most important part of using software is figuring out how to install it. Sometimes it can be harder than you think.

Here is how you install Velvet:

  1. Download the source
    • Optional: Check out the sweet Velvet website complete with web 2.0 design. Al Gore would be proud.
  2. Go to the directory in which you downloaded the file $ cd ~/Downloads and unzip the file $ tar zxvf velvet_1.2.10.tgz
  3. Go to the just unzipped directory $ cd velvet_1.2.10 and compile Velvet by issuing the command $ make
    • Error warning!! If you get an error that says something along the lines of fatal error: zlib.h: NO such file or directory then try installing the package zlib1g-dev then running $ make again
  4. If you didn’t get any errors, it looks like you have installed Velvet! If you got errors, Google the error and figure out how to fix it.

Running Velvet

To execute the Velvet program make sure that you are in the velvet_1.2.10 directory and then type $ ./velveth and it should return a short help message. If it didn’t, check to see if you are in the correct directory by issuing the command $ pwd.

Assembling the Zika virus genome

Prepping the files for assembly

We have some reads from the Zika virus, fresh from Florida. We want to assemble the Zika virus genome to help find a cure. Download the reads zika.read1.fastq and zika.read2.fastq, then run this command $ ./velveth zika_genome 20 -fastq -shortPaired ~/Downloads/zika.read1.fastq ~/Downloads/zika.read2.fastq. This command is a sort of preprocessing command that constructs your dataset so that it can assemble it. Here are what the parameters mean:

  • ./velveth- the program that we use
  • zika_genome- this is the output directory of all the files
  • 20- this is the hash (in other words, kmer) size that we use, you will want to play around with this
  • -fastq- this is the type of input files that we have
  • -shortPaired- this is the type of input reads that we have
  • ~/Downloads/zika.read1.fastq- this is the first file of reads
  • ~/Downloads/zika.read2.fastq- this is the second file of reads

Note: You can have an unlimited number of input files.

Assembling the reads

We are now going to use the program ./velvetg to actually construct the de Bruijn graph and assemble the reads. Issue the command $ ./velvetg zika_genome/ -cov_cutoff 4 -min_contig_lgth 100, and now you have assembled your first genome! Here are what the parameters mean:

  • ./velvetg- the program that we use
  • -cov_cutoff 4- this removes the nodes that have a coverage less than 4
  • -min_contig_lgth 100- this gives us all of the contigs that are greater than 100 bases

Viewing the generated contigs

Now we can see the contigs we generated by this command, $ cd ./zika_genome and $ less contigs.fa. Feel free to explore around in this directory for other cool stuff about the contigs!

Happy assembling!

Above and beyond…

You can compare your generated contigs with the NCBI Reference Sequence for the Zika virus to see how well (or how poorly) your genome assembly actually is!

Python Tutorial CS 418/BIO 365

Python Tutorial

CS 418/BIO 365

30 August 2016

# This is a comment
""" This is a 
    multi-
    line
    comment"""
''' This is also a 
    multi-
    line
    comment'''
# How do I declare variables?

# String
my_string = 'Hello, CS 418/BIO 365!!'
my_other_string = "Hello, friends."
# Numbers
my_number = 418
my_other_number = 365.578
# List
my_list = ['this', 'is', 'fun', 56]
# or
my_other_list = []
# or
my_other_list2 = list()

my_other_list.append('bioinformatics')
# Dictionary (Map)
my_dict = {'key': 'value', 'other_key': 'other_value'}
# or
my_other_dict = {}
# or
my_other_dict2 = dict()

my_other_dict['my_key'] = 'my_value'
# How do I make loops?

# NOTE: Whitespace matters!!

# for loop
for i in my_list:
    print(i)
    
# NOTE: "print(i)" is used in Python 3, "print i" is used in Python 2 
this
is
fun
56
# for loop over range
for i in range(0, 5):
    my_other_list.append(i)
print(my_other_list)
['bioinformatics', 0, 1, 2, 3, 4]
# while loop
while len(my_list) > 0:
    del my_list[-1]
    print(my_list)
['this', 'is', 'fun']
['this', 'is']
['this']
[]
# break 
for val in my_other_list:
    if val is 2:
        break
    else:
        print(val)
bioinformatics
0
1
# continue
for val in my_other_list:
    if val is not 2:
        continue
    print(val)
2
# How do I perform file I/O?

# file input
import sys # this imports the library to get command line arguments
with open(sys.argv[1]) as fh: # sys.argv[1] is the first command line argument, sys.argv[2] is the second ... and so on.
    my_first_line = next(fh) # Python 2 & 3
    my_next_line = fh.next() # Python 2 
    
with open(sys.argv[1]) as other_fh:
    # iterate over all lines in file
    for line in other_fh:
        print(line)
ACGTTGCATGTCGCATGATGCATGAGAGCT

4
# file output
with open('./my_file.txt', 'w') as writable: ''' NOTE: the second 
    argument ('w') makes this file writable '''
    writable.write('This is my test file\n')
    writable.write(my_first_line)
# How do I manipulate strings?

# string slice
my_string = 'banana'
print(my_string[3])
print(my_string[0:5])
print(my_string[1:5])
print(my_string[-1])
print(my_string[0:-2])
a
banan
anan
a
bana

Rosalind Tutorial

You can register for the course at: http://rosalind.info/classes/enroll/e464d00a10/

Use your NetID as your username!!

The first problem is found here: http://rosalind.info/problems/ba1b/?class=322

# What do we need to do in pseudo code?

''' -Read in the sequence and the 
    length of the kmer (a kmer is essentially a substring)
    -Break up the sequence into kmers
    -Count each kmer
    -Find the kmer(s) with the highest counts
    -Print out the highest count kmers'''
# Read in the sequence and the length of the kmer from a file
import sys
with open(sys.argv[1]) as file:
    seq = next(file).strip() # REMEMBER: in Python 2 use file.next()
    kmer_len = int(next(file).strip())
# Break up the sequence into kmers
    for i in range(0, len(seq)):
        kmer = seq[ i : i + kmer_len]
# Count each kmer
    counts = {}
    for i in range(0, len(seq)):
        kmer = seq[ i : i + kmer_len]
        if kmer not in counts:
            counts[kmer] = 1
        else:
            counts[kmer] += 1
# Find the kmer(s) with the highest counts
    max_count = 0
    max_kmers = []
    for kmer, count in counts.items(): # NOTE: use counts.iteritems() in Python 2
        if count > max_count:
            max_count = count
            max_kmers.clear()
            max_kmers.append(kmer)
        elif count == max_count:
            max_kmers.append(kmer)
# Print out the kmers with the highest counts
    print(" ".join(max_kmers))