Colored de Bruijn Graphs

Cole Lyman

11 July 2018

I have really wanted to write this post for a long time, but seem to only now get around to it. For more than a year now my research in the Computational Sciences Lab (CSL) at Brigham Young University (BYU) we have been researching various applications of the Colored de Bruijn Graph (CdBG). It all started when we explored some novel phylogenetic reconstruction methods in the CS 601R class during Winter semester 2017. We (or at least, I) kept being drawn back to CdBG’s and their potential for phylogeny reconstruction. Here are some of the things that I have learned along the way!

Related Work {#related-work}

As with most scientific endeavors, this project certainly stands on the shoulders of giants. Some of these giants include the following papers and their respective authors. I think that they have done amazing work and I admire their methods.

Motivation {#motivation}

We want to use the CdBG to reconstruct phylogenetic trees because it is very efficient computationally. The CdBG can be constructed in O(n)O(n) time and space and it can utilize whole genome sequences, which is a shortcoming of many of the traditional phylogenetic tree reconstruction algorithms.

Furthermore, we also figured that the CdBG contains more information than many of the kmer counting methods, and if they can perform so well then the CdBG will only be able to perform better because it not only stored the kmers (as nodes in the graph), but it also stores the context in which those kmers occur (as edges where k1k - 1 basepairs overlap on either end of the kmer).

Our Contribution {#our-contribution}

kleuren {#kleuren}

In order to prove our hypothesis, we did what every self-respecting Computer Scientist would do, we wrote a program to figure out if it worked. We call our program kleuren, which is Dutch for “colors” (referring to the colors in the CdBG).

kleuren works by finding bubble regions of the CdBG. A bubble is defined as a subgraph of the CdBG that consists of a start and end node that are present in nn or more colors, and there are multiple paths connecting the start node to the end node; where nn is a given parameter and is no greater than the total number of colors in the CdBG and a path is simply a traversal from one node to another.

After the bubbles are found, they are aligned through Multiple Sequence Alignment (MSA) via MAFFT and then each MSA block is concatenated to form a supermatrix. The supermatrix is then fed into a Maximum-Likelihood program (IQ-TREE) to reconstruct the phylogenetic tree of the taxa.

How Bubbles are Found {#how-bubbles-are-found}

kleuren uses fairly simple and straightforward algorithms to find the bubbles, which is broken up into two steps: Finding the End Node and Extending the Paths.

Finding the End Node {#finding-the-end-node}

kleuren iterates over the super-set (the union of all kmers from all taxa) as potential start nodes (in a dBG the nodes are kmers, thus node==kmernode == kmer). Given a kmer, it is queried in the CdBG and the number of taxa (or colors, thus taxon==colortaxon == color) is calculated to determine if the number of colors for that kmer, cc, is greater than or equal to nn, where nn is a parameter provided by the user.

If cnc \geq n then the node is a valid start node and a breadth-first search is performed starting from this node until another node is found where the number of colors that it contains is greater than or equal to nn, which then becomes the end node.

Extending the Paths {#extending-the-paths}

After an end node is discovered, the sequence of each path between the start and end nodes must be calculated. In order to discover a path in a dBG, one must collapse edges by appending the last nucleotide of the next node to the previous node’s sequence. For example, if a node is ACT -> CTG, then the collapsed sequence will turn out to be ACTG.

This is implemented as at most cc depth-first searches, where cc is the number of colors. The number of depth-first searches decreases as the number of paths with shared colors increases.

Further Reading {#further-reading}

If you are interested in the details of our algorithm and would like to see some results, please check out our paper Whole Genome Phylogenetic Tree Reconstruction Using Colored de Bruijn Graphs (preprint). We are currently working on extending kleuren to improve its efficiency.

Join my Friend's Newsletter

Join next to no one to be very occasionally updated on my life by email.

I won't send you spam. Unsubscribe at any time (I won't know about it).