# Colored de Bruijn Graphs

Jul 11 2018I have really wanted to write this post for a long time, but seem to only now get around to it. For more than a year now my research in the Computational Sciences Lab (CSL) at Brigham Young University (BYU) we have been researching various applications of the Colored de Bruijn Graph (CdBG). It all started when we explored some novel phylogenetic reconstruction methods in the CS 601R class during Winter semester 2017. We (or at least, I) kept being drawn back to CdBG’s and their potential for phylogeny reconstruction. Here are some of the things that I have learned along the way!

## Related Work

As with most scientific endeavors, this project certainly stands on the shoulders of giants. Some of these giants include the following papers and their respective authors. I think that they have done amazing work and I admire their methods.

- De novo assembly and genotyping of variants using colored de Bruijn graphs :: Zamin Iqbal et al.’s original paper introducing the CdBG’s application to Bioinformatics. Even though their implementation isn’t very efficient, it established the usefulness of the data structure in the Bioinformatic community.
- TwoPaCo: an efficient algorithm to build the compacted de Bruijn graph from many complete genomes :: Ilia Minkin et al.’s work on discovering bubbles within the CdBG structure, which influenced our thinking and guided our work.
- An assembly and alignment-free method of phylogeny reconstruction from next-generation sequencing data :: Huan Fan et al.’s application of a fantastic distance based phylogenetic tree reconstruction algorithm that I have found to be very accurate (and talk about fast). I love the simplicity of their model of evolution (based on the Jaccard Index), I find that it is very elegant.

## Motivation

We want to use the CdBG to reconstruct phylogenetic trees because it is very efficient computationally. The CdBG can be constructed in \(O(n)\) time and space and it can utilize whole genome sequences, which is a shortcoming of many of the traditional phylogenetic tree reconstruction algorithms.

Furthermore, we also figured that the CdBG contains more information than many of the kmer counting methods, and if they can perform so well then the CdBG will only be able to perform better because it not only stored the kmers (as nodes in the graph), but it also stores the context in which those kmers occur (as edges where \(k - 1\) basepairs overlap on either end of the kmer).

## Our Contribution

`kleuren`

In order to prove our hypothesis, we did what every self-respecting Computer
Scientist would do, we wrote a program to figure out if it worked. We call our
program `kleuren`

, which is Dutch for “colors” (referring to the colors in the
CdBG).

`kleuren`

works by finding *bubble* regions of the CdBG. A bubble is defined as
a subgraph of the CdBG that consists of a start and end node that are present in
\(n\) or more colors, and there are multiple paths connecting the start node to
the end node; where \(n\) is a given parameter and is no greater than the total
number of colors in the CdBG and a path is simply a traversal from one node to
another.

After the bubbles are found, they are aligned through Multiple Sequence Alignment (MSA) via MAFFT and then each MSA block is concatenated to form a supermatrix. The supermatrix is then fed into a Maximum-Likelihood program (IQ-TREE) to reconstruct the phylogenetic tree of the taxa.

### How Bubbles are Found

`kleuren`

uses fairly simple and straightforward algorithms to find
the bubbles, which is broken up into two steps: Finding the End Node
and Extending the Paths.

#### Finding the End Node

`kleuren`

iterates over the super-set (the union of all kmers from all
taxa) as potential start nodes (in a dBG the nodes are kmers, thus
\(node == kmer\)). Given a kmer, it is queried in the CdBG and the number
of taxa (or colors, thus \(taxon == color\)) is calculated to determine if
the number of colors for that kmer, \(c\), is greater than or equal to \(n\),
where \(n\) is a parameter provided by the user.

If \(c \geq n\) then the node is a valid start node and a breadth-first search is performed starting from this node until another node is found where the number of colors that it contains is greater than or equal to \(n\), which then becomes the end node.

#### Extending the Paths

After an end node is discovered, the sequence of each path between the
start and end nodes must be calculated. In order to discover a path in
a dBG, one must *collapse* edges by appending the last nucleotide of
the next node to the previous node’s sequence. For example, if a node
is `ACT -> CTG`

, then the collapsed sequence will turn out to be
`ACTG`

.

This is implemented as at most \(c\) depth-first searches, where \(c\) is the number of colors. The number of depth-first searches decreases as the number of paths with shared colors increases.

## Further Reading

If you are interested in the details of our algorithm and would like to see some
results, please check out our paper Whole Genome Phylogenetic Tree
Reconstruction Using Colored de Bruijn Graphs (preprint). We are currently
working on extending `kleuren`

to improve its efficiency.