Episode 29: Mashtree software deep dive
In this microbinfie podcast episode, Lee Katz discusses MASHtree, a rapid bioinformatics tool for comparing microbial genomes and generating approximate phylogenetic trees using MinHash algorithms.
The podcast discusses the development and implementation of MASHtree, a tool leveraging the MASH and MinHash algorithms to rapidly compare microbial genomes and build phylogenetic trees.
MASHtree uses the concept of k-mers to minimize the data size, transforming it into a smaller footprint (e.g., 15 MB raw data to an 8 KB sketch file), which significantly accelerates genomic comparisons.
The core of the algorithm involves converting the k-mers into integers and retaining the first 1,000, creating a much smaller, manageable dataset ideal for rapid computation.
MASH distances generated by the tool align comparably to Average Nucleotide Identity (ANI), providing a quality approximation of phylogenetic relationships without needing fully assembled genomes.
MASHtree utilizes a neighbor-joining algorithm to assemble the tree, which provides an approximation rather than a full phylogeny, focusing on clustering genomes based on proximity and similarity without inferring evolutionary ancestry.
Tools and Methodologies:
-
MASH (MinHash): Originally adapted from web technology (AltaVista) used to identify duplicate web pages quickly, it was repurposed for bioinformatics to compare genomic data efficiently.
-
k-mers and Sketching: Used to reduce the genomic data size for faster processing. Each k-mer is converted into an integer, and a subset is used to create a "sketch" representative of the whole genome.
-
Neighbor-Joining Algorithm: Selected over UPGMA due to better performance in approximating the desired phylogenetic trees.
-
Software Development: MASHtree is written in Perl, chosen for its practicality and multithreading capabilities at the time of initial development. It's packaged for easy installation via CPAN and includes comprehensive unit testing.
Challenges and Limitations:
-
The main limitation is that MASHtree does not produce true phylogenetic trees but rather approximated dendrograms or clustering trees. The tool cannot infer ancestral states or detailed evolutionary paths.
-
The need for scalability: Earlier attempts in software, such as Saffron Tree, did not scale well due to the exponential increase in computational demand with more genomes.
-
Error profiles in data from different sequencing technologies (e.g., MinION) need separate testing to ensure compatibility and accuracy.
Future Directions and Improvements:
-
There are aspirations to generalize the MASHtree pipeline, allowing it to accept various distance metrics as input, broadening its applicability across different types of genomic comparisons.
-
The community has incorporated MASHtree into broader pipelines, notably in Australia, where it is part of rapid outbreak data assessment.