Hello, welcome to a new thing we're doing called software deep dives, where we
interview the author of a bioinformatics software package. Today we're having a
chat about some of the software we've created ourselves because behind all of
our software are quirky details that never make it to the final paper. Today Lee
is in the hot seat with MASHtree. So Lee, what problem does MASHtree try to
solve? MASHtree is one of the most rapid methods for creating a tree from whole
genome sequences. So before MASHtree, the most rapid method might have been
7-genome LST or some other low resolution method. But the issue is that we have,
you know, 50 megabytes or 200 megabytes fastq files coming off of the sequencer,
and we just didn't have a great way of making a rapid tree to compare them
quickly before, you know, spending an hour on each genome to assemble and
characterize WGMLST or any other method. This is possibly the fastest method to
compare genomes right off the shelf. So I made something similar to that as well
back in 2017 called Saffron Tree, published in JOSP by the way, but it's got two
citations and it's the opposite of fast. It's fast if you have two or three
genomes, but it scales exponentially. Oh, wow. I'm so sorry. I wish that I saw
that before I started MASHtree actually. No, no, no, no. I mean, obviously
you've done it much better, you know, if it actually scales properly. What was
its method? So the method literally was just to get all the k-mers in each fastq
file or fasta file and then just do an intersection and build a tree from that.
That's exactly what I would have done too. I really do wish I saw your software
before I started MASHtree. But it scales very poorly. So that is obviously a
slight downside compared to your software. Just to clarify, Lee, both software
are working directly with the raw sequencing data, right? Correct. So yeah, so
you're saving a lot of time with avoiding a lot of genome assembly and
processing that way. Where does the MASH in the name come from? Okay. So MASH
comes from Adam Filippi's group and they made software called MASH, which stands
for MinHash, the name of the algorithm. And originally, well, they borrowed this
algorithm from an older technology, MinHash, from the 90s. I think it was first
used in the software, in the search engine AltaVista. Wow, that's going back.
Yeah, if you remember going way back. And I believe they used it so that they
wouldn't have duplicate web pages in their index, so they could be a lot faster
in their indexing. And I mean, the goal is just to look at a new page and to see
whether or not it's the same or not. That makes sense. Like, yeah, if you have
the same web page over and over, you don't want that to be every single search
result. Yeah. So it's incredibly useful. And we wanted the speed also in
bioinformatics. And I'm grateful that Adam's group brought it to bioinformatics.
It turns each k-mer into an integer and there's some bloom filter stuff in
between that I'm not going to go into. That's out of the scope. But it sorts the
integers that comes out of those k-mers and it sorts them numerically. It keeps
the first thousand integers. And those first thousand integers are way smaller
than the original FASTQ file or the genome itself. It's, you know, a five
megabyte genome might go down to or, you know, 15 megabyte raw FASTQ file might
turn into just an eight kilobyte sketch file. So it's a huge speed up, reduces
the footprint a lot. And so MASHtree takes those sketches, it takes the
sketches, it compares those sketches using the MASHdist algorithm, it creates
MASH distances and from those distances, it creates a neighbor joining tree.
Originally I tried UPGMA, but neighbor joining turned out to be a better
approximation of the trees that we're looking for. I always need to give this
caveat. I always forget to give the caveat right away. MASHtree creates trees,
dendrograms, but it does not create a phylogeny. Neighbor joining would be an
evolutionary model? Yeah, I mean, neighbor joining is an evolutionary model, but
the difference is there's no ancestry inferred here on the nodes that neighbor
joining creates. We can't say that this is an ancestral state of these leaves.
And there's no discussion of evolution implied in the tree at all. It's just
clustering. These genomes seem to be closer to these genomes. So why does it
exist? At the time we had, you know, genomes coming off the shelf, it was
probably 2014 to 2016. I've been thinking about this problem and we kept getting
asked to compare genomes, make a rapid tree and we were in the middle of
outbreak analyses or we were doing population structure. And one day I just did
MASH on all of our genomes and compared them and then I made a neighbor joining
tree just stepwise and it turned out to be a pretty good tree. It described what
we needed to see and I started pipelining it. So it was really for my own
benefit and I'm glad that the rest of the community sees the value in it too.
Can you use it with long reads, maybe stream it off MinION as reproduced? Yeah,
I would love to test it with MinION. I haven't done it too much with that yet
and I think it does need separate tests from MinION just because of the
difference in the error profile. So anyway, who asked for this to be made and
commissioned it? This is going to sound egotistical, but I just sort of came up
with it. I did. I mean, this problem was being thrown at me every single day in
my branch and I just, it was just one of those times where I had enough and I
had to just do it. And I asked Adam Filippi five different ways if he wanted to
do this project, if he was going to do it, if he was okay with me doing it, if
it was okay if I did it this way, is it okay if I did a poster, is it okay if I
published it like every single step of the way and he was very gracious and Adam
is a really great person to talk with and I really appreciate everything about
that. So I did come up with the actual tree making part of it, but the hard
part, the algorithm was from Adam's lab. And were there any earlier prototypes
of it at the time? I tried this with a few different things. Actually speaking
of saffron, you reminded me, I had this incredibly slow way and I'm sure it was
even slower than what you did in Perl and I did try to come up with basically a
jacquard distance of k-mers, but it was incredibly slow and I dropped the idea.
Yeah. You see, I didn't drop it. I learned my lesson. Maybe if I did this in
rasters, then it would have been fast, I don't know. There's a lot of, if you go
back to the match paper, there's a lot of tricks with their underlying
libraries. I think Captain Proto is the one that they mentioned that has a lot
of optimization by people way beyond our expertise. There's a lot of speed up
stuff they did with getting it so that it's so fast. I appreciate that. I
appreciate when speed ups are put in the library. Another early prototype was I
tried it with ANI, Mummer-based ANI. And Mummer again is, for people who don't
go way far back in bioinformatics, Mummer again is from Adam Phillippe. So I
think that's kind of fun. So Mummer-based ANI. And again, it was not as rapid
because you still have to assemble the genome. So MASH turned out to be the best
alternative because you could use it on assemblies or on raw reads. So where did
you end up publishing it? I chose JAWS. You and everybody else were kind of
talking about JAWS and eventually I saw that FDA had a paper in it, a good paper
from our colleague at FDA, Steve Davis. And they convinced me that this is not,
like, it's a free journal. It's, you know, to me, it's very suspicious. This is
the Journal of Open Source Software? Yes, but it's real. And it didn't take too
long after I investigated it. Like, this is a real journal. And I went for it
because it's this really transparent, open source method of peer review on
GitHub. And I actually wish everything was like this. So everything I can do,
I'm probably going to try to publish it in JAWS now. Awesome. Oh, I should
mention, speaking of ANI, that in my experience, and there is a little bit of
literature on it, that the MASH distances that it calculates is comparable to
ANI. They're not equivalent, but comparable. That's where the fact that this
makes trees that make sense comes into it. Yeah, it was comparable. There's a
figure in the MASH paper where they showed that keeping 1,000 k-mers had a
really good correlation with ANI. And I kind of made an executive decision,
like, okay, it's fast, but it's still fast if I do 10,000 k-mers. And so I did
it that way just to make sure the resolution stayed pretty high. So where did
the name come from? I just glued together MASH and tree. And just like how you
were talking with Greg in a different episode, I started saying MASH tree, and
it just stuck. Internally, the resulting tree, people were just informally
calling the MASH tree.  And that's stuck to yeah, I think there's there's
something about the software tool The names that really stick are the ones that
are kind of dumb Slightly slightly dumb or just catchy, you know, you know, you
know earworm kind of way Totally agree. I'm done trying to make super
sophisticated names like it needs to be very straightforward now I know I'm all
for you know, the obscure Irish language names. I Like that I do like that I'll
talk about that another time, but I like Yeah, so anyway, what are the unique
selling points of it if you are stuck in the weeds like me and and you have a
Bunch of huge fast Q files coming at you and and your higher-ups are asking you
to make trees very fast. Well Mastery is rapid a tree that would be done with
high-quality snips, you know That would be done in an hour and a half is done
instead in a minute with mastery And that could be done on your laptop instead
of on the high-performance computer The trees are are good approximates of the
real phylogeny I shouldn't say we're real phylogeny because phylogenies are
inferred usually But it it does a really good approximation. So if you want a
rapid Approximate tree and and you're willing to admit that it's not the final
tree because you're gonna run something else afterwards if you're gonna Do
publication then mastery is a really nice way to to get that first round of
analysis So what kind of organisms would you usually put through it and your
day-to-day work? I've been asked to put E. Coli through it or you know related
organisms Shigella or Listeria or or Vibrio All those guys I've been putting
through it I've seen other people use it for you know, things that are not
enteric like legionella and it's done a fantastic job awesome So what language
is it written in? Oh, don't kill me. It's in Perl I think that Perl was a good
choice actually even if I knew a few other scripting languages because One of
the key points of mastery is as the multi-threading and I know Basically all
other modern languages have multi-threading or some form of that But at that
point in time when I was starting to write this I had a good grasp on multi-
threading in Perl And it and it's people have their their valid criticisms of
multi-threading in Perl, but it got the job done really well And is it just one
big script or it's broken up into multiple files or what? There is a main
mastery script and it reads off of a Single actually a couple different library
files But after that it is kind of packaged up into kind of the standard Perl
Package. Can you install it with CPan? Exactly. Yes, you can install it with
CPan. That was sort of Inspired by Torsten prodding me on github issues I think
he was doing it with he was prodding me on a different package But I realized
that it would be easier to package up mastery at that time. How did you do the
packaging? Did you use distiller or something like that? Perl has a Packager as
a few different packagers, but I used make maker which basically creates a
makefile and then runs the makefile to Install the whole package. And I presume
for Joss you had tests in there as well. Yep this was this was perhaps my More
it wasn't the first time I did unit tests But it was the more professional
package that I did with professional Style unit tests where I just I picked
every major thing and even several minor things and and tested tested tested
Awesome. So, how is it documented? I'm not as sophisticated as as the rest of
the world on documentation. I don't use LaTeX or read the docs I just used
Markdown language so there's a main markdown readme in the main directory and
then there's some and then that links to other markdown files And in the repo
under a subdirectory docs I don't think it's as sophisticated but I think it's I
think there are advantages to it Like the documentation always gets cloned with
the repo and markdown is not too hard to read on the command line Yeah, that is
That is a consideration It is sometimes annoying when you are stuck and then you
have to open a browser To find the read the docs page or whatever So what
features are you most proud of for mastery? Um, I wish I could say I was proud
of making mash itself But I'm I'm only I can I can just say that I'm proud of of
Having that aha moment that mash was this incredibly fast method of comparing
two genomes I am I was fortunate that Adam visited Atlanta in I think 2015 and
showed it to us and I had that aha moment like we can make trees out of this
thing The thing I'm most proud of is just having that aha moment But I feel like
it's possible just like the saffron like it could have been someone else could
have thought of that some of the design principles some proud of like using
multi-threading and Packaging it up for cpan by packaging up for cpan. It has
standard installation It has standard unit tests and a standard whatever else so
it's a lot better Streamlined in the Linux environment. That's I'm proud of the
streamlining and the standardization of it Yep, and I can see that it's been
cited a few times already and Yeah for different for quite a few different
organisms and people all over the place are using it How has the community used
the software as far as you know, as you know? I think there has been a couple of
slip-ups where somebody said it was my phylogeny, unfortunately But when
someone's using it, right? I've seen several times people using it as the first
pass for their analysis so if they need to know the population structure if
they've if they've got enough diversity or If they rapidly want to know whether
or not to exclude a sample from an outbreak analysis I've seen all of these or
or I'm not just outbreak analysis But what if they have you know, this question
is this related or not? Is this is this the same isolate? Is this the same
strain it can be used to rapidly exclude a sample before going into a more in-
depth analysis So, um, so I think the community when they use it correctly are
using it as a as a first pass analysis Yep, and has anyone used it in a more
complicated pipeline that you know of? There has been one Instance that I saw I
think it was the pandu pipeline in Australia and it's it's being used Where one
data set is a possible Outbreak data set and so it's used to rapidly as a rapid
sanity check I think to see whether or not a sample or all samples belong with
each other Otherwise mastery itself is is the more complex pipeline that uses
mash I think more often than not I would view mastery as the more complex
pipeline that that uses the core program Okay, and any future plans? Is this
something you'd like to add but you didn't have time to add yet? Oh man, there
was a great suggestion by Torsten for Making this pipeline more generalizable
and I wish I thought of that early on So what if you give it just a different
mechanism for comparing distances and that could be plugged in So let's say you
have like a bunch of ANI distances and then you plug it into the tree making
program I wish that I built that in but I just don't have time for that
Otherwise, I'm leaving mastery basically alone as as a stable program and I'm
listening out for any any bug reports Yeah, that's cool Thanks for the great
discussion. We've had a quick chat about some of the software we've created
ourselves and There's always some interesting facts about how these different
tools came into being Today we were talking about mastery which makes trees
rapidly from raw reads You can check it out on github and the papers up on Joss
and that's all the time we have for this episode See you next time. Thank you
all so much for listening to us at home If you like this podcast, please
subscribe and like us on iTunes or Google Play and if you don't like the podcast
Please don't do anything This podcast was recorded by the microbial
bioinformatics group The opinions expressed here are our own and do not
necessarily reflect the views of CDC or the Quadram Institute