Hello and thank you for listening to the MicroBinFeed podcast. Here we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There is so much information we all
know from working in the field, but nobody writes it down. There is no manual,
and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My
co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Both
Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work
on microbes in food and the impact on human health. I work at Centers for
Disease Control and Prevention and am an adjunct member at the University of
Georgia in the U.S. Hello and welcome to the Microbial Bioinformatics podcast.
This week we continue to talk about phylogenetics, and again I am joined by Dr.
Conor Meehan. So Dr. Conor is a lecturer in molecular microbiology at the
University of Bradford. He specializes in whole genome sequencing and molecular
epidemiology of pathogens, particularly mycobacterium tuberculosis and genome-
based bacterial taxonomy. And we're also joined today by Dr. Leo Martens, who is
the head of phylogenomics at the Quadram Institute Bioscience. He enjoys
developing and implementing tree-based models, so some of his are BioMC2, GNOMU,
TreeSignal, and there'll be links for those packages in the show notes. He's
only recently moved to working with bacteria. Previously, he worked with
viruses, eukaryotes, but from a modeling and theoretical perspective. So thank
you both for joining me again on the show. Thank you. Thanks for having me. All
right, so let's get right into it. One of the critical decisions in
phylogenetics is picking models. And so I think for a lot of us, models are very
complicated. So let's just go very broadly. What are the various models
available for phylogenetics? So I will start by saying why we need models. So
models of evolution are there to help with the estimation of transitions between
states, and that can be mutations from one nucleotide to another, or from one
amino acid to another if you're using protein models. And a lot of them is,
again, to help model multiple substitutions at a single site and try to, as
closely as possible, fit the evolutionary patterns and pressures that would
shape the process that creates the tree at the end. So I've often heard it
described as it doesn't have to be perfect. It's like the London tube map. It
should be accurate enough to get you all the information that you need, but not
so overly detailed that it's specific to only one set of analysis. Yeah, that's
a nice way of summarizing. So yeah, and then I think, and then on top of this,
you have usually when you see the models written down, you have the plus G plus
I plus something. And so once you describe this instantaneous transition from
between the states, and then you can also say that maybe the rates themselves,
they vary along the tree. And so this class of models are called covariant
models. And you can also say that they vary across the sites. And so these are
the, I think a general term for them will be cat or category models. And we have
heterogeneity models, I would say. Yes, heterogeneity models, but it's
heterogeneity between the sites, because you can also have heterogeneity between
the branches. Between the branches, yeah. Yeah. And so for instance, the
heterogeneity between the sites, we have, if you look at the software, you can
have, I think, four different ways of parametrizing them. So they have similar,
so for instance, the famous one is the cat model from RaxML that is different
from the cat model of FiloBase. And then at some point there was some, anyway, I
won't go into the details, but they are different. And you also have the gamma
distributed models or the usually based models, which, so this is in statistics
called a random model. That you'll say, I don't know what's in this site, I
don't know what's the rate on this site, but it can belong to some rate that
follows a gamma distribution. Okay, so the cat doesn't mean concatenate. No, it
means category. Okay. And it's because there was a, so historically, can I go a
bit? So historically you had DNAML, which is the package from Felsenstein on the
Philip package. And then I think it's Gary Olsen who wrote a version of it that
was called Fast DNAML. So it basically, it was just an algorithm change and the
program became faster. And then he also wrote a program that called DNA Rates,
where it tries to estimate for each site what would be the optimal category. So
what would be the optimal rate? It's like a multiplier. I think in Bayesian
models you can see this as a multiplier. And then RaxML, I think the first
versions of RaxML were based on Fast DNAML and DNA Rates. And so they imported
the name cat from the DNA Rates. But then I think at the same time, it was
Nicola Lactirau who was developing a phylo base and then they also have this
category model. But then it was an amino acid based model. And so the category
is a category for the equilibrium frequencies of the amino acids. Because I
think the idea was that if you look at the protein and then if you look at
different sites, so different sites, they can tolerate a different set of amino
acids because some of them are in the membrane, some of them are exposed. And
then they say, well, maybe what changes between the sites in a protein is they
allow an amino acid. So they change the equilibrium frequencies. And that's the
phylo base category model. So I said I wouldn't go into details, but I am. So I
would say how models are built and also how maximum likelihood works in Bayesian
can be quite complicated. But Paul Lewis does fantastic introductions to how all
this works on phyloseminar.org, which is a seminar series run by Eric Mattson.
He did, I think it was a two part series showing all this kind of stuff. So if
people really want to know how maximum likelihood works and the probability that
goes into Bayesian and all these model selections and how the models are built,
I suggest looking at Paul Lewis's lectures on that. We can link to it. Yeah,
we'll link to that. Yeah, that's a good idea. OK, but which is your personal
favorite? DTR? HKY, of course. HKY? Come on. No, OK. So there are two reasons,
one personal and one. So the personal one is that I did my PhD with Kishino,
which is the K from HKY, Segawa Kishino Enya. And but the second one is because
this is I think this is the most complex model that you have an analytic
solution for it. So you don't need to solve the eigen, you know, the eigen
system every time you want to calculate these rates. I think it practically
doesn't change that much. It might get on a one percent, zero point two percent
faster than using a DTR. Yeah, I did my PhD with the general from from GTR.
Well, I did my PhD on HIV, which always needs a GTR model because of its such
massive complexity and mutations. So I just started with that. And I subscribe a
little bit to, which is where I guess we'll get into it, that
overparameterization of a model is not anywhere near as much of an issue as
underparameterization of a model. So it's sometimes difficult to justify why you
would not use GTR. And a lot of people subscribe to that way of thinking.
Anyway, both of them are wrong. So both are wrong. Just about how wrong you are.
So which is the best model? I mean, you set that one up. Which is the best one?
Okay, the best model. The best model is whichever model test tells you that it's
your best model. So you have to do a model selection, because the best model,
it's data dependent. So you have to, once you have your data, and then you have
to apply a battery of tests, and then you'll see which one is the best model. In
practice, this doesn't make a difference. As Connor mentioned, it doesn't, you
know, it's very hard to justify one over another based on the, on what you want
to do, right, with the tree. Actually, the fundamental difference we talked in
the last episode about IQ tree and Raxomel doesn't make a difference. But the
people who make them fundamentally think two different things about models,
because IQ tree will do a test of all the models and then pick the right one and
go forward. Raxomel uses GTR, and that's it. You can now explicitly tell it to
use other ones. But Alexei doesn't seem to think that you need to pick anything
but GTR and has kind of said that on multiple occasions with Raxomel as far as I
know. Are there any other more weird bespoke models that are sort of in common
use? Birth or death model or a whole bunch of different things? Yeah, so there's
one model that I think people should be paying more attention to, but I think
there's not much going on there, which is, it's called Thorn, Kishino and
Felsenstein. So it's TKF91 and TKF92 models, because they incorporate the indel
process. So they incorporate, they model explicitly how indels go into the
alignment. That model is the basis for doing alignment and phylogeny
constriction at the same time, like Bali Phi and things like that. Yes, exactly.
Yeah. The problem is that it's quite complicated to calculate. And I think with
the first one, with the 91, they had a tendency to create very long indels, and
then they had to fix that. And then that's why you have the TKF92. To try to fix
that. And if you write down the like of the equations, you can have, you know,
indels from  zero to infinity and so at some point you have to anyway it's hard
to calculate and so far people didn't didn't feel the importance of it but I
hope you know more people are going to be interested in this in this kind of
models because there's another one a more recent one is called I think Poisson-
Indell process yeah which is also similar to this I think it's a simplification
of this I think that's an important point that people might not be aware of is
when you construct phylogenies you generally do not include insertion or
deletion events or missing data or missing data it is always just the base
substitutions at a specific place yeah so I'd say for model selection or sorry
model implementation working on nucleotides where you have no indels or missing
data that realm of research is I would say done extremely well the next levels
are really including indels and very good ways to deal with that and then going
on to codon models and protein models where if you're doing tree-of-life
reconstructions which they're probably knows a lot more about than me it's on
the protein level and those models of substitution are a lot more complex yeah I
think when you look at coding data usually I don't know it might be easier to
you know to remove or to handle indels usually the practitioner knows how to how
to handle that but then when you look at three of life when you can know coding
regions you have to be more more careful because there's a lot of information in
the indels there okay and then following on from indels and weird models and
there's always an issue of recombination yes I said the R word recombination
which is absolute poison when it comes to phylogenetics or at least that's the
that's the feeling so how what do we do with a combination in our tree how
should we deal with it is there does it is there models to deal with it do we
have to do something as we've been talking about with our initial data what's
the best plan of attack for this if you wanted to start corner yeah the harsh
answer is you probably shouldn't be building a tree if you have a lot of
recombination so if you work on a species which is recombination rich like local
Darius through the malai or something its genome has been estimated to be about
70 or 80 percent resulting from a combination then if you remove all that
recombination which is what most people do with RDP if you're picking the noise
the signal yeah you can use clonal frame ML fantastic program for the EVA to
remove it from bacteria but if you've removed 75% of the data what are you
trying to get from that tree and I'm harping on all the time about it but like
why are you building a tree and there you need to start going towards
phylogenetic networks which is a much more underdeveloped fields so you can use
something like splits tree to try to find out the network of how everything's
connected and you'll probably just find everything's connected to everything
because recombination is so rampant or try an a Bayesian framework like from
Timothy Vaughn he has Bacter yeah Bacter inside of beast which will work on some
genes and get you the ancestral reconstruction graph recombination graph sorry
to tell you what where the recombinations occurred within the timeline of these
but yeah recombination is difficult most people put their hands over their ears
throw it away and then don't think about it anymore but I think it's where
actually most of the info interesting stuff is happening because that's where
your antimicrobial resistance is coming from that's how you know what bacteria
or viruses are in close contact with each other because they recombined in some
way so I think we need to develop more tools that don't throw it away and
actually start to use like utilizers yeah I think yeah I think that's a that's
the sad truth if there's a lot of recombination then you shouldn't be searching
for a tree you should be looking for the set of trees to the to the network I'm
not sure if splits tree is the is the answer so because you know if you anyway
you can describe the set of trees that it can give you this the set of trees
that best describe your data but this doesn't come from the fact that different
regions in the genome or you know in your alignment can come from from different
apologies I don't know I'm not yeah I would I think might be you might be
wanting to look more at forest based things where they're trying to see if
across a set of trees you always get the same groupings so even if they're
recombining they're always recombining together or they're more closely related
in terms of the recombination events yeah that's not the way comics and
phylogenetics at that stage yeah so the way that I see is if recombination is
rare so that there's a there's a Goldilocks number here when recombination is
rare compared to the substitutions and then you assume that you have enough
signal to have trees along your genome which means you can find the trees and
the breakpoints where the where the tree changes along the genome so for
instance this is what they do for a viral recombinant forms think so in the case
of influenza we have this reassortment so you know the breakpoints but you
assume the trees are different but then if recombination is very very common
like if you take a human population a human sample so recombination is much more
common compared to the substitution rate and then there's no way you're gonna
get a tree and then it makes no sense to talk about a tree unless you could have
a tree of each one of the sites in this case there are methods under the
coalescent so although you can you forget about the tree but now you can still
talk about the populations about the recombination rate the substitution rate
and in the population and then there's some methods that are something between I
think they're more non parametric which is like good beans and clonal frame and
then so and some other methods that try to remove sites which have an excess of
homoplasies or regions that appear to have an excess of substitutions which are
all based on the idea that it's something doesn't feel right within that that
tree yeah I agree with all that yeah that's a lot to unpack so you have
basically you have a combination in terms of what most people think of what
recombination is within gene recombination and that is quite difficult to work
with then it's like maybe building the tree really isn't what you should be
doing it's doing a lot more comparative genomic methods and figuring out from
there if it's whole genes being transferred from one to the other then you can
build those individual gene trees and try to find those lateral gene transfer
events with an AU test or something like that and these are two completely
different ways of going about and they have different implications for the
bacteria itself yeah let's follow up on that because we sort of so far been
mainly talking about we've been mainly talking about a sort of super matrix a
point where you where you've got you're sort of treating your whole genome core
genome snips as one gene you're putting all you're mashing it all together using
that sequence on itself which is one way to do it and I don't think we've
touched on a super tree a super tree and so what is that and I think I think
most people are more more familiar with the former what is a super tree approach
how does that differ to what you would normally do saying RaxML with all of your
core genome slips and would that be more tolerant of the combination or some of
the problems we've been talking about okay so I I think that the last the last
question is harder so if it can handle recombination I don't know but just to so
just to give an overview because I like super trees I like distance between
trees I like to I like to look at trees because usually the trees is something
that we are very bad at interpreting because if somebody gives me two trees I
cannot tell if they are similar or not they are only similar or not related to
something to some distance so what are you looking what what makes them
different what makes them similar and so super trees are methods that when you
have historically is when you had incomplete information so when you have so
friends for one gene you only have for a few species for a different gene you
have for different species and then you want to like concatenate the overlap
this these trees and have one that has information about all the all the species
but I think nowadays we are using the term more loosely and I like that it's
when you have a collection of trees how to summarize this this information so
for instance one of the most successful methods in this tree of life methods
called astral and astral is I consider it a super tree approach because it takes
the individual trees from your individual genes and it tries to find a tree that
it's optimal with regards in this case to the to the quarter distance of all
your of all your trees so in theory maybe if you describe the if you describe
your problem as a recombination so if you can have the recombination distance
between your trees and a species tree maybe you can have maybe you can you know
kind of help solving the the recombination problem by having by finding the
species trees by finding the trees the super trees that minimize this
recombination disagreement with your particular with our individual trees I
don't know yeah again if it's if it's a lateral gene transfer event of whole
genes super tree methods I think tend to do better so you can also build super
trees with a subtree prune and regraft approach which Chris Whidden had put out
a few years ago and build super trees that way and they tend to do much better
with these lateral gene transfer events but again when it's a within gene
recombination that's when it gets really really difficult and I don't think
either method is better for those is that training is that pruning approach
available or implemented in a package that's ready to use or is it something it
is yeah so I have so I have to stop here I'm waving my arms because there's an
implementation of fangorn that I did because I did this a long time ago I
implemented it's not it's not the proper SPR distance but it's an approximation
to the SPR distance that works very well when the distance is not very large but
when the distance is not very large you know the trees are completely unrelated
so if you if you use fangorn for our there's a there's  an SPR distance there.
It was because the lab that I worked in before, which is Robert Biko in
Dalhousie in Canada, it was Chris Whidden who was working with him and doing his
SPR, so it's all on Rob's website as far as I know still. Or now, I think he's
now working for Eric Mattson and they're doing stuff there. Yeah, but that's the
kind of people who are doing that stuff as well now. Yeah, yeah, I think the
first papers that I read with this, you know, associating this SPR distance with
recombination were from Robert Biko. Yeah, good people. Yeah, thank you. So
alright guys, let me change tack a little bit. Out of all of what we've been
talking about today, are there any particular areas we need to watch out for,
particular data sets or particular problems where tools really struggle that
require, you know, out of all of your wisdom and expertise that people should
know about and should be aware about? Yeah, I think for me, it's not about the
tool, but it's about the user. So I don't have a problem with the tools, but
with the users. We should be very careful that as we accumulate more data,
because we have this thing that we want to have the tree, we want to find a
single tree, or we want to describe something and we want a point estimate, but
we don't have point estimates in life. And so I think we should start thinking
about seriously about the diversity of data, about the uncertainty. You know, if
you have ten genes, these ten genes might give you a different story, and you
shouldn't, you know, you shouldn't lose your hair trying to find which one of
the genes is telling you the truth. You know, it's like the Japanese movie
Roshomon, you have five people telling a story, it's five different stories, but
the facts are the same. Don't spoil it, Leo. Don't spoil it, Leo. But this movie
is from the 50s. If you haven't watched Roshomon, don't watch it anymore. No,
please watch it. I very much like Roshomon. Please watch Roshomon if you haven't
seen it. But you know, I think that's the thing. So if you have a large data
set, if you have a lot of genes, a lot of samples, they will tell you different
stories, and you have to listen to all the stories that they are trying to tell
you. A tree with a thousand faces, by the way. For me, it's that a lot of these
methods were built for complete information, gene-based phylogenies, and the
mathematics that goes into building these ones is way beyond my comprehension.
It's for people much smarter, and it's difficult to build, and they just finally
are getting to the point where they were good for genes, and then we all moved
to whole genomes, and then we all now wanted to try to shove this whole genome
data into programs that were not necessarily built for that. So my pet peeve,
which anybody knows me, is ascertainment bias correction of SNP data. The
programs are built assuming that you've put in all of the data that you know. So
you have your gene sequence which has constant and variable sites, and people
putting in SNP trees, they're just going to be wrong. So it's about correcting
for that, putting in your constant sites, and then a lot of that has issues
because we have repetitive regions. Whole genome sequencing tends to not always
be whole genomes. It's actually whole genomes minus all the repetitive regions.
So it's understanding that the data you're putting in may not always be complete
in the same way that the program is expecting it to be. So on trees based
primarily on SNPs, what kind of errors are they going to introduce? For me, in
my experience, I find that often if I run it with all the information and just
the SNPs, I often find the topology is more or less correct, but my branch
lengths are rubbish. Yes. So going back to the models that we talked about, two
things that will go into something like an HKY or a GTR model is the rate of
transition between the different nucleotides, but also the frequency of the
nucleotides that are seen. And if you just put in the SNP data, the frequencies
of these four nucleotides will most likely be incorrect because you haven't
accounted for all the constant sites and how many A's or C's or G or T constant
sites there are. So if you're working on tuberculosis like me, very high GC
content, whereas the variable sites may be very high AT content. And then you'll
get completely wrong model of evolution, which will give you completely wrong
branch lengths. But there are models that do try to try to estimate this or work
around it. Yes, you have the Lewis model that will do a single estimate
recalculation without any extra information. You have the Felsenstein-Astutamus
bias correction where you just tell it all the number of constant sites. And
then you have the Stamatakis supposed one, which is where you tell it how many
A's, how many C's, how many G's, and how many T's were constant sites. And you
just, if you put that in with your SNP data, it almost definitely will give you
the correct information. Almost. Although myself and Leo did discuss this a
little bit, and you should actually just put in the entire genome. Okay, so
generally you can work around it, but try it, but don't. Don't just use SNPs by
themselves. You need to either put in the entire file. You need to be putting in
the constant sites as well. Yeah, yeah, and so I think a reality check is if you
have 10,000 genomes, but then you only have, I don't know, 100 SNPs, can you
actually have a tree of 1,000 tips, you know, given that you only have 100 SNPs?
Probably not. And so I think that's a reality check. Maybe you don't have enough
data there to distinguish between one tree and another. It needs to be
phylogenetically informative, for sure. Yeah, exactly. So a lot of, because I do
a lot of clinical phylogenetics, things that are used for diagnostics are almost
definitely not the things that can be used for transmission, because diagnostics
is very low amount of variation, and transmission you want as much variation as
possible. Yes, that's always something that I wrangled with when talking to
clinicians going from a microbiome, from a more purist microbiology background,
is methods, the more SNPs the better. The more discriminant power I have, the
better it is. And then the question comes back to, yes, but you're not
reflecting the species, you're not reflecting evolution anymore. You're
introducing some other information. And so this tree has all the information in
the world that you can feed it, all the data you can feed it, rather, but it's
not informative. It's telling you the wrong thing. But this comes back to what
we were talking about, where it's really important about the question, and yes,
in some cases, yes, discriminant, you just want to see the different, you want
to put these apart with as much information as possible, introduce as many
variable sites as possible, regardless of where they're from. And you just want
to say, is this the same or not, which is fine. So what is the next big thing
for phylogenetics? I think we'll close with this question. Where to next? And I
think we've sort of touched on it, but I'd like to hear a summary from the two
of you. What's the next big area? What would you like to see, and what do you
actually think is going to come next? For me, because I moved from viral
phylogenetics and molecular epidemiology into bacterial ones, and the vast
majority of programs, especially Bayesian programs, just cannot handle the vast
amount of data that's in bacterial genomics, with all these levels of different
types of recombination, but also just the size of the genomes and the size of
the data sets that we're coming out with. So 10 years ago, a data set of 20
isolates was enough for a paper, and now it's like, oh, you only had 400
isolates, and then you're trying to build something with 400 isolates with
either a lot of variation or not a lot of variation, and I think that's maybe
the next step is scaling up a lot of Bayesian analysis to be able to handle
larger and larger data sets, whether that's even possible. The other thing that
I think is the next big thing is, it's been done a little bit, which is
alignment and tree estimation at the same time, as opposed to doing a two-step
process where you align all your data and then you build a tree, or
alternatively, actually doing completely alignment-free phylogenetics using
hidden Markov model approaches, which are very new in trying it that way. Yeah,
I think, yeah, I agree that I think alignment phylogenetic coestimation is the
next big thing. I think something like, there are a few software available. So
PASTA and SATE, I don't remember which is older, which is newer. I think PASTA
is the newer version. I can give a link later. Yeah. And I think it's for
alignment, but since they construct the tree at some intermediate steps, then it
might help a lot in this coestimation. So SATE does an alignment and then builds
a tree and then redoes the alignment based on that tree. Based on that tree,
right, until it doesn't change, right? Yeah. Or something like this. Yeah, okay.
Yeah. And then there's some, I think there's another software called PHMM3,
which is, I think it's what you mentioned about the hidden Markov model. And
also PROPIP, which is based on the Poisson-Indell process was, I think it's
being done. I forgot his name. It's Maria Nizimova. Yeah. In Switzerland. And so
I think they are using it to do the alignment, but also it helps in the tree
inference. And I hope there's going to be much more of this in the future. And
another thing that I would be happy to see more often, and I think it's the next
big thing, it's whole genome alignment. Yeah. And how to use whole genome
alignments for phylogenetics, for phylogenomics. Yeah, replace my answer with
that answer. That's definitely the best answer. Yeah, I think it was mentioned
in the previous episode, right? And yeah, in genome graphs, I'm looking at you
because I think that the big promise of the whole genome alignment, it's with
the genome graphs as well. All right. So I think that's more or less the time
we've got for this session.  Any final words from either of you, too? Understand
your data, and then build your tree from there. I would like to just mention one
paper from 2017, where they show that just, I think I have the name of the paper
here, contentious relationships in phylogenomic studies can be driven by a
handful of genes. And it's Nature, Ecology, and Evolution, 2017. And in this
paper, they show that, so the PI was Anthony Rojas. Of course, yeah. Yeah, and
then they showed that if you look at a tree of life size alignment, so they have
several data sets. And they show that in these data sets, just by removing a few
genes, you can have a different phylogeny. And removing a few sites from a few
genes, you also have a different phylogeny. So I always think about this, and
there's a nice figure in this paper. I always think about this figure from this
paper, when I think, when I have a data set, and then I think, maybe I'm
removing that one that is going to change the topology. And so, yeah, I think
it's a, you won't sleep well for a few days after reading this paper, but it's
quite interesting. I think it's a. All right, that's fantastic, OK. And so,
yeah, to end on a high note. End on a high note, that's fantastic. All right.
Everything's a lie. I want to thank my special guest, Dr. Leo Martin. Thank you.
Dr. Conor Meehan for joining me today. And I'm Nabeel Ali Khan, and this has
been the Microbial Bioinformatics Podcast. Thank you all so much for listening
to us at home. If you like this podcast, please subscribe and like us on iTunes,
Spotify, SoundCloud, or the platform of your choice. And if you don't like this
podcast, please don't do anything. This podcast was recorded by the Microbial
Bioinformatics Group and edited by Nick Waters. The opinions expressed here are
our own and do not necessarily reflect the views of CDC or the Quadrant
Institute.