Hi, and welcome to the MicroBINFI podcast. I am Nabeel, your host for today, and
once again I am joined by our favorite arborist, Dr. Conor Meehan and Dr. Leo
Martens. Dr. Conor Meehan is a lecturer in molecular microbiology at the
University of Bradford. He specializes in whole genome sequencing and molecular
epidemiology of pathogens, in particular, microbacterium tuberculosis and
genome-based bacterial taxonomy. Along with me is Dr. Leo Martens, who works
with me and is head of phylogenomics at the Quadram Institute Bioscience. He
enjoys developing and implementing tree-based models and has a number of
different software out there, such as BioMC², GNOMU, and TreeSignal. Previously,
he used to work with viruses and eukaryotes and so on, and has only recently
switched to working with bacteria. So, thank you both for joining me again for
another episode about the dark world of Bayesian magic. Thank you. Let's do it.
Let's do this. All right, let's fire straight from the hip. We were previously
talking about all of the theory with Bayesian. Let's talk about how do I get
started? What's the simplest toy example that I can start playing with to get my
head around what this is actually doing? So, classically, there's this class of
graph models that you can start playing with. So, basically, it's a language.
So, the most famous is STAN. The oldest one, I think, is called Bugs and
WinBugs. And now you have this. STAN is also quite old, but it's quite famous.
And now we have another language. It's like a programming language, which is
called Rev, which is for RevBase. So, I think I would start, if I had to learn
Bayesian statistics again, I would start with Rev. I'll go to the tutorials and
see how they, because it's very simple. You write some of the parameters, and
then you describe how they interact with each other. And then if you have data,
you have your posterior. If you don't have data, you will simulate the data. So,
I think this would be my first day on Bayesian, you know, playing with Bayesian
models. So, Rev model and STAN, because STAN is also quite famous. So, I think
it's developed by Andrew Gelman and others. And they have this so-called plate
notation that basically you draw. It's like a flowchart. And then you have
circles and squares. And then if it's a variable, it's something. And then you
have arrows telling which one is conditional on which. So, you know, you can
describe your model by drawing. So, I would agree that the graphical models can
be quite easy to help you understand how that works. And Rev, Bayes, their
tutorials outline how all these work. If you want to get into toy examples, as
in toy data sets and results that can come biologically, there is a course
called Taming the Beast, which is run in person every year, actually multiple
times in multiple parts of the world. But the whole program is online as well.
And Taming the Beast does introductory tutorials and intermediate and advanced
tutorials. Most of them are actually based around eukaryotic work. So, it's a
lot around bears and these kind of data sets. But they're very, very good for
understanding what you're putting in and what's coming out at the end of it. So,
I would say learning the Rev language or similar like Leo said is good because
you start to understand all the different aspects of the model truly. And then
these Taming the Beast or Rev, Bayes associated tutorials can help you work
through some actual data to get some answers. Okay, that sounds straightforward.
So, let's say I want to start using more Bayesian, more ad hoc in my
bioinformatics. Let's say I've got a new method and I want to do a Bayesian
approach in that to find a particular answer. Is this easy to build into my
current way of how I write my code or do I have to start from scratch? Yeah, at
some point you need the collaboration. You want to talk to some, you know, to a
statistician just to check the things. So, for instance, I did my PhD in
Bayesian phylogenetics and my first author paper had a mistake on it. And then
somebody published a correction on the thing. So, this is just to say, you know,
I felt pretty confident on my model and it was wrong basically. And I was the
reviewer for the paper and I had to accept because they were right and I was
wrong. So, you know, it's just to say that you have to, even if you feel very
confident and even if you check everything that you could, you can still have
some silly mistakes because writing the proposals and writing the models, the
conditionals can be quite convoluted. Yeah, I often run any of my analysis by
somebody else as well, just to be sure. If you really want to start playing
with, you know, let's say writing software or checking or trying to make a
Bayesian version of what you already have, I would suggest start with – there
are two ways. One of them is simulated annealing. Simulated annealing is like,
it's the optimization version of Bayesian models. Instead of estimating the full
posterior probability, you just try to optimize what's the best value or it's
the best set of values. So, since it's an optimization, even if you wrote
something wrong with the conditionals, as long as the best value is the same,
your simulated annealing will still work. So, even when I'm writing models in
software, I have always implemented a simulated annealing version, just to see
if it's going to be the maximum likelihood estimate, let's say. Another
alternative would be to play with ABC, with Approximate Bayesian Computing.
Computing, thanks. So, basically, you simulate. If you cannot calculate the
likelihood, you simulate under a known model and we have very good software for
simulating whatever you can. And then you have some rules of how to accept the
simulated output. And in the end, you have a distribution as well that you can
use as a proxy to the posterior. So, even if you cannot calculate the
likelihood, even if you cannot calculate everything, you can still have a
distribution of values, as long as you can simulate. Yeah, I don't think it's
very easy to automate Bayesian analysis, especially for phylogenetics, because
it tends to be a little bit more dataset-specific. Each of them require a little
bit different in terms of the priors. So, that's why I would often say I'm good
at building phylogenetic trees for certain organisms, but not others. And that's
when you want to collaborate with the biologists, so they really understand
what's happening underneath, or more, quote-unquote, true Bayesian statisticians
who can maybe extend the model to suit the needs if yours is a little bit more
unique. All right. So, let's just say I'm a simple user of an existing tool, and
we sort of touched on this in the last episode, but how do I interpret the
results? I mean, I've been running this program for a few weeks, and I want my
estimate. How do I know this is done? How do I know if it's more or less
correct? There's a couple of different packages that will allow you to look at
what we call convergence, to see if, as you run the MCMC longer and longer and
longer, are we really exploring any more of the data space? So, the one that's
used by the vast majority of people is a program called Tracer, and allows you
to visually look at all these different parts, look at these ESS to see if
you've gotten enough independent sampling, and to compare across different runs
to make sure that it's done. I often will run the model four or five times,
compare them as well to make sure that independent runs are also doing. There
are alternatives to Tracer, which I guess are not used as much, but sometimes
can be better, like RWTY, which stands for Are We There Yet?, which is an R
package. And there are other ones, Bonsai also, and it depends on, there's lots
of different ways to look for convergence, but that's often what we want to look
for as a first step to see if we're done. Yeah, in the non-phylogenetic context,
there's Coda, and I think Tracer borrowed some things from Coda. Coda is also an
R package, and I like Are We There Yet? because it can actually, it calculates
the distance between the trees, so when you have a sample, a posterior sample of
trees, it calculates the distance between them, and then it plots. I think it's
an MDS, so actually you can see how your trees are moving in the tree space. And
so, if you want your estimate, as I said, I want to see the uncertainty, I want
to see the credible sets, but the most common one, since we don't have an
average tree, the most common one is called the MAP, so it's the Maximum
Posteriori Tree. So, it's the tree that appeared more frequently in the
posterior sample, and for the continuous variables, you can have means and
medians and so forth. So, often when you're running it, you're running, at each
set, you've got a tree with all the data on it, and you've got a log, and we're
using the log file in order to look for convergence and make sure that we've
finished, and we may have a lot of parts of the data in there.  that we actually
care about, like maybe the mutation rate, et cetera, and that often comes from
the log. And then the trees basically get combined to create this maximum A
credible set tree. And that's our estimate or the tree that goes as figure
whatever with the uncertainties put onto it. And there, if you're looking again
at phylogenetics and you want to be putting uncertainties on, there are programs
to do that. Big tree is one that's very good for putting uncertainties on, but
it is by far not the only one that's out there because you want to make sure
it's not just the tree that you're putting into your paper. You also want to be
putting these uncertainties onto that. Yeah. And at least in theory. So if you
run your program for one week and then you check the convergence, the
convergence looks like it's going somewhere, but you're not satisfied yet. You
can, if you have the exact same parameters, so if the model is identical, you
can run it again for another week and then you can merge the results from both
of them. You can assume that they are independent samples. Like bootstrapping
trees. Yeah. Yeah. So I'll often run four or five times and I'll run the chains
very long, merge them all together into one and then thin it down to maybe
10,000 samples. So I get really independent runs in between it. And it's often
overkill, but I'd rather be overkill than underkill, if underkill isn't right.
All right. Well, that's a fair bit to digest for everyone. But so where do we
see Bayesian software in the wild, at least as far as the phylogenetics is
concerned? Which are the most famous software packages? I mean, Beast is always
the most famous, I would say, for multiple reasons. One is a very nice graphical
interface, which means you can do a lot of the easy stuff in there. Pretty
stable software that runs quite well on a lot of different architecture, like
it's quite well made. And it comes with a lot of things for post-processing. It
comes with a program for combining and thinning out the log and the tree files
for creating that maximum credible tree with Tracer to look at your convergence
and all the other things and with a lot of other tools in order to allow you to
do a full Bayesian analysis. And there's pros and cons to that. It has a lot of
defaults and those defaults now in the new versions, it warns you when you've
not changed any of the defaults to tell you that you probably should be changing
these defaults. Oh, that's clever. That's very clever. It used to not. So often
you would see a paper and they'd run it for 10,000 runs and that's just the
default in Beast. And you're like, that's definitely not right. 10,000 is almost
definitely not right for anything. But Beast is very easy to use, which can be a
little bit deceptive because then people will just go and run it and not
understand that there's a lot of complexity you need to change in there. Yeah, I
was raised with Mr. Base. So when I was a small kid, I was playing with Mr.
Base. So I think I might have a bit of a bias towards RevBase. I don't know if
Mr. Base still exists, but... It's not being supported anymore. Yeah. Okay. So
people should move to RevBase, right? And yeah, so probably that's what I'm
going to be looking to. And there was also FiloBase. So FiloBase, they have some
interesting models and convergence analysis was also nice. But yeah, I think the
Beast is the big one, right? So I have a question, because we have Beast 1 and
Beast 2. So what's the difference between them? So it's a proper fork, right?
It's not like GitHub using the word fork now. It's a proper fork. When they
become speciated, they become independent projects, right? Yeah. They are, yeah.
So Beast 1 and Beast 2 are run by different groups now, with Andy Rambo and
Alexei Drummond. And they do similar things, but slightly different sometimes.
And what's important to realize is that Beast creates time trees almost
exclusively. It is not just creating, Bayesian trees can be non, they don't have
to be time related. They can also be evolutionary distance related, like we
often think of from maximum likelihood So, but Beast will only build time trees,
as far as I'm aware. Okay. So they always have a clock model there. Yes, it's
always, you have to specify some kind of clock. Whereas MrBase, FiloBase and
RevBase will build any type of tree that you need. So RevBase is fantastic. It
runs very, very well. It's very, very robust. And also in the way, it gives you
a lot of options, but it's written in what's similar to an R language. So it can
be very, very off-putting to new people, because you have to code it. You have
to say, I want this prior with this distribution. And it makes you think about
every single aspect of the Bayesian analysis, which is actually a good thing,
because you really should be thinking about all these aspects. But it can be
quite off-putting to people who are trying to learn Bayesian and learn coding
and learn, and learn, and learn. And an MCMC state in RevBase means something
different to what it does in Beast. Yeah. When I was playing with MrBase, so
basically the first versions you didn't have, you didn't have any clock model.
It was like RexML or IQtree. I think people assume that Bayesian trees mean time
trees. And that's not true, because Beast is so popular and it only builds time
trees. Can you build a time tree without Bayesian? Is there a counter software?
Technically, yes, you can use what's called LSD, least squares deviations, where
you build a maximum likelihood tree, which you then feed to it with a mutation
rate and sampling times, and it'll build a time tree from that. And it works
quite well for a lot of different things. And it's definitely being actively
developed and getting better. And a lot of time, it tends to get you the same
tree if you only care about the time tree and nothing else, sometimes. But a lot
of the work of LSD, if I'm honest at the moment, is people comparing it to Beast
and running both and saying, oh, yeah, it actually did do as well. But it may
not. So maybe you should do this all the time. And one of the, I think maybe it
was the first software for estimating relaxed clock models, MultiDivTime. So, I
don't know if we can call that Bayesian. I think it was an empirical base
because it fixed the tree topology to the maximum likelihood one. So, basically,
you give the one tree, and then you give node estimates, time estimates for the
nodes. And then MultiDivTime try to estimate, try to make that. Just jiggles the
branch lengths. Yeah. That would run pretty fast, though, wouldn't it? Yeah,
because it assumes the curvature around the maximum likelihood. And LSD runs
really, really quickly. So sometimes what's a way of thinking of it is those
maximum likelihood approaches basically split everything up. You build an
alignment, end of sentence, build a tree, end of sentence, build a time tree.
And then Bayesian actually really should be doing the alignment and the
phylogeny and the time and the entire model all as one thing. So at the moment,
we've still separated alignment out, tree building and dating, etc. But there
are some things that are trying to do tree building and alignment construction
at the same time inside a Bayesian framework. So, I mean, that's maybe a larger
discussion. But we talked a little bit about it before with SATA and things like
that. Okay. Well, we keep saying models. We're talking a lot about models. Are
these the good kind of models? What are examples of models? Everything's a
model. Everything's a model. All the models are wrong, but some models are
useful. Yeah. So a tree is just a model of the evolutionary process that shaped
the data that we see at the end. And I think it's good to remember that the tree
is not a separate thing from the rest of the model. The tree is part of the
model. It is a model of the evolutionary process. So when we think of models,
again, I revert to molecular epidemiology because that's what I do. And it's
probably the largest use of Bayesian in the microbial bioinformatics world. A
model that we'd often see is a clock model. The molecular clock is something
that maybe a lot of people have heard of or maybe have not. It's the assumption
that there is a stochastic relationship between time and the rate of mutation of
a gene or a genome that says over a certain period of time, we would expect to
see X number of mutations during X number of years or whatever. So if we know
this race, then we can correlate its change through time and estimated
divergence time or some other things. So if we know divergence time and put that
into a model, then we can find out the mutation rate. And if we know the
mutation rate, we can find out the divergence time. So the models that then go
into this is a molecular clock model, and this can be multiple different types.
There can be a relaxed one where there's a different race on each branch in the
tree or a strict one where it's just one rate for the whole tree. These can be
exponentially distributed or they can be log normally distributed. They can be
correlated, which means the rate of the ancestor in the tree, the ancestor
branch in the tree affects the rate of the child's branch in that tree, or they
can be uncorrelated where all the rates are free to vary. And you have to test
which one of these is most appropriate for your data. That's probably the most
common model that people are using for that kind of work. And try to do that in
a likelihood framework. Yeah. So this is sometimes where these.  programs can
fall down that are trying to do time trees outside of a Bayesian framework is
they often make an assumption of which of these, which aspect of this model is
the correct one. So for a microbe, what is the correct one? Almost definitely
not a strict one. No, strict definitely doesn't sound correct, unless they were
all sequenced in the same week, maybe. And they were within an outbreak. Within
an outbreak and within the same environment, I mean, like, no, yeah, from what
we know microbes, they're not going to be strict. It's normally an uncorrelated
relaxed clock of some kind. And then whether it's exponential or log normal,
that depends on how much variation there is on the race, whether it's a lot in
one end and little in the other end, or whether it's mostly in the middle with
two tails. So which is which in that? Sorry, an exponential means that you have
a lot on one end and more only one tail. And then a log normal means that you
have two tails to that distribution is mostly how you can think of that. So
you'd want exponential, like if you're sequencing in an outbreak, it's very top
heavy to those samples. And then you've got a few contextual singletons outside
of that. Yeah, perhaps, perhaps. And then if you've got two tail, then you've
got something where there's a lot going on, a lot more going on. Yeah, probably
because it depends on how far they are. Yeah, because the log normal, it's the
autocorrelated, no, it's the correlated model. So it says that if you know the
rate at one branch, maybe you know the rate at the neighboring branches. But as
the branches start to, you know, as the nodes start to get far apart and then
you don't know anymore, you start losing information. But that's why it's
correlated. So like if it's in the same host, but it's quite diverse. So if they
are similar, yeah, if they are, if you look at the tree and then you have the
branch lengths and then the branch lengths are small and they are they are close
using this on the tree and then probably the rates are going to be similar.
Yeah, this is where you test all these different parts of the model and see
which one is appropriate. It's difficult to guess and say, oh, you have an
outbreak, you should definitely be using this. It's more ensuring, because what
we see is an outbreak, for example. Yeah, but then that begs the question, if we
run all, well, we don't run strict, fine. We run relaxed, we run correlated, we
run exponential, we run log, but how do we pick? You go back to how do we
assess? You have what's called a path sampling or a stepping stone, or there are
parts inside of these programs that allow you to test for a marginal likelihood.
And that says, given all the other parts of the model are fixed, if I use an
uncorrelated or a correlated or a exponential or a log normal, I run it for
those four in different paths where I bring the data in and make the data
stronger. I can then compare and see which of these is more appropriate for the
data. So there are ways to test these models inside of programs like Beast and
RefBase. As in, and appropriately mean, best fits. So it uses the base factor.
It's a Bayesian version of the likelihood ratio test. Like we talked about
picking different models of evolution with an AIC or a BIC, it's a similar
concept in that you rank these different models and say which one is most
appropriate for your data. But a pain in the neck to calculate, because you have
to run, I think I had to run many times, right, the chain in different... Yeah,
you normally have to run it like 30 or 40 times. So these things take a long
time. So often it's a paper that tells you those things and then another paper,
or they say just put that in supplemental and you're like, but that was six
months worth of work. That's an example of a model. And then another example
often is what's called the coalescent, which is talking about the population and
how it's evolving over time. As in, a coalescent event is if you randomly select
two people from a certain time point in the population, how far back into the
population in terms of generations do you go before they have an ancestor in
common? Do they coalesce into a single organism or something like that? And then
these models then can be put in to your phylogenetic framework to then try to
guide how the tree should look or how the population size is changing or the
mutation rate is changing, these kind of things. So there are two of the, if you
were running BEAST, they're the two models that you almost explicitly have to
use in some way. And then there's lots of additional models on top of that, like
birth, death, skylines and all these kind of things that are depending on the
analysis that you want to do. So there was one paper by Sebastian Duchenne and
some others where they calculated the clock rates for a bunch of different
organisms and they just put that on a paper. Yeah, it was like a cheat, it
looked like a cheat sheet for molecular clock. Yes, I used that as my
justification of why I picked a certain rate over other ones for my papers. So
you support that paper, you think, yeah, that's a good place to start? For
mycobacterium tuberculosis, it worked quite well and he and other people redid
the analysis and much more in depth specifically for mycobacterium tuberculosis
a couple of months ago and found very similar results. The difficulty is it can
be a little bit misleading because not all outbreaks or different things inside
of one species will operate in the same way. So different MTB lineages seem to
have slightly different clock rates and also short term evolution seems to have
a different clock rate to long term evolution. So it's about, it's a good paper
to give you a starting point for your mean for a prior for your analysis. Does
that make sense? Yeah, it said for MTB it's around 10 to the minus five. So I
often set that as my mean, but allowing it to explore other values around that.
I don't then just say this is the race for MTB, but it definitely gives you an
idea of what kind of ballpark you should be in. So, OK, so just to close this
session, any final tricks of the trade or war stories from both of you just to
avoid coming into falling into common pitfalls or avoid raising the ire of the
Bayesianists? Yeah, there's a neat trick that I remember. Where was it? I think
it was in one manual of Mr. Bayes. Anyway, because it's in general in Bayesian,
there's a trick that you can simulate. So instead of instead of running your
software with your data to have the posterior, you can run the software with the
same parameters, with the same settings, but with no data whatsoever. And look
at what you're going to have from it. So look at the posterior of this in the
absence of data, which is the prior. So you can compare the posterior and the
prior and see, because in phylogenetics and in real life, usually the models,
they are not non-informative. So you have very informative prior distributions.
And so if you run, I think you can do this with Beast maybe by creating an
alignment only with ends, I think. Right. You just you just there's a button
that you press and it runs it into the prior. OK, yeah. So, yeah, because it's a
it's a it's a it's a nice trick that it's became standard. So you run it like
this and then you'll see what would be the the prior distribution. So so that
you can have an idea before having any data, what trees do you expect? What set
of parameters do you expect with that choice of of parameters that you set?
Because then you can see, you know, if your posterior is the same as the prior
and then you say, well, maybe that's a bad sign. If your posterior equals your
prior, go back to the tree. So in a in a similar vein, often what people want to
do is they want to know when the common ancestor of my entire data set was. When
did my outbreak start or when did it come into this new area or something? So
what you can do there is you often giving it tip samples. You're saying I took
the sample X and it was sampled on this specific date. And what you can often do
then is randomize the tip dates. So just randomly reassign the dates to other
tips and see if you get the same age for the root, as in the start of the
epidemic or something. If you do, then you know that those tip dates are not
actually very informative and you cannot have a lot of reliance in the dates
that you came up with because they're not really driving anything the model is
driving. So that's a good tip as well as date randomization, along with running
it under the prior. And there's some extra checks to do. Yeah, but maybe you
have to change the term because if you say randomization, you know, it means
some people might think frequentist. Yeah, and it doesn't work for everything.
My biggest tip is a trick of the trade is just talk to other people about it.
Doing Bayesian by yourself is difficult because you want to make sure that you
run everything correctly. So just collaborate. Don't feel like you have to do
everything by yourself. Almost every person I do Bayesian with is really, really
nice and more than happy to collaborate and to answer questions because we want
the trees to be right and the analysis to be correct. And the more people that
do it, then the easier it is for everybody. That's a wonderful sentiment. Please
go and visit and talk to your local.  local friendly Bayesianist today, find
local Bayesians in your area. For me, a really good Bayesian analysis sometimes
requires the biologist who really knows the organism, the person who's running
all that in between who's often the bioinformatician, and then maybe another
person who's more on the Bayesian side. So what I often say is, with my normal
job, I'm the modeler that's working with all the biologists, and then when I
work with a group like Tanya Stadler, I'm the biologist working with all the
modelers. So you need to know a little bit about both, but it's always good to
have a person who's really working with that organism and knowing and going,
does this seem right? So if I'm trying to say, multidrug-resistant TB arose, and
it seems to have arisen in 1920, they're like, that's not possible. Something's
definitely gone wrong. These drugs didn't exist until the 70s. So you need to
ensure that the data kind of looks right, and then also make sure that the model
is being specified. So collaboration, I think, is good. All right, excellent. So
I think that's a great sentiment, and I think we'll close this session on that
note. I'd like to thank both Connor and Leo for joining me yet again for another
episode of the MicroBinfy podcast. Thanks, Fabrice. Thanks, all. Yep. And so
this is Nabeel from the MicroBinfy podcast, and thanks for listening. Thank you
all so much for listening to us at home. If you like this podcast, please
subscribe and like us on iTunes, Spotify, SoundCloud, or the platform of your
choice. And if you don't like this podcast, please don't do anything. This
podcast was recorded by the Microbial Bioinformatics Group and edited by Nick
Waters. The opinions expressed here are our own and do not necessarily reflect
the views of CDC or the Quadrant Institute. Thank you all for listening. I'm
Nick Waters.