Hi, and welcome to the MicroBINFI podcast. I am Nabeel, your host for today, and once again I am joined by our favorite arborist, Dr. Conor Meehan and Dr. Leo Martens. Dr. Conor Meehan is a lecturer in molecular microbiology at the University of Bradford. He specializes in whole genome sequencing and molecular epidemiology of pathogens, in particular, microbacterium tuberculosis and genome-based bacterial taxonomy. Along with me is Dr. Leo Martens, who works with me and is head of phylogenomics at the Quadram Institute Bioscience. He enjoys developing and implementing tree-based models and has a number of different software out there, such as BioMC², GNOMU, and TreeSignal. Previously, he used to work with viruses and eukaryotes and so on, and has only recently switched to working with bacteria. So, thank you both for joining me again for another episode about the dark world of Bayesian magic. Thank you. Let's do it. Let's do this. All right, let's fire straight from the hip. We were previously talking about all of the theory with Bayesian. Let's talk about how do I get started? What's the simplest toy example that I can start playing with to get my head around what this is actually doing? So, classically, there's this class of graph models that you can start playing with. So, basically, it's a language. So, the most famous is STAN. The oldest one, I think, is called Bugs and WinBugs. And now you have this. STAN is also quite old, but it's quite famous. And now we have another language. It's like a programming language, which is called Rev, which is for RevBase. So, I think I would start, if I had to learn Bayesian statistics again, I would start with Rev. I'll go to the tutorials and see how they, because it's very simple. You write some of the parameters, and then you describe how they interact with each other. And then if you have data, you have your posterior. If you don't have data, you will simulate the data. So, I think this would be my first day on Bayesian, you know, playing with Bayesian models. So, Rev model and STAN, because STAN is also quite famous. So, I think it's developed by Andrew Gelman and others. And they have this so-called plate notation that basically you draw. It's like a flowchart. And then you have circles and squares. And then if it's a variable, it's something. And then you have arrows telling which one is conditional on which. So, you know, you can describe your model by drawing. So, I would agree that the graphical models can be quite easy to help you understand how that works. And Rev, Bayes, their tutorials outline how all these work. If you want to get into toy examples, as in toy data sets and results that can come biologically, there is a course called Taming the Beast, which is run in person every year, actually multiple times in multiple parts of the world. But the whole program is online as well. And Taming the Beast does introductory tutorials and intermediate and advanced tutorials. Most of them are actually based around eukaryotic work. So, it's a lot around bears and these kind of data sets. But they're very, very good for understanding what you're putting in and what's coming out at the end of it. So, I would say learning the Rev language or similar like Leo said is good because you start to understand all the different aspects of the model truly. And then these Taming the Beast or Rev, Bayes associated tutorials can help you work through some actual data to get some answers. Okay, that sounds straightforward. So, let's say I want to start using more Bayesian, more ad hoc in my bioinformatics. Let's say I've got a new method and I want to do a Bayesian approach in that to find a particular answer. Is this easy to build into my current way of how I write my code or do I have to start from scratch? Yeah, at some point you need the collaboration. You want to talk to some, you know, to a statistician just to check the things. So, for instance, I did my PhD in Bayesian phylogenetics and my first author paper had a mistake on it. And then somebody published a correction on the thing. So, this is just to say, you know, I felt pretty confident on my model and it was wrong basically. And I was the reviewer for the paper and I had to accept because they were right and I was wrong. So, you know, it's just to say that you have to, even if you feel very confident and even if you check everything that you could, you can still have some silly mistakes because writing the proposals and writing the models, the conditionals can be quite convoluted. Yeah, I often run any of my analysis by somebody else as well, just to be sure. If you really want to start playing with, you know, let's say writing software or checking or trying to make a Bayesian version of what you already have, I would suggest start with – there are two ways. One of them is simulated annealing. Simulated annealing is like, it's the optimization version of Bayesian models. Instead of estimating the full posterior probability, you just try to optimize what's the best value or it's the best set of values. So, since it's an optimization, even if you wrote something wrong with the conditionals, as long as the best value is the same, your simulated annealing will still work. So, even when I'm writing models in software, I have always implemented a simulated annealing version, just to see if it's going to be the maximum likelihood estimate, let's say. Another alternative would be to play with ABC, with Approximate Bayesian Computing. Computing, thanks. So, basically, you simulate. If you cannot calculate the likelihood, you simulate under a known model and we have very good software for simulating whatever you can. And then you have some rules of how to accept the simulated output. And in the end, you have a distribution as well that you can use as a proxy to the posterior. So, even if you cannot calculate the likelihood, even if you cannot calculate everything, you can still have a distribution of values, as long as you can simulate. Yeah, I don't think it's very easy to automate Bayesian analysis, especially for phylogenetics, because it tends to be a little bit more dataset-specific. Each of them require a little bit different in terms of the priors. So, that's why I would often say I'm good at building phylogenetic trees for certain organisms, but not others. And that's when you want to collaborate with the biologists, so they really understand what's happening underneath, or more, quote-unquote, true Bayesian statisticians who can maybe extend the model to suit the needs if yours is a little bit more unique. All right. So, let's just say I'm a simple user of an existing tool, and we sort of touched on this in the last episode, but how do I interpret the results? I mean, I've been running this program for a few weeks, and I want my estimate. How do I know this is done? How do I know if it's more or less correct? There's a couple of different packages that will allow you to look at what we call convergence, to see if, as you run the MCMC longer and longer and longer, are we really exploring any more of the data space? So, the one that's used by the vast majority of people is a program called Tracer, and allows you to visually look at all these different parts, look at these ESS to see if you've gotten enough independent sampling, and to compare across different runs to make sure that it's done. I often will run the model four or five times, compare them as well to make sure that independent runs are also doing. There are alternatives to Tracer, which I guess are not used as much, but sometimes can be better, like RWTY, which stands for Are We There Yet?, which is an R package. And there are other ones, Bonsai also, and it depends on, there's lots of different ways to look for convergence, but that's often what we want to look for as a first step to see if we're done. Yeah, in the non-phylogenetic context, there's Coda, and I think Tracer borrowed some things from Coda. Coda is also an R package, and I like Are We There Yet? because it can actually, it calculates the distance between the trees, so when you have a sample, a posterior sample of trees, it calculates the distance between them, and then it plots. I think it's an MDS, so actually you can see how your trees are moving in the tree space. And so, if you want your estimate, as I said, I want to see the uncertainty, I want to see the credible sets, but the most common one, since we don't have an average tree, the most common one is called the MAP, so it's the Maximum Posteriori Tree. So, it's the tree that appeared more frequently in the posterior sample, and for the continuous variables, you can have means and medians and so forth. So, often when you're running it, you're running, at each set, you've got a tree with all the data on it, and you've got a log, and we're using the log file in order to look for convergence and make sure that we've finished, and we may have a lot of parts of the data in there. that we actually care about, like maybe the mutation rate, et cetera, and that often comes from the log. And then the trees basically get combined to create this maximum A credible set tree. And that's our estimate or the tree that goes as figure whatever with the uncertainties put onto it. And there, if you're looking again at phylogenetics and you want to be putting uncertainties on, there are programs to do that. Big tree is one that's very good for putting uncertainties on, but it is by far not the only one that's out there because you want to make sure it's not just the tree that you're putting into your paper. You also want to be putting these uncertainties onto that. Yeah. And at least in theory. So if you run your program for one week and then you check the convergence, the convergence looks like it's going somewhere, but you're not satisfied yet. You can, if you have the exact same parameters, so if the model is identical, you can run it again for another week and then you can merge the results from both of them. You can assume that they are independent samples. Like bootstrapping trees. Yeah. Yeah. So I'll often run four or five times and I'll run the chains very long, merge them all together into one and then thin it down to maybe 10,000 samples. So I get really independent runs in between it. And it's often overkill, but I'd rather be overkill than underkill, if underkill isn't right. All right. Well, that's a fair bit to digest for everyone. But so where do we see Bayesian software in the wild, at least as far as the phylogenetics is concerned? Which are the most famous software packages? I mean, Beast is always the most famous, I would say, for multiple reasons. One is a very nice graphical interface, which means you can do a lot of the easy stuff in there. Pretty stable software that runs quite well on a lot of different architecture, like it's quite well made. And it comes with a lot of things for post-processing. It comes with a program for combining and thinning out the log and the tree files for creating that maximum credible tree with Tracer to look at your convergence and all the other things and with a lot of other tools in order to allow you to do a full Bayesian analysis. And there's pros and cons to that. It has a lot of defaults and those defaults now in the new versions, it warns you when you've not changed any of the defaults to tell you that you probably should be changing these defaults. Oh, that's clever. That's very clever. It used to not. So often you would see a paper and they'd run it for 10,000 runs and that's just the default in Beast. And you're like, that's definitely not right. 10,000 is almost definitely not right for anything. But Beast is very easy to use, which can be a little bit deceptive because then people will just go and run it and not understand that there's a lot of complexity you need to change in there. Yeah, I was raised with Mr. Base. So when I was a small kid, I was playing with Mr. Base. So I think I might have a bit of a bias towards RevBase. I don't know if Mr. Base still exists, but... It's not being supported anymore. Yeah. Okay. So people should move to RevBase, right? And yeah, so probably that's what I'm going to be looking to. And there was also FiloBase. So FiloBase, they have some interesting models and convergence analysis was also nice. But yeah, I think the Beast is the big one, right? So I have a question, because we have Beast 1 and Beast 2. So what's the difference between them? So it's a proper fork, right? It's not like GitHub using the word fork now. It's a proper fork. When they become speciated, they become independent projects, right? Yeah. They are, yeah. So Beast 1 and Beast 2 are run by different groups now, with Andy Rambo and Alexei Drummond. And they do similar things, but slightly different sometimes. And what's important to realize is that Beast creates time trees almost exclusively. It is not just creating, Bayesian trees can be non, they don't have to be time related. They can also be evolutionary distance related, like we often think of from maximum likelihood So, but Beast will only build time trees, as far as I'm aware. Okay. So they always have a clock model there. Yes, it's always, you have to specify some kind of clock. Whereas MrBase, FiloBase and RevBase will build any type of tree that you need. So RevBase is fantastic. It runs very, very well. It's very, very robust. And also in the way, it gives you a lot of options, but it's written in what's similar to an R language. So it can be very, very off-putting to new people, because you have to code it. You have to say, I want this prior with this distribution. And it makes you think about every single aspect of the Bayesian analysis, which is actually a good thing, because you really should be thinking about all these aspects. But it can be quite off-putting to people who are trying to learn Bayesian and learn coding and learn, and learn, and learn. And an MCMC state in RevBase means something different to what it does in Beast. Yeah. When I was playing with MrBase, so basically the first versions you didn't have, you didn't have any clock model. It was like RexML or IQtree. I think people assume that Bayesian trees mean time trees. And that's not true, because Beast is so popular and it only builds time trees. Can you build a time tree without Bayesian? Is there a counter software? Technically, yes, you can use what's called LSD, least squares deviations, where you build a maximum likelihood tree, which you then feed to it with a mutation rate and sampling times, and it'll build a time tree from that. And it works quite well for a lot of different things. And it's definitely being actively developed and getting better. And a lot of time, it tends to get you the same tree if you only care about the time tree and nothing else, sometimes. But a lot of the work of LSD, if I'm honest at the moment, is people comparing it to Beast and running both and saying, oh, yeah, it actually did do as well. But it may not. So maybe you should do this all the time. And one of the, I think maybe it was the first software for estimating relaxed clock models, MultiDivTime. So, I don't know if we can call that Bayesian. I think it was an empirical base because it fixed the tree topology to the maximum likelihood one. So, basically, you give the one tree, and then you give node estimates, time estimates for the nodes. And then MultiDivTime try to estimate, try to make that. Just jiggles the branch lengths. Yeah. That would run pretty fast, though, wouldn't it? Yeah, because it assumes the curvature around the maximum likelihood. And LSD runs really, really quickly. So sometimes what's a way of thinking of it is those maximum likelihood approaches basically split everything up. You build an alignment, end of sentence, build a tree, end of sentence, build a time tree. And then Bayesian actually really should be doing the alignment and the phylogeny and the time and the entire model all as one thing. So at the moment, we've still separated alignment out, tree building and dating, etc. But there are some things that are trying to do tree building and alignment construction at the same time inside a Bayesian framework. So, I mean, that's maybe a larger discussion. But we talked a little bit about it before with SATA and things like that. Okay. Well, we keep saying models. We're talking a lot about models. Are these the good kind of models? What are examples of models? Everything's a model. Everything's a model. All the models are wrong, but some models are useful. Yeah. So a tree is just a model of the evolutionary process that shaped the data that we see at the end. And I think it's good to remember that the tree is not a separate thing from the rest of the model. The tree is part of the model. It is a model of the evolutionary process. So when we think of models, again, I revert to molecular epidemiology because that's what I do. And it's probably the largest use of Bayesian in the microbial bioinformatics world. A model that we'd often see is a clock model. The molecular clock is something that maybe a lot of people have heard of or maybe have not. It's the assumption that there is a stochastic relationship between time and the rate of mutation of a gene or a genome that says over a certain period of time, we would expect to see X number of mutations during X number of years or whatever. So if we know this race, then we can correlate its change through time and estimated divergence time or some other things. So if we know divergence time and put that into a model, then we can find out the mutation rate. And if we know the mutation rate, we can find out the divergence time. So the models that then go into this is a molecular clock model, and this can be multiple different types. There can be a relaxed one where there's a different race on each branch in the tree or a strict one where it's just one rate for the whole tree. These can be exponentially distributed or they can be log normally distributed. They can be correlated, which means the rate of the ancestor in the tree, the ancestor branch in the tree affects the rate of the child's branch in that tree, or they can be uncorrelated where all the rates are free to vary. And you have to test which one of these is most appropriate for your data. That's probably the most common model that people are using for that kind of work. And try to do that in a likelihood framework. Yeah. So this is sometimes where these. programs can fall down that are trying to do time trees outside of a Bayesian framework is they often make an assumption of which of these, which aspect of this model is the correct one. So for a microbe, what is the correct one? Almost definitely not a strict one. No, strict definitely doesn't sound correct, unless they were all sequenced in the same week, maybe. And they were within an outbreak. Within an outbreak and within the same environment, I mean, like, no, yeah, from what we know microbes, they're not going to be strict. It's normally an uncorrelated relaxed clock of some kind. And then whether it's exponential or log normal, that depends on how much variation there is on the race, whether it's a lot in one end and little in the other end, or whether it's mostly in the middle with two tails. So which is which in that? Sorry, an exponential means that you have a lot on one end and more only one tail. And then a log normal means that you have two tails to that distribution is mostly how you can think of that. So you'd want exponential, like if you're sequencing in an outbreak, it's very top heavy to those samples. And then you've got a few contextual singletons outside of that. Yeah, perhaps, perhaps. And then if you've got two tail, then you've got something where there's a lot going on, a lot more going on. Yeah, probably because it depends on how far they are. Yeah, because the log normal, it's the autocorrelated, no, it's the correlated model. So it says that if you know the rate at one branch, maybe you know the rate at the neighboring branches. But as the branches start to, you know, as the nodes start to get far apart and then you don't know anymore, you start losing information. But that's why it's correlated. So like if it's in the same host, but it's quite diverse. So if they are similar, yeah, if they are, if you look at the tree and then you have the branch lengths and then the branch lengths are small and they are they are close using this on the tree and then probably the rates are going to be similar. Yeah, this is where you test all these different parts of the model and see which one is appropriate. It's difficult to guess and say, oh, you have an outbreak, you should definitely be using this. It's more ensuring, because what we see is an outbreak, for example. Yeah, but then that begs the question, if we run all, well, we don't run strict, fine. We run relaxed, we run correlated, we run exponential, we run log, but how do we pick? You go back to how do we assess? You have what's called a path sampling or a stepping stone, or there are parts inside of these programs that allow you to test for a marginal likelihood. And that says, given all the other parts of the model are fixed, if I use an uncorrelated or a correlated or a exponential or a log normal, I run it for those four in different paths where I bring the data in and make the data stronger. I can then compare and see which of these is more appropriate for the data. So there are ways to test these models inside of programs like Beast and RefBase. As in, and appropriately mean, best fits. So it uses the base factor. It's a Bayesian version of the likelihood ratio test. Like we talked about picking different models of evolution with an AIC or a BIC, it's a similar concept in that you rank these different models and say which one is most appropriate for your data. But a pain in the neck to calculate, because you have to run, I think I had to run many times, right, the chain in different... Yeah, you normally have to run it like 30 or 40 times. So these things take a long time. So often it's a paper that tells you those things and then another paper, or they say just put that in supplemental and you're like, but that was six months worth of work. That's an example of a model. And then another example often is what's called the coalescent, which is talking about the population and how it's evolving over time. As in, a coalescent event is if you randomly select two people from a certain time point in the population, how far back into the population in terms of generations do you go before they have an ancestor in common? Do they coalesce into a single organism or something like that? And then these models then can be put in to your phylogenetic framework to then try to guide how the tree should look or how the population size is changing or the mutation rate is changing, these kind of things. So there are two of the, if you were running BEAST, they're the two models that you almost explicitly have to use in some way. And then there's lots of additional models on top of that, like birth, death, skylines and all these kind of things that are depending on the analysis that you want to do. So there was one paper by Sebastian Duchenne and some others where they calculated the clock rates for a bunch of different organisms and they just put that on a paper. Yeah, it was like a cheat, it looked like a cheat sheet for molecular clock. Yes, I used that as my justification of why I picked a certain rate over other ones for my papers. So you support that paper, you think, yeah, that's a good place to start? For mycobacterium tuberculosis, it worked quite well and he and other people redid the analysis and much more in depth specifically for mycobacterium tuberculosis a couple of months ago and found very similar results. The difficulty is it can be a little bit misleading because not all outbreaks or different things inside of one species will operate in the same way. So different MTB lineages seem to have slightly different clock rates and also short term evolution seems to have a different clock rate to long term evolution. So it's about, it's a good paper to give you a starting point for your mean for a prior for your analysis. Does that make sense? Yeah, it said for MTB it's around 10 to the minus five. So I often set that as my mean, but allowing it to explore other values around that. I don't then just say this is the race for MTB, but it definitely gives you an idea of what kind of ballpark you should be in. So, OK, so just to close this session, any final tricks of the trade or war stories from both of you just to avoid coming into falling into common pitfalls or avoid raising the ire of the Bayesianists? Yeah, there's a neat trick that I remember. Where was it? I think it was in one manual of Mr. Bayes. Anyway, because it's in general in Bayesian, there's a trick that you can simulate. So instead of instead of running your software with your data to have the posterior, you can run the software with the same parameters, with the same settings, but with no data whatsoever. And look at what you're going to have from it. So look at the posterior of this in the absence of data, which is the prior. So you can compare the posterior and the prior and see, because in phylogenetics and in real life, usually the models, they are not non-informative. So you have very informative prior distributions. And so if you run, I think you can do this with Beast maybe by creating an alignment only with ends, I think. Right. You just you just there's a button that you press and it runs it into the prior. OK, yeah. So, yeah, because it's a it's a it's a it's a nice trick that it's became standard. So you run it like this and then you'll see what would be the the prior distribution. So so that you can have an idea before having any data, what trees do you expect? What set of parameters do you expect with that choice of of parameters that you set? Because then you can see, you know, if your posterior is the same as the prior and then you say, well, maybe that's a bad sign. If your posterior equals your prior, go back to the tree. So in a in a similar vein, often what people want to do is they want to know when the common ancestor of my entire data set was. When did my outbreak start or when did it come into this new area or something? So what you can do there is you often giving it tip samples. You're saying I took the sample X and it was sampled on this specific date. And what you can often do then is randomize the tip dates. So just randomly reassign the dates to other tips and see if you get the same age for the root, as in the start of the epidemic or something. If you do, then you know that those tip dates are not actually very informative and you cannot have a lot of reliance in the dates that you came up with because they're not really driving anything the model is driving. So that's a good tip as well as date randomization, along with running it under the prior. And there's some extra checks to do. Yeah, but maybe you have to change the term because if you say randomization, you know, it means some people might think frequentist. Yeah, and it doesn't work for everything. My biggest tip is a trick of the trade is just talk to other people about it. Doing Bayesian by yourself is difficult because you want to make sure that you run everything correctly. So just collaborate. Don't feel like you have to do everything by yourself. Almost every person I do Bayesian with is really, really nice and more than happy to collaborate and to answer questions because we want the trees to be right and the analysis to be correct. And the more people that do it, then the easier it is for everybody. That's a wonderful sentiment. Please go and visit and talk to your local. local friendly Bayesianist today, find local Bayesians in your area. For me, a really good Bayesian analysis sometimes requires the biologist who really knows the organism, the person who's running all that in between who's often the bioinformatician, and then maybe another person who's more on the Bayesian side. So what I often say is, with my normal job, I'm the modeler that's working with all the biologists, and then when I work with a group like Tanya Stadler, I'm the biologist working with all the modelers. So you need to know a little bit about both, but it's always good to have a person who's really working with that organism and knowing and going, does this seem right? So if I'm trying to say, multidrug-resistant TB arose, and it seems to have arisen in 1920, they're like, that's not possible. Something's definitely gone wrong. These drugs didn't exist until the 70s. So you need to ensure that the data kind of looks right, and then also make sure that the model is being specified. So collaboration, I think, is good. All right, excellent. So I think that's a great sentiment, and I think we'll close this session on that note. I'd like to thank both Connor and Leo for joining me yet again for another episode of the MicroBinfy podcast. Thanks, Fabrice. Thanks, all. Yep. And so this is Nabeel from the MicroBinfy podcast, and thanks for listening. Thank you all so much for listening to us at home. If you like this podcast, please subscribe and like us on iTunes, Spotify, SoundCloud, or the platform of your choice. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group and edited by Nick Waters. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadrant Institute. Thank you all for listening. I'm Nick Waters.