Hello and thank you for listening to the MicroBinFeed podcast. Here we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There is so much information we all
know from working in the field, but nobody writes it down. There is no manual,
and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My
co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Both
Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work
on microbes in food and the impact on human health. I work at Centers for
Disease Control and Prevention and am an adjunct member at the University of
Georgia in the U.S. Hello and welcome to the Microbial Bioinformatics podcast.
This week I have some special guests joining me. I have Dr. Conor Meehan, who is
a lecturer in molecular microbiology at the University of Bradford. He
specializes in whole genome sequencing and molecular epidemiology of pathogens,
primarily mycobacterium tuberculosis and genome-based bacterial taxonomy. He
says the programming language he's always wanted to learn properly is R. Also
joining me today is Dr. Leo Martin, who is head of phylogenomics at the Quadram
Institute Bioscience. He enjoys developing and implementing tree-based models,
so he has a bunch of tools like BioMC2, GNOMU, TreeSignal, and we'll have links
for those packages in the show notes. He recently moved to working with
bacteria. Previously he worked with viruses, eukaryotes, but from a modeling and
theoretical perspective. He says he should have learned OpenCL by now, and next
year he wants to write something in C++. Good day to you, gentlemen. Nice to
have you on the show. So today's episode, we were going to talk about
phylogenetics. I think both of you are expert arborists. Yeah, supposedly. Three
headers. So we'll start off with a softball question for both of you. Connor,
who are you and what do you do? That's probably more of a difficult question. I
do a little bit of everything. I've always worked on pathogens of some kind. So
I started off with HIV and then moved into human microbiomes, lateral gene
transfer a little bit more, and now into mycobacteria, but it's always been
based around phylogenetics. So in HIV, that was small transmission trees.
Microbiomes, it was actually lateral gene transfer and started taxonomy, and now
doing a lot of tuberculosis, mycobacterium ulcerans as well and other things, a
lot of Bayesian reconstructions and heavy use of a lot of phylogenetics
programs. Okay, and what about you, Leo? So yeah, I'm a bioinformatician here at
the Quadram Institute of Bioscience, and what I'm doing right now is to provide
service and research. So on one hand, I install software, I run analysis for
other researchers at the institute, especially when they involve some sort of
phylogenetic inference. And at the same time, I also design and implement new
software when I see there might be a gap in current landscape or it will help
future research by the community. And so through my background, I'm a Bayesian
phylogeneticist, so I develop usually Bayesian models for complex phylogenetic
models like recombination and species tree inference on the phylogenomics
context. So it's clear both of you have some excellent credentials when it comes
to phylogenetic reconstruction, and I'm assuming between the two of you, you
both must have used every single program out there. And I was curious, I have a
basic assumption that different data or different data sets require different
approaches. Maybe they don't. Do you both agree at least with that sentiment?
Yes. Mostly yes. Okay. Mostly yes? Would you like to clarify? I mean... A lot of
the reasons for using something like parsimony or distance-based approaches have
gone away, I would say. So a lot of it now comes down to, in terms of the
phylogenetic approach, something within a maximum likelihood framework or even
better, a Bayesian framework. So I think as we go along and along, you'll start
seeing the simpler methods drop off and most of the data will go through some
form or some different aspect of one of maximum likelihood or Bayesian. And is
this simply because it's more feasible to compute using those more comprehensive
methods or is there a fundamental change in the theory or what's the difference?
Why are we making that shift? I would say a little bit of both. So about 10, 15
years ago, you would have had to have a computer that was used to run your
maximum likelihood tree and you would leave it for days for maybe 20 to 30 taxa,
but that's not the case anymore. So back then parsimony and neighbor joining
were done so that you could actually get some results within your year. But now
as the computers get much, much faster and the code is streamlined a lot by some
excellent programmers, maximum likelihood and Bayesian can be used on much
larger datasets and more comprehensive models of evolution. And what about...
But I, yeah, I understand. Yes, so I'm a bit biased because I'm Bayesian. But I
think that whenever you increase the capacity of the computers, people come up
with more data and more complex models. So I don't know. I think we still have
space for parsimony or for distance methods. So although I'm not an expert on
them, I would say, so for instance, if you have for a short time frame, you
know, an outbreak or if you have, if you're looking at subspecies, clonal
complexes, can't you just say, because in this case, you know, if you're using
SNPs, you assume that you don't have back substitutions, you don't have anything
weird going on there. So every, you know, substitution that happened in the
population is present in your sequences. In this case, can't you use parsimony
or distance? I have the impression that it might be complicated to use Bayesian
models for everything. And I think people are, I think there's a renaissance of
this, you know, quick and dirty methods. Yeah, I will say, often when I'm
talking to people who want some help with molecular epidemiology, like these
kind of outbreaks, the first question always is, why are you building a tree?
And a lot of the times people are building a tree because that's what they think
is required to go into their papers, and they don't always need that. So for an
outbreak, it can just be, yes, a quick build a SNP matrix, and then just have
some cut offs to say these things are closer than these other things. And in
that case, exactly, a parsimony or a distance approach is fine, because it's
just coming from a matrix of some kind. As a quick background to why we
sometimes we use maximum likelihood and Bayesian for people may not know, it's
to model multiple substitutions happening at the same column or in alignment.
Like Leo said, if it's a very short time span, and it's in a small population,
the likelihood of multiple mutations having occurred that you did not observe is
so low that a parsimony or a distance matrix is fine, probably will get you the
same answer as a maximum likelihood. Yeah, no, yeah, yeah, I agree with that. I
think on short timeframes, when people want to work on an outbreak, they want to
look at transmission clustering. And there, you have to maybe start moving into
some of these Bayesian approaches, which are trying to turn a phylogenetic tree
into a transmission tree, like outbreaker or transphylo, which are trying to
estimate how many events you may have missed. So as with everything, the answer
is always yes and or no. I was hoping even if we were talking very short,
focused timeframes, like outbreaks, we could get some sort of consensus what it
looks like. It's really, you know, the standard thing that anyone tells you,
it's down to the question you're asking, what's appropriate? What's your data
doing? How deeply do you want to model this? And how important is it to
establish that line of succession in your data? Well, if you were working on
viruses, and you have a small outbreak in a hospital, you look at the MRSA paper
that Simon Harris did years ago, which is kind of set that out, you can then
just do a distance based or a parsimony tree, because all you want to know is
whether they are related to each other or not. And then it's just about the
topology of the tree, and then they're probably fine. It's about if you want to
do something with that tree, as in look at clustering on that tree, or look at
the branch lengths and know times of when did this transmission occur and all
that, that's when you got to go to the more complex things. Yeah, I think I
agree with you. When you want to use tree seriously, you shouldn't rely on
parsimony or distance. You want to have something that you can trust upon,
right? Yeah, I think I can agree with that as well. I, myself, am quite fond of
doing a first pass NJT, no matter what size the data set, just to get an idea
and make sure that it's sane. And then, but I always keep that caveat that this
is not really robust and you need to go back and do something There's still a
very nice algorithmic trick, right? So you can use that as a starting point for
doing something. Even IQ3 or XML like that, they use some distance-based
methods, just to have a first pass on your data. Yeah, I think a lot of people
forget that all maximum likelihood approaches often start with a parsimony or a
distance tree. And they refine that within RaxML or IQ3 or PhiML to get to
something better from there. But you have to have something to start with. Well,
all right, guys, let's go deeper.  What does does does your advice or does the
type of tools you would use change when we start thinking about? Looking at an
entire clonal complex or sub within a subspecies not necessarily species level.
It might be species level if it's a very monoclonal species, but just sort of
within a species, but not just an outbreak you're going a bit broader. What
would you recommend? As the approach there does it change compared to these
shorter time frames we've been talking about So I will say a lot of my work
recently has gone away from Using phylogenetics and focusing on the data that
you're going to put in to build that tree because It's just tends to be a bit
more of a mess when you do go to whole genomes or whole species level They're
the maximum likelihood approaches Tend to work quite well again. I'm a big wax
ml NG fan other people use IQ tree a lot and there's a big discussion about
which one is better in my opinion Neither is better. They're both quite Robust
at doing anything at that level then it's about what data you're going to put in
there And normally you're going to put in something like a core genome that
you've created with Rory or there's a program called pro genomics which also
will create these core genomes and then you want to build a Tree of some kind
from there normally a maximum likelihood if you just want to know the
relationships between these things in there Well, I know Leo is a massive IQ
tree friend You well, I use IQ tree because it's very easy to use. So I think
it's and it's quick I've used rex ml in the past, but I remember for some data
set it gave me, you know, it was taking too long It was taking longer that I
used to with IQ tree and then I went back to IQ tree But I think I did it the
other way. I did it the other way around because with rex ml I've been using
that for years. Everyone was like why aren't you using IQ tree? That's where
it's at right now I tried to do this huge Complex based tree and IQ tree was
like estimating that it was going to take me about two months XML and G. So the
new version of XML is much much faster at these large trees But IQ tree
definitely is more comprehensive and a lot of the things it can do Especially
picking that model for you and running it all and doing the bootstraps and it's
definitely a great program I'm contemplating going back to IQ tree So these more
diverse data sets Leo do you agree with with Connor that essentially it's less
about the Program using and it's very dependent on the on the data. Yes. Yes,
absolutely I think for yeah, and I think this is going to be more of an issue in
the in the future So for people who don't know what constitutes bad like if it's
so important what what constitutes bad data or good data? They thought that was
to pick and pick it So good. That's bad Well, we depend so some people call this
so this can be a selection bias or it can be gene shopping So gene shopping is
when you when you want to answer a problem and so for instance if you want to do
a time Divergence time estimation and so you need genes that let's say are
constitutive genes. So you Not genes that are under weird selective scenarios
nearly neutral. Yeah, exactly And so and so you this is what they call gene
shopping So you go in your genome and you find the genes that are the best to
answer your question But then there's the you know, there's the the evil twin
side of this which is a selection bias so maybe you can you choose the genes
that you have more data available or the genes that are easier to detect in more
genomes or the genes that you only have in single copy because then you don't
need to handle paralogy and so if you Because feeling by the way, I'm talking
this not only about microbes Because you have a lot of Long-going discussions on
parts of the tree of life that are when you look at it It boils down to which
genes are you taking? Which species are you taking because it becomes a matter
of do you optimize the the noise to signal ratio? So, you know by selecting the
genes that are more present in more species Then you might be actually selecting
one kind one kind of signal their enemies all the other rest So, you know from
this from this standpoint I think choosing the the genes and choosing the data
and how do you throw away your samples? and well, how do you choose which
samples to analyze and then if you have a lot of data, I think people sometimes
they They first do a clustering and then they remove the sequences that are very
similar and then they work with the other ones But you know, what if you're
inducing some bias there? Connor any comment from you? Yeah, it's data selection
is a real big problem with these kind of things So there's become a huge trend
of taking the core genome and then you have all of those those genes that are
present in Every species that you're looking at or every isolate you're looking
at and then concatenating that all together into what's called a super alignment
And then building a tree from that and a difficulty is especially with bacteria
is lateral or horizontal gene transfer And you're assuming that everything in
the core genome was vertically inherited and not horizontally Inherited in
what's called orthologous replacement where the bacteria has a copy of that gene
and just replaces it Through lateral gene transfer with a copy from another
bacteria and these individual If all the individual genes that you've
concatenated together don't all have the same signal You can get really weird
trees and maximum likelihood can have real problems with incongruent data inside
of the same alignment Understanding what data you should be putting in can be
quite difficult. So if you take for example 16s Everybody uses 16s and they say
that it's the best marker to use and it's fine in some ways But some bacteria
have four copies of 16, four or more copies of 16s And they are not all
identical and which one of these do you choose? Some of them have gotten 16s
through horizontal gene transfer. Which one of these do you use? So paralogs and
orthologs and putting all that in it's a lot more work than people probably
think it is Much of what we're talking about essentially extends across all life
From the sounds of it from our discussion that the short time frame within an
outbreak It's sort of dependent exactly on your question what you're going to do
But as you get wider away, it's more your input data is vitally important More
so than probably picking between your xml or iq tree. Yeah That's I don't think
that's that's the bit we talk about the most but that's not the bit that we work
on the most for sure I always say with with phylogenetics, you will always get a
tree every program will create a tree It's not like you'll get an error and it
says nothing can come out at the end No, you get a star sometimes. Yeah, you get
a star, but that's still a tree technically. It's acyclic technically It's about
knowing actually a lot of the skills that go on top of that of how to check how
good that tree is In terms of does all the data support it with something like a
bootstrap or better in a Bayesian framework Leo you want to chip in? Yeah, no,
actually i'm I'm, very happy to hear Kono saying this that you can get a tree
out of anything. So, you know garbage in garbage out any any Set of sequences
even if they are random if even if they don't have common ancestry, you know
completely random sequences once you align them you have signal there that The
tree inference procedures will pick up and they will tell you this is the tree
and so it's and it's very hard to Sometimes sometimes it's easy, you know, if
you just take random sequences probably or three you have very large branch
lengths But still they're going to be finite because the software assumed that
you should have finite branch lengths there but anyway, uh, so the the I think
that's the That's the big question, right? So Does the tree make sense? So if
okay, well, let's carry on from that to both of you. I put the question let's
take a hypothetical a colleague or a student or someone comes to you very
excited with with a picture of a Phylogram and they ask you how oh look at this
tree. Isn't this great? What are the things you're looking? What are you
critical of or what are you looking for to make sure that that tree? Looks good
is believable In line of what we're talking about and you can say you need to
look at What and if you need to go back to primary data what aspects of the
primary data? Yeah So I think the first thing that I would do is I would go back
to primary or I would ask Uh, how many different models did you use? How many
different algorithms did you use to get to this tree? so I think that's the And
uh, and maybe take a look at the alignment, you know ask what if you invert some
of these sequences we will have the same alignment, so actually there's a test
called head and tails to see if the alignment is good or not is uh, So I think
you invert all the sequences or because you know the alignment sometimes It can
it depends on the order in which it starts aligning and in the order that you
put the sequences there So so first first thing would be to check the alignment
to see if there's a lot of gaps in those and so which yes Uh, so which part are
you reversing you're reversing the order like so you can so you can invert the
order in which the sequences Are added but you can also invert the whole the
whole genome the whole alignment So what's in you know, what the what's the
right most will become left most so if you invert completely and then you align
again And you have and then invert again at the end and then you have a
different alignment from the from what you start with Uh, this means that you
know, the alignment is doing something there that is that was not on the data So
the alignment is forcing some some things there. That's that's a really good tip
connor any any tips on you What are you worried about? What do you look for?
Yeah alignment is is So important when it comes to phylogenetics and I would
also say again comes back to the question I always ask of why are you building
the tree? So if it's to look at the relationships between the isolates within
the same species Then different data sources can give you different trees so you
build a 16s tree and then you build Um multi-local sequence type tree and then
you build a whole genome-based tree and these might all be different from each
other If they're all exactly the same Then you've probably robustly means that
there's a strong signal for whatever this one tree is If they're all different
that doesn't mean that any of them are wrong It could actually be that's just a
different signal and then it comes back to the question of what are you trying
to prove? With this tree  Are you trying to say that these isolates evolved from
a common ancestor with each other? Then it's trying to find exactly and make
sure that each individual gene that you put into that alignment doesn't suffer
from a combination or lateral gene transfer or all this kind of thing. So it's
for me, I'm then trying to I want to prod at the tree as much as I can to see if
it breaks. There's just one thing before we finish that I think this is part of
a standard answer is look at the bootstraps. Yeah. So, you know, look at the
confidence, the confidence interval. Or if you're working Bayesian, you can look
at the distribution of trees to see. I think that's a that's what you are
hinting on. Right. Sorry. Yeah, I would say if you are good at doing Bayesian or
you're not good at doing Bayesian, find someone who is good and build a really
good Bayesian tree. And I definitely would be more if that had been tested
properly to make sure the priors were all correct and that your models going in
were all correct. I'm I'm more inclined to, quote unquote, believe that tree. So
my last question for this session is, is there any really stupid mistakes that
people can sanity check when they when they look at a phylogeny. So, a couple of
ones that I've run into is someone shows me a tree, and all of the blur. It's
meant to be a maximum likelihood tree generated, you know, very vigorously
produced, but all of the branch lengths are the same. It's essentially a
cladogram or part of the tree looks like a cladogram. Another one is, as you
mentioned before, certain taxa have exceedingly long branch lengths that don't
make any sense. So is there any other things like that when you look at a figure
and you just go, okay, that's just wrong, something has happened. For me, it's
where the bio of bioinformatician comes in. It's about knowing what you're
working on. When I'm working on tuberculosis, I normally have a fair idea of
where certain I know that what lineages should be beside what lineages and what
subspecies should be clustering with what subspecies. And if they're not doing
that, then something has definitely gone wrong. And I see that quite a lot. The
branch lengths, if someone's using SNP data and they haven't done an
ascertainment bias correction or introduced constant sites, the branch lengths
would be wrong. But if you look at that branch scale and if you're like, wait,
this is a species level and there's tiny branches, something has gone wrong
there. You've cut out a lot of the data, maybe by mistake or something. I'm
happy to hear that because that's usually the first thing that I that I that I
search for in a tree is the branch lengths. So if I see a lot of zero or, you
know, on the limit of zero, because sometimes the software doesn't give exactly
zero, but it's going to give you 10 to the minus six or something. So if you see
this limit epsilon like branch lengths and then I know that something it might
not be wrong, but I would look at the data again. All right, well that's
fantastic. Thank you both. That's all the time for we have for this session,
maybe we'll have you back on for another episode, I don't know. But this is so
this is Nabil with Connor and Leo signing off from the microbial bioinformatics
podcast. Thank you. Thank you. Thanks. Thank you all so much for listening to us
at home. If you like this podcast, please subscribe and like us on iTunes,
Spotify, SoundCloud, or the platform of your choice. And if you don't like this
podcast, please don't do anything. This podcast was recorded by the microbial
bioinformatics group and edited by Nick waters. The opinions expressed here are
our own and do not necessarily reflect the views of CDC or the quadrant
Institute.