Hello, and thank you for listening to the MicroBinfeed podcast. Here we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There's so much information we all
know from working in the field, but nobody writes it down. There is no manual,
and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My
co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Andrew
and Nabil work in the Quadram Institute in Norwich, UK, where they work on
microbes in food and the impact on human health. I work at Centers for Disease
Control and Prevention and am an adjunct member at the University of Georgia in
the U.S. Welcome to the Microbial Bioinformatics podcast. Andrew and I are your
co-hosts for today, and we're talking about some exciting new developments in
the area of comparative genomics. And our guests today are Dr. Zaman Iqbal, who
is a research group leader at the European Bioinformatics Institute. He leads a
computational genomics research group working on genetic variation in microbes,
developing methods for representing and understanding complex genetic variation,
and exploring surveillance and diagnostics for antimicrobial resistance. We're
also joined by Dr. Grace Blackwell, who is a postdoctoral fellow working jointly
at the European Bioinformatics Institute in Zaman's group and in Nick Thompson's
group at the Wellcome Sanger Institute. So welcome to you both. Thanks for the
invite. So let's start off with something easy. Normally we ask guests, who are
you and what do you normally do? I'm a computational biologist. I trained as a
mathematician once upon a time, and then as a software engineer once upon a
time, a bit more recently, and then moved into genomics. I spend my time
thinking about how bacteria evolve, how we can sort of conceive of bacterial and
writing algorithms for doing that. And another half of my life works on
tuberculosis and moving TB sequencing into the clinic. All right. And Grace?
Yeah. So as mentioned, I'm a postdoctoral fellow, but during my PhD, I was
trained as a wet lab microbiologist with a bit of a genomic sort of aspect where
looking into mobile genetic elements and how they move antimicrobial resistance
genes around. So when I came to EBI, I wanted to still keep that focus of
looking at MGEs, but on a much larger scale. So with Nick, I've learned a lot
more bioinformatics and computational skills to be able to do so. Okay. So first
question, everyone really wants to know, anyone who works in EBI is, are you
completely loaded? Do you have tax-free cars? You've got all this data at your
disposal, the entire world's data right there, and what are you going to do with
it? So I moved to EBI in 2017. And the first thing, I arrived at the point where
we were just sort of in the middle of trying to trial big Z on a big data set.
And I was like convinced this was going to be perfect because I'd be able just
to run jobs on the cluster and they would be able to see the ENA like locally on
the data. And so everything would be like super easy. And that was when I
realized that life's not that simple because a cloud is never where you think it
is, so the ENA isn't, you know, the data isn't physically there. So yeah, I
ended up failing badly. My PhD student ended up SEPing a lot more data than we
ever expected to SEP around. EBI is an exciting place. I mean, there's lots of
people, I mean, it is exciting to have all of that data around and to the
combination of actually there's two different things. One is thinking about
public health stuff and really, really thinking about what people need
practically. I think that changes the questions you ask. And so the big Z stuff
and Grace's stuff all really triggered off of us trying to work out what you
would do when you really have millions and millions of genomes and how you're
going to query them. And we wanted to do that for surveillance. But yeah, so
it's exciting being there with people who care about sort of looking after the
data. I don't have a Rolls Royce or a diplomatic community or anything like
that, sadly. So I really want big Z in production, you know, because for me, I
need to go on fishing expeditions all the time, right? So just the other day I
asked Nabil, could he look up some weird and wonderful genus or species for me
and get everything that's there. And I know there is, you know, a couple of
hundred, a couple of thousand of these particular species in the EBI. But what's
the best way to pull those out, you know, and identify them? And not only that,
to identify the ones that I want for my clade, you know, because I know there is
a bit of an emerging cluster there, but I only want the specific ones. But I'm
probably going to have to trawl through tens of thousands of different species,
different genomes to actually pull out the, you know, probably 20 that I need.
So what is the best way to do it? Do you have a solution for me or am I still
going to wait for big Z to come along? I don't think there's, I don't think
there's one single answer. Okay. There's, there's a technical question and
there's a practical question. So technically sometimes what you want is you want
to know whether there's some specific sequence there and then big Z or cobs or
other, I mean, there are clever people doing better things than cobs now, but
it's basically sequence search or ideally blasty type stuff rather than just
presence. But then, but that's only half of it. Like, you know, sometimes
someone wants to know, can I have a hundred genomes that roughly represent my
species, the whole diversity of the species and big Z is not the right thing for
that. You know, you want something that represents the, you want genetic
distances of all your stuff basically from each other or a tree or something
like that. And sometimes you want to know is my, is my thing like your thing,
you know, you want something like, you know, what's the nearest thing to my
thing. And I think we're just going to, we're going to end up with having to
index in multiple ways for multiple types of query. And really, I think the
thing that we don't have that we need is a general open API that allows you to
query multiple databases and say, I want to make this type of, you know, if you
support mash, can you tell me if I've got anything like this? And if you
support, you know, sequence search, can you, do you have this sequence and that
kind of thing? And we, it's just a, it's sort of obvious thing, but we should
have a, an open API that allows you to specify what kind of query you want and
the, and the server can reply saying yes or no, I don't support it. And, you
know, and that way, you know, you're great, I mean, Grace has done an amazing
job with all of this data and other people are doing amazing jobs with other
chunks of data around the place. I, either we spend all of our lives continually
aggregating it, or guess which answer we find a way to actually talk between
these, these databases. And in some cases they weren't, there'd be legal reasons
why they don't, they can't share or strong preferences, but they might be up for
saying, yes, I do have something like that. So Grace, what is COB? COB stands
for Compact Bits Slice Signature Index, and it's in my kind of view, cause I'm
not a developer, it's like Big Z 2.0. So it indexes in the same way, my best of
my knowledge, but it's just more compact. So it means that when I created this
index, it was now not 80 terabytes. It was only 800 gigs and we could actually
perform searches a lot faster. Big Z basically makes a massive rectangle, a big
matrix of bloom indexes. COBs pays attention to the, some of the genomes being
smaller. And so you actually, it's like, it's like integration, really, you, you
have rectangles underneath the curve instead of one massive rectangle. And so
what can you get out of it? Like what are the inputs to it and what are the
outputs from it? Yeah. So you can query any sequence, so this could be a gene,
particular part of a gene that covers a SNP or even an entire plasmid, which
I've done for quite a few people. And so what it does is it splits your query
into its overlapping K-mers and then queries on each of those K-mers for its
presence or absence in the index, and usually have to specify a threshold. So
how many of those K-mers have to be present for you to get the resulting hit?
And yeah, and then it will spit out at the end, a list of the sample IDs that
have your query of interest and the specific threshold that's present. And it
does, it does really well for identifying sequences that are quite close to what
you query, but due to the way it needs a perfect match for each of those K-mers,
it's not good for identifying more divergent sequences. So would that be useful
for say, looking at plasmids that move around between different species? Yeah,
yeah. I've done that for that. So one of my projects that I'm doing at the
moment is we're taking one of the replicons for the small plasmids, colon RNA1,
and pulling out all of the assemblies that have that plasmid, and I've used COBS
to do that. And that's, and that's across all of kind of proteobacteria that
plasmid is found. And have you noticed any unusual signals? Like I personally,
as a computer scientist was surprised when I saw say a plasmid from E. coli
turning up in salmonella, but have you noticed anything else like that? Yeah,
host range of plasmids is, for some it's quite well known how far they can go.
Some can actually go across phyla, which I think is pretty amazing to have the
replication system work that well, but others we know are more restricted. And
what I do find is really cool is when you do have identical sequence of a
plasmid in different species, especially over time, that can happen as well
when, so these elements I just think are quite amazing in the way that they can
spread and evolve, but also stay the same for so long. What kind of size of
k-mers can you support? Like can you support say PCR primers, could you use it
for primer search, you know, and seeing which primers will work best for one
particular species, that kind of thing? There's no reason why you couldn't do
that. At the moment we've indexed it for 31-mers. We very nearly, I mean, we
actually, so in the back of the, it's annoying that we've got 90% of a job done.
We set up a system so that we could try and live, keep up with it. And, and
while we were at it, we decided we were going to index everything at 15 and 31
so that we'd have both. And we got through to like January, 2020 more or less.
And then some people left the team and COVID happened and all the rest of it.
And so a whole plan of moving that from research to ENA core basically fell
over. But we need to restart and try and catch up with like there's, there's
what, 660,000 genomes in Grace's data set and there's now 1.7 million in the
ENA, so we really need to get our skates on to catch up with them. Well, don't
worry. There's been no sequencing of bacteria in the past year and a half
anyway, so. Yeah. That's a good point. Yeah. Imagine what it could have been at.
I mean, it's kind of exciting that, I mean, some of the questions with like host
ranges of mobile elements and things like that, it'd be quite hard to design a
study.  study to sample widely enough the phylogeny of pseudomine. I mean, you
would, the things that we found, you know, they're widely enough spread that you
wouldn't know to go and look there for pseudomine. It's only with this gradually
accumulated massive data set from all kinds of people that you find all this
stuff that you would never know to look for. That's one of the exciting things,
really, that we've got, it's a massive shared resource and we ought to be able
to make better use of it. So what kind of memory do you need for a system like
that? Not much, actually, it's all on disk. I mean, so maybe a gig or two, but
the limitation is actually disk speed. So ideally it would all be an SSD or
we've actually built an aggregation thing so you can query multiple, you know,
multiple indexes. I mean, it's much more scalable if you can split many smaller
indexes across different servers. Are you able to look at AMR genes over all
species simultaneously and get kind of like a high level satellite overview of
how that is and how it changes over time? Yeah, so that's one of the analyses
that we did with the paper. So instead of doing it via the COBS, instead I
actually ran NCBI's AMR finder on each of the genomes individually. And so that
reported to us any sort of predicted antimicrobial resistance genes. For the
acquired ones, we didn't end up including SNPs. But as for, and yeah, so we were
able to look, you know, what genes are where and, you know, how many genes,
antibiotic resistant genes are in a specific genome. We were slightly limited on
over time as we were based off the metadata that you get from a read submission,
which doesn't include isolation date. And so instead we were having to use date
made public as our sort of age estimation of that dataset. So we've been talking
about a range of different topics. And a lot of this is actually encapsulated in
this recent preprint from the both of you and your colleagues. So can you take
us through that? So this is the recent bio-archive, it's out on bio-archive as a
preprint called Exploring Bacterial Diversity via a Curated and Searchable
Snapshot of Archived DNA Sequences. And we'll put a link to that in the show so
people can find it. So can you take us through, maybe Grace, what was the
motivation for this? I mean, a lot of it we've kind of touched on in terms of
just getting on top of all the data out there, but what was it for you really
getting into? So for me, when I came to the UBI and Sanger, we wanted to, along
with Sam and Nick, we wanted to look at mobile genetic elements and how they're
distributed within bacteria and how they've evolved over time. Those sort of
questions were what we were interested in asking. But really we wanted to apply
this to all data that was available. And so just when I joined, BigZ Demo was
available and that had everything up to end of 2016 and it was indexed just
based on the reads. And so I could do my queries in BigZ and I could get out the
datasets that had my element of interest. But when it came to analysis, I really
wanted and also to know some extra information about what's in the assembly,
such as other plasmid replicons or what AMR genes did it contain. And so that
led to one conversation with Nick and Sam when it was like, let's just assemble
it all. And so that was probably early November, 2018. And then we pulled in
Martin Hunt, who's an amazing developer and bioinformatician. And yeah, he over
like three weeks put together our download pipeline and assembly pipe. And then
I spent the next six months actually running the pipeline and getting all the
661,405 bacterial assemblies. Yeah, and so now these are all up on an FTP server
via the ENA, I think, isn't it? We are in the process of submitting them to the
ENA. They're gonna be classed as third party assemblies. It's also been about a
year and a half of trying to figure out how they class our assemblies and how
they wanted to upload them and make sure it was in a form that NCBI would also
agree with and the DDJ as well. But yeah, so on the FTP, but hopefully we'll be
in the ENA and SRA soon. Sam, anything you wanted to add? Yeah, I think it's
quite exciting trying to build. I, at the start of this, I was really, I was
sort of keen not to do the assemblies if I possibly could. When we did Big Z, we
would just, I mean, it's because shortly assemblies are imperfect anyway. And
sometimes things have got mixtures and so on, but it is so much more valuable
having the assemblies and being able to actually properly align and try and
figure out what's going on, especially when you're trying to work out the
history of some mobile element and whether it's inserted in another mobile
element or whatever. So it's taken longer than we expected. Although, I mean, we
were obviously flat, but just all of the curation that Grace has had to go
through in order to get something high enough quality, I think it's just gonna
be really useful for people to have, I mean, one of the things we've been having
to justify to people recently was like, does it matter that they're all
assembled the same way? Because you could just take collections of assemblies
that already exist and depends what you're gonna do. But if you're trying to
look for a signal across all of these genomes and they're all these sort of
implicit batch effects, because some of them are being assembled with one tool
and some with another, and you find a subtle signal, you've got to work quite
hard to persuade yourself that it's not actually an artifact of different
assemblies being used. So it depends what you're gonna do, but there is a real
value in having sort of completely uniform system for assembly and QC for the
whole thing. Yeah, I totally agree. I mean, if your interest is the fringe
elements of mobile genetic elements repetitive elements that tend to be more
problematic, I can imagine if you pulled in a bunch that were just done on
velvet or something like that, they would not be comparable. Yeah, exactly. But
I think people have sort of forgotten, people have forgotten the old days when
assemblies were really, when you really did get different outputs from different
assemblies. Well, it's a funny time for us now, because at one level with long
read assemblies, we're sort of expecting really perfect assemblies. And so now
looking at Illumina assemblies, they seem much worse than they would have done
before. I mean, yeah, I sort of look forward to a place, a time when we can have
this many. That's really where we have to go. That would be the dream, yeah.
That'd definitely be something amazing. But I mean, we touched on a few of the
different use cases already with digging around for mobile genetic elements or
AMR. But what are some of the others that come, that for you were the main
motivations for this work? I suppose what I want to mention here is more the
extra indices that we put on top. So yeah, making it searchable was a big thing,
and that was the COBS. But the other thing was we wanted to actually be able to
look at how the phylogenetic relationship between the genomes that did contain
any gene or sequence of interest. And so that really led us to doing, at first,
the cell mesh index. So we sketched each of the assemblies and then put that
into an index. And so that really allows, like, if you say got a new genome and
you want to see how is it related to anything we've seen before, you should be
able to sketch it and search it against this database, and you'll be able to get
out however many of its closest neighbors that you're after. And then the other
thing, then we've kind of followed this up with John Lee's PP sketch, and that
really, it performs a core and accessory genome distance estimation. And so then
we're able to, yeah, actually have punitive relationships between these genomes
and can make trees. Nothing, you know, too fancy, but enough to tell you if
you've maybe got a cluster of very closely related strains, or if they're
different clades within a species. You've given the kind of answer I might've
expected myself to give. So I'll give the kind of answer I'd expect you to give,
which is like, there's just so many unknowns with transposon and plasmid and
other mobile element biology. And to give definitive answers, you sort of need
to do very carefully, but to get good hypotheses, I mean, this is really fertile
ground. So I'm kind of excited about sort of picking individual transposons and
things and following them up through the data and trying to understand how and
when they, well, when is quite hard to estimate, but, you know, where they're
more and less abundant, how they've changed. Again, it's all sort of
circumstantial, but looking at what correlates with what, particularly stuff
like spacer elements, what correlation is there between spacer elements and the
mobile elements you do and don't find inside genomes, that kind of thing.
There's so much stuff to it. I have like a laundry list of genes of interest or
structures of interest that I would love to kind of get, dig into this and go
after, you know, maybe once in a glorious future when COVID isn't a problem. I
mean, from the other Jekyll and Hyde half of my personality, I'm quite keen on,
basically, so there's a weird thing that look, K-mer indexes are a bit like
Google. They're sort of like a document retrieval problem. So in Google, you
sort of, you know, you pass in a word and you want to know which webpage
contains it. And here you pass in a K-mer and you want to know which genome, and
that the language, obviously bacterial language is so much more complicated than
English or French or whatever, but it is that kind of problem. And there are a
lot of super clever computer scientists out there with very clever ways of, you
know, doing these kinds of indexes. And you can already see, I mean, it's four
years, I guess, since Briggsie came out and that there's so many different
approaches now, which are more efficient. So I'm keen, I think this data set is
just a really useful data set for people to benchmark on. And you've already
seen this amazing paper for Ron Cheeky and co where they're building sort of
minimizer to point graphs where the alphabet is minimizers instead of bases. I
think it's really important for us, for people who care about the majority of
cellular life on this planet, as opposed to humans, that we have to make sort of
data sets and the kind of problems we want to answer. We want to make that sort
of really clear because there's a bioinformatics tools end up inevitably be
attuned for the use case that you want them to use them on. And it's very easy
for things to get super tuned for working really well on human genomics, which
is important, but it's different. And I think we need to...  to, you know, I
actually think there's room for, this is overkill, but, you know, instead of
having only 22 problems for mathematics, it's that sort of bioinformaticians'
problems when they're trying to deal with bacteria. I think the kinds of
problems you want to answer and the kind of data that you deal with are very
different because there's billions of years of diversity in there. So, I think
just from a, just providing data for other people to innovate on this.
Definitely. I see this as a step change towards more of a data analytics thing
that everyone else does in other fields and in computer science generally, where
biology has sort of been a bit behind with, because it is difficult to
articulate exactly what we're trying to do and trying to look for. I can't, I
tried to explain the kind of genomic data to physicists and they just lost their
head. They're just like, what, how would you deal with that much error in your
sequencing? What do you mean? You can't trust it. Like, what are you doing with
it in the first place then? I mean, that is a challenge that it keeps changing,
especially particularly with nanopore data, right? And that, you know, the error
style changing and even the base call is changing and it's hard, it's hard to
compare genomes, base call with different methods. So I think I'll, I'll cross
back and ask Grace about the AMR results in the paper, because that I think is
the most, I suppose the most sizable biological result, I think, or maybe I'm
wrong, but would you, do you want to take us through that? And what was the main
take homes from, from that analysis when you've looked at 600 odd thousand
genomes, what can you tell us? Yeah. So I suppose I just want to preface this
with some caveats with how the analysis was done in the way that all of this is
predicted. So it doesn't necessarily mean genotype does not mean phenotype, but
also we're missing, you know, mutational resistance and that was just to do with
how it was run. And so that's, you know, for definitely for some organisms
that's like, you know, the major pathway to resistance. And the other thing is
just cause I was running it on everything at one time, there have been some, you
know, sneaky core genes that get counted as a resistance genes. So I have where
possible, I've tried to note that, but overall ignoring, well, moving forward
from those caveats, it does show us, yeah, some pretty interesting trends. There
are some genomes with a huge number of antimicrobial resistance genes. Like I
think there were so many coli and Klebsiella that had from 32 to 34 different
antimicrobial resistance genes. And yeah, I want to check these like islets just
to be sure that there wasn't, you know, any like strong sign of contamination in
the genomes themselves, but they looked, yeah, pretty clean. So yeah, there's
some very resistant things out. When we were kind of just like plotting the
number of resistance genes per genome in a genus, you could see some vacuum,
definite peaks from certain genera. And when we compared back to the WHO
priority list for bacterial pathogens, where there's a real need to develop new
antibiotics because they are so antimicrobial resistant. Yeah, we see, yeah, the
same genera coming up. So these are sort of some problems that we already knew
about, but there is quite a number of species that came up that, you know,
aren't on that list. And so potentially it might be problems in the future or
might be reservoirs of antimicrobial resistance genes for our more problematic
clinical species. So I noticed that TB isn't even on that list. So is TB still a
problem? I think if you read the introduction to that paper, the one that says
like the WHO priority pathogen list, it then says NTB. It's like a secondary.
It's like, it's because the numbers for TB are so high, it just dwarfs
everything else in the graph, right? It's sort of, I mean, I think there's more,
yeah, I think it dominates. And from the point of view of this analysis, I mean,
Grace was only looking for gene presence rather than SNP driven stuff, so it
wouldn't have shown up. Yeah, TB is a scary problem. There's lots of interesting
science, but it's as much or more a political and social problem of ending
poverty at some level, as much as, you know, diagnostics and surveillance and
drugs. I mean, all of which are important, particularly drugs. Yeah, it's a huge
deal. Just to go back to something you said there, Grace, you mentioned that you
found some E. coli with huge numbers of AMR genes. So is there duplicates, like
are they targeting the same drugs and it's just they're become super resistant
to particular drugs or is it just kind of a spread of everything? Both, like I
think there's quite a broad range of coverage, but there definitely is some
redundancy. Like I'm pretty, I've seen isolates, I can't tell you specifically
about those ones that you have three different sulfonamide resistance genes
where, you know, one is definitely enough to confer resistance to high levels of
resistance to sulfonamides. So there definitely is some redundancy in there, but
also with that many, I also expect it to cover the broad range of potential
antimicrobials that would be thrown at it. So I guess you'd call these something
like absurdly drug resistant as opposed to multi-drug resistant. I mean, without
the testing, I wouldn't want to call them pan, but my guess is they'd probably
be towards that. That's pretty scary, isn't it? I wish there was a better way of
strain banking rather than just, just, you know, banking so we could test
things. I mean, there'd be a ton of work to set up. You really hope there's a
big fitness cost there. Yeah. I mean, the other kind of thing to comment on with
this is we do know that like a lot of sequencing projects will actually select
for resistance. So you might be plating on a particular antimicrobial already.
Like I know a lot of people do it for the third generation or carbapenem
resistance they want to select for in that particular maybe hospital. And so a
lot of what gets through into, into the sequencing databases are the ones that
are already resistant. So I think it would be really worthwhile to also make
sure you're pairing it with a susceptible isolate from if you're doing it from a
hospital. Let's talk more generally about that. The issue of sample bias,
because obviously people are sequencing, especially isolates is what's
culturable. They're sequencing what's clinically relevant for you. How was that
readily apparent in the data set? Did that cause problems? Yeah, definitely
apparent. So for example, we just look at the species breakdown of a 661,000,
90% of that data comes from just 20 bacterial species. Almost 30% of it is just
Salmonella enterica alone. Definitely huge biases in what species are collected,
but even within species. So say we go into like E. coli, huge numbers of ST11,
which is a 157H7 serotype. So really important in disease, but also things like
ST131 and things that seem to like they're over represented in the data compared
to, you know, all the other STs sequence types of E. coli. So you've got, yeah,
you've got your species level biases, you've got within the species bias, and
then we've got selecting for particular resistant isolates bias as well. Because
you've got the US CDC, FDA are doing sequencing, PHE are doing sequencing, but
then other countries do virtually no sequencing, which does complain. You're
totally right. And you could think of that as funding bias rather than country
bias actually, because, so Grace has a statistic, which I've forgotten, but it's
like, like a handful of sequencing cover, I don't know what it is, tens of
percent of that data. So I got 50 projects, 50% of the data, 50 sequencing
projects. Yeah, I mean, the bulk of it's going to be, because for things like
genome tracker or the PHE enteric bacterial submissions, they're all just going
into one bioproject because it's a pain in the neck to keep setting up new
projects really. And I guess no one is really looking at those either if they're
just kind of regulatory turn handle type sequencing. So actually it's probably a
very nice data set to mine if you can handle it. Yeah. And I mean, actually
something that we've done in response to viewers was to compare like the
composition of our dataset against, you know, the genome assemblies in NCBI and
the Patrick database and to just comparing the sample accession ID, which is
present in each. We actually found that we've got over 300,000 assemblies that
weren't in NCBI assembly prior to this. So they only existed in the read
database. And so, yeah, like you said, we're probably not looked at or analysed
in any sort of great detail. When is enough sequencing enough? You know, when do
we stop for a particular pathogen? Like say take TB, is a hundred thousand TB
enough and can we just stop because it's all the same? It depends on your goal.
I've got about three answers to that. One is TB is unusual because sequencing it
is directly useful for treatment. So you get information that directly allows
you to decide what drugs you can treat with. So actually moving in a direction
of more countries starting to sequence by routine routinely. So for the
individual patient on that basis, then we're going to keep going. You could
legitimately argue that we maybe don't need to share it, but I would
legitimately argue that we should keep it in particular because we don't know
now what's important. And we can later on discover that, you know, some mutation
caused drug resistance. And then you can look back and say, actually that
appeared in 2016 or whatever. So like, for example, people have discovered that
the bedaquiline resistance first appeared before people started treating with
bedaquiline. We don't know why. There's potentially shared mechanisms with
resistance to antileprosy drugs and clofazamine, but it's not totally clear that
those drugs are really heavily used, certainly not in South Africa. So I think
basically biology is complicated and drugs are only one selection pressure and
resistance mutations are not just resistance mutations, you know, they have
cellular metabolic effects. I think that's a cheap answer that we're going to
need to keep sequencing because of diagnostics. I think you're right that if
nothing much was going on now, you can learn a lot about ancient history without
having hundreds of thousands of genomes because it depends how far in the deep
past you want to go. The thing is that we're applying all these pressures now
and we sort of want to know what's happening now. So that, you know, there's
interesting examples. For example, so here's an interesting one. Quite commonly
for TB where you test with Lyme probe assays, there's effectively PCR testing
for specific SNPs. If there's a resistance SNP that's not on your list, it's not
in your PCR test, that's a selective advantage for that bug because you know,
you don't treat it appropriately. And so you actually, you're actually seeing
outbreaks where we're basically diagnostic driven selection for a mutation.
that are not in your catalog? The chlamydia in Sweden, a few years ago in
Sweden, they had a huge drop in cases of chlamydia and it turned out it was just
the particular PCR primers they're using weren't working anymore. And those
primers were on a plasmid. Oh really? Like it was selecting, it was trying to
detect a plasmid, which like I have heard is very stable in chlamydia. It's
normally always there, but some changes happened. Yeah. So it wasn't even like
on the, like chromosome, the marker. Oh wow. But anyway, so I mean, I think as
long as changes are going to be starting and in particular, so first of all,
we'd like to notice when resistance appears to new drugs and we're pretty sure
it's going to happen fast and we want to, when it happens, we sort of want to
try and backtrack and sort of find out when it appeared and where else it is. If
you're rolling out a regimen somewhere, I think in theory it would be important
to know, you know, is resistance, I mean, that's me being paranoid. It's not a
super high likely thing, but I think in principle, when you're testing a new
antimicrobial, how do you test? I mean, obviously you test in the lab. In
principle, you would have clinical trials. It would be really good if we started
deep sequencing people in clinical trials and noticing low frequency mutations,
which start to appear and so on. And if I was a drug manufacturer, I would quite
like the abilities. If I knew something caused resistance, I'd quite like the
ability to go and check, you know, is it out there already? Or is this, you
know, at least a priori something that doesn't exist in the world yet? It
depends. Anyway, another, another answer, which is maybe more constructive. If
you want to know how much AMR or whatever there is out there in the world, there
are other species we need to be sampling rather than the ones we already have.
Systematic random sampling from across the world would be super valuable. Yeah.
And just, yeah, increasing the diversity. So not just maybe different species,
but also the same species that we care about, but in different places in the
world from different environments, all of that, I think would really be valuable
and would increase what we know about bacteria. I guess even what's inside us,
you know, we don't have to look too far. There's so many bugs inside us we only
have terrible mags for and not, not even isolates. So start at home. I mean, we
have to go back and redo all of this for different growth conditions, don't we?
Yeah, we do have to do that. The work never ends. Like microtiter plates with
different growth conditions, different drugs. That's the way forward. Well, it's
one way forward. We're going to be busy for a while, even if it wasn't for
COVID. Great. We have a job then for the next few decades. We're just
advertising that we should have a job for life. Well, until someone comes up
with a sequencing technology where we can just get the full chromosome out, you
know, a single cell goes in, magic, full chromosome, full plasmids, push the
button. That would be great. But let's say I give you 10,000 perfect genomes and
you want to understand either their evolutionary history or, you know, how
things are related and things like that. It's still, it's not straightforward in
particular because bacteria have pan genomes and they don't, you know, it's not
strictly clonal evolution. I think there's, there's a lot of open and
interesting stuff going on. The obvious stuff, which is, you know, the
difference between the clonal inheritance and genes move in and out and what
compensatory mutation happens when stuff arrives and this kind of stuff. And
then the bonkers stuff with phagin plasmids and mobile elements and what stuff's
going on there. I mean, I, I, I actually think, you know, once, once we're at
that place that you're talking about where you can get perfect genomes out,
which is not miles away, just, I think we're just going to start seeing so much
fast stuff happening. Like the mobile stuff is so much faster than SNP. Yeah. I
think that's only really the end of the beginning. There's, there's an entire
universe to unravel there. I think that's exciting rather than scary, but maybe
it's a bit of both. Yeah, that's the bit that I'm really waiting for. So what
else is, as you mentioned, like that's one way of the future, what else is, is
on the cards in terms of the future directions? Maybe we can couch it in just
this pre-print. What next? I mean, you've got review comments to do and then
catch up with the other, you know, over half a million to do and. Grace has a
way, then we'll, we'll sort it out for protein search. I think that's, that's,
that's been, that's been aggressively going up the priority list. I think that
would be, I mean, COBSL doesn't care what alphabet your data's in. So we could
six-way translate everything and then search it that way. What would be the
advantage of protein, say, if you did it that way and yeah, what, what, what
would you do with it? I can tell you what I'd want, you know, if, if I'd protein
search and maybe you can solve it then. I want pathways, right? I want to know
which pathway is in, you know, all these different bugs. So I can go on a
fishing expedition and then just pull out all the stuff I need. So that's, that,
that's much more than just a sequence algorithm question. I, I, well, it's the
next level, you know, you go up to proteins and then you join the proteins
together and you have some kind of mechanical process or whatever. So, you know,
it's all about joining the dots and you've got the nucleotides and now you're
going to protein. And then the next obvious level is, you know, what does it do?
Yeah. I'm still stuck on the, how to get the best proteins out of all my
assemblies, that question, because like, if we just, you know, do we run Prokka
on everything and just take what it says, where we know that you can improve
your annotations if you do it for this particular species and maybe you give it
a good reference genome, or do we just, you know, six way translation, translate
everything and then just build a COBS index on that instead. But then we're
going to have a potential junk. So yeah, I'm still stuck at the, then how do we
do the next step? But yeah, I can definitely see in the long run, what you're
suggesting would be very useful. And we've sneakily not answered Nabil's
question, which is who cares about protein? Why do you need protein search? I
mean, but Grace said before that, you know, these k-mer things, they only really
work for really, really close matches. And if you're looking for things that are
more evolutionarily diverged, basically amino acids are more conserved than DNA.
So you can sort of look further away in the tree. I mean, for me, what comes to
mind is looking at some of the garbage, like pseudogenation, looking at
pseudogenes and how they're distributed, because that's going to tell you a fair
bit about which, which clades are moving into different niches, which have
recently moved into a different niche, that sort of thing. Now, which, which,
which, you know, only sort of variably pseudogenes have seen to me within a
species in which they're sort of consistently as well. Anything else from you
two you'd like to promote or plug? Or I think one question we didn't ask is, is
there any new software coming out? So I can plug two things, but one is a
mystery. So one of the limitations of what we do at the moment is it's not
proper blast, right? We sort of break things up into k-mers and sort of say, you
know, demand that certain number of them are present. And that, that works
pretty well, but it's not, it's not a proper alignment. And there is a mystery
person out there somewhere who I think is making big progress towards being able
to do that on these size data sets. So I think you should look out for that.
I'll point them to you, but reveal. And then less, less, less mysteriously. I
mean, we're, I mean, in the group, we're spending a lot of time working on how
to, how to look at, you know, simultaneously study SNP and gene presence
diversity. So our Pandora paper is sort of still in review, but I think a lot
of, we're going to do a lot of exciting things with that, but basically allowing
you to sort of look at mutational changes within the accessory genome and, and,
you know, and thereby how that impacts phenotype. Lots of stuff in the field
coming, I think. All right. Any final words from you Grace? I'm really excited
to use this data set for what, start using it for what we designed when we went
out, which was to actually start looking at, you know, distribution and
diversity and of mobile elements and potentially predicting new ones from all
the data out there. But, you know, we realized, you know, along the way that
this was just, you know, one use case for all of these assemblies and this
resource. And so we really wanted to make it public and get out there. And yeah,
so I really hope that, you know, the community finds it useful and if they do
have any troubles or questions about trying to utilize it or run it, then yeah,
please send us an email and I'll help where. And we really will try and make it
web searchable. That is, we, I've been saying that for like 18 months, but we
really will. All right. And on that, we'll end this episode. So I want to thank
Sam and Grace for joining us today. And we've been talking today about some
really exciting new developments in the area of comparative genomics, and that's
it for the MicroBinfy podcast. See you later. Thank you so much for listening to
us at home. If you like this podcast, please subscribe and rate us on iTunes,
Spotify, SoundCloud, or the platform of your choice. Follow us on Twitter at
MicroBinfy. And if you don't like this podcast, please don't do anything. This
podcast was recorded by the Microbial Bioinformatics Group. The opinions
expressed here are our own and do not necessarily reflect the views of CDC or
the Quadram Institute.