Hello, and thank you for listening to the MicroBinfeed podcast. Here we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There's so much information we all
know from working in the field, but nobody writes it down. There is no manual,
and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My
co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Andrew
and Nabil work in the Quadram Institute in Norwich, UK, where they work on
microbes in food and the impact on human health. I work at Centers for Disease
Control and Prevention, and am an adjunct member at the University of Georgia in
the U.S. Hello, and welcome to another Software Deep Dive, where we interview
the author of a bioinformatics software package. Today, we are having a chat
about some of the software we have created ourselves, because behind all of our
software are quirky details that never make it to the final paper. Today, Biel
is in the hot seat with Enterabase. Nabil, now that you're in the hot seat, tell
me about the background of the problem that you were trying to solve with
Enterabase. Yeah, I should point out that I joined Enterabase, I think roughly
around one year after the grant was awarded, and the proposal was put together
by Mark Palin and Mark Ackman, and they had an idea of a bunch of different
problems. They were both concurrently trying to solve. One issue was Mark P. had
XBase, which is a comparative genomic website tool that had pre-calculated
analysis of all of the complete genomes out there for all the different
bacteria, and you could just go and pull stuff down, pull homologs out, and look
at alignments and things like that. And there was the MLST database, which Mark
Ackman developed and curated for different genera, which was Escherichia,
Salmonella, Moraxella, and Yersinia, and these are based off MLST schemes that
he and his group had put together. That was originally running at the University
College Cork, and then moved over to the University of Warwick when he moved
over. Those two were mixing a bunch of different ideas. You had the idea of
comparing a bunch of genomes together online and providing visualization and
analysis tools for people, and then you had genotyping in MLST and looking at
population structure of bacterial species. And so the idea of EnterBase was to
take both of those and put it together for enteric pathogens and Moraxella for
some reason. What is Moraxella? I've never heard of it. I think there's a fair
few different types of species within the Moraxella genus, but the particular
one for EnterBase or MLST was Moraxella catarrhalis, which is a respiratory
pathogen. It was a bit weird because it's like a respiratory pathogen in a
database called EnterBase. So putting that all together, that's where EnterBase
tried to move all of these different ideas together and do things at scale and
provide like an easy web interface so that people who aren't interested in
dealing with command line can get some basic information out. All these things
were put together in the previous MLST websites. And how'd you go about bringing
them together into EnterBase? And what added features did EnterBase bring on top
of those previous websites? The old MLST website was fairly simple. It was just
a catalog of the allele sequences and the sequence type profiles available. So
just giving that information and you could search different things and submit
new data to the database and you could get sequence types assigned this way. And
people would be uploading, well, they're not uploading, they're kind of just
copy and pasting the Sanger traces that they had done, that was the evidence
that they had a new allele and then in turn getting a new sequence type
assigned, which is a badge of honor for a lot of people to be able to say that
they've identified several novel sequence types in the study. All of the legacy
data was carried over from that database. People were, at that time, were
applying MLST to like pathogens and rivers and just random bacteria in the
environment and all over the place and some really old historic samples as well.
And those sequence types are maintained in that database and they were pulled
over into EnterBase. So that information is there, it's called legacy data and
you can actually search it. You find it's just got a stub of the allele
information and the sample information, but there's no like sequencing attached
to it because there's no whole genome sequencing data. Other than the concepts
in this little smidgen of data that we pulled across, everything else we redid
from the ground up. We wrote the whole thing in Python using the Flask web
framework and new databases done with Postgres. So everything was written again
from the ground up. Just to clarify, you would troll through the public archives
and pull out all new sequencing data every day or week or do some processing or
something? Initially we were running, I don't know now how often it's running,
but we were running the crontab like every hour. So as data was being generated,
as soon as it was coming out on the SRA, it was being fetched down and having
this boilerplate set of pipelines applied to it. So basic QC with Kraken and
then assembly with Bades, we had our own pipeline, but it's more or less on the
same kind of workflow as what Shovel is doing. And then annotation on top of
that with Prokka and then genotyping with all of the different MLST schemes,
RMLST, CGMLST, then providing all of that information in a web framework and
having it such that you can search that data and you can search any fields and
you can mix and match and show different columns if you want. One of the things
that my colleague Shemin Chow spent a lot of time on was, and I think this is
one of the most valuable parts of EnterBase, is he wrote a lot of tooling and
scripts to convert the free text and sort of loose information that gets
submitted to the SRA for a different sample and actually converting and
categorizing it the best he can. And I think that's one of the best things that
people go to EnterBase and download is the metadata because we've tried to
curate it and clean it up. If it's a dataset we care about, we manually will go
back and clean it up, but there's a lot of it that's automated that just tries
to make it systematic and consistent so people can use it. And I've seen people
take that dataset, take that information, take the assemblies and shove it
straight into some machine learning tool and use that to come up with
predictions. As a user, I find it really, really useful. When I was in the
Sanger Institute, we would have thousands of, say, Salmonella. We didn't have
the same level of granularity and analysis that EnterBase had. So often I would
go into EnterBase and see, I know what my sample is from the Sanger IDs and can
I go backwards and see what other interesting stuff do we have in our database?
Looking up, say, the CGMLST was phenomenally good. You can actually say to
EnterBase, okay, I want something that's within, say, five alleles. I can have
five mismatches or 10 or whatever. And so you can set the threshold for how
close you want. And you can go on this massive fishing expedition and just find
extra data. You know, if you've one random sample, you can often go and find a
load of others that are nearby or similar. Then you can go and include them in
your analysis. So it's phenomenally good for that. I personally find the CGMLST
stuff to be fantastic. But I believe it was one of your own schemes, wasn't it?
CGMLST, yeah, we developed in MORIC. And so how did you do the Salmonella one? I
remember there was more than one version. We designed the first set of loci that
we were interested in based off a annotation that Jay Hinton did for his pet
Salmonella, which is an SC131. We're interested in that because they'd gone back
and they'd curated all of the gene start stops with transcriptional data, and
they'd paid a lot of attention to trying to tidy that up. So that annotation
looked like a very good annotation route as opposed to just an automated one. So
we took that genome and then we expanded it to expand the preliminary pan-genome
set with a paper from Nuccio and Baumler, which talks about a bunch of genomes
that they had hand-annotated and cleaned up. So we take all of those and that
becomes a pan-genome panel. We watched it through all of the reference genomes
we could get our hands on. So this included the complete genomes and in the NCTC
genomes from Sanger and PHE that were done on PacBio, we included those as well.
Can I just say that they may be of variable quality? No, that's fine. We had
just made up like gold standard from those genomes before. We were trying to
figure out which genes were generally conserved or not. And then based off that,
we come up with this pan-genome set and then verify it with a bunch of genomes,
one per ribosomal ST to all of the existing Salmonella data that we had
assembled. And then from that, we have this massive sort of Rory-esque table of
gene presence and absence or whether it's truncated or not for like two, like
maybe several thousand Salmonella of this master panel of all of these pan-
genome genes. And then we're playing around with that and looking at thresholds
of where we're going to, which genes are like consistent and generally
conserved. I think they had to be in 98% of the genomes. If you had a hard 100%,
because you just get this sort of stochastic error, you get like assemblies that
are not that great. If you did like, okay, I want only the genes that were
conserved in every Salmonella in my reference set, you basically have three
because you just had other assemblies that were just bad problems and they were
just missing stuff. We had a soft cutoff and then we're checking if these allele
sequences that we extract have a lot of pseudogenes or if they tend to be
truncated or they're problematic because they don't align well, low complexity.
We throw all of those out and we come up with 3,000 genes that we use in the
Salmonella scheme. And that method we applied for all the other species as well.
Was it 3,002 genes? 3,002. Yeah. 3,002. Those two are very important. Oh,
absolutely. Yeah. And version one was the first iteration of it. And for the
PLOS genetics paper, we went back and we visited those because we went from like
the.  30,000 Salmonella genomes to 70-odd thousand, like the database doubled in
a year or two. And we were interested to see, were the genes that we had picked
initially still good? Did they suddenly have issues? As we sequence more and
more, we get more and more weird genomes, and then the proportion of Salmonella
that have some problem, like those genes in a particular cerebrum, like they
start, they're all paralogous, they're all pseudogenes, or they're all
something. And then as a marker, they're useless. They start to break. We were
interested to see, are these still stable after a couple of years? That's where
version two comes out. Like we took out a bunch that weren't that great. I think
it's up to now probably a quarter of a million Salmonella in those databases.
Mostly it seems from CDC and FDA, you know, they seem to be piling them all in.
Oops. Yeah. No, definitely. Like, it's amazing the amount of sequencing. We have
267,000 Salmonella genomes from next generation sequencing up there. However,
they're probably all from about 20 or 30 servers out of the two and a half
thousand. I think maybe, I think it's roughly 30 to 40% of them are going to be
either enter it as a typhimurium. Jeez. That's really important for people who
are looking at Salmonella or even E. coli or any of these species. And they go,
Oh, I'll just look at the whole database and I'll just include all the data
because that's representative. And it's not because there's a sample bias.
You're picking clinical license, you're picking things that tend to occur in
clinics or in food production. Those are the ones that are going to be there. So
if you're going to just take all the data and say it's random, it isn't. And I
often have to point this out on manuscripts that I review that just because
there's a hundred thousand data points doesn't mean it's random or there's no
bias. You have to be quite judicial with what you're selecting in your data.
Yeah. I know for a lot of servers, there might be one example of it with whole
genome sequencing or actually probably zero. I think we cover this in the
PLOSgen paper. If we go back to the MLST, like done with Sanger traces, how many
of those STs have genomes? There's about 30% that don't. They just never appear.
That's great. And we're like, we've done 200, where are they? How come they've
shown up on MLST and they've never shown up as a genome? We don't mention this
in the paper because I couldn't hammer this down, but it seems to be some of the
STs are genuinely rare, weird stuff. So salmonella from the seventies that we
don't have the genome for or random stuff from rivers, weird places. And then
some of it is like single locus variants to very, very common sequence types of
things in typhimurium and things like that, which is probably someone just
making a mistake with the Sanger traces when they did the initial sequencing,
because it was all manual. Like even the copying and pasting it into the website
was manual. So I'm guessing there's a fair proportion of error in that as well.
And someone was typing out the letters. Yeah. Typing out the letters or
something like that, or aligning it in a word document by hand, like this kind
of stuff. It is for that reason why Enterbase does not accept Sanger traces at
all and insists that all new sequence types be generated from whole genome
sequencing, because the amount of error is too great. There is some truth to it.
I was in the meningitis lab for a year, which is heavily dominated by MLST
historically. And definitely they were not by informaticians back then,
definitely not. And people were just learning how to use the website and pasting
it in and doing stuff like that. So I definitely think that there was room for
error, unfortunately, especially with the numbers there. Yeah. Nasir is a Martin
Maitland in Oxford who looks after that one. A shout out to Martin if he's
listening. I don't think he does. And Keith John. We have this caveat over and
over again, that data generated from traditional microbiology and old MLST with
Sanger traces and PFG and so on, all of those different methods, they were
manual. They had a fair margin of error that we found over time, but a lot of
gems are hidden in those datasets as well. So if you spent a lot of time doing
on that stuff, I'm not saying there's anything wrong with that, but there is a
margin of error and it doesn't scale very well, it seems. Yeah. And I need to
give a little caveat too, just because I think that everyone was highly
qualified. I just think that the way it had to be implemented, like there was a
chance for error and just an unavoidable side effect. There's a lot of good
information in there. And those sequence types that came up or serotypes, phage
types, we still use that nomenclature. We still use serovar nomenclature in
salmonella. That is not going to go away. That is not going to suddenly be
replaced by a CGMST number or something like that. It's just too cool to call
your serovar Newport or Dublin or typhimurium or cholecystis. It's cool. Not
CGMSTST31582. Yeah. So I saw that NCBI have their pathogen browser now where
they're trying to do pathogen specific analyses. And I suppose replicate some of
the functionality of EntroBase, but there isn't really any other website out
there that I can think of that actually does this kind of thing and does it so
well. Can you think of any, Nabil? I think I have to give shout out to BigsDB,
Keith Jolley and Martin again. PopMNST got an update the other day and looks a
lot prettier. I think the magic in EntroBase is the idea of taking a species and
thinking of it as a population and looking at it top down and saying, how do I
take a hundred thousand of these guys and talk about the differences between X
and Y? How do I break this population down so I can explain it to someone else?
And part of that is not just that here's a tool that does BLAST and gives you
the numbers. It's deciding on a framework of reasonable thresholds. And if you
use these metrics, you will translate it to a strain. This is what a strain is.
This is what a sequence type is. This is what a total complex is. This is what a
species is. We're making those calls in EntroBase and we're trying to add this
extra conceptualization of the data rather than just presenting data. That to
me, when someone asks like, what's the critical difference between what
EntroBase does versus what something else does, is that. Because a lot of other
tools are much more polished, much prettier, much faster. But I think that's the
secret sauce in EntroBase that makes it really special. Along those lines, I
wanted to ask you about something you mentioned earlier, the cleanup actually,
because that's a huge selling point. And that's something that I feel like is
going to hound us forever, especially with something like Salmonella. Like if we
try to clean up the host information or the food vehicle that it came from. All
those things are free text usually, and can be misspelled or they can have
different colloquial names or something. Was that something that was cleaned up
too? That was a massive undertaking. German did all of the coding. He wrote a
classifier that is sitting in the background and taking the information from
NCBI, passing it, and then coming up with the categories you see in EntroBase.
And he's been iteratively improving on it as we go along. But to get him
started, we sat down for a week going through the then 30,000 Salmonella
records. German did a first pass and passed out most of the information, but
then we were going through and going like, okay, what is this? And we're looking
up scientific names for random species of North American foxes going like, okay,
that's a fox. Okay. Yeah. Yeah. That's a fox. So we changed the host wild
animal. Because we don't know what a turkey is. We don't know the scientific
name for turkey. Oh, that's a turkey. Okay. You'd know, Lee. I do not know the
scientific name of a turkey. So yeah, we didn't know either. So we wound up
doing a lot of manual mappings of different fields into categories so that we
could simplify the data down so we can present it to everyone else. That's
amazing. I think that we were trying to do something like classification like
that with food ontology. And that is even such a complicated field. People like
Emma Griffiths, and many, many other people are working on that, and I just
think it's so hard. I mean, at that degree, it's insane, the level with those
sort of ontologies for when you're trying to track information as a public
health lab. That's insane. There's so much effort involved. I mean, we were just
trying to figure out if the sample was from food or a chicken or a dog. Because
there's a difference between salmonella from a wild animal versus a pet. And
there's a difference between, say, a live chicken on a farm versus a breast of
chicken that you get in a package in a supermarket. Yep. But that's all chicken.
I mean, is the host always chicken? I mean, exactly. So there's a lot of... An
Easter chick also. If you consider a chicken carcass in a bag at a supermarket,
that's chicken. But do you consider a chicken Kiev, like crumb chicken in a box?
Is that still the same type of chicken? God, it could be anything in there. So I
guess, what are the future plans? Because Enterbase has been going for a while,
I presume the grant is probably near its end. So what's going to happen after
this? I don't know. I work with you now, Andrew. I don't know what's happening.
Mark hasn't told me what the future is. I'd hope it continues on in some
fashion. I hope someone picks up the torch and carries it forward. Because it's
going to be a bit sad if it disappears. There's a lot of people who rely on it
these days. I mean, I don't know. Lee, have you been using it? I think it's
great. And it's a hidden gem. I don't work as much in surveillance anymore. But
when I was doing surveillance, and yeah, I was using it a lot. I know that
CGMOS-T that you guys came up with, PulseNet has it for its backbone also. So
it's definitely in wide use. And it's great. That's good to hear. Yeah. And I
was always a big fan of it, particularly for finding related samples and for
going on giant fishing expeditions very quickly. And I always promoted it to
whoever I'd come across and say, you know, you've got to check this out. It'll
save you probably months of work. We tried to capture some of the fishing
expedition adventures in the genome research paper where we go after salmonella
in badgers. And it's a pretty exciting little vignette in there. I mean,
everyone knows badgers, right? Aren't they culling badgers currently? Yep.
Because they're meant to be carrying TB, but yeah. Oh, no. That's a sad note.
Well, this is a great discussion. Again, this is a quick chat about some of the
software we created ourselves, in this case, Enterobase. There's always some
interesting facts about these different tools on how they came into being. You
can check it out at enterobase.warwick.ac.uk. The papers are in Genome Research
and PLoS Genetics, and that's all the time we've had for this episode. See you
next time. Thank you all so much for listening to us at home. If you like this
podcast, please subscribe and like us on iTunes, Spotify, SoundCloud, or the
platform of your choice. And if you don't like this podcast, please don't do
anything. This podcast was recorded by the Microbial Bioinformatics Group and
edited by Nick Waters. The opinions expressed here are our own and do not
necessarily reflect the views of CDC or the Quadrant Institute.