Hello, and thank you for listening to the MicroBinfeed podcast. Here we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There's so much information we all know from working in the field, but nobody writes it down. There is no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work on microbes in food and the impact on human health. I work at Centers for Disease Control and Prevention, and am an adjunct member at the University of Georgia in the U.S. Hello, and welcome to another Software Deep Dive, where we interview the author of a bioinformatics software package. Today, we are having a chat about some of the software we have created ourselves, because behind all of our software are quirky details that never make it to the final paper. Today, Biel is in the hot seat with Enterabase. Nabil, now that you're in the hot seat, tell me about the background of the problem that you were trying to solve with Enterabase. Yeah, I should point out that I joined Enterabase, I think roughly around one year after the grant was awarded, and the proposal was put together by Mark Palin and Mark Ackman, and they had an idea of a bunch of different problems. They were both concurrently trying to solve. One issue was Mark P. had XBase, which is a comparative genomic website tool that had pre-calculated analysis of all of the complete genomes out there for all the different bacteria, and you could just go and pull stuff down, pull homologs out, and look at alignments and things like that. And there was the MLST database, which Mark Ackman developed and curated for different genera, which was Escherichia, Salmonella, Moraxella, and Yersinia, and these are based off MLST schemes that he and his group had put together. That was originally running at the University College Cork, and then moved over to the University of Warwick when he moved over. Those two were mixing a bunch of different ideas. You had the idea of comparing a bunch of genomes together online and providing visualization and analysis tools for people, and then you had genotyping in MLST and looking at population structure of bacterial species. And so the idea of EnterBase was to take both of those and put it together for enteric pathogens and Moraxella for some reason. What is Moraxella? I've never heard of it. I think there's a fair few different types of species within the Moraxella genus, but the particular one for EnterBase or MLST was Moraxella catarrhalis, which is a respiratory pathogen. It was a bit weird because it's like a respiratory pathogen in a database called EnterBase. So putting that all together, that's where EnterBase tried to move all of these different ideas together and do things at scale and provide like an easy web interface so that people who aren't interested in dealing with command line can get some basic information out. All these things were put together in the previous MLST websites. And how'd you go about bringing them together into EnterBase? And what added features did EnterBase bring on top of those previous websites? The old MLST website was fairly simple. It was just a catalog of the allele sequences and the sequence type profiles available. So just giving that information and you could search different things and submit new data to the database and you could get sequence types assigned this way. And people would be uploading, well, they're not uploading, they're kind of just copy and pasting the Sanger traces that they had done, that was the evidence that they had a new allele and then in turn getting a new sequence type assigned, which is a badge of honor for a lot of people to be able to say that they've identified several novel sequence types in the study. All of the legacy data was carried over from that database. People were, at that time, were applying MLST to like pathogens and rivers and just random bacteria in the environment and all over the place and some really old historic samples as well. And those sequence types are maintained in that database and they were pulled over into EnterBase. So that information is there, it's called legacy data and you can actually search it. You find it's just got a stub of the allele information and the sample information, but there's no like sequencing attached to it because there's no whole genome sequencing data. Other than the concepts in this little smidgen of data that we pulled across, everything else we redid from the ground up. We wrote the whole thing in Python using the Flask web framework and new databases done with Postgres. So everything was written again from the ground up. Just to clarify, you would troll through the public archives and pull out all new sequencing data every day or week or do some processing or something? Initially we were running, I don't know now how often it's running, but we were running the crontab like every hour. So as data was being generated, as soon as it was coming out on the SRA, it was being fetched down and having this boilerplate set of pipelines applied to it. So basic QC with Kraken and then assembly with Bades, we had our own pipeline, but it's more or less on the same kind of workflow as what Shovel is doing. And then annotation on top of that with Prokka and then genotyping with all of the different MLST schemes, RMLST, CGMLST, then providing all of that information in a web framework and having it such that you can search that data and you can search any fields and you can mix and match and show different columns if you want. One of the things that my colleague Shemin Chow spent a lot of time on was, and I think this is one of the most valuable parts of EnterBase, is he wrote a lot of tooling and scripts to convert the free text and sort of loose information that gets submitted to the SRA for a different sample and actually converting and categorizing it the best he can. And I think that's one of the best things that people go to EnterBase and download is the metadata because we've tried to curate it and clean it up. If it's a dataset we care about, we manually will go back and clean it up, but there's a lot of it that's automated that just tries to make it systematic and consistent so people can use it. And I've seen people take that dataset, take that information, take the assemblies and shove it straight into some machine learning tool and use that to come up with predictions. As a user, I find it really, really useful. When I was in the Sanger Institute, we would have thousands of, say, Salmonella. We didn't have the same level of granularity and analysis that EnterBase had. So often I would go into EnterBase and see, I know what my sample is from the Sanger IDs and can I go backwards and see what other interesting stuff do we have in our database? Looking up, say, the CGMLST was phenomenally good. You can actually say to EnterBase, okay, I want something that's within, say, five alleles. I can have five mismatches or 10 or whatever. And so you can set the threshold for how close you want. And you can go on this massive fishing expedition and just find extra data. You know, if you've one random sample, you can often go and find a load of others that are nearby or similar. Then you can go and include them in your analysis. So it's phenomenally good for that. I personally find the CGMLST stuff to be fantastic. But I believe it was one of your own schemes, wasn't it? CGMLST, yeah, we developed in MORIC. And so how did you do the Salmonella one? I remember there was more than one version. We designed the first set of loci that we were interested in based off a annotation that Jay Hinton did for his pet Salmonella, which is an SC131. We're interested in that because they'd gone back and they'd curated all of the gene start stops with transcriptional data, and they'd paid a lot of attention to trying to tidy that up. So that annotation looked like a very good annotation route as opposed to just an automated one. So we took that genome and then we expanded it to expand the preliminary pan-genome set with a paper from Nuccio and Baumler, which talks about a bunch of genomes that they had hand-annotated and cleaned up. So we take all of those and that becomes a pan-genome panel. We watched it through all of the reference genomes we could get our hands on. So this included the complete genomes and in the NCTC genomes from Sanger and PHE that were done on PacBio, we included those as well. Can I just say that they may be of variable quality? No, that's fine. We had just made up like gold standard from those genomes before. We were trying to figure out which genes were generally conserved or not. And then based off that, we come up with this pan-genome set and then verify it with a bunch of genomes, one per ribosomal ST to all of the existing Salmonella data that we had assembled. And then from that, we have this massive sort of Rory-esque table of gene presence and absence or whether it's truncated or not for like two, like maybe several thousand Salmonella of this master panel of all of these pan- genome genes. And then we're playing around with that and looking at thresholds of where we're going to, which genes are like consistent and generally conserved. I think they had to be in 98% of the genomes. If you had a hard 100%, because you just get this sort of stochastic error, you get like assemblies that are not that great. If you did like, okay, I want only the genes that were conserved in every Salmonella in my reference set, you basically have three because you just had other assemblies that were just bad problems and they were just missing stuff. We had a soft cutoff and then we're checking if these allele sequences that we extract have a lot of pseudogenes or if they tend to be truncated or they're problematic because they don't align well, low complexity. We throw all of those out and we come up with 3,000 genes that we use in the Salmonella scheme. And that method we applied for all the other species as well. Was it 3,002 genes? 3,002. Yeah. 3,002. Those two are very important. Oh, absolutely. Yeah. And version one was the first iteration of it. And for the PLOS genetics paper, we went back and we visited those because we went from like the. 30,000 Salmonella genomes to 70-odd thousand, like the database doubled in a year or two. And we were interested to see, were the genes that we had picked initially still good? Did they suddenly have issues? As we sequence more and more, we get more and more weird genomes, and then the proportion of Salmonella that have some problem, like those genes in a particular cerebrum, like they start, they're all paralogous, they're all pseudogenes, or they're all something. And then as a marker, they're useless. They start to break. We were interested to see, are these still stable after a couple of years? That's where version two comes out. Like we took out a bunch that weren't that great. I think it's up to now probably a quarter of a million Salmonella in those databases. Mostly it seems from CDC and FDA, you know, they seem to be piling them all in. Oops. Yeah. No, definitely. Like, it's amazing the amount of sequencing. We have 267,000 Salmonella genomes from next generation sequencing up there. However, they're probably all from about 20 or 30 servers out of the two and a half thousand. I think maybe, I think it's roughly 30 to 40% of them are going to be either enter it as a typhimurium. Jeez. That's really important for people who are looking at Salmonella or even E. coli or any of these species. And they go, Oh, I'll just look at the whole database and I'll just include all the data because that's representative. And it's not because there's a sample bias. You're picking clinical license, you're picking things that tend to occur in clinics or in food production. Those are the ones that are going to be there. So if you're going to just take all the data and say it's random, it isn't. And I often have to point this out on manuscripts that I review that just because there's a hundred thousand data points doesn't mean it's random or there's no bias. You have to be quite judicial with what you're selecting in your data. Yeah. I know for a lot of servers, there might be one example of it with whole genome sequencing or actually probably zero. I think we cover this in the PLOSgen paper. If we go back to the MLST, like done with Sanger traces, how many of those STs have genomes? There's about 30% that don't. They just never appear. That's great. And we're like, we've done 200, where are they? How come they've shown up on MLST and they've never shown up as a genome? We don't mention this in the paper because I couldn't hammer this down, but it seems to be some of the STs are genuinely rare, weird stuff. So salmonella from the seventies that we don't have the genome for or random stuff from rivers, weird places. And then some of it is like single locus variants to very, very common sequence types of things in typhimurium and things like that, which is probably someone just making a mistake with the Sanger traces when they did the initial sequencing, because it was all manual. Like even the copying and pasting it into the website was manual. So I'm guessing there's a fair proportion of error in that as well. And someone was typing out the letters. Yeah. Typing out the letters or something like that, or aligning it in a word document by hand, like this kind of stuff. It is for that reason why Enterbase does not accept Sanger traces at all and insists that all new sequence types be generated from whole genome sequencing, because the amount of error is too great. There is some truth to it. I was in the meningitis lab for a year, which is heavily dominated by MLST historically. And definitely they were not by informaticians back then, definitely not. And people were just learning how to use the website and pasting it in and doing stuff like that. So I definitely think that there was room for error, unfortunately, especially with the numbers there. Yeah. Nasir is a Martin Maitland in Oxford who looks after that one. A shout out to Martin if he's listening. I don't think he does. And Keith John. We have this caveat over and over again, that data generated from traditional microbiology and old MLST with Sanger traces and PFG and so on, all of those different methods, they were manual. They had a fair margin of error that we found over time, but a lot of gems are hidden in those datasets as well. So if you spent a lot of time doing on that stuff, I'm not saying there's anything wrong with that, but there is a margin of error and it doesn't scale very well, it seems. Yeah. And I need to give a little caveat too, just because I think that everyone was highly qualified. I just think that the way it had to be implemented, like there was a chance for error and just an unavoidable side effect. There's a lot of good information in there. And those sequence types that came up or serotypes, phage types, we still use that nomenclature. We still use serovar nomenclature in salmonella. That is not going to go away. That is not going to suddenly be replaced by a CGMST number or something like that. It's just too cool to call your serovar Newport or Dublin or typhimurium or cholecystis. It's cool. Not CGMSTST31582. Yeah. So I saw that NCBI have their pathogen browser now where they're trying to do pathogen specific analyses. And I suppose replicate some of the functionality of EntroBase, but there isn't really any other website out there that I can think of that actually does this kind of thing and does it so well. Can you think of any, Nabil? I think I have to give shout out to BigsDB, Keith Jolley and Martin again. PopMNST got an update the other day and looks a lot prettier. I think the magic in EntroBase is the idea of taking a species and thinking of it as a population and looking at it top down and saying, how do I take a hundred thousand of these guys and talk about the differences between X and Y? How do I break this population down so I can explain it to someone else? And part of that is not just that here's a tool that does BLAST and gives you the numbers. It's deciding on a framework of reasonable thresholds. And if you use these metrics, you will translate it to a strain. This is what a strain is. This is what a sequence type is. This is what a total complex is. This is what a species is. We're making those calls in EntroBase and we're trying to add this extra conceptualization of the data rather than just presenting data. That to me, when someone asks like, what's the critical difference between what EntroBase does versus what something else does, is that. Because a lot of other tools are much more polished, much prettier, much faster. But I think that's the secret sauce in EntroBase that makes it really special. Along those lines, I wanted to ask you about something you mentioned earlier, the cleanup actually, because that's a huge selling point. And that's something that I feel like is going to hound us forever, especially with something like Salmonella. Like if we try to clean up the host information or the food vehicle that it came from. All those things are free text usually, and can be misspelled or they can have different colloquial names or something. Was that something that was cleaned up too? That was a massive undertaking. German did all of the coding. He wrote a classifier that is sitting in the background and taking the information from NCBI, passing it, and then coming up with the categories you see in EntroBase. And he's been iteratively improving on it as we go along. But to get him started, we sat down for a week going through the then 30,000 Salmonella records. German did a first pass and passed out most of the information, but then we were going through and going like, okay, what is this? And we're looking up scientific names for random species of North American foxes going like, okay, that's a fox. Okay. Yeah. Yeah. That's a fox. So we changed the host wild animal. Because we don't know what a turkey is. We don't know the scientific name for turkey. Oh, that's a turkey. Okay. You'd know, Lee. I do not know the scientific name of a turkey. So yeah, we didn't know either. So we wound up doing a lot of manual mappings of different fields into categories so that we could simplify the data down so we can present it to everyone else. That's amazing. I think that we were trying to do something like classification like that with food ontology. And that is even such a complicated field. People like Emma Griffiths, and many, many other people are working on that, and I just think it's so hard. I mean, at that degree, it's insane, the level with those sort of ontologies for when you're trying to track information as a public health lab. That's insane. There's so much effort involved. I mean, we were just trying to figure out if the sample was from food or a chicken or a dog. Because there's a difference between salmonella from a wild animal versus a pet. And there's a difference between, say, a live chicken on a farm versus a breast of chicken that you get in a package in a supermarket. Yep. But that's all chicken. I mean, is the host always chicken? I mean, exactly. So there's a lot of... An Easter chick also. If you consider a chicken carcass in a bag at a supermarket, that's chicken. But do you consider a chicken Kiev, like crumb chicken in a box? Is that still the same type of chicken? God, it could be anything in there. So I guess, what are the future plans? Because Enterbase has been going for a while, I presume the grant is probably near its end. So what's going to happen after this? I don't know. I work with you now, Andrew. I don't know what's happening. Mark hasn't told me what the future is. I'd hope it continues on in some fashion. I hope someone picks up the torch and carries it forward. Because it's going to be a bit sad if it disappears. There's a lot of people who rely on it these days. I mean, I don't know. Lee, have you been using it? I think it's great. And it's a hidden gem. I don't work as much in surveillance anymore. But when I was doing surveillance, and yeah, I was using it a lot. I know that CGMOS-T that you guys came up with, PulseNet has it for its backbone also. So it's definitely in wide use. And it's great. That's good to hear. Yeah. And I was always a big fan of it, particularly for finding related samples and for going on giant fishing expeditions very quickly. And I always promoted it to whoever I'd come across and say, you know, you've got to check this out. It'll save you probably months of work. We tried to capture some of the fishing expedition adventures in the genome research paper where we go after salmonella in badgers. And it's a pretty exciting little vignette in there. I mean, everyone knows badgers, right? Aren't they culling badgers currently? Yep. Because they're meant to be carrying TB, but yeah. Oh, no. That's a sad note. Well, this is a great discussion. Again, this is a quick chat about some of the software we created ourselves, in this case, Enterobase. There's always some interesting facts about these different tools on how they came into being. You can check it out at enterobase.warwick.ac.uk. The papers are in Genome Research and PLoS Genetics, and that's all the time we've had for this episode. See you next time. Thank you all so much for listening to us at home. If you like this podcast, please subscribe and like us on iTunes, Spotify, SoundCloud, or the platform of your choice. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group and edited by Nick Waters. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadrant Institute.