Hello, and thank you for listening to the MicroBinfeed podcast. Here we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There's so much information we all know from working in the field, but nobody writes it down. There is no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work on microbes in food and the impact on human health. I work at Centers for Disease Control and Prevention and am an adjunct member at the University of Georgia in the U.S. Welcome to a new thing we are doing called Software Deep Dives, where we interview the author of a bioinformatics software package. Today, Hank Dembaker is in the hot seat with Sepia. He's an assistant professor at the University of Georgia. Since this state, not the country, in the U.S., I work with him through the food safety informatics group at the University of Georgia. Hank's an alum of the Wieman Lab and has a rich history working on things like listeria and campylobacter. Correct me if I'm wrong later on, please, but I think fungi and a lot of other things out there. He got into computational biology and bioinformatics from working on all these different things, and we're interviewing him today on Sepia. First of all, Hank, what is Sepia? Sepia is, I would say, yet another read classifier. Why do we need it? Being a taxonomist and somebody who uses read classifiers a lot, there were just a lot of things that I wanted to have in a read classifier that I didn't have yet. I wrote Sepia to just address all those things. Sepia uses a couple of data structures. One of them is the complex hash table that Kraken 2 uses, and actually some of the principles or algorithms that Kraken 2 uses to classify reads. One of the main components in read classification is the use of taxonomy, and taxonomy is very important. I'm a taxonomist by training. I'm developing new taxonomies, for instance, the GTDB taxonomy versus the NCBI taxonomy. There are several taxonomies currently that are being used in the metagenomic field, and seeing how that influences your ability to classify reads, and especially reads that are from organisms that are not necessarily well known, that really interests me. It's a tool to experiment with those things. The other thing is that I'm interested in data structures. So how can we, the compact hash table of Kraken 2 is already pretty compact, but can we make that more compact with combining it with something like a perfect hash function? So Sepia has those things. You say a hash table there. So how are you actually storing the sequence information? How are you encoding it? The sequence information, currently it's encoded in bits. So we have basically the compact hash table, or the hash table with the perfect hash function is one big factor containing, what is it, unsigned 32 integers. And we can use those unsigned 32 integers to store both information of part of the sequence and the associated taxon that goes along with those sequences. So we first hash the sequence, and then we use that hash to find a position in that factor, then we use either the hash value of the sequence of your minimer to confirm that really, or part of your sequence to confirm if that's a good match or not. So how long would your k-mers be then in that case? Currently my k-mers can go up to 31 base pairs, but so you don't use the full 31 base pairs. And one of the things that I'm excited about is that I actually can extend the size of k-mers. So we can go up to 64 base pairs. My language that I'm using, Rust, has an unsigned 128 value. So we should be able to make it even bigger. I don't know if that affects the performance of the software. So I got really excited when I looked at your code and I saw this is Perl because it has many of the same constructs as Perl and similar kind of layout and syntax and whatnot. But I was very disappointed then when you told me you'd abandoned Perl for some other frivolous fly-by-night language called Rust. Can you tell me more about that? First, I have to probably explain the reason that I never abandoned Perl. I can read Perl. I write most of my scripts and things like that. Where Python is fast enough, I use Python. That's my go-to language at the moment. But if I write code that really is performance critical and I want to read classifier, I want to classify a couple of million of reads and tens of data sets within a limited amount of time, that's where I use Rust. So they say it much better than me. Like they say a language empowering everyone to build a reliable and efficient software. And I think with Rust, you can really get the same performance out of Rust as C and C++. You can get at that level. I don't see. Actually, which parts are you seeing that have Perl on it? Like I feel like when I started learning it going from Perl to Rust, it just blew me away. I had to go step-by-step in the tutorial and learn a whole new language. Well, the way I see it is, you look at it, it says use and then library and then semicolon. You know, that's very Perly. True, true. Okay. Which got me. And then all the curly brackets and stuff like that. It is a very beautiful language, actually. Yeah, I think so. Honestly, my experience was getting so frustrated with it. But like being just like gobsmacked when I translated my Perl over to Rust and I got like a 10 or 20 fold speed increase. That's insane. Yeah, this language is insane. So, so that's absolutely, that's the reason I chose it. I mean, the other thing is, is I can't read C++. I can't read C just to look at algorithms and at the details of some people's codes. And, but what always frustrates me to, is to, to, to skip between files, like your header files and whatever you need here. You just have one file. That's where your code is. Okay. Let's get back to why you didn't just use Kraken 2 and why you went and made up your own classifier. I think Kraken 2 is fabulous and it's fast, but there are just things I don't know if, if it already exists. But one, one of the things that, that frustrated me was that there wasn't a batch mode. So if you start a Kraken 2 run, the first thing that, that the software does is load the index or database, whatever you want to call it. And if it's large and no matter how big your computer is, that takes a long time. That usually takes longer than the actually, the actual action of, of classifying your reads. So one of the things CPI can do is just a batch mode where that loading the database is done once. And then you can specify, tap the limited files with your, your, your sequence data and, and your sample names, and it will just do it in one go. So it takes like a minute to, to load your 80 to 90 gig index. And then it takes like 10 seconds per sample to, to classify the reads and give you nice summary files and all those things. So one thing with reclassifiers I find is that you can have bits that are shared by different species, like maybe mobile genetic elements or AMR genes or virulence genes or whatever. And that can sometimes throw some weird curve balls. And it's just influenced by the number of samples that happen to be sequenced. Absolutely. Like say Salmonella, that's massively over represented than just generic Salmonella you find in the soil. So how does your, your classifier work in that case? In that case, it will just be as bad as other classifiers. That's the other thing that I'm, I'm very focused on indexes that, that use reference type or not type strains, but the reference strains, instead of like trying to index all of Salmonella, like a median and centroid strain from a population and use that as a reference. So it takes away some of those, like being some genera being over represented, but you still have. that same problem. If you use HyperLogLog, for instance, to estimate like how many k-mers are, or how many minimizers, whatever, are represented by some of those elements. Say you sequence a soil microbiome and you run your read classifier and you have like 100,000 reads that match salmonella. But it turns out that those 100,000 reads, 100,000 reads should be enough to cover like a salmonella genome several times. But if you find that actually it's a subset of the k-mers, a small subset, say 2,000 k-mers compared to the whole genome, it should be like 4 million, 5 million k-mers, then you can say it's probably a shared gene instead of the organism itself. So I'm working on that currently. The other thing that I find really helpful, that standard currently that I started to integrate into CPI from the start, is it gives you what I call a hit ratio. So a minimizer-based estimate of the k-mer similarity, average k-mer similarity of your reads compared to the reference strain or strains that are classified, the reads are classified as such. So kind of like a MASH score of some description. Exactly, yes. It correlates really well with ANI, an average nucleotide density. And I find that really useful to see if you have a really high score, since you're working with k-mers, that's something between 0.8, 99, you never get a 1 unless you have exactly the same strain that you recovered from the metagenome. You have a pretty good indication that you have that organism. The other thing is that you can filter out a lot of the noise. So if you have these read classifiers, things get classified as, it's usually over-classified. So these classifiers always go to the lowest level that you can get. But if you have like a k-mer similarity of 0.01, you know that's clearly noise and that's just the read classifier doing its over-classification thing. Do you, just switching gears for a second, if you feel like it, do you want to give any hints on what you've been using sepia for on any applied research? I've been using it, so currently I'm working on some metagenome projects, like using metagenomics to predict species that, like animal intrusion in farmlands, and using metagenomics to predict how long ago that animal dropped its feces on your land. We're working on mapping the microbiome of the, what is it, of the retail environment. So at the far end, actually of the farm to food continuum in food safety, so, and see how we can relate microbiome data to the occurrence of things like listeria or salmonella. Where I use sepia there is if I have 16S datasets, within seconds I can quickly scan through a dataset and pick out reads of interest. That's, especially with like an amplicon database, use 16S, that's super, super fast. You don't even need the batch mode there. Have you used it for coronavirus yet? No, and I know it can do coronavirus. Yes. The next thing you need for your paper when you write it up. It's coronavirus capable. I made sure of that. Because you're going to get that everywhere, I'd imagine, at some point. The reagents of contamination will be coming through and whatnot. So tell me, going back to your, I suppose, animal droppings and food safety bit, does that mean that say if you get maybe contaminated lettuce or cantaloupes or whatever, that you can give some kind of classification on where that might've come from? Yes and no. So one of the first things I did for that project was see how long the native microbiota of animal droppings. So that's in particular indicative of what species was associated with those droppings, how fast that disappears. So what you see first in the first couple of days is that most of the typical obligate anaerobes disappear. They're the most indicative, I think. But I was quite surprised. So it depends. But it depends how long ago the fecal contamination occurred. That's basically that. So I mean, if you think about foodborne pathogens like Salmonella, Listeria, and E. coli, they're actually some of the species that I found in that data set, especially we had some nice Escherichias that last the longest. So with Listeria, I always, or I've always heard that it's very difficult to take from the environment and you have to do an overnight culture, that kind of thing. You can't just pick it up off the ground or in a factory and do something with it. So what type of samples would you be dealing with in that case? So that's absolutely true. So all those data sets, what we're currently doing, so you find Listeria and you find it in small numbers, even in data sets that you sequenced without any prior enrichment. So what we've done for a couple of these projects is that we did cultural enrichment. So that takes actually a lot longer than an overnight culture and all those things. It means that you get a sample and you expose it to an environment that's positively affecting a few organisms or your organisms of interest. So it can be by using certain antibiotics, medium, etc. So for Listeria, that would typically be initially an overnight culture. It takes at least a couple of days to get from like a soil sample to your Listeria cultures. Okay. So maybe let's get deep into the technical bit, right? Yes. And I recall that you were involved somehow in Big Z with Zamenhof. Yeah. And did his kind of work influence you in any way? Absolutely. So I have another piece of software called, and it's also in Rust, it's called Color ID. And it uses a version of Big Z that's a Rust version that can be actually downloaded as a crate. It's publicly available. So it builds a Big Z in memory. So it's an in-memory use of a Big Z. So you have your index gets loaded in memory and that makes it really, really fast. So you cannot make Big Zs that are as big as like the entire SRA, NCBI's SRA, but you can index 10,000s of strains in a relatively small data structure. That kind of terrifies me because I know that Big Z at one point is running on a half a terabyte of RAM. Exactly. Yeah. How much RAM does your algorithm require? That all depends on how many accessions you have. I guess what I'm asking is, can I run it on my laptop or do I need a bigger virtual machine to run it on? That all depends. So if you want to do, say, all the current GTDB version R202, so the latest version of, and you want to have something, all reference strains. And when you say all reference strains, how many people take it? We have, here we have 50,000 references. So they are everything, archaea bacteria, they include everything from cultured organisms to metagenome amplified genomes. Any humans? Not yet. But if we look at the GTDB database with sepia, so that is 98 gigabytes. And that has to be loaded into your RAM. So it won't work on my laptop. It won't work on your laptop. Yeah. But 98 gigs is quite good actually compared to, I know Kraken can require a fair whack of RAM. Some of the things that I've been experimenting with is that the K-mer size versus the minimizer size. And how much that influenced the accuracy of your read classification. So like after playing around with some values, it came out of 31 and a minimizer size of 21 actually gets you a significantly smaller database, even if you use the same values in Kraken. So is that kind of indicating like there might be people who might want to know what the parameters are for lower memory or those who want to have it absolutely. So are you kind of, are you kind of documenting that or detailing that? Yes, I will. We know, I forgot if we actually said that on the recording, but we literally just got access to the, to Sepia just as we started this podcast. So we're kind of talking and looking at the same time. Yes. And I really like these, these talking about these things because I'm literally writing the markdown updated and extensive markdown. So one of the things since I've been working on this over several years is that the help functions are really, really helpful. And I can tell you that because there are things that I didn't remember from, let's say over a year ago, and I looked at my help function. And so everything works with, with the help function. Well done for doing it right. And, and I have to say that one of the things that makes it really easy is this one crate in. What, what do you mean by a crate? Is that like a container or something? Or are they close to like library files? Yes. They're library files that you can download from the, from a central depository. But they're not just subset libraries or, or modules. Yeah, they, they call them crates. Correct. Yeah. And you get them at crates.io. And so the crate that that's responsible for me writing good help functions is actually called clap. The clap crate is, is, is fabulous. One function that I want to add, and that's the read filter. And that would be, it's going to be integrated into CPI. And now I'm going to have it as a standalone. So have you written a paper on this yet? No. So I'm, I'm working on a million papers. So after this, I will, we'll get something out as soon as possible. I think that's always the thing with writing software. You're writing loads of code and potentially help functions. And then you have to write a paper. I mean, there are a couple of neat things. For instance, the data structure that, that, that uses the perfect hash function needs to know that the set of all k-mers or minimers that you want to index the perfect hash function. So I wrote a variation on the compact hash map, and that's the compact hash set. So it's a set that can take gigantic ginormous numbers of, of, of k-mers or whatever, and just, you can infer the set of all k-mers in your data set before you start building your, your hash map. So can you take two different databases and then do set operations on them and say, basically start doing like GWAS. Oh, we're onto something here actually now. That would be interesting. I haven't thought about that. Yeah, because you know, what's in common, what's different and extract them out and then maybe go and mine for interesting things. Yes, that shouldn't be too hard. And if you can do more complex set operations, you can do some pretty phenomenal things. Okay. Off you go, implement that and we'll, we'll write the nature paper then. You could have like a set of reference genomes, which are your cases, a set of reference genomes are your controls. And then you say, okay, go and build me two separate databases. Then get, say the intersection or whatever is not in the intersection. And then you have like a unique database, maybe for finding Listeria, that would be pretty cool. Are you trying to scoop yourself on Plasmatron? It's basically Plasmatron that I'm reinventing here. Yeah, obviously done in a better way because Plasmatron was very much hacked together. I was curious that you were talking earlier, much earlier about the batch of being able to run samples through in a batch, and then I noticed in the source code that you've got some call-outs to Redis, is that what's underwriting that, or what's your use of Redis in this? Redis, so this is one of the leftovers. I started building my own compact has set for those, those big things. I tried to do it with Redis, but it ran out of, basically I couldn't use that to set operations there. So that's actually vestigial. Yeah. All right. There's a lot of vestigial stuff in the current code, which I may remove. So Redis, for those who don't know, this is like, it's an in-memory or it's like a cache data structure store. And basically it's just a giant key value storage. You can use it for a whole bunch of different things and you'll find it all over the place. So the way that taxonomy stored actually in the index is different from, from Kraken 2. So there are a lot of things that are very similar to Kraken 2, but also different. Like, so the taxonomies are stored as, as directed acyclic graphs. So in that way, you can, can look up like a taxonomy of a single organism, or if you, you identify the k-mer fairly quickly. So it goes from the lowest to the highest taxonomy level, say seven or eight steps that you need to infer taxonomy. And then you can do some set functions to figure out what the most recent common ancestor is. Well, what would the output look like for Sepia actually? Because is it the sort of Kraken classification where each read gets assigned a thing and that hierarchical number of reads or bits or whatever part of the chunks that support a particular taxon and then like the number of that uniquely mapped to a particular taxon. Currently there are two outputs, like there's a summary file, but it doesn't use that hierarchical structure of Kraken. I mean, is that's just a straight assignment to a particular genus or species, much like the Kraken output? So that's good. So what it gives you, it's, it gives you like a taxon with a number of reads that hit taxon. The average k-mer hits similar or minimizer k-mer, dependent on what you're using, similarity per read. If you use the HLL, so the HyperLogLog function, it will give you an estimate. It will give you the total number of minimizers or actually k-mers that were found for that specific taxon, the cardinality, and then what is it? The total number divided by the cardinality. So you can infer like kind of coverage per organism. Okay. That's good because that sounds more digestible than sort of your raw, the raw Kraken, like the Kraken report that you get, that can be, I mean, that's not something you can just palm off to someone else who doesn't necessarily know how to interpret it. So it's good that you've got something that sounds a lot more like a lot more human or digestible. To keep the code fast, everything is like in U32 or whatever, encoded like all your taxonomic designations. But then the moment that you have, so the summary file and the per read classification file, everything is humans readable. So I made sure of that. Let's see, there is like a separate folder in the CPI repository that says scripts and that's a Python script that actually generates chronoplots or the input for chronoplots from the classification file that CPI generates. Another file that they called the plus file. So it will give you not only the average K-mer similarity, but also the distribution of how those K-mer similarities are. So you can see what the curve looks like. And I made that in the past to, to kind of see if I could use a machine learning algorithm to filter the noise from the real hits. No, that's good. That sounds good. And definitely the Corona output is professors like the Corona output. Yeah. Yeah. Nice and clickable for them. Yes, exactly. Interactive. Yeah, exactly. I think we kind of touched on this, but I am curious how, what happens if there are reads that are very diverse, completely unrelated to your reference database? What is the chance that this program is going to falsely assign it to one of those taxons just because it has no idea? I think you touched on this confidence value that would help, but what would be the propensity here? So the propensity of, I think read classifiers in general is to just assign it to the lowest taxon possible. That's where you get it. So that's where that, that K-mer similarity comes in. I mean, then if you get closer. look usually like a very, very low average k-mer similarity that really throws out those hits as being true hits for that organism. Chances that it just gets classified as no hits, so I have specifically a no hits category. But I mean, the danger is real. I mean, I've heard a couple of talks, I think virology talks, where they used reclassifiers and then were thrown off by weird or disturbing classifications, which do not turn out to be things they were. Yeah, like using your pestis on the subway or whatever that was. Yes. I mean, that was a naive case, but yeah. Can you test your software with us? It would be a good test. Yeah, I'm really curious. Is that data set still out there or has it been retracted? We can make up a data set. I mean, it's not that hard. Yeah. I think there are some comparison papers out there for reclassifiers, kind of like Assemblathon, but it's reclassifathon. I can't remember the name of the papers. And they do have some data sets that they use that are these kind of gotcha ones, which should throw off some of these tools. And so that would be a good benchmark, you know, pulling those down and having a go at those. A little more seriously, maybe the first benchmark should be like something like the Zymo mock communities, that kind of thing. Oh, the sky's the limit, right? We can try and break it as much as we can. Now the code is out, we can stop trying to break it as much as we like. Actually, yeah, Hank, what do you think that people should be looking at first when they get to the repo? We're coming up with all sorts of awesome things, definitely awesome, but you know, maybe you have awesome things too. Just give the soft a run and see what you can do with it. So the current implementation of HyperLogLog function, it's not something I wrote myself, so it makes my code very slow. I wouldn't do that. The other thing I use it for is read classification of Oxford Nanopore reads. Because you have that flexibility and setting those parameters that well, you can really play with the ideal parameters to read classification for Oxford Nanopore or noisy reads. So somebody first coming to your repo should try out their Oxford set on that? What you will see is that your average K-mer values are of course highly effective. They're not comparable to what you will find for Illumina data. It does a pretty good job, I think. So I know that Minimap has an error model to cope with PacBio and Nanopore. I don't know if that's a thing you can just flag. I should have a look at that, definitely. So currently I think that the best, if I use smaller databases, for instance, just with a bunch like Calamari, if I make a Calamari database with just a K-mer size of 21, which is fairly small, so you can wiggle your way past those critical errors that Nanopore includes. That works pretty well. So smaller K-mers definitely seem to do a good job, as long as they're not too small, because then you need everything and that's not very valuable. Would you have like a two-pass hierarchy kind of thing? So maybe you start off with K-mers of say 11 or something crazy small at a second pass. Yeah, I think I've mentioned a couple of different times, but it's definitely worth mentioning again. It's a database of curated reference genomes, mostly bacterial, mostly foodborne, that we are using in-house over here, but I also have it up on GitHub. And it's basically a list of accessions of these things and a script to download them. And documentation on how to build it for different databases. And I'm looking forward to documentation on how to build it for sepia. I just found the dataset that was used for comparing the read classifiers, and that is the CAMI, I've just blanked on the name, the CAMI dataset. So that's Critical Assessment of Metagenome Interpretation. And I think the paper is in Nature Methods 2017, Skirba et al, if you want to look that up. Yes, absolutely. For the audience, if they want to use it themselves. Me too, I want to see this. I haven't seen this before. There's a couple of them. There's Skirba et al, there's one McIntyre 2017 in genome biology, which I think is the sequel to that. So between that, if you're able to outperform everyone with those, then your thing is golden, if anyone wants to play around with those datasets. Yeah, I have a feeling that really strain level differences with read classifiers, unless you use like a really big k-mer size and all those things. So one thing I want to mention is CPI will also check the consistency of your taxonomy. So if there are, for instance, if the same genus name is found in different lineages, it will flag it. So you can have a look at it. Or if that's the main thing. So if you combine a plant taxonomy with a bacterial taxonomy, you will find that there are some genus names that are used in both domains of life. So that's the thing is that notes in my taxonomy, the name is just the whole taxonomy string that fixes it, not just the genus name. I learned that pretty quickly when I started to combine plant and bacterial taxonomy with zoological taxonomy. Hank, where the name CPI came from? Oh yeah. So the name CPI is actually a tribute to Kraken because Kraken is a big octopus. So cephalopod, CPI is also a cephalopod and it refers to the rust color, like the pigment that you can make from its ink sac, which is rusty colored. So it's a humble cephalopod compared to the big Kraken. All right. Well, thanks for a great discussion. This was a quick chat for CPI, the classifier that Hank DenBaker created. There's always some interesting facts about these tools. So I'm glad that we talked it through, especially where the actual name came from. I love diving into rust and everything else about that too. For those who are listening, you can check it out on GitHub. That'll be in the show notes. And that's all the time we have for today. See you next time. Thank you so much for listening to us at home. If you like this podcast, please subscribe and rate us on iTunes, Spotify, SoundCloud, or the platform of your choice. Follow us on Twitter at Microbinfee. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadram Institute.