Hello, and thank you for listening to the MicroBinfeed podcast. Here we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There's so much information we all
know from working in the field, but nobody writes it down. There is no manual,
and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My
co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Andrew
and Nabil work in the Quadram Institute in Norwich, UK, where they work on
microbes in food and the impact on human health. I work at Centers for Disease
Control and Prevention and am an adjunct member at the University of Georgia in
the U.S. Welcome to a new thing we are doing called Software Deep Dives, where
we interview the author of a bioinformatics software package. Today, Hank
Dembaker is in the hot seat with Sepia. He's an assistant professor at the
University of Georgia. Since this state, not the country, in the U.S., I work
with him through the food safety informatics group at the University of Georgia.
Hank's an alum of the Wieman Lab and has a rich history working on things like
listeria and campylobacter. Correct me if I'm wrong later on, please, but I
think fungi and a lot of other things out there. He got into computational
biology and bioinformatics from working on all these different things, and we're
interviewing him today on Sepia. First of all, Hank, what is Sepia? Sepia is, I
would say, yet another read classifier. Why do we need it? Being a taxonomist
and somebody who uses read classifiers a lot, there were just a lot of things
that I wanted to have in a read classifier that I didn't have yet. I wrote Sepia
to just address all those things. Sepia uses a couple of data structures. One of
them is the complex hash table that Kraken 2 uses, and actually some of the
principles or algorithms that Kraken 2 uses to classify reads. One of the main
components in read classification is the use of taxonomy, and taxonomy is very
important. I'm a taxonomist by training. I'm developing new taxonomies, for
instance, the GTDB taxonomy versus the NCBI taxonomy. There are several
taxonomies currently that are being used in the metagenomic field, and seeing
how that influences your ability to classify reads, and especially reads that
are from organisms that are not necessarily well known, that really interests
me. It's a tool to experiment with those things. The other thing is that I'm
interested in data structures. So how can we, the compact hash table of Kraken 2
is already pretty compact, but can we make that more compact with combining it
with something like a perfect hash function? So Sepia has those things. You say
a hash table there. So how are you actually storing the sequence information?
How are you encoding it? The sequence information, currently it's encoded in
bits. So we have basically the compact hash table, or the hash table with the
perfect hash function is one big factor containing, what is it, unsigned 32
integers. And we can use those unsigned 32 integers to store both information of
part of the sequence and the associated taxon that goes along with those
sequences. So we first hash the sequence, and then we use that hash to find a
position in that factor, then we use either the hash value of the sequence of
your minimer to confirm that really, or part of your sequence to confirm if
that's a good match or not. So how long would your k-mers be then in that case?
Currently my k-mers can go up to 31 base pairs, but so you don't use the full 31
base pairs. And one of the things that I'm excited about is that I actually can
extend the size of k-mers. So we can go up to 64 base pairs. My language that
I'm using, Rust, has an unsigned 128 value. So we should be able to make it even
bigger. I don't know if that affects the performance of the software. So I got
really excited when I looked at your code and I saw this is Perl because it has
many of the same constructs as Perl and similar kind of layout and syntax and
whatnot. But I was very disappointed then when you told me you'd abandoned Perl
for some other frivolous fly-by-night language called Rust. Can you tell me more
about that? First, I have to probably explain the reason that I never abandoned
Perl. I can read Perl. I write most of my scripts and things like that. Where
Python is fast enough, I use Python. That's my go-to language at the moment. But
if I write code that really is performance critical and I want to read
classifier, I want to classify a couple of million of reads and tens of data
sets within a limited amount of time, that's where I use Rust. So they say it
much better than me. Like they say a language empowering everyone to build a
reliable and efficient software. And I think with Rust, you can really get the
same performance out of Rust as C and C++. You can get at that level. I don't
see. Actually, which parts are you seeing that have Perl on it? Like I feel like
when I started learning it going from Perl to Rust, it just blew me away. I had
to go step-by-step in the tutorial and learn a whole new language. Well, the way
I see it is, you look at it, it says use and then library and then semicolon.
You know, that's very Perly. True, true. Okay. Which got me. And then all the
curly brackets and stuff like that. It is a very beautiful language, actually.
Yeah, I think so. Honestly, my experience was getting so frustrated with it. But
like being just like gobsmacked when I translated my Perl over to Rust and I got
like a 10 or 20 fold speed increase. That's insane. Yeah, this language is
insane. So, so that's absolutely, that's the reason I chose it. I mean, the
other thing is, is I can't read C++. I can't read C just to look at algorithms
and at the details of some people's codes. And, but what always frustrates me
to, is to, to, to skip between files, like your header files and whatever you
need here. You just have one file. That's where your code is. Okay. Let's get
back to why you didn't just use Kraken 2 and why you went and made up your own
classifier. I think Kraken 2 is fabulous and it's fast, but there are just
things I don't know if, if it already exists. But one, one of the things that,
that frustrated me was that there wasn't a batch mode. So if you start a Kraken
2 run, the first thing that, that the software does is load the index or
database, whatever you want to call it. And if it's large and no matter how big
your computer is, that takes a long time. That usually takes longer than the
actually, the actual action of, of classifying your reads. So one of the things
CPI can do is just a batch mode where that loading the database is done once.
And then you can specify, tap the limited files with your, your, your sequence
data and, and your sample names, and it will just do it in one go. So it takes
like a minute to, to load your 80 to 90 gig index. And then it takes like 10
seconds per sample to, to classify the reads and give you nice summary files and
all those things. So one thing with reclassifiers I find is that you can have
bits that are shared by different species, like maybe mobile genetic elements or
AMR genes or virulence genes or whatever. And that can sometimes throw some
weird curve balls. And it's just influenced by the number of samples that happen
to be sequenced. Absolutely. Like say Salmonella, that's massively over
represented than just generic Salmonella you find in the soil. So how does your,
your classifier work in that case? In that case, it will just be as bad as other
classifiers. That's the other thing that I'm, I'm very focused on indexes that,
that use reference type or not type strains, but the reference strains, instead
of like trying to index all of Salmonella, like a median and centroid strain
from a population and use that as a reference. So it takes away some of those,
like being some genera being over represented, but you still have.  that same
problem. If you use HyperLogLog, for instance, to estimate like how many k-mers
are, or how many minimizers, whatever, are represented by some of those
elements. Say you sequence a soil microbiome and you run your read classifier
and you have like 100,000 reads that match salmonella. But it turns out that
those 100,000 reads, 100,000 reads should be enough to cover like a salmonella
genome several times. But if you find that actually it's a subset of the k-mers,
a small subset, say 2,000 k-mers compared to the whole genome, it should be like
4 million, 5 million k-mers, then you can say it's probably a shared gene
instead of the organism itself. So I'm working on that currently. The other
thing that I find really helpful, that standard currently that I started to
integrate into CPI from the start, is it gives you what I call a hit ratio. So a
minimizer-based estimate of the k-mer similarity, average k-mer similarity of
your reads compared to the reference strain or strains that are classified, the
reads are classified as such. So kind of like a MASH score of some description.
Exactly, yes. It correlates really well with ANI, an average nucleotide density.
And I find that really useful to see if you have a really high score, since
you're working with k-mers, that's something between 0.8, 99, you never get a 1
unless you have exactly the same strain that you recovered from the metagenome.
You have a pretty good indication that you have that organism. The other thing
is that you can filter out a lot of the noise. So if you have these read
classifiers, things get classified as, it's usually over-classified. So these
classifiers always go to the lowest level that you can get. But if you have like
a k-mer similarity of 0.01, you know that's clearly noise and that's just the
read classifier doing its over-classification thing. Do you, just switching
gears for a second, if you feel like it, do you want to give any hints on what
you've been using sepia for on any applied research? I've been using it, so
currently I'm working on some metagenome projects, like using metagenomics to
predict species that, like animal intrusion in farmlands, and using metagenomics
to predict how long ago that animal dropped its feces on your land. We're
working on mapping the microbiome of the, what is it, of the retail environment.
So at the far end, actually of the farm to food continuum in food safety, so,
and see how we can relate microbiome data to the occurrence of things like
listeria or salmonella. Where I use sepia there is if I have 16S datasets,
within seconds I can quickly scan through a dataset and pick out reads of
interest. That's, especially with like an amplicon database, use 16S, that's
super, super fast. You don't even need the batch mode there. Have you used it
for coronavirus yet? No, and I know it can do coronavirus. Yes. The next thing
you need for your paper when you write it up. It's coronavirus capable. I made
sure of that. Because you're going to get that everywhere, I'd imagine, at some
point. The reagents of contamination will be coming through and whatnot. So tell
me, going back to your, I suppose, animal droppings and food safety bit, does
that mean that say if you get maybe contaminated lettuce or cantaloupes or
whatever, that you can give some kind of classification on where that might've
come from? Yes and no. So one of the first things I did for that project was see
how long the native microbiota of animal droppings. So that's in particular
indicative of what species was associated with those droppings, how fast that
disappears. So what you see first in the first couple of days is that most of
the typical obligate anaerobes disappear. They're the most indicative, I think.
But I was quite surprised. So it depends. But it depends how long ago the fecal
contamination occurred. That's basically that. So I mean, if you think about
foodborne pathogens like Salmonella, Listeria, and E. coli, they're actually
some of the species that I found in that data set, especially we had some nice
Escherichias that last the longest. So with Listeria, I always, or I've always
heard that it's very difficult to take from the environment and you have to do
an overnight culture, that kind of thing. You can't just pick it up off the
ground or in a factory and do something with it. So what type of samples would
you be dealing with in that case? So that's absolutely true. So all those data
sets, what we're currently doing, so you find Listeria and you find it in small
numbers, even in data sets that you sequenced without any prior enrichment. So
what we've done for a couple of these projects is that we did cultural
enrichment. So that takes actually a lot longer than an overnight culture and
all those things. It means that you get a sample and you expose it to an
environment that's positively affecting a few organisms or your organisms of
interest. So it can be by using certain antibiotics, medium, etc. So for
Listeria, that would typically be initially an overnight culture. It takes at
least a couple of days to get from like a soil sample to your Listeria cultures.
Okay. So maybe let's get deep into the technical bit, right? Yes. And I recall
that you were involved somehow in Big Z with Zamenhof. Yeah. And did his kind of
work influence you in any way? Absolutely. So I have another piece of software
called, and it's also in Rust, it's called Color ID. And it uses a version of
Big Z that's a Rust version that can be actually downloaded as a crate. It's
publicly available. So it builds a Big Z in memory. So it's an in-memory use of
a Big Z. So you have your index gets loaded in memory and that makes it really,
really fast. So you cannot make Big Zs that are as big as like the entire SRA,
NCBI's SRA, but you can index 10,000s of strains in a relatively small data
structure. That kind of terrifies me because I know that Big Z at one point is
running on a half a terabyte of RAM. Exactly. Yeah. How much RAM does your
algorithm require? That all depends on how many accessions you have. I guess
what I'm asking is, can I run it on my laptop or do I need a bigger virtual
machine to run it on? That all depends. So if you want to do, say, all the
current GTDB version R202, so the latest version of, and you want to have
something, all reference strains. And when you say all reference strains, how
many people take it? We have, here we have 50,000 references. So they are
everything, archaea bacteria, they include everything from cultured organisms to
metagenome amplified genomes. Any humans? Not yet. But if we look at the GTDB
database with sepia, so that is 98 gigabytes. And that has to be loaded into
your RAM. So it won't work on my laptop. It won't work on your laptop. Yeah. But
98 gigs is quite good actually compared to, I know Kraken can require a fair
whack of RAM. Some of the things that I've been experimenting with is that the
K-mer size versus the minimizer size.  And how much that influenced the accuracy
of your read classification. So like after playing around with some values, it
came out of 31 and a minimizer size of 21 actually gets you a significantly
smaller database, even if you use the same values in Kraken. So is that kind of
indicating like there might be people who might want to know what the parameters
are for lower memory or those who want to have it absolutely. So are you kind
of, are you kind of documenting that or detailing that? Yes, I will. We know, I
forgot if we actually said that on the recording, but we literally just got
access to the, to Sepia just as we started this podcast. So we're kind of
talking and looking at the same time. Yes. And I really like these, these
talking about these things because I'm literally writing the markdown updated
and extensive markdown. So one of the things since I've been working on this
over several years is that the help functions are really, really helpful. And I
can tell you that because there are things that I didn't remember from, let's
say over a year ago, and I looked at my help function. And so everything works
with, with the help function. Well done for doing it right. And, and I have to
say that one of the things that makes it really easy is this one crate in. What,
what do you mean by a crate? Is that like a container or something? Or are they
close to like library files? Yes. They're library files that you can download
from the, from a central depository. But they're not just subset libraries or,
or modules. Yeah, they, they call them crates. Correct. Yeah. And you get them
at crates.io. And so the crate that that's responsible for me writing good help
functions is actually called clap. The clap crate is, is, is fabulous. One
function that I want to add, and that's the read filter. And that would be, it's
going to be integrated into CPI. And now I'm going to have it as a standalone.
So have you written a paper on this yet? No. So I'm, I'm working on a million
papers. So after this, I will, we'll get something out as soon as possible. I
think that's always the thing with writing software. You're writing loads of
code and potentially help functions. And then you have to write a paper. I mean,
there are a couple of neat things. For instance, the data structure that, that,
that uses the perfect hash function needs to know that the set of all k-mers or
minimers that you want to index the perfect hash function. So I wrote a
variation on the compact hash map, and that's the compact hash set. So it's a
set that can take gigantic ginormous numbers of, of, of k-mers or whatever, and
just, you can infer the set of all k-mers in your data set before you start
building your, your hash map. So can you take two different databases and then
do set operations on them and say, basically start doing like GWAS. Oh, we're
onto something here actually now. That would be interesting. I haven't thought
about that. Yeah, because you know, what's in common, what's different and
extract them out and then maybe go and mine for interesting things. Yes, that
shouldn't be too hard. And if you can do more complex set operations, you can do
some pretty phenomenal things. Okay. Off you go, implement that and we'll, we'll
write the nature paper then. You could have like a set of reference genomes,
which are your cases, a set of reference genomes are your controls. And then you
say, okay, go and build me two separate databases. Then get, say the
intersection or whatever is not in the intersection. And then you have like a
unique database, maybe for finding Listeria, that would be pretty cool. Are you
trying to scoop yourself on Plasmatron? It's basically Plasmatron that I'm
reinventing here. Yeah, obviously done in a better way because Plasmatron was
very much hacked together. I was curious that you were talking earlier, much
earlier about the batch of being able to run samples through in a batch, and
then I noticed in the source code that you've got some call-outs to Redis, is
that what's underwriting that, or what's your use of Redis in this? Redis, so
this is one of the leftovers. I started building my own compact has set for
those, those big things. I tried to do it with Redis, but it ran out of,
basically I couldn't use that to set operations there. So that's actually
vestigial. Yeah. All right. There's a lot of vestigial stuff in the current
code, which I may remove. So Redis, for those who don't know, this is like, it's
an in-memory or it's like a cache data structure store. And basically it's just
a giant key value storage. You can use it for a whole bunch of different things
and you'll find it all over the place. So the way that taxonomy stored actually
in the index is different from, from Kraken 2. So there are a lot of things that
are very similar to Kraken 2, but also different. Like, so the taxonomies are
stored as, as directed acyclic graphs. So in that way, you can, can look up like
a taxonomy of a single organism, or if you, you identify the k-mer fairly
quickly. So it goes from the lowest to the highest taxonomy level, say seven or
eight steps that you need to infer taxonomy. And then you can do some set
functions to figure out what the most recent common ancestor is. Well, what
would the output look like for Sepia actually? Because is it the sort of Kraken
classification where each read gets assigned a thing and that hierarchical
number of reads or bits or whatever part of the chunks that support a particular
taxon and then like the number of that uniquely mapped to a particular taxon.
Currently there are two outputs, like there's a summary file, but it doesn't use
that hierarchical structure of Kraken. I mean, is that's just a straight
assignment to a particular genus or species, much like the Kraken output? So
that's good. So what it gives you, it's, it gives you like a taxon with a number
of reads that hit taxon. The average k-mer hits similar or minimizer k-mer,
dependent on what you're using, similarity per read. If you use the HLL, so the
HyperLogLog function, it will give you an estimate. It will give you the total
number of minimizers or actually k-mers that were found for that specific taxon,
the cardinality, and then what is it? The total number divided by the
cardinality. So you can infer like kind of coverage per organism. Okay. That's
good because that sounds more digestible than sort of your raw, the raw Kraken,
like the Kraken report that you get, that can be, I mean, that's not something
you can just palm off to someone else who doesn't necessarily know how to
interpret it. So it's good that you've got something that sounds a lot more like
a lot more human or digestible. To keep the code fast, everything is like in U32
or whatever, encoded like all your taxonomic designations. But then the moment
that you have, so the summary file and the per read classification file,
everything is humans readable. So I made sure of that. Let's see, there is like
a separate folder in the CPI repository that says scripts and that's a Python
script that actually generates chronoplots or the input for chronoplots from the
classification file that CPI generates. Another file that they called the plus
file. So it will give you not only the average K-mer similarity, but also the
distribution of how those K-mer similarities are. So you can see what the curve
looks like. And I made that in the past to, to kind of see if I could use a
machine learning algorithm to filter the noise from the real hits. No, that's
good. That sounds good. And definitely the Corona output is professors like the
Corona output. Yeah. Yeah. Nice and clickable for them. Yes, exactly.
Interactive. Yeah, exactly. I think we kind of touched on this, but I am curious
how, what happens if there are reads that are very diverse, completely unrelated
to your reference database? What is the chance that this program is going to
falsely assign it to one of those taxons just because it has no idea? I think
you touched on this confidence value that would help, but what would be the
propensity here? So the propensity of, I think read classifiers in general is to
just assign it to the lowest taxon possible. That's where you get it. So that's
where that, that K-mer similarity comes in. I mean, then if you get closer.
look usually like a very, very low average k-mer similarity that really throws
out those hits as being true hits for that organism. Chances that it just gets
classified as no hits, so I have specifically a no hits category. But I mean,
the danger is real. I mean, I've heard a couple of talks, I think virology
talks, where they used reclassifiers and then were thrown off by weird or
disturbing classifications, which do not turn out to be things they were. Yeah,
like using your pestis on the subway or whatever that was. Yes. I mean, that was
a naive case, but yeah. Can you test your software with us? It would be a good
test. Yeah, I'm really curious. Is that data set still out there or has it been
retracted? We can make up a data set. I mean, it's not that hard. Yeah. I think
there are some comparison papers out there for reclassifiers, kind of like
Assemblathon, but it's reclassifathon. I can't remember the name of the papers.
And they do have some data sets that they use that are these kind of gotcha
ones, which should throw off some of these tools. And so that would be a good
benchmark, you know, pulling those down and having a go at those. A little more
seriously, maybe the first benchmark should be like something like the Zymo mock
communities, that kind of thing. Oh, the sky's the limit, right? We can try and
break it as much as we can. Now the code is out, we can stop trying to break it
as much as we like. Actually, yeah, Hank, what do you think that people should
be looking at first when they get to the repo? We're coming up with all sorts of
awesome things, definitely awesome, but you know, maybe you have awesome things
too. Just give the soft a run and see what you can do with it. So the current
implementation of HyperLogLog function, it's not something I wrote myself, so it
makes my code very slow. I wouldn't do that. The other thing I use it for is
read classification of Oxford Nanopore reads. Because you have that flexibility
and setting those parameters that well, you can really play with the ideal
parameters to read classification for Oxford Nanopore or noisy reads. So
somebody first coming to your repo should try out their Oxford set on that? What
you will see is that your average K-mer values are of course highly effective.
They're not comparable to what you will find for Illumina data. It does a pretty
good job, I think. So I know that Minimap has an error model to cope with PacBio
and Nanopore. I don't know if that's a thing you can just flag. I should have a
look at that, definitely. So currently I think that the best, if I use smaller
databases, for instance, just with a bunch like Calamari, if I make a Calamari
database with just a K-mer size of 21, which is fairly small, so you can wiggle
your way past those critical errors that Nanopore includes. That works pretty
well. So smaller K-mers definitely seem to do a good job, as long as they're not
too small, because then you need everything and that's not very valuable. Would
you have like a two-pass hierarchy kind of thing? So maybe you start off with
K-mers of say 11 or something crazy small at a second pass. Yeah, I think I've
mentioned a couple of different times, but it's definitely worth mentioning
again. It's a database of curated reference genomes, mostly bacterial, mostly
foodborne, that we are using in-house over here, but I also have it up on
GitHub. And it's basically a list of accessions of these things and a script to
download them. And documentation on how to build it for different databases. And
I'm looking forward to documentation on how to build it for sepia. I just found
the dataset that was used for comparing the read classifiers, and that is the
CAMI, I've just blanked on the name, the CAMI dataset. So that's Critical
Assessment of Metagenome Interpretation. And I think the paper is in Nature
Methods 2017, Skirba et al, if you want to look that up. Yes, absolutely. For
the audience, if they want to use it themselves. Me too, I want to see this. I
haven't seen this before. There's a couple of them. There's Skirba et al,
there's one McIntyre 2017 in genome biology, which I think is the sequel to
that. So between that, if you're able to outperform everyone with those, then
your thing is golden, if anyone wants to play around with those datasets. Yeah,
I have a feeling that really strain level differences with read classifiers,
unless you use like a really big k-mer size and all those things. So one thing I
want to mention is CPI will also check the consistency of your taxonomy. So if
there are, for instance, if the same genus name is found in different lineages,
it will flag it. So you can have a look at it. Or if that's the main thing. So
if you combine a plant taxonomy with a bacterial taxonomy, you will find that
there are some genus names that are used in both domains of life. So that's the
thing is that notes in my taxonomy, the name is just the whole taxonomy string
that fixes it, not just the genus name. I learned that pretty quickly when I
started to combine plant and bacterial taxonomy with zoological taxonomy. Hank,
where the name CPI came from? Oh yeah. So the name CPI is actually a tribute to
Kraken because Kraken is a big octopus. So cephalopod, CPI is also a cephalopod
and it refers to the rust color, like the pigment that you can make from its ink
sac, which is rusty colored. So it's a humble cephalopod compared to the big
Kraken. All right. Well, thanks for a great discussion. This was a quick chat
for CPI, the classifier that Hank DenBaker created. There's always some
interesting facts about these tools. So I'm glad that we talked it through,
especially where the actual name came from. I love diving into rust and
everything else about that too. For those who are listening, you can check it
out on GitHub. That'll be in the show notes. And that's all the time we have for
today. See you next time. Thank you so much for listening to us at home. If you
like this podcast, please subscribe and rate us on iTunes, Spotify, SoundCloud,
or the platform of your choice. Follow us on Twitter at Microbinfee. And if you
don't like this podcast, please don't do anything. This podcast was recorded by
the Microbial Bioinformatics Group. The opinions expressed here are our own and
do not necessarily reflect the views of CDC or the Quadram Institute.