Hello, and thank you for listening to the MicroBinfeed podcast. Here we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There is so much information we all
know from working in the field, but nobody writes it down. There is no manual,
and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My
co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Both
Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work
on microbes in food and the impact on human health. I work at Centers for
Disease Control and Prevention and am an adjunct member at the University of
Georgia in the US. Hello and welcome to the MicroBinfeed podcast. I'm your host
today, and we're going to talk about SARS-CoV-2 data mining and how we build
benchmark datasets. Again, we're joined by Lingzi and Jill, and we're going to
talk about how they put together the benchmark datasets for SARS-CoV-2 that we
talked about last time. So this is like a part two, really. So I suppose I'm
going to open it up immediately to Jill. Jill, tell us about how you guys went
about actually finding the right data to use. Yeah. So as I kind of mentioned
before, right, Adrian, at the time, we were trying to figure out where to get
the sequences from, and then how do we download everything, and at the time on
GetSafe, I think it was just a little difficult to do a lot of bulk downloading.
And so as sometimes it does, Adrian and I kind of ended up working on the same
problem and then went about it two completely different ways. And I found this,
I don't know if anyone else has used it before, but it's this Python package
called Selenium, which allows you to remotely interact with web browsers. So
you're nodding, so you have used it, or no? I've used it for continuous
integration testing. Yeah, usually it's used as a test package for websites.
Gotcha. Well, I commandeered it for other uses. And so I was just using it to
search. We had a list of different ID numbers, and so I was using it to just go
through and search that process. I'm sure there's a faster way to go about it,
but I think it was also because I thought it was a fun tool, and so I wanted to
play around with it in that sense. And so ultimately, Adrian got stuff
downloaded a different way, and so I stopped that, but we did end up re-using it
again because once we had a group of samples that we thought were good, there
were still some checks that we needed, like wanted to make sure they were all
alumina, wanted to make sure they were all paired in, wanted to make sure that
they all used Arctic primers, and checking that information. So then I used it
again to go to NCBI's website and pull that information for the samples and then
kind of give Lindsay a filtered list of everything that did meet the criteria
rather than kind of doing it one by one. I think I'm, like any good
bioinformatician, I'm just deeply lazy in some senses, and if I have to click
buttons more than once, then I feel like this should be automated. So were you
trusting all the metadata that was on the SRA? Yeah, we took what it said on
SRA. I mean, we ran it through an alumina-specific pipeline, so if there was
something weird that came out, right, there was no reads or anything like that,
then you'd be like, okay. Because I think there was, at some point we ran some
of it through the wrong pipeline, and then there was no reads in the R2, and
then we were like, why is there an R2 if this is actually nanopore, right? So
there was some oddities like that that we found, there was also some oddities in
naming of things. People put, that was some of what I found and had to kind of
specifically put in there, right, because people didn't put the right
information in the right place, if there is a right place, or they called it
different things, right? So there was no standard way of saying, other than
saying alumina, right, there was a lot of ways to say Arctic V3, right? Some
just said version 3, some said V3, you know, there was like a thousand ways that
people did that. So some of that, that was part of the process of what took a
while, was like combing through that and if I wasn't pulling the right
information, like what are they actually saying on that webpage? That's quite
difficult, like unbelievably difficult, I'd say, particularly since you have
people, well now using V4 and whatever, so in the future, you're going to have
to be extra careful, you know, did they use V3, V4, or did they say they used
one and used the wrong one? Because that totally messes up everything. So how
did you find representative samples for the VOCs? Did you pick like the very
first example? Did you go to the Pangolin lineages and look for their, I don't
know what you call them, nominal descriptions of those lineages or what, you
know, how did you come up with them or do you pick a centroid or something in a
tree? Yeah, maybe I can describe some of the process and then I think Lizzie can
fill in a little bit more detail. So we kind of started talking about this,
right, but first we had all these sequences that we got, then we basically ran
them through this kind of like process in which we did kind of basic, fast QC,
right, just what are the basic information here, then we used SAM tools to look
at the depth of coverage that was at every nucleoside position, so the average
and the standard deviation, and then we ran them through the Titan pipeline,
which gives tons of QC metrics, one of that, so some of them being like the
number of ends, the Pangolin lineage, the VADAR alerts, if there was amino acid
assertions or deletions, substitutions, and then, yeah, sequencing depth across
the entire assembly. So we got a bunch of QC metrics, and if they kind of, they
were cool with that, then what we did for the BOI-BOC was we had an internal CDC
reference, and we used Snippy to compare the SNPs, and then we were looking
basically for sequences that had the least amount of SNP differences to the CDC
internal reference, and we were trying to make sure that the SNPs and changes
that we were seeing, that those were not in the spike protein, so we looked
specifically at what was going on in the spike protein, and that it had the
mutations that, you know, characterized it to be that specific lineage. So
Lindsay can probably add more detail. Yeah, so the idea is, like, we're using
the CDC internal reference for each lineage as a starting point, and then we go
through the J-Site, all the available assemblies for that lineage, and then try
to look for the ones which has the, like, the least SNPs and the, like, the
least ambiguous Ns. Once we have, like, a subset, we have, like, a link table
between J-Site and SRA, and then we will go back to link to the SRA and using
all the SRA to go through the TITAN, and also through the SMTools and the best
QC to check the quality to make sure they first meet our QC cutoff, and then
second, that they have the exact same spike mutation described by the CDC, like,
website for the VOI and VOC. Okay, so how did you decide on which lineages to
include and which ones not to include? Because there are a lot of lineages out
there, and a lot of them just go nowhere. You know, they come and go, and that's
it. They do nothing, particularly for the non-VOC stuff. And then for the VOCs,
were you using your definition of a VOC, or were you looking at the WHO list, or
what? Because I know a lot of different people call things different. A lot of
people have said they're variants of concerns, but not everyone agrees that they
are variants of concern. So when we developed this project, we still don't have
the WHO nomination at that time, so we used the CDC-defined VOI VOCs at that
moment. That's all we have time for on the MicroBinFeed podcast. Thank you again
to Lingzi and Jill for coming and talking to us about SARS-CoV-2 benchmarking
datasets and how they did the data mining behind those. And I hope you guys can
join us again someday. Thank you.