Hello, and thank you for listening to the MicroBinfeed podcast. Here we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There is so much information we all know from working in the field, but nobody writes it down. There is no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Both Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work on microbes in food and the impact on human health. I work at Centers for Disease Control and Prevention and am an adjunct member at the University of Georgia in the US. Hello and welcome to the MicroBinfeed podcast. I'm your host today, and we're going to talk about SARS-CoV-2 data mining and how we build benchmark datasets. Again, we're joined by Lingzi and Jill, and we're going to talk about how they put together the benchmark datasets for SARS-CoV-2 that we talked about last time. So this is like a part two, really. So I suppose I'm going to open it up immediately to Jill. Jill, tell us about how you guys went about actually finding the right data to use. Yeah. So as I kind of mentioned before, right, Adrian, at the time, we were trying to figure out where to get the sequences from, and then how do we download everything, and at the time on GetSafe, I think it was just a little difficult to do a lot of bulk downloading. And so as sometimes it does, Adrian and I kind of ended up working on the same problem and then went about it two completely different ways. And I found this, I don't know if anyone else has used it before, but it's this Python package called Selenium, which allows you to remotely interact with web browsers. So you're nodding, so you have used it, or no? I've used it for continuous integration testing. Yeah, usually it's used as a test package for websites. Gotcha. Well, I commandeered it for other uses. And so I was just using it to search. We had a list of different ID numbers, and so I was using it to just go through and search that process. I'm sure there's a faster way to go about it, but I think it was also because I thought it was a fun tool, and so I wanted to play around with it in that sense. And so ultimately, Adrian got stuff downloaded a different way, and so I stopped that, but we did end up re-using it again because once we had a group of samples that we thought were good, there were still some checks that we needed, like wanted to make sure they were all alumina, wanted to make sure they were all paired in, wanted to make sure that they all used Arctic primers, and checking that information. So then I used it again to go to NCBI's website and pull that information for the samples and then kind of give Lindsay a filtered list of everything that did meet the criteria rather than kind of doing it one by one. I think I'm, like any good bioinformatician, I'm just deeply lazy in some senses, and if I have to click buttons more than once, then I feel like this should be automated. So were you trusting all the metadata that was on the SRA? Yeah, we took what it said on SRA. I mean, we ran it through an alumina-specific pipeline, so if there was something weird that came out, right, there was no reads or anything like that, then you'd be like, okay. Because I think there was, at some point we ran some of it through the wrong pipeline, and then there was no reads in the R2, and then we were like, why is there an R2 if this is actually nanopore, right? So there was some oddities like that that we found, there was also some oddities in naming of things. People put, that was some of what I found and had to kind of specifically put in there, right, because people didn't put the right information in the right place, if there is a right place, or they called it different things, right? So there was no standard way of saying, other than saying alumina, right, there was a lot of ways to say Arctic V3, right? Some just said version 3, some said V3, you know, there was like a thousand ways that people did that. So some of that, that was part of the process of what took a while, was like combing through that and if I wasn't pulling the right information, like what are they actually saying on that webpage? That's quite difficult, like unbelievably difficult, I'd say, particularly since you have people, well now using V4 and whatever, so in the future, you're going to have to be extra careful, you know, did they use V3, V4, or did they say they used one and used the wrong one? Because that totally messes up everything. So how did you find representative samples for the VOCs? Did you pick like the very first example? Did you go to the Pangolin lineages and look for their, I don't know what you call them, nominal descriptions of those lineages or what, you know, how did you come up with them or do you pick a centroid or something in a tree? Yeah, maybe I can describe some of the process and then I think Lizzie can fill in a little bit more detail. So we kind of started talking about this, right, but first we had all these sequences that we got, then we basically ran them through this kind of like process in which we did kind of basic, fast QC, right, just what are the basic information here, then we used SAM tools to look at the depth of coverage that was at every nucleoside position, so the average and the standard deviation, and then we ran them through the Titan pipeline, which gives tons of QC metrics, one of that, so some of them being like the number of ends, the Pangolin lineage, the VADAR alerts, if there was amino acid assertions or deletions, substitutions, and then, yeah, sequencing depth across the entire assembly. So we got a bunch of QC metrics, and if they kind of, they were cool with that, then what we did for the BOI-BOC was we had an internal CDC reference, and we used Snippy to compare the SNPs, and then we were looking basically for sequences that had the least amount of SNP differences to the CDC internal reference, and we were trying to make sure that the SNPs and changes that we were seeing, that those were not in the spike protein, so we looked specifically at what was going on in the spike protein, and that it had the mutations that, you know, characterized it to be that specific lineage. So Lindsay can probably add more detail. Yeah, so the idea is, like, we're using the CDC internal reference for each lineage as a starting point, and then we go through the J-Site, all the available assemblies for that lineage, and then try to look for the ones which has the, like, the least SNPs and the, like, the least ambiguous Ns. Once we have, like, a subset, we have, like, a link table between J-Site and SRA, and then we will go back to link to the SRA and using all the SRA to go through the TITAN, and also through the SMTools and the best QC to check the quality to make sure they first meet our QC cutoff, and then second, that they have the exact same spike mutation described by the CDC, like, website for the VOI and VOC. Okay, so how did you decide on which lineages to include and which ones not to include? Because there are a lot of lineages out there, and a lot of them just go nowhere. You know, they come and go, and that's it. They do nothing, particularly for the non-VOC stuff. And then for the VOCs, were you using your definition of a VOC, or were you looking at the WHO list, or what? Because I know a lot of different people call things different. A lot of people have said they're variants of concerns, but not everyone agrees that they are variants of concern. So when we developed this project, we still don't have the WHO nomination at that time, so we used the CDC-defined VOI VOCs at that moment. That's all we have time for on the MicroBinFeed podcast. Thank you again to Lingzi and Jill for coming and talking to us about SARS-CoV-2 benchmarking datasets and how they did the data mining behind those. And I hope you guys can join us again someday. Thank you.