Hello, and thank you for listening to the MicroBinfeed podcast. Here we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There is so much information we all know from working in the field, but nobody writes it down. There is no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Both Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work on microbes in food and the impact on human health. I work at Centers for Disease Control and Prevention and am an adjunct member at the University of Georgia in the U.S. Today in the booth, we're joined by a special guest, Leo Martens, head of phylogenomics at the Quadram Institute, and their arborist in residence. And then there's myself, Andrew, and Nabil, and everyone is now working on SARS-CoV-2. So that's what we're talking about today. We'll be focusing on some of the latest mistakes and issues you need to know about for SARS-CoV-2 analysis. So Andrew is head of informatics. Do you want to talk about some latest gotchas and issues? So I suppose I'm going to bounce this off to Leo first, right? You know, someone asked me today, I've got one step missing and it changes the lineage assignment completely and utterly. So what is going on and how do different methods work and, you know, how can I massage my data to make it look like it should? Well, I think one of the causes for this might be that, for instance, spangling is based on the machine learning approach. It's a decision tree. And so what they have is they have a big table. So their labels or the classes are the lineages. And then they have the parameters then define what is a lineage. And these parameters are SNPs in particular positions. If you're missing one of these or, you know, a few of these indicative features, which are the SNPs, and this is going to change the way that the decision tree handles the classes. So maybe your SNP is the one that defines or helps defining one of the lineage. And so by missing that or by missing a group of SNPs, and then it might go in a completely, you know, different direction in the decision tree. There are other ways of classifying your sequence, but they might also be sensitive to missing data. For instance, if you have your sequence and then you search for the closest sequence in a database. So you have a database with the sequences that you know the lineage for them. But now you have your query sequence and then you ask what's the closest one in that database. But then, of course, closeness doesn't take into account the phylogenetic position. And so maybe it's close because some, let's say some irrelevant SNPs are the same. But that SNP that would make the difference that is, you know, helps defining the lineage is not there. It might be missing data. And so you might also be in trouble there. Maybe the way to solve that is to actually look at your sequence in the tree. So you do an alignment and then you do a, you know, a phylogenetic inference. And then you look at your query sequence where it's positioned in the tree, because good phylogenetic methods, they can handle missing data better. And in this case, you would see that even with our missing data, you know, your sequence is somehow closer to that cluster that is labeled as lineage something. I think this is what Lama does. And you can also use, you know, if you want to do it by hand, you can do it with IQ3. It's like a phylogenetic placement. So you give the tree minus your sequence and then you give the alignment with all the sequences and then you look where it is in the tree. But I think that's what Lama does, if I'm not mistaken. I guess sometimes I have to admit, I do this by eye and by hand and just look at the SNPs. You know, if you've got a cluster and there's something a bit iffy, I will go in and eyeball the SNPs. And that is quite useful because often you can see, okay, there's one SNP or two or, you know, one of the region missing here and it's throwing things off, but the rest look fine. And it does reinforce that you shouldn't just blindly trust algorithms. You should sometimes, you know, go in and actually have a look, particularly if it's an important sample or whatnot, or you have other epidemiological data, which tells you otherwise. And yeah, so look at your data. Yeah, yeah, exactly. And then look at the tree. So, I mean, I'm biased, but I say look at the tree because if you'll see that there's a long branch there, or even if you just look at the pairwise distance, it might be classified as let's say B1, but then you have a distance of, I don't know, five SNPs to the closest sequence in the database. And then that might be something going on there, right? You didn't sample enough, or I think that's how they discovered the B117 that they look at. They realized it was a cluster and this cluster had quite a few SNPs compared to all the other ones that in the same lineage. And some other epidemiological evidence, of course. Yeah, I mean, this is a general problem that you always have with clustering and classification, ruling things in, ruling things out. It's not really a solved problem. And I don't know, I always ran into it with genotyping and MLST data, it's a massive pain trying to just flatten your data and make it into these little buckets for people to access more easily. So maybe moving on to something associated, which is naming lineages. So Lee, how on earth do you name lineages? You know, are you going with the UK schemes or with the US next strain schemes or what? I think a lot of times I see the states, I'm just, I'm talking about this kind of from an outsider coming in still. So I'm still learning, but from the outside, it looks like the states are mostly using pangolin or pango lineages. And I guess we're trying to separate pango from the animal so that we can be more proper or something. And then, and I've seen people use variants of concern also. I'm still getting into the weeds a bit. I'm figuring out for myself, what are you guys doing over there? We're heavily invested in the pangolin lineages, you know, since they have come from our consortium and they're awesome. But it does get kind of a bit difficult when you get into the constellations of mutations of concern. And then it just becomes horrific because say in the UK, you have B.1.1.7, which is the UK variant, the Kent variant, variant of concern, blah, blah, blah. So lots of different names. But basically, eek has independently emerged multiple times throughout the tree, which is a problem because it's not like a distinct lineage, which is emerged. It is, you know, kind of multiple parallel emergences. Currently the thinking is we'll just call it B.1.1.7 plus eek, you know, is the way to describe it, you know, because you're saying it's got this mutation rather than a single point mutation occurring and then it's spreading everywhere. And you know, that has some kind of evolutionary history. So it's getting more complicated by the day. So that's confusing to me. Like what, how do you call it eek? Like, where does that name come from? I've heard it a few different times, but I don't, I don't think I've heard in depth, like why it's called eek. It's a replacement in the spike protein in the position 484 from an e to a k. And then this replacement you write down as e484k. But instead of saying e484k, you say eek, you know, just look at the first one, the last one, and you hope that there's not another one that also starts in an e and ends with a k, because then you're out of nicknames. If it's in a p lineage, then you can call it peek. So the thing is, is there's a difference between B.1.1.7 plus eek versus regular eek, and then B.1.1.7 on its own is its own thing as well, right? So what I'm worried about is someone, this particular combination being flagged, and then people just looking for the spike protein mutation and then freaking out about it. It's like, no, no, no, no, it's not the right background. It doesn't, it doesn't work like that. You can't just use the spike to name everything. Can I ask one more question, because you're an arborist and we have you here. So you could have eek multiple times in different lineages. Are we seeing some kind of convergent evolution? Yes, I think they're not, they don't claim yet that it's convergent evolution because they don't know what's the cause for this. But yeah, it's been observed several times. I think the most recent one was, I don't know, this week, last week, they observed eek in B.1.1.7. And if you look at these three, there are three main lineages or variants of interest and two of them. So it's the P2 and the other one is B.3.5.1, which was the one first seen in South Africa. So these two had the eek, but the B.1.1.7 didn't. But now this week, they found out that there was a B.1.1.7 that has. So I know it appeared again. And the other two of interest. So the B.1.3.5.1 and P.1.1.7. P1 or P2, they also had, you know, and they are quite different. So yeah, this is a case that it's appearing again. So far, they can only claim it's a homoplasy, you know, it's appearing again and again, but they still don't know if it's because of convergent evolution or for some, you know, it's just random or drift. I guess it takes some time to empirically untangle what these mutations are actually really conferring for the virus. And there was a bit of hoopla this week in the UK where they identified lots of the B1.351 which originated in South Africa and the UK community transmission. And so they've gone for a surge testing, kind of going door to door testing people and trying to kind of contain and stamp it out in eight different areas in the UK. It's kind of difficult. Certainly we can only really at the moment detect that variant if you genome sequence it, you know, because of everything else we said. And so I'm guessing there's a lot of genome sequencing coming down the line just to see how much of it is in these eight different areas in the UK we're doing surge testing and you know, whatnot. But if it is just to contain EEC, well then it's kind of, you know, you're trying to hold back the ocean and that's going to be a problem I think if it's independently emerging in other lineages as we've seen it. I guess to just to come back to the original question, how to name these things. So my feeling is that don't. If you feel tempted to give a new name, the first step would be to go to the site called Lineages that we discussed last time. There's a link there that you can suggest a lineage. And then the first thing that you'll see is that there's some rules. So they say, what factors suggest your sequences form a new lineage? And then they say, well, you know, they might, they have to cluster in the global tree with good support values, input struct values. You also have epidemiological support, introduction in a novel, in another geographic region. And if you, if you, if your new sample satisfy all these conditions and they might think about suggesting a lineage and then you click, you suggest to them. And in the meanwhile, I think this is what happened to the P2 lineage. So the P2 lineage wasn't born P2. So I think the first time when they published, I think it wasn't biological. They didn't call it P2, they call it, it was B1.1.28 and then eek in parentheses, you know, E484K. And so if you look at the, in the first publications, that's how they referred to what we now know as P2. So I think, I like this. I think that they, you know, they did a very reasonable thing. So I know the history of that and P2 came about to stop journalists from being confused because there, so someone had made a mistake or had misspoken and the media picked it up and it was like, Oh, you know, there's lots and lots of what we now know as P2 in the UK, you know, there's a few cases and that freaked everyone out and, you know, the press went wild and they had to kind of convey this message simply that, okay, calm down here. You know, it's actually something different. We're calling it this, don't worry about it. You know, the one you really have to worry about is P1, which I don't think we've seen in the UK yet, have we? Check on the COV lineages site. No. So we haven't seen that and, but we have seen P2, sure we've seen P2 even in Norfolk, which is, you know, not, not exactly on the routes directly from Brazil, is it? Well, you're a resident Brazilian, Leonardo. No, no, I don't think so. So I think this, the P2 is the one that spread quickly in Rio and they, it was, it was of interest because it was a case of reinfection. And then they shown that, I think the person was infected twice by B1.1.28, but it, you know, although they were the same lineage, they were quite different from each other that it could say that it was a different one, but no, I don't think there's, there's direct roots. No. Actually, while you're talking about co-infections, I read a paper the other day and they had found lots of cases of people being infected with two different lineages and I was like, this sounds interesting. And of course the press were, were interested in it as well. And it had been widely reported, but I looked at the paper and like, there's no mention of controls. There's quite a lot of CTU, CT snips, like 53% of our snips are these, these snips. And as, as we know, these are kind of signs of degraded RNA. And that sets off alarm bells to me saying that, well, maybe some of the samples or, you know, maybe some of these mutations you're seeing are actually just degraded RNA and not a real signal. And they found quite a lot of, in their genome sequencing, they found quite a lot of co-infections they called them, but I would call it contamination. And I strongly suspect that this group were just sequencing, maybe not, not the cleanest, or maybe they weren't following the protocols exactly what rang alarm bells as well for me was that they have their own bioinformatics pipelines that they've written from scratch, which is always a red flag because, you know, this stuff is hard. There's a lot of nuances and they just made a slight little error somewhere as well as messing up the, the sequences. I've seen some very poor quality sequences, not ours, but others. And the primary reason for that explained to me was they just didn't get it into the freezer fast enough. You know, this, this is, you know, dealing with RNA, it's a fragile, fragile thing. And yeah, you've got to always be vigilant. I'm just wondering, is that a standard QC thing, looking for bias in base substitutions? I always do it as a standard QC myself, just to gauge, is this an extreme outlier or not? I think Torsten was the one who originally told me to have a look at it. And he had just a very simple way of doing it. Yeah. I mean, it'd be one of those things that can be nice just to bake into any, any pipeline. Now, of course, CTU mutations are legitimate and they do happen, but just not at the vast high rates that you sometimes get. I mean, we used to use that for ancient DNA. That's how you knew ancient DNA, because it was degraded in a very specific pattern. And then fragments would be shorter as well, wouldn't they? Yeah. Fragments were always shorter. And you always had a, you always had the edges, the edges of the fragment were always had this weird, you know, bias. Okay. So another question that came up during the week was, do you, if you're doing Nanopore, Arctic Nanopore base calling, do you need a GPU? And the answer is yeah. So a lot of people around the world are getting into Nanopore, but maybe they are from resource constrained environments and they don't necessarily have stuff around. So people were asking, could they just use the fast base calling mode? And the answer is maybe avoid it, you know, because in SARS-CoV-2 every snip counts and the errors are mostly random, but not always random. And so, yes, you need to use a GPU to get the highest possible quality data out there. Luckily, most gaming laptops will probably have enough power to do real-time base calling for you with HAC mode. So that's high accuracy. And if you can just beg, borrow, steal a GPU card or a gaming laptop, there's probably many around. You'll save yourself a lot of time because if you do CPU based calling, it takes like a million years and you know, you don't want to do that. You want to get the stuff in and out. If you want to do in the cloud, well, that can be difficult if you don't have say reliable electricity or internet. And yeah, you can't necessarily, you know, upload vast quantities of data to a GPU in the cloud. It also gets very expensive, very quick. So get a GPU card, make sure it's like an NVIDIA, even an old one will do like a 1080 or whatever. You don't have to go for like the mega Bitcoin miner type GPUs, you know, a slightly cheaper one will work, but it has to be NVIDIA. Yeah. Just kick the kids off Minecraft. It actually pains me. You know, I see my kids watch these YouTube videos of Minecraft and you know, sometimes they'll pop up like the specs of the machine they're using. And it's like, my God, you know, that guy's, probably spent two grand on a graphics card just to play Minecraft. That's insane. But then I think back to when I was younger and I was thinking, yeah, okay, yeah, I probably would have wasted money if I'd had it, you know, on a gaming machine. I don't know. I used to, used to have people program games onto their graphics calculators way back when. Yeah. They play some snake on your mobile. What happened to that? But moving on. So you've been running into some issues with logistics as well, Andrew. Yeah. So we've got a collaboration with Zimbabwe and we've been running into logistical problems because so many borders are now closed, you know, trying to send, say, nanopore reagents from the UK to Zimbabwe is quite difficult because dry ice, you know, it lasts a few days, but it won't last two weeks sitting in a warehouse in Stansted airport, which is what happened. And so we've had thousands of pounds of reagents destroyed because we haven't been able to actually get stuff, you know, urgent stuff, which we paid extra to ship actually out to the countries we need them to go to. We can get dry stuff out, you know, stuff that can go room temperature. So we've sent out like laptops and nanopore device and whatever. And interestingly, logistics is just insane. Like for, for shipping goods, you know, stuff goes all over the world. It seems, you know, even when you send it out the door in the same, in the same van, but. Logistics is hard and we don't necessarily know how to solve it, you know. If we want to spread sequencing around the world and do it in country, we need to be able to ship this stuff. Border closures and flight bans and whatnot really don't help in getting this stuff around the world. So we don't have a solution to that, but it is a challenge and it's eye-opening seeing all of the challenges, even for the very simplest of things that goes on. So another question we had was, can you look at recombination with the Arctic Protocol? And no. And the reason is that, so the Arctic Protocol, you know, you get chimeras in it. So you've got two different random bits joining together, you know, it's PCR. And what ultimately happens is that if you are sequencing on nanopore, you have the barcodes on both ends. And if you look for the barcodes on both ends and you make, you know, you have a very tight window of how long a read should be or how short it should be. You get rid of most of these chimeras, you know, straight off the bat. When you do Illumina, unfortunately, you're looking at much shorter fragments. You're not going to necessarily, you're not, well, definitely not going to have the adapters and barcodes at both ends or the amplicons. And so, you know, you're dealing with teeny tiny windows into the amplicon. And if you see a chimera, if you see recombination in there or signal recombination in there, you can never be sure, is that just a chimera or is it real? And you can't use this kind of data. Well, you can't use a nanopore, you absolutely can't use an Illumina, but you're going to see a lot of an Illumina and you're even more blind and in the dark. So the end result is if you want to look at, say, recombination and that kind of structural variation in SARS-CoV-2, you have to use metagenomics or, you know, maybe hybrid capture and do de novo assemblies. You can't go back to consistent sequences and that kind of thing. So make sure you use the right technology to answer the question that you're asking. Yeah, no, I should mention with the Oxford nanopore, you do need to run it with required barcodes at both ends for Minnow and the different tools. That's definitely a requirement. Otherwise, yeah, you are going to pick up chimeras and you're going to get all sorts of, they're not real chimeras, they're all artifacts, but you're going to get that down the garden path if you're not filtering against that. And then if you stare at the data long enough, you know, you'll see all sorts of craziness going on there. You'll see like maybe 10 of these things stuck together or you'll see, you know, barcodes in the middle and all this kind of stuff. So it's not just, you know, barcodes at the end, it's, is there barcodes in the middle as well? And so that's why you set a maximum length on the reads, you know, to be just a bit bigger. Yep. One of the things I saw kicking around as a discussion point was a simple question around how to annotate VCF files and just see what the mutations are actually, you know, encoding. And different people suggested different things, but one of the nice solutions is to look at CovGlue, which we'll put a link to that in the show notes, which has a table, a catalog of all of the different replacements, insertions, solutions that are there. So you can just put your sequence in and find out all the information about what those variants are going to do. It's kind of like a, it's kind of like an intro scan, but, but not for SAS Cov. So yeah. Have a look at CovGlue. It's got all of that information there for you to play with. If you want to run it on your own, on your own machine, if you've got a faster file, you can put that into Nextclade or if you've got a VCF file, some people use SNPF, S-N-P-E-F-F to figure that out. But between the three different resources, you should be able to sort of annotate your mutations without too much problems. But yeah, it's not something obvious on how to do that. There's also a very nice Python script by Ben Jackson called Type Variants. And I think it incorporated into Pangolin, CVET, which gives you basically the coordinates. So if you have an aligned genome, so you know, a genome aligned to the reference genome, and then it can give you the amino acid replacements. And yeah, it's pretty nice too. And it's very small. So I think you can even incorporate into your own software. So that's easy to add to your Sage and Nextclade workflow or whatever you're doing. Okay. So in some, I suppose, more general news, I saw on Twitter that Emma Hodcroft has gone through all of her detailed maps from Nextstrain, looking at the variants of concern, and then just kind of picked out community spread. She's like, oh, it's there, it's there, it's there, you know, it's really wonderful. Like she's kind of got this Star Wars, use the force kind of mind, you know, she's able to spot community spread before, you know, other people are spotting it, which is quite interesting. And it's quite telling that there is just a lot of it around. This is not just something that countries have been able to contain, but it, you know, it's spread very rapidly. And these variants are everywhere. And it's probably not one introduction, or we know it's not one introduction in many cases. And often in cases, it's like lots and lots of introductions, and it's only by countries doing lots of sequencing, they're actually identifying that, you know, lots of it has got in and, you know, the horse is long gone, you know. Yeah, Nextstrain is great for that. And keep an eye out on Emma Hodcroft's Twitter. Yeah, if you want the latest and greatest information, she's definitely one person to follow. And in, I suppose, local news, Quadram has sequenced 10,000 genomes, in fact, actually more because this week, they put on another 1500. I mean, that's predominantly what we've been servicing in our local area. It is absolutely awful that we've gotten to that toll. But in terms of the scale of the lab, and the hard work that everyone has put in, it's just absolutely amazing. So most of it has been local Pillar 1, they call it, so that's stuff coming to the hospitals, and stuff of clinical concern. And then we have some national samples. So it's people go to drive up testing facilities, it's called Pillar 2 in the UK, and or have tests at home, they post them off. It's some of those. And then we have a thing called the REACT study as well. So that's where people, households are randomly chosen around the country, and they get them to send back swabs, and they see how much in a very structured manner, how much coronavirus is out there, how many households have a positive member in them. It's random and happens differently every, every month. So you know, you do get a very, very good idea of what's there. And so we're sequencing the positives from those so that we can now get an understanding of what lineages are there, and answer questions like, you know, when was B.1.1.7, or whatever it's called, the Kent variant? When did that actually, you know, start being picked up in these national surveys? And so yeah, we'll have more answers on that in the future. Yeah, it's amazing looking back. Remember, I don't know, maybe this time last year, people were saying, this doesn't change very much. Why bother sequencing it at all? Well, actually, I said that as well at the time. I was like, well, you know, they're really into sequencing, it's not changing much. Yeah, how wrong I was. Well, we're thinking actually, maybe it might help with stamping out outbreaks, you know, little outbreaks that might bubble up, you know, later on in the future. We didn't necessarily believe that it would be just this kind of car crash it is. But there you go. Yeah, it's one of those things you hate being wrong. And well, we should end on a happier note. So we'll switch over to this article from Nature News. Scientists call for fully open sharing of coronavirus genome data. So that's a little op-ed piece that they that, yeah, that support Nature News, pretty much asking that everyone do the right thing and get the data out as quickly as possible so we can get on top of this. I suppose there's a bit of politics there, though, in that, you know, you have the competing databases INSTC, which is NCBI, EBI, and DDBJB, DDBJB, yeah, sorry. And then you have GISAID, which is run from Germany. And so you have these competing, I suppose, ideas, fundamental ideas, you know, GISAID, I suppose, protects the data a little bit more and has more built in protections, the idea being that it makes it more likely that hesitant people will be will share data. And then INSTC is very much more CC by so it's like, you know, let's just share data as quickly as possible and make it as open and easy to share, which is great. So my only concern is that INSTC is a bit slow to actually getting data out there. So it's not necessarily as useful for public health surveillance. Great for retrospective academic research and for going back later. But certainly it is a little bit of an issue. And that's why people are continuing to use GISAID. Yeah, I mean, regardless, I think the the critical issue is the, I don't want to say politics, but it's just the nature of the ethics around the sharing of data and making sure that people who produce it get what they need to get out of it and the people who need it get what they need to get out of it. And in a way where everyone can can be happy and interact with each other. I mean, we've had that luxury in Cog because it's all within the consortium that we're able to establish what and everyone is able to speak sort of in a safe space and everyone kind of knows each other and it's all sort of nice. But in an open world, like we need to have those sort of we need to have a kind of generic framework like that where we can. put our data out there and know that it's going to be used correctly. And I mean, it's not just the thing of being scooped. It's a thing of, I can't imagine if we put out some genomes tomorrow and then someone else use it in an analysis, which said something horrible. And then people come and point at us saying like, well, you guys generated this data. So you're the ones responsible for this work. Like, no, it's not, that's outside of our remit of us putting the data out. You know, these sorts of questions are, it's a difficult problem of sharing data. So maybe to give an idea, I looked at 25 papers recently and 15 of them don't have any raw data available, like raw reads, that's just insane. Like that's a very low percentage. And if you don't have the raw reads, you can't really reproduce any results. You know, you have the genomes, consensus genomes or assemblies, but those aren't necessarily correct or, you know, methods slightly change over time. So you really need to go back to raw data and to have such a tiny, tiny percentage being reproducible is just shocking. So. And GIS aid has only the consensus sequences, right? So that's, that's one of the issues that we could talk on. Now you think that everyone submits to GISAID, but no, unfortunately there is a sizeable percentage is about 20% of studies that I I've seen don't actually even bother to submit to GISAID. And we found that as well with some countries who've approached us for help, you know, they're sequencing, but they're not actually making the data public. So they're, they're taking all the data that everyone else in the world is producing to provide context for them and lineages for them, but they're not actually sending it back the other way saying, well, this is what we found as well. So, you know, it works both ways. I'll get off my high horse there. Yeah. I mean, it's not putting the, not putting any data out. Is this not odd? Sorry. I can't, I like to play the side of people's privacy and, and then get, and people and protecting people who generate data, that's really important. But if you're not putting anything up and then expecting us to help you, well, sorry. Anyway, on that note, we should probably wrap up. Yeah. So that's all the time we have for today. We've been talking about some of the latest tips we've picked up about SARS-CoV-2 analysis. And hopefully this catches you up as well. And yeah, so special thanks to Leo for joining us today. And we'll see you all next time. Thank you all so much for listening to us at home. If you liked this podcast, please subscribe and like us on iTunes, Spotify, SoundCloud, or the platform of your choice. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group and edited by Nick Waters. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadrant Institute.