Hi, and welcome to the Micro BNP podcast. I am Nabeel and I'll be your host for today. We're back talking about COVID-19 and today we're going to talk about a new method, Corona hit that allows massive multiplexing of SARS-CoV on Oxford Nanopore. Andrew Page is joining me today in his capacity as Head of Informatics at the Quadram Institute Bioscience. And we have two special guests, Justin O'Grady, Group Leader at the Quadram Institute Bioscience and Associate Professor of Medical Microbiology at University of East Anglia joins us again. And we have David Baker, who is Head of Sequencing at the Quadram Institute Bioscience. So let's get started. Dave, since it's your first time on the show, who are you and what do you normally do? I run the sequencing at Quadram Institute, I run the core sequencing and essentially process samples for the whole institute. I've been at the QIB since middle of 2018. Let's get started with Corona hit. So what is Corona hit? Preprint that came out a few days ago, and it describes this method of multiplexing on Oxford Nanopore. What exactly is this problem of Corona hit trying to solve? A couple of the main reasons to create or devise Corona hit was one, to increase the throughput and two, to decrease cost. So essentially being able to process more samples more quickly at a lower cost. What would be the current cost with the off-the-shelf methods and how many could you do at a time? The method which was originally published several months ago processed 24 samples, 23 in a negative control. And depending on where you're getting your reagents from, that was coming out at close to 40 pounds per sample. Whereas we were using our sort of higher throughput novel method. It went down depending on how many you multiplex, but between sort of 13 and 18 pounds a sample. So less than half the cost per sample. And how exactly does that magic happen? What are you doing differently in Corona hit? Well, essentially the first part of the process generating the RNA and then the cDNA and then doing the Arctic PCR that generates, you know, 98 or more sort of overlapping 400 base pair fragments. And normally that goes into end repair and adaptor ligation. And obviously you need quite a few nanograms per sample to do that. Whereas what Corona hit does is it uses a novel low input Nextera method to insert Nextera or tagmentation adaptors at the end of PCR fragments to then do a PCR enrichment using any barcodes, barcoded primers of your choice. We've chosen the 24 base pair nanopore barcodes. So essentially you're, instead of doing a ligation reaction with a native barcode, you're putting on a small adaptor and then essentially PCRing on the barcodes. So it doesn't involve a ligation reaction. Well, yeah, I think the original Arctic protocol, which is very fast, actually, it's a rapid protocol, so you can get it done in a day. And we wanted to maintain the flexibility of being able to turn around results really quickly with the higher throughput kind of protocols that we've been using on Illumina sequencing machines. So we wanted a kind of a nanopore version of what we do on those. So Dave came up with this idea where we could kind of create this hybrid approach where we would add the adaptors by Nextera and then we would use the nanopore 96 PCR barcodes for sequencing. And then we would be able to run that as quickly as you can, the Arctic protocol, and keep that flexibility. Just to be clear, we keep mentioning Arctic protocol. So Arctic is what? So the Arctic protocol is a protocol that was devised by Josh Quick, Nick Lohman and colleagues, which is a method for sequencing SARS-CoV genomes on the Oxford nanopore, on an Oxford nanopore minion or grid ion system using a nanopore flow cell. It takes about a day to, you know, a working day to prepare the library and then do the sequencing overnight. So you can have results within about 24 hours and it is all based on a tiling PCR approach that was devised by Josh Quick for sequencing the entire 30 KB genome of the SARS-CoV RNA genome. So the Arctic protocol, as far as I know, is the de facto standard for everything we're doing in the UK. And I think it's been widely adopted across the rest of the world. I think that's probably fair to say. There are other people who devised other schemes, tiling approaches. This seems to be the most heavily used. The Arctic protocol seems to be globally the most heavily used, as far as I'm aware, at least. And essentially CoronaHIT follows that, all of the initial steps and just has that extra bit for the multi-blixing. Yeah. So we did 48 samples on a single flow cell, but we also tried 94 samples. Now you're probably thinking, why 94? Well one of the barcodes we made didn't work. And one of them is for a negative control, which you should always run. We had a previous version where actually we had 16 base barcodes. Didn't work out as well, but we did try running it with 158 samples and you get data, you get consensus sequences out, but we just weren't able to de-blix them on the bioinformatics side of things very well. But using slightly longer barcodes or slightly higher quality barcodes seem to help quite a bit. It is important on the bioinformatics end of things to accurately de-multiplex samples because you're dealing with such sensitive things. You don't want data accidentally overflowing into other samples. So having a, what do they call it? Bleeding? Barcode crosstalk. Yeah. Yeah. Barcode crosstalk. So that's what we did. And we've got a set of 94, well 95, if you include negative control, that works really well. It goes through all the standard nanopore de-plexing software. So you just pop it into Guppy. It de- plexes it. It's all built in. If you go for more custom barcodes, like we did initially, and we have in testing as well, you need to use different methods like pore chop and add in custom barcodes. So slightly higher barrier, but not that bad. Let's say that you are already running the Arctic protocol. How much effort would it be to change over to what Corona describes? Very easy. As in, there are only a few reagents needed to switch over. Obviously one of them being the primers and one of them being an exterior kit and those two items obviously easily available. And otherwise it's all the other standard equipment you'd use for native barcoding, like tape stations and qubits and stuff that already exists. So yeah, relatively easy to switch over, I think. So where did the idea for Corona Head actually come from and which of you came up with it first? Well, I first heard of the general idea on a big plane ride off to MRC Gambia. So it's the Smiling Coast in Africa. And Dave and I just happened to be sitting beside each other on the plane for what, six or eight hours. And we just got chatting and he said, oh, I've got this great idea, you know, for high level multiplexing on PacBio, not on Nanopore, but on PacBio. And we got talking and we drew up on the plane a plan for how we can get this tested. We got back to the UK and we started working through that. We got some money. We got some barcodes, got some samples. We're working with bacteria at that time. It worked okay. The only problem was that every experiment on PacBio is horrifically expensive. You know, you don't, you can't do small experiments like you do on Nanopore and get it wrong or reuse flow cells. We were doing experiments and if it didn't work, you know, there's a couple of grand down the drain. It just made it a little bit more difficult. But on the side, Dave, you went and did a little bit of Nanopore just to test things out because you thought, oh yeah, this should work and lo and behold, it worked even better. Well, at the very start of all this, it was with the increased yield of the sequel instrument, which is the PacBio instrument they released with the 1 million reads and then the 8 million chip, I thought immediately, well, with that throughput, then this method might work really well on PacBio. And also because with the high fidelity reads on PacBio being much higher quality than the raw reads on Nanopore, I didn't really think Nanopore would cut the mustard, as it were, in terms of the systematic errors. However, when I did the first experiment and I pulled my libraries and I ran it on PacBio, I had part of the pool left over. So I just chucked it on a flow cell. on a nanopore flow cell. And even though it wasn't size selected, so just to go back a little bit, one of the important things when you're doing large insert sequencing on PacBio is you have to get rid of all the small stuff. With this sort of novel next era method, we optimized it so we would get a lot of fragments above seven KB. However, there was still a lot of stuff below seven KB. So the actual, the fraction that went on to PacBio was size selected, but the fraction that went on nanopore wasn't. So I was surprised even more that without much size selection, it still performed probably as well as the PacBio. So at that point, it was like a light bulb moment and it was like, nah, let's forget about PacBio. And that's why the original designs for the barcodes were 16 base pairs because they were actually the PacBio supported 16 base pair barcodes. And so at the very start of this whole process, I did have a batch of 384 symmetrical combinations of barcodes. It transpires that not all of them work as efficiently as others. So there is that still to go back to the 16 base pair barcodes too, but obviously we've now moved to the 24 base pair, which slots into the bioinformatic pipelines for nanopore. We also had a problem where we found that part of the end barcode would be chopped off. So we have the same barcode on either end of the Arctic Amplicons, if that's what you call them. You know, you need both barcodes fully there and with reasonable quality to be able to say, yeah, that's absolutely right. But sometimes you're finding that one will be kind of truncated at the end. And so we're missing a lot of stuff there. But the 24 base ones really helped because we had a bit more margin of error. You could lose a bit, it'd still be there. You could still recover it. They have a design where they have a pad at the end so that if we do lose any of the end of the sequence, it's the first part is the pads before it goes into the barcode. The other way around that is buying more expensive primers, HPLC purified, where you're unlikely to have the missing bases at the five prime end of the primers. Quite a lot about barcode design during the last few weeks and months. So we realized that, you know, when we looked at the design of the Oxford Nanopore 96 PCR barcodes, we were able to see that there was additional sequence at multiple points in the actual barcode, which were added, I assume, just to get around the problems of losing some of the last bases of the barcode as you sequence for various reasons. So, yeah, when we redesigned, you know, the barcode that we look for is 24 more, but the complete construct is significantly longer. So that was necessary to improve the overall performance of the method. So the data we get off with Coronet isn't exactly identical to the data you get off with the ARTIC protocol because of the tagmentation. Do you want to talk about that? So essentially, Nextera is designed more for whole genome sequencing, where it randomly inserts the transomosomes. And if you use Nextera on amplicons, you do lose a small bit of sequence at the ends of the amplicons. That is a slight risk. However, because these amplicons are tiled amplicons and they overlap each other, you know, up to a hundred base pairs of overlap, then losing a bit of sequence at the end doesn't seem to affect the coverage. So that's why we were able to go through and use Nextera. And in fact, I think the standard Nextera XT method is used on COVID samples for Illumina sequencing out there in the community. So it's being used, but not necessarily for running on a nanopore. But that is an important point because in the ARTIC protocol, you get the full amplicon. So you get the synthetic sequence at either end. It maps that back and then it will mask that out. So you get exactly the complex genomic region in the middle, whereas there are tegmentated fragments. We have to make sure that we still do that mapping, but we're not going to get the full length. It may not be there at all. And actually you probably don't need to mask because those primer sequences at the end of the ARTIC PCRs, which needs to be masked out, probably don't exist because they've been removed. No, we absolutely do need to mask and we do. Why is that? Because sometimes part of the synthetic sequence makes it through. So we have to be absolutely certain that we get rid of that because otherwise you're going to get these odd results coming through. And there's so few snips in the coronavirus genome, maybe an average of six or seven or eight at the moment, that if you have one false snip in there, it can cause havoc. Obviously the insertion of the transposase adapters is random. So sometimes that was very close to the end of the PCR product, including some of the primer region. And then sometimes it doesn't. But I imagine that's because it's random, sometimes you will get primer and therefore you do need to mask. Okay, so here's a question. In the pre-print, you picked this number 94. Why can't this number be higher? What would be the maximal number of genomes you could put on and what was the limitation? Using the ARTIC protocol, I guess we'd get around 12 gigs on average from a run. If we run it out for 24, 36 hours, flow cell seems to die pretty much after that. The short products mean that the flow cell doesn't last as long as it would or produce as many reads as it would if it was for longer, five KB, 10 KB fragments. That's a lot of data. And if you were just needing 20X, 30X, 50X, 100X on your SARS-CoV-2 30 KB genome, you could actually theoretically put on a huge number of viral genomes onto one flow cell. Problem is that the ARTIC protocol produces 98 amplicons as we discussed earlier, but they're not evenly covered. Each one of these primer pairs has a different efficiency and it provides like some primer pairs work better than others. And so you get very, what we call spiky coverage of the genome, meaning sometimes you might have a thousand fold difference between the kind of most covered part of the genome and the least covered parts of the genome. And that means that you end up needing at least really a thousand X genome coverage for each of the SARS-CoV genomes on a flow cell. And therefore anywhere between 48 and 95 is kind of the max that we can put on and still get the coverage we need to accurately call SNPs. But it is quite impressive that Josh Quick managed to get 48 primer pairs to work together in one tube with decent efficiency. So, you know, they have tried to optimize that further since the first iteration of the primers, but any changes seem to make, you'll improve one primer pair, but another one will fall out and it turns out a bit like whack-a-mole has been described before. So you just end up, it's very difficult to optimize this much further. So, you know, you could get around it by doing more reactions per sample. So instead of doing two 48 primer pair multiplexes, you did four 24 primer pair reactions. That might help, you might get better efficiency, but that adds cost and time. And so I think they've, you know, they've hit a pretty good sweet spot there really. But how does it actually compare with real data between a standard nanopore and then say with standard alumina? So the way I see it, when the COVID situation occurred and I read the ARTIP protocol, I immediately thought, why not put everything on a alumina sequencer? One, because you can multiplex to a much higher level, the yields from the alumina NexE that we have, we can get up to 120 gigs from a single run on a high output run. And the other reason would be because alumina generally is higher quality. So my initial thoughts were, let's do it on alumina anyway. By doing the 23 on a standard nanopore using the standard method, I just see that being more useful to labs and in the field where they don't have access to alumina sequences. The other benefit of doing nanopore is that, as Justin mentioned earlier, you can multiplex any number of samples from one or two up to 23, 24 with the standard method or up to 94, 95 with the corona hit method. And that just gives you more flexibility into quickly running samples through the pipeline. Whereas alumina, we may have to, because of cost, if we had a small number of samples running on an alumina run, we would have to fill up the rest of the flow cell with other samples and that might take some time. So to summarize, I think the corona hit is perfect for rapid COVID sequencing day to day. If you had a high number of samples consistently coming into a pipeline, then I would say the alumina method would be probably the most effective way to do that. I suppose the nanopore gives you rapid turnaround as well because you can just run it with 10 samples, stop it, wash it, put it in the fridge or whatever. take it back out again and then continue maybe with a different set of samples. So it's super flexible and you get results so much quicker. Also, we can analyze them in real time as the reads come off and we can get an idea of exactly what's in there, how well it's deplexing, and we can do analysis, you know, pretty sharpish. A lot of it's to do with the fact that different stages of the pandemic required different approaches. So when we started, we wanted to be, sequencing several hundred samples per week. And at the time we didn't have Corona hit. So our option was 24 on a nanopore or, you know, hundreds on a next seq. And it just, it didn't make sense for us to try to do a 24, you know, 10 runs of 24 to get up to 240 samples in a week. It was just too laborious and we didn't have the people to do that. That was one of the reasons why Corona hit was a good idea so that you could multiplex a lot more on a single nanopore flow cell and get the numbers up. But also now that we're moving away from the peak of the pandemic, where we were getting hundreds of samples a week, now we're only getting tens of samples per week. And now you don't want to wait to batch those tens of samples per week up into hundreds so that you can run them on an Illumina. Now you want to get your response and your information back regarding clusters and outbreaks that are happening locally. We want to get that data back to the people who can use it as quickly as possible. So we want to be able to run 10 one day or 50 the next or whatever, seven, you know, whatever we need to do in a particular day or particular week. And therefore it's the flexibility and the associated cost of that flexibility, which is great with Corona. We can, you know, we can use, okay, the fewer you do on a flow cell, the more it will cost you. But as Andrew says, you can wash that flow cell and reuse it with a larger amount the next time with a different set of barcodes. And we have 95 barcodes to choose from. So that's great. So Andrew, in terms of the output data that comes off the Corona Hit platform, how does it compare to other sequencing methods? More or less, if you have the same coverage, you'll get more or less the same results from all the different methods. And we've checked that out with Illumina, Standard Arctic Nanopore, and with Corona Hit. The difference though in the data is that with the Illumina, it's paired-ended, it's smaller chunks. So, you know, it's broken up, but it's higher quality read. And then you have Corona Hit. It's similar. It's a little bit smaller because of the tagmentation, but more or less you get most of the stuff there. So it's about 300, 400 bases on average. The quality is less because it's Nanopore. And then you've Arctic, which is the full and fragment plus the adapter sequences on either end, the primary sequences on either end. They're all slightly different, but actually when it comes to informatics, they all just more or less work out the same. Because you're sequencing at such a high depth, it doesn't matter that there's errors in the Nanopore. They just all kind of magically disappear. And of course, deplexing data has gotten much, much better over time. You know, Guppy is fantastic now. On a whole, you get more or less the same result at the end. You get the exact same consensus sequences. We've checked, we've gone through the same set of samples on Illumina, standard Arctic Nanopore and Corona Hit. We've double- checked, we've checked every single SNP and 100% they are the same. So we're really happy about that. And then we throw it into a phylogenetic tree. They come out the same as well, obviously. The only variable there is how much coverage you get. And then that comes down to, well, how many samples do you actually put on a single flow? So we've tried 48, we've tried 94. It's up to you, you know, exactly how good you want to get it. Obviously more coverage will give you those harder to reach regions. But I know other wet lab improvements have meant that it is a little bit more even. There was a few kind of well-known dropout regions, which were just really temperature sensitive. But now people understand that they can calibrate instruments or use a slightly different method and be just a little bit more careful in those regions. And it seems to have gotten a little bit better over time. So in summary, use Corona Hit. I recall in the paper, we really went for it in terms of CT and we're sequencing some really, really poor samples. Or samples with very few copies of the virus in the sample. There's a direct correlation between the quality of the sequence you get out and the CT value, the number of copies of the virus you have in the sample. So if you're really strict about it and you say, okay, I'm only going to do the really, really obvious samples, you know, maybe a CT below 30, then you can actually bump up the number of samples you can do in a single run. But if you have like a wide variety and you want to try those harder to sequence samples, then you would need to knock it back to 48. I think another way to put that cutoff, you can put a cutoff with CT or you can put it with the success of the Arctic PCR. I've mentioned that before when we were talking. But I think kind of a two nanogram cutoff after the Arctic PCR could be a good way to decide whether or not to include a sample for sequencing. The thing about doing it that way is it costs a little bit more to get to that stage. So if you were to put a CT cutoff in place, you spend no money on any sample that, you know, you're not going to waste money on potentially poor samples. But if you don't want to miss anything that might work above 30, then it might be useful to put in a cutoff or a quality control step where based on how successful the Arctic PCR is, and if that's above two nanograms per microliter, you take it that has a highest chance of success in your sequencing run. I heard on the grapevine that Nanopore themselves will be releasing native 96 barcode soon. So how would CoronaHIT, does it have to compete with that approach or how would it be different? Yeah, well, I think it does in some ways have to compete. It's about how easy it is to perform and how expensive it is. If the 96 barcodes from Oxford Nanopore, if the ligation, the amount of ligase required stays the same for the 96 native barcodes it is for the 24, then it will be an expensive approach. But if they reduce the requirement for ligase, the amount of ligase in each reaction, it could cut the cost to similar to CoronaHIT. And then the other thing is about simplicity. So our approach is pretty simple at the moment and only requires a few steps after the Arctic PCR to get to sequencing. I'm not fully sure what the native barcoding protocol will look like, but at the moment it's a little bit more complex than ours. Maybe something that would be similar, but it depends what it looks like when it comes out. Well, I guess it's watch this space. We'll have to see. We don't have all the details yet to make a firm statement, I suppose. All right. So just to close up, Andrew, we've been talking today about CoronaHIT and so what's it all about and why should we be using it? So it's faster, cheaper, and more flexible. So we should all be using it and it can just drop straight into many existing pipelines. The Bioinformatics is straightforward. The wet lab seems to be a little bit more straightforward than existing methods and you're not constrained by having to run huge batches of samples. So I think overall it's something everyone should give a try if you're doing any coronavirus sequencing and hopefully we'll be able to deploy this around the world to, particularly to say countries which are resource limited. All right. Thanks guys. Thanks for spending the time to come in and talk to the MicroBinfeed podcast about all your work. Thanks very much. Great. Thanks. That's all the time we have for today. Again, I want to thank Dustin O'Grady, Dave Baker, and Andrew for being on the show. We've been talking about CoronaHIT, which is a new method we've put out for massive multiplexing of SARS- CoV-2 genomes on the Oxford Nanopore platform. I think we touched on a lot of practical tips and tricks with sequencing this particular virus. And even if COVID isn't your thing, I hope you've gotten an idea of the thought process and considerations it takes to come up with a new protocol. This has been the MicroBinfeed podcast. See you next time.