Hi, and welcome to the Micro BNP podcast. I am Nabeel and I'll be your host for
today. We're back talking about COVID-19 and today we're going to talk about a
new method, Corona hit that allows massive multiplexing of SARS-CoV on Oxford
Nanopore. Andrew Page is joining me today in his capacity as Head of Informatics
at the Quadram Institute Bioscience. And we have two special guests, Justin
O'Grady, Group Leader at the Quadram Institute Bioscience and Associate
Professor of Medical Microbiology at University of East Anglia joins us again.
And we have David Baker, who is Head of Sequencing at the Quadram Institute
Bioscience. So let's get started. Dave, since it's your first time on the show,
who are you and what do you normally do? I run the sequencing at Quadram
Institute, I run the core sequencing and essentially process samples for the
whole institute. I've been at the QIB since middle of 2018. Let's get started
with Corona hit. So what is Corona hit? Preprint that came out a few days ago,
and it describes this method of multiplexing on Oxford Nanopore. What exactly is
this problem of Corona hit trying to solve? A couple of the main reasons to
create or devise Corona hit was one, to increase the throughput and two, to
decrease cost. So essentially being able to process more samples more quickly at
a lower cost. What would be the current cost with the off-the-shelf methods and
how many could you do at a time? The method which was originally published
several months ago processed 24 samples, 23 in a negative control. And depending
on where you're getting your reagents from, that was coming out at close to 40
pounds per sample. Whereas we were using our sort of higher throughput novel
method. It went down depending on how many you multiplex, but between sort of 13
and 18 pounds a sample. So less than half the cost per sample. And how exactly
does that magic happen? What are you doing differently in Corona hit? Well,
essentially the first part of the process generating the RNA and then the cDNA
and then doing the Arctic PCR that generates, you know, 98 or more sort of
overlapping 400 base pair fragments. And normally that goes into end repair and
adaptor ligation. And obviously you need quite a few nanograms per sample to do
that. Whereas what Corona hit does is it uses a novel low input Nextera method
to insert Nextera or tagmentation adaptors at the end of PCR fragments to then
do a PCR enrichment using any barcodes, barcoded primers of your choice. We've
chosen the 24 base pair nanopore barcodes. So essentially you're, instead of
doing a ligation reaction with a native barcode, you're putting on a small
adaptor and then essentially PCRing on the barcodes. So it doesn't involve a
ligation reaction. Well, yeah, I think the original Arctic protocol, which is
very fast, actually, it's a rapid protocol, so you can get it done in a day. And
we wanted to maintain the flexibility of being able to turn around results
really quickly with the higher throughput kind of protocols that we've been
using on Illumina sequencing machines. So we wanted a kind of a nanopore version
of what we do on those. So Dave came up with this idea where we could kind of
create this hybrid approach where we would add the adaptors by Nextera and then
we would use the nanopore 96 PCR barcodes for sequencing. And then we would be
able to run that as quickly as you can, the Arctic protocol, and keep that
flexibility. Just to be clear, we keep mentioning Arctic protocol. So Arctic is
what? So the Arctic protocol is a protocol that was devised by Josh Quick, Nick
Lohman and colleagues, which is a method for sequencing SARS-CoV genomes on the
Oxford nanopore, on an Oxford nanopore minion or grid ion system using a
nanopore flow cell. It takes about a day to, you know, a working day to prepare
the library and then do the sequencing overnight. So you can have results within
about 24 hours and it is all based on a tiling PCR approach that was devised by
Josh Quick for sequencing the entire 30 KB genome of the SARS-CoV RNA genome. So
the Arctic protocol, as far as I know, is the de facto standard for everything
we're doing in the UK. And I think it's been widely adopted across the rest of
the world. I think that's probably fair to say. There are other people who
devised other schemes, tiling approaches. This seems to be the most heavily
used. The Arctic protocol seems to be globally the most heavily used, as far as
I'm aware, at least. And essentially CoronaHIT follows that, all of the initial
steps and just has that extra bit for the multi-blixing. Yeah. So we did 48
samples on a single flow cell, but we also tried 94 samples. Now you're probably
thinking, why 94? Well one of the barcodes we made didn't work. And one of them
is for a negative control, which you should always run. We had a previous
version where actually we had 16 base barcodes. Didn't work out as well, but we
did try running it with 158 samples and you get data, you get consensus
sequences out, but we just weren't able to de-blix them on the bioinformatics
side of things very well. But using slightly longer barcodes or slightly higher
quality barcodes seem to help quite a bit. It is important on the bioinformatics
end of things to accurately de-multiplex samples because you're dealing with
such sensitive things. You don't want data accidentally overflowing into other
samples. So having a, what do they call it? Bleeding? Barcode crosstalk. Yeah.
Yeah. Barcode crosstalk. So that's what we did. And we've got a set of 94, well
95, if you include negative control, that works really well. It goes through all
the standard nanopore de-plexing software. So you just pop it into Guppy. It de-
plexes it. It's all built in. If you go for more custom barcodes, like we did
initially, and we have in testing as well, you need to use different methods
like pore chop and add in custom barcodes. So slightly higher barrier, but not
that bad. Let's say that you are already running the Arctic protocol. How much
effort would it be to change over to what Corona describes? Very easy. As in,
there are only a few reagents needed to switch over. Obviously one of them being
the primers and one of them being an exterior kit and those two items obviously
easily available. And otherwise it's all the other standard equipment you'd use
for native barcoding, like tape stations and qubits and stuff that already
exists. So yeah, relatively easy to switch over, I think. So where did the idea
for Corona Head actually come from and which of you came up with it first? Well,
I first heard of the general idea on a big plane ride off to MRC Gambia. So it's
the Smiling Coast in Africa. And Dave and I just happened to be sitting beside
each other on the plane for what, six or eight hours. And we just got chatting
and he said, oh, I've got this great idea, you know, for high level multiplexing
on PacBio, not on Nanopore, but on PacBio. And we got talking and we drew up on
the plane a plan for how we can get this tested. We got back to the UK and we
started working through that. We got some money. We got some barcodes, got some
samples. We're working with bacteria at that time. It worked okay. The only
problem was that every experiment on PacBio is horrifically expensive. You know,
you don't, you can't do small experiments like you do on Nanopore and get it
wrong or reuse flow cells. We were doing experiments and if it didn't work, you
know, there's a couple of grand down the drain. It just made it a little bit
more difficult. But on the side, Dave, you went and did a little bit of Nanopore
just to test things out because you thought, oh yeah, this should work and lo
and behold, it worked even better. Well, at the very start of all this, it was
with the increased yield of the sequel instrument, which is the PacBio
instrument they released with the 1 million reads and then the 8 million chip, I
thought immediately, well, with that throughput, then this method might work
really well on PacBio. And also because with the high fidelity reads on PacBio
being much higher quality than the raw reads on Nanopore, I didn't really think
Nanopore would cut the mustard, as it were, in terms of the systematic errors.
However, when I did the first experiment and I pulled my libraries and I ran it
on PacBio, I had part of the pool left over. So I just chucked it on a flow
cell.  on a nanopore flow cell. And even though it wasn't size selected, so just
to go back a little bit, one of the important things when you're doing large
insert sequencing on PacBio is you have to get rid of all the small stuff. With
this sort of novel next era method, we optimized it so we would get a lot of
fragments above seven KB. However, there was still a lot of stuff below seven
KB. So the actual, the fraction that went on to PacBio was size selected, but
the fraction that went on nanopore wasn't. So I was surprised even more that
without much size selection, it still performed probably as well as the PacBio.
So at that point, it was like a light bulb moment and it was like, nah, let's
forget about PacBio. And that's why the original designs for the barcodes were
16 base pairs because they were actually the PacBio supported 16 base pair
barcodes. And so at the very start of this whole process, I did have a batch of
384 symmetrical combinations of barcodes. It transpires that not all of them
work as efficiently as others. So there is that still to go back to the 16 base
pair barcodes too, but obviously we've now moved to the 24 base pair, which
slots into the bioinformatic pipelines for nanopore. We also had a problem where
we found that part of the end barcode would be chopped off. So we have the same
barcode on either end of the Arctic Amplicons, if that's what you call them. You
know, you need both barcodes fully there and with reasonable quality to be able
to say, yeah, that's absolutely right. But sometimes you're finding that one
will be kind of truncated at the end. And so we're missing a lot of stuff there.
But the 24 base ones really helped because we had a bit more margin of error.
You could lose a bit, it'd still be there. You could still recover it. They have
a design where they have a pad at the end so that if we do lose any of the end
of the sequence, it's the first part is the pads before it goes into the
barcode. The other way around that is buying more expensive primers, HPLC
purified, where you're unlikely to have the missing bases at the five prime end
of the primers. Quite a lot about barcode design during the last few weeks and
months. So we realized that, you know, when we looked at the design of the
Oxford Nanopore 96 PCR barcodes, we were able to see that there was additional
sequence at multiple points in the actual barcode, which were added, I assume,
just to get around the problems of losing some of the last bases of the barcode
as you sequence for various reasons. So, yeah, when we redesigned, you know, the
barcode that we look for is 24 more, but the complete construct is significantly
longer. So that was necessary to improve the overall performance of the method.
So the data we get off with Coronet isn't exactly identical to the data you get
off with the ARTIC protocol because of the tagmentation. Do you want to talk
about that? So essentially, Nextera is designed more for whole genome
sequencing, where it randomly inserts the transomosomes. And if you use Nextera
on amplicons, you do lose a small bit of sequence at the ends of the amplicons.
That is a slight risk. However, because these amplicons are tiled amplicons and
they overlap each other, you know, up to a hundred base pairs of overlap, then
losing a bit of sequence at the end doesn't seem to affect the coverage. So
that's why we were able to go through and use Nextera. And in fact, I think the
standard Nextera XT method is used on COVID samples for Illumina sequencing out
there in the community. So it's being used, but not necessarily for running on a
nanopore. But that is an important point because in the ARTIC protocol, you get
the full amplicon. So you get the synthetic sequence at either end. It maps that
back and then it will mask that out. So you get exactly the complex genomic
region in the middle, whereas there are tegmentated fragments. We have to make
sure that we still do that mapping, but we're not going to get the full length.
It may not be there at all. And actually you probably don't need to mask because
those primer sequences at the end of the ARTIC PCRs, which needs to be masked
out, probably don't exist because they've been removed. No, we absolutely do
need to mask and we do. Why is that? Because sometimes part of the synthetic
sequence makes it through. So we have to be absolutely certain that we get rid
of that because otherwise you're going to get these odd results coming through.
And there's so few snips in the coronavirus genome, maybe an average of six or
seven or eight at the moment, that if you have one false snip in there, it can
cause havoc. Obviously the insertion of the transposase adapters is random. So
sometimes that was very close to the end of the PCR product, including some of
the primer region. And then sometimes it doesn't. But I imagine that's because
it's random, sometimes you will get primer and therefore you do need to mask.
Okay, so here's a question. In the pre-print, you picked this number 94. Why
can't this number be higher? What would be the maximal number of genomes you
could put on and what was the limitation? Using the ARTIC protocol, I guess we'd
get around 12 gigs on average from a run. If we run it out for 24, 36 hours,
flow cell seems to die pretty much after that. The short products mean that the
flow cell doesn't last as long as it would or produce as many reads as it would
if it was for longer, five KB, 10 KB fragments. That's a lot of data. And if you
were just needing 20X, 30X, 50X, 100X on your SARS-CoV-2 30 KB genome, you could
actually theoretically put on a huge number of viral genomes onto one flow cell.
Problem is that the ARTIC protocol produces 98 amplicons as we discussed
earlier, but they're not evenly covered. Each one of these primer pairs has a
different efficiency and it provides like some primer pairs work better than
others. And so you get very, what we call spiky coverage of the genome, meaning
sometimes you might have a thousand fold difference between the kind of most
covered part of the genome and the least covered parts of the genome. And that
means that you end up needing at least really a thousand X genome coverage for
each of the SARS-CoV genomes on a flow cell. And therefore anywhere between 48
and 95 is kind of the max that we can put on and still get the coverage we need
to accurately call SNPs. But it is quite impressive that Josh Quick managed to
get 48 primer pairs to work together in one tube with decent efficiency. So, you
know, they have tried to optimize that further since the first iteration of the
primers, but any changes seem to make, you'll improve one primer pair, but
another one will fall out and it turns out a bit like whack-a-mole has been
described before. So you just end up, it's very difficult to optimize this much
further. So, you know, you could get around it by doing more reactions per
sample. So instead of doing two 48 primer pair multiplexes, you did four 24
primer pair reactions. That might help, you might get better efficiency, but
that adds cost and time. And so I think they've, you know, they've hit a pretty
good sweet spot there really. But how does it actually compare with real data
between a standard nanopore and then say with standard alumina? So the way I see
it, when the COVID situation occurred and I read the ARTIP protocol, I
immediately thought, why not put everything on a alumina sequencer? One, because
you can multiplex to a much higher level, the yields from the alumina NexE that
we have, we can get up to 120 gigs from a single run on a high output run. And
the other reason would be because alumina generally is higher quality. So my
initial thoughts were, let's do it on alumina anyway. By doing the 23 on a
standard nanopore using the standard method, I just see that being more useful
to labs and in the field where they don't have access to alumina sequences. The
other benefit of doing nanopore is that, as Justin mentioned earlier, you can
multiplex any number of samples from one or two up to 23, 24 with the standard
method or up to 94, 95 with the corona hit method. And that just gives you more
flexibility into quickly running samples through the pipeline. Whereas alumina,
we may have to, because of cost, if we had a small number of samples running on
an alumina run, we would have to fill up the rest of the flow cell with other
samples and that might take some time. So to summarize, I think the corona hit
is perfect for rapid COVID sequencing day to day. If you had a high number of
samples consistently coming into a pipeline, then I would say the alumina method
would be probably the most effective way to do that. I suppose the nanopore
gives you rapid turnaround as well because you can just run it with 10 samples,
stop it, wash it, put it in the fridge or whatever.  take it back out again and
then continue maybe with a different set of samples. So it's super flexible and
you get results so much quicker. Also, we can analyze them in real time as the
reads come off and we can get an idea of exactly what's in there, how well it's
deplexing, and we can do analysis, you know, pretty sharpish. A lot of it's to
do with the fact that different stages of the pandemic required different
approaches. So when we started, we wanted to be, sequencing several hundred
samples per week. And at the time we didn't have Corona hit. So our option was
24 on a nanopore or, you know, hundreds on a next seq. And it just, it didn't
make sense for us to try to do a 24, you know, 10 runs of 24 to get up to 240
samples in a week. It was just too laborious and we didn't have the people to do
that. That was one of the reasons why Corona hit was a good idea so that you
could multiplex a lot more on a single nanopore flow cell and get the numbers
up. But also now that we're moving away from the peak of the pandemic, where we
were getting hundreds of samples a week, now we're only getting tens of samples
per week. And now you don't want to wait to batch those tens of samples per week
up into hundreds so that you can run them on an Illumina. Now you want to get
your response and your information back regarding clusters and outbreaks that
are happening locally. We want to get that data back to the people who can use
it as quickly as possible. So we want to be able to run 10 one day or 50 the
next or whatever, seven, you know, whatever we need to do in a particular day or
particular week. And therefore it's the flexibility and the associated cost of
that flexibility, which is great with Corona. We can, you know, we can use,
okay, the fewer you do on a flow cell, the more it will cost you. But as Andrew
says, you can wash that flow cell and reuse it with a larger amount the next
time with a different set of barcodes. And we have 95 barcodes to choose from.
So that's great. So Andrew, in terms of the output data that comes off the
Corona Hit platform, how does it compare to other sequencing methods? More or
less, if you have the same coverage, you'll get more or less the same results
from all the different methods. And we've checked that out with Illumina,
Standard Arctic Nanopore, and with Corona Hit. The difference though in the data
is that with the Illumina, it's paired-ended, it's smaller chunks. So, you know,
it's broken up, but it's higher quality read. And then you have Corona Hit. It's
similar. It's a little bit smaller because of the tagmentation, but more or less
you get most of the stuff there. So it's about 300, 400 bases on average. The
quality is less because it's Nanopore. And then you've Arctic, which is the full
and fragment plus the adapter sequences on either end, the primary sequences on
either end. They're all slightly different, but actually when it comes to
informatics, they all just more or less work out the same. Because you're
sequencing at such a high depth, it doesn't matter that there's errors in the
Nanopore. They just all kind of magically disappear. And of course, deplexing
data has gotten much, much better over time. You know, Guppy is fantastic now.
On a whole, you get more or less the same result at the end. You get the exact
same consensus sequences. We've checked, we've gone through the same set of
samples on Illumina, standard Arctic Nanopore and Corona Hit. We've double-
checked, we've checked every single SNP and 100% they are the same. So we're
really happy about that. And then we throw it into a phylogenetic tree. They
come out the same as well, obviously. The only variable there is how much
coverage you get. And then that comes down to, well, how many samples do you
actually put on a single flow? So we've tried 48, we've tried 94. It's up to
you, you know, exactly how good you want to get it. Obviously more coverage will
give you those harder to reach regions. But I know other wet lab improvements
have meant that it is a little bit more even. There was a few kind of well-known
dropout regions, which were just really temperature sensitive. But now people
understand that they can calibrate instruments or use a slightly different
method and be just a little bit more careful in those regions. And it seems to
have gotten a little bit better over time. So in summary, use Corona Hit. I
recall in the paper, we really went for it in terms of CT and we're sequencing
some really, really poor samples. Or samples with very few copies of the virus
in the sample. There's a direct correlation between the quality of the sequence
you get out and the CT value, the number of copies of the virus you have in the
sample. So if you're really strict about it and you say, okay, I'm only going to
do the really, really obvious samples, you know, maybe a CT below 30, then you
can actually bump up the number of samples you can do in a single run. But if
you have like a wide variety and you want to try those harder to sequence
samples, then you would need to knock it back to 48. I think another way to put
that cutoff, you can put a cutoff with CT or you can put it with the success of
the Arctic PCR. I've mentioned that before when we were talking. But I think
kind of a two nanogram cutoff after the Arctic PCR could be a good way to decide
whether or not to include a sample for sequencing. The thing about doing it that
way is it costs a little bit more to get to that stage. So if you were to put a
CT cutoff in place, you spend no money on any sample that, you know, you're not
going to waste money on potentially poor samples. But if you don't want to miss
anything that might work above 30, then it might be useful to put in a cutoff or
a quality control step where based on how successful the Arctic PCR is, and if
that's above two nanograms per microliter, you take it that has a highest chance
of success in your sequencing run. I heard on the grapevine that Nanopore
themselves will be releasing native 96 barcode soon. So how would CoronaHIT,
does it have to compete with that approach or how would it be different? Yeah,
well, I think it does in some ways have to compete. It's about how easy it is to
perform and how expensive it is. If the 96 barcodes from Oxford Nanopore, if the
ligation, the amount of ligase required stays the same for the 96 native
barcodes it is for the 24, then it will be an expensive approach. But if they
reduce the requirement for ligase, the amount of ligase in each reaction, it
could cut the cost to similar to CoronaHIT. And then the other thing is about
simplicity. So our approach is pretty simple at the moment and only requires a
few steps after the Arctic PCR to get to sequencing. I'm not fully sure what the
native barcoding protocol will look like, but at the moment it's a little bit
more complex than ours. Maybe something that would be similar, but it depends
what it looks like when it comes out. Well, I guess it's watch this space. We'll
have to see. We don't have all the details yet to make a firm statement, I
suppose. All right. So just to close up, Andrew, we've been talking today about
CoronaHIT and so what's it all about and why should we be using it? So it's
faster, cheaper, and more flexible. So we should all be using it and it can just
drop straight into many existing pipelines. The Bioinformatics is
straightforward. The wet lab seems to be a little bit more straightforward than
existing methods and you're not constrained by having to run huge batches of
samples. So I think overall it's something everyone should give a try if you're
doing any coronavirus sequencing and hopefully we'll be able to deploy this
around the world to, particularly to say countries which are resource limited.
All right. Thanks guys. Thanks for spending the time to come in and talk to the
MicroBinfeed podcast about all your work. Thanks very much. Great. Thanks.
That's all the time we have for today. Again, I want to thank Dustin O'Grady,
Dave Baker, and Andrew for being on the show. We've been talking about
CoronaHIT, which is a new method we've put out for massive multiplexing of SARS-
CoV-2 genomes on the Oxford Nanopore platform. I think we touched on a lot of
practical tips and tricks with sequencing this particular virus. And even if
COVID isn't your thing, I hope you've gotten an idea of the thought process and
considerations it takes to come up with a new protocol. This has been the
MicroBinfeed podcast. See you next time.