This time in the Michael Binvey podcast, we come from the Arctic Network and
Climb Big Data joint workshop on COVID-19 data analysis held on the 14th and
15th of January, 2021. I'm going to do a talk on how you go about sequencing
SARS-CoV-2 with quite a lot of reference to the COG-UK consortium. I was
pondering this question last night and did a Twitter thread on this. In COG-UK
at the moment, we do a lot of nanopore sequencing, which is a rapid turnaround
protocol, up to 96 samples. But in parallel with that, there's also a number of
high throughput methods, usually using Illumina sequencing. And we actually use
two out of the three possible broad classes of sequencing approach, which you
would call metagenomics, which is untargeted RNA-seq basically, and then two
different classes of targeted methods, either amplicon sequencing or bait
capture. Both of these are actually used in COG-UK. And it occurs to me that
this model of decentralized network that we have, it draws in a lot of expertise
from all over the UK. And that's actually what's led to success of the project,
having this broad base of diversity within the methods that's allowed us to be
what's been called in the press, the best genomic surveillance system in the
world. So that's very gratifying to see that kind of thing. But yeah, I'm going
a little bit off topic. So the amplicon sequencing is what this talk is about.
You might have heard the names, the ARTIC method, which is a sort of nickname
for amplicon sequencing. It's become popular because it's quite cheap and easy
to scale, can recover genomes, or at least partial genomes in the cycle
threshold 30 to 35 range, which makes it ideal for this kind of work, even in
the presence of, a high host level. And the way that it works is it uses in the
nanopore context is it uses multiplex PCR to generate amplicon pools, which we
then use native barcoding to barcode and sequence on nanopore. So you will have
all heard of PCR. This is a classic 1980s technique. And obviously you start off
with genomic DNA, and then you have these first cycle products where you extend
one piece, one primer, and then you have a second, you have four second cycle
products where you extend your reverse primer. So after cycle two is
exponential. But what we do in the multiplex PCR method, which was born out of
the Brazilian road trip, we were struggling with high CT values for the clinical
Zika virus isolates that we got. The modal CT value was about 36. And that is
very high in terms of the number of copies. It's sort of in the 10 to 100 copies
per mil range. It's quite low. So we're struggling with that. And in the
previous Ebola work, we've done single plex RT, one step RT PCR reactions,
pooled them and sequenced them, but we needed something which used less RNA and
was faster and more high throughput. We end up focusing on multiplex PCR where
you generate all the amplicons in one reaction. The trick there is trying to
make the reaction work. All of those single reactions work in the same, in the
same tube. And that led to me reverse engineering a technique developed by a
thermo fisher called AmpliSeq, which we call primal scheme. My idea was that
there wasn't really anything too complicated going on here. And it was purely a
primer design challenge. And then over, this is the latest iteration of the
tool, which we call primal scheme, which is a web-based tool, enabling you to
design the primers to do this multiplex PCR. And all you have to do is upload a
fast A file of your references. And this can be multiple references from, from
any virus. The tool was able to, to design the best primers to cover the
diversity of the reference genomes that you upload. And this, that's the sort of
best, most recent feature that we've added me and Andy Smith, who developed this
tool. So all you really need to set is the amplicon size and the name, and the
tool will start off. It will apply a window over the foot, over the beginning of
the genome, and then align all the references with parasail to, to, to determine
the sort of scar strings really. And then from that, it will choose the most
conserved primers. If you want to make it design primers against your first
reference, only you can turn on the pinned flag, and then it will only produce
prox, it will only produce primers, which occur in the first genome, which is
sort of like the old, the old behavior for the people that have used this for a
while. The, there's an overlap between, between the primers, which means that
you can trim off the primers, which obviously synthetic oligos, and you only
consider the, the actual sequence from the, from the viral RNA, which, which is
quite an important part, which we'll come back to later, because obviously this
is the main consideration when you're doing the analysis of this type of data.
But in terms of the PCR, you have to leave a gap between the amplicons because
otherwise you would generate an overlap product. If you mix these primers
together, you would preferentially produce a product between two left and one
right, which would be a short product, which wouldn't cover any of the genomes.
So we have to separate them out into two reactions. That said, the technique
still does allow you to produce 96 amplicons in two reactions, which cover the
30,000 base pair genomes. So it's quite high plexity in terms of the, the number
of reactions or primers in the reaction, the sort of natural variation that you
get from the efficiency of the primer pairs, which obviously will, even though
you try to make them very similar, can change. And then the overlaps, which are
the, the, you know, the higher coverage areas, it's sort of like the coverage of
region one and two mixed together, basically. And then in the analysis pipeline
that gets normalized down so that you have some n number of coverage for all of
the regions above that threshold. And then you end up with what we call
dropouts, which is where you get no, no data for that particular point. That's
because of amplification failure, which is an inevitable consequence of
something like this can be caused by poor primers or by mutations in the primer
binding site, which could be caused by variation. Although that is, you should,
because of the overlap, you should still get the amplicon, the adjacent amplicon
covering it or, or just high CT or degraded input material, which, which lead to
a gradual degradation of the reaction more and more and more dropouts towards
the high CT samples. So that's what the, that's the sort of key features of what
the data looks like for SARS-CoV-2. The initial protocol release was on, I think
the 23rd of January. So it's coming up to the first birthday of this protocol.
And the genome that this is based on was actually the third version of the
genome, which was actually only released, I think on something like the 18th. So
it's less than a week afterwards. And then subsequently we changed the primer
pairs, trying to, trying to improve the dropouts. In order to allow this more
people to have access to these primers, I organized a bulk scale synthesis with
IDT, which was, which is manufactured and filled and packed by IDT and sent out
from them, which allows more consistency in the terms of the primer pairs that
people use. And it's, and it's actually really cheap. They only, they only
charge like $30 to do this, which is, I think partly because we've paid for the
synthesis. I don't know if we're just buying, we've just bought it for everyone,
but at least they send it out to everyone, which is much more, much more
convenient. So if you want to order the primers, you can order the primers
directly from IDT and they'll send them out to you for the V, for the V3 pools,
which is the only thing that's available. Arctic Network is the platform called
protocols.io for sharing protocols. It's an online platform which allows people
to post protocols, you know, open source protocols, and then comment, fork,
update. It's not a full version tree, which like GitHub, which is what I don't
like about it, but it is quite simple and basic, which I suppose is its
strength. And it is, it has proven to be quite popular. This is the third
version, which I call low cost. It's had over a hundred thousand views. So it
allows you to interact with other people, other users and, and answer each
other's questions basically. So it's a good forum for troubleshooting. It's also
had something like 56 forks. So there are a number of sub-protocols with this.
Yeah. And in addition, it's been adopted commercially by other companies. So
this is actually supported by ONT. They will support this protocol through tech
support. They also have a protocol on their website, which is very much in line
with this one. QIAGEN have commercialized a product for this. Illumina have
commercialized COVID-C, which is based on, on the multiplex PCR primers. NEB
have commercialized a kit, which is called the Arctic Companion Kit. So it's
been very, very widely adopted across the community. The genome recovered from
the PCR is limited by the input cDNA. The less you put in, the more dropouts
occur and degradation of the reaction. That's why we call them dropouts because
no amount of coverage will rescue those amplicons. They're just not there. It's
a good way to sort of mentally think about how much sequencing you need, because
you can see that it's rolled over after you've generated, you know, a hundred
thousand reads per sample and you're not really going to achieve any more
coverage than that, no matter how long you keep running that flow. And that's
what sort of decision you can make about how long to run your flow so for, if
you're using RAMPAR.  This is Rampart, but once you've set the run up, you can
see the same Amplicon coverage profile that I was talking about earlier,
collection of reads, barcoding by sample information. And this is the sort of
rolling over curve that I was talking about earlier. So you can see which
samples are likely to achieve saturation, and then you can decide whether to
carry on the sequencing or stop the sequencing. So that's the end of my talk.
And I'd like to thank all of the rest of the Arctic Network and all of our
collaborators. Thank you very much. There are a couple of questions. So there
was a question, has there been any developments in the use of the direct RNA kit
for viral nanopore sequencing? And I suppose maybe thinking about the use for
kind of sequencing clinical samples for genomic epidemiology. So there's a
couple of limitations with the direct RNA kit. One is that it's poly A prime, so
it's designed for messenger RNA. And it has a very high requirement of input. So
it's like a microgram of total RNA to get, or it's even more than that, isn't
it? It's like 10 micrograms of total RNA to get at least 100 nanograms of poly A
RNA. So it's very high. It's too high for practical use in a clinical isolate
setting. Obviously the virus that you're sequencing needs to be polyadenylated.
And if it's not a polyadenylated virus then you have to do additional
polyadenylation steps which reduce the efficiency even further. There was a nice
proof of principle paper from the Australian group very early on in the outbreak
actually, showing the sub genomic RNA fragments using direct RNA. But other than
that, and using them to annotate the genome, but other than that, I'd say that
there hasn't been any major usage. Another question from Hassan is a practical
one. So if you design a primer scheme with primal scheme, how do you prepare
pool A and pool B after you've done that primer design? You get all of your
primers in tubes or plates. I'd probably recommend tubes for most schemes
because it's very, very difficult to avoid, you know, contamination if you're
trying to do this with a multi-channel. I prefer to do it one tube at a time.
You'd basically take all of the odd tubes and all of the even tubes. So for the
odd tubes, I mean, one left, right, three left, right, five left, right, one
rack and all of the even tubes, two left, right, four left, right, six left,
right into another rack. Basically take one at a time and just take five
microliters from each one of your resuspended primers at a hundred micromolar.
And then you can do rebalancing. So you can start off with equimolar pools, you
can sequence them, and then you can rebalance them to try and make balanced
pools. And that will improve your genome completeness. But you don't, but it's
not possible to predict the efficiency of the reactions in advance. You have to
do an equimolar pool first and then you have to sort of average it over a number
of samples to get decent balanced pools. This is a technique that has been done
by a couple of different people, Nate and the Sanger have done, have got this
technique for calculating the weighting of each primer based on coverage, which
is available on the, on protocols.io actually. There was a question from John
Juma. How do you use a primal scheme to design primers for segmented viruses
where you have multiple different nucleic acids? So this is like the next
feature that I want to put in. It's not possible at the moment. You have to, it
is possible to do segmented viruses where you just put all the segments in
separately. One of the functions in primal scheme in the selection, in terms of
the selection and the ranking algorithm is that it looks for heterodimer forming
pairs in the same pool. So it considers every new primer candidate against all
the previous primers, selected primers within that pool for heterodimer
stability using the thermodynamic engine from Primer 3. And it can't do that if
you have separate runs. So one of the things that we want to do and we want to
add is put in the ability to do segmented viruses. It will also be able to be
used for bacterial CGMLST panels as well, if you do that, or any panel really,
because you would just provide it with, we'll change the way that you put in the
input files so that you can add multiple input files for each segment or each
gene in the panel. And then it will just run them as the same job. And instead
of like separating them into separate jobs, it will consider the primers within
the same pool across segments. So you can do it basically as is now, but it
won't give you the full ability to reduce heterodimer formation. You can, in
fact, the INGRA has actually pooled multiple schemes together for different
viruses to make these sort of super pools. And that's the sort of same idea as
well, really, where you have multiple schemes in the same thing. It will work,
but you're not able to exploit the advantages of minimizing dimer formation.
Thank you. Another question from Winfred. What is the success rate of designing
primal schemes for highly diverse genomes or populations? So it's better than it
was. So the idea with the recent version where it can design primers against all
genomes and make a multi-alignment against all of your input references has
improved its ability to work on more diverse reference or more diverse reference
sets compared to the previous version. But there is a limit to what you can do.
And you're trying to design primers against a very large number of genomes or
very diverse group of reference genomes. It's not gonna be possible to find
fully conserved primers. So at a point there will be a time where you should
consider other methods if you have that application that you're trying to do.
It's really good for single viral strains. It's not so good against massive
groups of diverse virus families. So that's the trade-off really of doing
something that is high sensitivity, but lack of sensitivity when things get more
diverse. Yeah, and it's probably worth adding that SARS-CoV-2 is an excellent
use case in that regard because all of the genomes present share a very recent
common ancestor from about a year ago and actually very minimal genetic
diversity has arisen during that time. So the scheme works very well. From Anna
Cusco, how do the native barcoding protocol used here compared to the rapid
barcoding protocol is the main advantage, the higher number of barcodes for the
native, therefore lower cost, or are there other differences between those two
protocols? It's a good question. There isn't actually any. I think up until
recently, there was a difference in the number, but I think there is now 96
rapid barcodes or at least there is coming soon, but that's not the reason. The
reason is because if you're doing rapid barcoding, you only get a five prime
barcode and we need for this in order to get very, very high specificity
demultiplexing, you need to see two barcodes, a five prime and a three prime
barcode. We actually only use the data which has got two barcodes for the
analysis for that reason. So it's because of that. The other thing with the
rapid barcoding is that you'll fragment it even further. So we start off with
400 base pair, you'll go down even more and nanopore sequencing has a lower
limit of, with the standard settings, in order for Guppy to base call a read, it
needs to be, I think, 250 base pairs. If you fragment your amplicons, you'll
lose quite a number of reads that would come too short to base call. That's a
secondary consideration. Okay, excellent. Yeah, and at this point about double
barcoding, we found that to be very important, particularly because of this
issue of amplicon dropouts that you mentioned. If you have an amplicon dropout
and you have in one sample, but not in another, and you have low barcoding
specificity, can you just elaborate on what might happen? Yeah, so this is quite
an important point. And actually, if you do this right, nanopore, the nanopore
demultiplexing is better than the Illumina in terms of barcode hopping. And I
say that with some confidence because obviously not only because in the Illumina
methods which use exclusion amplification, there's a known phenomenon of barcode
hopping, but even compared to the MiSeq, it's better. It's not uncommon to have
negative controls with zero demultiplex reads on nanopore. And that's not gonna
happen on any Illumina instrument. But the reason why it's very important is
because you can have very big differences in coverage over two or three logs of
difference in amplicon coverage between one region and the next due to the
efficiency of that primary pair. Say, if you compare the most efficient region,
which you might have tens of thousands of X coverage versus a dropout region,
which might have zero. If you consider that in a separate sample, those profiles
might be inverted, then any amount of crosstalk from one sample to the other
could lead you to see the minimum or to get over the threshold of coverage in
the incorrect sample, leading to variant cause in that region.  where you
shouldn't have cooled anything. And that's why it's really important to have
highly stringent demultiplexing for this kind of work. It's not only really the
efficiency of the regions, it could be that you have a high CT sample and a low
CT sample on the same run, and you're just the crosstalk from your very high CT
sample or say your low CT sample, your positive control could hop in, by barcode
hopping, could bleed across into your high CT sample or your negative, and then
you'd see coverage there where you shouldn't see any coverage. One last
question, I think, what are the issues if you want to use Primal Scheme with
non-viral genomes? And I guess you're thinking, Ajda, you're thinking about
bacterial genomes, perhaps. The main difficulty with bacteria is that you have a
lot more, where you have some of the key bacterial species and pathogens, you
might be interested in a high GC and that leads to more problems designing
primers. So that was why we introduced the high GC mode. The reason that you
have to have different parameters for that is because high GC primers are a lot
shorter than low GC primers, because they have different kinetics. The TM
difference is huge in between a, say like a high GC, 70% GC organism compared to
the inverse. So we have to change all the parameters in terms of the primer
length that are permitted. But yeah, you just, if you want to do that, turn on
high GC mode and it will select the other settings. If you have something which
is sort of in between like 50, 55% GC, then you might have to try both and see
which one gives you the best results. Thank you all so much for listening to us
at home. If you liked this podcast, please subscribe and like us on iTunes,
Spotify, SoundCloud, or the platform of your choice. And if you don't like this
podcast, please don't do anything. This podcast was recorded by the Microbial
Bioinformatics Group and edited by Nick Waters. The opinions expressed here are
our own and do not necessarily reflect the views of CDC or the Quadrant
Institute.