This time in the Michael Binvey podcast, we come from the Arctic Network and Climb Big Data joint workshop on COVID-19 data analysis held on the 14th and 15th of January, 2021. I'm going to do a talk on how you go about sequencing SARS-CoV-2 with quite a lot of reference to the COG-UK consortium. I was pondering this question last night and did a Twitter thread on this. In COG-UK at the moment, we do a lot of nanopore sequencing, which is a rapid turnaround protocol, up to 96 samples. But in parallel with that, there's also a number of high throughput methods, usually using Illumina sequencing. And we actually use two out of the three possible broad classes of sequencing approach, which you would call metagenomics, which is untargeted RNA-seq basically, and then two different classes of targeted methods, either amplicon sequencing or bait capture. Both of these are actually used in COG-UK. And it occurs to me that this model of decentralized network that we have, it draws in a lot of expertise from all over the UK. And that's actually what's led to success of the project, having this broad base of diversity within the methods that's allowed us to be what's been called in the press, the best genomic surveillance system in the world. So that's very gratifying to see that kind of thing. But yeah, I'm going a little bit off topic. So the amplicon sequencing is what this talk is about. You might have heard the names, the ARTIC method, which is a sort of nickname for amplicon sequencing. It's become popular because it's quite cheap and easy to scale, can recover genomes, or at least partial genomes in the cycle threshold 30 to 35 range, which makes it ideal for this kind of work, even in the presence of, a high host level. And the way that it works is it uses in the nanopore context is it uses multiplex PCR to generate amplicon pools, which we then use native barcoding to barcode and sequence on nanopore. So you will have all heard of PCR. This is a classic 1980s technique. And obviously you start off with genomic DNA, and then you have these first cycle products where you extend one piece, one primer, and then you have a second, you have four second cycle products where you extend your reverse primer. So after cycle two is exponential. But what we do in the multiplex PCR method, which was born out of the Brazilian road trip, we were struggling with high CT values for the clinical Zika virus isolates that we got. The modal CT value was about 36. And that is very high in terms of the number of copies. It's sort of in the 10 to 100 copies per mil range. It's quite low. So we're struggling with that. And in the previous Ebola work, we've done single plex RT, one step RT PCR reactions, pooled them and sequenced them, but we needed something which used less RNA and was faster and more high throughput. We end up focusing on multiplex PCR where you generate all the amplicons in one reaction. The trick there is trying to make the reaction work. All of those single reactions work in the same, in the same tube. And that led to me reverse engineering a technique developed by a thermo fisher called AmpliSeq, which we call primal scheme. My idea was that there wasn't really anything too complicated going on here. And it was purely a primer design challenge. And then over, this is the latest iteration of the tool, which we call primal scheme, which is a web-based tool, enabling you to design the primers to do this multiplex PCR. And all you have to do is upload a fast A file of your references. And this can be multiple references from, from any virus. The tool was able to, to design the best primers to cover the diversity of the reference genomes that you upload. And this, that's the sort of best, most recent feature that we've added me and Andy Smith, who developed this tool. So all you really need to set is the amplicon size and the name, and the tool will start off. It will apply a window over the foot, over the beginning of the genome, and then align all the references with parasail to, to, to determine the sort of scar strings really. And then from that, it will choose the most conserved primers. If you want to make it design primers against your first reference, only you can turn on the pinned flag, and then it will only produce prox, it will only produce primers, which occur in the first genome, which is sort of like the old, the old behavior for the people that have used this for a while. The, there's an overlap between, between the primers, which means that you can trim off the primers, which obviously synthetic oligos, and you only consider the, the actual sequence from the, from the viral RNA, which, which is quite an important part, which we'll come back to later, because obviously this is the main consideration when you're doing the analysis of this type of data. But in terms of the PCR, you have to leave a gap between the amplicons because otherwise you would generate an overlap product. If you mix these primers together, you would preferentially produce a product between two left and one right, which would be a short product, which wouldn't cover any of the genomes. So we have to separate them out into two reactions. That said, the technique still does allow you to produce 96 amplicons in two reactions, which cover the 30,000 base pair genomes. So it's quite high plexity in terms of the, the number of reactions or primers in the reaction, the sort of natural variation that you get from the efficiency of the primer pairs, which obviously will, even though you try to make them very similar, can change. And then the overlaps, which are the, the, you know, the higher coverage areas, it's sort of like the coverage of region one and two mixed together, basically. And then in the analysis pipeline that gets normalized down so that you have some n number of coverage for all of the regions above that threshold. And then you end up with what we call dropouts, which is where you get no, no data for that particular point. That's because of amplification failure, which is an inevitable consequence of something like this can be caused by poor primers or by mutations in the primer binding site, which could be caused by variation. Although that is, you should, because of the overlap, you should still get the amplicon, the adjacent amplicon covering it or, or just high CT or degraded input material, which, which lead to a gradual degradation of the reaction more and more and more dropouts towards the high CT samples. So that's what the, that's the sort of key features of what the data looks like for SARS-CoV-2. The initial protocol release was on, I think the 23rd of January. So it's coming up to the first birthday of this protocol. And the genome that this is based on was actually the third version of the genome, which was actually only released, I think on something like the 18th. So it's less than a week afterwards. And then subsequently we changed the primer pairs, trying to, trying to improve the dropouts. In order to allow this more people to have access to these primers, I organized a bulk scale synthesis with IDT, which was, which is manufactured and filled and packed by IDT and sent out from them, which allows more consistency in the terms of the primer pairs that people use. And it's, and it's actually really cheap. They only, they only charge like $30 to do this, which is, I think partly because we've paid for the synthesis. I don't know if we're just buying, we've just bought it for everyone, but at least they send it out to everyone, which is much more, much more convenient. So if you want to order the primers, you can order the primers directly from IDT and they'll send them out to you for the V, for the V3 pools, which is the only thing that's available. Arctic Network is the platform called protocols.io for sharing protocols. It's an online platform which allows people to post protocols, you know, open source protocols, and then comment, fork, update. It's not a full version tree, which like GitHub, which is what I don't like about it, but it is quite simple and basic, which I suppose is its strength. And it is, it has proven to be quite popular. This is the third version, which I call low cost. It's had over a hundred thousand views. So it allows you to interact with other people, other users and, and answer each other's questions basically. So it's a good forum for troubleshooting. It's also had something like 56 forks. So there are a number of sub-protocols with this. Yeah. And in addition, it's been adopted commercially by other companies. So this is actually supported by ONT. They will support this protocol through tech support. They also have a protocol on their website, which is very much in line with this one. QIAGEN have commercialized a product for this. Illumina have commercialized COVID-C, which is based on, on the multiplex PCR primers. NEB have commercialized a kit, which is called the Arctic Companion Kit. So it's been very, very widely adopted across the community. The genome recovered from the PCR is limited by the input cDNA. The less you put in, the more dropouts occur and degradation of the reaction. That's why we call them dropouts because no amount of coverage will rescue those amplicons. They're just not there. It's a good way to sort of mentally think about how much sequencing you need, because you can see that it's rolled over after you've generated, you know, a hundred thousand reads per sample and you're not really going to achieve any more coverage than that, no matter how long you keep running that flow. And that's what sort of decision you can make about how long to run your flow so for, if you're using RAMPAR. This is Rampart, but once you've set the run up, you can see the same Amplicon coverage profile that I was talking about earlier, collection of reads, barcoding by sample information. And this is the sort of rolling over curve that I was talking about earlier. So you can see which samples are likely to achieve saturation, and then you can decide whether to carry on the sequencing or stop the sequencing. So that's the end of my talk. And I'd like to thank all of the rest of the Arctic Network and all of our collaborators. Thank you very much. There are a couple of questions. So there was a question, has there been any developments in the use of the direct RNA kit for viral nanopore sequencing? And I suppose maybe thinking about the use for kind of sequencing clinical samples for genomic epidemiology. So there's a couple of limitations with the direct RNA kit. One is that it's poly A prime, so it's designed for messenger RNA. And it has a very high requirement of input. So it's like a microgram of total RNA to get, or it's even more than that, isn't it? It's like 10 micrograms of total RNA to get at least 100 nanograms of poly A RNA. So it's very high. It's too high for practical use in a clinical isolate setting. Obviously the virus that you're sequencing needs to be polyadenylated. And if it's not a polyadenylated virus then you have to do additional polyadenylation steps which reduce the efficiency even further. There was a nice proof of principle paper from the Australian group very early on in the outbreak actually, showing the sub genomic RNA fragments using direct RNA. But other than that, and using them to annotate the genome, but other than that, I'd say that there hasn't been any major usage. Another question from Hassan is a practical one. So if you design a primer scheme with primal scheme, how do you prepare pool A and pool B after you've done that primer design? You get all of your primers in tubes or plates. I'd probably recommend tubes for most schemes because it's very, very difficult to avoid, you know, contamination if you're trying to do this with a multi-channel. I prefer to do it one tube at a time. You'd basically take all of the odd tubes and all of the even tubes. So for the odd tubes, I mean, one left, right, three left, right, five left, right, one rack and all of the even tubes, two left, right, four left, right, six left, right into another rack. Basically take one at a time and just take five microliters from each one of your resuspended primers at a hundred micromolar. And then you can do rebalancing. So you can start off with equimolar pools, you can sequence them, and then you can rebalance them to try and make balanced pools. And that will improve your genome completeness. But you don't, but it's not possible to predict the efficiency of the reactions in advance. You have to do an equimolar pool first and then you have to sort of average it over a number of samples to get decent balanced pools. This is a technique that has been done by a couple of different people, Nate and the Sanger have done, have got this technique for calculating the weighting of each primer based on coverage, which is available on the, on protocols.io actually. There was a question from John Juma. How do you use a primal scheme to design primers for segmented viruses where you have multiple different nucleic acids? So this is like the next feature that I want to put in. It's not possible at the moment. You have to, it is possible to do segmented viruses where you just put all the segments in separately. One of the functions in primal scheme in the selection, in terms of the selection and the ranking algorithm is that it looks for heterodimer forming pairs in the same pool. So it considers every new primer candidate against all the previous primers, selected primers within that pool for heterodimer stability using the thermodynamic engine from Primer 3. And it can't do that if you have separate runs. So one of the things that we want to do and we want to add is put in the ability to do segmented viruses. It will also be able to be used for bacterial CGMLST panels as well, if you do that, or any panel really, because you would just provide it with, we'll change the way that you put in the input files so that you can add multiple input files for each segment or each gene in the panel. And then it will just run them as the same job. And instead of like separating them into separate jobs, it will consider the primers within the same pool across segments. So you can do it basically as is now, but it won't give you the full ability to reduce heterodimer formation. You can, in fact, the INGRA has actually pooled multiple schemes together for different viruses to make these sort of super pools. And that's the sort of same idea as well, really, where you have multiple schemes in the same thing. It will work, but you're not able to exploit the advantages of minimizing dimer formation. Thank you. Another question from Winfred. What is the success rate of designing primal schemes for highly diverse genomes or populations? So it's better than it was. So the idea with the recent version where it can design primers against all genomes and make a multi-alignment against all of your input references has improved its ability to work on more diverse reference or more diverse reference sets compared to the previous version. But there is a limit to what you can do. And you're trying to design primers against a very large number of genomes or very diverse group of reference genomes. It's not gonna be possible to find fully conserved primers. So at a point there will be a time where you should consider other methods if you have that application that you're trying to do. It's really good for single viral strains. It's not so good against massive groups of diverse virus families. So that's the trade-off really of doing something that is high sensitivity, but lack of sensitivity when things get more diverse. Yeah, and it's probably worth adding that SARS-CoV-2 is an excellent use case in that regard because all of the genomes present share a very recent common ancestor from about a year ago and actually very minimal genetic diversity has arisen during that time. So the scheme works very well. From Anna Cusco, how do the native barcoding protocol used here compared to the rapid barcoding protocol is the main advantage, the higher number of barcodes for the native, therefore lower cost, or are there other differences between those two protocols? It's a good question. There isn't actually any. I think up until recently, there was a difference in the number, but I think there is now 96 rapid barcodes or at least there is coming soon, but that's not the reason. The reason is because if you're doing rapid barcoding, you only get a five prime barcode and we need for this in order to get very, very high specificity demultiplexing, you need to see two barcodes, a five prime and a three prime barcode. We actually only use the data which has got two barcodes for the analysis for that reason. So it's because of that. The other thing with the rapid barcoding is that you'll fragment it even further. So we start off with 400 base pair, you'll go down even more and nanopore sequencing has a lower limit of, with the standard settings, in order for Guppy to base call a read, it needs to be, I think, 250 base pairs. If you fragment your amplicons, you'll lose quite a number of reads that would come too short to base call. That's a secondary consideration. Okay, excellent. Yeah, and at this point about double barcoding, we found that to be very important, particularly because of this issue of amplicon dropouts that you mentioned. If you have an amplicon dropout and you have in one sample, but not in another, and you have low barcoding specificity, can you just elaborate on what might happen? Yeah, so this is quite an important point. And actually, if you do this right, nanopore, the nanopore demultiplexing is better than the Illumina in terms of barcode hopping. And I say that with some confidence because obviously not only because in the Illumina methods which use exclusion amplification, there's a known phenomenon of barcode hopping, but even compared to the MiSeq, it's better. It's not uncommon to have negative controls with zero demultiplex reads on nanopore. And that's not gonna happen on any Illumina instrument. But the reason why it's very important is because you can have very big differences in coverage over two or three logs of difference in amplicon coverage between one region and the next due to the efficiency of that primary pair. Say, if you compare the most efficient region, which you might have tens of thousands of X coverage versus a dropout region, which might have zero. If you consider that in a separate sample, those profiles might be inverted, then any amount of crosstalk from one sample to the other could lead you to see the minimum or to get over the threshold of coverage in the incorrect sample, leading to variant cause in that region. where you shouldn't have cooled anything. And that's why it's really important to have highly stringent demultiplexing for this kind of work. It's not only really the efficiency of the regions, it could be that you have a high CT sample and a low CT sample on the same run, and you're just the crosstalk from your very high CT sample or say your low CT sample, your positive control could hop in, by barcode hopping, could bleed across into your high CT sample or your negative, and then you'd see coverage there where you shouldn't see any coverage. One last question, I think, what are the issues if you want to use Primal Scheme with non-viral genomes? And I guess you're thinking, Ajda, you're thinking about bacterial genomes, perhaps. The main difficulty with bacteria is that you have a lot more, where you have some of the key bacterial species and pathogens, you might be interested in a high GC and that leads to more problems designing primers. So that was why we introduced the high GC mode. The reason that you have to have different parameters for that is because high GC primers are a lot shorter than low GC primers, because they have different kinetics. The TM difference is huge in between a, say like a high GC, 70% GC organism compared to the inverse. So we have to change all the parameters in terms of the primer length that are permitted. But yeah, you just, if you want to do that, turn on high GC mode and it will select the other settings. If you have something which is sort of in between like 50, 55% GC, then you might have to try both and see which one gives you the best results. Thank you all so much for listening to us at home. If you liked this podcast, please subscribe and like us on iTunes, Spotify, SoundCloud, or the platform of your choice. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group and edited by Nick Waters. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadrant Institute.