Hello, and thank you for listening to the Microbinfeed podcast. Here, we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There is so much information we all know from working in the field, but nobody writes it down. There is no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Both Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work on microbes in food and the impact on human health. I work at Centers for Disease Control and Prevention, and am an adjunct member at the University of Georgia in the U.S. Hi, and welcome to the Microbinfeed podcast. I'm Nabil, and I'll be your host for today, and we're back talking about COVID-19. Today, we're going to talk about our sequencing efforts of the SARS-CoV-2 genomes present in positive clinical samples from Norfolk region. Andrew Page is joining me today in his capacity as head of informatics at Quadram Institute Bioscience, and we have two special guests, Alison Mather, who's a group leader at Quadram Institute Bioscience. Once again, we are joined by Justin O'Grady, who is a group leader at Quadram Institute Bioscience and Associate Professor of Medical Microbiology at the University of East Anglia. So, Alison, since it's your first time, who are you and what do you normally do? Thanks, Nabil, it's very great to be with you guys. Normally, I've been interested in bacteria, so what I look at are the epidemiology, evolution and dynamics of foodborne and zoonotic bacteria, so bacteria that go back and forth between different host populations, with a particular focus on antimicrobial resistance. And to do this, we use both short and long read sequencing in both isolate-based genomics and metagenomics as well. That's cool. So let's get right into it. And so today, we're talking about SARS-CoV-2, but for anyone who doesn't know, Justin, what actually is SARS- CoV-2? Well, I think most people probably know it's a virus, and it was first identified in Wuhan in China. There's some debate as to when it probably arose in the human population, but I can pretty much definitely say it wasn't engineered in a lab. There has been some talk about that on the internet. But it started in around maybe late November, December in Wuhan in China. And as we all know, it's spread across the world rather rapidly. It's been confirmed in 25 million cases, 25 million infections, and about 850,000 deaths. So it's had a major effect on all our lives. We're talking a lot about whole genome sequencing. So Alison, what can whole genome sequencing do to actually help us combat this particular virus? Well, whole genome sequencing obviously provides the highest resolution data that we have to understand the virus and how it's evolving and how it is changing over time. Of course, because it hasn't been around for a terribly long time, although it may feel like that, it's the combination of both the genomic data and the epidemiological data that can help us understand how it's spreading, both in a local context, either within a region as we're going to talk about today, or more globally, and also how the different control measures that have been applied are having an effect as well. So what we can do with the data is to understand how it's evolving, and by doing so, we can understand how that's affecting the pathogen spread. And this is organized as within COG-UK, and we previously mentioned COG-UK, but in case anyone missed it, Justin, what is COG-UK and how does QIB fit in within that? Well, COG-UK is a consortium, it's called the COVID-19 Genomics UK Consortium, and it was a £20 million funded grant from the Department of Health and Social Care in the UK, UKRI and Wellcome Trust. So these organizations got together and funded this grant, which was created to deliver large-scale rapid whole genome virus sequencing for the NHS and for public health agencies. So there's 12 academic partners and the four UK public health bodies, so that's Public Health England, Wales, Scotland and Northern Ireland. And yeah, I suppose that's a brief description of what it is. So you're just one of the members? No, we are one of the four academic partners, so sorry, sorry, the 12 academic partners. So and I think there are more academic partners joining since the original 12. So we cover a certain, the region, the East of England region, not all of the East of England, but mainly Norfolk and parts of Suffolk. So we sequence the SARS-CoV positive cases, or as many as we can from this particular region. And so what have you actually set up in Norfolk and what are your activities day-to-day now that you're working on this project? What we do is we work with the local hospital, the Norfolk and Norwich University Hospital, and they collect samples from all around the region and they do the virology testing for this particular region. So they collect hospitals from the other major centres in the region. So Kingsland and Great Yarmouth, the hospitals that are there send their samples to Norfolk, sorry, the Norwich and Norfolk Hospital, and then we collect the samples from there and we sequence them. There are also other types of samples sent to the NNN for testing and they would be, sometimes we get some samples from hospital staff, their families. Sometimes we get samples from care homes, sometimes we get samples from the community, but there are different pillars for testing in the UK. And one of them is the pillar one is the test where the samples go from hospitals. That's most of the type of testing that we're involved with. Pillar two is mainly community testing and that came online later. And those pillar two samples, we don't get many of these days. So what we do is we get all these samples, we collect them, we extract RNA when necessary. Sometimes we get already extracted RNA and then we create cDNA and we make libraries and we sequence them and then we pass the data on to bioinformatics team here at the QIB, which Andrew leads, and they crunch the data and determine lineages. And then we return that lineage information to the NHS or to the Norfolk test and trace teams here in Norfolk. And they use that data to investigate outbreaks and investigate hospital transmission, etc. Here's a fairly simple question. I mean, a couple of times you've mentioned Norfolk and for people outside the UK and maybe for a few people within the UK. Alison, where is Norfolk and who are the people that live there? Norfolk is a county which is on the east of England. So it is kind of, if you look at the map, there's the fat part of England and it's on the east and eastern side of that. So actually, it's a shame that this is through the medium of audio, because whenever I give a presentation, I always put up a few photographs of the glories of Norfolk. We've got lovely beaches and big skies and lots of countryside. And so it's quite a stable population, which is one of the reasons why this study, I think, is so nice. In general, Norfolk has an older population, which is expected to increase at a greater rate than the rest of England. And almost all of the population increase over the last five years has been in those over the age of 65. And that obviously has implications for a virus, which seems to affect the older populations more commonly. So over to you, Andrew, when you started this work, what were you hoping to find? To be honest, we weren't really sure what we're going to find. But I know from doing a lot of studies in the past, particularly large scale studies, that as soon as you start applying genomics to it, you're going to find a lot of stuff that you never knew was there and a lot of hidden links and a lot of really, really cool stuff. So we were just initially hoping to, I suppose, sequence everything we could possibly get our hands on, build some trees, see what's linked, what's not linked. And that seems to have proven very useful. And because we made a decision to sequence everything, you know, even really, really poor quality samples, you know, with very high CT values, very low viral loads, it's meant we could go on then and answer some more interesting questions that we never even thought we could. Whereas I think other groups in the world maybe set a cut off, you know, for money reasons, whatever, and said, OK, no, we're not going to do the samples of low viral loads. We're just going to focus on the really high quality ones and that's it. But, you know, I think we've done quite well identifying some really interesting things like identifying false positives, limits of detection, the correlation between what happens when you have a low viral load and the quality of the genome sequence you get at the other end, this kind of stuff. Some really interesting things we found is that we don't just have, say, one massive outbreak. It's lots of small, different introductions and expansions within the region, which has proved invaluable because when you're just in traditional epidemiology, you can't tell. difference necessarily between two people who walk into a pub, do they get the same, infected with the same thing, or were they infected with something different from two different infection sources? At least with genomics, you have some kind of idea of, can you answer that question? So you could, if they're, if the genomes are different enough, you might be able to tell, well, you know, these, this isn't the same thing. It's, they're slightly different things. We also were looking at, throughout the pandemic, the spike protein mutation, the D614G, and that's linked to increased transmissibility of the virus in vivo. And they've gone on and found, yeah, it seems to be right. So that's interesting as well. And I think in all of our data, actually, we've seen a conversion from types with the wild type into the types of mutation. So that dominates, I think every sample that we get through the door is this, has this spike protein mutation there. Yeah. It sounds like there's a lot of like really cool, interesting vignettes. And I think we'll, we'll talk about them a little bit later, but what I want to set down is some, some basic background on some of the methodology that all of you are applying within this work. So Justin, I mean, you touched on this earlier, but just wanted to ask if there's anything else you wanted to add in terms of which samples you're investigating and how they're collected. It's quite variable. The sources of our samples are very variable. So I guess if anyone's trying to do this anywhere else, you'll know, you'll find out that there are many different diagnostic technologies used to identify positive patients. And some of them are more sensitive than others. Some of them use RT-qPCR. Some of them use other technologies. They all have different readouts. They all have different limits of detection. And so it can be quite a, quite a minefield trying to figure out what you should or should not sequence or what result on the diagnostic machine will mean that you will get a good sequence out of the ARTIC sequencing protocol. So I think that's something that we'll try to address. COG UK will try to address in the not too distant future. So we will, we will come out with, with some pre-print hopefully in the next few weeks that will maybe describe some of the differences between the technologies that are used for diagnosis and, and how you can kind of benchmark them against each other in some way. It is quite difficult to do that. And what we find is that, you know, some machines have automated extraction included and, but you don't get the RNA back out and therefore you've got to go back to primary sample to do extractions. Some machines involve an offline RNA extraction, which you can easily get. That's the nicest sample type for us. If we can, if the lab can give us RNA rather than giving us a primary sample, which we have to extract the RNA from. And then some, some, some machines give us a CT, some machines give us a takeoff value, some machines give us relative fluorescence units for the positive results. And so how they're related and linked is, is not a simple story and it can be very difficult to compare between different technologies. So that's just something to be aware of if you're going to try to do this. So I'll ask a question about the, about the surveillance from Alison and I think we mentioned clinical samples as being the main, main focus, but did you include any samples from the wider population? That's right. Most of the samples that we have been sequencing here and therefore we've been analyzing are diagnostic samples. So they'll come, they'll come to the hospitals. There's also a number, a smaller number of samples that came from testing of key workers. So those are healthcare workers or care workers or police and their families. But the vast majority of the community testing, they will have been processed through the lighthouse labs. And so it won't be included here. What's a lighthouse lab? Well, a lighthouse lab is, are these large labs that were set up by the government to provide testing, large scale testing for the UK. There are, am I correct, four or five of them at the moment. They can do, they have different capacities, but thousands of samples per day, tens of thousands of samples per day for, for most of them. And they're trying to increase that capacity all the time. So they are outside the normal clinical microbiology or clinical virology network and they are privately run. So Justin, like you did mention very briefly, the, some of the steps that you are taking in the lab, but what are the main components of the workflow that is required for this work? And I wanted to get an idea of kind of, you know, the degree of complexity compared to normal, normal lab techniques and the amount of sort of timeframe that you're working under or time pressure you're working under. There was a protocol developed for SARS-CoV whole genome sequencing by the Arctic Network, which is a network of academics. It's a collaborative thing funded by the Wellcome Trust. But Josh Quick from Birmingham developed this protocol and he has since optimized this a number of times. And I think we're on version three now, the low cost protocol. And this version three is really, really quite heavily optimized and it's quite cheap. I think it's, you can do a SARS-CoV genome for about 10 or 12 pounds, 10 pounds, I think they're saying. So, and you can do it within a day. You can have your sequencing turnaround in less than 24 hours. That's from sample to result. I think the main steps involved are you will get the sample. You may or may not have to extract the RNA. And if you do, if you do have to extract the RNA, you have to decide on what RNA extraction method to use. And they can be hard to guess. RNA extraction reagents and stuff can be a little bit under pressure because of massive amount of testing that's happening globally. So do factor that in if you're trying to set this up. You'll find it easier to get kind of manual extraction kits and reagents than you will to get automated stuff. Then you would move on to cDNA synthesis. So you've got to turn the RNA into cDNA and that's first strand cDNA synthesis. And then you move into the Arctic PCR. And this is the real major advance in this method. This is the reason why this method is used globally by most people who are sequencing SARS-CoV genomes. And that's because Josh was able to use his primal screen algorithm to develop a set of primers that work in two 48 multiplexes. You've got your multiplex 48 primer pairs in a reaction and it works with high sensitivity to detect SARS-CoV in samples. So we can detect samples with, you know, CT even higher than 35. We don't always get, well, we don't get the whole genome from very high CT samples, but we can get parts of the genome. So it's a very sensitive multiplex PCR, high level multiplex PCR. It takes quite a long time. It takes about three and a half hours to do the PCR, but that's because it's such a complex reaction. Then you move from the Arctic protocol into some form of library preparation. And Josh has cut this back, skipping washes and skipping quantification steps to make this as simple as possible. And you can sequence on Illumina if you want, you can sequence on Nanopore if you want. The protocols are going to be different, but Josh's Arctic protocol is set up for Nanopore. And so it's using the new 96 barcodes, native barcodes, Nanopore kits. And so you go through that process, you end up with a library and you sequence it for 12 hours or so, depending on the number of samples that you have on the flow cell. So recently, you know, the discussions recently, most people think a sweet spot for a number of samples to put on is probably around 48. 96 means you're going to have to run it for longer. So what people are doing is, even if they have 96 samples, they'll put on two flow cells with 48 on each and they'll cycle the barcodes, wash the flow cells and the next 96, they'll put the opposite barcodes onto the opposite flow cells and they will reuse those flow cells and run them for a further 12 hours. But this is all in an effort to get the data as quickly as possible. So you run two flow cells, both for 12 hours, and you then get that data. If you're not going to run 96 and then you get that data and you analyze it. And that's where my expertise falls over. Now, I guess that passes on to the hands of Andrew and his group. So Andrew, what is the bioinformatics applied to the sequencing data? I guess we have the easy part, you know, we get all of this raw data. And the first thing we need to do is if it's Nanopore, we get a base called that, well, actually Nanopore or Illumina, we base call it. We use a really strict deplexing on Illumina and on Nanopore because, you know, even smaller errors are going to cause problems downstream. And then once you have that, we go on and run all the Arctic pipelines, which thankfully are written in NextFlow. So it's quite straightforward to run. And it magically processes the data. So it does things like say for Illumina, we'll do IVAR. and at the other end of that you get a consensus genome for Illumina you need about 10x coverage minimum over an area or region of the genome and then it'll call the bases in there build a consensus and for Nanopore you need a bit more so about 20x minimum but actually what we really get out the other end is about thousand x or even more than a thousand x so you know we're really confident that a base is a base and the reason is because with the Amplicon sequencing you get huge variations in coverage and you need to do a lot of sequencing to make sure that you get the some of the Amplicons which don't work as well. You need roughly about a hundred thousand reads per sample on the flow cell but that's a hundred thousand reads after you lose some because they're poor quality or you lose some because they don't have bar codes at both ends so you'll lose on the Nanopore sequence so you lose roughly around 40% of the data so imagine you are you put on your run you've got 48 on a flow cell you want to you want to factor that in so when you decide that their run has enough data you want to think I need a hundred thousand reads per sample on the flow cell so and then I need to factor in I need an extra 40% that I'm gonna lose so you know that's that's what you need to kind of aim for. Yeah and so once you get the consensus genomes God I guess the first thing we do actually is look in the blanks I skipped a step and that's usually like everyone is holding their breath you know what what is the blank picture gonna look like it is it really gonna be blank you know will there be any reads mapping to any part of the SARS-CoV genome and sometimes there is zero and then it's fantastic but sometimes there's not and then the run has to be chucked so I think things are looking an awful lot better over time as the guys in the lab have gotten used to the protocols and used to the peculiarities and put in extra checks along the way and so we've also on the other end on the data processing end I've got more used to what is really a problem and what isn't a problem you know if you get like tiny little matches it's not really you know that's not really contamination but I know Nabil haven't you looked at a lot of blanks what have you found? Yeah there is a clear difference you do especially on the Illumina platform you do get a fair proportion of primodimer and odd bits and pieces and this is back I think it's just mainly because you're sequencing so deeply I mean we're talking 1000x to 3000x on a given sample so so you do just get random bits and bobs but those are quite obvious compared to clean fully mapped sequence reads that do map to the genome and you can decipher that quite quite well but it does mean that you have to take a little bit of care and not just naively accept the what the consensus sequence would generate just by using every single map read. Yeah I think with there's a slight difference between Illumina and Nanopore in the way that the libraries are made in our Illumina pipeline we have a second PCR because we use Nextera and there's a Nextera PCR involved so you get a second amplification stage so you and that can lead to additional problems with contamination but you know these are very sensitive PCRs there are a lot of amplicons floating around laboratory because they have to be sequenced so tubes are open post PCR and things get contaminated in the lab quite easily so it's a we have to go to a lot of effort to make sure the sequencing does not get contaminated and as Andrew says from time to time it does happen and when that does happen you know we are not willing to trust that run so stuff has to be repeated and we we just basically repeat the run where a blank is obviously contaminated. And anything you want to add Andrew on say clustering and phylogenetic analysis? Absolutely like that's the most interesting bit I think personally so the thing that people really want to know is how similar is this genome I have here to all everything else that's been sequenced and that gives a rough idea of is it the same outbreak or is it a different outbreak or whatnot and it can help rule things out. What we do is we upload our data to COG and then Andrew Rambo's group run Pangolin and things like that over it and it gives everything a lineage it gives it multiple lineages actually it gives it a global lineage and this is applicable globally so it'll start with A or B and then a few numbers and also you get UK lineages which are more fine-tuned and more fine-grained towards the UK and that gives us an idea of what's circulating in the UK and what's you know coming to a region how it's evolved that kind of thing and it helps us to communicate between ourselves and with other people like Public Health England and Track and Trace and the hospital. And to achieve this we do it in different ways so here in Quadram we have Leo has made Paraba and that gives us lineages very quickly and to build the trees in a particular way and Andrew Rambo's group will build a tree and if you think about it you build a tree of life and then you kind of draw circles around different parts of the tree and you say okay well these are clusters and they have the same name and they're different in some way to another cluster and often they will have defined differences between clusters that'll be some kind of genetic variation there or maybe a set of generic genetic variants from there you get these numbers but when we get data just straight off the sequencer we want an answer very quickly so what we do is we run a piece of software called Civit and that's from again Andrew Rambo's group I think it's Onyo Tool and Verity Hill and that is a fantastic piece of software you just shove the genomes in and then it produces a beautiful pretty report and it tells you exactly what your genomes are similar to, how many variants are there, what are the variants and it very clearly tells you what the deal is, what you're dealing with and it's quick as well, it only takes about an hour to run and so we use that straight off after everything comes off the sequencers. So Andrew are you telling me that bioinformatics really is just putting the data in, pressing a button? Yeah but the hard part is interpreting it you know and then the other hard part is... I don't know what to get paid for, I really don't. So I'll take the report and then I'll go and do some more in-depth analysis you know kind of go backwards and see what else is in the cluster because it only tells you your stuff then it kind of collapses all the trees and everything to give you like just the absolute essential information but I like to go back into the wider clusters and then kind of search, go fishing more widely and see what else is there in Norfolk, where is it spread, this kind of thing and getting a you know a more general idea of what do we know, what are we missing, filling in the gaps and then also what are unique new introductions into the region and that's important as well because if you have maybe one week you've seven brand new lineages coming into the region that have never been seen before maybe they were here last April and have disappeared and have gone off and come back again, that's important information to know you know because the rate of new introductions will tell you I suppose how big of a problem you have coming down the road. I think at the moment we were having well last week there's about seven, this week about two introductions so it's at a reasonably low level but all it takes is one, all you need is that you know one spark and maybe in a factory or something like that or in a care home and then you know all hell breaks loose. Thanks Andrew, you touched on a number of those, I think one of the most interesting things that I think is actually the diversity of lineages that we found here in Norfolk, so there's a hundred distinct UK lineages that were identified and one of the things I would say, so we talked about Norfolk having a stable population, low density and stable population but what we've seen here in terms of the diversity of the lineages is similar to what has been observed in the rest of the UK and in other regions as well, so it doesn't mean that people aren't coming and going and going on holiday so there's lots of opportunities for introductions and so what we're seeing in this county or in this region is similar to what has been observed elsewhere. Now that being said, digging around in some of these data as Andrew was saying we found some interesting aspects, so for example there are 16 UK lineages that were found in key workers that were not observed in patients or in other samples, so I think that's a really positive finding actually because of course it's not great that individuals have gotten affected but it has also demonstrated that the infection control practices were adequate so it prevented the onward transmission of those lineages or those infections in those key workers into the patients or the community that they were serving, so I think that's a really positive finding actually and also we were able to detect these kinds of things because we sequenced this so heavily, so it's about 42% or 43% of all positive cases that were identified in this region have been whole genome sequenced, so that gives us an unprecedented resolution in order to investigate these different potential transmissions or the diversity and try and pick apart some of the epidemiological links and what I also want to say here though is the reason we're able to do this and to be able to identify some of these cases or incidents that we're going to talk about in a few minutes is because we were working so closely with the hospital, with the NHS microbiology labs who are doing a fantastic job and are working really hard to make sure that all the samples are processed. And also with data scientists. So we're working with folks who can get the clinical data and can match it up to the samples that we have. And we're also working, as we said, with PHE and the track and trace. So it's actually the combination of the whole genome sequence data and the metadata and the clinicians and everyone working together that has allowed us to identify a lot of these situations where we can actually make a difference. And so Andrew, I'll hand it over to you to talk about maybe one or two of these cases. Well, first, just to mention, out of all the cases we sequenced, it was, well, one and a half thousand individuals. When it came to actually just high quality sequencing, there was about a thousand samples there, you know, and high quality is above, the consensus genome is above 90% and there's no contamination and, you know, it's all contiguous. And actually when we compared our high quality sequences to everyone else in the world, we found that actually we're kind of, if we were a country, we'd be, you know, number six in the world for the number of genomes you sequenced. So only five countries have sequenced more than us, you know, including the UK, US, people like that, you know. So it's pretty good going. We've done quite a reasonable job of getting high density sequencing there. And it's a tiny region as well. It's only what, a million people in Norfolk? Compared to the US is about 300 million. So we've done a reasonable job in terms of density and we can answer a lot of questions that people can't answer if you, unless you have done a lot of sequencing. It has allowed us, you know, by doing all the sequencing, we've samples from basically the entire first wave. So all the way from March up to August is the time period we covered. And we've sequenced as much as we could. We couldn't get everything, you know. A lot of samples. I don't know what happened, but it didn't make it to us. And unfortunately, a lot of the early questions from clinicians, they all seem to be samples that we hadn't actually sequenced. So it was a bit embarrassing initially. You know, they gave us a list, it was like 20 samples, these are the ones they're interested in. And we were like, well, sorry, you never sent them to us. But however, as time has gone on, we've actually, we've gotten such a nice big overview of what's gone on, that it's become more and more and more useful. And I think as time goes on, the sequencing data is going to become even more useful to everyone involved in outbreaks, particularly with rapid turnaround and better coordination and discussions with all of the different partners involved in public health response. So I wanna fire through a couple of key questions that everyone kind of asks about COVID at the moment. And then I think I'd really like to hear more about these deeper vignettes that we've been touching on throughout the episode. So first off, what is this mutation rate of coronavirus as far as Norfolk is concerned? So the evolutionary rate is about two SNPs a month. And it's actually, it's like clockwork, you know, when you're looking at data, like it's like bang, bang, bang. And other people have said it's maybe 2.5, other people have said it's less, but it seems that that's approximately the number of SNPs you can expect from the original Wuhan reference back in January. And so actually, if you pick up any genome and you just count the number of SNPs differences, you can kind of get a rough idea of what time period that was collected in. And it's quite useful for research. All right, and another question everyone always asks about is the spike protein mutation, the D614G mutation. So what does that look like in Norfolk? Well, back in March, it was kind of mostly the wild type, and then just the mutants came in, took over everything, and now it dominates completely. I mean, absolutely dominates. And it has been that way since about April. Okay, and another thing that's been recently floating around in the literature is COVID reinfection. So do you see any reinfection in the samples that you've sequenced? Well, actually, Nabil, I think you did this analysis, so you should be telling us. I just did the pairing of all the samples together, and no, we didn't see any case of reinfection in any of the longitudinal samples that we had. And okay, so we had quite a few, like we had a big data set, I think it was about 140, and we had up to six samples per patient. And fine, some of the samples were very short term, so a lot of, I think the median was about 13 days, and the mean was about 16 days in terms of time span for patients. But actually, we had some pretty long- term patients, and I think 71 days is the longest we have in our data, you know, with sampling all the way along. And in every single case, when we looked, there was no difference between them. It was usually the same, it was the same lineage in every case, bar, I think, two samples, and that was just, there was a little bit of ambiguity in there from Illumina sequencing, and actually, it was the same lineage. So we didn't find any cases of reinfection, but we are absolutely alert to it, and we are looking for it all the time, and I know clinicians are very much looking for it as well, particularly since we had the cases in Nevada and Hong Kong, and I'm sure it's gonna turn up in the UK, it's just a matter of time. And I think the first place we're gonna be looking at is the genomic data in the UK, because we have so much genome sequencing done here. Clinically, it was always unlikely that data, the longitudinal data that we had from patients who were still in hospital with the same episode, shall we say, of coronavirus, who had never left the hospital, it was very unlikely that you were gonna find that they were reinfected with a different strain at any point. There has been a recent query, again, about a patient who had her first episode back in March and had become positive on a test recently, but when we looked into that further, there was a query over whether that was a false positive and whether the machine used for testing had produced a false positive. Again, that person had symptoms for a long time. So I think, really, probably where you're gonna see the definitive evidence are in those use cases where the patient from Hong Kong reports it in the literature. They had a screening test after returning home after originally having it months earlier, and that test was performed randomly, and they picked up the infection almost accidentally, and they had the genomes from both episodes, and it was different, and that was a really clear-cut story. I think it was a common thing. We'd see it a lot more, and there's only been two really confirmed, well, a few really confirmed, well-confirmed cases around the globe. So obviously, there's more than that in reality, and it's hard to prove when it has actually happened, but I think in the UK, we're gonna be in a really strong position to do that, and I would imagine some of the first cases that we will see will come from healthcare workers. When we start to see people who are exposed to the virus during the first week being exposed again and getting reinfected with a different lineage, and we'll be able to pick that up relatively quickly. If that does indeed happen. If that does indeed happen. Good point, and it might be a very rare event. It might be because the virus isn't developing very quickly that when you've developed your immunity to it, that that immunity lasts, and it might last a year. It might last two years. It might last, who knows, but there was a lot of talk after those first patients were identified as being reinfected that, okay, that's proof now that immunity only lasts for four months, if that was truly the case, I think we would have seen a huge number more proven reinfections than we have seen. I think there's a lot to do with the host characteristics, you know, the individual people as well that will have a lot to say on whether or not someone gets infected and how long immunity lasts. Yeah, I'm sure it would be quite variable, but in general, we're doing a massive global experiment. There's been how many, we said 25 million confirmed cases, and yet we only have a handful of reinfections so far. So until we start seeing larger numbers of reinfections, we can say that today's immunity lasts at least six months, and, you know, who knows how long it's gonna last. We can only hope. Let's move on to some of the longer narratives. We've had a lot of different numbers about what's going on, but I wanna hear some neat stories about what you found during this study. So let's ask Andrew. So what can you tell us about anything in care facilities? You mentioned care facilities a few times. We had this really interesting query come in, and the hospital were very interested in one care facility because there'd been a big outbreak there over the same period of time. And so I started looking at it, and I think there was, what, 15 samples? 14 of them happened to be from the same sub-lineage, which, you know, obviously is a problem. But then I started looking at, where else are samples from the sub-lineage found? And I looked on a map, and hey, presto, we have some hotspots. I was like, that's pretty weird. So I go into the metadata, and I start looking around, digging deeper, and I find that quite a lot of people over the age of 85 are in particular areas on the map and in these hotspots, and they all have very similar addresses, or the same address. And so that indicated that something's going on here. And, you know, when you. have a lot of elderly people with the same lineage in the same place, it means probably care homes. So we went back and working with the hospital, we uncovered and unravelled this and found that actually, yes, there was not just one care home but six care homes. And digging into it, you know, this was the same strain, like 100% identical in a few cases, going between all of these different care homes. And that's not good. You know, the care homes had taken extreme precautions to try and keep all of the virus, keep all of the the infection out, but obviously there had been a chink in the armour somewhere and then it had been led to spread between care facilities. And when I looked at one small town, which had two care facilities, which had the same strain, identical strain in both, when I looked at the wider area that that was in, there wasn't a match-up between the strains circulating, or lineages circulating in the locality and the lineages in the care home. So the care home had one lineage and then in the locality there was like 13 different ones. So they'd done a good job of keeping the local lineages out, but not keeping the care facility lineage, you know, it was kind of an attack from the inside, unfortunately. And it was pretty serious. So that was the first care home only sub-lineage we identified. And we didn't find the same lineage in the hospitals or in the community. So luckily that lineage ended and became extinct. So the outbreak ended. But it does give some important lessons for what we should be doing. We should be monitoring care facilities very closely, we should be sequencing the hell out of them, and we should be looking for all these hidden links between organisations so that we can keep people safe. Yeah, but that's reassuring that this was able to be quickly identified. And on top of that, that there wasn't any influx from the outside community. I mean, the fact that we're able to silo this is reassuring. Yeah, I think we have to say we were lucky in the sense that the lineage that we identified in the care homes was sufficiently different to circulating lineages for us to be very sure of how it moved around the region. And so I think that's what we've been looking in that context a couple of times. So if, for example, an outbreak in a factory or a care home etc. was caused by a widely circulating lineage like, for example, the most widely circulating lineage in the UK is UK5. What's the other designation for that? B1.1 or something is the international designation. But lineage UK5, if that was the cause of an outbreak, it would be much harder to prove that, you know, that the community and the care homes were not linked. And it's very hard to sometimes... so you can prove when things aren't being transmitted with common lineages, but it's very difficult to prove what is being transmitted with common lineages. So in the care home outbreak, it was a particular, you know, sub-lineage as we call it, which caused the outbreak and therefore it was very easy to kind of see that lineage in our data and pick out that outbreak between care homes. If it was UK5, we would never have been able to do that. Let's cross over to Alison. I've got a question about hospital outbreaks and other types of outbreaks. Have you found any other outbreaks during the study? So I think that's what Justin was saying. So sometimes what you can do, it's easier to rule things out than to rule things in, in case it's an outbreak. So we had one example where we had a question in from Ipswich Hospital and they had a number of samples, a number of patients in their hospital, that they wanted to know if they were linked or not. And so these were samples that were over a number of months. And again, as Andrew was explaining for the previous example, we could see that most of those samples were from patients over the age of 65, so they were older. And then we were able to sequence those and majority of which had high-quality sequences that we were able to evaluate. And then what we could see is out of 18 high-quality genome sequences, there was a total of six global lineages and eight UK lineages. So from the number of samples that we had, that's quite a high diversity. And as Justin was talking about, UK5 is the most common UK lineage. And so the most commonly observed lineage in the set of samples was UK5. But also the diversity that we were looking at within that setting and within those samples also indicated that it wasn't a single sustained hospital outbreak. It was likely that the patients had acquired it from the community and not that it was transmitted within the hospital. So I think that's really important. And again, what that indicates is that the infection control practices that were happening were sufficient to make sure that there wasn't sustained or onward transmission once those patients got in the hospital. So I think that's really important. And that's where these kinds of questions that come in from hospitals or care facilities and working closely with the clinicians can have a really high impact. And then we had another case where we're looking at an outbreak in a food processing facility. And so this is on a different kind of timescale. So this happened really quickly. And I think this is where the utility of the rapid response that we can do with whole genome sequencing is really important. So in August, so just last month, there was 35 positive samples from workers at a food processing facility that we sequenced very rapidly at the Institute. So it was less than 24 hours. And this was part of an outbreak analysis that was between the NHS microbiology lab at the Norfolk and Norwich University Hospital Trust, the Norfolk County Council, Test and Trace and Public Health England and ourselves. So working together, what we're able to do was to assign lineages. So we had got we're able to get high quality sequences and assign lineages. And what we found was, in contrast to what I was just describing at the hospital, we found that all the genomes within the food processing facility had the same global lineage and the same UK lineage. So it confirmed that these genomes were similar to each other. And I think I'll hand over to Andrew to, there's a there's a few defining SNPs that we're able to define a sub lineage within these data as well. Yeah, so when we looked at it, we found that there was very definite signal, we had, I think, two SNPs, and we hadn't seen anywhere else before. So we're quite confident that any genomes that have these two SNPs were probably from the same outbreak. And you know, we knew that they're all collected from the same physical facility, when they're doing large scale testing. So what happened was one worker tested positive, they turned up in hospital, and then they started, they went to the food processing facility, and they discovered a very, very high positivity rate, and they did large scale screening, and they found quite a lot of people. And we've kept an eye on this lineage, because it's so distinct, we think it probably came from one introduction, potentially from mainland Europe. And it hadn't been seen in the UK before this particular sub lineage. And we've kept an eye on it in all of our sequencing we've been doing over the past few weeks, and we've seen it pop up. And you know, some of these are going to be people are working in this factory, have now developed COVID symptoms that are severe enough to work going to hospital, or maybe people lived with them in the same households, this kind of thing. And so we're starting to see that in hospitals. But also we're starting to see random samples appearing in Norfolk, and even further field, unfortunately, which is an indication that this is spread beyond just one facility and containment hasn't been as good as we would have hoped. But it can be quite difficult when you're dealing with a large, large number of people in one place, and you know, they may have scattered wind or whatnot, or they may not have followed the self isolation as strictly as you would have hoped. And unfortunately, I think these things have gotten have gone beyond but I haven't knowing exactly what's there and what the cluster is and what the lineage is means that we can track it and we can trace it back. And I'm sure we'll probably look back in a few weeks time and we'll know the full extent of what's happened. Hopefully it'll be contained. All right, and any final statements from from the rest of you? I'm particularly interested in what tips you would give anyone else looking to set this process up in their region? Or anything else you wanted to add? You know, going into this, I wasn't sure how a virus with a relatively slow mutation rate and constant sequencing of its genome would how helpful that would be. I'm not a epidemiologist or viral epidemiologist. But it's turned out to be extremely helpful in many cases. And then a lot of the stuff that we've done here in Norfolk, we've been able to help out with public health and test and trace efforts that are ongoing in the county council and nationally. So I would say it's very much worthwhile. And it's and it's proving itself more and more the government in the UK are backing it heavily and really want this to be turned into a service rather than a research grant. So I think that if this is something that you're considering doing, there are a huge number of resources out there already for you. You don't have to reinvent the wheel. The Arctic Protocol by the Arctic Network and Josh Quick is extremely useful. And there's, you know, the lab methods are highly optimized. and the analysis tools are there and so you can set it up. If you are struggling with anything, I think you can contact me if you have issues with the lab setup. But I do think it's quite straightforward now and I think it's extremely valuable and if you want to get in touch with your local hospital or public health agency and offer to do it for them and show them what can be done, I think they would be very keen that you could help out. I think the other thing to say is along those lines that this was only possible through the collaborative efforts of many people, on the sequencing side, on the bioinformatics side, the clinicians, the data managers and the epidemiologists. So I think to get the most out of data such as, or any data, but including the whole genome sequence data, you need to have everyone working together. Yeah and a lot of people did this, you know, there was no payment involved. This was all people volunteering to do this in a difficult time and a huge number of people were involved and helped out and some of them continued to stay involved and some people had to go back to their normal day jobs. So but yeah, you know, it was a huge effort and it's resulted in a lot of really fascinating data. But yeah, a lot of people need to be thanked. And yeah, there is a huge amount of work being done by people. I mean, I remember last Friday at 11 o'clock at night, half my team seemed to be online analysing data because it had just come off the sequencer and everyone was digging in and seeing what's there, how is it related and what interesting stuff can you get so we can feed it back immediately to Track and Trace and to Public Health England and the hospital. So that's real dedication there for you and I know people are very, very willing to dig into this because it is such an important public health issue in our daily lives, you know, it's taken over. Okay, so that's all the time we have for this episode. We've been talking about genome surveillance efforts of SARS-CoV-2 in Norfolk within QIB as part of COG UK. Much of what we're talking about will be available in a pre-print in MedArchive quite soon. And if you want to learn more, you can have a look at that. And finally, I'd like to thank our guests, Alison and Justin, for being on. And hopefully I'll see you next time on the MicroBinfy podcast. Thank you all so much for listening to us at home. If you like this podcast, please subscribe and like us on iTunes, Spotify, SoundCloud, or the platform of your choice. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group and edited by Nick Waters. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadrant Institute.