Hi, and welcome to the Microbe Information podcast. I'm Nabil, your host for
today, and we've had a lot of people asking us to talk about COVID-19. So we put
together this special episode to talk about it. Dr. Andrew Page is joining me
today in his capacity as head of informatics at the Quadram Institute of
Bioscience. And we have a special guest, Dr. Justin O'Grady, who is a group
leader at the Quadram Institute and associate professor of medical microbiology
at the University of East Anglia. Both have been leading the SARS-CoV-2
sequencing effort in our region, and so far the team has sequenced 1,500 SARS-
CoV-2 genomes. So we'll be talking about how the UK effort is organized, what's
been happening in the lab and how that flows onto the bioinformatics. I should
mention that we live in Norfolk in the UK, which is a pretty small geographic
region, which is about 900,000 people. So we have one genome for every 600
people, which is probably one of the densest sequencing in the world. So first
off, Justin, who are you and what do you normally do? I think you've introduced
me quite well, but yeah, I usually work on medical microbiology applications and
particularly diagnostics. I work on rapid detection of pathogens in clinical
settings and also in food. So more recently, I've been applying clinical
metagenomics, so metagenomic sequencing to clinical samples to detect pathogens
faster than we currently do by culture. Just in case anyone's been living under
a rock, what is COVID-19 and what is SARS-CoV-2 or H-CoV-19? Well, it's a global
pandemic, it's a virus which caused this pandemic that we're currently
experiencing in every country in the world. It started off in, they believe, in
December last year. Some say maybe as early as November, and then it rapidly
spread across the world, as we all know, and caused major disruption to our
lives. What is COG-UK and how does that fit into COVID-19? Well COG-UK is a UK
consortium of people who are trying to sequence the coronavirus genome. It's the
Coronavirus Sequencing and COVID-19 Genome Sequencing Consortium from the UK.
There are 16 sites across the country, and they're mainly academic labs, public
health labs, and research institutes like the Quadram Institute and the Sanger.
It covers the whole of the UK, so Wales, Scotland, England, and Northern
Ireland. The idea is that we try and get as much coverage of all the regions
across the UK as possible, and we try and build a tree, a family tree, of the
virus here in the UK to try to show how it was introduced and how it has spread
across the UK. We can also use it for more in-depth analysis to look at, for
example, outbreaks in care homes or workplaces, et cetera, as the outbreak moves
on. Okay, so how did you both get involved in all of this? Well, in March, I was
on jury service and an email came in to Justin and I asking, did you want to be
part of a consortium to do some sequencing of coronavirus? And I thought, oh
yeah, sure, why not? And it came from a guy who shares my office 20% of the
time, but he runs the Bonne Famille Extend in Public Health Wales. So it was,
you know, reusing existing connections and building a consortium, and in
literally no time, this entire consortium came together. And if you're not
familiar with grants and that kind of thing, normally these things, you know,
putting together a consortium would take months and months or even years. And
this seemed to be, you know, a matter of a week or two. And to be honest, we
started doing everything before even paperwork was done or signed, because this
is obviously such an important thing to get done. What about you, Justin? Yeah,
well, I got the same email as you did, Andrew, and then talked to John, our boss
here, and then talked to Ian, the boss, and then we decided, you know, could the
Institute deliver on such a program? And we decided we should do it, and it
would be a good idea to get involved. But like you, I thought it would take
time, and I thought this was an initial just tester from them to see would be
interested. But what I didn't realize at the time was that a grant had already
been written and submitted, and it was only a few days after we received the
email that they had been awarded the grant. And then, you know, within two
weeks, we were sequencing genomes. That's pretty fast. Yeah. Particularly for
things like ethics and building all the connections and all that. It's just
insane, really, when you think about it. And I think there's only possible
because like you had a lot of existing connections into the local hospital. And
for those of you who don't know, the Quadram Institute is based on a campus,
which also has the regional hospital, NNUH, and it also has the medical school
for the local university. So you know, within one teeny tiny little campus,
well, it's a big campus, within one small little area, you have everyone you
need to make a project like this happen. It was fortuitous in some ways that we
had everything. We had the docs already joined for other projects. But yeah, I
think what we had to do was go through the biorepository because that allowed us
to put in place the ethics that we needed to collect these samples and start
working them on them almost immediately, long before the official ethics for the
study had come through the COG UK consortium from Public Health England. So we
were able to get up and running very quickly through the virology department
giving us the samples and the biorepository handling the samples and handling
the metadata and anonymizing the samples for us so that we could proceed with
work. So just to clarify, you both mentioned this biorepository and this
interface with the hospital and our own research institute. What is the, like,
how is the biorepository actually structured and how does that help? Well, the
biorepository is an NNUH organization. So it is able to collect excess samples
from laboratories and allow researchers to test those excess samples with an
overall kind of overarching ethics that they have. So they're allowed, they
allow us to gain access to these samples and test them. And then if we want any
information on the patients or the patient's metadata, we would then need to get
local ethical approval to get that information. All of the work that you're
talking about is operating under lockdown conditions. How has that affected the
ability to respond so quickly? Well, that's a good question. So the good thing
from the analysis point of view, and I'll let Andrew discuss this, but you guys
are able to work from home a little bit easier than we are. But from the lab
perspective, we had to get people to volunteer to come to work. And basically we
put out a call to ask people who would be willing to come and do some work on
coronavirus sequencing. And we got a number of people responded to that, a
number of people from my group, because they have a lot of expertise in this
clinical sequencing area anyway. And so, yeah, we had to put together a bunch of
volunteers who would be able to handle each part of the process that was
required. The process would start with sample collection from the clinical
virology lab, and then it would move, then the samples would be brought back to
the Quadram Institute, where we would make cDNA and do PCR, and then we would
move the samples to sequencing, and we had other people to do that job. And then
the data that would then be sent to the bioinformatics team, and I'll let Andrew
take over from there. There was a lot of people involved, there must have been,
what, about 26, I think, at one count? So huge numbers everywhere. And then
everyone needed to have backups just in case they get coronavirus, you know,
because obviously that's a big risk as well. We did, we tried to separate teams
at one point here. We wanted to have independent teams so that they wouldn't be
working at the same time in the lab, so that there wouldn't be a chance, they
were working in, shall we say, pods, so that they wouldn't give the coronavirus
to each other. So if one pod went down, we'd be able to switch to another. But
luckily we didn't have to make use of that because nobody got coronavirus except
me. God help you. And was it bad, Justin? I've had worse blues, but there were
more lasting effects, I would say. So, you know, I would even still sometimes
notice a lack of energy or a breathlessness at the top of the stairs, whereas,
you know, other flus would have, when they were finished, you know, you would
have a couple of days you felt bad, but this seems to have dragged on a little
bit with minor symptoms many weeks later. So you're very dedicated to the cause.
Oh yes, yes, yes. That's excellent. So I'm going to turn it over to talk about
some of these different wet lab methods you've been using. So if you're reading
the literature, there are many different methods for sequencing SARS-CoV-2.
Which did you pick and why? Well, we chose the ARTIC protocol, Network ARTIC
protocol, for a number of reasons. The people involved in the study in the UK
are the people who developed the Network ARTIC protocol. So Josh Quick, Nick
Lowman, and others. So, and they were heavily involved in this ACOG UK project,
so it made sense that we would use that protocol. But also, for many reasons,
it's probably one of the best protocols there is globally. And in terms of
sensitivity and specificity, it seems to be very good. So I guess that's why we
chose this. And then it's a 98 primer pair tiling PCR approach. There are other
protocols out there which use larger amplicons. So this one uses 400 base
amplicons, whereas other ones use one KB or 1.2 KB. But they're not quite as
sensitive. So I think it's a good combination between sensitivity and data
yields that we get from either mino and flow cells or Illumina sequencing. And
there's no opportunity for direct sequencing? It has to be through some
amplification? Well, the virus was discovered using metagenomics. So there was a
patient in China who had unknown severe respiratory infection, and they knew it
was probably related to coronavirus, but their coronavirus, so their SARS
primers, et cetera, weren't working. So they used metagenomics to sequence some
samples, and they were able to detect and sequence the virus in that way. But
the problem with that approach is that it wouldn't be as sensitive as a tiling
PCR approach. So you would get genomes, full genomes from patients who had high
viral loads, but you wouldn't from patients with lower viral loads. So the
tiling PCR approach gives you more complete genomes from more patients. And so
that must have introduced several issues into the lab. So what have you had to
deal with in the lab in terms of the sequencing? Yeah, so you've got to, you
know, the protocol itself is, it's somewhat laborious. It takes quite a bit of
time, maybe seven or eight hours to go from sample to sequencing. You know,
there are several steps along the way, and you have to have fairly robust sample
handling procedures so that you don't mix patient samples up and you record them
appropriately. But then there are challenges all the way along the protocol with
the, you know, you're generating very high concentration coronavirus sequence
and PCR products, then you're washing them and you're getting them ready for
library, to make a library for sequencing. And there is a lot of coronavirus
nucleic acids moving around the laboratory, and that can cause issues and
headaches with contamination. So how much virus are we talking about? Like what
kind of sequencing coverage does this translate to? Or are there any
measurements of how much virus there is? So we get the sample from the clinical
virology lab. So sometimes we get it just as in viral transport medium, we have
to do RNA extraction ourselves. But a lot of the time we get the excess
diagnostic sample that was tested by the clinical virology laboratory. And they
have tested it using a qPCR assay and that is quantitative. So it gives you a CT
measurement. And that CT measurement is related to how much viral copies were in
the sample. And we would often, you know, if it's in the thirties, there are
only hundreds, tens to hundreds of copies of the virus. And if it's in the
twenties, there are thousands to hundreds of thousands of copies of the virus in
the sample. We would know the concentration of the virus that's in the
particular sample that we were testing. And the ARTIC protocol works quite well
up to about a CT of 32, 33. So that's probably about a hundred copies of the
virus in the sample. So the protocol works well to there, but still works above
that. But the genome coverage reduces as you get fewer copies of the genome. Any
tips for anyone listening at home to avoid, you know, different issues? What
would you do differently if you had to do it again? Yeah, what you don't want to
do is say waste a lot of sequencing effort, but you also don't want to bias your
sample collection by only sequencing lower CT samples. So samples that have a CT
below 32, you could put an artificial cutoff there and say, okay, I will only
sequence samples below CT 32. But then you might miss some biology in the fact
that you would only collect samples from patients who had a relatively high
viral load. And these might be asymptomatic patients, for example. So you may,
and imagine there was a certain lineage associated with asymptomatic patients.
That's not the case to be quite honest with you. But we didn't want to miss any
biology and we don't want to miss any biology in the future by putting a cutoff
at CT 32. So I think what I might recommend for labs doing this, not to waste
too much money on sequencing samples that will fail quality control threshold. I
would say you might want to say after the RT-PCR is performed, that you would
look at the concentration of nucleic acid by cubit and determine what it was.
And if it was say below two nanograms per microliter, then you might decide you
wouldn't want to run that sample in your sequencing protocol just because it's
quite likely to fail at that stage. And I know that we've been using Illumina
over Nanopore. And so why that decision to use the Illumina platform? It's
really because throughput. Nanopore is flexible in terms of throughput. If you
can kind of handle, so if you're only looking at samples up to about 23, 24
samples per run, we were dealing with far higher numbers than that during the
course of the peak of the outbreak in April and May. So we needed to sequence
hundreds of samples per week. And the choice, it was easy really because
Illumina sequencing is able to, once you get up to hundreds of samples, it's
quite flexible in terms of how many you can add and the cost is appropriate and
probably the right choice of technology if you're dealing with hundreds of
samples, but you can see that to do several Nanopore runs to reach those kind of
numbers using the network-oriented protocol, Nanopore protocol. So at this
point, we've generated some sequence data. So I want to bring Andrew in a bit
and talk about some of the bioinformatics involved. So Andrew, could you just
take us through the basic bioinformatics workflow when off for COVID-19? Yeah,
sure. So when the sequences come off the sequencers, you know, you could have, I
think the highest we had was 384 in a single run. Then it's all hands on deck.
And we actually have a Discord server running in the background where we all
communicate all the time. So it's all hands on deck. When the sequences come
off, they get processed through a pipeline. So if it's Nanopore data, then it
has to get base called a guppy. If it's Illumina data, it goes through your
standard BCL fast queue. And then we start doing some interesting work. So for
Illumina, we use primarily IVAR. And that, first of all, has to trim off the
adapters. So there's synthetic sequence for the ARTIC protocol to enable the
tie-line PCR. And you have to mask that out. Otherwise, you're going to see
variants, which may not really exist. It may just be in the actual synthetic
sequence, not actually in the genome you're sequencing itself. So we mask those
out with IVAR, and then you build a consensus sequence. Now we're just using a
Nextflow pipeline developed by Tom Connor's lab, and we've tweaked it slightly.
This pipeline just more or less does these steps for us. So we get a consensus
sequence out the other end. You get a BAM file with the reads mapped back to
consensus. And then we do some QC on these. So we want to see how much of the
genome have we captured? Are there places dropping out? In a few cases early on,
we had problems with an early set of primers, had problems where three specific
regions would always drop out. But that's improved now that we've got primers
from a different company. I think there's been a little bit of a fiddling with
temperatures and stuff. We're doing all right. We never really get the ends of
the genome, unfortunately. Just that's how it is. We like to see a reasonable
coverage, but with amplicon sequencing, you get huge differences in coverage
throughout the genome. It's just what you got to deal with. And as long as your
algorithms can work with that, then you're fine. A big mistake people make is
they just blindly take amplicon data and then go and assemble it or something or
call variants directly from it without actually considering that this needs to
be treated in a separate manner. Right, so then we have a consensus genome. Goes
to Nabil for data submission to COG. Yep, so my major role in all of this is to
take that data, have a look, have a poke at the QC for it. And then if we decide
that the run is good, we submit that up to...  the General Cog UK Consortium. So
they have a nice server where you send all your files and a system for uploading
metadata that we've also collected from the biorepository and from the
hospitals. And that then trickles down into GISAID, which is the central
database that is storing all of the coronavirus sequencing across the world. And
eventually it should be going into the standard repositories for sequencing data
like ENA and SRA and so on. And just to note that only about 20% of genomes that
get submitted to GISAID actually end up in the ENA or in GenBank, or sorry, the
SRA, which is a bit of a problem because if you don't have the raw underlying
reads, then there's certain types of analysis you can't really do. You can't go
back and reanalyze data. You just have to take the consensus sequences or
assemblies people have deposited in GISAID That's how it is, you know, you can't
double check anything. For me, that struck me a little bit odd. I kind of just
went with it, that GISAID only does keep track of the consensus sequence. And
they do have their own internal QC where they go back and look if there's any
abnormalities with the consensus sequence, like random frame shifts that they
hadn't seen before and things like that to make sure that the data is valid. But
the fact that the raw underlying reads are not available for the community to go
back to and just double check everything's fine seems a bit strange. I have been
wrangling with submitting our data directly to ENA and it's, as always, it's,
yeah. It takes time. It takes time. How, what's the cutoff that GISAID puts on
submission of sequence in terms of genome coverage? So their hard cutoff is 90%.
They want to see 90% of the genome recovered as compared back to the reference.
And within COG, they're happy to entertain things a little bit lower than that.
So they take 50%, but even then we submit everything because for these high CT
samples, you'll get pretty poor genomes, but some people I'm sure will want to
do some analysis on that and they'll be able to pull out something from
somewhere. Yep, so we religiously put up all of the data that we get hold of
along with all of the CT values as well. We try to put up as much as possible so
people can go back and do that QC. It's, it would be very difficult to figure
that out if you only ever saw the good 90% consensus samples. I mean, it's a
different question really when with that sort of data, you're not using it for
an epitrace thing at that point. So I guess it's worth mentioning that it's not
that straightforward sometimes in the bill in terms of duplicate samples,
samples that have been poorly labeled and it can be difficult with the metadata
to get it right before submission. So we always double check our data. That's
true with any project of this scale. This definitely, as you both have
mentioned, coming out so quickly means coming up with systems and new protocols
and new avenues of sharing information very fast. And we've had to reinvent and
solve a lot of these sort of problems of how we represent our data and check if
the metadata is valid and so on. It does seem to be vital more than what we
would normally do to get this right. And yeah, I think everybody's very much
involved with trying to get this as good as we can. Yeah, I think working in
genomics, sometimes it depends on your area, but often microbial genomics, there
wouldn't be a great need for you to be reaching kind of clinical standards of
reporting. And we work on clinical metagenomics and therefore have some
experience in this area. But yes, certainly compared to sequencing foodborne
pathogen genomes, et cetera, sequencing the coronavirus from patient samples and
making sure that you report the data accurately and don't make any mistakes in
terms of patient metadata is extremely important. And it's another step up in
how you have to work with your data. It's quite interesting that quite a lot of
people who traditionally worked in bacteria moved immediately into working on
viruses because viruses is a teeny tiny little area and very, very specialized.
So it is quite good that so many people were able to transition so rapidly over.
Yeah, traditionally, like I only have done very, very little virology in my
career. So this would be the biggest project I've ever done. And I don't
maintain to be a virologist in any way, but it's the skills that I have in terms
of microbiological sequencing were transferable from bacteriology to virology
without too much difficulty, but I'm still not a virologist. None of us are.
Well, we should mention that there are virologists and specialists within the
consortium who handle that particular component. It really is playing to
everyone's strengths. Yeah, well, we have a team of virologists in the clinical
virology lab who support us in our understanding of the data. But in terms of
the sequencing procedures and analysis procedures are not significantly
different for a virus as they would be for a bacterium. So I know what next, you
know, we're at the end of wave one. How are we going to do things differently
for wave two? Yeah, that's something that's evolving all the time, I think so.
So what we've been really trying to make sure that is that we don't go and
sequence 1500 genomes and not do anything with the data. There was some really
interesting work done by Oliver Pybus and Andrew Rambo, which was published
recently in which Nick Loman was on BBC Radio 4 Today Show discussing. This is
how many introductions of the virus were made into the UK and they reckon it was
around 1350 separate introductions. And that's supposed to be probably an
underestimate. So that's a very interesting application of the data that we're
producing. But we need to now move from a kind of an overall understanding of
what happened and how the virus moved into the country to using this data to
help in the outbreak control programs that will run all the way across the
country at various different public health agencies and local county councils.
So I think that that's the next stage for us to start working with local county
councils and PHE to use the data for if we expect and as we expect the second
peak to arrive, that we would use the data in as real time as possible to help
inform on public health interventions. This would be the canary in a coal mine
that will warn us if something is going to miss or if things are calming down?
Well, I think its strength is probably how you can demonstrate that whether a
transmission in a certain setting is likely or whether it's not likely. So for
example, if you had a school setting or a care home or a factory where there
have been multiple cases of coronavirus, this is the way you can genotype them
and figure out if you have transmission from person to person within that
institution or it is separate introductions of that virus into that institution
and they would require different public health interventions. So I think that
is, it's going beyond just a positive result to giving some much deeper
information on where the virus has come from or how it's spreading. So that's
all the time we have for this episode. I'm Nabil, I was your host for today and
I was joined with Andrew Page and Justin O'Grady. Thanks for being on the show.
Thanks very much. We've been talking about coronavirus and the efforts within
the Quadram Institute as part of CLOG-UK in combating and sequencing that.
Thanks for tuning in. See you next time on the MicroBinP podcast. Thank you.