This time in the Michael Finke podcast, we come from the Arctic Network and
Climb Big Data joint workshop on COVID-19 data analysis, held on the 14th and
15th of January, 2021. So just to introduce the panel members, or I'm actually
going to allow them to introduce themselves, I'm going to chair and we're going
to hear from Andrew Page from the Cordial Institute, Anya, who you met already,
Will, who you met already, and Anna Price from Cardiff University. I'm just
going to ask each of the panellists to introduce themselves properly and just
maybe share a story about something about COVID genomics that's worked well last
year, or perhaps has been a total disaster, whichever you fancy. So Andrew, can
you say hello and just introduce yourself? Hello, I'm Andrew Page from the
Kwarterm Institute in Norwich, which is in the kind of east of England, and I'm
head of informatics and do bioinformatics normally. But for COVID, you know,
we've been doing lots of sequencing for our little region, so working with like
five different hospitals, and we've nearly sequenced about 10,000 samples now at
this point. So some good things I think that came out of it are that we are
talking directly to the public health teams, the track and trace, so, you know,
people who in the local area will go knock on doors so we can feed them
information and say, maybe this little cluster here is important, or kind of
looks a bit odd, and they can go and track it down. And also working with the
clinicians in the hospitals. So they might come to us with questions like, do we
have one outbreak here? Or is it like 10 different random COVID cases coming in
from the community? And we can, you know, give them a heads up, or maybe there
is linked cases here potentially, and they can go and, you know, maybe do a bit
more work, or not. It doesn't always work out. Often we find that the samples
clinicians are most interested in are the ones that they haven't given us, so we
can't sequence them, or that they have very high CT values, so there's a very
low viral load, and of course, we've no genomes for that. Thanks, Andrew. Anya,
do you want to reintroduce yourself and share an anecdote? Yeah, no worries.
Yeah, I'm Anya O'Toole. I'm in Andrew Rambo's lab up in the University of
Edinburgh. I was to finish my PhD last October, but the pandemic has sort of
taken a precedent over thesis writing, so I'm still a PhD student with Andrew at
the moment. Yeah, so it's been a busy year, I think, for everybody. I've learned
a lot. Pangolin was the first sort of packaged tool that I wrote, and you know,
sort of over the last year, me and Verity and other people in the lab sort of
been coming up with other tools to try and make it as easy as possible for
people who are generating the data to actually do outbreak investigations and to
sort of get information out of it, and I think for me, Civit has been a really
useful tool, sort of on my side and also from hopefully the user's side as well.
As you know, there's been sort of install issues and stuff like that yesterday
on VirtualBox. We've never really trialled that before, but yeah, it's been
really useful because as I mentioned yesterday, at the very beginning last year,
it was a very manual process of doing outbreak investigations. You know, we had
people sequencing down the hospital, and basically I would get emails from
multiple people asking, can I make a tree out of these sequences, and you know,
it was just too much to juggle. So between me and Verity, we wrote this tool
Civit, and it meant that actually the ability to do the outbreak report was in
the hospital then, and you know, obviously there's been sort of development over
the last few months, but for me, that's been really useful because it meant
people could do it when they wanted to do it, and they wouldn't be waiting for
me to do it as well, and I think it sort of brought that ability out a little
bit more, which was really nice. So it's been, I think Civit has been a success,
I think, it's still in development, so. Yeah, and you know, these tools have
been incredibly useful, certainly in the COG project and also globally. I know
they're extremely popular now, so it's really great that they can be made
available in the way that they are. Okay, Will, do you want to say a few words?
Hi everyone, I'm Will Rowe, postdoc in the next group. I was working on the
Arctic Network grant before the coronavirus struck, and then now I've been
working on this, mainly the pipeline, which we all have been using yesterday. In
terms of anecdotes, I guess something good which has happened is, it's been
really sort of positive and encouraging experience to sort of be a part of
something where lots of bioinformaticians all come together to work on sort of a
shared code base. I mean, I've been used to sort of siloed bioinformaticians and
sort of going from start to finish product and just publishing a paper and that
being it, but this is a much more sort of, I mean, it's more exciting, it's more
sort of a natural way of evolving something where you've got lots of big,
massive user base, lots of people say we want features and we want this or
reports a lot differently. So that's been really sort of encouraging to just be
exposed to sort of how I would like bioinformatics to work more in the future.
So that's been really good. But at the same time, I guess you could see this as
a negative in that you've just working at such a fast paced area. Lots of bugs
are getting incorporated, which you've got to go and squash at a later date and
stuff. So it's sort of a poison chalice really, but it's been, it's been a
really good experience and it's nice to have something which is being widely
used and be really grateful for everyone's feedback and future suggestions on
sort of this pipeline and other software. That'd be great. Yeah. I mean, it's
like the only software with no bugs is software of no users, right? That's the
idea. I think you're right. The team spirit, particularly on the Cog project has
been immense in terms of fast paced development, something we really want to
open up more to the rest of the world, which is, you know, this workshop is
aiming to facilitate that. So thank you, Will. Anna, do you want to give us a
few words too? Hi, I'm Anna Price. I'm a research software engineer at Cardiff
University, working with Climb and Supercomputing Wales. So I've been involved
with two analysis at Cardiff in conjunction with Public Health Wales. So
analysis of a new variant in Wales, and also we've been working on an analysis
where we try to determine the rate of importation of cases from England into
Wales. So my work is mostly involved analysis of the Cog UK metadata and the
information that can be gleaned from this. So using the metadata to generate
information on the geographic distribution of lineages and also using it to look
at population characterisation. So in terms of things that have gone well, I
think I have to highlight the efforts of Public Health Wales and the incredible
amount of work they've done. I think just in the last week, PHW have sequenced
over 2000 samples. That's an incredible effort in terms of things that have gone
badly. I think I have to agree with Will in terms of having to work very quickly
on software and sometimes bugs get introduced and things go wrong. So yes,
that's been a little bit pressurised, but I think it's been a lot of work
generated and a huge amount of data generated as well. It's been an interesting
time. Thanks, Anna. So, OK, that's great scene setting. So I think we're going
to try and turn this panel discussion into, you know, kind of set it up by
theme. And the first theme is really to try and get a discussion going around,
you know, what do you need? So, you know, we're really interested to know, you
know, what the gaps are, what the challenges and the difficulties are. And on
that theme, I'll take the first question from Christine. Hi, good morning. No, I
just asked what would it be possible to get examples of spreadsheets and sample
ID schemes that have been used to collect metadata from health care institutions
and public health bodies rather than us reinventing the wheel where we are, we
could adapt what's being used already. For example, we've, you know, we've
started doing our own sequencing. We have a terrible sample ID scheme and I'm
trying to force people to change it. And I would love some advice. And we got
some advice, but it would be nice just to see something that we could adapt.
Thanks. Great question. Who would like to take that question first? I can take
it maybe. So with metadata, you can go a bit crazy on it. And one extreme might
be phage have a metadata scheme, which is like the ultimate for if you really
like metadata. But then on the other extreme, you have a very minimal. And what
we found is that if you ask for too much, people don't give you anything or they
give you like virtually nothing. So you have to be very careful in what you ask
for and what is really, really important. We can provide like a sample metadata
spreadsheet if you want. And we've had to put in like columns, you know,
annotated for maybe office staff who might have to fill these in because
computing systems can be maybe not joined up. I know in our local hospital, we
have people who have to go into multiple different systems to pull stuff out
because, you know, there's like 150 different systems in the hospitals that we
support. And then say last week, someone changed how data was processed and we
ended up having a team of four people just tracking down 200 individuals
metadata, you know, and they spent a couple of days on that. So it can be quite
a challenge. So if you do come up with a metadata scheme, make sure that you ask
for the absolute minimal that you need to do your job. So that might be maybe an
age, gender, that kind of thing. Collection data is very important. And just a
few other small things.  investigating, you know, are people seriously ill or
the healthcare workers location as well. But if you then dig way too deep and
you start, you know, having like 200 different pieces of information people have
to fill in, you're going to find out they'll take shortcuts or they'll just put
in random boilerplate stuff. Yeah, so just to recap there, there are, so Andrew
mentioned the PHAGE project, which is a consortium run, I think, out of Sanbi in
South Africa. And you've published, I think, a basic metadata specification. So
that's a good option. In terms of the COG project, Christine, we've published a
paper or it's a preprint on our local database, which is called Majora. That's
the one that collects all of the metadata up. And in there, there's a metadata
specification that you could look at as well. So those are two options to start
with. Let's take this question now from Laurence, who's asking if the panel have
any recommendations about how to assess for intrapatient variability. So
intrahost variation in SARS-CoV-2. And I think as a, so that's one question and
a follow up question is, can it be done with the ARTIC protocol and can it be
done on Oxford Nanopore specifically? So who would like to talk about intrahost
variation in general first and then answer the questions about ARTIC and
minions? I don't mind having a start at that. So personally, I think the jury is
currently out as to whether there is significant interesting information in a
typical SARS-CoV-2 infection, i.e. a quite short infection in terms of
intrapatient variability. So you've got to remember where that variation comes
from. It can come from the fact that someone could be infected by a diverse set
of genomes, if you like. For example, in a situation where the transmission
bottleneck is wide and there's lots of circulating diversity, you could be
infected by multiple lineages or multiple strains at the same time. In our
experience, we don't think that's very common. We don't think there is a wide
transmission bottleneck in SARS-CoV-2. And actually, when we look into it, it's
very unusual that we find multiple lineages in the same person or even much
variability in the same person. In the context of a typical infection where you
get infected, you hopefully don't get too sick, but you might get sick, but you
recover within a week or two. That's not very much time for the virus to evolve
much in the person. And that's what we tend to see in the sequencing. And
although we do sometimes see what might be called co-infections, it's often
quite hard to know if that's real or if that's a technical artifact from, let's
say, two different samples getting mixed up in the laboratory process. Because
we're often dealing with amplification of very, very small amounts of starting
material, even the tiniest amount of contamination between samples before the
amplification could result in that type of result. But we don't see it very
often either way, whether it's technical, whether it's biological. The flip
side, though, I would say is that there are now studies that are very
interesting, looking at patients that don't seem to be able to clear the
coronavirus quickly. So I'm thinking about situations such as immune deficient
or immune compromised hosts, either through genetic reasons because of an
acquired illness like HIV or because of cancer chemotherapy or other
immunosuppressant drugs. There have been several case studies where in patients
where SARS-CoV-2 is not cleared rapidly, where infections can go on for a long
time, months. I think the longest I've seen recorded is about 150 days. In those
situations, that's enough time for some appreciable amounts of diversity to
accumulate in a person. And this might be accelerated by the host condition. If
the immune response is not very good, it may give the virus a bit more of a
playground to test out different combinations of mutations. And interestingly,
the new variant that was detected in the UK, which is being called B.1.1.7, or
it's got several other names now, and I can't remember them all. It's been
speculated that virus, which has actually got a very large number of mutations,
many more than you would expect over at this point in the epidemic, with
evolutionary rates of about two mutations a month, it's got about 10 or 20 more
mutations than you'd expect on that clock. And that was speculated to have
emerged during a kind of chronic infection. In those kind of cases, intra-host
variation is quite interesting. And so the second part of your question, and I
don't know who wants to take this on, is can it be done with the ARTIC protocol,
either on Illumina or Minions? We did actually publish a paper on looking at
this specific question. Can you use ARTIC amplicon sequencing to look at intra-
host variation? You actually can, but if you do a single sequence, a single
replicate, if you like, if that's a thing, the frequencies are not that well
correlated with metagenomics, just because of the sort of stochasticity
associated with PCR. And this is particularly problematic at very low viral
loads, very high CTs, because you are probably amplifying off one or two or a
handful of template molecules, in which case you get a very poor frequency
profile that probably doesn't make sense, because you can't find something at 5%
or 10% allele frequency if you're amplifying off one copy of a molecule. It's
simply mathematically impossible. We found that those results can be improved
dramatically if you throw it, if you do replicate sequencing. So if you do the
same, if you do the PCR three times from the same RNA, you get frequencies that
much better match what you'd expect from an unamplified type of approach like
metagenomics. And in regards to nanopore sequencing, well nanopore sequencing
has clearly a much higher error rate than Illumina sequencing, we would say
about 5%, but actually that 5% is quite unevenly distributed between different
parts of the genome. In some parts of the genome it's actually much lower error
rate, and in some it's much higher error rate, for example near homopolymers. So
in that case, nanopore sequencing, the frequencies generally correlate well with
Illumina and with metagenomics in regions where you have detected that there is
variation. So if you know a priori you're interested in a particular location,
it tends to correlate quite well. It's not very good for discovering low
frequency variation down at the 1, 2, 3% mark, but at the kind of 25, 50% mark,
it's quite good. Right, that was quite a long answer. Did anyone, any other
panelists want to want to chip in there and disagree, particularly if you want
to disagree with me? Well I'd just like to say that a lot of the variation, if
you do just randomly look at a sample, it's probably going to be contamination
or something like that. So you're probably 99% of what you're going to chase is
probably going to be just simple little bits of noise. So maybe try and skip
that one and leave it for the moment. Yeah, I agree. I think for a single you
know, a single time point, single sample, single sequence, you know, the
probability of finding things that are interesting is probably outweighed by the
probability of finding things that are technical artefacts. I think that's
probably, I think that agrees with what you're saying, Andrew. I think in the
situation where you do have a long-term infected patient, multiple time points
where you can track the trajectory of, you know, these intra-host variation, if
you can see particular mutations going up in frequency or descending in
frequency, that's much more stronger evidence that there's a real effect going
on in that patient. Although several papers have been published talking about
the use of intra-host variation for, for example, doing genomic epidemiology,
I'm not actually convinced that it's adding a great amount of signal in terms
of, let's say, resolving transmission chains. I don't think there's a very good
evidence base for that at the moment, regardless of the protocol that you use.
So I wouldn't get too bogged down with doing that. I'd much rather have lots of
samples than have, than look very deeply into an individual sample, if that
makes sense. John, do you want to ask your, or share your experience and ask
your question directly? We were mandated to test COVID within my institution
here. It's the International Livestock Research Institute. But before we gained
that ability, there was only one institution that is the Kenya Medical Research
Institute, through a collaboration with the Wellcome Trust that were mandated to
do the testing. But since they are focusing on human research, they were given
also the go ahead to sequence the samples. And mostly they have  data on G-sate
and I've done that, the sequencing using MinION. But here in my institute we do
have both Illumina and MinION, but a major challenge has been to convince the
government to give us approval in order for us to sequence the samples, even
though we do the testing. So it's taken quite an amount of time, but I think
hopefully soon we will begin that. So my question is, how can we better engage
the government or the ministries of health in helping us tackle these problems,
especially when they become a pandemic like SARS-CoV-2? That's a really great
question, I think. How do you persuade the government that sequencing is
important and to support it? I mean, the UK has been in an interesting position
because I think we started our coronavirus sequencing long before the government
was interested in coronavirus sequencing. But the discovery of new variants
recently has really changed the interest levels of government. They are much,
much more interested in using this information, particularly as it relates to
issues around travel corridors and travel policies. And of course, we have a
situation at the moment where UK nationals are banned from traveling to many
countries because of the potential for us exporting new variants. But we are
also now imposing travel bans from countries in South America because of the
Brazil variant and also from South Africa because of the variant that was
discovered there. So this turns it very political very quickly. Would anyone
like to comment specifically on John's point? I mean, it's probably worth saying
in the UK, we've been building the groundwork for a long time. Nick will
remember when he first started working with me, we used to just joke about the
fact that Public Health England would never, ever do genome sequencing. It's
just inconceivable. This was just a research tool. And we in our ivory tower
universities would gloat over it, but we'd never get the public health
authorities using it. But that transformation did come, it came unevenly. Some
parts of Public Health England took to genome sequencing very easily, other
parts not so much. But we'd already laid that groundwork in that we'd engaged
with those people, they, you know, it was clear that it was a good thing to do.
Obviously, there are issues about sometimes people don't want to know the answer
as well. So if you, if you can prove that there's an outbreak going on in a
hospital and the patients are catching it in the hospital, whether it's MRSA or
COVID, the hospital authorities will say, well, we don't want to know that, that
affects our reputation. And these issues are things where you have to say, well,
you've got to step back from that petty business about your reputation, and have
a system in place where actually these kind of things are mandated and that
people have to share information. But it's yeah, it's, it's a slow process
winning over opinion across and building that bridge. I mean, one of the things
that is always at risk is the clinical academic interface, keeping people who
are working in clinical practice, and academics talking to each other, and
actually having people who are qualified in both areas, to be able to take what
we find in the university environment as new methodologies and new approaches
and findings and translate them into into the clinical environment, always a
struggle to keep that that interface going. Sorry, no, I was just saying that in
Trinidad, one of the things that helped us to get our Ministry of Health and
Public Health Agency on board very quickly, was the assurance that we weren't
going to be sending any samples overseas. So having that local capacity for the
sequencing, we use the MinION, having that local capacity really helped us, they
were not interested in, you know, having samples go anywhere else. And that that
made a big difference. So I don't know if that's helpful to John. That's really
interesting. I mean, one bit of practical advice from the UK, because I think
other countries have got caught out here, where they've engaged in genome
sequencing programs in public health, is to establish very, very early, ideally
written down, and passed by the government, the principle of data sharing. So if
you state in your protocol, your sampling protocol, and your sequencing
protocol, that data produced will be submitted to public databases at the point
of production, and get that, you know, at the point that you start your
protocol, that will probably not be so controversial. Establish that principle,
and I think the UK did this, get a lot of credit for sharing data, that makes it
much harder for then governments to say, as Mark alluded to, that actually, we
don't want this information getting out because of potential political costs
associated with it. Once you've started sharing data, it's kind of hard to stop.
But many countries maybe did it the other way around, started sequencing, and
then sought permission to share data. And that is much harder, I think, to do.
So that's just a practical point, if you're getting started up. I will add a
simple comment that if you look, there's a nice WHO report that came out quite
recently. I did link it in my slides. And that's very nice reading, because it's
written and makes a good case for sequencing. And it's written in a way that
policymakers can understand. So if you're looking for the words to make the
case, that might be a good place to have a look and base some of your argument
out of that. Thanks, Nabil. Anyone else want to contribute? I suppose you also
need the political will as well from the very top. And that did help the UK
quite a lot. Because I know, like, say, in Ireland, they got a huge pot of money
early on. But then it took them six months before they were allowed to get
permission to actually go and, you know, start doing the work, because they had
to work out all the legalities. Whereas in the UK, they put in legislation,
which made it a little bit easier to kick things off. Yeah, it's better for you
to establish your genome sequencing, and be able to analyse it and be able to
make recommendations to public health, rather than other people doing it. So one
thing that you can suggest to government is that if you don't do your own genome
sequencing, because of the still amount of travel that's going on, people will
make inferences about what's happening in your country, through analysing data
from returning travellers, let's say, or just making inferences on the basis of
no data. So it's much better to have a data set of your own that you can see in
real time, that you can analyse to make public health decisions. And I think the
other thing that really is focusing the mind is the, as I mentioned before, the
focus on the novel variants, and the potential link between these novel variants
and increased transmissibility, and potentially evading antibodies. So, you
know, potentially evading natural immune responses means it's extra important to
know what variants, what lineages are circulating in your country, in order to
try and, if you like, quarantine the more worrying ones, because these variants
at this point have not spread globally, but left unchecked. We expect that they
will. Okay, I'm going to move on to some other questions. Ben, do you want to
ask your question? Yeah, I work as a clinical microbiologist in Scotland, and
we've established sequencing locally for outbreaks. But I'm just keen to keep my
skills up with bioinformatics. And I've done a couple of courses, like the
Wellcome Trust ones, and some FutureLearn ones, but just really wondered if
there were any more courses like this one that you're giving at the moment, if
you could recommend them, that would be great. It's a very common question at
the end of a bioinformatics courses, are you running any more bioinformatics
courses? And I always wonder if that just means we've done it wrong, or it just
means that... No, you've done it right, you've done it right, definitely. That's
good to know. Who wants to answer that question from our panel? Great, Will.
It's all right. Obviously, it's sort of a massive area, isn't it,
bioinformatics, and so it'd be useful to learn a little bit more about what
you're after. But in terms of someone who's new to bioinformatics and wants to
be able to get a bit more independence in it, I can thoroughly recommend
learning a pipelining language like Nexler or Snakemake. They have really good
online tutorials available, which you can do in your own time. It's really easy
to install them, and you can get some simple examples up and running really
quickly. I'll paste a couple of links into the chat box as where you can find
those. But if you're wanting more sort of like bioinformatics training, I guess,
I mean, you want to start digging into writing some of your own stuff, maybe
some basic Python scripting. Rosalind offers some really good training. So they
sort of set really short problems, which you can have a go answering yourselves
with any coding language. And that's what I used to learn bioinformatics. And so
I can recommend doing that as well, if that's more what you're after. And as
well as that, I think get engaged with the online community. Like if you want to
start learning how things work and want a bit more guidance, most of these tool
developers go out of their way to help you. So just go on to GitHub, tool you're
interested in or want to learn to use more, raise an issue on GitHub, or they
have online chat forums and things. So definitely worth just getting a bit more
stuck in, in that sense. And that can really help you sort of move along in the
bioinformatics capacity as well.  If there's anything in particular you want
guidance or training on, if you could just stay and maybe we can be a bit more
specific. I was also going to mention, so not an open source thing now,
unfortunately, but if you're in Scotland, I've done quite a few of the Edinburgh
Genomics courses, which you do have to pay for. But I remember when I was first
learning to code and they, the sort of, I sort of taught myself and done some
online courses before in Python and actually having a taught week long course in
sort of, you know, formal Python learning was really helpful. And my coding
improved quite a lot from that. They also do sort of Linux courses. They do
various specific courses in addition to just the Python or coding ones as well.
So I would recommend them, but unfortunately you do have to pay for them. So I'm
going to be kind of virtual here maybe and say that you need to hire people who
actually have the preexisting skills, because it can take years to get to a
level where you are sufficiently skilled to analyze this kind of data. And a one
week course might be a nice introduction or for you as a manager, you might be
able to guide and direct other work, but in terms of actually doing it, you need
to hire someone who's actually a professional in that area. Like I've been doing
this like more than 20 years and I've been doing computer science for more than
20 years and still, I don't necessarily think I'm an expert in this and, you
know, when it comes to bioinformatics, particularly around individual pathogens,
you know, these things are so specific and there's so many little gotchas and so
many little quirks, it takes years and years and years to learn, I would say
just hire someone who actually knows what they're doing and who can figure out,
you know, where the best resources are and how to fix different problems.
Otherwise, you know, you might spend six months, you know, faffing around when
you could just hire someone who can, you know, do the same thing in two days
because they, they are pre-skilled. That's a, that's a provocative perspective.
I like it. And Joe, it's probably worth saying that SARS-CoV-2 is so new that a
year ago, there were, there were no experts and arguably there are still, there
are still no experts, but I know exactly what, where you're coming from. I'm
waiting for the HR to put out the job descriptions that are like, you know,
COVID bioinformatician required, requires three years of experience. Yeah,
exactly. Anna, did you have any perspectives as a research software engineer?
Yeah, I mean, just to expand a bit on what Will said about next flow and snake
make, I think it's definitely important to pick one and sort of learn if you're
sort of interested in writing your own bioinformatics pipelines and in terms of
running other people's as well, because increasingly people are turning to next
flow and snake make, there is actually a quite a good community around next flow
called NF cool. It's probably worth a look because they've got quite a lot of
interest in pipelines that might be interested in having a look at. Thank you.
Yeah. And I do agree that, that these workflow languages are a really useful
skill for any budding bioinformatician to learn. And also, I think Nabil made
this point in his talk, there's a lot out there already, that's, that's pretty
good to build off. So starting from scratch is great for learning
bioinformatics, but for getting work done in a, in a high pressure environment,
probably best to start with things that are well tested and, and ideally
validated. We do have the SOPs that are, have been there since, since the start,
since January on the Arctic website, and we can post up the links to those. The
SOPs do cover setting up a lab, sequencing, and by initial bioinformatics. We
also have some phylogenetic tutorials made by Anja and Verity that are focused
around Ebola from training courses in the past that can also be adapted or
looked at. So those are useful resources. It's, but I mean, there's clearly,
that we could clearly put more documentation in. Did you want to talk about any
of those at all, Anja, in terms of resources that are available? Yeah. So I was
going to just mention about when we, a couple of years ago, Nick, I, and a few
of the others from the Arctic Network went to Ghana and we had a whole week long
training session where we went from sort of sample through sequencing, and then
actually did hands-on, obviously in person at the time, bioinformatics teaching,
and then sort of bringing people through the pipeline. And the thing that I sort
of remember, I think it was really good. And I think people learned a lot from
us, but I think part of the, the sort of, a lot of the time got sucked up
actually on very basic bioinformatics. So it's one thing to run the pipeline,
but actually even just knowing where you are when you're on the terminal, you
know, knowing what directory you're in, knowing where your files are, these are
really basic things that actually there's so many resources online that you can
kind of familiarize yourself with. So I think doing a course and like being
walked through these commands is great, but it's one thing to, to do that and
maybe just be copying and pasting in from the, from the, from the notes, which
works and it's fine, but I think actually even just independently, you can get
some practice just by playing around and with, with the terminal and, and, and
getting familiar with it yourselves. Definitely. So a question from Rach Toom,
the question is how, how's the Arctic pipeline, Pangolin, Civit, how have they
been validated for use? Is there a standard process you tend to use? Yeah. So I
think Pangolin and Civit have maybe two different approaches. And when we, when
we talk about that, so Pangolin, the actual data that goes into training the
model in Pangolin is manually curated. So in terms of the input, we are visually
inspecting it and making sure that, that this it's a sequence name mapped to a
lineage assignment, and that's done by curating through the, the tree, or if
people have generated sequences and sort of flag that they have their own, a new
lineage or something like that, we can include it, and if anybody is doing
sequencing and notices that they have a sort of cluster of sequences that they
think should be a lineage, definitely flag it either on the GitHub, send me a
message on Twitter and email anything. And we can actually try and incorporate
it into the scheme. So you can get that lineage assigned in terms of validating
the output. So we train the model and then we test and run all of the sequences
and do 10 full cost validation of the model. And we get results about their
recall, accuracy, precision for all different lineages. And different lineages
do have some, some have better, some have worse recall, you know, so depending
on how big the lineage is, if it's got snips that might be present in multiple
lineages, it, it does vary, but on average, we're at about 98.6% accuracy for
the lineage assignments. So, so it does vary and ambiguities and sort of missing
data can, can kind of affect that themselves, but we're always updating and sort
of keeping an eye on these things. In terms of civet, the input is that big
tree. So this sort of validation goes along and sequence quality and stuff like
that. It's all part of grapevine, which is being run by Ben Jackson on Climb
Every Day. So the input itself, and actually civet in its, in itself is really
just summarizing the data. So you have your input sequence. We match it to the
closest thing in the tree, and then we summarize the information. So in terms of
actual analysis going on, there's really not that much. It's more of a, and a
lot of the things that we're using, they're, they're really well-validated tools
already. So things like Minimap 2 by Heng Li, which is this incredible tool, you
know, aligns the whole 300,000 sequences in a matter of seconds. So it's an
amazing tool. So we are sort of combining really good pieces of software from
various people and then summarizing it in a nice report, so that's sort of two
different, two different kinds of processes for that. Thank you. Certainly
different groups will tend to validate their whole process. So that's quite
important. They'll want to validate the entire process, including the lab work
and the bioinformatics and the answer. Generally with SARS-CoV-2, there isn't a
huge amount of ground truth available for validation. So, you know, you can do
things like sequence positive control material that's well characterized and
check that you're producing the right variants. So for example, in Birmingham, I
think we've done something like 130 or 40 sequencing runs of SARS-CoV-2 on every
single run, we include a positive control from a isolated SARS-CoV-2 virus cell
culture and a negative control. And so we would check on every single run that
we're pulling out the correct set of variants and positive control and that the
negative control is clean. But generally what we have to do for validation of
pipelines is kind of cross validation. So sequencing the same samples with
different, slightly different methods and comparing them. So for example,
sequencing on Nanopore and Illumina and cross-checking that they give the same
results is work that we've done in the past and also checking different
protocols, metagenomics versus amplicons to check that that gives the same
results. And we also have a kind of a more broad validation in the sense of that
everyone in the co-consortium is throwing in data of different, from different
places with different protocols that are brought together. And so that often
throws up a very easy way of finding anything anomalous. So for example, if you
identify a new variant, you know, we found examples where, for example, some
pipelines like the Arctic pipeline will call correctly.  deletion, so that new
variant is characterised by several short deletions in the genome, but other
pipelines perhaps don't call that deletion correctly, call it as N characters or
don't call it at all. That's a good way of doing kind of cross-validation.
Will's also done a lot of work on the on the nanopore Arctic pipeline to make
sure that it conforms to kind of modern software development practices and
testing. Don't know if you wanted to mention that at all. So whenever we make
changes to the Arctic pipeline, we check it against positive control samples
basically to make sure that we haven't broken anything, but these are again sort
of a standard set of variants we expect to find when we run in the raw data and
want to produce a consensus genome. But on that note, particularly for those
wanting to sort of implement bioinformatics practices and standardise them, I'd
thoroughly recommend using releases, tagged releases of all software, doing it
within controlled environments like Condor environments are great for this. Be
cautious using anything which hasn't had a release because people can sort of
pollute the code in the main branch on GitHub. So if you just go and pull down a
GitHub branch, it may or may not work properly. It's always worth checking the
build tests. Quite often they have a little badge there saying they're passing
or failing. So that's a good way to standardise things. For the Arctic pipeline
and stuff doing at COG, we use the stamped releases so we can version control
the software all the way through. So if something does start to look a bit
funny, maybe when we release a new version of the software, we can always roll
back to a previous one until we've fixed any problems that might have arisen. So
yeah, I can thoroughly recommend that as a standard practice to do if you're
doing bioinformatics. Maybe just to set expectations though, it is a very
rapidly changing area and there is changes in all the software on a weekly
basis. And if you think that you can just freeze everything today and use that
for the next two years, you're going to encounter a lot of problems because
there is people constantly improving things, finding new things, putting it in
place. And it's very rapid academic quality software being updated as fast as
possible in response to the pandemic. I think that's a really good point,
actually. Thanks for clarifying that. And also I think it ties into your early
point as well, but you do need the sort of professionals to be able to sort of
tell when you need to change your software if you like, and to sort of validate
the changes which are taking place, whether or not there's something you want to
incorporate in your pipelines, or if you should be sticking to a release which
is doing something in particular. So yeah, very fair point. I might comment on
that as well about the sort of software and then also the lineages releases. So,
you know, if you have been doing sequencing, you'll have noticed that lineages
get updated pretty regularly over the last three weeks, very, very regularly as
we've sort of had a few various sets of variants flagged and we're sort of
optimizing how we're assigning them. And that's one thing that we do as well
that Will mentioned was that we tag not only versions of pangolin, so pangolin
itself and the assignment method that it uses is one, but we also have tagged
dated versions of the lineage releases. So if say you've done an analysis from
back in March or April last year, really early on, the lineages that you may
have assigned then will be very different from the ones that get assigned now.
The thing is, if you want, all of the versions are still available online. So if
you have done that analysis and want to replicate it, you can go back and, and
use that version of the software and that version of the data, which all still
exist. But if you want the most up to date information, definitely recommend
using the most recent, both version of lineage of the pangolin model and also
version of pangolin as well. So. Can I just caution, I've revealed a lot of
bioinformatics papers for SARS-CoV-2 genomics. And one of the most common
problems is where people use commercial systems, black boxes, and they think
that, you know, the company is telling them, you know, this is all, you know,
really great. You can just press the magic button. But these are the papers most
often where you have very, very clear and obvious errors. And so don't be fooled
by, you know, just dropping in a standard commercial system and thinking
everything is fine. Often it may not be. Yeah, I think that's a really good
point. And academic software gets a lot of stick, you know, that people, you
know, the academic quality is not always thought to be that high, but the
commercial software can, can sometimes be worse because it has the illusion of
quality. People will, you know, it's nicely packaged. It works very easily. It
produces an output very easily. And, and these are the features that people look
for in commercial packages, you know, easy to install nice GUI, but, you know,
they'll also just allow, you know, they will not, they'll not know that you're
using the Arctic Amplicon scheme. They won't know anything about SARS-CoV-2.
They'll just allow you to put in a fast Q file and spit out an answer, making a
bunch of assumptions. And those assumptions are probably invalid until the
developers that software understand how the field has progressed and understand
how to deal with that data and make an update. So in a way, the, the commercial
packages can sometimes be, as you say, that actually produce the worst results.
In a few years time, they'll probably be the best, you know, but, but right now
they're in a kind of, we're in a kind of transition period. Okay. That seems to
have answered Rach's question, which is great. So I think we're coming to the
end. We've had an hour of Q and A, and I think it's been really interesting.
We've covered a couple of interesting topics. Then let's say goodbye. Bye from
all the panelists. Thank you for attending and we'll see you on the Slack or
somewhere else soon, I'm sure. Thank you all so much for listening to us at
home. If you like this podcast, please subscribe and like us on iTunes, Spotify,
SoundCloud, or the platform of your choice. And if you don't like this podcast,
please don't do anything. This podcast was recorded by the Microbial
Bioinformatics Group and edited by Nick Waters. The opinions expressed here are
our own and do not necessarily reflect the views of CDC or the Quadrant
Institute.