This time in the Michael Finke podcast, we come from the Arctic Network and
Climb Big Data joint workshop on COVID-19 data analysis, held on the 14th and
15th of January, 2021. So welcome everybody to this last Q&A session. The
question from Arnold, wanting to know a bit more about what was going on with
Pollcat. I'm going to ask Verity, but Pollcat, if you didn't catch this, is a
tool for investigating clusters, so genomically defined clusters. That's about
the limit of my knowledge, so I'm going to turn that over to Verity to expand on
my answer. Yeah, basically that's what it's for. So it's a way to try and detect
outbreaks in a little bit more of a systematic way, because before we'd kind of,
we'd look at the tree and we'd say, oh, that looks a bit strange, or people that
were sequencing the various different parts of the UK might highlight things and
say, oh, this looks a bit strange. So Pollcat is a way of detecting things that
we might not otherwise notice. And basically what it does is it takes, well,
Andrew Rambo wrote a kind of big, clever, magical function in Java that
summarizes the tree into lots and lots of different small groups and produces
summary statistics about each of those groups. And then the Pollcat report will
summarize those things. And we're basically looking for things like a long
branch followed by a lot of sequences, because that can sometimes indicate
there's an outbreak, a really high growth rate, a lot more sequences than you'd
expect in a timeframe in a specific place. So we also look at the time and
spatial distribution of these things compared to what we expect, all of these
sorts of things. So it's to summarize the tree into useful things so we can
detect novel outbreaks. It was mostly developed on request from Public Health
England and the people that really asked for that, but it's usable. Yeah. And as
I understand it, one of the motivations is to try and detect these, these kind
of interesting variants of concern have certain phylogenetic features, like you
mentioned, they could be expanding more quickly than other clusters. Yeah. They
could have like a very long stem on, you know, a long branch kind of indicating
some kind of mutational burst, you know, like that we talked about earlier
about, you know, perhaps associated with evolution in a host. And, and so I
think the idea was to kind of to try and be able to detect a, you know, without
much a priori information, whether there was something that demanded a kind of
follow-up, epidemiological follow-up or... Yeah. And it, it wasn't necessarily
originally about finding like variants that we were worried about, like the
specific mutations in, like we're looking at now, it was kind of, I think
originally also just about like outbreaks in the gene, because sometimes,
because when, because epidemiologically you can sometimes see outbreaks
appearing, but you might miss some because there's just so many cases in one
area, you might not know which ones are actually connected to each other. So it
wasn't also all about the variants of concern, although it has been useful for
that later. It's, it was kind of just a new way of detecting outbreaks to, in a
slightly more focused way than sometimes the case data can be, if that makes
sense. And Polka is something that people can run on their own data. Is that
right? Is available? I believe it's certainly available. It's on GitHub. I don't
honestly know how well it works on other data, but I think when Andrew, when
Andrew's function produces stuff, I think it does it for the whole tree. So I
think if you have a whole tree, you can do it, but I would need to look into
that because I'm not, I'm not fully sure. This is like, Andrew wrote the kind of
backend function and Anja wrote the report section of it. So I'm not too
familiar with the details of it. So a question now is, is could you give
guidance on the difference between phylotypes, Cog UK phylotypes versus the
pangolin lineages? And when should you use one or the other topical question?
Yeah, so I can take this again. So the phylotypes are very, very high
resolution. So phylotypes are about describing specific shared snips that some
sequences might have. So really what the phylotypes end up doing is they're a
sort of text codification of the tree. So if you have the tree and you have the
phylotypes, it's kind of the same information. And so they're just about like
shared snips. It's very, very high resolution. If you have two sequences with
the same phylotype, they're much, much more connected than two sequences with
the same pangolin lineage, which as Anja said yesterday, like the size of the
lineage is sort of agnostic. So sometimes the lineages are really, really big.
So if you have two sequences in the same lineage, they shared a common ancestor
at some point relatively recently, but it may not be that recently, whereas two
sequences with the same phylotype are going to be much more closely linked than
two sequences with the same lineage. So it depends on the resolution of the
question you're asking, really. Yeah, so that was my understanding as a
phylotype, very high resolution, pangolin lineages, lower resolution. Yeah, I
think phylotype is like the highest resolution thing we use, because like I say,
it's the tree. We just kind of write it out so that it can be easier to report
on. Yeah, excellent. So phylotype is kind of useful when looking at kind of
very, you know, hospital outbreaks, something like that, you know, very fine
resolution and lineages more kind of used in the kind of international setting,
I suppose, but can also be used in hospital outbreaks, as you pointed out in
your practical demonstration. This is another quick question, I think, and most
of the questions are coming to you, Verity, at the moment, I think, but feel
free to chip in anyone else. What's the main difference between civet and llama?
And I think that's the software tools rather than the animals. Yeah, llamas have
hooves. So no, yeah, so they're very similar. They have a lot of code in common.
The main difference is that we wrote llama to be used globally, and we wrote
civet to be used within the UK. So we were kind of angling it so that llama
would be for kind of broader scale, lineage level stuff in other countries. And
then civet would be a kind of specific application for cluster investigation
within the UK, where we were able to provide more high resolution data. So like,
civet has some extra stuff in it, like as quite a lot of spatial things I've put
in and mapping and all these sorts of things. And because mapping requires an
awful lot of metadata, curation and management, that feature is only available
for the UK. So that feature is only in civet. As we move on, they are becoming
more and more similar, partly because we're expanding, we're moving towards
making civet usable for people outside of the UK. So some features probably
won't be available for outside of the UK in civet. But yeah, civet and llama are
becoming slightly more similar, but they're sort of different applications of a
similar sort of code base. Does that make sense? They are getting increasingly.
It makes sense to me, I think. And Tiago's got a follow-up question, which is
actually directly related to this one, which is about, you know, there's a huge
number of genomes in GISAID, you know, particularly if you've got a genome from
a big lineage like B1 in his country, he wants to find most related to the
international database. How does he go about it? I'm going to guess that you're
going to suggest using civet or llama for something like that. But he was also
wondering if you could blast it. I don't, I honestly don't know about blasting
it. Blast is not something I use particularly. I don't know if anyone else has
any. I wouldn't go blasting it. I thought you'd say that. You know, you've got a
tiny number of mutations and, yeah, blast will not give you the resolution you
need. Also, most genomes are not in GenBank. So if you're talking about using
NCBI blast on the public database, you'll miss an awful lot of things that way.
But I agree, it's not the most sensitive tool for these types of sequences. It's
not really designed for that. Yeah. So I think llama would be a good one for
this because you could put in the sequences that you have and it would pull out
the relevant parts of the tree. And then you could look at what lineages the
rest of those parts of the tree are. So I think llama would be a good tool for
that. It's probably worth saying that if it's a B1 lineage, you know, that's
going to be a huge number of sequences. And in effect, it's going to be very
hard for you to say anything about what the closest relative is, because it's
going to, the fact that it's in a B1 lineage basically tells you it's going to
be in this very, very broadly distributed international lineage. So you may not
be able to say much more than anyway. Yeah, that's very true. Like most
sequences are part of B.1 in some way. Yeah. So if you've got B.1.7 or something
like that, then obviously you've got more information to work with and you might
be able to infer it. So then, you know, at the moment, B.1.1.7 also called the
UK variant is, you know, we think probably started in the UK and is being
exported. So if you see one of those nowadays in another country, you might
assume it's a UK import, although it's become so widespread so quickly, it may
also be local transmission. So you'd have to, you know, but you kind of don't
tell it from the sequence. You sort of must tell that from the epidemiology of
that lineage, if you like, who described it first and where are the cases. Okay,
good. That was an interesting discussion, actually. I can't think of any other
tools that are really good for doing this other than just making a tree. And
effectively what Civot and Lama are doing is making a tree, is finding a place.
I'm guessing what it's doing is it's trying to find most similar sequences by
using a simple multiple sequence.  alignment type approach and then it's
building a tree of the things that are nearby. Is that right? Yeah, as I
understand it. So it's like, yeah, we align sequences. And I'm not familiar with
the pipeline, so I don't want to say something wrong. But the idea is
effectively to find, to recruit a bunch of sequences that you can then go and
make a tree from effectively, an existing tree. Exactly. Yeah. It looks at the
big tree and it finds the right part of the tree and then it pulls out that
right part of the tree and builds it. And if you've got new sequences in there,
it will then like remake that small part of the tree with the new sequences. And
so you can add the new sequences on the fly. Yeah. And this is just to get
around the problem that you don't want to build a new tree. You add a sequence
to a database of 400,000 sequences or something like that. Yeah, exactly. You
start with a reference tree. Reference trees are available from the Cog project.
Actually, we post one on our website every time the pipeline runs, which is
usually daily. I think there are other other trees are available. I think Rob
Lanphier curates a worldwide tree. So you can start with that as your approach
to finding kind of neighbouring sequences and then pull out sequences and then
build a tree in your favourite way, either manually with a tree building
software or one of the kind of tree building pipelines like the next strain
auger pipeline is a kind of nice way of easily drawing novel trees. OK, the next
question then is from Federico, also for Verity. So this is about PangaLearn.
From what I understand, you're training the model for the machine learning
approach of PangaLearn with a manually created sequence lineage data set. Can
you comment a little bit further about how are you managing to manually curate
that data set? Another topical question. Yeah, so it's it's very entertaining. I
will say Anja does most of the manual curation. I help her because it's it's a
huge task, as I'm sure you can imagine. But broadly, what we do is we have the
big tree and we chop it up into smaller trees so that it's easier to open in
TreeViewer fig tree, because if you don't do that, then it doesn't scroll very
fast. And we basically go through it and we we have what the old lineage
designation was. And we look at the how that's being transported into the new
tree, because obviously trees are uncertain and you've got more sequences and
sometimes lineages split up. And you look at it and you say, OK, there's some
new sequences in there. We need to add those into this lineage. And then you go
down and you say, oh, this this one actually looks like it might be a new
lineage. So then we call that one a new lineage. And then you go down and you
say, oh, we've just we've already seen v.1.1.10 somewhere else. So that lineage
just split up. So now we need to give this one a new number. So we do that for
the whole tree, which is why it doesn't happen all that often. That's why it
happens every couple of months, just because it's it takes about a week's worth
of time to do that. In terms of helping, I'm sure like that would be fantastic
if anyone does want to get involved with going through that tree and doing that
process. You can chat to me and Anya about that and we can get involved, because
I think crowdsourcing it is a definitely good way to do it. Because like the
last time we did a big lineage release, I think I helped Anya with like one or
two trees and that was like totally fine from my point of view to do. And then
she had to do one or two fewer trees. So that was nice. So, yeah. Yeah. Get in
touch if you're keen to help out. That sounds great. Yeah. And I was under the
impression that anyone that's doing sequencing, you know, if they identify
something that they think is epidemiologically or biologically interesting can
propose that a particular part of the tree gets its own lineage assignment, the
cover lineage's website and the contact details there, I guess. Yeah. And we've
actually we've been doing that recently with there was one in the US, I think,
that Anya recently did a kind of mini extra release for to like incorporate
that, as well as, yeah, I think we had one from Brazil before, like not the
variant of interest, just like a separate lineage, all these sorts of things.
Yeah. So we're trying to make it a very open and democratic process. So, yeah.
Any questions or any thoughts? Please do let us know. Great. OK. Another civic
question. Wow. So. All right. So I'm going to attempt. So new toy is asking, how
is civet tuned to break up clades in the display? So, you know, how does it
choose which, you know, what constitutes a subtree, I suppose? And also, if it
was to be adapted to other viruses that mutate faster, like HIV, you know, how
could you how could you use civet or how would you tune it? We have a series of
defaults inbuilt into civet which can be changed using the config file or the
command line. And the ones that we use for choosing the tree, we look at like
how far. So you've got your sequence of interest and we we say, I think it's I
think we go two nodes above the sequence of interest and two nodes down. I think
that's the default. And then we also have a radius measure. So if you were
interested in things slightly outside of this tree that you've got, you could
just increase that radius or increase your up and down distance. And that would
give you more of the tree. Part of the reason that we have it quite tight is
that these polytomies that we were all discussing yesterday can be really,
really, really big, especially with anything, anytime you get close to UK data,
because it's so heavily sampled here. So you end up with like these giant
polytomies that can be hundreds and hundreds of sequences, and sometimes
thousands, which we can't display very nicely. Basically, it doesn't it doesn't
look nice to display them. And there's also a pixel limit in Python, it turns
out. So we yeah, so that's why it's like quite tight. But you can change that.
It's very easy to change. And you can like play around with it and see what
works for the question that you're asking. In terms of stuff like HIV, if you
wanted to adapt it, I think you would probably just play around with the default
settings that we have. I don't I don't know. Yeah, Andrew has something to add.
Yeah, I mean, I've done this a lot playing with civet. And it's a bit like the
question earlier with the B one, you know, sometimes you'll make a tree, and
there won't be very much context at all, you know, so you just will see your
queries and not much around it, you know, a couple of sequences around it, in
which case, I tend to increase the, the, the up distance, you know, to pull in
more, to pull in up and down distance to pull in more sequences. But if you were
putting something like a B one kind of quite basal kind of thing, it will, it
will put in huge numbers of sequences. So in that case, you sometimes want to
reduce the distances again. So there's always a bit of, there's, you know,
there's a bit of tuning to be done, depending on how densely populated that part
of the tree is, I think would be my, my, my view of that. So I would caution
using tools built for one pathogen on another pathogen. Because say, if you take
HIV, it's quite common that people are infected with maybe multiple variants,
variants or strains. And that's quite important clinically, because, you know,
it's implicated in treatment failures, things like that. And that's something
that clinicians who look at the data then are really interested in. And of
course, those don't really display very well in a tree, if you've, you know,
cloud some infection rather than, you know, beautiful, like isolates or
something like that. So I would just caution trying to port one thing over into
another. Yeah. And I think kind of like in connection with that, it would change
the way you would interpret anything. So like with SARS-CoV-2, like if it's in a
different subtrees in the default settings, that's far enough away for SARS-
CoV-2 that we can rule out transmission pretty much. For something like HIV that
evolves much faster, the default settings in Cibit that are appropriate for one
virus, like if they were in different subtrees, that might not actually rule out
anything really, because of the biological context of the virus. So yeah, just
kind of connected to what Andrew said. Federica's got quite an off-topic
question, he admits. Are the sequences of the mRNA adenoviral vector-based
vaccines available? And what's your opinion on sharing that kind of information?
Well, I think at least for one mRNA vaccine, there's definitely a sequence
because I saw it posted on Twitter. And it was quite interesting because it was
encoded with kind of nucleotides, but also Greek letters associated with
modified RNA bases that are used in the mRNA vaccine. I'm not sure about the
other vaccines. I don't think all of the vaccine sequences are available. But
what's my opinion on sharing that kind of information? I think people should
share that kind of information. I expect when people with vaccines are used,
we'll end up sequencing the vaccine by accident when it's in people's systems.
And we'll end up knowing the sequences quite soon, just by chance, or people
will go and sequence it in the lab or something. But yeah, I don't see why any
reason why that sequence data should be protected. And I think it should be
shared and it would be interesting. Okay, so here's another good question for
the panel. Do you have a preference for databases that report variation in SARS-
CoV-2? Hopefully everyone's got some thoughts on that. Well, I use CovGlue and
Clade's Nextstrain, but they do give slightly different results and you can't
compare them directly. So just bear that in mind. If you do pick a database,
stick with that database and nothing else. Yeah, CovGlue from CVR in Glasgow is
a really nice database of variation, good web page, good easy way to find common
and rare mutations and insertions, deletions, things like that. And as you say,
Nextstrain is another great source of variation. I'm not aware of many.  other
variation databases, but I don't know if anyone else is. Mostly use Andrew
Rambo's brain, to be honest. Just email Andrew. Oh, he'd love that. In what ways
do CovGlue and Nextstrain differ? Well, CovGlue is entirely focused around that
question about, you know, data cataloging variation. So it's not a, it's a,
it's, it's a database of mutations, database insertions, deletions for you to
query. Um, it's not so much a phylogenetic platform. Nextstrain starts with a
phylogenetic tree, but it performs ancestral state reconstruction. So it
basically says for clades in the tree, what are the lineage defining mutations,
and they, and it helpfully displays that in amino acid nomenclature. So it's
easier for you to find recurrent mutations that are interesting. People are
interested in particular mutations at the moment, because they're thought to
have biological properties, things like N501Y, E484K. There's a whole list of
them. 614G is the one that's been around for a long time. So you can use
Nextstrain to see when that mutation emerged. Whereas CovGlue is much more about
kind of cataloging the frequency. Nextstrain, because it is display based, it's
tree based. It won't show everything in one view because no one's figured out a
good way of displaying all that information. So you don't get quite as much
background in the, about the database as CovGlue gives you. But they are
complementary tools, I would say. So there is clades.nextstrain.org, which is
kind of like CovGlue, but they call certain genes slightly differently. And
they, they seem to do different types of filtering as well. So you can get some
differences in what gets called each thing. Oh yeah, clades is excellent
actually, isn't it? I have to admit, I have never really used it, but you're
right. That's an excellent option. You can just dump your sequences in there and
it will give you a lot of that information, won't it? That's for analyzing your
own sequences, right? Yeah. Yeah. And some of the, if the variant has been
assigned a pangolin lineage, then you can put your sequences in the pangolin web
app. And if you were worried about the UK variant, for example, the UK variant,
if it came out as 3.1.1.7, then it's, it's that. So there's also that option if
it's been assigned a pangolin lineage. Very good. I've learned something. So
this has been really useful. And actually I needed something like that in it for
a meeting in about half an hour. So perfect, perfect timing. Okay. Here's, this
might be the last question. How, how would you describe the quality of the
variant calls in the software you described above? That's, I'm not sure the
answer to that one. One's a database and one's for your own sequences. So I'm
not sure it really works like that, but does anyone want to take a crack at that
question? Well, it depends on the quality of the consensus genomes you put in
and have been put in. So it's, it's only as good as the data that it's built
upon. So there's no way of really saying quality or not because a variant in a
FASTA file will be, you know, that's the variant. There's no ambiguity about
that, but if there's no reads to properly support it, well then that's a
problem. Well, you'll never know. Yeah. I mean, I think for Nextstrain
generally, there have curators there that will, that will prune out and remove
the more kind of obviously wrong sequences. So there's a element of curation on
Nextstrain generally. And I think the Claids site offers some QC for your own
sequences. It's actually, it's, it's, it will give you some QC metrics. You, you
were shaking your head there, Andrew, but I think it will do, it will try to do
something for you, won't it? Yeah. It's, you use, I suppose, some rules at home
and say, well, you've got some private variants here. So maybe it's not very
good, but actually it might just be because you're, you've sequenced something
from an underrepresented region in a world. So you have to take all of these
things at a pinch of salt. Yeah. And then I think the Cog Glue basically just
reports, as you say, what's in the databases. So if it's good sequence, it's,
it's a good variant call, but if it's not, it's not. Well, yeah. They also do
some nice QC checking against the Arctic Amplicon regions. So you, it can
highlight maybe if you've got a snip in an area where they know it might be a
bit dodgy. Yeah, that's a good point. Okay. Hopefully that answers the question.
Okay. So we appreciate all those really good questions. So we hope to see you
again soon, probably virtually on the Slack, on Twitter or elsewhere. Thank you
all so much for listening to us at home. If you liked this podcast, please
subscribe and like us on iTunes, Spotify, SoundCloud, or the platform of your
choice. And if you don't like this podcast, please don't do anything. This
podcast was recorded by the Microbial Bioinformatics Group and edited by Nick
Waters. The opinions expressed here are our own and do not necessarily reflect
the views of CDC or the Quadrant Institute.