This time in the Michael Finke podcast, we come from the Arctic Network and Climb Big Data joint workshop on COVID-19 data analysis, held on the 14th and 15th of January, 2021. So welcome everybody to this last Q&A session. The question from Arnold, wanting to know a bit more about what was going on with Pollcat. I'm going to ask Verity, but Pollcat, if you didn't catch this, is a tool for investigating clusters, so genomically defined clusters. That's about the limit of my knowledge, so I'm going to turn that over to Verity to expand on my answer. Yeah, basically that's what it's for. So it's a way to try and detect outbreaks in a little bit more of a systematic way, because before we'd kind of, we'd look at the tree and we'd say, oh, that looks a bit strange, or people that were sequencing the various different parts of the UK might highlight things and say, oh, this looks a bit strange. So Pollcat is a way of detecting things that we might not otherwise notice. And basically what it does is it takes, well, Andrew Rambo wrote a kind of big, clever, magical function in Java that summarizes the tree into lots and lots of different small groups and produces summary statistics about each of those groups. And then the Pollcat report will summarize those things. And we're basically looking for things like a long branch followed by a lot of sequences, because that can sometimes indicate there's an outbreak, a really high growth rate, a lot more sequences than you'd expect in a timeframe in a specific place. So we also look at the time and spatial distribution of these things compared to what we expect, all of these sorts of things. So it's to summarize the tree into useful things so we can detect novel outbreaks. It was mostly developed on request from Public Health England and the people that really asked for that, but it's usable. Yeah. And as I understand it, one of the motivations is to try and detect these, these kind of interesting variants of concern have certain phylogenetic features, like you mentioned, they could be expanding more quickly than other clusters. Yeah. They could have like a very long stem on, you know, a long branch kind of indicating some kind of mutational burst, you know, like that we talked about earlier about, you know, perhaps associated with evolution in a host. And, and so I think the idea was to kind of to try and be able to detect a, you know, without much a priori information, whether there was something that demanded a kind of follow-up, epidemiological follow-up or... Yeah. And it, it wasn't necessarily originally about finding like variants that we were worried about, like the specific mutations in, like we're looking at now, it was kind of, I think originally also just about like outbreaks in the gene, because sometimes, because when, because epidemiologically you can sometimes see outbreaks appearing, but you might miss some because there's just so many cases in one area, you might not know which ones are actually connected to each other. So it wasn't also all about the variants of concern, although it has been useful for that later. It's, it was kind of just a new way of detecting outbreaks to, in a slightly more focused way than sometimes the case data can be, if that makes sense. And Polka is something that people can run on their own data. Is that right? Is available? I believe it's certainly available. It's on GitHub. I don't honestly know how well it works on other data, but I think when Andrew, when Andrew's function produces stuff, I think it does it for the whole tree. So I think if you have a whole tree, you can do it, but I would need to look into that because I'm not, I'm not fully sure. This is like, Andrew wrote the kind of backend function and Anja wrote the report section of it. So I'm not too familiar with the details of it. So a question now is, is could you give guidance on the difference between phylotypes, Cog UK phylotypes versus the pangolin lineages? And when should you use one or the other topical question? Yeah, so I can take this again. So the phylotypes are very, very high resolution. So phylotypes are about describing specific shared snips that some sequences might have. So really what the phylotypes end up doing is they're a sort of text codification of the tree. So if you have the tree and you have the phylotypes, it's kind of the same information. And so they're just about like shared snips. It's very, very high resolution. If you have two sequences with the same phylotype, they're much, much more connected than two sequences with the same pangolin lineage, which as Anja said yesterday, like the size of the lineage is sort of agnostic. So sometimes the lineages are really, really big. So if you have two sequences in the same lineage, they shared a common ancestor at some point relatively recently, but it may not be that recently, whereas two sequences with the same phylotype are going to be much more closely linked than two sequences with the same lineage. So it depends on the resolution of the question you're asking, really. Yeah, so that was my understanding as a phylotype, very high resolution, pangolin lineages, lower resolution. Yeah, I think phylotype is like the highest resolution thing we use, because like I say, it's the tree. We just kind of write it out so that it can be easier to report on. Yeah, excellent. So phylotype is kind of useful when looking at kind of very, you know, hospital outbreaks, something like that, you know, very fine resolution and lineages more kind of used in the kind of international setting, I suppose, but can also be used in hospital outbreaks, as you pointed out in your practical demonstration. This is another quick question, I think, and most of the questions are coming to you, Verity, at the moment, I think, but feel free to chip in anyone else. What's the main difference between civet and llama? And I think that's the software tools rather than the animals. Yeah, llamas have hooves. So no, yeah, so they're very similar. They have a lot of code in common. The main difference is that we wrote llama to be used globally, and we wrote civet to be used within the UK. So we were kind of angling it so that llama would be for kind of broader scale, lineage level stuff in other countries. And then civet would be a kind of specific application for cluster investigation within the UK, where we were able to provide more high resolution data. So like, civet has some extra stuff in it, like as quite a lot of spatial things I've put in and mapping and all these sorts of things. And because mapping requires an awful lot of metadata, curation and management, that feature is only available for the UK. So that feature is only in civet. As we move on, they are becoming more and more similar, partly because we're expanding, we're moving towards making civet usable for people outside of the UK. So some features probably won't be available for outside of the UK in civet. But yeah, civet and llama are becoming slightly more similar, but they're sort of different applications of a similar sort of code base. Does that make sense? They are getting increasingly. It makes sense to me, I think. And Tiago's got a follow-up question, which is actually directly related to this one, which is about, you know, there's a huge number of genomes in GISAID, you know, particularly if you've got a genome from a big lineage like B1 in his country, he wants to find most related to the international database. How does he go about it? I'm going to guess that you're going to suggest using civet or llama for something like that. But he was also wondering if you could blast it. I don't, I honestly don't know about blasting it. Blast is not something I use particularly. I don't know if anyone else has any. I wouldn't go blasting it. I thought you'd say that. You know, you've got a tiny number of mutations and, yeah, blast will not give you the resolution you need. Also, most genomes are not in GenBank. So if you're talking about using NCBI blast on the public database, you'll miss an awful lot of things that way. But I agree, it's not the most sensitive tool for these types of sequences. It's not really designed for that. Yeah. So I think llama would be a good one for this because you could put in the sequences that you have and it would pull out the relevant parts of the tree. And then you could look at what lineages the rest of those parts of the tree are. So I think llama would be a good tool for that. It's probably worth saying that if it's a B1 lineage, you know, that's going to be a huge number of sequences. And in effect, it's going to be very hard for you to say anything about what the closest relative is, because it's going to, the fact that it's in a B1 lineage basically tells you it's going to be in this very, very broadly distributed international lineage. So you may not be able to say much more than anyway. Yeah, that's very true. Like most sequences are part of B.1 in some way. Yeah. So if you've got B.1.7 or something like that, then obviously you've got more information to work with and you might be able to infer it. So then, you know, at the moment, B.1.1.7 also called the UK variant is, you know, we think probably started in the UK and is being exported. So if you see one of those nowadays in another country, you might assume it's a UK import, although it's become so widespread so quickly, it may also be local transmission. So you'd have to, you know, but you kind of don't tell it from the sequence. You sort of must tell that from the epidemiology of that lineage, if you like, who described it first and where are the cases. Okay, good. That was an interesting discussion, actually. I can't think of any other tools that are really good for doing this other than just making a tree. And effectively what Civot and Lama are doing is making a tree, is finding a place. I'm guessing what it's doing is it's trying to find most similar sequences by using a simple multiple sequence. alignment type approach and then it's building a tree of the things that are nearby. Is that right? Yeah, as I understand it. So it's like, yeah, we align sequences. And I'm not familiar with the pipeline, so I don't want to say something wrong. But the idea is effectively to find, to recruit a bunch of sequences that you can then go and make a tree from effectively, an existing tree. Exactly. Yeah. It looks at the big tree and it finds the right part of the tree and then it pulls out that right part of the tree and builds it. And if you've got new sequences in there, it will then like remake that small part of the tree with the new sequences. And so you can add the new sequences on the fly. Yeah. And this is just to get around the problem that you don't want to build a new tree. You add a sequence to a database of 400,000 sequences or something like that. Yeah, exactly. You start with a reference tree. Reference trees are available from the Cog project. Actually, we post one on our website every time the pipeline runs, which is usually daily. I think there are other other trees are available. I think Rob Lanphier curates a worldwide tree. So you can start with that as your approach to finding kind of neighbouring sequences and then pull out sequences and then build a tree in your favourite way, either manually with a tree building software or one of the kind of tree building pipelines like the next strain auger pipeline is a kind of nice way of easily drawing novel trees. OK, the next question then is from Federico, also for Verity. So this is about PangaLearn. From what I understand, you're training the model for the machine learning approach of PangaLearn with a manually created sequence lineage data set. Can you comment a little bit further about how are you managing to manually curate that data set? Another topical question. Yeah, so it's it's very entertaining. I will say Anja does most of the manual curation. I help her because it's it's a huge task, as I'm sure you can imagine. But broadly, what we do is we have the big tree and we chop it up into smaller trees so that it's easier to open in TreeViewer fig tree, because if you don't do that, then it doesn't scroll very fast. And we basically go through it and we we have what the old lineage designation was. And we look at the how that's being transported into the new tree, because obviously trees are uncertain and you've got more sequences and sometimes lineages split up. And you look at it and you say, OK, there's some new sequences in there. We need to add those into this lineage. And then you go down and you say, oh, this this one actually looks like it might be a new lineage. So then we call that one a new lineage. And then you go down and you say, oh, we've just we've already seen v.1.1.10 somewhere else. So that lineage just split up. So now we need to give this one a new number. So we do that for the whole tree, which is why it doesn't happen all that often. That's why it happens every couple of months, just because it's it takes about a week's worth of time to do that. In terms of helping, I'm sure like that would be fantastic if anyone does want to get involved with going through that tree and doing that process. You can chat to me and Anya about that and we can get involved, because I think crowdsourcing it is a definitely good way to do it. Because like the last time we did a big lineage release, I think I helped Anya with like one or two trees and that was like totally fine from my point of view to do. And then she had to do one or two fewer trees. So that was nice. So, yeah. Yeah. Get in touch if you're keen to help out. That sounds great. Yeah. And I was under the impression that anyone that's doing sequencing, you know, if they identify something that they think is epidemiologically or biologically interesting can propose that a particular part of the tree gets its own lineage assignment, the cover lineage's website and the contact details there, I guess. Yeah. And we've actually we've been doing that recently with there was one in the US, I think, that Anya recently did a kind of mini extra release for to like incorporate that, as well as, yeah, I think we had one from Brazil before, like not the variant of interest, just like a separate lineage, all these sorts of things. Yeah. So we're trying to make it a very open and democratic process. So, yeah. Any questions or any thoughts? Please do let us know. Great. OK. Another civic question. Wow. So. All right. So I'm going to attempt. So new toy is asking, how is civet tuned to break up clades in the display? So, you know, how does it choose which, you know, what constitutes a subtree, I suppose? And also, if it was to be adapted to other viruses that mutate faster, like HIV, you know, how could you how could you use civet or how would you tune it? We have a series of defaults inbuilt into civet which can be changed using the config file or the command line. And the ones that we use for choosing the tree, we look at like how far. So you've got your sequence of interest and we we say, I think it's I think we go two nodes above the sequence of interest and two nodes down. I think that's the default. And then we also have a radius measure. So if you were interested in things slightly outside of this tree that you've got, you could just increase that radius or increase your up and down distance. And that would give you more of the tree. Part of the reason that we have it quite tight is that these polytomies that we were all discussing yesterday can be really, really, really big, especially with anything, anytime you get close to UK data, because it's so heavily sampled here. So you end up with like these giant polytomies that can be hundreds and hundreds of sequences, and sometimes thousands, which we can't display very nicely. Basically, it doesn't it doesn't look nice to display them. And there's also a pixel limit in Python, it turns out. So we yeah, so that's why it's like quite tight. But you can change that. It's very easy to change. And you can like play around with it and see what works for the question that you're asking. In terms of stuff like HIV, if you wanted to adapt it, I think you would probably just play around with the default settings that we have. I don't I don't know. Yeah, Andrew has something to add. Yeah, I mean, I've done this a lot playing with civet. And it's a bit like the question earlier with the B one, you know, sometimes you'll make a tree, and there won't be very much context at all, you know, so you just will see your queries and not much around it, you know, a couple of sequences around it, in which case, I tend to increase the, the, the up distance, you know, to pull in more, to pull in up and down distance to pull in more sequences. But if you were putting something like a B one kind of quite basal kind of thing, it will, it will put in huge numbers of sequences. So in that case, you sometimes want to reduce the distances again. So there's always a bit of, there's, you know, there's a bit of tuning to be done, depending on how densely populated that part of the tree is, I think would be my, my, my view of that. So I would caution using tools built for one pathogen on another pathogen. Because say, if you take HIV, it's quite common that people are infected with maybe multiple variants, variants or strains. And that's quite important clinically, because, you know, it's implicated in treatment failures, things like that. And that's something that clinicians who look at the data then are really interested in. And of course, those don't really display very well in a tree, if you've, you know, cloud some infection rather than, you know, beautiful, like isolates or something like that. So I would just caution trying to port one thing over into another. Yeah. And I think kind of like in connection with that, it would change the way you would interpret anything. So like with SARS-CoV-2, like if it's in a different subtrees in the default settings, that's far enough away for SARS- CoV-2 that we can rule out transmission pretty much. For something like HIV that evolves much faster, the default settings in Cibit that are appropriate for one virus, like if they were in different subtrees, that might not actually rule out anything really, because of the biological context of the virus. So yeah, just kind of connected to what Andrew said. Federica's got quite an off-topic question, he admits. Are the sequences of the mRNA adenoviral vector-based vaccines available? And what's your opinion on sharing that kind of information? Well, I think at least for one mRNA vaccine, there's definitely a sequence because I saw it posted on Twitter. And it was quite interesting because it was encoded with kind of nucleotides, but also Greek letters associated with modified RNA bases that are used in the mRNA vaccine. I'm not sure about the other vaccines. I don't think all of the vaccine sequences are available. But what's my opinion on sharing that kind of information? I think people should share that kind of information. I expect when people with vaccines are used, we'll end up sequencing the vaccine by accident when it's in people's systems. And we'll end up knowing the sequences quite soon, just by chance, or people will go and sequence it in the lab or something. But yeah, I don't see why any reason why that sequence data should be protected. And I think it should be shared and it would be interesting. Okay, so here's another good question for the panel. Do you have a preference for databases that report variation in SARS- CoV-2? Hopefully everyone's got some thoughts on that. Well, I use CovGlue and Clade's Nextstrain, but they do give slightly different results and you can't compare them directly. So just bear that in mind. If you do pick a database, stick with that database and nothing else. Yeah, CovGlue from CVR in Glasgow is a really nice database of variation, good web page, good easy way to find common and rare mutations and insertions, deletions, things like that. And as you say, Nextstrain is another great source of variation. I'm not aware of many. other variation databases, but I don't know if anyone else is. Mostly use Andrew Rambo's brain, to be honest. Just email Andrew. Oh, he'd love that. In what ways do CovGlue and Nextstrain differ? Well, CovGlue is entirely focused around that question about, you know, data cataloging variation. So it's not a, it's a, it's, it's a database of mutations, database insertions, deletions for you to query. Um, it's not so much a phylogenetic platform. Nextstrain starts with a phylogenetic tree, but it performs ancestral state reconstruction. So it basically says for clades in the tree, what are the lineage defining mutations, and they, and it helpfully displays that in amino acid nomenclature. So it's easier for you to find recurrent mutations that are interesting. People are interested in particular mutations at the moment, because they're thought to have biological properties, things like N501Y, E484K. There's a whole list of them. 614G is the one that's been around for a long time. So you can use Nextstrain to see when that mutation emerged. Whereas CovGlue is much more about kind of cataloging the frequency. Nextstrain, because it is display based, it's tree based. It won't show everything in one view because no one's figured out a good way of displaying all that information. So you don't get quite as much background in the, about the database as CovGlue gives you. But they are complementary tools, I would say. So there is clades.nextstrain.org, which is kind of like CovGlue, but they call certain genes slightly differently. And they, they seem to do different types of filtering as well. So you can get some differences in what gets called each thing. Oh yeah, clades is excellent actually, isn't it? I have to admit, I have never really used it, but you're right. That's an excellent option. You can just dump your sequences in there and it will give you a lot of that information, won't it? That's for analyzing your own sequences, right? Yeah. Yeah. And some of the, if the variant has been assigned a pangolin lineage, then you can put your sequences in the pangolin web app. And if you were worried about the UK variant, for example, the UK variant, if it came out as 3.1.1.7, then it's, it's that. So there's also that option if it's been assigned a pangolin lineage. Very good. I've learned something. So this has been really useful. And actually I needed something like that in it for a meeting in about half an hour. So perfect, perfect timing. Okay. Here's, this might be the last question. How, how would you describe the quality of the variant calls in the software you described above? That's, I'm not sure the answer to that one. One's a database and one's for your own sequences. So I'm not sure it really works like that, but does anyone want to take a crack at that question? Well, it depends on the quality of the consensus genomes you put in and have been put in. So it's, it's only as good as the data that it's built upon. So there's no way of really saying quality or not because a variant in a FASTA file will be, you know, that's the variant. There's no ambiguity about that, but if there's no reads to properly support it, well then that's a problem. Well, you'll never know. Yeah. I mean, I think for Nextstrain generally, there have curators there that will, that will prune out and remove the more kind of obviously wrong sequences. So there's a element of curation on Nextstrain generally. And I think the Claids site offers some QC for your own sequences. It's actually, it's, it's, it will give you some QC metrics. You, you were shaking your head there, Andrew, but I think it will do, it will try to do something for you, won't it? Yeah. It's, you use, I suppose, some rules at home and say, well, you've got some private variants here. So maybe it's not very good, but actually it might just be because you're, you've sequenced something from an underrepresented region in a world. So you have to take all of these things at a pinch of salt. Yeah. And then I think the Cog Glue basically just reports, as you say, what's in the databases. So if it's good sequence, it's, it's a good variant call, but if it's not, it's not. Well, yeah. They also do some nice QC checking against the Arctic Amplicon regions. So you, it can highlight maybe if you've got a snip in an area where they know it might be a bit dodgy. Yeah, that's a good point. Okay. Hopefully that answers the question. Okay. So we appreciate all those really good questions. So we hope to see you again soon, probably virtually on the Slack, on Twitter or elsewhere. Thank you all so much for listening to us at home. If you liked this podcast, please subscribe and like us on iTunes, Spotify, SoundCloud, or the platform of your choice. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group and edited by Nick Waters. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadrant Institute.