Hello, and thank you for listening to the MicroBinFeed podcast. Here, we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There is so much information we all know from working in the field, but nobody writes it down. There is no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Both Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work on microbes in food and the impact on human health. I work at Centers for Disease Control and Prevention and am an adjunct member at the University of Georgia in the U.S. Hello and welcome to the MicroBinFeed podcast. Today we're giving a rapid roundup of what's been changing for SARS-CoV-2 slash COVID-19 genomics. Things are changing very quickly, so we should mention that it's the 19th of February, 2021, and some of what we mention might change by the time you hear this. Today we are again joined by a special guest, Peter Van Hursten, bioinformatician at the South African National Bioinformatics Institute. Peter has a system administration background and has moved over into the bioinformatics world. So welcome, Peter. Thank you very much, Lee. So let's get started with reviewing the latest changes regarding tools and resources. So I think NextAlign is the first thing on our list. NextAlign is basically a chunk of NextClade. So there's a bit of backstory here. So NextClade was originally developed as a web server, so it does alignment in your browser. So dumping a bunch of things into the NextClade website is a great way to crash your browser. And then it is available as a Docker container. And then I wrote a CWL, Common Workflow Language wrapper, to run it on the command line, but that has always been a little bit ropey. And so now this is part of like kind of porting part of the code into something that's going to be a bit more command line friendly. And isn't the idea that you can do like hundreds of thousands of genomes on a laptop in a few minutes? Yeah, just an alignment, because what they're looking for here is an alignment against the reference genome, so that you know which variants you're talking about, and then you can do a NextClade lineage assignment. So instead of running Maft for like five days on half a million genomes, you can just, you know, knock it out in a few minutes. That is the idea. I haven't used the C++ version yet. I think I heard about it two days ago, but it certainly sounds like something to watch out for. Yeah. Richard Nair's been saying it's lightning fast on Twitter. I haven't had a play with it yet. But it's one to watch. But it is one. Yeah, it's definitely one to watch. So there's a great tool called Cog Mutation Explorer. It's more UK based, right? But it is phenomenal, certainly if you're in the UK, for looking at mutations. And it lets you look at which mutations of concern, or which, well, which variants in general are popping up, and then how frequent are they? And then how recent are they in the UK population, or how many samples are you seeing, and frequencies and all of that? So you can very quickly zoom in on, say, if you get an interesting mutation and see where else has that occurred? Is it occurring in multiple different lineages, or just in, you know, in your lineage? And that's kind of cool. And then also, then they have a list of, say, all the lineages where E484K has been found, and all the other kind of interesting or potentially interesting lineages, and kind of how they're progressing in all these big data sets in Cog UK. So it's just a phenomenal way of actually interacting with the data, and zooming in very quickly to see, is this interesting? Is this just kind of a one-off that I've seen? Or is it something that's maybe rapidly expanding in the UK? And you know, you're catching it first, and you're going to get your Nobel Prize. So yeah, check it out, Cog Mutation Explorer. I was looking for something like this for when I was working on the reinfections project to see if, like, one sample was the same as, you know, infection one sample is the same as infection two sample. And one line of evidence we were thinking about was something like this. This is great. So like you, if you saw some mutation in the second infection sample, then you would look at the statistics and just see if it's like this rare mutation. It's like, yep, that's definitely a different strain. Or no, it looks really similar. So how does this relate to CovGlue? I guess it's related in the sense that they are calling the variants in CovGlue, and then this is just kind of aggregating them in a different way. Yeah, because this is a whole table. So now with CovGlue, you can pop in, here's my thing, and then click on a particular variance and ask how common it is. But this is, I just had a look, and it's a whole table of things. But ultimately, we want to do some statistics on this, don't we? We want to have this database and then say like, what is the chance that this is happening, you know, that this is important. But we kind of can't because, or we're limited in what we can do, I guess, because you'd have to pull GISAID and deal with the licensing terms to build a derivative product from that. That is an issue. With CogUK data, luckily it's, you know, single file downloads for anyone and updated every three days at the moment. So that at least gets you half the data that will be in GISAID. But you're right, you know, that is a little bit of a limiting issue. So what's next? Well, moving on, Anya Chul and Verity Hill are continuing to expand the CovLineages website with Grinch. So you've seen all the different reports on variants of concern, where they have automated a lot of that. And that's where, you know, you'll have like B117 and what countries it's gone to, and you know, what are the flight patterns like and how many genomes that are in GISAID. And from that, you can very quickly understand the spread of things, even before it's reported in the media sometimes. And so they've expanded that to account for just all these brand new lineages we're getting every day. So we're probably going to have hundreds, you know, this time next year. Yeah, they have some interesting guidance on the website of whether a cluster of sequences does count as a new lineage, and they give different parameters you need to, or criteria you need to fill. So a monophyletic group, good support values, epidemiological support, it's looks like it's moving into a new geographic region, and that it's circulating within a region. You know, it's not just one person sitting there, and there is like defining mutations for that that make it kind of a bit special. I mean, this is very rough. I think there's more information on the website. But yeah, if people are wondering what defines a new lineage versus just an outbreak, that's a good place to start. So Grinch is on the Covlineage's GitHub, I must admit, the readme isn't super informative at the moment. Apparently you feed it some kind of JSON and some kind of config file, and then it tells you something. But I presume this is what they're now using internally to automate all the amazing work they're doing for the Covlineage's website itself. So I guess so much of this stuff is changing so rapidly that documentation and training material and all that is lagging behind. But you know, I prefer to have working code then with incomplete documentation than the other way around. Absolutely. And we did a really fun little hackathon just before ASM NGS. And maybe we'll do something like that before ABPHM, you know, just get people together. The focus of the hackathon before ASM NGS was on improving testing, and we got a bit familiar with GitHub Actions. So maybe we should embed some documentation for the ABPHM hackathon. Absolutely. Now, in terms of new tools, it seems the US has woken up with a giant stack of cache, and you're going full steam ahead into sequencing. And you're also developing a lot of new tools and that kind of thing. So that's kind of cool. Lee, can you tell us more? No comment on the money part or anything, but yeah, I mean, I'm not directly involved with this, but it looks like we have a couple of new sites on the cdc.gov page. And they're basically maps of the US with our variants of concern. So that's cdc.gov slash coronavirus slash 2019-ncov, and then you can go to it from there. And it's basically an interactive map of one of them is of the world and one of them is the US and you can see basically where variants of concern have arrived. So kudos to the team who did this. And I think it's just a really nice, straightforward map without a lot of confusing different things on it. It's just very straightforward. It's really cool. You can color code things by different variants. So talking about getting all of our genome sequences out into the world, how are you guys actually sending sequences out there? First of all, is it a distributed system versus a centralized system where everybody submits to a central place and that central place submits to NCBI and GSAID? And then how do people even send it to you, like through FTP? Yeah, it'd be the former in that regard. So the way COGS set up is that all of the sites will send the reads and the consensus sequence. locally and they'll send that via FTP to the central server where all of these are kept. Then those are processed and used to do all of the phylogenetic stuff you see on Twitter down the line. It's actually quite quick now, within a couple of days that's been pushed up to ENA and GSAID. But that's all done centrally. The process... of actually doing the submissions, the external databases, it requires, you know, a little bit of knowing what you're doing, especially when you're dealing with the INSTC databases. So that tends to get centralized. Yeah, exactly. Even the GISAID submission isn't as trivial as it could be. Some people tend to struggle. And I can imagine that if all the sites had to do that on their own, it would be basically spent half the week doing that rather than doing any actual work. So all the assembling and stuff, do you worry that people are doing it like differently in your network and then sending it to you? Big time. But what can you do? I mean, there was a survey that went around, I think Cog, and just finding out the different versions of different pipelines people are using. And, you know, maybe some people might be slightly out of date, some people would be totally out of date, and some people would be forking the latest and making their own changes on top or reinventing the wheel totally. So even within one consortium, like there's a lot of different ways of doing things. When you account for all the difference in versions. And that is a problem because sometimes you will get a version of some software and it's going to have a bug in it, you know, that's life in software development. And that's a bad thing. And it does come in and mess up your data in a way. So you do have to be mindful of all of these things and look for batch effects on a large scale. But yeah, it is a problem. But you have to do all of this stuff really as near as possible to the sequencing, because you know, you don't want to be sending around huge fast five files or enormous fast Q files for the human data, you know, that that's a non runner, really. So you got to do it locally, and then shove the data up. Okay. And what do you guys, what do you guys do, Peter? At Sambi, we haven't been doing a lot of the sequencing ourselves, but I've been working with people up and down the continent working on it. And people are using a mixed mass of whatever works for them. I mean, we have a real skills gap. We tried to solve that a little bit. We're doing some work based on the IREDA platform, the one that was originally from PHAC in Canada, where we have a system that can be installed with a single Docker compose command. And then, well, I guess to you need to get pull the thing and then do a Docker compose. But that'll work on a server with minimal systems administration skills. That'll work locally. So you don't need to ship those huge files everywhere. And that has a web interface. But mostly what we've got is whoever's got the ability to just get a sample out there does so. And unfortunately, the side effect of this is that we don't actually know the quality of everything. And we have to look at kind of average metrics. Like in general, you look at things like number of variants, you know, that you expect to see at a particular time point and things like that. And unfortunately, the way that we share data, because we're sharing really in GISAID, we're sharing consensus sequences, there's no trace to actually go back and figure out how this thing was put together. If you look at the kind of methods, sections of the metadata, it's very vague. So like, if you had great bandwidth with all your partners, would the preferred way be to just have them send you not FAST5, but maybe FASTQ files and everything's assembled and characterized centrally? Look, I do think that we need to get more people involved and distribute the work. But I also think that we have to make it easier to actually store the raw data, maybe even on the FASTQ level somewhere. So at the moment, the databases where we can publish raw data, that's the INSDC databases. These are archives. These are not outbreak response databases. They're not designed for public health needs. And you can see that in terms of how difficult they are to work with. So where are we going with this? I don't know. I mean, we've seen a couple of documents again, more recently from WHO, but we need to think seriously, what does regional or even global outbreak response look like from a data perspective? So are we going to be putting everything into the cloud now, into AWS or Azure and hoping for the best with some kind of global centralized system? I mean, we've got a lot of concerns from LMICs about inappropriate use of data. And not only then, I mean, I remember some people in Netherlands were rather annoyed at how some of their data was used by research in the UK, and they put it out there related to the mink outbreaks. And then somebody picked it up and published a preprint quickly, and they were not happy. I think these sort of concerns and then general issues around material transfer agreements and so on, basically means that you're going to need to have a lot of this in catch-all blanket consortiums. I don't think you can freely chuck data out. The legal prescience around it just doesn't exist to kind of just put data out there and show that it's going to be used appropriately. So it should be a closed shop. Those should be very, within that walled garden, it should be quite loosey-goosey about who can do what, but everyone sort of understands that they have to play nice and be helpful to the people who originally generated the data. In terms of how do we actually shunt this around, I think getting started with the bioinformatics for COVID, running it in-house is going to be problematic. Just to have someone at a laptop trying to figure this out and not making mistakes is going to be quite tricky. So short reads probably are going to be pushed around via FTP, which is the lowest common denominator here. FTP over port 80 will work on any computer, right? So you're going to need to base your system around something like that where you're shunting up consensus sequence and probably the reads to some central place that's going to do all the analysis and then do the downstream submission to databases and sort of put all the pieces together. I mean, who's going to generate the phylogenies? Are you going to run it locally? That's fine. But if you want to do something outside of the institution and look at everyone else's data within one region, like the US or East Africa or whatever, you can't put it up to ENA, wait for them to mint the accession codes, wait for the data to go public and then pull it back. That's not going to work. So there is going to need to be this clearing house that everyone can use that just sort of where you just shunt this thing like a checking account, clear those, build your analysis, and then push those up to the central databases. I know locally in the region that we serve, people want results really quickly, like clinicians and public health officials want local results very, very fast to answer local questions. It might be that you have a patient in a hospital ward and you really want to know what variant do they have, what lineage do they have, because you may need to take some action or you may need to do enhanced contact tracing or whatever. So you do also need very rapid analysis locally and have the ability to do that to actually have an impact. I know a lot of these larger databases, they're great for a retrospective academic research where you have a lot of time, but when you have a pandemic and people are dying and you want to stop it, you got to be quick. I mean, this also relates to the density of what you're doing. When you're sampling things at a certain level, all you can say is approximately what lineages we're seeing. When you are sampling things as densely as you are in the UK, then you start actually having something that is useful for immediate local public health tool. So the more work you do, the more the problems that you face and change. Even a sort of high level understanding of what's circulating will dictate government policy. For instance, which vaccines are they going to use? If there's a particular bias or predictive variant, like even if it's, it doesn't have to be fine grained, but just having a basic idea of what's circulating in the population will change what real people are going to do. And they want to know that pretty quickly. And I can't imagine an excuse of like, oh, I'm sorry, but the guy who, the only guy who knows how to upload it to NCBI is off and no one else knows how to do it. It's not going to work. But I'm just saying, if we take your example there, right? That's a national, your vaccine deployment is a national level decision. So if you have national centralized resources, you can guide national level public health decisions. If you are sequencing down to being able to see what's happening in each district, then your district office is going to want to know what's your district health people are going to want to know what's going on in our specific region. So the more you do, the more successful you are, the more people start asking you questions and the systems change. Yeah, definitely. That's the curse of doing your job well, you just get more work. But I think there's a, there's another option besides the like, let's dump it onto AWS, because I mean, let's face it, you know, AWS, it's in certain legal jurisdictions, that's just not a possibility for some people. But the kind of web based software as a service, I see that is another niche that if people have got enough bandwidth, but they don't have a lot of expertise, they upload something to a website, get some kind of overview. And then, you know, the data doesn't stay on that kind of web store or something, it still is being moved constantly from their local stuff to some, you know, websites. I mean, we do that with tools like CovGlue, and, you know, NextGlade and stuff like that. And the web based Pangolin server that CRG runs and things like that. So, so there is a space for web based things that are not cloud based kind of platform as a service thing. So there's this there's more than just local versus web versus cloud. Yeah, I think it has to be tailored around the user base that are going to be interacting with it. You don't want to expose the decision of whether this is a local resource or a cloud resource to someone who's not computationally orientated, they, they need a nice web interface where they drag and drop their file in and they get a result back in whatever timeframe where you do the compute is your problem. that's like fine for a half a dozen, you know, if you're doing, you know, a plate a week kind of thing, that's fine. And then for the power users, they're going to want to have a different framework where they probably run the stuff locally, and they have someone managing it and looking after it. So you'll have these different tiers of your data flow, where things are done, and we'll have to be quite flexible with that. I've got a question for you, Lee, because I mean, of us here, I think you've got the most experience running a service that every week had to have a report before the days of SARS-CoV-2. You have experience running a service that had to provide a standardized report on, I think it was foodborne outbreaks, right? And then, you know, you start moving away from either having a bunch of graduate students, you know, driving your infrastructure. Let's not forget that, you know, incredible people like Eno Toole and Ferrity Hill are still, you know, students, and then you start having a professional service for this. And that causes a different set of decisions about how things are done, and when things can change, and that kind of stuff. When you do have a report like that, it gets, you have a business side instead of an academic side, and it kind of sucks. It's kind of cool at first, but then it's like, you are definitely expending a lot of energy. I was just going to say, the next step then would be for commercial service providers to take it over. You know, the academics have stepped in to the breach to kind of provide a quick temporary service, but then we're not going to sit around for years and years on end just sequencing COVID, producing pretty trees. There has to be that next step where it's put into long- term stable production. Yeah, my graduate school advisor actually warned me against trying to come up with actual tools, because then you're on the hook forever. Unless I guess a commercial partner takes it over. You can be emailed whenever, even after you leave Sanger, Andrew. Yeah, yeah, absolutely. And I mean, the thing is that one of the things that I've seen with some of the public health labs in less resourced places is that if you ever want to have an academic career again, you have to figure out how to still squeeze publications out of your work, while also doing as what you described as a kind of business. It's a real tough position for people to be in, and you know, where there's so many competing priorities. Actually, in my own work, I totally was not thinking you were getting into my foodborne side. Like, it's like my last lifetime. It did turn into like a little bit of business doing reporting and everything for that kind of stuff. And that's ongoing. It's not something that has gone away. In fact, it's increased, if anything, and the future of it necessarily had to be to give it to some kind of team to take over, which is basically what we did for one aspect of that over time, we just realized it was a routine part of our work. And we designated a few of us on the bioinformatics team in our in our foodborne lab that they take it over and their point people on it, even though I was the developer on that stuff. Can I ask with the genome tracker project, you put my seeks everywhere in the US, like into old state labs, it seemed. And are those being repurposed for SARS-CoV-2 sequencing? I'm going to give a disclaimer that I'm not part of genome tracker. So I'm going to I'm going to say what I think I understand from the outside. So genome tracker and pulse center separate networks. And what's really cool along the lines of your question is that all of our statistical actually have been given training in my seeking and have been given my seeks where they purchased them. And so our labs are equipped, thanks to pulse net, thanks to genome tracker. And and yeah, they are starting to get repurposed. I won't say too much since not a lot of it is probably public yet. But I think that a lot of these labs are are getting a lot more geared towards SARS-CoV-2 work and in a really positive, constructive way. And a lot of it is thanks to the backbone of these networks. I guess it helps as well that you have lots of bioinformaticians who have been for many years working on these projects, working on like say galaxy tracker and doing bioinformatics analysis. So you've kind of got a ready made army nearly, you know, of people to do all of this work. Yeah, I wish we had more of an army, actually. But yes, there are some really effective and really hardworking people out there who have been trained as part of these networks. And and they are amazing. And I like to call people out on the podcast for for doing great work. But I'm and I want I want to but I'm afraid I'm going to leave people out because there are so many people. So we have a really great network full of bioinformaticians and training leads and my seek operators and just amazing people for sure who have a rich history, especially in the foodborne world, but other things. Any other news? The current hit paper has now been published in Genome Medicine, and it's new and improved. And you can basically do one prep that you can very easily then do a nanopore or on Illumina. And we've been using it to do up to 1500 SARS-CoV-2 samples on Illumina and up to 96 samples on a single MinION run. So very flexible and it scales to very, very high levels. You did 96 samples on a MinION before the 96 barcodes came out? Yes, we did 160 actually, but we didn't publish it. Okay, but you did many. I mean, is this something that I should still consider using even now that there is a bigger barcode kit? Is it a better approach than what I can get from the manufacturer? It's cheaper. We make that comparison in the paper. So, you know, you'll have to read it and find out. And besides the paper, is it also on Protocols.io? It is on Protocols.io. I would suggest, you know, if anyone wants to find out more just to read the paper, it's more of a wet labby paper. So I apologize if I don't fully understand all the technical wet lab details, but I am assured that it's awesome. People who actually know what they're I will go read the paper, even though I also know almost nothing about wet lab. Who do we at at Twitter? David Baker. So I took advantage of your paper because Nabil and I got talking about it and we realized that there aren't a lot of benchmark datasets out there for SARS-CoV-2. And I reached out to the Public Health Alliance for Genomic Epidemiology community and Nabil answered the call. I just said, who can provide some benchmark dataset? Like if I came up with my own new pipeline for SARS-CoV-2 and I wanted to, for example, cluster it the best way, how do I benchmark? And you guys have some really great data in there and show how those genomes should be clustered. Yeah, because we're demonstrating the value of a sequencing protocol, we just wound up sequencing the same samples over and over again, using Illumina, standard Nanopore low cost and the Corona hit method. But for informatician, this is great because you've got multiple platforms supporting the same sample. So you can do really simple QC of like running a tool and you should get all of those same samples clustered together if you're doing a phylogeny, this sort of stuff. So yeah, it's a good dataset for that. And there's, it's around a hundred samples that you can play around with for each platform. And all the data is in the ENA as well. So you get the raw reads for all of these different experiments. Yeah, I put up the raw reads and the, and our consensus sequences are also there attached to the ENA records. Awesome. So I have to give the disclaimer that maybe when this episode comes out, won't be exactly up on the site yet, but it's getting there just because a couple of the samples when I was actually doing it last week, were not available. So I'll try again. Yeah. It's been a bit slow getting the ENA stuff public. I don't know what's going on. I did make one more dataset, a benchmark dataset. So we do have one online. Danny Park provided the accessions from, I apologize. I don't remember which paper it came from, but I appreciate him helping us out. Very nice benchmark dataset with three introductions of SARS- CoV-2. So you can test your pipeline and see how well they cluster. Yeah, that's a real world outbreak example, I think. So yeah, this all came from a discussion within Phage on the Phage Slack. So the Public Health Alliance for Genomic Epidemiology is exactly what the name says. So we have been looking into how to create some standards so people know, you know, instead of just going with like, I invented this pipeline, so I'm going to use it. So that's, that's one of the things that we need is good quality benchmark datasets. We're also discussing, you know, what the interfaces look like. Let's actually focus on like, what do the users want rather than the fact that like, oh my goodness, I wrote this thing. Now everybody must use it. So Peter, you had a query about benchmark datasets with FAS5 data for testing at say Arctic ONT. Let me give you a bit of background on that. And I know why it can't be shared, but I'd actually like to ask a question here. So the field bioinformatics software from Arctic, the Arctic software itself, it can either use Medaka from the FASTQ to call variants, or it can use Nanopolish and some upstream stuff there. Now, of course, the upstream stuff assumes that you've kind of been doing the sequencing yourself, but yeah, can you explain why it's not out there in the wild? Yeah, generally, or quite often you have a lot of human reads that carry over into your sequencing. So you never really want to release human reads accidentally. So people will filter those out as they go through a pipeline after base calling. So that's probably why there isn't very much FAS5 data out there. I don't know if anyone has released FAS5 data for SARS-CoV-2 from Arctic. There are a small number of test datasets in the article repository. I'm not sure if that's true, but yeah, uh, I don't know to what extent, uh, are people using the nanopolish routes versus using the Medaka routes? I actually don't know the pros and cons of the two different things. Uh, we're using Medaka, I think. I was trying to explain this to a colleague of mine earlier, and I had to go so into the weeds, you know, of like, this is what happens when you run a nanopore sequencer. And then I showed him a picture of a nanopore sequencer running, and he thought that the computer was the sequencer. And then I had to point out, no, you missed in this photo, you missed this little thing sitting here next to the mouse. And that's the actual sequencer there. Oh yeah. The future is now. So going into any conferences anytime soon? Well, I've registered for ABPHM, Applied Mathematics and Public Health Microbiology, which is only a hundred pounds this year because it's virtual. It normally it's held in Hingston every two years. Fantastic conference. I think I saw all of you at the last one when we were doing something in person. I think Peter, you had a mohawk last time actually I saw you. I kind of do, but it's just a little bit further back on the head receding, you know? So yeah, I mean, ABPHM really, I kind of think that it's very seldom that a conference has spawned a field to the extent that ABPHM has. I think the first crowdsourced outbreak analysis happened during one of the ABPHMs? Yep. Yep. For the European O104 outbreak, E. coli outbreak. And the abstract deadline is coming up on the 9th of March. So yeah, you know what you're gonna be doing on the 8th of March. Yep. I will probably not be able to make it. So I might have to duck out of the podcast too for a little while. I got a baby girl coming right at that time of that conference. Mid-May. There won't be any fancy cookies from you this time. Man, that is like my favorite part of international conferences, just like sharing candy or cookies or whatever else from your locality. I really enjoyed bringing Oreos to that last one. Well, congratulations, Lee. And yeah, I always remind people that I handed out cookies that I got from the food board outbreaks guy. That's awesome. I will bring more next time I see you guys at whichever conference. So that's all the time we have for today. We've been talking about SARS-CoV-2. I want to thank Peter for joining us today. And thanks to everyone for listening. Thank you all so much for listening to us at home. If you like this podcast, please subscribe and like us on iTunes, Spotify, SoundCloud, or the platform of your choice. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group and edited by Nick Waters. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadrant Institute.