[Speaker A]: Hello, and thank you for listening to the MicroBIMBY podcast. Here, we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There is so much information we all know from working on the field, but nobody really writes it down. There's no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. I am Dr. Lee Katz. My co-hosts are Dr. Nabil Ali Khan and Professor Andrew Page. Nabil is a senior bioinformatician at the Center for Genomic Pathogen Surveillance at the University of Oxford. Andrew is the CTO at Origin Sciences and Visiting Professor at the University of East Anglia. Hi, and welcome to the Microbinfi podcast. Nabil and I are here. Andrew has been detained at his real job. In his place, I've brought on a friend and colleague, Clint. He's an expert in SARS-CoV-2 and has a background in virology. Just a reminder that the opinions expressed here are our own and not those of our employers. We're going to be talking about pathoplexis and today with us are three guests. We have Dr. Emma Hodcroft. She's a molecular epidemiologist and bioinformatician who studies how viruses evolve and spread. She's a co-developer of Nextstrain, created Covariance.org to track COVID-19 variants. and recently co-founded Path of Flexis.org to make sharing viral genomes data easier and fairer. Since 2023, she's led her own lab at the Swiss Tropical and Public Health Institute, where she's back to studying the viruses that keep circulating outside of pandemics. Welcome, Emma. [Speaker B]: My pleasure to be here, Lee. [Speaker A]: And we also have Dr. Theo Sanderson. He's an assistant professor at the London School of Hygiene and Tropical Medicine. Theo started his career studying malaria, doing a lot of functional genetics, but when the pandemic came along, he became fascinated by viral genomics. SARS-CoV-2 brought an unexpected scale of data and Theo built new tools for working with virus genome data at this scale, including Texonium, a tree explorer. Now he's trying to expand these large-scale phylogenetic methods to a broader set of viruses. Welcome, Theo. [Speaker C]: Thank you. Great to be here. [Speaker A]: We also have here Mr. Arthur Shim Kasambula. He's an independent consultant for bioinformatics and data science. He graduated from McKerry University with a Bachelor of Science in Biomedical Laboratory Technology and a Master of Science in Bioinformatics. He has spent the last four years working with the Ministry of Health and USAID-CDC funded partners to help manage and coordinate outbreak data for COVID-19, Ebola, malaria, and the ongoing Mpox outbreak in Uganda. Welcome, Martha. [Speaker C]: Thanks, thanks. [Speaker A]: So just to start this off, why is it called pathoplexis and what does it do? Emma or Theo, do you want to take this one? [Speaker B]: Yeah, I'm happy to jump in. So pathoplexis is a fantastic name, I hope, that conveys a little bit of the heart of what we're trying to do here. So the patho, no surprise, comes from pathogen, and then plexus is like an interwoven network. And this hopefully conveys what we're trying to do, which is to provide pathogen data. in a way that helps people to be more interconnected with that data, allows that data to be more open and also more easy to share. [Speaker D]: So, um... [Speaker E]: Either Theo or Emma, do you want to tell us about what the platform actually does? [Speaker C]: Sure. So Pathaplexis is a virus genome database and what it allows is for somebody who sequenced a virus to share it with the broader community so it can be analysed for public health responses or to understand how viruses are evolving. And so we try to provide both a really easy way to upload your data just sort of sort of dragging and dropping data into a form to make it appear on the database and we also try to make lots of nice ways to search the data to be able to search for viruses with specific mutations to be able to search for viruses from specific time periods and to download that the set of sequences associated with those viruses. Emma might want to say more. [Speaker B]: I guess one thing that makes pathoplexis a little bit unique is that as some listeners may be aware, one of the biggest barriers to sharing, especially when this sharing is being done by academic labs, is fear of scooping. So essentially the worry that if you publish your data soon, someone else could publish a paper on it and you won't get credit for the work that you had to do to get those sequences. While at the same time, of course, we'd really prefer that when it comes to infectious disease, people share data quickly so that we can use that in our. an outbreak response in learning about a new pathogen or a new spillover. So how do we handle the fact that we want data to be shared quickly, but people are worried that if they share quickly, they might not get credit? And at Pathoplexis, one of the ways we've tried to handle this is by having two ways to share data. So one way is to share that data openly, which is very much the same as if you were to put it on GenBank so anyone can use that data, they don't need to ask you, or you can actually share that data, what we call restricted use. use, which means that for up to one year, people can use it for public health and communication, but they can't publish on it or preprint on it without your permission. And what we really hope and what we've heard from users so far is that this allows kind of a balance between sharing for public health use, making sure that data gets out there quickly while still ensuring that people have time to write up a publication and get that credit that they need for the work that they've done. [Speaker E]: Well, okay, so... You mentioned mutations, Theo. Does this actually do? Analysis beyond just hosting sequences. [Speaker C]: Yeah, so what happens when you drop a bunch of sequences into PathoPlex is sort of as soon as you do that in the upload form, what we call the pre-processing pipeline springs into action. And for the viruses that we're currently using, that's basically running next-clade against them. So it's using next-clade to assign lineages, to figure out mutations with respect to the reference genome. And that can both tell you about mutations, it can also just give you a lot of QC information about how much coverage you've got of the genome and so on. And so within seconds of putting your data in, you get to see all those scores and then you can decide, does the sequence look good and you can release it to the database or do you want to go back and do some more QC on it first. So yeah, there's a bunch of pre-processing that happens on the data and then... Then as new lineages are defined and so on, we also reprocess the data so that it's always up to date with those lineage calls. [Speaker E]: Okay, so that must mean there is a curation or some logic around the specific pathogen that's in the database. So which pathogens are covered and how do you pick them? [Speaker B]: Yeah, so we launched pathoplexes originally with just four pathogens, so Ebola Zaire, Ebola Sudan, West Nile and Crimean Congo hemorrhagic fever. And we targeted these because they were viruses that are of public health concern clearly, but also ones that didn't have what we call a good home. So these were viruses where we saw people didn't want to share those sequences onto GenBank immediately. They were worried about, you know, getting that publication, as I just talked about, but they wanted to share the sequences of public health. So we were seeing them uploading sequences to Virological, to GitHub, to random web pages. which is great that they're sharing them but obviously doesn't necessarily mean that data is very findable because you need to know which of these various websites it might be on and so we thought that we'd start Pathoplexis trying to meet that unmet need so trying to have a place for viruses where people are trying to share those sequences where there's a public health need to have that shared quickly by providing kind of a home in Pathoplexis and since then we are really lucky we've been able to expand to other viruses you So, for example, we now have mPOCs and we have RSV-A, RSV-V, and HMPV, and we'll be launching measles as well in the next few days, which is really exciting. And the way that we decide which pathogens to add, they are all, of course, still of public health concern, but we actually rely on the community to help us make these decisions. So we ask that if people are interested in us adding a virus to get kind of some kind of indication of community support. We've had some surveys, we've had some letters that just shows that the community that works on that virus thinks that it would be useful to have that virus in Pathoplexus because we think that this way there's some input from the community that adding this isn't going to be something where we add it and actually it didn't need this to be added. The community was happy with the system they had. Rather, we can really add functionality and putting this into Pathoplexus will hopefully make a difference to public health and to research. research. [Speaker A]: So that's a lot about like the data and community and that might lead us to like how the site is run. I think that I wanted to point out Clint might have had a question about that also. [Speaker F]: Yeah, I had a question about the data deposit. You said that all of the data is available for use immediately, but the non-public data... It's just not allowed to be published on yet. Is that correct? [Speaker C]: All the data on PathoPlexus is accessible to the public, but some data has this restricted use term attached to it, which means that yeah, for a limited period of up to a year, people can't publish on it, but it can be used for public health responses and so on. [Speaker F]: And a couple of follow ups. So are the restricted sequencing that they indicated in the FAFSA header or something like that so that you know that you've got one of those in your analysis? [Speaker C]: So there's two ways it's indicated. So when you download metadata, there's a column with this field, but also you actually in the download form, you choose, do I want to download just the sequences that are available under the open use terms with which there aren't any conditions attached or do I also want to include the restricted use terms. So you really opt into receiving the restricted use sequences and then by opting in, you're saying that you'll respect those. those conditions. [Speaker F]: And one last follow-up there is, so are the restricted sequences available for something like to appear in aggregate, like on a public database or a tool like outbreak.info? [Speaker B]: Yes, exactly. So we also wanted to make sure that, for example, when people are doing, you know, we can think back to a lot of the experiences we had in the pandemic, we had a lot of websites that popped up that helped to track variants that looked at mutations that looked at, you know, antibody binding, all sorts of super useful stuff. And for that, you need all the data. We do have requirements around making sure that that is credited, you know, that there's an appropriate way to tell what data was used and that you included restrictive. restricted data in that data set and which of the data is restricted if you follow the links back to Path of Plexus but we don't restrict that use because we consider that that is essentially public health you know this is the future of public health is having these 21st century tools to do that [Speaker F]: That's really excellent. Thank you. [Speaker C]: While we're on the subject of crediting, I guess something else we could mention is that there's a specific way if people are publishing papers using Path of Plexus data, there's a specific way of giving credit. which is that we have this seek set concept where people can define a set of sequences that they used in their papers so you go on to pathoplexus you list the accessions of the data that you used in the paper that you're publishing and what that does is to generate a seek set and that seek set has a DOI associated with it and so you cite that DOI in your paper and what that allows us to do at pathoplexus is to have a set of citations that that can be tracked through the crossref system that can allow us to build tools that will really credit the generators of those sequences so you could be able to see and this is features that we're still working on building but you can see okay what's the full list of papers that have used my sequences and and hopefully that really incentivizes again people to share their data because they see the impact of that in the world. [Speaker F]: I think it's a huge improvement on that concept. I'm very excited to hear that. [Speaker A]: Yeah, actually, so like my personal anecdote is that I was using some SARS-CoV-2 website in 2021, sorry, 2020, and I did like an all versus all comparison of a thousand genomes. And in order to cite all of those genomes, I had many pages of PDF to do that and I just gave up and I didn't do it. That's really slick the way that you had a solution for that. [Speaker G]: Thanks. [Speaker E]: So I wanted to bring Arthur in for a moment to tell us about, I suppose, as a power user, how does he feel about Pathoplexis? [Speaker G]: Sure. And thanks. I think one of the things that Pathoplexis addresses are the barriers or the worries that come with sequencing labs. That sort of ownership or their data being used for publications even before they get to do that because maybe they get caught up with full up sequencing that comes around during outbreaks and all that. So having that restricted use is sort of an incentive that I mean from my end I've realized some of the labs, especially in Uganda I would say, they find it easy to share the sequences via. biopathoplexes with that sort of safety net without any worries but one of the monumental times I think that happened was during our recent Ebola outbreak I think It was declared around, there were two things here, both a turning point for pathoplexus but as well as a monumental time for our country having almost near real-time characterization of the COVID, but then also just having it uploaded directly to pathoplexus and maybe what I can highlight there is how the process was smooth besides the collaborations that happened with the team and the institutions within. But the teams were actually able to submit sequences without support, needing support, constant support from Path of Plexus teams. And probably that shows how smooth and how the process of submitting the sequences can be easy for users. But probably, I mean, theorists could just chip in among that. [Speaker E]: I had a quick follow-up. I'm going to come back to the analysis provided by the platform. In your experience or with your partners, Arthur, was that enough what it generated for you, for your use? Or what else did you have to do in your use case? [Speaker G]: Sure. I'll have to say that I'm not part of the team that actually did the analysis, but maybe Emma, you'll correct me if I'm wrong, some of the analysis features, I think, weren't yet up and running in Pathoplexus by then. But according to what the team highlighted and how they followed up in the process, I think they are able to leverage the Arctic pipeline and then also be able to do phylogenetic analysis of the sequences. [Speaker B]: Yeah, I can follow up just a little bit just to say that, I mean, we at Path of Plexis, our aim isn't to provide in-house like the full phylogenetic analysis that one would want to do for. you know a new Ebola outbreak for example we do hope to provide as Theo said you know some quality control metrics mutation calling information about the sequences and we're very excited about a feature that we launched just a couple months ago which we call like a link out feature which is where we can have users have the ability to select sequences in pathoplexes and then send those out to other websites where they can do things like visualize something thing or do an analysis. We only have two at the moment, but one of them is to link out to Nextclade, which is a tool as Theo already covered a little bit where you can do QC lineage assignment also does quick tree placement. And so this is something where, for example, users that either don't need a full file genetic analysis or want a quick preview of what a file genetic analysis might show could use this to quickly see where their sequence falls in comparison to other sequences. Their sequences that are publicly available and we would be very excited to add more link out tools if people develop things that could take in those sequences and do some quick web-based analysis that would help users to better understand their sequence in the context of others or what those mutations may do or I don't know what the 3D protein structure looks like. These would all be really cool things that we could implement to send sequences straight to those tools. [Speaker E]: Yeah, let's talk about... Interoperability. Is there an API or how would a developer interact with Parthaplexus? [Speaker C]: Yep, so there's definitely an API, there's actually a couple of APIs. So there's one API for sort of submission that for sort of for yeah for submitters to use, which talks to the back end of the database and allows all sorts of submission or revision operations to be carried out on sequences and then there's a query API which actually at the moment uses Lapis which is a piece of software developed at ETH Zurich. direct to sort of to allow querying large amounts of viral genetic data and so that's the same API that's made available for the COV spectrum which quite a lot of people know from the COVID era and that allows you to do all sorts of queries for specific mutational patterns for countries for dates and so on so there's a fairly fairly sophisticated API and definitely like interoperability is really important to us like you know you can really imagine sort of a flourishing ecosystem of lots and lots of different different tools talking to each other and what you need for that is databases which provide APIs that allow you to get the data you need to run a particular tool so we're really keen to support that. Something else that is at an early stage is a command line interface as well so that's another way that people could talk to Pathoplexus in the near future. [Speaker F]: Are those incoming and outgoing data? Is there a UI-less submission protocol or also a nice submission interface? [Speaker C]: Absolutely. So yeah, you can submit without touching the UI at all. And yeah, I think that's a really sort of missing need to a significant extent in the world. So we one of the things that Pathoplexus does is to ensure that all of the data ultimately gets to the INSDC databases. So GenBank, the ENA and the DDBJ in Japan. And I think so for example, for GenBank at the moment. that isn't really for most viruses a way of submitting without using a UI or without sending an email. So I think providing easy ways to allow people to get data into NSDC is one of the contributions that we think we can make by providing these APIs. [Speaker F]: So, for example, if I submitted an RSV sequence at Pathoplexos, what? is the typical turnaround time from getting into Pathoplexus and then making it through INSDC back to GenBank. [Speaker C]: Yeah, that's quite variable for which is, you know, so we would hope that within one to two days, we would have that sequence submitted and so and because ENA is the INSDC database, which has a API, we submit via the ENA API. So that's sort of our way into INSDC. But then, of course, And of course, the synchronization between the NSTD databases. At the moment, we're seeing probably anywhere from sort of five days to... um yeah up to maybe a month which is and and if it's a month it's because it's sort of requiring manual sort of us to get in touch and figure out sort of whether something has got stuck in some system not talking to some other system so it can be a bit variable but um but that yeah we're and again we're continuing to sort of see what we can do to to get these times as short as possible [Speaker B]: I guess it may just be worth emphasizing, though, that these are delays that unfortunately are outside of our control. When you submit to Pathoplexus, it'll be live on the Pathoplexus website in like five minutes. But then from like from when we submit it, unfortunately, we're working with all of these networks to try and help this go through a little bit faster and more seamlessly. But unfortunately, we we we directly cannot make them appear any faster in INSDC. [Speaker E]: Well, that kind of is the point, isn't it? The fact that it does take time for data to go into RNSDC no matter how much you want it to. And that's essentially why you want something like Pathoplexus, which can be more agile. [Speaker C]: Yeah, I think there are different needs when you're, you know, when you're in an outbreak setting compared to when you're submitting, you know, vast metagenomic data sets or something. And so, you know, we can really focus on this particular challenge of pathogen data in close to real time. So, so yeah, I think that that is a sort of strength that we can bring. [Speaker E]: Yeah, I think we have to remember that. INSCC GenBank and a DDBJ are archives. It's like a library. I don't think it was intended to handle data at this velocity, which may change in time, but right now I think it still isn't something you can really do. [Speaker A]: We're about time, and so I'm going to... Unfortunately, close the conversation for now. I think we'll have to pick this up again next time to hear more about other interesting things, especially for future directions of pathoplexes. Talk to you all later. Thank you so much for listening to our podcast. If you like this podcast, please subscribe and rate us on iTunes, Spotify, SoundCloud, or the platform of your choice. This podcast was recorded by the Microbial Bioinformatics Group. For more information, go to microbinfie.github.io. The opinions expressed here are our own and do not necessarily reflect Reflect the views of origin sciences the Center for genomic pathogen surveillance or CDC