Hello, and thank you for listening to the Microbinfeed podcast. Here we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There's so much information we all know from working in the field, but nobody writes it down. There is no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work on microbes in food and the impact on human health. I work at Centers for Disease Control and Prevention and am an adjunct member at the University of Georgia in the US. Hello, and welcome to the Microbinfeed podcast. Welcome to another software deep dive, where we interview the authors of a bioinformatics software package. We know that behind all software, there are quirky details that never make it into the final paper. So today, we're having a chat about some of the software that is behind COG-UK. So COG-UK is the COVID-19 genomics UK consortium, which was created to deliver large scale and rapid whole genome virus sequencing to local NHS centers and the UK government, particularly for COVID-19. I'm joined today by Professor Nick Lohmann, who is a professor of microbial genomics and bioinformatics at the Institute of Microbiology and Infection at the University of Birmingham. We also have Sam Nichols, who is a postdoc in Nick's lab and self-described wonderful wizard of COG, and Slavko Floski, who is CLIMB's cloud manager at the University of Birmingham. Today we're going to be talking about the central database infrastructure called Majora that underpins the sequencing efforts within COG-UK. Nick, let's get kicked off with you. What exactly is the problem Majora is trying to solve? We decided in COG that we wanted to very rapidly have capability to sequence SARS- CoV-2 genomes. And at a meeting back in mid-March, we came up with the idea of the Coronavirus Genomics Consortium. And we just thought the best way of doing this would be to leverage all of the UK's deep expertise in sequencing and in molecular phylogenetics and create a network. And so we wanted everyone that could contribute genomes to be able to do that really, really quickly using whatever platforms they're comfortable with, even different sequencing protocols. And we actually ended up having a network of about 13 or 14 sites that gave us a real problem, which is how could we do a centralized analysis of the data produced by the consortium? So Majora solves that problem, basically. It connects together all of the distributed sequencing hubs that generate both sequence data and sample metadata. And it provides a database and a software infrastructure to integrate that information so that we can generate analysis products, upload that data to public databases, and actually with the business of understanding coronavirus genomics. I'm going to ask a kind of reviewer three question. What was the motivation behind setting this up? Like, I mean, what did Majora need to do that wasn't available at the time? That's a really interesting question. I think that there are lots of different software infrastructures available. And of course, there are good examples of kind of networks of sequencing that go on. So for example, in the US for Salmonella and other foodborne bacterial pathogens, you have Genome Tracker. In Canada, they have a system called Iridar. And the UK has been at the forefront of also generating whole genome sequences for organisms like TB as well as the foodborne. But this is a new virus, and it's very unusual to have a distributed network that's basically just popped up effectively overnight. It takes months or years of intensive development to get a system that's kind of stable and working. And so we needed something incredibly agile that would work basically immediately that we knew that we could build on and layer on functionality that we would need. And there was just nothing off the shelf that we could use that we could easily for this idea, particularly the fact that it's such a distributed system taking disparate types of sequencing information in. So as with all kind of classic bioinformatics problems, the solution is to roll your own. And Sam already had the sort of bones of a platform called Majora. It's very different to what we have now, but it was obvious to me that Sam would be able to bring something up very quickly to support this. I want to bring Sam and Rad in now. Was there anything you want to add to that? I think Nick's covered all of that, really. I think the special thing about Majora, like Nick was saying, is that it's allowed us to scale up to a very different sort of type of sequencing network, not necessarily one that's based in a very large sequencing center. We've got lots of different disparate sequencing sites doing lots of different types of sequencing with all sorts of library extraction techniques and bioinformatics techniques. And this is a way to bring all of those together. And for you Rad, coming in as a cloud manager on the sysadmin side, does this seem really novel to you as well, this sort of technology? It's the first time we did something in such a short amount of time that was bespoke solution. I also think we just played our strengths. Each of us is good at something and we sat together, okay, you can do this, you can do this, I can do that. And the end product is Majora and UranZone. I think that's a good point, actually, that the way it was developed as well is not in isolation. I worked really closely with Rad to build it on Climb. We worked closely with other people to integrate it into systems as well. So it wasn't developed in isolation. This whole ecosystem has sprung up out of Climb really and the connections that we have. Well, we started talking about the conception of the idea. So when did the work actually start? So I guess like Nick said, I had musings on what a LIMS should look like a little bit before the outbreak began. Towards the start of this year, we were ramping up sequencing for one of the projects that I'm attached to, and we decided that it needed some sort of LIMS to keep track of all of these samples for fecal metagenomics. And then, you know, when we were pulled off all of our grants at the beginning of March, Nick said, drop everything, I need you to do something, it's quite important. We realized that we could probably spin what I'd built into something that could serve this sort of concept of a large sequencing consortium. And that, I think, was at the beginning of March. And I have fond memories of Rad and I working, I think, the second weekend of March, pulling all sorts of crazy hours to build the first VM and start deploying software to it. So that was, I mean, I remember when we were talking with Andrew and Justin a few weeks ago, they talk about this call to action as a sort of phone call at midnight kind of thing. Was it much the same for both of you? Yeah, pretty much. I think sort of work was quiet for a week and Nick was sort of in lots of meetings. And then we get a little message on Thursday, March 12th, I think, Nick said, you need to drop everything. I need you to start building this system. So that was Thursday evening. We worked all weekend. On Monday, we had some bare bones infrastructure that we can just continue to work on. That was a crazy weekend. The first thing that we actually set up Majora to do was we retooled it to be a user management system. That was actually the first thing we got it doing that weekend, because we knew that loads of people were going to need to sign onto the system to upload data. So the very first thing that we wanted to do before the pipelines, before Majora as a LIMS was just provide a platform for people to provide these genomes so we could start coordinating an upload of them to GISAID. So Majora was actually a user management system at first, like that weekend. And I think we developed that in two or three days, had our first user on the Monday. And I think it's worth pointing out, I think, that it sounds very ad hoc the way that we're describing it. And of course, it sort of necessarily is. Coronavirus, it surprised us how quickly it came to the UK and how big it got. Everyone was obviously watching the news back in early January and the reports coming out of Wuhan, but I think it still caught people on the hop. But although it's very ad hoc, I think it's very important to point out that there's a sort of deep well of expertise and knowledge and actual pieces of the puzzle that already existed. It wasn't about standing up an infrastructure from nothing. There was, there was a huge amount of underlying stuff there. So obviously the Climb platform is a great example of something that's been developed over at least five years to be able to rapidly respond to kind of hardware infrastructure requests and knowing that the network's solid and knowing that all the firewalls work. Sam's got the basis of some software, but there's other stuff as well. Things like the Arctic project that we leveraged, which had protocols ready to go for doing both the lab work and the bioinformatics. And then all of the sort of deep work in phylogenetics that's been established over the last sort of 15 years with algorithms and the understanding of how to do genome epi. It is the sort of thing where we would have loved to have got a grant to fund, to fund all that stuff, but it's quite difficult to get a grant to fund that kind of infrastructure work. But the kind of work that we've been doing and where we were really, so actually we know how to do this. We've got most of the pieces. It's just the question of putting all the bricks together and make something that. a sort of usable, if not a house, a sort of, you know, at least an external garage. And we'll build all the other bits on later that we need to make it the kind of, the ultimate system that we continue to build today. So I think you've both mentioned this precedent and particularly this early prototype of Majora. Is there anything you want to add about what that was? What would have been Majora had we not had the pandemic? I think the important thing is that a lot of the concepts that I'd already put a lot of thought into were pulled over from the original Majora into the one that we have now. So everything in Majora is basically a process or an artifact. And if you're an artifact, you're a thing that exists like a bio sample or a file. And a process is something that happens to an artifact. And that sort of conceptual thinking took a while to come to and to build the database models that would back that. So a lot of that was already in place and we could pull that over and utilize that to deploy Cog. And a lot has come since, but it is important like Nick was saying that we did almost have quite a lot of these jigsaw puzzle pieces together. And really what we've done here is pull them into one coherent picture at a really fast pace. In that process, who were the major people involved other than yourselves, obviously? I guess if you look at the commit history of Majora, it's a one man band, right? But there are a lot of people involved. I mean, Rans might want to talk about who, so he's worked with to get the Climb infrastructure to where it is today. I think Sam can take the absolute lion's share of credit for getting Majora done, but this only works with a huge consortium of people. A lot of them are working at risk, working, giving up their time to the consortium. And we've got probably about 400, 500 people involved in generating data for this project, picking out samples. So the consortium itself is really wide. Majora just triggers a lot of downstream analysis. And so it relies on process genomes coming up and there's a lot of pipelines being built for that. And it relies on downstream and analytics and large numbers of groups have contributed to that effort as well. Too many for me to mention by name, but if we go on our website, you can see called CogConsortium.UK. You can see it's just a cast of hundreds involved there. Obviously there's yours truly as well. And Unibill. I mean, it couldn't happen without you, clearly. It couldn't happen without me, clearly, yeah. But I think for some of the older audience, Sam, you'll have to actually explain where the name Majora actually comes from. So if you don't know, I'm a bit of a Legend of Zelda fan. It's from one of the games in the series. Majora's mask is like an item in this game. It's like a really powerful mask and it's capable of saving the world from its impending doom. So the name seemed like really apt given our current situation. Let's move on to some of the more technical aspects about how Majora is actually constructed. And I do know that the preprint is out in circulation and I was reading that earlier today. So from the preprint, Majora is actually as a central database in part of the large workflow as Nick has mentioned. To help other people understand what happens, for me on the data submission side, I upload SARS-CoV sequencing data and contextual data to Majora. But then what happens to it? Yes, a good question. I think in the context of COG, Majora has kind of become a term that describes almost everything that happens on Climb. And I think that kind of makes sense given that the way that we've built it is that people upload their sequence data to us through our sink or secure trial transfer. And they provide the contextual metadata through the API. And then that's kind of their part done. And actually there's a whole load of things going on behind the scenes that drive what's happening in Climb. Every day, an inbound distribution pipeline that we've nicknamed Elan runs. And it contacts Majora to say, can you tell me about all the metadata that you've seen this week? Tell me all the new sequences that I'm expecting to look for. And then it'll try and go and find those by resolving the file structure in people's directories on Climb. So effectively, it's just like a matching operation. We're looking for sequences and data that match the metadata that's been uploaded. And once that pairing process is complete, Elan then kicks in and starts doing its job. It's mostly responsible for making sure that the data looks good, looks sane. The VAM has alignments in the fastest, the right size and all that kind of thing. Then it generates a bunch of QC data and that's all fed to Majora. So Majora basically knows everything that happens on Climb. And then once Elan is finished, there's a sort of post pipeline that publishes all of the data to somewhere that the analysts can actually find. And this is where sort of the phylogenetics pipeline will spin up and kick in and start processing all of the new data. And what languages is Majora and Elan actually written in then? So Majora is a Django application. So that's a Python based framework for developing web applications. The thing that I like about it, it's got a really nice object relational mapper built into it. So it means dealing with databases is a little nicer than doing it in say raw SQL or something like that. It is one of the main reasons we were able to prototype things pretty quickly as well because it has a really nice way of coding web applications. And then the inbound distribution pipeline is a Nextflow pipeline. So I learned Nextflow especially to join Cog because I heard we already had some Nextflow pipelines in place. So I switched from SnakeMake just for the occasion. And did you find that transition difficult? I tend to find Nextflow a little less intuitive than SnakeMake. Yeah, I definitely did. It was quite a learning curve for me. It does have like a completely different way of thinking compared to SnakeMake. So I have like a long Twitter thread that I'm still appending to about things that I've found out about Nextflow that I wish I learned at the beginning. What would be the highlight from that actually? If your Nextflow pipeline falls over and you want to pick it up again and you want to resume it, it's a single dash resume, not double dash resume. That's probably my main finding. Otherwise it'll just nuke everything and start over. My favorite thing about this is if you Google it, you'll find like loads of people complaining about it on the Nextflow repository. And the author just says, you should read the documentation and drink some more coffee. A single dash resume, that's insane. There is a reason. It's something about whether it gets passed through to the stuff running the commands or whether it's passed through to just Nextflow. So it kind of makes sense, but still. Well, that's a good tip. That's a good tip. I'll have to ask another like hard reviewer question. So is Majora well-engineered? Have you had time to go back and do proper tests throughout? Like Nick was saying, I mean, you know, the bits of Majora that are really cool kind of didn't appear out of thin air. Like I had the time to think about them towards the start of the year and they've been deployed a lot more quickly than they would have been if there wasn't a global pandemic happening. But I'm quite happy with the core like concepts of what the database is built on. As to whether it's well-engineered, I'd probably at a stretch say it's over-engineered because I've tried to make it really generic. So it's not actually, it's not actually tied to COG or tied to COVID-19 or anything like that. So the models could effectively work for anything. So that means it's really, really flexible, which was helpful for us because the kind of the needs and demands for the consortium were changing quite early on. You know, the metadata spec wasn't nailed down immediately. So we had to be sort of really agile in what we were collecting and how we were collecting it. What did Peter Van Houston, how did he describe your code? Didn't he say it looked very homemade? Homemade or more like sort of, I think it's more kind of like small batch artisanal kind of code, isn't it? Yeah, absolutely. I don't think you'll find code like this anywhere else. But I think, you know, as for testing, I mean, everyone always wishes they'd write more tests. Right, it has some, but most of the testing, the testing that we did was, you know, we deployed it. We have a test instance of Majora. It's bright pink. So everyone knows it's the test instance and we lovingly call it magenta. We deploy everything to there and do integration tests. So we try and upload data to it and all that kind of thing before we push it over to production. So pretty much everything that we do is based on integration tests. All right, that's good to hear. One thing that I quite enjoy coming as a user is the API component and the documentation around it. And I'm interested to hear how you've actually done that because I think that's actually a good example. I'm glad you think it's a good example because it's pretty much all done by hand. I mean, the first API documentation I wrote in Markdown on GitHub, we kept it all on GitHub pages so you could see how the API had changed and we were able to, you know, point people very quickly to where they should go if they wanted to do a particular thing with the API. But then I got really distracted by these very fancy interfaces that can like load in a JSON or YAML configuration and show you really fancy looking page for describing, you know, how the API works. So I've moved to defining this YAML file. I think it follows the open API spec or something like that. And you can load that into a Redock or Swagger or whatever takes your fancy. And it's basically the same thing, but seems to be harder for me to maintain than just a bunch of Markdown pages. So I'm not necessarily sure. I'm happy that I made the switch. Probably worth saying that a lot of API driven development is obviously flavor of the month these days, but API is often something that's retrofitted or a second fiddle to the main interface. But it's probably fair to say the API is the main way to interface with Majora for everyone. And obviously some people will use the API clients directly to upload metadata. For example, I think the Sanger do that when they upload their sequences. But it's also worth saying that other groups have built interfaces on top of the API. So the main way that people interact with the system for uploading metadata is using a metadata uploader developed by Anthony Underwood and the team at David Arnonson's CGPS. And that's a really slick, you know, nice web interface that's very easy to use that drives Majora through the API. And I think by encouraging API use and documenting that, that's a really, that put us in a really nice situation for building more functionality because it's not, it doesn't all fall to Sam to build any feature someone can think of. He doesn't have to sit there and write websites, which I happen to know that he dislikes doing seriously. So I think that's been another really key part of the success actually of Majora. Yeah. Anthony's not with us, but it was a really, really sort of missing piece of the puzzle for Majora because, you know, we'd built this, this great database and we had these APIs, but how do we get people who are working away in NHS labs and that kind of thing to provide us metadata? And they're certainly not going to go and download a command line client and work out how to, you know, pass it over the command line. So this missing piece of the puzzle where Anthony and his team build this JavaScript web application that would essentially just convert CSV files into JSON to post them to the Majora API is like an absolute godsend for me. Cause it means I didn't have to build it. It plays kind of suspenseful music while it's uploading, which everyone seems to enjoy. That's really, really good to hear. I mean, it's, it's such a temptation there to have an API and pretend that it's something forward facing and say, yeah, you can access everything and it can work. And then secretly the developers have stuff in the background, all these direct injections into the database kind of scripts that do the actual heavy lifting. So it's really good to hear that you actually put your money where your mouth is and put the API first. Definitely true. Because, you know, I actually get occasionally frustrated with, because I'll say, I just need to get, can I just get some information out of the database? And I just want a column or something, you know, and he will say like, well, I need to write a serializer and I need to, you know, write some code for that. And coming from a kind of, I was brought up in sort of nineties computing where you just get it straight from the database and SQL command or something like that, you know, before APIs were invented. And sometimes that's slightly frustrating, but Sam's absolutely right to enforce that interface because as soon as you start directly manipulating a SQL database or cutting corners to get the quick result, that's fine for a short amount of time, but then, you know, this sort of phenomenon of like technical debt and these problems start to emerge because you've kind of lost control of the engineering of how data should be put in and taken out again. So I think it's absolutely right for him to be quite almost a little bit, could I dare I say rigid about, about that. And I think it's to his credit really. I think the thing with the APIs as well, as it really lends us to this real-time angle because stuff gets posted to these APIs and they respond in real time. So if there's something wrong with a bio sample that you post that says, Hey, it's garbage because of this reason, you know, you've tried to collect it in the future or, you know, collected the sample in 1970 or whatever. And that's shown in the, the uploader as well. So the data that we're getting is validated in real time. So it's added to the database once it's correct in real time. How are you actually, it may ask like on a technical level, how, how are you actually implementing that validation step? So this is something that Django gives us, which is nice. So this, this web framework, you know, it's also designed for building forms and all that kind of thing. So we trick Django into thinking that the, the API is basically somebody filling in a web form, and then you can pass it through to a bunch of predefined validators that are attached to the fields. And then you can say, well, if the collection date is after today, it's collected in the future. So stick that as an error message on that field. And then we pass that back as Jason to the, through the API, and then Anthony's uploader will just render that and say, for this particular sample, you've tried to collect it in the future, so it's been rejected. Please fix this. We've been talking about the upload process and Anthony's tool. And I mean, I know the story, but I want, I want to hear the story about Metadata Friday. What is Metadata Friday? I think it was even in the press Metadata Friday. I think, yeah, Nick actually coined this in a, in the IMI newsletter. So maybe, maybe he can explain. Well, yeah, no, I don't think I coined it, but certainly Metadata Friday became quite legendary early on because, you know, effectively it was a hard deadline for submission of both your sequences and your metadata for that week. At the start of the, of the consortium, we were, we were really running the pipeline on a weekly basis. So Majora would give you this kind of pre- warning mid-morning that you needed to get your data and your metadata uploaded. And it would, it would basically run a test version of the pipeline and, and you'd have a kind of hour-long mad scramble to fix any problems that it threw up. Like, and you know, the sort of problems are you've got metadata, but you've got no sample, you've got no, you've got no genome data or vice versa. And the things that was all caught, you know, Majora is quite tightly integrated with our Slack workspace, which has hundreds of consortium people on it. And so ended up being a bit of a social event, really everyone trying to get their data formatted in time. And, and it was actually really useful. One of the most tedious tasks of, of, of, of this consortium is the wrangling of metadata. We get metadata from lots of different sources and it has to be formalized and uploaded and validated. And it's, you know, I actually, Sam doesn't do it for Birmingham. I do it and it's, it's extremely tedious. And, and so Metadata Friday was good because everyone would kind of get together and sort of share that pain if you like. And it would, you know, if you missed the deadline, it was, it was a real, it was a real bummer because you'd have to wait another week to have another go. Over time we made the pipeline run more and more frequently and now it runs, actually runs daily. So sadly Metadata Friday is, is no more. And we haven't really got a satisfactory replacement for it, have we? I remember when we moved it to twice a week, so it was Tuesday and Friday, but no one really, no one really took Metadata Tuesday. No, that Tuesday one didn't really work out. Yeah. And that was the idea in the Majora paper, we called it, you know, continuous integration, which is obviously a, a term stolen from software development. Yeah. But I like it. I think it works for genomes as well. You know, the data, the data and metadata is being continuously integrated into a various derived analysis product, batches of sequences, phylogenetic trees and reports. And that's something that's going on all the time. At the moment daily is pragmatic, but we have talked about moving to, to even twice daily and maybe one day it will just trigger every time a sequence is added, which would be kind of like the ultimate really. Let's change tack a little bit. This is to everybody, but I want to hear from Sam and Rad first. What features are you most proud of out of Majora? So that's a really hard question, actually. I mean, Rad's and I were sharing a drink over the internet a little while ago, kind of marveling at what we'd built. I think this was like shortly after, you know, we tweeted that we'd hit 50,000 sequences or whatever. And we were just like, man, we built like half of this in a weekend. We pulled all these pieces together. And so I think almost the whole thing, but the fact the whole thing moves and is pretty much almost entirely automated as well. So it's like a really hands off thing. It's almost like, you know, it's, it's something that we've built and it's almost run away from us in the sense that we're not actually directly in charge of it anymore. It runs itself. So I think like the whole, the whole thing, the whole ecosystem is kind of something that we're pretty proud of. I don't know if Rad has something that he's specifically happy with. That's the thing that we've built something that actually helping, we build it quickly and it works. Yeah. I'm proud that it works. That's always good. I think the, another nice thing, I don't know if it's like what I'm most proud of, but I think I really liked the way that the software has actually got a sort of personality and say it's tightly integrated with our Slack and it's got lots of, you know, those little, just kind of little Easter eggs and little like the, like the, the music when you play, when you upload your metadata, which went very well with Metadata Friday and you know, little, little kind of cheeky comments in the, in the automated Slack channel. Because actually consortium work can be quite a drudge, you know, for people, you know, like the actual day to day of this stuff can be quite hard work and quite tedious. And so just little things like that, you know, it's sort of, it's different from kind of corporate software. It's different from using Excel, let's say, you know, and, and I think, you know, it goes a long way to just keeping the morale up and it's, it's quite kind of underappreciated feature of software, you know, to just have that kind of personal touch. I totally agree. I'm very reminiscent of like Torsten's kind of software. He always has a funny comment at the end. Share and enjoy. Yeah, definitely. Yeah. Torsten's. Yeah. You know, like it gives you a little, gives you a little funny comment at the end when you, when you finish running each time, which is, yeah, it does, it does, it does lighten the mood. Sid, now you've taken a step back and had a look at 50,000. What's the count at the moment, number of genomes that have gone through? I think we're actually going to hit 75,000 today. Wow. A couple of milestones recently. So yeah, 75,000. And also Andrew who, Andrew Rambo did the phylogenetic analysis of a hundred thousand, including guess aid sequences. So a hundred thousand tip tree is pretty remarkable, actually. That was one of the longstanding problems of genomic epidemiology is the scaling thing is how do you, how do you manage a tree of that size? How do you build a tree of that size? And you know, it's been some good engineering on the phylogenetic side to enable a hundred thousand tip trees. And we're going to have to, it was, it's clearly going to need to scale another order of magnitude, unfortunately. So that, that is pretty impressive. Yeah. So on the back of that, let's talk about some of the future plans of where this is going to go. I mean, obviously the, the pandemic isn't over, so the routine day-to-day work will continue. but what about Majora generally as a platform in terms of say other organisms or other future plans? We haven't really thought about other organisms a great deal, although you know I do think that this is a model that demonstrably works very well for an epidemic disease where you know particularly one where you want where real-time genomics is incredibly helpful. Maybe it would be useful for other types of routine surveillance as well, but I think for me I know you know I don't want to like put any surprises on to Sam and Rads on this call, but for me I think the obvious way to extend this next is to think about the global situation. Quite a lot of this functionality that we're building has been really useful for the UK, but I think we'd also that functionality on CLIMB would be really useful for other groups in other countries and so for me thinking about the global reach and of course there are actually very good pragmatic reasons for a UK consortium to think about the global picture because obviously we're very interested in the relative contribution of importations of growing a virus versus local spread and of course other countries are interested in that as well and we're all highly connected. It's actually been quite difficult to do analysis of importations because actually very few countries have generated sufficient genome data to allow us to kind of confidently impute origins of new cases, but we know that you know you know significant numbers of clusters do come from from abroad. They obviously all came from abroad originally and then obviously that fell off because transportation, international transport wasn't so limited, but that's opened up a little bit again particularly over summer with holidays and and so we start to see more importations, but it's very hard to to know the origins and that would be a kind of pragmatic reason. But the other thing is obviously we want to be able to support the global users and we've had a lot of requests from other countries doing wanting to do similar projects to COG really looking for some some help and support about how to manage the infrastructure. Anything you'd like to add on a technical level to Majora in the future? So one thing we'd like to add I think is more bioinformatics capabilities so so at the moment Majora relies on the bioinformatics being done locally in terms of processing consensus sequences and the BAM files and and that works that works very well, but I think that there are certain types of analysis for example very very closely related isolates where you know you start thinking about is there a significance of say mixed positions in terms of transmission you know in other viruses mixed positions have have sometimes been helpful in clarifying direction of transmission things like that. We're not currently in a position to look at that that level of information partly because we have disparate pipelines being used. I would quite like to have the ability to re-crunch the entire data set with unified pipelines that have models that understand things like mixed positions so so more bit more bioinformatic support would be what would be one thing that I would like to see in Majora. I think that'd be quite a cool technical thing for us to build from a CLIMB standpoint as well. Yeah it would mean that we could support groups that don't have dedicated bioinformaticians as well. Basically I don't think you can contribute to COG without having your own bioinformatician. I definitely don't want to do bioinformaticians out of jobs because I am one and I think it's important that that expertise is widely distributed. You could imagine a situation where more sequencing happens in more locations where it's just not practical to have on-site bioinformaticians crunching data so it might be better to be done remotely. Rad how do you feel about that considering that's probably a lot of computation burden on you then. That's fine it's a compute nowadays it's not a problem it's just how you put it together you need to think how how you're going to do each step how it all ties together because compute as a resource now it's it's dirty cheap and it's everywhere you just what do you do with it. I guess we're quite grateful that the genomes are so small as well because you know storage is the real blocker these days right so I'm glad we're not you know sequencing human genomes. We are quite fortunate in that respect to that you know even a hundred thousand coronavirus genomes in their BAMs is not is not actually a huge amount of storage and that does that does make life a lot easier. We sort of touched on this but I wanted to hear this and then we'll wrap up. If someone did want to set up something similar what would you advise knowing what you know now? I would definitely encourage people to check out Sam's pre-print on bioRxiv about Majora just because it does in quite a lot of detail go through some of the thought processes about why we made certain design decisions maybe certain trade-offs and I don't think we're necessarily suggesting that people adopt exactly the same model as us actually but I think there's an awful lot of useful kind of learning points in there for people to draw from. We need to be a bit careful when Sam wrote the Majora paper that we're not proposing a off-the-shelf infrastructure that people can download and install. All the software is open source it's pretty well documented it's certainly usable but you know it's the bioinformatician's curse is to do something that's useful and end up supporting it for the rest of of their of their lives and we kind of wanted to avoid that so I think it's it for us it's much more about that model and other countries have adopted a similar model actually I know Canada have a distributed model of sequencing I know that the US in fact although it's been quite slow to get started I would say have also proposed a distributed model. I think for those countries that are of that size that want to do genome sequencing have a look and just basically pick and choose from some of the things that we've done and maybe think about reusing them if it's appropriate and the only other thing I would say is you know don't you know and this is something we tried to get across in the pre-print is the perfect's the enemy of the good. There's no point spending a year getting the perfect infrastructure together before you start because you've lost a year and you can't you don't have that amount of time when you're responding to a public health emergency and so trying to be pragmatic although it's it's difficult sometimes because you want everything to be perfect of course trying to be pragmatic is actually is really vital and so I think hopefully that message comes across in the pre-print as well. Sam anything from you? Yeah I was going to say I think one of the highlights from our pre-print is how we deployed it kind of internally as a sort of walled garden it's not necessarily to keep data away from people but it just meant that we could actually turn things around in real time whereas you know if we follow followed a model where we had to upload all of the reads and everything to a public database like ENA we'd be waiting for the sessions to be minted and then have to re-download all the data again to do any bioinformatics so you know the the model that we've proposed or the model that we've built has really allowed people to upload sequences in real time and get them integrated into the data set on the same day so we're seeing phylogenetic trees coming out hours after the sequences have been uploaded and matched to metadata whereas that just wouldn't be possible if you were using a sort of system that depended on you depositing everything first. I guess like one point that we touch on at the end of the the pre-print as well is that although a lot of the stuff that we describe is a technical achievement there was a lot of work from a lot of people across the whole consortium to kind of define a metadata standard lots and lots and lots of meetings about what metadata to collect and what we can and can't collect and lots of people navigating sort of the legal frameworks from all the different public health agencies and that is you know for for anyone who wants to build a system like ours that's also something that you can't neglect so someone has to put the time in and work out how they're actually going to collect the metadata you know in a in a sort of legal way. And on that note I think we'll end so thanks for the great discussion this is a special software deep dive about the central database infrastructure called Majura that underpins the sequencing efforts within CogUK. So there is a pre-print now available on bioRxiv if you want to learn more and all of the sales code is up on Sam's github there'll be links in the show notes for that if you want to have a look. I want to thank Nick, Sam, Radislav for joining me today that's all the time we have for this episode so see you next time. Thank you all so much for listening to us at home. If you like this podcast please subscribe and like us on iTunes, Spotify, SoundCloud, or the platform of your choice. And if you don't like this podcast please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group and edited by Nick Waters. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadrant Institute.