Hello, and thank you for listening to the Microbinfeed podcast. Here we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There's so much information we all
know from working in the field, but nobody writes it down. There is no manual,
and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My
co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Andrew
and Nabil work in the Quadram Institute in Norwich, UK, where they work on
microbes in food and the impact on human health. I work at Centers for Disease
Control and Prevention and am an adjunct member at the University of Georgia in
the US. Hello, and welcome to the Microbinfeed podcast. Welcome to another
software deep dive, where we interview the authors of a bioinformatics software
package. We know that behind all software, there are quirky details that never
make it into the final paper. So today, we're having a chat about some of the
software that is behind COG-UK. So COG-UK is the COVID-19 genomics UK
consortium, which was created to deliver large scale and rapid whole genome
virus sequencing to local NHS centers and the UK government, particularly for
COVID-19. I'm joined today by Professor Nick Lohmann, who is a professor of
microbial genomics and bioinformatics at the Institute of Microbiology and
Infection at the University of Birmingham. We also have Sam Nichols, who is a
postdoc in Nick's lab and self-described wonderful wizard of COG, and Slavko
Floski, who is CLIMB's cloud manager at the University of Birmingham. Today
we're going to be talking about the central database infrastructure called
Majora that underpins the sequencing efforts within COG-UK. Nick, let's get
kicked off with you. What exactly is the problem Majora is trying to solve? We
decided in COG that we wanted to very rapidly have capability to sequence SARS-
CoV-2 genomes. And at a meeting back in mid-March, we came up with the idea of
the Coronavirus Genomics Consortium. And we just thought the best way of doing
this would be to leverage all of the UK's deep expertise in sequencing and in
molecular phylogenetics and create a network. And so we wanted everyone that
could contribute genomes to be able to do that really, really quickly using
whatever platforms they're comfortable with, even different sequencing
protocols. And we actually ended up having a network of about 13 or 14 sites
that gave us a real problem, which is how could we do a centralized analysis of
the data produced by the consortium? So Majora solves that problem, basically.
It connects together all of the distributed sequencing hubs that generate both
sequence data and sample metadata. And it provides a database and a software
infrastructure to integrate that information so that we can generate analysis
products, upload that data to public databases, and actually with the business
of understanding coronavirus genomics. I'm going to ask a kind of reviewer three
question. What was the motivation behind setting this up? Like, I mean, what did
Majora need to do that wasn't available at the time? That's a really interesting
question. I think that there are lots of different software infrastructures
available. And of course, there are good examples of kind of networks of
sequencing that go on. So for example, in the US for Salmonella and other
foodborne bacterial pathogens, you have Genome Tracker. In Canada, they have a
system called Iridar. And the UK has been at the forefront of also generating
whole genome sequences for organisms like TB as well as the foodborne. But this
is a new virus, and it's very unusual to have a distributed network that's
basically just popped up effectively overnight. It takes months or years of
intensive development to get a system that's kind of stable and working. And so
we needed something incredibly agile that would work basically immediately that
we knew that we could build on and layer on functionality that we would need.
And there was just nothing off the shelf that we could use that we could easily
for this idea, particularly the fact that it's such a distributed system taking
disparate types of sequencing information in. So as with all kind of classic
bioinformatics problems, the solution is to roll your own. And Sam already had
the sort of bones of a platform called Majora. It's very different to what we
have now, but it was obvious to me that Sam would be able to bring something up
very quickly to support this. I want to bring Sam and Rad in now. Was there
anything you want to add to that? I think Nick's covered all of that, really. I
think the special thing about Majora, like Nick was saying, is that it's allowed
us to scale up to a very different sort of type of sequencing network, not
necessarily one that's based in a very large sequencing center. We've got lots
of different disparate sequencing sites doing lots of different types of
sequencing with all sorts of library extraction techniques and bioinformatics
techniques. And this is a way to bring all of those together. And for you Rad,
coming in as a cloud manager on the sysadmin side, does this seem really novel
to you as well, this sort of technology? It's the first time we did something in
such a short amount of time that was bespoke solution. I also think we just
played our strengths. Each of us is good at something and we sat together, okay,
you can do this, you can do this, I can do that. And the end product is Majora
and UranZone. I think that's a good point, actually, that the way it was
developed as well is not in isolation. I worked really closely with Rad to build
it on Climb. We worked closely with other people to integrate it into systems as
well. So it wasn't developed in isolation. This whole ecosystem has sprung up
out of Climb really and the connections that we have. Well, we started talking
about the conception of the idea. So when did the work actually start? So I
guess like Nick said, I had musings on what a LIMS should look like a little bit
before the outbreak began. Towards the start of this year, we were ramping up
sequencing for one of the projects that I'm attached to, and we decided that it
needed some sort of LIMS to keep track of all of these samples for fecal
metagenomics. And then, you know, when we were pulled off all of our grants at
the beginning of March, Nick said, drop everything, I need you to do something,
it's quite important. We realized that we could probably spin what I'd built
into something that could serve this sort of concept of a large sequencing
consortium. And that, I think, was at the beginning of March. And I have fond
memories of Rad and I working, I think, the second weekend of March, pulling all
sorts of crazy hours to build the first VM and start deploying software to it.
So that was, I mean, I remember when we were talking with Andrew and Justin a
few weeks ago, they talk about this call to action as a sort of phone call at
midnight kind of thing. Was it much the same for both of you? Yeah, pretty much.
I think sort of work was quiet for a week and Nick was sort of in lots of
meetings. And then we get a little message on Thursday, March 12th, I think,
Nick said, you need to drop everything. I need you to start building this
system. So that was Thursday evening. We worked all weekend. On Monday, we had
some bare bones infrastructure that we can just continue to work on. That was a
crazy weekend. The first thing that we actually set up Majora to do was we
retooled it to be a user management system. That was actually the first thing we
got it doing that weekend, because we knew that loads of people were going to
need to sign onto the system to upload data. So the very first thing that we
wanted to do before the pipelines, before Majora as a LIMS was just provide a
platform for people to provide these genomes so we could start coordinating an
upload of them to GISAID. So Majora was actually a user management system at
first, like that weekend. And I think we developed that in two or three days,
had our first user on the Monday. And I think it's worth pointing out, I think,
that it sounds very ad hoc the way that we're describing it. And of course, it
sort of necessarily is. Coronavirus, it surprised us how quickly it came to the
UK and how big it got. Everyone was obviously watching the news back in early
January and the reports coming out of Wuhan, but I think it still caught people
on the hop. But although it's very ad hoc, I think it's very important to point
out that there's a sort of deep well of expertise and knowledge and actual
pieces of the puzzle that already existed. It wasn't about standing up an
infrastructure from nothing. There was, there was a huge amount of underlying
stuff there. So obviously the Climb platform is a great example of something
that's been developed over at least five years to be able to rapidly respond to
kind of hardware infrastructure requests and knowing that the network's solid
and knowing that all the firewalls work. Sam's got the basis of some software,
but there's other stuff as well. Things like the Arctic project that we
leveraged, which had protocols ready to go for doing both the lab work and the
bioinformatics. And then all of the sort of deep work in phylogenetics that's
been established over the last sort of 15 years with algorithms and the
understanding of how to do genome epi. It is the sort of thing where we would
have loved to have got a grant to fund, to fund all that stuff, but it's quite
difficult to get a grant to fund that kind of infrastructure work. But the kind
of work that we've been doing and where we were really, so actually we know how
to do this. We've got most of the pieces. It's just the question of putting all
the bricks together and make something that.  a sort of usable, if not a house,
a sort of, you know, at least an external garage. And we'll build all the other
bits on later that we need to make it the kind of, the ultimate system that we
continue to build today. So I think you've both mentioned this precedent and
particularly this early prototype of Majora. Is there anything you want to add
about what that was? What would have been Majora had we not had the pandemic? I
think the important thing is that a lot of the concepts that I'd already put a
lot of thought into were pulled over from the original Majora into the one that
we have now. So everything in Majora is basically a process or an artifact. And
if you're an artifact, you're a thing that exists like a bio sample or a file.
And a process is something that happens to an artifact. And that sort of
conceptual thinking took a while to come to and to build the database models
that would back that. So a lot of that was already in place and we could pull
that over and utilize that to deploy Cog. And a lot has come since, but it is
important like Nick was saying that we did almost have quite a lot of these
jigsaw puzzle pieces together. And really what we've done here is pull them into
one coherent picture at a really fast pace. In that process, who were the major
people involved other than yourselves, obviously? I guess if you look at the
commit history of Majora, it's a one man band, right? But there are a lot of
people involved. I mean, Rans might want to talk about who, so he's worked with
to get the Climb infrastructure to where it is today. I think Sam can take the
absolute lion's share of credit for getting Majora done, but this only works
with a huge consortium of people. A lot of them are working at risk, working,
giving up their time to the consortium. And we've got probably about 400, 500
people involved in generating data for this project, picking out samples. So the
consortium itself is really wide. Majora just triggers a lot of downstream
analysis. And so it relies on process genomes coming up and there's a lot of
pipelines being built for that. And it relies on downstream and analytics and
large numbers of groups have contributed to that effort as well. Too many for me
to mention by name, but if we go on our website, you can see called
CogConsortium.UK. You can see it's just a cast of hundreds involved there.
Obviously there's yours truly as well. And Unibill. I mean, it couldn't happen
without you, clearly. It couldn't happen without me, clearly, yeah. But I think
for some of the older audience, Sam, you'll have to actually explain where the
name Majora actually comes from. So if you don't know, I'm a bit of a Legend of
Zelda fan. It's from one of the games in the series. Majora's mask is like an
item in this game. It's like a really powerful mask and it's capable of saving
the world from its impending doom. So the name seemed like really apt given our
current situation. Let's move on to some of the more technical aspects about how
Majora is actually constructed. And I do know that the preprint is out in
circulation and I was reading that earlier today. So from the preprint, Majora
is actually as a central database in part of the large workflow as Nick has
mentioned. To help other people understand what happens, for me on the data
submission side, I upload SARS-CoV sequencing data and contextual data to
Majora. But then what happens to it? Yes, a good question. I think in the
context of COG, Majora has kind of become a term that describes almost
everything that happens on Climb. And I think that kind of makes sense given
that the way that we've built it is that people upload their sequence data to us
through our sink or secure trial transfer. And they provide the contextual
metadata through the API. And then that's kind of their part done. And actually
there's a whole load of things going on behind the scenes that drive what's
happening in Climb. Every day, an inbound distribution pipeline that we've
nicknamed Elan runs. And it contacts Majora to say, can you tell me about all
the metadata that you've seen this week? Tell me all the new sequences that I'm
expecting to look for. And then it'll try and go and find those by resolving the
file structure in people's directories on Climb. So effectively, it's just like
a matching operation. We're looking for sequences and data that match the
metadata that's been uploaded. And once that pairing process is complete, Elan
then kicks in and starts doing its job. It's mostly responsible for making sure
that the data looks good, looks sane. The VAM has alignments in the fastest, the
right size and all that kind of thing. Then it generates a bunch of QC data and
that's all fed to Majora. So Majora basically knows everything that happens on
Climb. And then once Elan is finished, there's a sort of post pipeline that
publishes all of the data to somewhere that the analysts can actually find. And
this is where sort of the phylogenetics pipeline will spin up and kick in and
start processing all of the new data. And what languages is Majora and Elan
actually written in then? So Majora is a Django application. So that's a Python
based framework for developing web applications. The thing that I like about it,
it's got a really nice object relational mapper built into it. So it means
dealing with databases is a little nicer than doing it in say raw SQL or
something like that. It is one of the main reasons we were able to prototype
things pretty quickly as well because it has a really nice way of coding web
applications. And then the inbound distribution pipeline is a Nextflow pipeline.
So I learned Nextflow especially to join Cog because I heard we already had some
Nextflow pipelines in place. So I switched from SnakeMake just for the occasion.
And did you find that transition difficult? I tend to find Nextflow a little
less intuitive than SnakeMake. Yeah, I definitely did. It was quite a learning
curve for me. It does have like a completely different way of thinking compared
to SnakeMake. So I have like a long Twitter thread that I'm still appending to
about things that I've found out about Nextflow that I wish I learned at the
beginning. What would be the highlight from that actually? If your Nextflow
pipeline falls over and you want to pick it up again and you want to resume it,
it's a single dash resume, not double dash resume. That's probably my main
finding. Otherwise it'll just nuke everything and start over. My favorite thing
about this is if you Google it, you'll find like loads of people complaining
about it on the Nextflow repository. And the author just says, you should read
the documentation and drink some more coffee. A single dash resume, that's
insane. There is a reason. It's something about whether it gets passed through
to the stuff running the commands or whether it's passed through to just
Nextflow. So it kind of makes sense, but still. Well, that's a good tip. That's
a good tip. I'll have to ask another like hard reviewer question. So is Majora
well-engineered? Have you had time to go back and do proper tests throughout?
Like Nick was saying, I mean, you know, the bits of Majora that are really cool
kind of didn't appear out of thin air. Like I had the time to think about them
towards the start of the year and they've been deployed a lot more quickly than
they would have been if there wasn't a global pandemic happening. But I'm quite
happy with the core like concepts of what the database is built on. As to
whether it's well-engineered, I'd probably at a stretch say it's over-engineered
because I've tried to make it really generic. So it's not actually, it's not
actually tied to COG or tied to COVID-19 or anything like that. So the models
could effectively work for anything. So that means it's really, really flexible,
which was helpful for us because the kind of the needs and demands for the
consortium were changing quite early on. You know, the metadata spec wasn't
nailed down immediately. So we had to be sort of really agile in what we were
collecting and how we were collecting it. What did Peter Van Houston, how did he
describe your code? Didn't he say it looked very homemade? Homemade or more like
sort of, I think it's more kind of like small batch artisanal kind of code,
isn't it? Yeah, absolutely. I don't think you'll find code like this anywhere
else. But I think, you know, as for testing, I mean, everyone always wishes
they'd write more tests. Right, it has some, but most of the testing, the
testing that we did was, you know, we deployed it. We have a test instance of
Majora. It's bright pink. So everyone knows it's the test instance and we
lovingly call it magenta. We deploy everything to there and do integration
tests. So we try and upload data to it and all that kind of thing before we push
it over to production. So pretty much everything that we do is based on
integration tests. All right, that's good to hear. One thing that I quite enjoy
coming as a user is the API component and the documentation around it. And I'm
interested to hear how you've actually done that because I think that's actually
a good example. I'm glad you think it's a good example because it's pretty much
all done by hand. I mean, the first API documentation I wrote in Markdown on
GitHub, we kept it all on GitHub pages so you could see how the API had changed
and we were able to, you know, point people very quickly to where they should go
if they wanted to do a particular thing with the API. But then I got really
distracted by these very fancy interfaces that can like load in a JSON or YAML
configuration and show you really fancy looking page for describing, you know,
how the API works. So I've moved to defining this YAML file. I think it follows
the open API spec or something like that. And you can load that into a Redock or
Swagger or whatever takes your fancy. And it's basically the same thing, but
seems to be harder for me to maintain than just a bunch of Markdown pages. So
I'm not necessarily sure. I'm happy that I made the switch. Probably worth
saying that a lot of API driven development is obviously flavor of the month
these days, but API is often something that's retrofitted or a second fiddle to
the main interface. But it's probably fair to say the API is the main way to
interface with Majora for everyone. And obviously some people will use the API
clients directly to upload metadata. For example, I think the Sanger do that
when they upload their sequences. But it's also worth saying that other groups
have built interfaces on top of the API. So the main way that people interact
with the system for uploading metadata is using a metadata uploader developed by
Anthony Underwood and the team at David Arnonson's CGPS. And that's a really
slick, you know, nice web interface that's very easy to use that drives Majora
through the API. And I think by encouraging API use  and documenting that,
that's a really, that put us in a really nice situation for building more
functionality because it's not, it doesn't all fall to Sam to build any feature
someone can think of. He doesn't have to sit there and write websites, which I
happen to know that he dislikes doing seriously. So I think that's been another
really key part of the success actually of Majora. Yeah. Anthony's not with us,
but it was a really, really sort of missing piece of the puzzle for Majora
because, you know, we'd built this, this great database and we had these APIs,
but how do we get people who are working away in NHS labs and that kind of thing
to provide us metadata? And they're certainly not going to go and download a
command line client and work out how to, you know, pass it over the command
line. So this missing piece of the puzzle where Anthony and his team build this
JavaScript web application that would essentially just convert CSV files into
JSON to post them to the Majora API is like an absolute godsend for me. Cause it
means I didn't have to build it. It plays kind of suspenseful music while it's
uploading, which everyone seems to enjoy. That's really, really good to hear. I
mean, it's, it's such a temptation there to have an API and pretend that it's
something forward facing and say, yeah, you can access everything and it can
work. And then secretly the developers have stuff in the background, all these
direct injections into the database kind of scripts that do the actual heavy
lifting. So it's really good to hear that you actually put your money where your
mouth is and put the API first. Definitely true. Because, you know, I actually
get occasionally frustrated with, because I'll say, I just need to get, can I
just get some information out of the database? And I just want a column or
something, you know, and he will say like, well, I need to write a serializer
and I need to, you know, write some code for that. And coming from a kind of, I
was brought up in sort of nineties computing where you just get it straight from
the database and SQL command or something like that, you know, before APIs were
invented. And sometimes that's slightly frustrating, but Sam's absolutely right
to enforce that interface because as soon as you start directly manipulating a
SQL database or cutting corners to get the quick result, that's fine for a short
amount of time, but then, you know, this sort of phenomenon of like technical
debt and these problems start to emerge because you've kind of lost control of
the engineering of how data should be put in and taken out again. So I think
it's absolutely right for him to be quite almost a little bit, could I dare I
say rigid about, about that. And I think it's to his credit really. I think the
thing with the APIs as well, as it really lends us to this real-time angle
because stuff gets posted to these APIs and they respond in real time. So if
there's something wrong with a bio sample that you post that says, Hey, it's
garbage because of this reason, you know, you've tried to collect it in the
future or, you know, collected the sample in 1970 or whatever. And that's shown
in the, the uploader as well. So the data that we're getting is validated in
real time. So it's added to the database once it's correct in real time. How are
you actually, it may ask like on a technical level, how, how are you actually
implementing that validation step? So this is something that Django gives us,
which is nice. So this, this web framework, you know, it's also designed for
building forms and all that kind of thing. So we trick Django into thinking that
the, the API is basically somebody filling in a web form, and then you can pass
it through to a bunch of predefined validators that are attached to the fields.
And then you can say, well, if the collection date is after today, it's
collected in the future. So stick that as an error message on that field. And
then we pass that back as Jason to the, through the API, and then Anthony's
uploader will just render that and say, for this particular sample, you've tried
to collect it in the future, so it's been rejected. Please fix this. We've been
talking about the upload process and Anthony's tool. And I mean, I know the
story, but I want, I want to hear the story about Metadata Friday. What is
Metadata Friday? I think it was even in the press Metadata Friday. I think,
yeah, Nick actually coined this in a, in the IMI newsletter. So maybe, maybe he
can explain. Well, yeah, no, I don't think I coined it, but certainly Metadata
Friday became quite legendary early on because, you know, effectively it was a
hard deadline for submission of both your sequences and your metadata for that
week. At the start of the, of the consortium, we were, we were really running
the pipeline on a weekly basis. So Majora would give you this kind of pre-
warning mid-morning that you needed to get your data and your metadata uploaded.
And it would, it would basically run a test version of the pipeline and, and
you'd have a kind of hour-long mad scramble to fix any problems that it threw
up. Like, and you know, the sort of problems are you've got metadata, but you've
got no sample, you've got no, you've got no genome data or vice versa. And the
things that was all caught, you know, Majora is quite tightly integrated with
our Slack workspace, which has hundreds of consortium people on it. And so ended
up being a bit of a social event, really everyone trying to get their data
formatted in time. And, and it was actually really useful. One of the most
tedious tasks of, of, of, of this consortium is the wrangling of metadata. We
get metadata from lots of different sources and it has to be formalized and
uploaded and validated. And it's, you know, I actually, Sam doesn't do it for
Birmingham. I do it and it's, it's extremely tedious. And, and so Metadata
Friday was good because everyone would kind of get together and sort of share
that pain if you like. And it would, you know, if you missed the deadline, it
was, it was a real, it was a real bummer because you'd have to wait another week
to have another go. Over time we made the pipeline run more and more frequently
and now it runs, actually runs daily. So sadly Metadata Friday is, is no more.
And we haven't really got a satisfactory replacement for it, have we? I remember
when we moved it to twice a week, so it was Tuesday and Friday, but no one
really, no one really took Metadata Tuesday. No, that Tuesday one didn't really
work out. Yeah. And that was the idea in the Majora paper, we called it, you
know, continuous integration, which is obviously a, a term stolen from software
development. Yeah. But I like it. I think it works for genomes as well. You
know, the data, the data and metadata is being continuously integrated into a
various derived analysis product, batches of sequences, phylogenetic trees and
reports. And that's something that's going on all the time. At the moment daily
is pragmatic, but we have talked about moving to, to even twice daily and maybe
one day it will just trigger every time a sequence is added, which would be kind
of like the ultimate really. Let's change tack a little bit. This is to
everybody, but I want to hear from Sam and Rad first. What features are you most
proud of out of Majora? So that's a really hard question, actually. I mean,
Rad's and I were sharing a drink over the internet a little while ago, kind of
marveling at what we'd built. I think this was like shortly after, you know, we
tweeted that we'd hit 50,000 sequences or whatever. And we were just like, man,
we built like half of this in a weekend. We pulled all these pieces together.
And so I think almost the whole thing, but the fact the whole thing moves and is
pretty much almost entirely automated as well. So it's like a really hands off
thing. It's almost like, you know, it's, it's something that we've built and
it's almost run away from us in the sense that we're not actually directly in
charge of it anymore. It runs itself. So I think like the whole, the whole
thing, the whole ecosystem is kind of something that we're pretty proud of. I
don't know if Rad has something that he's specifically happy with. That's the
thing that we've built something that actually helping, we build it quickly and
it works. Yeah. I'm proud that it works. That's always good. I think the,
another nice thing, I don't know if it's like what I'm most proud of, but I
think I really liked the way that the software has actually got a sort of
personality and say it's tightly integrated with our Slack and it's got lots of,
you know, those little, just kind of little Easter eggs and little like the,
like the, the music when you play, when you upload your metadata, which went
very well with Metadata Friday and you know, little, little kind of cheeky
comments in the, in the automated Slack channel. Because actually consortium
work can be quite a drudge, you know, for people, you know, like the actual day
to day of this stuff can be quite hard work and quite tedious. And so just
little things like that, you know, it's sort of, it's different from kind of
corporate software. It's different from using Excel, let's say, you know, and,
and I think, you know, it goes a long way to just keeping the morale up and
it's, it's quite kind of underappreciated feature of software, you know, to just
have that kind of personal touch. I totally agree. I'm very reminiscent of like
Torsten's kind of software. He always has a funny comment at the end. Share and
enjoy. Yeah, definitely. Yeah. Torsten's. Yeah. You know, like it gives you a
little, gives you a little funny comment at the end when you, when you finish
running each time, which is, yeah, it does, it does, it does lighten the mood.
Sid, now you've taken a step back and had a look at 50,000. What's the count at
the moment, number of genomes that have gone through? I think we're actually
going to hit 75,000 today. Wow. A couple of milestones recently. So yeah,
75,000. And also Andrew who, Andrew Rambo did the phylogenetic analysis of a
hundred thousand, including guess aid sequences. So a hundred thousand tip tree
is pretty remarkable, actually. That was one of the longstanding problems of
genomic epidemiology is the scaling thing is how do you, how do you manage a
tree of that size? How do you build a tree of that size? And you know, it's been
some good engineering on the phylogenetic side to enable a hundred thousand tip
trees. And we're going to have to, it was, it's clearly going to need to scale
another order of magnitude, unfortunately. So that, that is pretty impressive.
Yeah. So on the back of that, let's talk about some of the future plans of where
this is going to go. I mean, obviously the, the pandemic isn't over, so the
routine day-to-day work will continue.  but what about Majora generally as a
platform in terms of say other organisms or other future plans? We haven't
really thought about other organisms a great deal, although you know I do think
that this is a model that demonstrably works very well for an epidemic disease
where you know particularly one where you want where real-time genomics is
incredibly helpful. Maybe it would be useful for other types of routine
surveillance as well, but I think for me I know you know I don't want to like
put any surprises on to Sam and Rads on this call, but for me I think the
obvious way to extend this next is to think about the global situation. Quite a
lot of this functionality that we're building has been really useful for the UK,
but I think we'd also that functionality on CLIMB would be really useful for
other groups in other countries and so for me thinking about the global reach
and of course there are actually very good pragmatic reasons for a UK consortium
to think about the global picture because obviously we're very interested in the
relative contribution of importations of growing a virus versus local spread and
of course other countries are interested in that as well and we're all highly
connected. It's actually been quite difficult to do analysis of importations
because actually very few countries have generated sufficient genome data to
allow us to kind of confidently impute origins of new cases, but we know that
you know you know significant numbers of clusters do come from from abroad. They
obviously all came from abroad originally and then obviously that fell off
because transportation, international transport wasn't so limited, but that's
opened up a little bit again particularly over summer with holidays and and so
we start to see more importations, but it's very hard to to know the origins and
that would be a kind of pragmatic reason. But the other thing is obviously we
want to be able to support the global users and we've had a lot of requests from
other countries doing wanting to do similar projects to COG really looking for
some some help and support about how to manage the infrastructure. Anything
you'd like to add on a technical level to Majora in the future? So one thing
we'd like to add I think is more bioinformatics capabilities so so at the moment
Majora relies on the bioinformatics being done locally in terms of processing
consensus sequences and the BAM files and and that works that works very well,
but I think that there are certain types of analysis for example very very
closely related isolates where you know you start thinking about is there a
significance of say mixed positions in terms of transmission you know in other
viruses mixed positions have have sometimes been helpful in clarifying direction
of transmission things like that. We're not currently in a position to look at
that that level of information partly because we have disparate pipelines being
used. I would quite like to have the ability to re-crunch the entire data set
with unified pipelines that have models that understand things like mixed
positions so so more bit more bioinformatic support would be what would be one
thing that I would like to see in Majora. I think that'd be quite a cool
technical thing for us to build from a CLIMB standpoint as well. Yeah it would
mean that we could support groups that don't have dedicated bioinformaticians as
well. Basically I don't think you can contribute to COG without having your own
bioinformatician. I definitely don't want to do bioinformaticians out of jobs
because I am one and I think it's important that that expertise is widely
distributed. You could imagine a situation where more sequencing happens in more
locations where it's just not practical to have on-site bioinformaticians
crunching data so it might be better to be done remotely. Rad how do you feel
about that considering that's probably a lot of computation burden on you then.
That's fine it's a compute nowadays it's not a problem it's just how you put it
together you need to think how how you're going to do each step how it all ties
together because compute as a resource now it's it's dirty cheap and it's
everywhere you just what do you do with it. I guess we're quite grateful that
the genomes are so small as well because you know storage is the real blocker
these days right so I'm glad we're not you know sequencing human genomes. We are
quite fortunate in that respect to that you know even a hundred thousand
coronavirus genomes in their BAMs is not is not actually a huge amount of
storage and that does that does make life a lot easier. We sort of touched on
this but I wanted to hear this and then we'll wrap up. If someone did want to
set up something similar what would you advise knowing what you know now? I
would definitely encourage people to check out Sam's pre-print on bioRxiv about
Majora just because it does in quite a lot of detail go through some of the
thought processes about why we made certain design decisions maybe certain
trade-offs and I don't think we're necessarily suggesting that people adopt
exactly the same model as us actually but I think there's an awful lot of useful
kind of learning points in there for people to draw from. We need to be a bit
careful when Sam wrote the Majora paper that we're not proposing a off-the-shelf
infrastructure that people can download and install. All the software is open
source it's pretty well documented it's certainly usable but you know it's the
bioinformatician's curse is to do something that's useful and end up supporting
it for the rest of of their of their lives and we kind of wanted to avoid that
so I think it's it for us it's much more about that model and other countries
have adopted a similar model actually I know Canada have a distributed model of
sequencing I know that the US in fact although it's been quite slow to get
started I would say have also proposed a distributed model. I think for those
countries that are of that size that want to do genome sequencing have a look
and just basically pick and choose from some of the things that we've done and
maybe think about reusing them if it's appropriate and the only other thing I
would say is you know don't you know and this is something we tried to get
across in the pre-print is the perfect's the enemy of the good. There's no point
spending a year getting the perfect infrastructure together before you start
because you've lost a year and you can't you don't have that amount of time when
you're responding to a public health emergency and so trying to be pragmatic
although it's it's difficult sometimes because you want everything to be perfect
of course trying to be pragmatic is actually is really vital and so I think
hopefully that message comes across in the pre-print as well. Sam anything from
you? Yeah I was going to say I think one of the highlights from our pre-print is
how we deployed it kind of internally as a sort of walled garden it's not
necessarily to keep data away from people but it just meant that we could
actually turn things around in real time whereas you know if we follow followed
a model where we had to upload all of the reads and everything to a public
database like ENA we'd be waiting for the sessions to be minted and then have to
re-download all the data again to do any bioinformatics so you know the the
model that we've proposed or the model that we've built has really allowed
people to upload sequences in real time and get them integrated into the data
set on the same day so we're seeing phylogenetic trees coming out hours after
the sequences have been uploaded and matched to metadata whereas that just
wouldn't be possible if you were using a sort of system that depended on you
depositing everything first. I guess like one point that we touch on at the end
of the the pre-print as well is that although a lot of the stuff that we
describe is a technical achievement there was a lot of work from a lot of people
across the whole consortium to kind of define a metadata standard lots and lots
and lots of meetings about what metadata to collect and what we can and can't
collect and lots of people navigating sort of the legal frameworks from all the
different public health agencies and that is you know for for anyone who wants
to build a system like ours that's also something that you can't neglect so
someone has to put the time in and work out how they're actually going to
collect the metadata you know in a in a sort of legal way. And on that note I
think we'll end so thanks for the great discussion this is a special software
deep dive about the central database infrastructure called Majura that underpins
the sequencing efforts within CogUK. So there is a pre-print now available on
bioRxiv if you want to learn more and all of the sales code is up on Sam's
github there'll be links in the show notes for that if you want to have a look.
I want to thank Nick, Sam, Radislav for joining me today that's all the time we
have for this episode so see you next time. Thank you all so much for listening
to us at home. If you like this podcast please subscribe and like us on iTunes,
Spotify, SoundCloud, or the platform of your choice. And if you don't like this
podcast please don't do anything. This podcast was recorded by the Microbial
Bioinformatics Group and edited by Nick Waters. The opinions expressed here are
our own and do not necessarily reflect the views of CDC or the Quadrant
Institute.