Hello, and thank you for listening to the MicroBinfeed podcast. Here, we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There's so much information we all know from working in the field, but nobody writes it down. There is no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Both Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work on microbes in food and the impact on human health. I work at Centers for Disease Control and Prevention, and am an adjunct member at the University of Georgia in the U.S. Hello, and welcome to another Software Deep Dive, where we interview the author of a bioinformatics software package, and today we're having a chat about some of the software we've created ourselves, because behind all of our software are quirky details that never make it into the final paper. So today, Andrew is in the hot seat with Rory. So to kick things off, Andrew, what is the problem that Rory is trying to solve? So Rory just solves the problem of creating pan genomes for bacteria. It's a very simple problem, but when Rory came out and when it was developed, it was a really difficult problem to do at scale. People were doing pan genomes. Actually, maybe I'll start with what a pan genome is. So if you sequence a lot of bacteria, you do assemblies, you're going to get differences between those assemblies, and it might be mobile genetic elements or other random things like you might have plasmids, phage, integrons, this kind of thing. And this is fine if you just want to build, I don't know, a nice little tree, you might go and map everything back to a reference. But if you want to look at all this variation in the accessory genome, which is everything that's not in the core conserved part of the genome, then you need some way of interrogating it and seeing what's in common in these strains and what's different and that kind of thing. And then also people want to know what genes are actually in every genome in their set or nearly in every genome. And those are your kind of core conserved things, which actually define what this set of bacteria is. What makes a salmonella an actual salmonella? It's probably this collection of 2,000 genes or 3,000 genes that really define it, how it works, how it operates, how it lives and dies. Whereas there can be a flow of other things coming in and out that maybe help it to survive in different environments, but they're not universally required. It came about because within the Sanger Institute, they were struggling because they had these huge big sample collections, but they weren't able to actually pull out this information about the pan genomes very easily. What is accessory? What is core? And they're using tools like PanoCT and LSBSR and get homologs, this kind of thing. Another member of my group started off going and looking at OrthoMCL and seeing could he actually modify that? And now if you've ever used OrthoMCL, you'll realize that it's probably not the easiest software in the world to use, particularly when they moved to more recent versions using MySQL database. And it just created this whole nightmare of complexity. And doing something quite simple became quite difficult. It's fine for maybe eukaryotes where you've only a small number of samples, but when you actually have a huge data sets like with bacteria, it becomes a huge problem very quickly. And we looked back in the literature and found that, well, people weren't really building pan genomes bigger than maybe 80 or 100 bacteria, which is quite small. So we set about trying to build something that could actually scale to thousands or tens of thousands of bacterial genomes. So we can do these much larger projects. And that's kind of how Rory came about. Originally, it just had the ominous name of the Pan Genome Pipeline internally. And the scripts were just called Pan Genome, Create Pan Genome, that kind of thing. So not very well named, but it got through the kind of early stages. How did you switch to Rory then? Because when we started opening it up, we realized we needed a better name than just the Pan Genome Pipeline. So I had to make up a name. And the name I made up was Rory. So like Rory the Racing Car, which is a kids TV program in the UK, and it's in Australia as well. Yeah, it kind of stuck. But really, my son is actually named Rory, spelled slightly differently. So that's kind of where the name came from as well. Might as well, you know, choose something a bit obscure that isn't currently used in software and then use it, you know, because then people can look it up easily. They can type, you know, Condon saw Rory, that kind of thing, and get it quite quick. I do remember one time, maybe last year or the year before, and you showed me a picture of your son and you said, this is the original Rory. So I think I know who it's really named after. It's a good name. You know, it's a good Irish, Scottish name. And it's quite common in those countries. One more question about the name, though. Yeah. Yeah, I've heard. We're all at home, right? Is he like in the next room and you're saying his name a lot? Is he kind of perking up, you think? He's actually downstairs sick today. So yeah, he's probably listening in. How does he feel about having such a prominent software named after him? Well, I don't know. I think he's too young to actually understand, you know, that he's got a famous name. But maybe in the future, he'll come to the dark side and work in infectious diseases and bioinformatics, you know, and you never know. So did you choose that spelling? I guess I have one more question. Yeah. Did you choose the spelling? Because I don't know anything about Rory the racing car. I've never seen it before. Is that kind of an homage to the speed? Or is there some other meaning behind it? No, there's no other meaning. It was just son's name spelled slightly differently. OK. Racing cars go fast. And also racing cars go fast, you know, you know, it sounded nice. It was unique. You know, that's the main thing is a unique namespace. OK, we can move on. I can stop riffing on the name. Well, I mean, when when when did when did the development actually start? You're talking about 100, you know, large data sets, 100 genomes. I mean, that must have been a while ago. Yes, it started off like maybe 2013. So a fair while ago. And I think it was only published in 2015. And actually, it's only published as a two page bioinformatics application note. So actually very, very short. But I think the supplemental material was like 30 pages long as you do. And it was grand. Like people, main thing about it was that internally within the Sanger Institute, people were using it in a group. So it was, you know, written our own dog food. And we got some great feedback very rapidly. You know what people actually need, what they don't need. And so it had some extra quirks originally. So, for example, it could scale massively on HPC cluster. So the way the program is actually structured is a collection of scripts which kind of call each other. And then there is like job runners which run those particular jobs. And that's designed so that you can scale it out massively. But actually, after a while, I figured out that no one really did it that way. They basically said, I have, you know, 64 threads. Please run this on one machine with 64 CPUs. And that's how they ran it. They weren't using the power, the awesome power of HPC where it could scale to thousands of simultaneous jobs in parallel. So basically that whole functionality is still there, but it's kind of hidden away and no longer advertised. But I think it's the best part. I spent a lot of time trying to get that to work. Wait, I had no idea about this. So is that available? Is that like a secret thing that we can get to? It only works for LSF, which is what the cluster at the time used. But yeah, you can actually use it. But you don't really need to be quite frank, you know, because it's quite fast. One of the big speed improvements, actually, that I think is really awesome is using CD hit to pre-cluster all the data, because fundamentally this method and probably most methods, they just do an all against all blast of genes. So not too fancy. But if you pre-cluster all the genes first, you massively shrink down the amount of comparisons you need to do. So that massively, massively speeds it up. And it scales as well, because if you're comparing one species of bacteria, there's only so many genes, unique genes that are going to be in there, you know, unless you have contamination, of course. But there's only so many unique genes. So as you add more and if you're continually clustering, it's going to it's not going to increase that much. It's going to increase linearly rather than this massive n by n comparisons. So that improvement came from a guy I think called Feruz Yeltsin, who was actually trying to use it to speed up Orto MCL. He quickly abandoned that project, though, when he realized it was just difficult. What was the most interesting feature that was requested at Sanger to put into that, that maybe something unexpected? So there were a lot of features requested and, you know, most of them are put in. But actually, I tried to integrate a lot of stuff that Torsten had told me about, you know, so I made sure that Rory took Torsten's Torsen's Proko tool as input and only Torsen's Proko tool because, you know, trying to take in generic GFF files or generic annotation files is a royal pain. All right. So I focused just on the one annotation tool that is really easy to use, which is Proko, and that's the only one that I wanted to take in. It worked really well and everything gets processed really easily. That solved a lot of problems getting data in. But, of course, you know, people then go and download data randomly from GenBank or from EMBL and they're like, oh, yeah, you know, I threw all of these different genomes in and I'm getting really weird results by randomly downloading them from the Internet. Well, actually, the reason is that they're all probably annotated with different tools, different gene predictors. They kind of slightly call genes slightly differently and that messes things up, you know. If you're going to analyze a set of bacteria, you really need to do everything up front the same and then analyze them in one go the same. You can't just randomly throw things together. Yeah, there's a slight tragedy in that, that for a lot of data sets, for the kind of analysis we do at scale, it's more important that it's consistent rather than having sort of nuance or accuracy involved. So, yeah, a lot of there's a lot of painstakingly good, detailed GenBank annotation that gets scrubbed off and thrown away and just you just use the generic one that's generated via RefSeq or you use the automated one generated by Prokka, which both are good jobs and they're consistent, which is what's important. But yeah, there's a lot of secret information that just disappears along the way. But then, I don't know, nobody finishes genomes anymore, do they? Why would you want to when you can now do nanopore sequencing and then just get complete genomes, you know, or lots of them in one little experiment? So what are the unique selling points of Rory? It is that it is very fast. It's very conservative memory, so it doesn't scale exponentially. It's scaled linearly. And it's been used in anger quite a lot for huge data sets. So I know people have used it for like 20,000 genomes and it's worked. Oh, another big selling point is that it's packaged up for a lot of different systems, so it's on Debian, so on Ubuntu, you can just type apt-get Rory or you can go to Homebrew or to Conda. You know, it's just it's widely available and quite easy now to install. It wasn't always easy to install, you know, originally, but now it is. Also, it's written the best language in the world called Perl. And it's actually I find it's well engineered. So it's written with Moose. So it's got tests and it's got like lots and lots of automated unit tests. It's got a you know, it's structured nicely. So anyone can pick it up and probably read and work out what it does. If you happen to know Perl, which is the best language in the world. I love hearing you say that. So wait, you actually switched. This is this is a an aside, I guess, but just because you invoked Perl, it is the best language. You still like that, even though you've moved on? No, I think maybe my heart is there because I worked in Perl for a fair while. So I know my PhD, I did all like Java and MATLAB and stuff like that. Then I moved on to Ruby and Rails for about a year or two and then into Perl. And it's like, oh, yeah, Perl or it was really getting back to Perl. And Perl is lovely. You know, I did lots and lots of years in that. And now, you know, I've moved on to Python. Python has its own pros and cons. But, you know, my first love, you know, proper love is Perl. Yeah, is Python really that lovable? It's a bit generic, isn't it? It doesn't have these funny idiosyncrasies that other languages have. It's quite pretty and it works, you know, perfectly indenting. What really gets me is, you know, when you have tabs and spaces and you mix them accidentally. Oh, yeah. In terms of how it's documented, the one thing that people keep coming back to is the FAQ. And it seems to be quite a popular thing. But any time I got an unusual request or request for support, I would make sure to update the FAQ when I was developing it. And there's a lot of random stuff in there. Like someone was trying to run the Perl scripts with Python. Didn't work, obviously. People often, you know, just throw in random genomes and then expect it to work or they don't read the documentation or they want you to go and analyze their data. Whole range of things. So check out the FAQ. It's a barrel of laughs. Yeah, I like there's some nice questions like I haven't done any QC on my sequencing data and the pan genome looks very strange. Answer, garbage in, garbage out. That is pretty much most common support request, usually by people who have been, you know, been handed a set of data. They're not bioinformaticians. They don't know how to analyze it. But their their boss wants a pretty tree or their boss wants the pretty results for nature papers. And they they just don't want to learn the complexity of our fields when actually, you know, you kind of have to like I can walk into a pathogen lab and start sloshing around chemicals without knowing what I'm doing. You know, I will be shot by health and safety. So it really comes down to you have to know what you're doing. And I would expect people to have a baseline knowledge and not just try and randomly run pieces of complex analytical software because they won't be able to interpret the results properly. No, and there's actually some really intelligent FAQ questions as well, which explain some really simple, basic theory stuff as well. Things like what? There's some very long branches in my tree. What should I do? Or there's one here that says something. Why is there a sudden increase in core genome size every hundred genomes? So, yes, that's basic mathematics. Unfortunately, that's another common question. And it's because a feature people wanted was I want to be able to say this is a core gene, even if it's only present in 99 percent of my genomes. Well, you know, that means that you're going to get these steps every now and again due to simple rounding of whole numbers. Unfortunately, yeah. Or you could just say, I want the core only to be made up of genes which are in everything. But the problem is that assemblers don't necessarily assemble everything and you get random contact breaks. So you have to allow a little bit of fuzziness. Can you analyze my data for me? Sure. Pay me a lot of money and I'll do it. But yeah, no, there's some there's some funny questions in the FAQs and and actually quite a few really intelligent, thoughtful ones in the FAQs as well. So definitely worth a read if anyone listening is interested. Bioinformatics ASMR. Don't laugh, we should do it. It'll be hilarious. Bioinformatics ASMR, what on earth is that going to be like? Stroke my genomes. I made a 10 second video of it one time. It's hilarious. I'll put it on YouTube. So it's been widely used in the community for a huge range of things. And I'm quite surprised actually how far the software has gone. It's got way over a thousand citations so far. And it seems people are using it like all over the world. I've gone to random conferences and people have been using it. So that's kind of cool. And it's used in a lot of public health labs as well, just for doing quick core genome analyses and that kind of thing and as part of pipelines. So it's nice to see the software being used. I haven't done updates on it for donkey's years because, you know, I've moved on. You don't get publications based on the stuff you did five years ago. And obviously that was in a previous job. So you have to continually move on publishing stuff, publishing tools and sometimes just leave the other ones to to survive on their own. But it is on GitHub. So if anyone wants to make changes or additions, they're more than welcome to. And, you know, I can merge those in. Yeah, there are a few programs out there now, like in the past five years since it's been published, lots of other new programs come along. In fact, there seems to be, you know, one a month coming out. So which is quite good. Things like Pop Punk and Paneroo I think is a new one as well. There's loads. And I'd say if Rory doesn't suit you, then check those ones out. Well, they're all benchmarking Rory. So they do say they're faster or better in different ways. So if Rory doesn't suit your needs, check them out. Are we going to do anything with Scory yet or is that another time? Oh, yeah. Totally forgot about Scory. Yeah, that's a great way of doing further analysis, like statistical analysis on your output. And yeah, that's a fantastic piece of software. I'd say check that out. So basically you can put in like these are your cases, use your controls within Rory and it'll kind of do this magic and pull out a million different stats for you. And yeah, so Scory is definitely one to use in conjunction with Rory. Do you know who's wrote Scory? Oh, it's a guy in Norway. Hola, Brindisland. I would say check it out because it's a fantastic piece of software. Can I do a bonus question for the premium users? What's the strangest use of Rory you've seen or what's the most unusual, most notable one? My God, honestly, I don't know. I mean, people have tried to build like pan genomes of everything gram negative and all this kind of stuff. You know, you're not going to find much in there. Well, in fact, well, that's one of the most common ways of seeing if people. having QC to data, you know, if they come back and they say, Oh, I've only got 150 core genes. And it's like, well, you probably just tried to create a pan genome of everything in, you know, ground negatives or ground positives or whatever. And you probably have contamination in there. contamination is quite easy to spot actually, in Rory. And I use it as a quick check as well all the time. And, yeah, I don't know, people don't do the basics, they kind of skip over, they think, ah, sure, I'll just try all these methods blindly, see what comes out. But number one always is QC, QC your data, please. Good point. Right. Awesome. So thanks for the great discussion. We this was just a quick chat about some of the software we've created ourselves. And there's always some interesting facts about these different tools that came into being. And yeah, today we've been talking about Rory, which allows you to do pan, generate pan genomes. And you can check it up on GitHub. There's the paper out in bioinformatics. So yeah, follow up with that. Read the FAQs. And that's all the time we have for this episode. See you next time. Thank you all so much for listening to us at home. If you like this podcast, please subscribe and like us on iTunes, Spotify, SoundCloud, or the platform of your choice. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group and edited by Nick Waters. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadrant Institute.