Hello and thank you for listening to the MicroBinfeed podcast. Here we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There's so much information we all know from working in the field, but nobody really writes it down, there's no manual, and it is assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabeel Ali Khan of Enterobase, GrapeTree, and BrakeFame, and Dr. Andrew Page of such works as Plasmatron 5000, Rory, and Govens. I am Dr. Lee Katz, and you might know me from my Tree Making Pipeline Mastery or my SNP Pipeline Live Set. Both Nabeel and Andrew work at the Quadram Institute in Norwich, UK, where we work on microbes in food and the impact on human health. I work at the Centers for Disease Control and Prevention and am an adjunct professor at the University of Georgia in the US. Hi, I am Nabeel Ali Khan. Normally, I'd be joined by my co-host, Lee Katz, but he's currently flying over the Atlantic at the moment, so joining me instead is Andrew Page, head of informatics at the Quadram Institute, and we're joined today by a special guest, Associate Professor Torsten Seeman, who is lead bioinformatician at the Microbiological Diagnostic Unit Public Health Lab in the Doherty Institute for Immunity and Infection at the University of Melbourne in Australia. It's great to be here. We wanted to just ask you today, Torsten, how on earth do we write proper, good, bioinformatic software for people who actually use wild analogies, throw away bioinformatic software that just goes in the bin after someone's published a paper? Thanks for having me on the podcast, Andrew and Nabeel. That's a good question. I guess I'm not an expert on how to write great software. I know a bit. I think what we all tend to agree on is what bad software is like. So I think what I tend to do is just try and avoid the bad things and what results is usually kind of useful and good. I'm not a trained software engineer, really. I just try and make things robust and I guess foolproof or idiot proof. And you write it all in Perl. Well done. Yeah, I am an old dog. I do still write in Perl. It was my first language and it's sort of still my go-to. I'd love to change, but not quite there yet. But ultimately, it doesn't matter what it's written in, sort of how the user interacts with it, what their experience of the software is like. And if it just works and it does what they want and doesn't fail often, they won't care what's underneath. And that's the beauty of Prokka, which you wrote a few years ago. It just works out of a box with virtually no dependencies, no extra libraries, no nothing. And that's why everyone has just adapted and used it so widely. Yeah, I think one of the key successes with Prokka was that, yeah, it came bundled with small databases, but the databases that were good enough to get good results. And the other key thing was that you didn't need any settings. You could just give it your faster file of contigs and it will give you an answer. Quickly. Quick, yeah, and relatively quickly. So, yeah, that was the key, that it was much faster than many of the other solutions out there. A solution out of the box that just worked for most people. And it was written out of my own need. I needed to annotate genomes. And so I guess that's always the best motivation is when you need to use something. The reason I originally wrote Prokka is probably 14 years ago, where we were annotating a single genome and we were still in the days where we would actually curate individual genes. And Prokka was a way to sort of help bootstrap this manual curation process to try and get the bulk of it done at once. But it wasn't long after the next-gen sequencing boom where suddenly we were annotating hundreds and thousands of genomes at a time. And Prokka really became the sort of standard part of the pipeline to perform that annotation and just get something good enough. Yeah, it's interesting because Prokka was not the first genome annotation tool available. There's plenty like, you know, RAS and there's another one we use called Manatee, which was an absolute gargantuan pipeline. But I think for the most part Prokka has superseded all of these other tools and it seems to be the standard issue that you just generate a Prokka annotation. What do you think was the one key element between, say, the other existing tools that hadn't just become that successful? Well, I think it was simply the fact that it was a single download. You could just, at the time, untar and it would run. Didn't have to do anything else. It came bundled with a lot of binaries for Linux and Mac and it just worked. You would run it on your Contig and get a GFF and a GenBank file, completely valid, and you could immediately use that in a downstream tool such as Artemis or in later years it would become immediately usable in terms of putting it into Rory for calculating your pan genome and things like that. So yeah, that's the one thing is that you could just download it and immediately run it. What, you mean all of our Linux software can't just be downloaded and run easily? Not in this universe that we operate in, no. That is a rare thing that that will work. Although, given the resurgence of new packaging systems like Conda has really changed that, that people are putting in a lot of effort to get these tools working for you and easily installable. And you do a lot of that work as well in your free time. You upload to Conda and make recipes. I think mainly you're in the brew. Yeah, I started off in the homebrew science thing. So that was mainly because of Sean Jackman, who is a fellow blind practitioner from Canada, who was very helpful in getting, teaching me how to write, package these things. But the sort of community is tended now towards Conda because it can handle versioning and environments much better than brew, which is more of a flat versioning system. And why would that be important? Reproducibility. And that other word beginning with R. So you're not just known for Prokka, you like... I know in this institute here in Quadram, everyone uses your tools. It's called the Torstiverse. And people use like Snippy for finding corgenes. And they use Nullivore for producing these insanely useful reports on everything that someone would want out of bacterial isolate sequencing. And even, yeah, other things like Apricate for calling AMR and MLS-T for typing in a very trivial manner. Although the name of that, you know, you need to change your name. Yeah, the MLS-T tool is one of the few tools I have that doesn't have a proper name. And I kind of regret it. We do spend a lot of time brainstorming the ideas for that. But the only kind of words that sort of came up with were things like molest, which were terrible, terrible ideas. Thank God you didn't call it that. And so it's sort of hard to sort of go back and rename it now. Like it seems to be, people seem to know it now. So I'm hesitant to change the name. But once again, all these tools sort of came out for two reasons. Like one of the reasons was when I wrote Prokka, I realized I needed a tool to find ribosomal RNA genes and the existing tools weren't any good. So I ended up writing my own tool. So it's just sort of... Is it Barnup? Barnup, yeah. So it's not perfect, but it's good enough. And it's fast and it came once again bundled with lots of different databases, clean and simple, worked out of the box. So nearly all my tools are modeled on that, the original Prokka experience where I released this tool and people gave lots of good feedback and were happy with it. And that really encouraged me to write more tools. And I've followed that Prokka model all along. Minimum number of inputs, like just a single file, whether it's contigs or your reads, maybe an output folder, but even then, not. One thing I like about Barnup and about your tools is that you output in standard formats. So Barnup will give you GFF if you want it. And the same with Prokka, that's insanely useful because you haven't gone down the path of inventing your own format, which a lot of developers go and do. And then it's just yet another useless format that we have to consider. Whereas you've taken the time to go and actually use a standard that it pre-exists. That does take a lot of work. I appreciate the effort you put in. Well, thanks. I strongly recommend using standards like BED, GFF and VCF and FASTA, just so they can interoperate with all the other useful utilities. It makes pipelines much simpler to write when you don't have to write too much glue in between. Absolutely, yeah. And I guess the other thing that made Prokka, going back to Nabil's original question, was what was the single thing that made Prokka successful? I think mainly it was a command line tool. And it just came that you could be running the high throughput manner because suddenly we had all these genomes and people didn't want to run them one by one, didn't want to upload them to a website one by one and collect all the downloads and then figure out which ones failed and resubmit. Being able to do it on the command line made it just plug in nicely into pipeline systems or workflow systems with well-defined inputs, FASTA and well-defined outputs like GFF. So yeah, just made it easy. Well, when I first used Prokka, the thing that excited me the most, especially being in Australia with poor internet, was that the databases for a lot of the annotation tools were very big and very complicated to download and install. And I think in Prokka, you had a scaled down Uniprot database that you had that just made life, that you got, again, it's like 90% of what you would get with the more exhaustive database, but you didn't have that much overhead and that made life a lot easier. So this comes back to one. point that, if I'm writing software, I think might be, I don't know, want to get your opinion on, is building a singular tool that's fit for, like, do one thing and do it well, and just cut all of the, if you can cut extra, as we're saying, I think already, dependencies, do it. If you can cut database sizes, do it. If you can cut unnecessary parameters, do it. Is that generally what we're talking about here, less is more? I do favor having less dependencies. Sure, the Biopil part of Prokka has been a problem for people in the past. I probably won't use that in the next iteration. I might remove that dependency. But funny story about the reducing the size of the databases, part of the reason for that was to reduce the distribution size, because I was hosting it on our university website, and most of my users are outside Australia, and the pipe to the rest of the world is quite small, was back then, and people were complaining about downloading it, because it was like a 1.2 gigabyte tarball at the time. So I ended up helping them by putting it on my Dropbox, and I got banned from Dropbox one day later for too much bandwidth being used. So I had to write them a nice letter and say, I promise to delete it if you reinstate my Dropbox account, and they said yes, okay. So I put it onto Google Drive, and they never had a problem with the bandwidth. But that made me realize that people were still having trouble downloading back in the day. So what I exactly did that was I did not use, people said, oh, why don't you just use the whole NR database in GenBank as your annotation source? And I said, well, sure, but a lot of those genes are annotated poorly, like they're just annotations from other bad annotations, propagation of errors. So I thought, what should we do? Go back to pure stuff we really know. And as you said, I used UniProt, but not just, I used UniProt SwissProt, which is sort of the curated part. But I went even further. UniProt has evidence codes for each of its records, which tell you whether it was confirmed by proteomic evidence or RNA evidence, or how did it come about. So I thought, well, let's just stick to the well-curated stuff. And that really shrunk it down to about 70,000 proteins, which ultimately kind of represented a core proteome for bacteria. So yeah, you could annotate 60% of your genes with just a very tiny database with high reliability. But a lot of people still aren't happy. They want better annotations. So I've added capabilities now where you can supply some reference genomes, and it'll incorporate those into the annotation. Well, let's change track and talk about software in general. And so when you're sort of catching up on the literature and trying to see what else people are doing, what are the sort of bugbears you have when interacting with the new bioinformatics tool? Well, I think we all have a universal experience. The standard thing is you'll see a tweet or get an email from someone or a Slack message. Oh, have you seen this new tool? And it's a link to a paper or to a bioarchive paper. So you go there excitedly, read the abstract and think, yep, this is it. This tool is going to solve all my problems. And then your first problem is actually finding the source code. So you read the paper and you're digging around for a URL somewhere. Sometimes you won't find one. Contact the author. Contact the author, or you'll just search for HTTPS or search for the word GitHub in the paper, just hoping for a link. And you eventually find it. And then you think you're 90% of the way there, but you're like 10% of the way, because you then have to install it. And now Conda and Brew and these packaging systems made this a lot easier. But this is not true for new software. New software isn't packaged into those things quite yet. It's not mature enough. So you go through the complicated install instructions and eventually it kind of works. And then you run it on some tiny data set or just some standard thing you try everything on. And then it fails. Some Python backtrace or bash error or missing module in Perl. It's the standard thing. I guess you guys have both experienced this as well. Absolutely. I've installed hundreds of applications and I've bashed my head against the wall so often where things just absolutely do not work. Or they will only ever work in the institution where they're written because they've gone and hard coded the parameters. Or they tell you, please go and edit the script in line 59 to change some little tiny little thing. And you're thinking, this isn't proper software. That's a terrible experience for your user to have first time around. They're excited. They want to use your software. They spend all this time on it and then they can't get it working. And they maybe file some GitHub issues and maybe get a response or maybe never get a response. They're never going to use your software again. So I don't want that to happen to people that use my software. I want their first experience to be a positive one. So just coming back, one thing that popped in my head is when you're saying that you read the abstract and you're looking for a URL, where would be the best place to put it in a manuscript? Where are you actually first expecting it? I think software papers. I think the software URL should be in the abstract. If it's not in the last line of the abstract, where is it going to be? There might be a specific section in some journals for software availability and stuff like that. For example, the Oxford Bioinformatics have that, but it's kind of part of the abstract. I know a lot of people don't agree with me, but I think that is the main output of the paper. So it should be in the abstract. So I saw a blog post you wrote a while back on the minimum standards for bioinformatics command line tools. How did you come up with that? Well, yeah, it kind of was born out of my frustration with all these tools that I couldn't get working. So I thought I would just write down some of the kind of internal rules I have for my own software and hope that other people will kind of read that and adopt it. And the blog post ended up being quite popular, getting a lot of readers and a lot of sharing. And I got approached by the Gigascience Journal at the time, and they asked if I would like to turn it into a paper. And I'm a terrible paper writer, so that was a bit of a motivation to actually get a paper out. And I did get a paper out, and it was quite interesting. It helped improve the blog post a bit. And yeah, it's been quite a popular paper. It was one of the most downloaded papers from the Gigascience Journal website at the time. Wow, that's insane. Until their database crashed and they lost all their statistics. So there was definite interest in this sort of stuff. Even though it seems to be more, it should be obvious to people, but obviously it's not. So Torsten, what kind of things did you put into your recommendations? Well, nothing too complicated, Andrew, just really basic things. These are what users will experience when they first try the software. Say you install a piece of software and the first thing you do is run the commands. Say you install Prokka, you will run Prokka, and what do you expect to happen? You expect it to tell you something about what you didn't do, like you didn't provide this option or you need to supply this. You don't expect it to just spit out an error. You expect it to spit out a useful piece of information to tell you what to do next. Or some software doesn't even have a help option, like you need to have a help option. Or even more dangerously, it just goes and runs and you don't know what it's doing. Yes, that's the worst. Deleting your data. Yeah, often a piece of software running with no options might create a directory and start writing some files in there without any permission or awareness of the user, and that's very frustrating. So yeah, my goal is to do nothing unless you've got everything you need and then tell the user at every point what's happening and what's going on. And a lot of people quite enjoy all the text that gets scrolled past them in my software. They kind of find it comforting, and I find it comforting as well to know what's going on all the time. Although there is differing opinions, some people say it should be quiet unless the user tells you to be noisy. Yeah, so I'm more noisy, but I'm the other way around. I'm a bit noisier and you can quieten it down. Most of my tools have a quiet option. Say you're embedding it in a large pipeline, you might not want all that verbose output. No, it's interesting. A lot of the logging from your tools often has a fairly funny or comical sign-off at the end of it when it's finished running. I think it's Prokker, normally it says share and enjoy at the end of it. Prokker had two kind of inspirational quotes or funny quotes at the end. I don't know what made me put that in. I think I just did it to keep people interested when they were looking at their thing. But yeah, it's sort of become a bit of a trademark where most of my tools have some kind of useful little remark at the end. And I think one of my latest tools might have like a library of 20 random statements it gives at the end. I think it's Snippy actually. Yeah, Snippy has quite a lot of random ones and I get lots of good feedback about that, so it kind of makes it fun for me as well. And your software is very robust as well. I know in my own role, we've run it maybe half a million times and it just works absolutely each time. The only thing that breaks is, was it table to ASN? Yeah, table to ASN is a tool provided by NCBI, which converts, which actually writes the GenBank files that Prokker uses. And it's become a bit of the bane of the Prokker universe. It's distributed in a binary only form, which is kind of against my beliefs. It also doesn't tell you when they're going to update it. And the FTP site just magically appeared with a new binary version of it. So this is something that's caused a lot of problems and I want to remove it from Prokker. And it just seems to expire after a year. Yeah, every six months to a year it expires. So you need to replace it, but there's no warning. Until it breaks your pipeline. Yeah. So that is the main problem with Prokker, is that step in the pipeline. I kind of regret using it. It guaranteed nice compliant GenBank files, but it was a bit of a mistake looking back and I won't be using it in any future tools. Yeah. So for new people who are writing a tool, what sort of advice would you give to someone who wants to write a new tool or who are wondering if they should try and write a new tool? Well, my first advice would be to don't write a new tool. It's a big commitment. If you want to do it properly, it's going to take time. And the key thing is you have to maintain it. If you're going to just write a tool, publish it and then abandon it. Please do not start that project. You're not really helping anybody if you're not going to use the tool don't write it because you'll never truly make it useful and Reliable unless you are forced to use it and suffer through its problems only through that suffering Will you improve the software because ultimately we don't have it's not that we don't care about other people's experiences But ultimately, you know time is limited But if it's failing all the time and you're being you know It's causing you problems or your colleagues are annoyed at you then you will fix it But if no one else is going to use it, then yeah, don't bother. What's the what is the What's the oldest tool at the moment that you wrote that you're still currently maintaining now? How many years back are we I think proper is my oldest tool actually proper started as a series of modular bash scripts back in about 2004 to 2006 that was the genesis of proper It was used internally only by us at the Department of Microbiology at Monash University where I used to work Yeah, it didn't become the proper. We all know until much later. I think the paper was 2014 yeah, but it was in the open source Community well before that took me a while to write the paper as I've said, I'm not a good paper writer I never feel like the tools are quite good enough to be published But clearly that's not true based on what we see published out there a lot of the time It is thousands of citations. I think it's 2800 Yep, so it's in the couple. It's in the low thousands depending on whether you use web of science or Google, which is an amazing Amazing number. I never expected that Yeah, everywhere I go people know me as the procker guy and I'm okay with that Like I've made a lot of people happy because it solved a problem they needed to solve and it did it in a reliable way and I'm proud of it like it's enabled a lot of research and a lot of other pipelines and tools are built using the output of Proctor and I Think it really set the tone for the rest of my career that I was keen to write good quality tools And make them available freely to the community that set my whole career So once you let's say that you have followed these steps and you've identified a tool that you're willing to put a good Say decade of looking after it and developing it in How do you then sort of distribute it and make it available to other people what are the platforms that are available Because at the moment it seems quite Saturated interesting to hear like what's available and then what do you think is like the best like your priority list of what you should? Do you know put it here then here then here? Interested to hear your yeah, so when Proctor being my oldest tool is probably a good place to start you'd have sort of existed around then but I was hesitant it seemed Confusing and I was a bit stubborn. I didn't want to use a revision control system I've used SVN and RCS back in in universe undergrad and well that that wasn't pleasant So I can appreciate that and I sort of I was scared of github it seemed complicated and I but I've had friends and colleagues who helped me sort of Figure out the basics and I once I understood how it worked and how that wasn't too hard to do the basics I just loved it and said this is what I have to use for everything before that. I was just distributing tables with version numbers on personal websites and stuff like that But after I discovered github and realized that I could get people to give keep my issues Organized and pull requests, which I tend to not be very good at reviewing. I realized that was the place to go Obviously any system like Bitbucket or get love you'd love all fine I just so used to the github interface that the others sort of a confusing to me Do you do now and since you mentioned the hosting it on your own website or even like an institutional website Do you recommend that now as an approach? No, I don't think I think you should if you are legally allowed to by your Institute to put it on us a more public place I encourage people to do that just that Universities rearrange their websites all the time software links go dead. They don't put in proper, you know HTTP redirect codes for old links. I would github is quite a sort of a static place or one of the other repositories Sourceforge is the exception. I would encourage people not to use I've had a lot of issues with Sourceforge There's been you know malware issues with Sourceforge and I just find the whole interface very confusing and difficult The others are just cleaner and simpler and more widely used. So you chose a name Procker, which I understand is from Quokka But how do you choose the name of a new software package? It just seems really really difficult Yeah, I think cheese choosing a software name is very important I know it seems a bit weird data, but people like a good name It's sort of it's a it is marketing in a way proper. You're right with Quokka, which is an Australian animal It lives off the west coast of Australia, but it's originally stood sort of like a sort of a pseudo acronym for prokaryotic annotation So that's yeah, I like the letter K It's one of my favorite letters for some reason and I the key thing is that I want it to be unique Like the namespace for software is huge and polluted with all the common words So to sort of stand out and not get any name conflicts. You need to find something a bit original. So Yeah, Prokka, Snippy, Barnap all these things were tested by googling first by Incognito mode and my mode just to check if there's any conflicts and if it was minor enough and I liked it It'd be fine. And like I said, you need to check urban dictionary. Yeah urban dictionaries Yeah, it's a that would hopefully come up in a Google search. But yeah, that's definitely very important to yeah I want you to think of these acronyms Yeah, I'm not a fan of acronyms Well, they can be good. Like if they end up as a nice unique name, then that's great Well, what what is a backroom name for people who don't know? How does that work a backroom name is where they've come up with a name and then they try to use they've sort of Retroactively fit a set of words related to the the name they chose So they didn't choose the the words and then make an acronym from it. They went the other way So, okay, so it's like they really want to call it rust and they go like, oh it means really unique search tool Exactly. It's like Okay, sure. Yeah, but sometimes backrooms can be good. I'm sure our listeners could probably come up with Provide some examples where the backroom name was good, but most of the time not And of course when it's not you win the Java Awards you win a Java Award Yep, Creek Keith Bradenham came out with the Java Award for bad bad bioinformatics acronyms Well, it stands for just another bogus bioinformatics acronym and there's plenty of bad examples out there, which we won't Personally name or shame any in this podcast Okay, so so yeah, so we've been talking about github and bitbucket and these are sort of repositories to put your Sourcecode up and make that available, but that's often more geared towards people who can Program or used to dealing with with software for people who just want to install a thing There's a range of different different options available to you And I just wanted to hear a bit about the different options and then what do you think or what would you prefer as the? Best way moving forward. Yeah, you're spot-on like the bulk of our users are sort of Applied users. They may be biologists who want to have done some sequencing or some other data generation and want to use our tools on Their data. They're not other bioinformaticians writing pipelines. So you need to really cater for those people They're the ones gonna have the most difficulty with the installing directly from github. Like you said, so Traditionally you would package thing in package your things sort of, you know in a Linux Distribution package system like Debian or Red Hat has RPMs and stuff like that But they sort of have fallen out of favor mainly because you can only store really one version at a time But also the writing packages for those those systems is quite complicated Debian has Debian med is the distribution for most of the bioinformatics offer and I had a very strict set of rules for software And I think Andrew I recall had some experiences Yeah, it can be there's quite a lot of documentation you to read and it be quite an ordeal You need to do basically months of mentoring and training before you can get to the point where you can actually submit stuff So it's quite a high barrier to entry and yeah, exactly so what came along as alternatives to this and the other thing is you had to usually be a root user to install using those systems and most people these days, you know on shared clusters or using Systems where they don't have administrator access. So along came things like homebrew which started on Mac But I guess the biggest in biggest ecosystem at the moment is condor which most which is based on Python Condor allows you to sort of install whole hierarchies of software Without having any administrator rights or root access and it's revolutionized I guess I would say the bioinformatics installation experience for everybody over the last few years I don't I would say that any tool new tool that comes along if you're not packaged into condor You're not really going to be used. Everybody's using condor now, and if they can't find it, then Somebody will package it for you, but ideally you would sort of Proactively try and get it packaged or write to the condor people and help them out Explain exactly the dependencies or make it easier for them to package but being very clear in your documentation what it needs How it works how you set it up and somebody out there will do it There's a huge team of bioconda developers who are just going out packaging up software tools that I don't even use themselves But they're doing it for the community so the easier you make it for them and the clearer you make it for them the more likely they will produce a Reliable condor package for your tool and I'm very lucky now that when I release a new version of my tool there's people out there who suddenly go and package it before I've even realized and that there's really I feel like it's everybody's working together and things are just happening. And of course if you make your tool easy to install then people are going to use it and cite it and so it's not rocket science but you know you will get all these citations for free just by taking an extra step. I have no doubt that the success of Prokka and its number of citations is purely because it just worked for most people and therefore it immediately ended up being used in their research project and they ended up citing it on the paper that used the results. If it wasn't easy to install the citations would be way lower and you know in Australia grant reviewers and stuff are taking things like citations much more into account these days because it's more about translational impact of your work. Having a paper doesn't mean much if no one cited it. So yeah that's been great for us in bioinformatics. It is an incentive to create high-quality software. So I guess if you can't install a software does it really exist? That's right. But so when we're talking about just on packaging though what is like what does that actually mean? Like what is very briefly like what are you sort of doing when you're packaging it or what's involved? Yeah so I guess traditionally how things were distributed in the old days most a lot of software a long time ago was written in something like C or C++ so the standard thing you would do is download the source code and run configure the standard configure program on that and then type make and then type make install and that would just put it in some fixed place usually in the user local hierarchy and that was all you had to do but since then and scripting languages like Perl and Python and R really took over from C just because of the ease of use of writing software and they came with their own kind of internal packaging systems. Python's gone through a lot of different ones over the years but now you know PyPy and the pip command has become the standard way to install stuff so if you are writing Python software you should really be thinking about packaging in a way that it can be uploaded into PyPy so that people can just type pip install whatever. Now it's a bit daunting at first but there's some very good guides for it and it's actually much simpler than you think. It usually just means putting your software into a subfolder and creating a YAML file or something or a TOML file that describes what it is what category of software is and what the sort of the starting command is so if you can get into PyPy that actually makes it then trivial to get into conda so the people packaging to conda for you it becomes a very easy for them to write a wrapper for it. The problem with PyPy is that it installs your actual Python script but say you use other commands like SAM tools or BWA they won't be included they won't be guaranteed they aren't installed alongside with PyPy. That's where conda comes in. Because those tools are obviously not Python and they don't belong in a repository based around Python. That's exactly right whereas conda kind of brings it all together for you conda can install a conda recipe as it's called or a brew formula is basically the set of steps that you would need to do manually to install a piece of software but even better than that conda can just use existing recipes say your tool depends on SAM tools you don't have to install SAM tools in that recipe it already knows how to install SAM tools so it will just say install SAM tools and somebody's already written a conda package for that so the bulk of the recipe in your conda in the conda recipe of your tool just be to basically install your Python software and just automatically install its dependencies so yeah Python use PyPy for Perl you should use CPAN and for R you should use CRAN so once again going back to standards these are community standards for distributing software and you should use them wherever possible okay so we've now gotten to the point where we've picked a tool that's required and we're going to look after it and we've just got it ready into repository so people can download it it's nice to use but then how do we then sort of market it and get word out into the community that this is available and it's good and people should have a look at it that's a good question so traditionally the way to promote your software is to publish it and as I've said that's not something I'm very good at doing so how I started up promoting things was mainly through originally I would write a blog post about it at the time and then use Twitter Twitter is probably still my main promotional channel for software because I still haven't written up met most of them Twitter people just tweet links to github and then they can look at the github and I usually tell them how to install it and usually these days that would involve running conda install over install so I try not to announce anything until it's at a reasonably good state but now that I've become known for writing software people sort of notice any new github repositories I create and kind of get pounce onto them quite early and I'm not I don't tend to make things private I tend to start off public and so yeah Twitter is probably my main avenue of advertising but this you know the traditional journals preprints are common I've yet to write a preprint I probably should oh they're awesome I've been writing loads there every paper we've had recently last few years has been preprinted and it's I find it to be enormously successful because you get people reading it and they can get all the data and it rarely changes by the time it's published but you get it like a year or two years earlier because publishing can just take so long to get that final polished paper yeah I think I should start writing but it still involves writing something one thing I'll mention on yeah so preprints as well massive advocate for that and been using that as much as possible in the last few years but you do have to be careful with the some journals still unhappy do not like the idea of preprints being available because they want exclusive right to the to the text and you've sort of semi published it already so they didn't say we don't want to entertain this manuscript there are resources online that you can check what is the kind of and you can the journal should be able to tell you like you should ask the editors but there are online resources to check like yeah sure piranha which is a database that just has a ranking of how open and easy it is to to publish in a journal whether they allow open access whether they're happy with preprints post prints and so on and so that's something to keep in mind but otherwise yeah in general I'd be advocate preprints are definitely takes the anxiety out because it's out there you're not sitting there waiting like oh my god I had still in review like when is this going to come out it's out there and you just sort of going to the motions of getting it out and print yeah I think I've tended to feel that my github and my documentation is kind of like a preprint in a way and that it contains most of the information about what the tools about I feel like a PDF file somewhere isn't really adding any value to that but from a career point of view it probably would so but what about things like software focus journals like Joss journal of open source software which is somewhere in the middle yeah this new Joss journal I like I love the model I love that it's focused on usability so you you don't need to prove that it's the best of anything or compare it to anything else but all it's mainly about the quality of software so they I believe they have checklists which reviewers used to sort of assess that there's got quality documentation that it's installable all the things that are you know that I complain about or talk about in terms of software quality Joss is try has it has address so I'm a big fan of that doesn't focus too much on comparing or writing complicated details focuses on the software so even if you're not necessarily submitting to Joss I think looking at those guidelines is also a good resource for people who want to know what's required or the minimum level of requirement of writing good software yeah I would agree with that so now so we've gone we've gone through the entire cycle development cycle and we're now at the point that you're maintaining the software and you have user requests you know great once you get that first email oh hey I use this software it's great but dot dot dot so then what's the best strategies for sort of managing Matt if you're maintaining broker for like 10 years how what are the best strategies for for managing user user support yeah at first you know you get these emails and you get this small amount of positive feedback and it's great and then you keep getting more and more email and you struggle to keep up you honestly do so I encourage people not to email me I sort of answer emails in bulk every few months I would encourage people use the github issues page and the great thing that I discovered about that is that there's a community of users out there that end up helping each other so I often get to an issue that's been filed and somebody else has already helped that user and I can just sort of say thank you and close the issue so that was a quite a surprising thing I don't have a mailing list I would like to but I worry that it will just hide a lot of the content I don't know the best way to do it issues get closed and so they sort of disappear as well should there be a frequently asked questions or a wiki on your github site I'm not sure what the best way to do it is but I've noticed a lot of my software is supported self-supporting in the community through bio stars or seek answers these other traditional bioinformatics forums online so I think I owe a big debt to the community who have sort of helped me maintain proper but there's still a huge burden on me with full requests that I can't don't have time to review and issues that I don't have time to fully investigate and there is actually quite a large sort of psychological burden there's guilt associated with this software that you put out there and and you're not being paid to maintain this indefinitely no you there there is no payment I'm very lucky that my employer sees my role in maintaining this software as an important part of my job but I understand that that is not true for a lot of people fellow software engineers fellow bioinformaticians so I feel very lucky that my boss sees it as an important part of my job so I am privileged in that respect and I'm trying to maintain but the long-term answer to this is not clear there needs to be more funding for bioinformatics software maintenance and I was recently quite excited to hear about the new Chan Zuckerberg grant system so they will be having three rounds a year of grants for software maintenance So this is not money to write new software, it has to be software that's already out there and has a proven user base and they are offering grants of $50,000 to $250,000 and there's three rounds a year, I think about four to six grants per round and it's specifically for key tools in computational biology. So I'm investigating that as a possible way to help fund Prokka and a next version of Prokka and other tools. Okay, well we've actually been talking for quite a while, so let's sum up what we've been discussing, some of the main take-homes in terms of writing bioinformatics software. Yeah, I guess the first thing is to just make it as easy as possible to install. Every kind of problem that you would encounter installing a piece of software, you're going to have your number of users, right? Just like putting equations in talks, every equation halves your audience apparently, so I think it's the same thing applies to installation problems. Documentation, my documentation is not always perfect but it's usually enough to get started and understand what's going on. So keep it clear and simple and if it's 100 pages and you can't figure out how to find the basic things you need to do, then that's going to be worse than having no documentation in some way. I'm a big believer in putting my software out there before I publish the paper. The community will find problems that you never even thought of and I'm scared that if I publish, I will then sort of forget about my software and not care as much. We all know that, we've seen that before where something's been published, final publication, you look at the software, you go to the GitHub website, it was added six months ago, no changes since and there's no issues or there's issues that have never been responded to and you know this is unfortunately probably a student or a postdoc that's written something, got a paper and then left and so that software will never be used and it just feels like a waste of effort. A common excuse for not doing this is people are scared of putting their code out that they feel embarrassed and people have made fun of my monolithic Perl code and they're right to make a bit of fun of it but ultimately for me, I've learned not to be embarrassed by it and it works, the proof is in the pudding and it doesn't really matter too much about the design behind there on smaller software projects. Bigger ones maybe with large numbers of people involved, sure you should probably apply some good software engineering principles but yeah, for my tools they're usually pretty small and surprisingly I've gotten good feedback from people who have read my monolithic proper Perl script and said it's so easy to understand, I can follow all the steps, you've commented each section and I know how to alter it so there's something to be said for that, that linear flow that's easy to walk through rather than jumping between modules and not knowing where you are in the process and most of all if you write good software, I've found that I get a lot of people buying me beer when I travel around the world so that is an incentive, free beer. So we owe you a lot of beers then? Yeah, I don't expect to buy another beer for the next five years, well that's the dream anyway. Absolutely, well thank you very much Torsten for this amazing chat. Yeah, thank you Torsten and so this has been Torsten Siemen with Andrew Page and myself Nabil Ali Khan for the microbial bioinformatics podcast, just signing off, we'll catch you next time. Thanks guys, had a great time chatting with you. Thank you all so much for listening to us at home, if you like this podcast, please subscribe and like us on iTunes or Google Play and if you don't like the podcast, please don't do anything. This podcast was recorded by the microbial bioinformatics group, the opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadrant Institute.