Hello, and thank you for listening to the Microbinfeed podcast. Here, we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There is so much information we all know from working in the field, but nobody writes it down. There is no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Both Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work on microbes in food and the impact on human health. I work at Centers for Disease Control and Prevention and am an adjunct member at the University of Georgia in the U.S. Hello and welcome to the Microbinfeed podcast. This episode, we'll be continuing to talk about sustainability of bioinformatics software. We previously talked about major issues that make it difficult for the community to take over software projects, which is what's implied by the way we approach development and funding. Let's first start off with the methods to make a sustainable tool. I think the first critical thing is documentation. I know sometimes I can be a bit guilty of doing a lack of documentation, because I'm itching to get a piece of software written, out the door, published, whatever, and then documentation comes last. But actually, we should be doing a lot more documentation, not just normal documentation for developers, but also for an ordinary user to use. Simple things like parameters, what they do, how they work, what the input format is, what the output format is, that kind of thing. I think documentation is the number one category, for sure, for making it sustainable. And I agree with you. It just has to be a usage statement or a readme, and you've made your project 1,000 times better. And you can go on from there. People are doing read the docs or other things. I think a toy example as part of the documentation really helps communicate what exactly people can expect as well. Often, I use those to reverse engineer how the program works. Or actually, the worst thing I find is having to go into the code to read the code, to figure out how it works, to figure out what the input format is. I had to do that yesterday, trying to figure out what columns are expected in a spreadsheet by looking at the code to work out the column names, and then trying to guess what they meant by those, which isn't ideal. It's what you've got to do sometimes with academic software. So having some nice documentation would be great. And someone I think who does this really, really well is Ryan Wick over in Melbourne. He does really good documentation, and he does really good verbose output for his tools when they're running. I suppose he comes from a more professional software engineering background, so that's to be expected. But I wish every mathematics tool could be like that. There's one package that I really like that's across multiple languages called DocOp. Instead of making a method of using getops inside of your package, it forces you to write the usage statement, it parses your usage statement, and then actually creates the options inside the program for you. So it's a documentation-focused package that I encourage people to use sometimes. That's pretty cool. So I'm guilty. I come from a software engineering background, so I maybe approach software slightly differently to other people in that, say, if I'm doing object-oriented programming, I like to have lots and lots of files, lots of classes, unit tests. People complain, oh, why do you do 50 files in a simple little software package? I could squeeze that into one script with 300 lines. Well, it sometimes is better to be a little bit more verbose than spread things out, because simply by spreading it out and naming, this file contains everything to do with a spreadsheet, and this file contains everything to do with parsing a spreadsheet, you can get across a lot of information just by how you break up your code. And it's easier to break code in advance than it is to actually do it later on. And if you have this huge blob of code, just line upon line, not documented in any way, it can be quite hard for someone to come in and fix bugs and whatnot, whereas if it's broken up, then it's a lot easier. Personally, I don't like any methods over 10 lines of actual code. Once you get over that, then you should probably think about breaking it up in different methods. Also, what people sometimes forget is that for every software development language, or for every programming language, there is usually a coding convention or a style guide, and you should stick to that, because it helps you greatly. Like Python has, what, PEP8, which sets things like how big the index should be, how you should wrap your code. All these different things are written down for you. And the same with C, C++, everything has style guides that you should go with. And it does help you a lot and helps other people who come along and read your code later on. You know, one of my pet hates is where people use like CamelCase and, I'm not sure what to call it. SnakeCase. SnakeCase, yeah. But they use it wrong. And that's just annoying, you know, like try and keep consistent for the language you're using. Also, don't call your variable names like Andrew1 or any, like, there are very specific cases where you might use a single letter as a variable. And that's usually in a loop, say, as a counter. And that's related to like mathematical functions. You know, you might use i as a counter to that kind of thing. And that's pretty standard. But in most cases, if you've obscured names and no one's going to understand what you're doing, I like to have nice descriptive ones. So yeah, that'll help make your software a bit better. Oh, I think a few years ago, the people behind the game Doom came out with their style guide. And you can read it through. And it is actually really in depth and awesome. And it's helped me out a little bit. It says things like put spaces around your equal signs and your functions cannot modify variables in place. They must return things. It was a really cool style guide, actually. But that's really important, because so often people do crazy things like that, where you don't return an object, or you modify things in a hidden way. And that's just a pain to actually understand and to debug later on. Actually, one of my pet peeves is versioning, where people don't version, sorry. And it's not that hard, you know, come on, add a little number, increment it when you're making a major change, minor change, whatever. You should be doing it. Because in a year's time, when someone comes back and says, oh, yeah, I ran your code. I've got this answer. How come your new code isn't producing the same answer? You need to be able to say, OK, well, you use version 1, 2, 3. I can go back into my Git repository or my source control, pull out, rerun it, get the same answer again. You know, you should always be able to do that for a reproducible, open science. Do you guys use, what is it, semver.org? Like, there's like a whole documentation on how to actually version your stuff. Oh, semantic versioning. Oh, yeah, like the XYZ, yeah, yeah. So for people who don't know, the first number is like the major version. If you change this, it means you've broken the API for your code. So you've made something that's destructive. Like maybe you've rewritten the interface, you've changed the options, you've changed the performance. And then the next number is, so then it's like 1.2. The next number is where you've made a major feature change, but you haven't broken the API. So maybe you've added some extra stuff. And then the final number is like a patch or bug where you've just maybe, you know, fixed a typo, that kind of thing. And this allows someone to see at a glance, you know, just how big a change has happened between the different versions. On the sustainability front. Okay, documentation is good for the public to learn about your tool and understand things. But then we have this whole new level of institutional knowledge. So you have a software, like, let's say SamTool, since we mentioned it last time. And yes, you can go in the documentation and learn about it. But what if your organization has a whole process of how to QC the data and do this to the data and do that to the data? And some part of the process needs to use SamTools. Well, you can codify it in a document on how you actually go from step A to step B. And a new person to your lab or somebody who's been there for years, who wants to pick it up again, basically has this whole instruction manual on how to do all the things, including that software. It also helps for things like validation and public health. If you've documented all the processes and the version numbers, and people know that things aren't going to change and the results are going to be the same as well. So that is actually an important step going from academic. software which is a bit iffy to something that can be used for real decisions, real clinical decisions or for public health surveillance, that kind of thing. And that is where we need to go ultimately. Even in public health or in major institutions, it's still important for even our academic applications where we want to go back to an old manuscript and we assess the data set that's presented and if you don't know the versions and you can't reconstruct exactly how they did it, you cannot produce the same result. So you'll rerun it to an assembly program and you'll get a completely different answer and you won't have any idea why. Another thing is that you should always have a clear license on your software because if someone else is going to use it and extend and whatever, they really want to know can they make a change, are they allowed to distribute it, are they allowed to sell it. All of these things are quite important and in some cases people aren't allowed to use software maybe as a license which says no commercial use and you work in a commercial company, that's obviously a problem. So having a clear license, all too often people don't put any licenses whatsoever, they just put it in the public domain and say, oh yeah, it's open source software but no, that's not enough. You need to be explicit and say, okay, I've got the BSD license, I've got the MIT license, I've got the GPL license. Personally for all my software I use GPL version 3 out of habit but I know other people use like BSD which allows for commercial use much more broadly and it's ultimately down to your employer to decide what type of license you can apply but you should always have one there. Totally agree. Yeah, there's some fun licenses like the license or the postcard license where yeah, you can basically, the license is you can do whatever you want but if it helps you, please send us a postcard or buy me a beer when you see me. One other step that people usually add to the software projects is they have a contributing document which outlines how people can help with the development of the software, what aspects, what features or what do they need to do that would really help maintain it and keep things going because often when you look at a package and you want to use it and you think it's great and you'd rather not make a new one, there's no information on how you get involved and start building a community around that tool. And ultimately that will help you because you get free features and free bug requests or bug fixes for your code. Yeah, so a few things that you guys mentioned were like validation tests or just seeing if you can take apart the paper and see if you can redo what the authors did. I think that's an important feature that I've found over the last few years called unit testing and I think this is like a 101 for software engineers like you guys but it was mind-blowing. Like every single thing I put in my software now, I make sure that there's a separate script now to test the thing and make sure it works and sometimes it breaks and it tells me really quickly what went wrong after I developed something and that has really given me confidence and other people confidence and just going to our big picture of sustainable software. It's essentially validation of your software which we all need, you know, like if you make a change you have to revalidate it quickly and you don't want to go around testing 20 different things manually. Personally, yeah, unit testing and there's a lot of different types of testing you can do. It's a whole industry in its own right. Integration testing, there's a, sorry, behavior- driven development and so that's where you can only you test maybe things that people are going to use and you don't really care about the stuff in the middle. And yeah, so testing is great. Actually, I did a degree in software engineering and we had one little module on testing in four years and I thought, okay, this is nice but we only got a flavoring of it and actually I think that's probably the most important module or one of the most important modules of the entire degree and yeah, testing is great but you should, as a rule of thumb actually for doing proper software engineering, be spending about half your time writing tests and the other half writing code. You know, if you're only writing a little bit of tests on the side, well then that's a problem. You can also do a test- driven development which is where you write the tests and then you write the code. So your test should be failing and then you write some code to make it pass and so it's this kind of constant going back and forth. It does make you write good code and think about what you're going to do before you do it because you're thinking of the tests and you're designing it out in advance and that is actually what I was doing today. I was writing some code and I was writing the tests first so I'm quite proud of it and it also means that you then cover all of the edge cases which you wouldn't normally cover and ultimately I think it speeds up the development process even though you're spending a lot of time and effort writing tests and actually the effort is usually trying to make obscure formats to get the data in and out. You know, you might have to mock things out which is where you build a fake object and populate it to run through a test and then get stuff out the other end. That's quite a lot of work but it always pays off and the more tests you write, the better your software is going to be. Are you convinced? Wow. Well, I learned a lot more from you just now legitimately. I just did not know it was a whole industry and that whole unit test-oriented software development. That is wonderful. So unit test is like the most granular type of testing. It's where you're testing one method, just what goes in, does the right thing come out but then you can test at class levels, you can test at functional levels, there's so many different ways you can test. You can even test, like if you have a website, if I click on the login button, type in my login details and click login and then I go to this page, I put a product in my cart and click buy. You can test all of those different things end-to-end in one test and there's some lovely descriptive languages. I haven't used them in years because I haven't been in web programming in years but it's like Cucumber and things like that. That's back when I was a Rubin Rails developer but yeah, there's a lot of different things you can do and it's great that you're getting into testing and I hope you'll write a lot more tests after this. Absolutely. Let's say that you've written a fantastic sustainable tool with all of the different elements we've been discussing. Now you want to convince people that this tool that you've written is in use and important and you want people to maybe pay you, at least give you the time to keep on doing all of these wonderful things like developing new features and adding in tests and writing new documentation. What are some of the ways that we can come up with to measure that impact moving away from just the standard measurement of number of citations for a paper? Well there's the old classic number of downloads and then if you have packaged up your software for something like Honda or Galaxy, they will also give you little badges telling you how many times people have installed the software, downloaded the software there. One of the problems with open source software is of course there's no control, no centralised control and so you can't actually get an exact number always. GitHub will give you an idea of the number of downloads or clones and it may be decentralised and people can distribute the software themselves if they feel like it. So it can be difficult but one of the biggest measures is how are people using it, who's using it, where are they using it, how has that helped their research, has someone gone and used it and written this amazing nature paper and if they have, well then great. How many different institutions are using it, is it spread globally, is it just a local thing, is every user just east of England or are they in the US and all around the world? I think at one point if you feel like you have a good version and you're at a good standing, I think it's a good time to just reach out and if you haven't already just kind of reach out and see who's using it. It can be a literature search or a survey to people or if your software is website based like scrape that data and just see where those IP addresses come from, it gives you a general sense. The whole analytics thing, I think Google has fantastic analytics actually if you need to use it and don't mind them spying on you. I've used things like GitHub stars and how many people are watching it and who's watching it. I mean sometimes I might have just a few people watching my project but like one time like one of the people behind BioPearl started watching one of my projects and I was like oh so that's a really important person to me. I think old-fashioned mailing lists can also help put out a survey and try to capture users, encourage users just to sign up to a mailing list, give some basic ideas of where they're from and what they may be doing and that helps when you need to go back to someone and say well this is who's using it and this is potentially what they're using it for. Yeah, I think we have forgotten mailing lists and things like that. That's a really good, it's almost like a boot on. on the ground technique, but it's really good. So let's say that you got your fantastic tool and you've got these metrics and you've tried to convince people to fund you and keep it going and you can't, you can't get external funding to keep it going. What options are there for you? Are we starting to look at commercial software user pays models? I mean, I think user pays good way to keep a software going, assuming that you can convince people to pay for it and you've got an income stream for that. But academics do tend to avoid paying for software and seem to actively avoid it. Yeah, often academics, well, they'll say, this is an open source package, so why should I pay for support? And I know I've raised this sometimes with people from commercial companies who've asked for support for old software applications and they don't even like paying for it either. You know, they expect it's open source, you should get support for free, but it takes time and effort. And if it takes an hour or two or three of my time in the evening or the weekend, well, you know, I'm not gonna give up my family time necessarily to do all that for free. If you're a commercial company, you should pay for keeping that going. But in terms of, should we convince people to pay? Yeah, there is some models out there like Client Big Data, where they're transitioning from a totally free and open model to a model where people make a small contribution and those add up then to continue providing a reasonable service. I think it's like, when I think of this stuff, like I think of like the Ubuntu model, like, okay, download the whole operating system Ubuntu for free, but if you wanna get actual technician help, then you have to pay their kind of subscription model. And I think that's really legitimate. I think the other distributions of Linux have something similar too. I was just gonna say, yes, that's a standard approach adopted by Red Hat, but that relies on having the few people who do pay for the support, enough of those are there to subsidize the open source component as well. I'm not sure if it's feasible in smaller projects. One thing that I suppose, if you don't go down the commercial route is that a commercial software provider can go down and do that validation and accreditation that you maybe need to move academic software from academia into public health or into the clinic. And a lot of what we do in microbial genomics and bioinformatics is we want to push things down to so that people can use these in a doctor's surgery, maybe get test results in five minutes rather than in days or weeks. We know that a lot of this stuff is possible, but it's only when you start getting it out to a wider audience that actually will become exponentially more useful to everybody. And maybe that is something where we need commercial as providers to build platforms for us because you can't have a bioinformatician in every doctor's surgery in the country, in the world, that's not going to happen. They will just want a single button that you push and that they know it's validated to work every time. It's not going to crash with some obscure error. You know, it'll give accurate results. They have a lot of controls, you know, all of these kinds of things. And that is something that an academic can't necessarily do, but a commercial software provider can. So maybe this can be like a question to the community also, like how to have people seen themselves moving from academia to free and open source to commercial, if that's the way that they want to go. Cause I think that's a really good open question. Although don't ever price it because, you know, I'm not going to pay for it. Yeah. Well, the price of some of these things are absolutely insane. You know, it can be hundreds of thousands of pounds for even things that sound trivial, but they always price them at just below the level where you would think about hiring someone for a year, you know, like hiring a postdoc for a year will cost you a hundred grand. And so they'll price it at maybe 80 grand because they know that you're not going to hire a developer for a year to do it yourself. You just pay for it. I would say it's very similar in the US. Like you might get a software package that's just on that level. And you have that choice to either hire a person for a year or to buy the thing. And I think it's probably a smart price point from the commercial entity, but it makes us- Unless it's your budget. I know, it gives us a tough choice. We actually, this isn't really software, but like this kind of stuff magically happens all the time that I had a hardware purchase and I told the company, the company gave us a quote and then it came back and I said, well, our budget is really this. And then magically it was $1 less than the quote. It's always way than the budget. Yeah, yeah. So then we were able to purchase it and that's just how that particular thing went. We got, we did get a discount from what it actually costs, but the company fit it to the actual budget. So we maybe we've talked about different aspects on how someone can build a sustainable software tool and how they can convince people to support it. And some of the metrics that are available to them. And we also touched on commercial software and how that can be implemented so that software can persist longer than the stint of one person, than a three year contract or something like that. It's clear that there needs to be a shift to longer term plans of funding for critical pieces of software, whether that be as a user pays model or from the funders themselves. Not everything is going to work out where you just make a thing, build it and they will come. That doesn't seem to work. We've seen time and time again, those fail. And yeah, so hopefully some of the discussion here is helpful, is interesting and maybe can get to their ideas and make a change in the community. Thank you all so much for listening to us at home. If you like this podcast, please subscribe and like us on iTunes, Spotify, SoundCloud or the platform of your choice. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group and edited by Nick Waters. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadrant Institute.