Hello, and thank you for listening to the Microbinfeed podcast. Here, we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There is so much information we all
know from working in the field, but nobody writes it down. There is no manual,
and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My
co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Both
Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work
on microbes in food and the impact on human health. I work at Centers for
Disease Control and Prevention and am an adjunct member at the University of
Georgia in the U.S. Hello and welcome to the Microbinfeed podcast. This episode,
we'll be continuing to talk about sustainability of bioinformatics software. We
previously talked about major issues that make it difficult for the community to
take over software projects, which is what's implied by the way we approach
development and funding. Let's first start off with the methods to make a
sustainable tool. I think the first critical thing is documentation. I know
sometimes I can be a bit guilty of doing a lack of documentation, because I'm
itching to get a piece of software written, out the door, published, whatever,
and then documentation comes last. But actually, we should be doing a lot more
documentation, not just normal documentation for developers, but also for an
ordinary user to use. Simple things like parameters, what they do, how they
work, what the input format is, what the output format is, that kind of thing. I
think documentation is the number one category, for sure, for making it
sustainable. And I agree with you. It just has to be a usage statement or a
readme, and you've made your project 1,000 times better. And you can go on from
there. People are doing read the docs or other things. I think a toy example as
part of the documentation really helps communicate what exactly people can
expect as well. Often, I use those to reverse engineer how the program works. Or
actually, the worst thing I find is having to go into the code to read the code,
to figure out how it works, to figure out what the input format is. I had to do
that yesterday, trying to figure out what columns are expected in a spreadsheet
by looking at the code to work out the column names, and then trying to guess
what they meant by those, which isn't ideal. It's what you've got to do
sometimes with academic software. So having some nice documentation would be
great. And someone I think who does this really, really well is Ryan Wick over
in Melbourne. He does really good documentation, and he does really good verbose
output for his tools when they're running. I suppose he comes from a more
professional software engineering background, so that's to be expected. But I
wish every mathematics tool could be like that. There's one package that I
really like that's across multiple languages called DocOp. Instead of making a
method of using getops inside of your package, it forces you to write the usage
statement, it parses your usage statement, and then actually creates the options
inside the program for you. So it's a documentation-focused package that I
encourage people to use sometimes. That's pretty cool. So I'm guilty. I come
from a software engineering background, so I maybe approach software slightly
differently to other people in that, say, if I'm doing object-oriented
programming, I like to have lots and lots of files, lots of classes, unit tests.
People complain, oh, why do you do 50 files in a simple little software package?
I could squeeze that into one script with 300 lines. Well, it sometimes is
better to be a little bit more verbose than spread things out, because simply by
spreading it out and naming, this file contains everything to do with a
spreadsheet, and this file contains everything to do with parsing a spreadsheet,
you can get across a lot of information just by how you break up your code. And
it's easier to break code in advance than it is to actually do it later on. And
if you have this huge blob of code, just line upon line, not documented in any
way, it can be quite hard for someone to come in and fix bugs and whatnot,
whereas if it's broken up, then it's a lot easier. Personally, I don't like any
methods over 10 lines of actual code. Once you get over that, then you should
probably think about breaking it up in different methods. Also, what people
sometimes forget is that for every software development language, or for every
programming language, there is usually a coding convention or a style guide, and
you should stick to that, because it helps you greatly. Like Python has, what,
PEP8, which sets things like how big the index should be, how you should wrap
your code. All these different things are written down for you. And the same
with C, C++, everything has style guides that you should go with. And it does
help you a lot and helps other people who come along and read your code later
on. You know, one of my pet hates is where people use like CamelCase and, I'm
not sure what to call it. SnakeCase. SnakeCase, yeah. But they use it wrong. And
that's just annoying, you know, like try and keep consistent for the language
you're using. Also, don't call your variable names like Andrew1 or any, like,
there are very specific cases where you might use a single letter as a variable.
And that's usually in a loop, say, as a counter. And that's related to like
mathematical functions. You know, you might use i as a counter to that kind of
thing. And that's pretty standard. But in most cases, if you've obscured names
and no one's going to understand what you're doing, I like to have nice
descriptive ones. So yeah, that'll help make your software a bit better. Oh, I
think a few years ago, the people behind the game Doom came out with their style
guide. And you can read it through. And it is actually really in depth and
awesome. And it's helped me out a little bit. It says things like put spaces
around your equal signs and your functions cannot modify variables in place.
They must return things. It was a really cool style guide, actually. But that's
really important, because so often people do crazy things like that, where you
don't return an object, or you modify things in a hidden way. And that's just a
pain to actually understand and to debug later on. Actually, one of my pet
peeves is versioning, where people don't version, sorry. And it's not that hard,
you know, come on, add a little number, increment it when you're making a major
change, minor change, whatever. You should be doing it. Because in a year's
time, when someone comes back and says, oh, yeah, I ran your code. I've got this
answer. How come your new code isn't producing the same answer? You need to be
able to say, OK, well, you use version 1, 2, 3. I can go back into my Git
repository or my source control, pull out, rerun it, get the same answer again.
You know, you should always be able to do that for a reproducible, open science.
Do you guys use, what is it, semver.org? Like, there's like a whole
documentation on how to actually version your stuff. Oh, semantic versioning.
Oh, yeah, like the XYZ, yeah, yeah. So for people who don't know, the first
number is like the major version. If you change this, it means you've broken the
API for your code. So you've made something that's destructive. Like maybe
you've rewritten the interface, you've changed the options, you've changed the
performance. And then the next number is, so then it's like 1.2. The next number
is where you've made a major feature change, but you haven't broken the API. So
maybe you've added some extra stuff. And then the final number is like a patch
or bug where you've just maybe, you know, fixed a typo, that kind of thing. And
this allows someone to see at a glance, you know, just how big a change has
happened between the different versions. On the sustainability front. Okay,
documentation is good for the public to learn about your tool and understand
things. But then we have this whole new level of institutional knowledge. So you
have a software, like, let's say SamTool, since we mentioned it last time. And
yes, you can go in the documentation and learn about it. But what if your
organization has a whole process of how to QC the data and do this to the data
and do that to the data? And some part of the process needs to use SamTools.
Well, you can codify it in a document on how you actually go from step A to step
B. And a new person to your lab or somebody who's been there for years, who
wants to pick it up again, basically has this whole instruction manual on how to
do all the things, including that software. It also helps for things like
validation and public health. If you've documented all the processes and the
version numbers, and people know that things aren't going to change and the
results are going to be the same as well. So that is actually an important step
going from academic.  software which is a bit iffy to something that can be used
for real decisions, real clinical decisions or for public health surveillance,
that kind of thing. And that is where we need to go ultimately. Even in public
health or in major institutions, it's still important for even our academic
applications where we want to go back to an old manuscript and we assess the
data set that's presented and if you don't know the versions and you can't
reconstruct exactly how they did it, you cannot produce the same result. So
you'll rerun it to an assembly program and you'll get a completely different
answer and you won't have any idea why. Another thing is that you should always
have a clear license on your software because if someone else is going to use it
and extend and whatever, they really want to know can they make a change, are
they allowed to distribute it, are they allowed to sell it. All of these things
are quite important and in some cases people aren't allowed to use software
maybe as a license which says no commercial use and you work in a commercial
company, that's obviously a problem. So having a clear license, all too often
people don't put any licenses whatsoever, they just put it in the public domain
and say, oh yeah, it's open source software but no, that's not enough. You need
to be explicit and say, okay, I've got the BSD license, I've got the MIT
license, I've got the GPL license. Personally for all my software I use GPL
version 3 out of habit but I know other people use like BSD which allows for
commercial use much more broadly and it's ultimately down to your employer to
decide what type of license you can apply but you should always have one there.
Totally agree. Yeah, there's some fun licenses like the license or the postcard
license where yeah, you can basically, the license is you can do whatever you
want but if it helps you, please send us a postcard or buy me a beer when you
see me. One other step that people usually add to the software projects is they
have a contributing document which outlines how people can help with the
development of the software, what aspects, what features or what do they need to
do that would really help maintain it and keep things going because often when
you look at a package and you want to use it and you think it's great and you'd
rather not make a new one, there's no information on how you get involved and
start building a community around that tool. And ultimately that will help you
because you get free features and free bug requests or bug fixes for your code.
Yeah, so a few things that you guys mentioned were like validation tests or just
seeing if you can take apart the paper and see if you can redo what the authors
did. I think that's an important feature that I've found over the last few years
called unit testing and I think this is like a 101 for software engineers like
you guys but it was mind-blowing. Like every single thing I put in my software
now, I make sure that there's a separate script now to test the thing and make
sure it works and sometimes it breaks and it tells me really quickly what went
wrong after I developed something and that has really given me confidence and
other people confidence and just going to our big picture of sustainable
software. It's essentially validation of your software which we all need, you
know, like if you make a change you have to revalidate it quickly and you don't
want to go around testing 20 different things manually. Personally, yeah, unit
testing and there's a lot of different types of testing you can do. It's a whole
industry in its own right. Integration testing, there's a, sorry, behavior-
driven development and so that's where you can only you test maybe things that
people are going to use and you don't really care about the stuff in the middle.
And yeah, so testing is great. Actually, I did a degree in software engineering
and we had one little module on testing in four years and I thought, okay, this
is nice but we only got a flavoring of it and actually I think that's probably
the most important module or one of the most important modules of the entire
degree and yeah, testing is great but you should, as a rule of thumb actually
for doing proper software engineering, be spending about half your time writing
tests and the other half writing code. You know, if you're only writing a little
bit of tests on the side, well then that's a problem. You can also do a test-
driven development which is where you write the tests and then you write the
code. So your test should be failing and then you write some code to make it
pass and so it's this kind of constant going back and forth. It does make you
write good code and think about what you're going to do before you do it because
you're thinking of the tests and you're designing it out in advance and that is
actually what I was doing today. I was writing some code and I was writing the
tests first so I'm quite proud of it and it also means that you then cover all
of the edge cases which you wouldn't normally cover and ultimately I think it
speeds up the development process even though you're spending a lot of time and
effort writing tests and actually the effort is usually trying to make obscure
formats to get the data in and out. You know, you might have to mock things out
which is where you build a fake object and populate it to run through a test and
then get stuff out the other end. That's quite a lot of work but it always pays
off and the more tests you write, the better your software is going to be. Are
you convinced? Wow. Well, I learned a lot more from you just now legitimately. I
just did not know it was a whole industry and that whole unit test-oriented
software development. That is wonderful. So unit test is like the most granular
type of testing. It's where you're testing one method, just what goes in, does
the right thing come out but then you can test at class levels, you can test at
functional levels, there's so many different ways you can test. You can even
test, like if you have a website, if I click on the login button, type in my
login details and click login and then I go to this page, I put a product in my
cart and click buy. You can test all of those different things end-to-end in one
test and there's some lovely descriptive languages. I haven't used them in years
because I haven't been in web programming in years but it's like Cucumber and
things like that. That's back when I was a Rubin Rails developer but yeah,
there's a lot of different things you can do and it's great that you're getting
into testing and I hope you'll write a lot more tests after this. Absolutely.
Let's say that you've written a fantastic sustainable tool with all of the
different elements we've been discussing. Now you want to convince people that
this tool that you've written is in use and important and you want people to
maybe pay you, at least give you the time to keep on doing all of these
wonderful things like developing new features and adding in tests and writing
new documentation. What are some of the ways that we can come up with to measure
that impact moving away from just the standard measurement of number of
citations for a paper? Well there's the old classic number of downloads and then
if you have packaged up your software for something like Honda or Galaxy, they
will also give you little badges telling you how many times people have
installed the software, downloaded the software there. One of the problems with
open source software is of course there's no control, no centralised control and
so you can't actually get an exact number always. GitHub will give you an idea
of the number of downloads or clones and it may be decentralised and people can
distribute the software themselves if they feel like it. So it can be difficult
but one of the biggest measures is how are people using it, who's using it,
where are they using it, how has that helped their research, has someone gone
and used it and written this amazing nature paper and if they have, well then
great. How many different institutions are using it, is it spread globally, is
it just a local thing, is every user just east of England or are they in the US
and all around the world? I think at one point if you feel like you have a good
version and you're at a good standing, I think it's a good time to just reach
out and if you haven't already just kind of reach out and see who's using it. It
can be a literature search or a survey to people or if your software is website
based like scrape that data and just see where those IP addresses come from, it
gives you a general sense. The whole analytics thing, I think Google has
fantastic analytics actually if you need to use it and don't mind them spying on
you. I've used things like GitHub stars and how many people are watching it and
who's watching it. I mean sometimes I might have just a few people watching my
project but like one time like one of the people behind BioPearl started
watching one of my projects and I was like oh so that's a really important
person to me. I think old-fashioned mailing lists can also help put out a survey
and try to capture users, encourage users just to sign up to a mailing list,
give some basic ideas of where they're from and what they may be doing and that
helps when you need to go back to someone and say well this is who's using it
and this is potentially what they're using it for. Yeah, I think we have
forgotten mailing lists and things like that. That's a really good, it's almost
like a boot on.  on the ground technique, but it's really good. So let's say
that you got your fantastic tool and you've got these metrics and you've tried
to convince people to fund you and keep it going and you can't, you can't get
external funding to keep it going. What options are there for you? Are we
starting to look at commercial software user pays models? I mean, I think user
pays good way to keep a software going, assuming that you can convince people to
pay for it and you've got an income stream for that. But academics do tend to
avoid paying for software and seem to actively avoid it. Yeah, often academics,
well, they'll say, this is an open source package, so why should I pay for
support? And I know I've raised this sometimes with people from commercial
companies who've asked for support for old software applications and they don't
even like paying for it either. You know, they expect it's open source, you
should get support for free, but it takes time and effort. And if it takes an
hour or two or three of my time in the evening or the weekend, well, you know,
I'm not gonna give up my family time necessarily to do all that for free. If
you're a commercial company, you should pay for keeping that going. But in terms
of, should we convince people to pay? Yeah, there is some models out there like
Client Big Data, where they're transitioning from a totally free and open model
to a model where people make a small contribution and those add up then to
continue providing a reasonable service. I think it's like, when I think of this
stuff, like I think of like the Ubuntu model, like, okay, download the whole
operating system Ubuntu for free, but if you wanna get actual technician help,
then you have to pay their kind of subscription model. And I think that's really
legitimate. I think the other distributions of Linux have something similar too.
I was just gonna say, yes, that's a standard approach adopted by Red Hat, but
that relies on having the few people who do pay for the support, enough of those
are there to subsidize the open source component as well. I'm not sure if it's
feasible in smaller projects. One thing that I suppose, if you don't go down the
commercial route is that a commercial software provider can go down and do that
validation and accreditation that you maybe need to move academic software from
academia into public health or into the clinic. And a lot of what we do in
microbial genomics and bioinformatics is we want to push things down to so that
people can use these in a doctor's surgery, maybe get test results in five
minutes rather than in days or weeks. We know that a lot of this stuff is
possible, but it's only when you start getting it out to a wider audience that
actually will become exponentially more useful to everybody. And maybe that is
something where we need commercial as providers to build platforms for us
because you can't have a bioinformatician in every doctor's surgery in the
country, in the world, that's not going to happen. They will just want a single
button that you push and that they know it's validated to work every time. It's
not going to crash with some obscure error. You know, it'll give accurate
results. They have a lot of controls, you know, all of these kinds of things.
And that is something that an academic can't necessarily do, but a commercial
software provider can. So maybe this can be like a question to the community
also, like how to have people seen themselves moving from academia to free and
open source to commercial, if that's the way that they want to go. Cause I think
that's a really good open question. Although don't ever price it because, you
know, I'm not going to pay for it. Yeah. Well, the price of some of these things
are absolutely insane. You know, it can be hundreds of thousands of pounds for
even things that sound trivial, but they always price them at just below the
level where you would think about hiring someone for a year, you know, like
hiring a postdoc for a year will cost you a hundred grand. And so they'll price
it at maybe 80 grand because they know that you're not going to hire a developer
for a year to do it yourself. You just pay for it. I would say it's very similar
in the US. Like you might get a software package that's just on that level. And
you have that choice to either hire a person for a year or to buy the thing. And
I think it's probably a smart price point from the commercial entity, but it
makes us- Unless it's your budget. I know, it gives us a tough choice. We
actually, this isn't really software, but like this kind of stuff magically
happens all the time that I had a hardware purchase and I told the company, the
company gave us a quote and then it came back and I said, well, our budget is
really this. And then magically it was $1 less than the quote. It's always way
than the budget. Yeah, yeah. So then we were able to purchase it and that's just
how that particular thing went. We got, we did get a discount from what it
actually costs, but the company fit it to the actual budget. So we maybe we've
talked about different aspects on how someone can build a sustainable software
tool and how they can convince people to support it. And some of the metrics
that are available to them. And we also touched on commercial software and how
that can be implemented so that software can persist longer than the stint of
one person, than a three year contract or something like that. It's clear that
there needs to be a shift to longer term plans of funding for critical pieces of
software, whether that be as a user pays model or from the funders themselves.
Not everything is going to work out where you just make a thing, build it and
they will come. That doesn't seem to work. We've seen time and time again, those
fail. And yeah, so hopefully some of the discussion here is helpful, is
interesting and maybe can get to their ideas and make a change in the community.
Thank you all so much for listening to us at home. If you like this podcast,
please subscribe and like us on iTunes, Spotify, SoundCloud or the platform of
your choice. And if you don't like this podcast, please don't do anything. This
podcast was recorded by the Microbial Bioinformatics Group and edited by Nick
Waters. The opinions expressed here are our own and do not necessarily reflect
the views of CDC or the Quadrant Institute.