Hello and thank you for listening to the MicroBinfeed podcast. Here we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There's so much information we all
know from working in the field, but nobody really writes it down, there's no
manual, and it is assumed you'll pick it up. We hope to fill in a few of these
gaps. My co-hosts are Dr. Nabeel Ali Khan of Enterobase, GrapeTree, and
BrakeFame, and Dr. Andrew Page of such works as Plasmatron 5000, Rory, and
Govens. I am Dr. Lee Katz, and you might know me from my Tree Making Pipeline
Mastery or my SNP Pipeline Live Set. Both Nabeel and Andrew work at the Quadram
Institute in Norwich, UK, where we work on microbes in food and the impact on
human health. I work at the Centers for Disease Control and Prevention and am an
adjunct professor at the University of Georgia in the US. Hi, I am Nabeel Ali
Khan. Normally, I'd be joined by my co-host, Lee Katz, but he's currently flying
over the Atlantic at the moment, so joining me instead is Andrew Page, head of
informatics at the Quadram Institute, and we're joined today by a special guest,
Associate Professor Torsten Seeman, who is lead bioinformatician at the
Microbiological Diagnostic Unit Public Health Lab in the Doherty Institute for
Immunity and Infection at the University of Melbourne in Australia. It's great
to be here. We wanted to just ask you today, Torsten, how on earth do we write
proper, good, bioinformatic software for people who actually use wild analogies,
throw away bioinformatic software that just goes in the bin after someone's
published a paper? Thanks for having me on the podcast, Andrew and Nabeel.
That's a good question. I guess I'm not an expert on how to write great
software. I know a bit. I think what we all tend to agree on is what bad
software is like. So I think what I tend to do is just try and avoid the bad
things and what results is usually kind of useful and good. I'm not a trained
software engineer, really. I just try and make things robust and I guess
foolproof or idiot proof. And you write it all in Perl. Well done. Yeah, I am an
old dog. I do still write in Perl. It was my first language and it's sort of
still my go-to. I'd love to change, but not quite there yet. But ultimately, it
doesn't matter what it's written in, sort of how the user interacts with it,
what their experience of the software is like. And if it just works and it does
what they want and doesn't fail often, they won't care what's underneath. And
that's the beauty of Prokka, which you wrote a few years ago. It just works out
of a box with virtually no dependencies, no extra libraries, no nothing. And
that's why everyone has just adapted and used it so widely. Yeah, I think one of
the key successes with Prokka was that, yeah, it came bundled with small
databases, but the databases that were good enough to get good results. And the
other key thing was that you didn't need any settings. You could just give it
your faster file of contigs and it will give you an answer. Quickly. Quick,
yeah, and relatively quickly. So, yeah, that was the key, that it was much
faster than many of the other solutions out there. A solution out of the box
that just worked for most people. And it was written out of my own need. I
needed to annotate genomes. And so I guess that's always the best motivation is
when you need to use something. The reason I originally wrote Prokka is probably
14 years ago, where we were annotating a single genome and we were still in the
days where we would actually curate individual genes. And Prokka was a way to
sort of help bootstrap this manual curation process to try and get the bulk of
it done at once. But it wasn't long after the next-gen sequencing boom where
suddenly we were annotating hundreds and thousands of genomes at a time. And
Prokka really became the sort of standard part of the pipeline to perform that
annotation and just get something good enough. Yeah, it's interesting because
Prokka was not the first genome annotation tool available. There's plenty like,
you know, RAS and there's another one we use called Manatee, which was an
absolute gargantuan pipeline. But I think for the most part Prokka has
superseded all of these other tools and it seems to be the standard issue that
you just generate a Prokka annotation. What do you think was the one key element
between, say, the other existing tools that hadn't just become that successful?
Well, I think it was simply the fact that it was a single download. You could
just, at the time, untar and it would run. Didn't have to do anything else. It
came bundled with a lot of binaries for Linux and Mac and it just worked. You
would run it on your Contig and get a GFF and a GenBank file, completely valid,
and you could immediately use that in a downstream tool such as Artemis or in
later years it would become immediately usable in terms of putting it into Rory
for calculating your pan genome and things like that. So yeah, that's the one
thing is that you could just download it and immediately run it. What, you mean
all of our Linux software can't just be downloaded and run easily? Not in this
universe that we operate in, no. That is a rare thing that that will work.
Although, given the resurgence of new packaging systems like Conda has really
changed that, that people are putting in a lot of effort to get these tools
working for you and easily installable. And you do a lot of that work as well in
your free time. You upload to Conda and make recipes. I think mainly you're in
the brew. Yeah, I started off in the homebrew science thing. So that was mainly
because of Sean Jackman, who is a fellow blind practitioner from Canada, who was
very helpful in getting, teaching me how to write, package these things. But the
sort of community is tended now towards Conda because it can handle versioning
and environments much better than brew, which is more of a flat versioning
system. And why would that be important? Reproducibility. And that other word
beginning with R. So you're not just known for Prokka, you like... I know in
this institute here in Quadram, everyone uses your tools. It's called the
Torstiverse. And people use like Snippy for finding corgenes. And they use
Nullivore for producing these insanely useful reports on everything that someone
would want out of bacterial isolate sequencing. And even, yeah, other things
like Apricate for calling AMR and MLS-T for typing in a very trivial manner.
Although the name of that, you know, you need to change your name. Yeah, the
MLS-T tool is one of the few tools I have that doesn't have a proper name. And I
kind of regret it. We do spend a lot of time brainstorming the ideas for that.
But the only kind of words that sort of came up with were things like molest,
which were terrible, terrible ideas. Thank God you didn't call it that. And so
it's sort of hard to sort of go back and rename it now. Like it seems to be,
people seem to know it now. So I'm hesitant to change the name. But once again,
all these tools sort of came out for two reasons. Like one of the reasons was
when I wrote Prokka, I realized I needed a tool to find ribosomal RNA genes and
the existing tools weren't any good. So I ended up writing my own tool. So it's
just sort of... Is it Barnup? Barnup, yeah. So it's not perfect, but it's good
enough. And it's fast and it came once again bundled with lots of different
databases, clean and simple, worked out of the box. So nearly all my tools are
modeled on that, the original Prokka experience where I released this tool and
people gave lots of good feedback and were happy with it. And that really
encouraged me to write more tools. And I've followed that Prokka model all
along. Minimum number of inputs, like just a single file, whether it's contigs
or your reads, maybe an output folder, but even then, not. One thing I like
about Barnup and about your tools is that you output in standard formats. So
Barnup will give you GFF if you want it. And the same with Prokka, that's
insanely useful because you haven't gone down the path of inventing your own
format, which a lot of developers go and do. And then it's just yet another
useless format that we have to consider. Whereas you've taken the time to go and
actually use a standard that it pre-exists. That does take a lot of work. I
appreciate the effort you put in. Well, thanks. I strongly recommend using
standards like BED, GFF and VCF and FASTA, just so they can interoperate with
all the other useful utilities. It makes pipelines much simpler to write when
you don't have to write too much glue in between. Absolutely, yeah. And I guess
the other thing that made Prokka, going back to Nabil's original question, was
what was the single thing that made Prokka successful? I think mainly it was a
command line tool. And it just came that you could be running the high
throughput manner because suddenly we had all these genomes and people didn't
want to run them one by one, didn't want to upload them to a website one by one
and collect all the downloads and then figure out which ones failed and
resubmit. Being able to do it on the command line made it just plug in nicely
into pipeline systems or workflow systems with well-defined inputs, FASTA and
well-defined outputs like GFF. So yeah, just made it easy. Well, when I first
used Prokka, the thing that excited me the most, especially being in Australia
with poor internet, was that the databases for a lot of the annotation tools
were very big and very complicated to download and install. And I think in
Prokka, you had a scaled down Uniprot database that you had that just made life,
that you got, again, it's like 90% of what you would get with the more
exhaustive database, but you didn't have that much overhead and that made life a
lot easier. So this comes back to one.  point that, if I'm writing software, I
think might be, I don't know, want to get your opinion on, is building a
singular tool that's fit for, like, do one thing and do it well, and just cut
all of the, if you can cut extra, as we're saying, I think already,
dependencies, do it. If you can cut database sizes, do it. If you can cut
unnecessary parameters, do it. Is that generally what we're talking about here,
less is more? I do favor having less dependencies. Sure, the Biopil part of
Prokka has been a problem for people in the past. I probably won't use that in
the next iteration. I might remove that dependency. But funny story about the
reducing the size of the databases, part of the reason for that was to reduce
the distribution size, because I was hosting it on our university website, and
most of my users are outside Australia, and the pipe to the rest of the world is
quite small, was back then, and people were complaining about downloading it,
because it was like a 1.2 gigabyte tarball at the time. So I ended up helping
them by putting it on my Dropbox, and I got banned from Dropbox one day later
for too much bandwidth being used. So I had to write them a nice letter and say,
I promise to delete it if you reinstate my Dropbox account, and they said yes,
okay. So I put it onto Google Drive, and they never had a problem with the
bandwidth. But that made me realize that people were still having trouble
downloading back in the day. So what I exactly did that was I did not use,
people said, oh, why don't you just use the whole NR database in GenBank as your
annotation source? And I said, well, sure, but a lot of those genes are
annotated poorly, like they're just annotations from other bad annotations,
propagation of errors. So I thought, what should we do? Go back to pure stuff we
really know. And as you said, I used UniProt, but not just, I used UniProt
SwissProt, which is sort of the curated part. But I went even further. UniProt
has evidence codes for each of its records, which tell you whether it was
confirmed by proteomic evidence or RNA evidence, or how did it come about. So I
thought, well, let's just stick to the well-curated stuff. And that really
shrunk it down to about 70,000 proteins, which ultimately kind of represented a
core proteome for bacteria. So yeah, you could annotate 60% of your genes with
just a very tiny database with high reliability. But a lot of people still
aren't happy. They want better annotations. So I've added capabilities now where
you can supply some reference genomes, and it'll incorporate those into the
annotation. Well, let's change track and talk about software in general. And so
when you're sort of catching up on the literature and trying to see what else
people are doing, what are the sort of bugbears you have when interacting with
the new bioinformatics tool? Well, I think we all have a universal experience.
The standard thing is you'll see a tweet or get an email from someone or a Slack
message. Oh, have you seen this new tool? And it's a link to a paper or to a
bioarchive paper. So you go there excitedly, read the abstract and think, yep,
this is it. This tool is going to solve all my problems. And then your first
problem is actually finding the source code. So you read the paper and you're
digging around for a URL somewhere. Sometimes you won't find one. Contact the
author. Contact the author, or you'll just search for HTTPS or search for the
word GitHub in the paper, just hoping for a link. And you eventually find it.
And then you think you're 90% of the way there, but you're like 10% of the way,
because you then have to install it. And now Conda and Brew and these packaging
systems made this a lot easier. But this is not true for new software. New
software isn't packaged into those things quite yet. It's not mature enough. So
you go through the complicated install instructions and eventually it kind of
works. And then you run it on some tiny data set or just some standard thing you
try everything on. And then it fails. Some Python backtrace or bash error or
missing module in Perl. It's the standard thing. I guess you guys have both
experienced this as well. Absolutely. I've installed hundreds of applications
and I've bashed my head against the wall so often where things just absolutely
do not work. Or they will only ever work in the institution where they're
written because they've gone and hard coded the parameters. Or they tell you,
please go and edit the script in line 59 to change some little tiny little
thing. And you're thinking, this isn't proper software. That's a terrible
experience for your user to have first time around. They're excited. They want
to use your software. They spend all this time on it and then they can't get it
working. And they maybe file some GitHub issues and maybe get a response or
maybe never get a response. They're never going to use your software again. So I
don't want that to happen to people that use my software. I want their first
experience to be a positive one. So just coming back, one thing that popped in
my head is when you're saying that you read the abstract and you're looking for
a URL, where would be the best place to put it in a manuscript? Where are you
actually first expecting it? I think software papers. I think the software URL
should be in the abstract. If it's not in the last line of the abstract, where
is it going to be? There might be a specific section in some journals for
software availability and stuff like that. For example, the Oxford
Bioinformatics have that, but it's kind of part of the abstract. I know a lot of
people don't agree with me, but I think that is the main output of the paper. So
it should be in the abstract. So I saw a blog post you wrote a while back on the
minimum standards for bioinformatics command line tools. How did you come up
with that? Well, yeah, it kind of was born out of my frustration with all these
tools that I couldn't get working. So I thought I would just write down some of
the kind of internal rules I have for my own software and hope that other people
will kind of read that and adopt it. And the blog post ended up being quite
popular, getting a lot of readers and a lot of sharing. And I got approached by
the Gigascience Journal at the time, and they asked if I would like to turn it
into a paper. And I'm a terrible paper writer, so that was a bit of a motivation
to actually get a paper out. And I did get a paper out, and it was quite
interesting. It helped improve the blog post a bit. And yeah, it's been quite a
popular paper. It was one of the most downloaded papers from the Gigascience
Journal website at the time. Wow, that's insane. Until their database crashed
and they lost all their statistics. So there was definite interest in this sort
of stuff. Even though it seems to be more, it should be obvious to people, but
obviously it's not. So Torsten, what kind of things did you put into your
recommendations? Well, nothing too complicated, Andrew, just really basic
things. These are what users will experience when they first try the software.
Say you install a piece of software and the first thing you do is run the
commands. Say you install Prokka, you will run Prokka, and what do you expect to
happen? You expect it to tell you something about what you didn't do, like you
didn't provide this option or you need to supply this. You don't expect it to
just spit out an error. You expect it to spit out a useful piece of information
to tell you what to do next. Or some software doesn't even have a help option,
like you need to have a help option. Or even more dangerously, it just goes and
runs and you don't know what it's doing. Yes, that's the worst. Deleting your
data. Yeah, often a piece of software running with no options might create a
directory and start writing some files in there without any permission or
awareness of the user, and that's very frustrating. So yeah, my goal is to do
nothing unless you've got everything you need and then tell the user at every
point what's happening and what's going on. And a lot of people quite enjoy all
the text that gets scrolled past them in my software. They kind of find it
comforting, and I find it comforting as well to know what's going on all the
time. Although there is differing opinions, some people say it should be quiet
unless the user tells you to be noisy. Yeah, so I'm more noisy, but I'm the
other way around. I'm a bit noisier and you can quieten it down. Most of my
tools have a quiet option. Say you're embedding it in a large pipeline, you
might not want all that verbose output. No, it's interesting. A lot of the
logging from your tools often has a fairly funny or comical sign-off at the end
of it when it's finished running. I think it's Prokker, normally it says share
and enjoy at the end of it. Prokker had two kind of inspirational quotes or
funny quotes at the end. I don't know what made me put that in. I think I just
did it to keep people interested when they were looking at their thing. But
yeah, it's sort of become a bit of a trademark where most of my tools have some
kind of useful little remark at the end. And I think one of my latest tools
might have like a library of 20 random statements it gives at the end. I think
it's Snippy actually. Yeah, Snippy has quite a lot of random ones and I get lots
of good feedback about that, so it kind of makes it fun for me as well. And your
software is very robust as well. I know in my own role, we've run it maybe half
a million times and it just works absolutely each time. The only thing that
breaks is, was it table to ASN? Yeah, table to ASN is a tool provided by NCBI,
which converts, which actually writes the GenBank files that Prokker uses. And
it's become a bit of the bane of the Prokker universe. It's distributed in a
binary only form, which is kind of against my beliefs. It also doesn't tell you
when they're going to update it. And the FTP site just magically appeared with a
new binary version of it. So this is something that's caused a lot of problems
and I want to remove it from Prokker. And it just seems to expire after a year.
Yeah, every six months to a year it expires. So you need to replace it, but
there's no warning. Until it breaks your pipeline. Yeah. So that is the main
problem with Prokker, is that step in the pipeline. I kind of regret using it.
It guaranteed nice compliant GenBank files, but it was a bit of a mistake
looking back and I won't be using it in any future tools. Yeah. So for new
people who are writing a tool, what sort of advice would you give to someone who
wants to write a new tool or who are wondering if they should try and write a
new tool? Well, my first advice would be to don't write a new tool. It's a big
commitment. If you want to do it properly, it's going to take time. And the key
thing is you have to maintain it. If you're going to just write a tool, publish
it and then abandon it.  Please do not start that project. You're not really
helping anybody if you're not going to use the tool don't write it because
you'll never truly make it useful and Reliable unless you are forced to use it
and suffer through its problems only through that suffering Will you improve the
software because ultimately we don't have it's not that we don't care about
other people's experiences But ultimately, you know time is limited But if it's
failing all the time and you're being you know It's causing you problems or your
colleagues are annoyed at you then you will fix it But if no one else is going
to use it, then yeah, don't bother. What's the what is the What's the oldest
tool at the moment that you wrote that you're still currently maintaining now?
How many years back are we I think proper is my oldest tool actually proper
started as a series of modular bash scripts back in about 2004 to 2006 that was
the genesis of proper It was used internally only by us at the Department of
Microbiology at Monash University where I used to work Yeah, it didn't become
the proper. We all know until much later. I think the paper was 2014 yeah, but
it was in the open source Community well before that took me a while to write
the paper as I've said, I'm not a good paper writer I never feel like the tools
are quite good enough to be published But clearly that's not true based on what
we see published out there a lot of the time It is thousands of citations. I
think it's 2800 Yep, so it's in the couple. It's in the low thousands depending
on whether you use web of science or Google, which is an amazing Amazing number.
I never expected that Yeah, everywhere I go people know me as the procker guy
and I'm okay with that Like I've made a lot of people happy because it solved a
problem they needed to solve and it did it in a reliable way and I'm proud of it
like it's enabled a lot of research and a lot of other pipelines and tools are
built using the output of Proctor and I Think it really set the tone for the
rest of my career that I was keen to write good quality tools And make them
available freely to the community that set my whole career So once you let's say
that you have followed these steps and you've identified a tool that you're
willing to put a good Say decade of looking after it and developing it in How do
you then sort of distribute it and make it available to other people what are
the platforms that are available Because at the moment it seems quite Saturated
interesting to hear like what's available and then what do you think is like the
best like your priority list of what you should? Do you know put it here then
here then here? Interested to hear your yeah, so when Proctor being my oldest
tool is probably a good place to start you'd have sort of existed around then
but I was hesitant it seemed Confusing and I was a bit stubborn. I didn't want
to use a revision control system I've used SVN and RCS back in in universe
undergrad and well that that wasn't pleasant So I can appreciate that and I sort
of I was scared of github it seemed complicated and I but I've had friends and
colleagues who helped me sort of Figure out the basics and I once I understood
how it worked and how that wasn't too hard to do the basics I just loved it and
said this is what I have to use for everything before that. I was just
distributing tables with version numbers on personal websites and stuff like
that But after I discovered github and realized that I could get people to give
keep my issues Organized and pull requests, which I tend to not be very good at
reviewing. I realized that was the place to go Obviously any system like
Bitbucket or get love you'd love all fine I just so used to the github interface
that the others sort of a confusing to me Do you do now and since you mentioned
the hosting it on your own website or even like an institutional website Do you
recommend that now as an approach? No, I don't think I think you should if you
are legally allowed to by your Institute to put it on us a more public place I
encourage people to do that just that Universities rearrange their websites all
the time software links go dead. They don't put in proper, you know HTTP
redirect codes for old links. I would github is quite a sort of a static place
or one of the other repositories Sourceforge is the exception. I would encourage
people not to use I've had a lot of issues with Sourceforge There's been you
know malware issues with Sourceforge and I just find the whole interface very
confusing and difficult The others are just cleaner and simpler and more widely
used. So you chose a name Procker, which I understand is from Quokka But how do
you choose the name of a new software package? It just seems really really
difficult Yeah, I think cheese choosing a software name is very important I know
it seems a bit weird data, but people like a good name It's sort of it's a it is
marketing in a way proper. You're right with Quokka, which is an Australian
animal It lives off the west coast of Australia, but it's originally stood sort
of like a sort of a pseudo acronym for prokaryotic annotation So that's yeah, I
like the letter K It's one of my favorite letters for some reason and I the key
thing is that I want it to be unique Like the namespace for software is huge and
polluted with all the common words So to sort of stand out and not get any name
conflicts. You need to find something a bit original. So Yeah, Prokka, Snippy,
Barnap all these things were tested by googling first by Incognito mode and my
mode just to check if there's any conflicts and if it was minor enough and I
liked it It'd be fine. And like I said, you need to check urban dictionary. Yeah
urban dictionaries Yeah, it's a that would hopefully come up in a Google search.
But yeah, that's definitely very important to yeah I want you to think of these
acronyms Yeah, I'm not a fan of acronyms Well, they can be good. Like if they
end up as a nice unique name, then that's great Well, what what is a backroom
name for people who don't know? How does that work a backroom name is where
they've come up with a name and then they try to use they've sort of
Retroactively fit a set of words related to the the name they chose So they
didn't choose the the words and then make an acronym from it. They went the
other way So, okay, so it's like they really want to call it rust and they go
like, oh it means really unique search tool Exactly. It's like Okay, sure. Yeah,
but sometimes backrooms can be good. I'm sure our listeners could probably come
up with Provide some examples where the backroom name was good, but most of the
time not And of course when it's not you win the Java Awards you win a Java
Award Yep, Creek Keith Bradenham came out with the Java Award for bad bad
bioinformatics acronyms Well, it stands for just another bogus bioinformatics
acronym and there's plenty of bad examples out there, which we won't Personally
name or shame any in this podcast Okay, so so yeah, so we've been talking about
github and bitbucket and these are sort of repositories to put your Sourcecode
up and make that available, but that's often more geared towards people who can
Program or used to dealing with with software for people who just want to
install a thing There's a range of different different options available to you
And I just wanted to hear a bit about the different options and then what do you
think or what would you prefer as the? Best way moving forward. Yeah, you're
spot-on like the bulk of our users are sort of Applied users. They may be
biologists who want to have done some sequencing or some other data generation
and want to use our tools on Their data. They're not other bioinformaticians
writing pipelines. So you need to really cater for those people They're the ones
gonna have the most difficulty with the installing directly from github. Like
you said, so Traditionally you would package thing in package your things sort
of, you know in a Linux Distribution package system like Debian or Red Hat has
RPMs and stuff like that But they sort of have fallen out of favor mainly
because you can only store really one version at a time But also the writing
packages for those those systems is quite complicated Debian has Debian med is
the distribution for most of the bioinformatics offer and I had a very strict
set of rules for software And I think Andrew I recall had some experiences Yeah,
it can be there's quite a lot of documentation you to read and it be quite an
ordeal You need to do basically months of mentoring and training before you can
get to the point where you can actually submit stuff So it's quite a high
barrier to entry and yeah, exactly so what came along as alternatives to this
and the other thing is you had to usually be a root user to install using those
systems and most people these days, you know on shared clusters or using Systems
where they don't have administrator access. So along came things like homebrew
which started on Mac But I guess the biggest in biggest ecosystem at the moment
is condor which most which is based on Python Condor allows you to sort of
install whole hierarchies of software Without having any administrator rights or
root access and it's revolutionized I guess I would say the bioinformatics
installation experience for everybody over the last few years I don't I would
say that any tool new tool that comes along if you're not packaged into condor
You're not really going to be used. Everybody's using condor now, and if they
can't find it, then Somebody will package it for you, but ideally you would sort
of Proactively try and get it packaged or write to the condor people and help
them out Explain exactly the dependencies or make it easier for them to package
but being very clear in your documentation what it needs How it works how you
set it up and somebody out there will do it There's a huge team of bioconda
developers who are just going out packaging up software tools that I don't even
use themselves But they're doing it for the community so the easier you make it
for them and the clearer you make it for them the more likely they will produce
a Reliable condor package for your tool and I'm very lucky now that when I
release a new version of my tool there's people out there who suddenly go and
package it before I've even realized and that there's really I feel like it's
everybody's working together and  things are just happening. And of course if
you make your tool easy to install then people are going to use it and cite it
and so it's not rocket science but you know you will get all these citations for
free just by taking an extra step. I have no doubt that the success of Prokka
and its number of citations is purely because it just worked for most people and
therefore it immediately ended up being used in their research project and they
ended up citing it on the paper that used the results. If it wasn't easy to
install the citations would be way lower and you know in Australia grant
reviewers and stuff are taking things like citations much more into account
these days because it's more about translational impact of your work. Having a
paper doesn't mean much if no one cited it. So yeah that's been great for us in
bioinformatics. It is an incentive to create high-quality software. So I guess
if you can't install a software does it really exist? That's right. But so when
we're talking about just on packaging though what is like what does that
actually mean? Like what is very briefly like what are you sort of doing when
you're packaging it or what's involved? Yeah so I guess traditionally how things
were distributed in the old days most a lot of software a long time ago was
written in something like C or C++ so the standard thing you would do is
download the source code and run configure the standard configure program on
that and then type make and then type make install and that would just put it in
some fixed place usually in the user local hierarchy and that was all you had to
do but since then and scripting languages like Perl and Python and R really took
over from C just because of the ease of use of writing software and they came
with their own kind of internal packaging systems. Python's gone through a lot
of different ones over the years but now you know PyPy and the pip command has
become the standard way to install stuff so if you are writing Python software
you should really be thinking about packaging in a way that it can be uploaded
into PyPy so that people can just type pip install whatever. Now it's a bit
daunting at first but there's some very good guides for it and it's actually
much simpler than you think. It usually just means putting your software into a
subfolder and creating a YAML file or something or a TOML file that describes
what it is what category of software is and what the sort of the starting
command is so if you can get into PyPy that actually makes it then trivial to
get into conda so the people packaging to conda for you it becomes a very easy
for them to write a wrapper for it. The problem with PyPy is that it installs
your actual Python script but say you use other commands like SAM tools or BWA
they won't be included they won't be guaranteed they aren't installed alongside
with PyPy. That's where conda comes in. Because those tools are obviously not
Python and they don't belong in a repository based around Python. That's exactly
right whereas conda kind of brings it all together for you conda can install a
conda recipe as it's called or a brew formula is basically the set of steps that
you would need to do manually to install a piece of software but even better
than that conda can just use existing recipes say your tool depends on SAM tools
you don't have to install SAM tools in that recipe it already knows how to
install SAM tools so it will just say install SAM tools and somebody's already
written a conda package for that so the bulk of the recipe in your conda in the
conda recipe of your tool just be to basically install your Python software and
just automatically install its dependencies so yeah Python use PyPy for Perl you
should use CPAN and for R you should use CRAN so once again going back to
standards these are community standards for distributing software and you should
use them wherever possible okay so we've now gotten to the point where we've
picked a tool that's required and we're going to look after it and we've just
got it ready into repository so people can download it it's nice to use but then
how do we then sort of market it and get word out into the community that this
is available and it's good and people should have a look at it that's a good
question so traditionally the way to promote your software is to publish it and
as I've said that's not something I'm very good at doing so how I started up
promoting things was mainly through originally I would write a blog post about
it at the time and then use Twitter Twitter is probably still my main
promotional channel for software because I still haven't written up met most of
them Twitter people just tweet links to github and then they can look at the
github and I usually tell them how to install it and usually these days that
would involve running conda install over install so I try not to announce
anything until it's at a reasonably good state but now that I've become known
for writing software people sort of notice any new github repositories I create
and kind of get pounce onto them quite early and I'm not I don't tend to make
things private I tend to start off public and so yeah Twitter is probably my
main avenue of advertising but this you know the traditional journals preprints
are common I've yet to write a preprint I probably should oh they're awesome
I've been writing loads there every paper we've had recently last few years has
been preprinted and it's I find it to be enormously successful because you get
people reading it and they can get all the data and it rarely changes by the
time it's published but you get it like a year or two years earlier because
publishing can just take so long to get that final polished paper yeah I think I
should start writing but it still involves writing something one thing I'll
mention on yeah so preprints as well massive advocate for that and been using
that as much as possible in the last few years but you do have to be careful
with the some journals still unhappy do not like the idea of preprints being
available because they want exclusive right to the to the text and you've sort
of semi published it already so they didn't say we don't want to entertain this
manuscript there are resources online that you can check what is the kind of and
you can the journal should be able to tell you like you should ask the editors
but there are online resources to check like yeah sure piranha which is a
database that just has a ranking of how open and easy it is to to publish in a
journal whether they allow open access whether they're happy with preprints post
prints and so on and so that's something to keep in mind but otherwise yeah in
general I'd be advocate preprints are definitely takes the anxiety out because
it's out there you're not sitting there waiting like oh my god I had still in
review like when is this going to come out it's out there and you just sort of
going to the motions of getting it out and print yeah I think I've tended to
feel that my github and my documentation is kind of like a preprint in a way and
that it contains most of the information about what the tools about I feel like
a PDF file somewhere isn't really adding any value to that but from a career
point of view it probably would so but what about things like software focus
journals like Joss journal of open source software which is somewhere in the
middle yeah this new Joss journal I like I love the model I love that it's
focused on usability so you you don't need to prove that it's the best of
anything or compare it to anything else but all it's mainly about the quality of
software so they I believe they have checklists which reviewers used to sort of
assess that there's got quality documentation that it's installable all the
things that are you know that I complain about or talk about in terms of
software quality Joss is try has it has address so I'm a big fan of that doesn't
focus too much on comparing or writing complicated details focuses on the
software so even if you're not necessarily submitting to Joss I think looking at
those guidelines is also a good resource for people who want to know what's
required or the minimum level of requirement of writing good software yeah I
would agree with that so now so we've gone we've gone through the entire cycle
development cycle and we're now at the point that you're maintaining the
software and you have user requests you know great once you get that first email
oh hey I use this software it's great but dot dot dot so then what's the best
strategies for sort of managing Matt if you're maintaining broker for like 10
years how what are the best strategies for for managing user user support yeah
at first you know you get these emails and you get this small amount of positive
feedback and it's great and then you keep getting more and more email and you
struggle to keep up you honestly do so I encourage people not to email me I sort
of answer emails in bulk every few months I would encourage people use the
github issues page and the great thing that I discovered about that is that
there's a community of users out there that end up helping each other so I often
get to an issue that's been filed and somebody else has already helped that user
and I can just sort of say thank you and close the issue so that was a quite a
surprising thing I don't have a mailing list I would like to but I worry that it
will just hide a lot of the content I don't know the best way to do it issues
get closed and so they sort of disappear as well should there be a frequently
asked questions or a wiki on your github site I'm not sure what the best way to
do it is but I've noticed a lot of my software is supported self-supporting in
the community through bio stars or seek answers these other traditional
bioinformatics forums online so I think I owe a big debt to the community who
have sort of helped me maintain proper but there's still a huge burden on me
with full requests that I can't don't have time to review and issues that I
don't have time to fully investigate and there is actually quite a large sort of
psychological burden there's guilt associated with this software that you put
out there and and you're not being paid to maintain this indefinitely no you
there there is no payment I'm very lucky that my employer sees my role in
maintaining this software as an important part of my job but I understand that
that is not true for a lot of people fellow software engineers fellow
bioinformaticians so I feel very lucky that my boss sees it as an important part
of my job so I am privileged in that respect and I'm trying to maintain but the
long-term answer to this is not clear there needs to be more funding for
bioinformatics software maintenance and I was recently quite excited to hear
about the new Chan Zuckerberg grant system so they will be having three rounds a
year of grants for software maintenance  So this is not money to write new
software, it has to be software that's already out there and has a proven user
base and they are offering grants of $50,000 to $250,000 and there's three
rounds a year, I think about four to six grants per round and it's specifically
for key tools in computational biology. So I'm investigating that as a possible
way to help fund Prokka and a next version of Prokka and other tools. Okay, well
we've actually been talking for quite a while, so let's sum up what we've been
discussing, some of the main take-homes in terms of writing bioinformatics
software. Yeah, I guess the first thing is to just make it as easy as possible
to install. Every kind of problem that you would encounter installing a piece of
software, you're going to have your number of users, right? Just like putting
equations in talks, every equation halves your audience apparently, so I think
it's the same thing applies to installation problems. Documentation, my
documentation is not always perfect but it's usually enough to get started and
understand what's going on. So keep it clear and simple and if it's 100 pages
and you can't figure out how to find the basic things you need to do, then
that's going to be worse than having no documentation in some way. I'm a big
believer in putting my software out there before I publish the paper. The
community will find problems that you never even thought of and I'm scared that
if I publish, I will then sort of forget about my software and not care as much.
We all know that, we've seen that before where something's been published, final
publication, you look at the software, you go to the GitHub website, it was
added six months ago, no changes since and there's no issues or there's issues
that have never been responded to and you know this is unfortunately probably a
student or a postdoc that's written something, got a paper and then left and so
that software will never be used and it just feels like a waste of effort. A
common excuse for not doing this is people are scared of putting their code out
that they feel embarrassed and people have made fun of my monolithic Perl code
and they're right to make a bit of fun of it but ultimately for me, I've learned
not to be embarrassed by it and it works, the proof is in the pudding and it
doesn't really matter too much about the design behind there on smaller software
projects. Bigger ones maybe with large numbers of people involved, sure you
should probably apply some good software engineering principles but yeah, for my
tools they're usually pretty small and surprisingly I've gotten good feedback
from people who have read my monolithic proper Perl script and said it's so easy
to understand, I can follow all the steps, you've commented each section and I
know how to alter it so there's something to be said for that, that linear flow
that's easy to walk through rather than jumping between modules and not knowing
where you are in the process and most of all if you write good software, I've
found that I get a lot of people buying me beer when I travel around the world
so that is an incentive, free beer. So we owe you a lot of beers then? Yeah, I
don't expect to buy another beer for the next five years, well that's the dream
anyway. Absolutely, well thank you very much Torsten for this amazing chat.
Yeah, thank you Torsten and so this has been Torsten Siemen with Andrew Page and
myself Nabil Ali Khan for the microbial bioinformatics podcast, just signing
off, we'll catch you next time. Thanks guys, had a great time chatting with you.
Thank you all so much for listening to us at home, if you like this podcast,
please subscribe and like us on iTunes or Google Play and if you don't like the
podcast, please don't do anything. This podcast was recorded by the microbial
bioinformatics group, the opinions expressed here are our own and do not
necessarily reflect the views of CDC or the Quadrant Institute.