Hello, and thank you for listening to the MicroBinfeed podcast. Here we will be 
discussing topics in microbial bioinformatics. We hope that we can give you 
some insights, tips, and tricks along the way. There is so much information we 
all know from working in the field, but nobody really writes it down. There's 
no manual, and it's assumed you'll pick it up. We hope to fill in a few of 
these gaps. I am Dr. Lee Katz. My co-hosts are Dr. Nabeel Ali Khan and 
Professor Andrew Page. Nabeel is a senior bioinformatician at the Center for 
Genomic Pathogen Surveillance at the University of Oxford. Andrew is the CTO at 
Origin Sciences and Visiting Professor at the University of East Anglia. Hello 
and welcome to the Microbial Bioinformatics podcast. I'm Dr. K, and joining me 
are your co-hosts, Nabeel and Andrew. And today we're diving into one of the 
most talked about topics in the bioinformatics community. Next flow. In the 
past, we've all been in support of next flow and workflow languages in general 
as a means of organizing your work. Now in 2025, next flow is popular and 
promises portability, reproducibility, and scalability across systems. But for 
every bioinformatician who swears by it, there's another who's pulling their 
hair out over submission quirks, debugging nightmares, or endless DSL syntax 
confusion. In this episode, we're taking an honest look at the pros and cons of 
next flow in 2025. We'll explore where it truly shines, parallel processing, 
robust error handling, and seamless container integration, and where it 
continues to frustrate users from steep learning curves to painful updates and 
costly ecosystem tools. So today, would you guys like to maybe take a side and, 
you know, whether or not it's really our opinion, we'll just, we'll do, we'll 
be opinionated. So here's what I'll propose. I'll take the cons, I'll say all 
the cons for next flow, and I'll take that position. Whether or not that's 
really my opinion, it's not necessarily my opinion. And do you guys want to, do 
you guys want to stand up for the other side? Well, I'm definitely, I'm 
definitely pro because we use it extensively at the moment. I'm happy to sit in 
the middle just to keep things balanced. Fair, fair. So everyone listening 
knows that this isn't necessarily our opinions. We're just, we're just having 
it out. Yeah. Good, good job, Lee. We'll edit all of this out, don't worry. 
Yeah, no, like I'm very pro next flow. We use it extensively in our work, which 
is cancer bioinformatics at the moment and microbiome analysis using Sequeira 
Tower, which is we call next flow tower, which is just a kind of an easier way 
to run next flow. Everything that I do in next flow is on AWS. So it's, it's on 
the cloud, you know, shiny. And to be honest, for a lot of stuff I do, I will 
usually just pick up an NF core pipeline and run it because, you know, it does 
cover all the major things you might want to do. And it's, it's only for more 
bespoke mathematics that you want to actually drill down and make your own 
stuff. And then AI has just made it so trivial to, you know, take your 
repository and make it into an expert pipeline. Like quite literally the other 
day, it was, I told AI to make a little tool, which was just a manipulates like 
some indexes in a fast Q file that said, okay, make it into an expo pipeline. 
Bang. There you go. And make it a Docker container. You know, it's, it's all 
very, very easy to do. And next flow is so simple and robust that you can just 
knock it out. So yeah, next one is away. So let's talk a little bit more about 
the performance and efficiency. So does the, so next, what does promise 
performance, does that hold up in, in real time in the cloud or on an HPC in 
large scale architectures? So maybe Andrew, you want to start with the pros of, 
of what Nexo things does well in this space? I guess it depends on how you're 
running it and all it is, is an engine. You know, the performance really, it 
comes down to how well you've written the software and how well you've 
paralyzed it all. And if you've done a poor job there, then of course, next, 
it's not next to his problem that it doesn't run well, you know, so don't blame 
next flow for your sins. But what it really does well is just orchestrating 
everything. You don't have to worry about where is my data has a linked, you 
can just kind of join it together in Lego blocks and it makes life a million 
times easier. You can also then have stuff, you can have data in buckets and 
things like that. And so you can interact with it in a very different way by 
using object storage, say in the cloud, rather than having to worry about, 
okay, I'm using this VM where to put the data as a local, I've uploaded, 
downloaded, interact with it. It just kind of magically happens. So the 
performance is really how you use it. And if you code a poorly, then you're 
going to get poor performance. If you tell it, you need a huge number of 
processors and you need a huge amount of memory and it doesn't use it, well 
then, you know, that's, that's your problem, not next, next flows. Yeah. When I 
think that you are programming for it, it's like you can multi multi, like 
break it up as much as possible. Like you can, you can have a workflow, for 
example, that does genome assembly followed by annotation, for example. And you 
can, you can make it so that one genome is on one process at a time, or you can 
decide to break it up. Like one job is one genome for assembly, and then one is 
a genome on annotation, and you can break it up even more, or maybe you can 
break it up even more into like another job for gene prediction, followed by 
gene assembly, gene annotation. So it's like you can modularize it as much as 
you want and break it up as much as you want. But this also means to me that 
you have more jobs sometimes than the HPC can handle. So some HPCs might have a 
scheduler that might be able to keep track of something insane, like 10,000 
jobs, but you know, you can be even more insane and you can submit a million 
jobs just by accident, by submitting so many jobs at once, and it can overwhelm 
your HPC. So I do appreciate it, but you know, I think it goes to what Andrew's 
saying, like you have to program it, it's going to be as smart or as insane as 
you are. And so you can overwhelm the HPC. I would also say that, you know, 
that the learning curve for it, like to make a job, to make a module, sorry, to 
make a module, you have to understand the syntax and okay, they do a good job 
of putting the manual, but it still makes it a steep learning curve to 
understand like what your input block is, what your output block is. If you 
have to process something using Java or God forbid, you have to learn Groovy, 
like that's hard. You know, I'm in a glass house, I program in Perl. So 
programming in Groovy though, I think is a big deal. And so just with all those 
different jobs, I think that you can really overwhelm it. And with each of 
those jobs, Nextflow creates an intermediate work folder, and each of those 
work folders I think has standard at least I think eight files, I'm sorry if 
I'm wrong on that, but it's somewhere between five and ten files, like 
.command.sh and .command.something. If you actually are insane and you have a 
million jobs, then you've created maybe eight times a million files and that 
many directories, and you might overwhelm your NSF file system. Sorry, I'll 
stop talking at this point on this, I've been ranting too long, back to you. I 
guess maybe HPC is not the right setup then for that, because say I'm running 
stuff in Amazon and in the cloud, and you have different problems, you don't 
have the NFS problem or the inode strain because it's safe for using object 
storage, and you can scale it up, like the other day I had 3,800 CPUs running 
simultaneously, and if you did that on a real HPC, you would get a call very 
quickly from someone telling you to stop using all the resources or you 
wouldn't be able to get them. And then if you have something like Tower, I 
think what it does is it actually will spin up one node and put a few jobs on 
it after small jobs. So I think some of it can be down to just the underlying 
infrastructure you have, and HPC I think is, you know, the old Slurm queues and 
all that are kind of the old boy in the corner, whereas if you want to get the 
most efficient use out of it, you know, you need to be moving to the cloud, and 
that is where it really shines, I think. However, you can make very expensive 
mistakes and you can spend thousands of dollars in a few minutes if you mess 
up. Okay, plenty to think about just with that. I've run into the same issue 
with, I think, yeah, I mean, what's nice in terms of the pros is you get the 
trace and reporting built into the next flow, which is really nice. So if 
you're running a pipeline over and over again, you can actually monitor the 
resources, and just by changing the process tag, you can tweak, like step 
modules down in terms of resources, or really easily tune that such that you're 
not spending more than you have to, and that would be something that would be 
quite difficult to do from scratch. So it's nice to have that out of the box 
for Nexo, and I think that's really the nice thing for the performance and 
efficiency tuning. On the other hand, I've run into the same issue with the 
philosophy that each job for each sample is a, for each job, there's a job for 
each sample in each genome, means that, yeah, you've got this massive array of 
tasks being run at the same time. That tends to be difficult, because in some 
cases you actually want to run things in batches, so like, you know, job arrays 
on HPC, but even just running something small, you want to run a thousand of 
them at a time. I'm sure on the Cloud Andrew as well, like for HPC, like it 
takes, you're in a queue and you're waiting for the node to come available, it 
takes a little while to spin up the job. So if it's like a job that only runs 
for 10 seconds, it sort of doesn't make sense to do that, and NexFlow tends not 
to like people putting things into batches, the NF Core specification, really 
they don't want you to do that, because what happens is if you start batching 
things up, it's very difficult to track what failed and where and so on. So 
you're playing with this thing of efficiency and clarity of your workflow, 
which I don't have a solution for. I'm the guy on the fence, so I get to say 
whatever. Yeah, it can be slow. Yeah, I didn't touch on that, but that's a good 
point. Like if you're submitting, for example, to an SGE cluster and you have 
like all these small 10 second jobs, you actually also add on that wait time, 
because it takes like, I think, one to 15 seconds for a job to be picked up. So 
it's like your job could have been 10 seconds, but now it's up to 25 seconds 
long, you know, with times a million jobs or times 10,000 jobs or whatever it 
is. I guess if you're in the cloud, often people will try and use spot 
instances as well. And that can add a certain degree of wait time, and it can 
add a certain degree of risk, because things might be cancelled and reclaimed. 
What is a spotter? I'm not familiar. A spot instance, it's where there just 
happens to be free capacity. And you say, I will accept that if you need the 
resources back, I will give them back and you'll kill my job. And you get 90 
second warning. And the benefit is you get like 60% or 70% discount. That's a 
good deal. So yeah, for small jobs, it's fine. But for large jobs, like where 
you need a lot of memory, it is absolutely not fine. Because if you're retrying 
something over and over again, then obviously it's going to cost more than you 
would have saved originally. Let's move on to the next topic then and keep it 
moving. So the next sort of area we want to talk about was workflow management. 
And Nexo comes with some degree of complexity, and is it worth for the 
reproducibility promise that it makes? Let's start with Andrew with the pros. 
Yes, because it turns it into Lego, and you can have modules, it's very 
modular. And it's very easy to link things together and then see what that kind 
of graph is like. Quite trivial, actually. And you can link together and put in 
output files and the like. Whereas in the olden days, if you've ever done 
workflow management with a dodgy bash script that you've written yourself, 
you'll know it is actually quite a complex task when you're linking things and 
making sure stuff is in the right place. And underlying all of that, of course, 
you may have different tools and different Docker containers and you're moving 
data around to different places. And just kind of bringing all that together 
into one very simple language and putting all the pieces together and saying 
what can be parallel, what can't, it's just phenomenal. And I think they've 
done a really, really well. Over to you, Lee. Yeah, I think, so again, I'm 
going to stress again, I am artificially taking the place of the cons category 
here. In reality, I do like the way that this is. But okay, the cons. So I 
would say that workflow management can be very difficult because you can't 
debug. So as you are, you can't debug easily. As you are programming this, in 
my experience, you can make even a simple thing, but maybe a 20-line process, 
and you run the process and NextFlow says that the process doesn't work in so 
many words. And it's not clear that debugging is painful. And the community 
guidelines kind of tell you to go to AI tools. Let the AI tell you what you did 
wrong. It won't. I actually think that that is kind of overkill. The actual 
error log output should tell you where to go to in your problem. So you might 
have a process that is five different modules. And you can narrow, like the 
debug output will narrow it down to one module. But where that error is, we 
don't know. You'll have to go to the AI. And that's a real issue, I think. So 
in that regard, you can make a Lego piece. Good metaphor. You can make a Lego 
piece in 10 minutes, maybe even five minutes. But then debugging where your 
issue is, that can take hours. Yeah. And that's always the way, isn't it? 
Regardless of whether it's NextFlow or any other way of running code, when you 
are dealing with this level of complexity, then it is going to take time. And I 
do admit that when you are going down many layers, when you're running 
everything in containers on remote machines, it is just more difficult to tease 
things out. And yet, they do need to make it a bit clearer. Maybe those AI 
tools can automatically be built in and then just pop up a message in the 
command line or whatever saying, hey, this is where your error is. This is the 
action you probably need to take. I think part of it comes from the pro and con 
of how it works. What it's doing for you is it's abstracting and a lot of 
boilerplate nonsense of package management, making sure things are glued 
together, handling outputs, handling resuming, handling resources on different 
architectures. That's great. But on the same time, because it's abstracting, 
it's obscuring it away from you. So then it's difficult to understand it when 
it doesn't work. The logging could be better. I haven't tried the AI tool, so I 
don't know how that would work. But then on the other hand, we're talking about 
a process that wouldn't even be possible for some people, that they'd be even 
able to run the pipeline to begin with, because they would just be trying to 
hand crank it and bash it, making all sorts of mess. So I don't know. Again, 
I'm on the fence. It's neither here nor there. Thanks for joining us on this 
episode of Microbial Bioinformatics Podcast. We're going to split up this 
discussion to a few different rounds here. So we hope you enjoyed this 
discussion on the good, the bad, and everything in between. We'll see you next 
time for the next part of this. Thank you so much for listening to our podcast. 
If you like this podcast, please subscribe and rate us on iTunes, Spotify, 
SoundCloud, or the platform of your choice. This podcast was recorded by the 
Microbial Bioinformatics Group. For more information, go to 
microbinfee.github.io. The opinions expressed here are our own and do not 
necessarily reflect the views of Origin Sciences, the Center for Genomic 
Pathogen Surveillance, or CDC.