Hello, and thank you for listening to the MicroBinfeed podcast. Here we will be 
discussing topics in microbial bioinformatics. We hope that we can give you 
some insights, tips, and tricks along the way. There is so much information we 
all know from working in the field, but nobody really writes it down. There's 
no manual, and it's assumed you'll pick it up. We hope to fill in a few of 
these gaps. I am Dr. Lee Katz. My co-hosts are Dr. Nabeel Ali Khan and 
Professor Andrew Page. Nabeel is a senior bioinformatician at the Center for 
Genomic Pathogen Surveillance at the University of Oxford. Andrew is the CTO at 
Origin Sciences and Visiting Professor at the University of East Anglia. Hello 
and welcome to the Microbial Bioinformatics podcast. I'm Dr. K, and joining me 
are your co-hosts, Nabeel and Andrew. Today we are continuing to dive into one 
of the most talked about topics in the bioinformatics community, NextFlow. In 
the past, we've all been in support of NextFlow and workflow languages in 
general as a means for organizing your work. We're going to go into a little 
bit of a debate on the pros and cons. We'll ask the big question, if you're 
starting to write a new bioinformatics pipeline, is NextFlow still the best 
choice for modern bioinformatics pipelines, or have AI tools or lighter 
workflow systems finally caught up? Again, I'm going to choose the cons 
category, if you guys are okay with it, and I'll say the cons, even if that's 
not necessarily my opinion. Well, thanks for playing devil's advocate, Lee. 
Let's kick off with one of the other things we wanted to discuss. One of the 
key selling points of NextFlow is the portability, the fact that you develop it 
on your laptop and then you can change a toggle and it's working on an HPC, 
then change a toggle, it's working in the cloud and so on. What do we feel 
about the portability of NextFlow as one of the strong suits, but are there any 
chinks in the armor? Andrew, let's have some pros. Yes. From my personal 
experience, I've gone and built an Xcode pipeline, I think it was for mapping 
human or something silly, easy like that, and so I built it on a VM in Google, 
and then I just flicked a switch and then it was then paralyzed on the Google 
batch, so it's their scheduler. And then a few months later, I went over to AWS 
and I ran the same pipeline there through Secure Tower without changing the 
code. So I've tried it in three different formats on two different clouds, and 
it just worked. So it proves that the portability aspect is actual, it's real, 
it's not theoretical, because a lot of things are theoretical, and I've used it 
in Angular. So they've done a good job. I think a lot of it is down to the fact 
that it is containers, and where all of the hard work is actually done. So 
obviously people have gone and done a lot of work to containerize every 
bioinformatics tool out there, and once you have that done, running it and 
linking it all together is the next piece of the puzzle, and I think NextFlow 
have got that down. So well done, Lotz. Yeah, I do like how you can make it 
portable. And when we are doing our NextFlow pipelines, one of the main selling 
points that I give to my leadership and everybody is, you know, we can use a 
pipeline the same here as the same there. But I'm going to tell you the issues 
that I have also. What's hard for me is that sometimes I do have to 
troubleshoot something and there's an executable, for example, and it's layers 
deep. So I might have an issue with script.py, and it's some, you know, part of 
the open source package that I'm using, and I can look into the NextFlow code, 
see what's going on, it's calling script.py, where is that? It's not in the 
GitHub repo for this pipeline. It's in the container, or it's in the conda 
package. So I have to, like, and usually when you're running something 
professional, it's in the Docker container, not just in conda, right? So it's 
I'll have to download the Docker container on my own and run, I think it's 
Docker shell, go into there, see what the path is inside of Docker, find the 
executable in that file system and then look at it. And then even then I can't 
edit because it's a static container. So the troubleshooting is pretty hard. 
And if you actually do need to make a fix, then you actually need to put out a 
new container. The better situation is you find out how the script works and 
you just fix the NextFlow code. But then you're back to what we talked about 
last time, you are debugging NextFlow code and you're going to get really tough 
error logging. So I think the portability is fine, but it's obfuscated code. I 
mean, on the other hand, you're not having to deal with installing different 
tools which have different, that behave differently on different architectures, 
right? Because that's all being abstracted away by the fact you're running 
things on containers. I think with NextFlow, they leverage containers really 
well. And that is most, so the most of the benefits we're talking about are the 
pros and cons as the fact that you're using containers, containerization in the 
first place, NextFlow leverages it really, really well, which is great. And you 
have all of those issues that we're just saying that, yeah, it's a bit 
difficult to understand the environment of the container versus the host, but 
you don't have to think about the host, which is good. I'm curious, Andrew, 
with the, what, when you're running it on the cloud, in the example you gave, 
how are those, what was the containers being run on? Was it a Kubernetes or 
you're saying AWS batch, sorry, Google batch, how are they actually being run, 
executed? So one of them was on AWS batch, and actually I just remembered now I 
messed up and it had millions of lines of logging outputs, which then went into 
the cloud logger, which I didn't realize until we got this, you know, thousand 
pound bill kind of deal. So yes, maybe there's too much debugging information 
there, but yeah, you do have to be careful about those kinds of quirks that you 
may not be logging to like a disposable file system on a VM. You may be logging 
to the cloud where you're charged per operation. There's one more part of 
portability I forgot to bring up too, which is the config file. So maybe you'll 
have thoughts on this too, Andrew. I have found it difficult to know exactly 
like how to write down your config file, put the correct thing into like your 
GitHub repo, you know, so the code itself is portable, but you have to explain 
your architecture and that can't necessarily go into your repo or sometimes it 
does, and that's confusing. And then even if it doesn't go into your repo, that 
can, it can be sensitive information, like a, you know, it might be risky to 
put into a GitHub repo that you have this many nodes on your HPC or this many 
CPUs or other details about it. So over to you. That is very interesting. Yes. 
You have to be careful. Okay. With great power comes great responsibility, and 
I guess we can just leave it at that. Let's talk a little bit more about 
robustness and everyone's favorite topic, documentation. Nextflow has, as we're 
saying, it's sort of got a robustness built in, at least it promises to. Is it 
worth the effort? Because you do have to work in the way Nextflow wants you to. 
So let's have some pros from Andrew. So when it works, it's fantastic. And I 
feel a bit worried for the next generation of programmers and IT people coming 
along and biopartitioners coming along, because they're all using ChattyPT and 
Copilot, and they don't actually understand how things work and understand the 
code. They would struggle to put together everything from scratch, because 
they're just saying, dear Copilot, ChattyPT, make this work, and oh, there's an 
error. Tell me where the error is and how to fix it, or fix it for me. So yeah, 
that's a problem. But Nextflow, luckily, is mostly fault-tolerant. When an 
error does occur, you can restart it, so you don't necessarily have to go back. 
There's checkpoints along the way, and it works to a degree. But there are 
many, many times when that doesn't work, particularly if you're using spot 
instances, and it gets interrupted, and it gets interrupted a few times, it 
might fall over, or maybe you run out of memory, or you run out of disk space, 
or whatever. And these kinds of things, you can fix these things incrementally. 
But sometimes I've found that maybe because of a bug, it's going and asking for 
more memory than there is available in AWS, like 100 terabytes of RAM, or 
something crazy like that. So their robustness comes with responsibility to 
actually tell it the right information up front. In terms of documentation, 
documentation is okay. It's better than most, to be honest, particularly the NF 
Core stuff. And again, AI can document a lot of stuff for you, but only to a 
degree. You have to give it the right information first, to start off, and 
there's a huge amount of it out there. So it can take time. And there's 
podcasts, and there's YouTube videos and everything, which help you navigate 
that if you don't like reading pointless documentation, because it all should 
work initially. But I find it to be good. Yeah, the documentation is excellent. 
And when they put out those short YouTube videos, those have been helpful for 
me too. Okay, so I'm going to take that stance again. I'm going to do the cons 
list again. So not necessarily my opinion. But the documentation can be 
overwhelming. So I would say, even though it's there for you, it takes a while 
to go through it and internalize those lessons for yourself and understand it 
and really learn it. So that makes it a, you know, it's overwhelming 
documentation. It's a steep learning curve to even get to the modules and 
configuration part. And so even after you learn it all, and you become a 
beginner at it, or an intermediate, like it's still time consuming to find 
relevant information, like, if you need to look up, I don't know, like, what 
this shell block does, like, it might, it might take you a while to just search 
around, if you need to find out like, how to, you know, put a variable into the 
shell block, it might take you a while to find that information. Or, you know, 
there's a lot more advanced information that you'll have to find either in 
that, or the special videos that they put out. But I guess for a lot of people, 
you're not, they're not developing pipelines, they're just using them. And so 
they won't care about documentation, they won't care, you know, because in most 
cases, say it'll work. And they are just the end users, they go into some fancy 
GUI, like secure tower, you know, click a button, there you go, magic happens, 
and they get the results at the other end, you know, it's what we all would 
love to have happen all the time with mathematics. And we're getting steps 
closer to that. And whereas, you know, all of us in the call are mathematics 
developers. And we're, you know, we enjoy getting deep into the weeds and the 
nitty gritty. And we're constantly tweaking things and extending things and 
reinventing things and whatnot. And we're at the bleeding edge of science. So 
we're maybe a slightly different use case to the normal bioinformatician, if 
you know what I mean. Fair points. But I think if you're, if you're, because 
you expose all the modularity of the pipeline, so the implicit thing is that 
you're going to take the next pipeline from whatever, you know, you pick up the 
NF core one, and you're going to tweak it. So you need to know enough to be 
able to make like a change. Maybe you want to add a new output file. Maybe you 
want to run a thing with a different flag. Maybe you want to collect the things 
into a CSV in a particular way. That takes a fair bit of time to get your head 
around what it's doing. I appreciate that the AI tools can seem to know a 
little bit and can generate stuff for you now. And there's a lot of 
documentation. It does explain this, but it's sort of difficult to find. I wish 
there was a bit more help with the syntax that made it a bit easier. Some more 
helper functions or something that would help out. Because yeah, people hit a 
wall pretty quick. I think part of it is, is like when you have all of these 
different commands, it's like conceptually, what, how are you operating with 
this? What's the paradigm of how these things are meant to interact with each 
other? I don't think it's very well explained. So like the documentation is 
like a reference book, right? And then there's like tutorial worked examples. 
But yeah, it seems almost like as if it's a separate course where you have to 
teach philosophically, like what are these things you're working with and how 
do they interact with each other? The fact that it's groovy, which is a 
language most of us are not familiar, which makes it difficult is also not 
great, but whatever. It is what it is. Absolutely. So leave it there. Yeah, 
let's leave it there. We'll cross into the next topic, which is scalability and 
maintenance. Thank you for joining us on this episode of the Microbe of 
Alphabetics podcast. We hope you enjoyed our second discussion on NextFlow, the 
good, the bad and everything in between. Thank you so much for listening to our 
podcast. If you like this podcast, please subscribe and rate us on iTunes, 
Spotify, SoundCloud or the platform of your choice. This podcast was recorded 
by the Microbial Bioinformatics Group. For more information, go to 
microbinfee.github.io. The opinions expressed here are our own and do not 
necessarily reflect the views of Origin Sciences, the Center for Genomic 
Pathogen Surveillance or CDC.