----- chunk 1 start @ 00:00:00 -----

[00:00:02] [Speaker A]: Hello, and thank you for listening to the Microbinfeed podcast. Here we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There is so much information we all know from working in the field, but nobody really writes it down. There's no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. I am Dr. Lee Katz. My co-hosts are Dr. Nabil Ali Khan and Professor Andrew Page. Nabil is a senior bioinformatician at the Center for Genomic Pathogen Surveillance at the University of Oxford. Andrew is the CTO at Origin Sciences and visiting professor at the University of East Anglia. Hello and welcome to the Microbial Bioinformatics Podcast. I'm Dr. Kay and joining me are your co-hosts, Nabil and Andrew. And today we will continue going into the next flow, the good, the bad, and the ugly. We have a few more points of comparison. And once again, I will volunteer to take the cons position. That is not necessarily my opinion. I'm just going to take that stance in this debate.

[00:01:04] [Speaker B]: Dr. K, by the end of this, this is going to be a meme. All right, I think where we left off was talking about scalability and maintenance. With rapid evolution, it's always sort of changing or there's new things coming online. Does that help or hurt the long-term stability of what users are doing?

[00:01:22] [Speaker C]: I guess it does help in a way that there's commercial companies behind it and it's got a community of people around Nextflow. When I worked in the Sangre Institute, there was There was I think seven or eight different workflow managers that have been developed internally, all of them you know equally as poor as the next. It's nice, you know, seeing like 10 years later that there is actually the community overall has coalesced around one tool and one workflow manager and all the activity, all the bugs are being ironed out, all the, you know, the features are being added. There is other tools, of course, you know, it's not just an XFlow world, there is many others, but it's nice to have that scale and it does bring stability. It does mean that it's not going to disappear tomorrow because someone's PhD is finished or grant has ended.

[00:02:16] [Speaker A]: Yeah, so it does update very frequently. And okay, the good part is that it does pay attention to versioning, especially if you're looking at the NFCore standard. But if you're not, okay, that's going to frequently break your workflows. If you are paying attention to the versioning. Then updating to the latest and greatest at this rate that they keep developing at, you know, it's a good thing, but it's also a bad thing because you find yourself behind often and if you do want to update to the latest version and you have a professional pipeline, you need to continue updating and testing it and that requires a lot of effort. If you are trying to keep up to date all the time, let's have the scenario where NF Core has an assembler, let's say Spades, and it has a module and you can use the NF Core module Spades, and that gets updated. Now let's say in the scenario also you have your own forked version of this module and you've just updated it just a little bit. Every single time they update it, that means that you have to take their new code, manually put in the new stuff. Maybe that would be a git merge, but you'd have to look at it by eye and make sure you merged it correctly. And you'd have to continue testing it. And that does require a lot of work. So I think that the rapid development is fine. And the good part is that can happen separately from you, but if you are trying to keep up, it is going to be very hard.

[00:03:59] [Speaker C]: I guess you should be pushing your changes upstream to the open source project rather than hiding them away in some government lab, you know, where no one can ever see them ever again. And, you know, the fact that you're editing the code is really, you know, your fault. You should be pushing it and making pull requests to everyone else so everyone else can benefit from your feature enhancements. All

[00:04:21] [Speaker A]: All right, touche. Yeah, you should be. Or we should be updating. They are very nice in their stance of accepting open source code, so I do appreciate that.

[00:04:31] [Speaker B]: I think that what one of the tradeoffs though is now that you have a community and you have a group that's coalesced around the tool you've lost a bit of the ability to to do those tweaks right so if you follow the rank and file nf core specification which you're sort of encouraged to do it does it might not be necessarily beneficial or optimal for what you're trying to do it but you want to be in line with the standards you stay in line with the standard and it's a very opinionated standard of what what if what constitutes an nf core that verified pipeline is very particular and i've spoken to plenty of people who said they don't like those they don't like that opinion they don't agree with it and they quite happily just diverge from it and they do what lee's doing they just run take the take the thing as a template and then go off in their own direction uh with with the pro with the corner of obviously then they have to if so if it breaks or if there's something wrong then they have to deal with it themselves All right, well, let's talk about one more topic. And I think Andrew raised a nice thing earlier about how it used to be, and maybe we can touch on that in the closing few minutes as well. But let's go to one of our last topics on ecosystem AI and cost, which is always very, very important. Is it still, is next door relevant now that you've got AI that can... Just vomit out a Python script that does a lot of the things. How does the cost compare? And let's dig into some of that.

[00:06:09] [Speaker C]: Well I mean as a start I think NextFlow core has actually been fantastic for the community. There's so many standard things that we have to do and just having people you know all come together and make one standard pipeline it's great. Like, I've been in labs where the standard rite of passage for a postdoctoral in the lab is, you know, to go and write their own SNP caller or their own, you know, mapping pipeline. And we don't need to do that anymore. You know, we can just use what is out there. We can use, we don't have to worry about all the esoteric little, you know, tweaks that have to happen for this weird and wonderful use case. It is just built in. And so NFCore is now my go-to for. for for most stuff um you know for most standard tasks because there is stuff there and sometimes actually you know given the choice between writing your own thing that is makes a slightly different result versus using an nf core i'd you know go for nf core because i know that i'm not going to be spending weeks doing stuff um around it and like Secaritar or Nexwetar or as USB called is great for making that easy you just pop in your the they say the repository or whatever that you want to use and it goes and kind of magically makes it work with your stuff and then it gives you all this stuff back and your your results are just magically there and you if you want to use a different cloud you can although it can be a little bit pricey but we won't we won't talk about that

[00:07:41] [Speaker A]: you won't talk about that so it is pricey yeah so but no oh no but i'll i'll i'll um i'll hit on what you're saying also over there which is you know if you want to do like a quick mapping step or assembly step or whatever you can quickly just type out nf core space and the whole command to just do your your pipeline it is fantastic there however if you do want to level up and you want to go to the cicara tower or is it is it called cicara or an x-blow tower at this point cicara

[00:08:12] [Speaker C]: Secure a terror.

[00:08:13] [Speaker A]: okay it is expensive so it's not for you know casual stuff like if you want to go professional professional like level up like it can it can cost a lot And when you do bring that in as your whole standard, you're now looking at a whole suite of tools that are AI based now because in my cons opinion, you have an ecosystem where the error output is not easy to look at. And so you do need to pass it off to Copilot or Cloud or ChatGPT. To help understand it and now the now those tools like Copilot and Clod are now boilerplate code where Nexflow was meant to simplify things. Now you have like this AI stuff and, you know, one tool that I love coming out of the Nexflow community is Multiqc. I'm sure we've talked about it before, though. I love it. But now you get these AI prompts inside of Multiqc and it's kind of terrifying because. The scenario is that you are done with your pipeline, you have a report in the MultiQC web page, you send it out to the world, right? So you send it to your leadership, you send it to your collaborators, to a molecular biologist, someone who doesn't understand programming, but they understand the topic, right? They know that they're looking at reads, but there's a little prompt on the side that lets you... have AI discuss the results. So now it comes from someone like me and like an expert in what read cleaning is goes to the next person and then the next person says, you know, they write into multi-QC, they might ask to interpret the read cleaning and it's not me helping them, it's whatever the machine is telling them. And so that's a little bit terrifying to me.

[00:10:04] [Speaker C]: Yeah, and I guess the nuance is lost there because obviously if you're analysing salmonella, you will have very different considerations in your mind to if you're analysing a virus that's been amplified, you know, and the data will look different or, you know, it might look, this data might look the same, but you might be looking for different things and it's terrifying that an AI will be, I guess, pretending to be confidently pretending to be an expert. but not understanding the nuances that an actual subject matter expert will have and so it is a bit terrifying it's terrifying that it's just there and you click a button nor can it read you know as the graphs and stuff like that you know or probably pull out data or understand that maybe i don't know you your target insert size was 400 you know going in from the lab and then it's come out as 200 and says oh all is All as great. Actually, no, it's not. You know, something's gone terribly wrong there with the experiments.

[00:11:00] [Speaker B]: Yeah, I don't know. Part of it comes back, the AI tools are pretty good in terms of producing a boilerplate script that does replicates a lot of the stuff you'd worry about NexFlow. So imagine, so we say NexFlow is good because it's sort of saving you thinking of like avoiding silly mistakes and obvious mistakes like, you know, you name files wrong and this runs and this doesn't and you could have something that just generates it on the fly. like you were saying yourself previously Andrew like it generates the next flow workflow for you but if the AI is generating the next workflow for you like why have it why have it specifically next for why not anything what's the point

[00:11:42] [Speaker C]: True. I don't care. I guess there is other companies out there now trying to, I guess, build different workflow managers and systems on top of workflow managers. And from what I've seen, none of them really, you know, reached the bar. But theoretically, you know, AI, if there's enough documentation out there, enough training material, AI could go and build all of that automatically, you know, being platform agnostic. So we're multiple levels of agnostic and we're meta, meta, meta. But fundamentally, AI tools are here to stay and we need to learn to work with them in a reasonable manner. I am terrified, as I said before, about the next generation coming along who are too addicted to them and don't understand how things actually work underneath because things go wrong and we see it all the time. Things do go wrong and it makes incorrect assumptions or it makes, you know, kind of very, very subtle errors. But saying that, sorry, as an aside, Cloud is amazing. So if you're going to choose an AI, Cloud Solid 4 is my current go-to for Copilot. I find it much better than ChatGPT-5. But anyway, back to the actual topic. Yes, it is so easy now to generate Expo code. You know, you can get AI to generate your script, to generate your code, to run your code, to interpret the output, to make a pretty picture. Yes, maybe we don't need these things in the future.

[00:13:10] [Speaker B]: No, but that rate we don't need mind for petitions and we don't need the podcast either so what

[00:13:14] [Speaker C]: Yeah, we're unemployed.

[00:13:15] [Speaker B]: are we even doing here Well, on that negative note, let's talk, if you have a couple of minutes, I'd like to sort of put some context on what were the alternatives, well, how did we used to do it, and maybe that'll help people understand sort of the positions we've taken with Nextflow. So, Andrew, you've been around the block. I mean, we've got... How does something like Nexo, what are some of the pros compared to things like Bash or Galaxy or even other workflow languages, SnakeMake if anyone has any experience, so on.

[00:13:51] [Speaker C]: The first one I worked with was written in Perl. Our first few were written in Perl and they were on HPCs and so I think I went through three different Perl workflow managers when I started off and you know they're getting better and better and better but they're internal. internal things running only one organization and they did the job they scaled to you know whatever the state of the art was at that point you know tens of thousands of samples but then it kind of tipped over and I realized hang on a second you know doing something alone around one institution is not good so that's when I found a galaxy and galaxy is amazing and you brought it into into the quantum Institute and so that enabled quite a lot of work to get To get done, you know, the standard pipeline, standard tools set up and there's a web interface people could use and biologists with very little technical knowledge could go and do bioinformatics for the first time. And then, you know, it's going that next step into NextFlow, which I think is the next evolution, and it makes it even easier and even more scalable in a cloud environment where you can scale, you know, not just to the HPC you have in the basement, but to, you know, potentially millions of nodes if you have credit card big enough.

[00:15:04] [Speaker B]: Lee, anything you want to add about next flow and other alternatives?

[00:15:09] [Speaker A]: I'm just so curious what were the Perl workflow managers. I didn't even know. Is that the DRMAAC or something?

[00:15:17] [Speaker C]: They were internal to Sanger. I can't remember what the first one was called. One was called VR Codebase and we used that in Pathogen Informatics for many years and that was a pain because I edited a lot of code in that that was inherited from the Tizen Genome Project and then Tizen Genome Project made VR tracking I think it was called which was like the next evolution of it and that was for whatever the next 100,000 genome project or whatever that was so VR meant vertebrate resequencing. meaning human and yeah they they built that and that was then made a little bit more cloud-friendly over time but then yeah just next flow came along and just blew everything out of the water because everyone coalesced around that but pearl is the best

[00:16:03] [Speaker A]: Pearl's the best. Thank you.

[00:16:04] [Speaker B]: It's always the best. All right.

[00:16:06] [Speaker A]: Every now and then I try to look at Pearl and I'm like, I could probably make a workflow manager out of this. And then I'm like, oh no, this actually is very hard. Next flow probably is the best thing we can do. So I feel like there's always going to be the desire, like that fire in me, like I want to build something better. But I can't like this. This really is the best thing we can do is just in this complicated Java groovy system. In the past, I've worked with other workflow managers, actually, like we tried our hand in Bpipe and that was also Java based. And unfortunately, it crashed too many times on us and we just stopped. But That was that one and we tried, what was the other one? But they all ultimately were just not stable and they were buried in Java, which I only have a basic understanding of. And so I agree like we have the next best thing, which is NextFlow. So I agree with it. The only other alternative I can think of, by the way, the real only real alternative, there are other alternatives out there, is WIDL or WDL. And I think that people are making that work really well for them too.

[00:17:28] [Speaker C]: Oh yeah I totally forgot about that. Terra yeah I worked on that for two years but it's being used in public health you know quite widely and again that's it's from the Broad Institute and Theogen work a lot with it in public health and so what they do is it's kind of just online kind of web interface and you can upload stuff and run through standard validated pipelines and biologists and epi's can can just kind of point and click and magically it works and Yeah, it works. It works really, really well. It's different in XFlow, like with a very different language. It has its own quirks, but it has its own strengths, you know, because it's a very sample-centric weird world. So, you know, for your sample, this is the analysis output, whereas with NextFlow, it's just I've done an analysis, I don't care what sample it is. And linking those up is actually a challenge that the user has to solve, you know, you have to remember that you've run, you know, this sample through 10 different analyses. these in 20 different directories and you know pulling it all together is actually a bit of a pain that's actually that's missing from secure tower we don't have that kind of i guess data management of results we need it if anyone you know has solutions solutions to that please tell me yeah

[00:18:41] [Speaker A]: You know what Whittle's like really good at? So Whittle is wildly successful in making its name an adjective, and I want to do that in the future because you make a Whittle pipeline. What an amazing way to put it. Like, I can't think of another software that has an adjective. Sorry, I didn't mean to interrupt you, Nabil. Go back.

[00:19:03] [Speaker B]: no no that's fine it's really insightful that i was thinking a widdle like a weedle weedle pipeline yeah fantastic i just realized that uh i've played around with different things i it's um yeah i mean they all have their pros and cons i think ultimately it's I would not go and say next flow is for everyone, there is a learning curve, it requires your working at a particular scale, your infrastructure is a particular way, for say someone who's just starting out it might be a lot, they may be, their journey may take them towards SnakeBake and may take them towards just custom Python scripts for themselves to run on their laptop and that's it, that's as far as that goes. Tera.bio. I've never used it. I've seen demos of it. It looks interesting. It does seem to handle that storage.

----- chunk 2 start @ 00:20:00 -----

[00:20:00] [Speaker A]: issue better which is interesting so check that out if you're interested in this sort of thing there's whittle there's cwl there's there's plenty of workflow languages out there i'm sort of interested in there are other ones that are not process-based but task-based or data-based where they take a record and then they're trying to sort of they have an understanding of what the state of a complete data point should be and so they go off and run processes to sort to sort of backfill so these are things like airflow or DAGs that people use in the data science and analytics community there's a different way of approaching it you don't have to do a workflow as a process like running through a list of things so it's a very interesting space out there and I think there's something in there for everybody if you just have a look around

[00:20:28] [Speaker @]: of

[00:20:50] [Speaker B]: What is can you can you draw that out a little bit more? I wasn't aware of that Is database?

[00:20:56] [Speaker A]: like task based or data data based where you would have you basically you have a data frame of records and then in that data frame you'd have a column and there's a value needed in that in that space right so you would have a process of like oh if this value is empty go and do this and so that's how it ensures and you would have the validation like if the data if the value in this field isn't like an integer you need to go and do this and fill it

[00:21:23] [Speaker B]: That's

[00:21:23] [Speaker A]: And

[00:21:23] [Speaker B]: cool.

[00:21:23] [Speaker A]: so you specify the tasks. And so the whole thing is the integrity of the. data frame itself that you define and then you go off and sort of fill and fix all of those issues and so if you have multiple data frames like you have a master table with all your numbers and then you have a sub table that's being used to make a report it can have those relationships built into it that checks oh hey this is out of date this is out of sync this isn't in line with the master table so all of this relational stuff you can set up you in other workflow specifications. Does it make sense for bioinformatics? I don't know. I haven't seen an implementation that makes sense yet, but there's a lot going on in this space. Data is everywhere and everyone is trying to keep on top of it.

[00:22:10] [Speaker B]: So just changing the topic a little bit slightly, like where do you think that make and snake make fall in this because I actually think make is an excellent workflow manager, but everyone always shoots me down for that.

[00:22:22] [Speaker A]: I think Snake Make is great. It's something that someone who knows a little bit of programming can be up and running in an afternoon. It is really good for that sort of a thing for someone who doesn't want to learn how to do threading properly and worry about resource management and they want to stay in Python on their machine and maybe give it to maybe a couple of people in their groups, like that would be perfectly fine. It makes a lot of sense for that. But I've seen some production workflows that seem to work okay, like yeah, it's fine.

[00:22:54] [Speaker B]: All right. Well, thanks for helping me with my curiosity on that topic. You guys want to wrap it up?

[00:23:00] [Speaker A]: Yeah, I mean, you know, it's just this is just three guys talking smack in a room. If if you have things you want to say, like get in touch, say things at us, whatever. What's your next experience in 2025? I mean, you've heard out or heard what we think. And so we'll pass it on over to you. And so, yeah, thanks for joining us for this episode of the Microbial Bioinformatics podcast. We hope you enjoyed all of our discussions on NextFlow, the good, the bad and everything. and everything in between and we'll see you next time

[00:23:32] [Speaker C]: Thank you so much for listening to our podcast. If you like this podcast, please subscribe and rate us on iTunes, Spotify, SoundCloud, or the platform of your choice. This podcast was recorded by the Microbial Bioinformatics Group. For more information, go to microbinfie.github.io. The opinions expressed here are our own and do not necessarily reflect the views of of origin sciences, the Center for Genomic Pathogen Surveillance, or CDC.

[00:24:08] [Speaker B]: I guess one thing you need to consider is that I've forgotten what I'm going to say.

[00:24:18] [Speaker A]: That must be very important then.