Hello, and thank you for listening to the MicroBinfeed podcast. Here, we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There is so much information we all know from working in the field, but nobody writes it down. There is no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Both Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work on microbes in food and the impact on human health. I work at Centers for Disease Control and Prevention and am an adjunct member at the University of Georgia in the U.S. Hello and welcome to the MicroBinfeed podcast. Nabil and I are your co-hosts today, and we are joined by Drs. Erin Young and Kelsey Florek. Erin works as a bioinformatician in the Utah Department of Health. You might know her from her work on the secret pipeline for SARS-CoV-2. Kelsey is the senior genomics and data scientist at the Wisconsin State Laboratory of Hygiene and a steering committee member of STAFFB. So first question goes to you, Kelsey. What is STAFFB, and why was it founded? Thank you. Thanks for having me on here. So STAFFB stands for the State Public Health Bioinformatics Workgroup, and it was really founded with the idea of bringing together collaborations between bioinformaticians and bioinformatics that's happening at state public health laboratories. When it initially started, there was very few of us out there, and there wasn't a lot going on as far as communication and ideas on how to move the field forward. And so once we started getting together and collaborating on a Slack channel, it kind of just snowballed, and now we're up to almost 400 members. It's full of a lot of great people from a lot of differing levels of expertise, which I think is great because it provides a lot of learning opportunities. So when I first started as a bioinformatician at UPHL, I was the only bioinformatician. So if I got stuck, it was nice to have a community to ask questions to who are dealing with the same things I was dealing with. So what are some of the activities that STAFFB regularly get involved with? It's really a conduit of communication between, again, like I said, laboratories that are on various projects, whether it's funded through NIH or CDC or other grant agencies. The idea is really to kind of connect expertise and provide a working level field of where that expertise exists. A lot of laboratories are still kind of getting into the game of sequencing and understanding what the data is that's coming off the machines. And so having a resource or repository for people to kind of connect to and seek out questions and get answers to different ideas and projects that they have going on in the lab that may not necessarily exist elsewhere. And so STAFFB has really been valuable for that. And in fact, just the Slack workspace alone has, I want to say, upwards of 50 channels all oriented towards different activities that different work groups are working on. And they're really focused on trying to make that communication happen and make it easier for people that are trying out new activities, whether it's a new surveillance program or whether it's bringing on something that a laboratory may not have a lot of experience doing before. So who can join STAFFB? Is it open to anybody or is it sort of restricted to U.S. state health departments? Well, STAFFB stands for State Public Health Bioinformaticians. So that's the that's the core audience. But for me, I'm not necessarily a gatekeeper. If that's a group of people that you work with or collaborate with, then I think you can join us. Yeah, that's Erin said, it's open to everybody. It's the content that's in the in the workspace and the resources that are available. They're really focused towards kind of the state public health activities. So, you know, maybe an academic biopharmacist may not get as much out of it as a state biopharmacist would. But there's still, you know, valuable resources there and practices that could be leveraged. Has anybody joined when you're and you're just you just say like, wow, like this famous person joined or or or any other any other impression? It's a small community. So usually the people that we're seeing are people the new faces that we see are are people that are just kind of getting into the field of informatics or state public health. There have been a few moments where we've seen people join and it's like great to have them be involved in the project and and see what they can contribute. Yeah, when I go into the about page, the website for StaffBee, I think it's like hilarious. And I've known this for a little while, where it credits the founding of StaffBee to my supervisor, Heather Carlton. I don't know. Do you have any comments on that? I think it's just really funny. It was saying Heather told us to stop bothering her and go talk amongst yourselves. Yeah, I you know, I joined kind of right after its founding, so I kind of got to know everybody after this this particular incident. But I think it's really telling. It was it was a time when there wasn't a lot of expertise in the state laboratories. Everybody was kind of just getting started. And there were a lot of questions about, you know, various aspects of implementation. And, you know, even today, there's still some questions about standardization. And so this kind of initial meeting of these people kind of started off or kicked off this idea of, hey, maybe you should start working among yourselves. You all have the similar questions and may have different parts of the answer. What if you come together to the table and try to develop a solution yourself? And that's really where Staff B is kind of excelled. A lot of the state public health laboratories work very closely with the CDC. And so we're constantly taking different aspects. You know, we're involved with various projects of the CDC and we're adopting methods and practices that the CDC has to work in our laboratories. And that translation is not always as smooth as we wish it could be. And so having this group, an organization where you have various people across the field who have all had experience dealing with these different problems and solving these different issues, it makes it a really valuable resource to be able to go and ask somebody, hey, I know you did this thing with AR bacteria and like it worked really well. Like, how did you get your workflow to translate to your compute environment? And how could we probably do that maybe here in Wisconsin or something like that? So having Staff B and that capability of doing that, and it also just makes it easier for other labs down the road. So if a CDC project is very valuable and needs to be expanded, then, you know, working with a few pilot labs and Staff B kind of helps orient that project better to how it might fit within a state laboratory and could help its success down the road. Yeah, so that ties into another question. What are the proudest achievements you've had from Staff B so far? So, I mean, you've been talking about collaboration, getting together, piloting software. What would you note as some of the biggest achievements? I would say the number one biggest achievement is the Slack Workgroup. It's phenomenal to see just the conversations that happen there every day. You know, and sometimes it's a lot more frequent than other times. You know, usually when there's a big new project coming out and there's a lot of questions, there's a lot of flurry, but it's really great to see all of that collaboration and those activities happening. And so every time I see that there's an unread message on Staff B, like, it's the first thing I go look to see. Like, I don't even, like, I don't check my email as often as I check the Staff B Slack channel. So that's really telling. The other aspect I think has really been successful is just the GitHub aspect of it. Just the collaborative workflows, how successful the Docker project has been. And it's been really great to see different laboratories kind of coming together and working on workflows together and making things more accessible to laboratories. It's been really phenomenal to kind of watch this project just grow and grow. I think one of the successes of the Staff B community is also the amount of trainings that we give each other. Like the success of the Docker repository couldn't have been made possible without a bunch of state public health bioinformaticians making a Docker image for their first time. And the workflows and things and the Staff B toolkit and the trainings that go into helping these bioinformaticians utilize the command line and utilize these tools, I think has been a great asset to public health. So for all of the listeners out there, what are some tips you can give given your interaction with Staff B? Tips you could give people moving into this space? What would you have done differently kind of thing? Or what are some resources that you've discovered or developed yourselves that the general audience here could use? So we've put together on the website is our videos, our monthly videos that we have outlining various... projects that are related to staff or state public health bioinformatics. Yeah, so that's a great resource for accessing some of the knowledge and expertise that's been developed at the state laboratories. For other groups trying to do something like this, you know, it really starts with a use case. If there's a real need to start developing communication and building a centralized source of expertise that other people can then utilize, I think it just becomes a question of how do you make that knowledge accessible and how do you clear the pathways for communication. Slack has worked exceptionally well for us because we've been, we're such a disjointed group, like all of us are part of different laboratories and in each laboratory there's at most like two or three bioinformaticians and many laboratories there's only one, if one. So having a common resource that everybody can get access to, that's easy to get access to, is really important. And I think one of the things that SAPI will continue to do going forward is make things more accessible. You know, more contribution guidelines, more examples, more tutorials, more training. And those are the aspects that I think will really help kind of build this next generation of bioinformaticians and state data scientists, really. Yeah, on the website I spotted a few interesting trainings so people can get a taste of what's up there. There's things like a session on COVID, on SARS-CoV-2 data submissions, and sequence similarity searching in terms of public health bioinformatics, used AMR detection. So the kind of things that people probably are running into day-to-day, that there are a few seminars and sort of blog training, textual things to play around with there up on the site. And we'll drop a link to that in the show notes. So that all leads us to some of your more tangible things. And I know Aaron, a lot of people will know you as the person who came up with the secret pipeline. Do you want to tell us just some intro stuff about it? What is it? To do that, first I want to set the setting because all workflows start with some sort of need or use case. So it was the spring of 2020. Things were beginning to get shut down and we were going to start sequencing SARS-CoV-2. And the Arctic group had this awesome protocol for sequencing SARS-CoV-2 on the nanopore sequencing platform. And I don't have any complaints about that, but just the way our public health laboratory was set up is it would have been infinitely easier to sequence SARS-CoV-2 on the MySeq. And so we developed a protocol to use the Arctic primers and made that compatible with Illumina and started sequencing it on our Illumina MySeq. But then there came the workflow aspect of it. The bioinformatic pipeline by the Arctic group is really nanopore based. And so we needed something that was Illumina based. And I tried out a bunch of tools. Kevin were researching all kinds of things. The first thing we noticed is the value of ends in consensus sequences, because the first few things that we attempted would just replace reference. If there was a SNP that was identified, the SNP would go into the sequence, but otherwise it would be referenced. And that was a bit of a problem. And so we were going through a variety of different tools. And at that time, there wasn't a workflow that had the tools that we need. And so Kevin developed a workflow. He named it Monro. It was based off of using Minimap2 as the aligner. And then I developed Secret, which uses BWA as the default aligner. And there we go. It's essentially for some sort of base, small genome where the reads have been amplicon based. And so it takes these reads, maps them to a reference, trims the primers off, and creates a consensus sequence out of that. I have tried to make it as species as agnostic as I can. So it should work with other organisms, but the end user would need to provide reference sequence in their primer bed files. So that's in essence what it is. It's for some sort of viral amplicon-based sequencing where there's a known reliable reference. The workflow has quite a bit of feature creep to it, which I have kept a lot of it in for legacy reasons. When it was first developed, we weren't quite sure what QC metrics everybody was going to be using. So there's a lot of QC metrics that are still in there that people no longer care about. I don't think people are talking about the flag stats from their BAM files anymore. So the feature creep, that is an interesting tidbit that I don't hear a lot. What have you kept in there? It still has SamTools flag stats. So it has an option. So back at the beginning of the pandemic, especially when people were still understanding what a mutation was, many of these public health laboratories and bioinformaticians had never really encountered an indel before and conceptually, what does that mean? And why is that important? What are these frame shifts? And so there was a group of people that suggested you, if so you use the default settings, you get a consensus sequence, you submit it somewhere and it gets rejected because it's got a frame shift. Well, if you try a different aligner, then maybe that provides enough evidence to get that sequence accepted to GenBank or GISAID. And so that's why there's a toggle for which aligner to use in secret. So you can either use BWA, which is the default, or you can use Minimap2. So there's that feature creep. People nowadays, if they're using secret or generally just using the default settings, I think. Another feature that was added for the community, and then I don't think anybody used it was using IvarTrim versus SamTools Amplicon Clip, because there were some concerns about IvarTrim. And so there was this other option there that people could use. I haven't bothered to take it out because at this point, if I took it out, there would be that one person that I don't know about that's using that feature. And I don't actually want to make them mad. How many people do you think are using it? What's the scope of the number of people who would get mad if you just trimmed all this down? I think most people have actually moved from secret to other workflows. The NF Core has a very similar workflow called Viral Recon that I think is a bit more popular. I don't see as many forks from secret nowadays, and these forks don't have as many changes in things. So I don't actually know how many people are using it, but I know there are more popular workflows out there. Oh, and there's also the Kevin, his Monroe workflow ended up getting incorporated into Theogen as their Titan workflow. And I don't know, maybe it's been renamed right by now, but that one's also quite a bit more popular. Yeah, there's also the NextFlow Arctic pipeline from Tom Connor's lab written by Matt Bull as well. That's what we're using. Most of us in the UK are using this one. I think that's another flavor for people if they want to check that out as well. It sort of came because we had to do something in a hurry. And yeah, it wasn't, it wasn't, you weren't there, Lee. It wasn't a fun time to just throw something. Everyone was just trying to get something up and running and just mashing stuff that might help or might not help. And now it's matured two years on. I feel kind of lucky that I didn't have to jump in right away. You're right. I wasn't there when the war started. Definitely difficult to write software or have pipelines when you don't even, your requirements around the organism itself, if you can call it that organism, isn't even nailed down yet. You don't even know what to expect. And then you had the fun of a sort of, you had this like midway through the film twist where it decides, oh no, I'm not going to do this lineage stuff anymore. I'm going to start making variants. You have to think about variants. You're like, oh, okay. I was thumbing through some of our old outputs and it was interesting to watch sort of right at the beginning, these really awful mapped BAM files and random directories. It looked like a mess. And then it starts to get more structured over time. You see Civet show up, you see Pangolin show up, you see Nexclade show up, you see tools being changed. You see a swapping out IVAR for Freebase, things like that over time. You see this evolution from the data output. And And yeah, definitely it's, it sucked just as a user. It sucked trying to change, trying to deal with the changing moving target. Adding onto that just a little bit, there's been interesting evolutions, not just in how we analyze COVID genomes, but how we manage and work with data as well. It's a realization and change that we've been experiencing pretty drastically in Wisconsin, where we've basically went from at most sequencing 50 to 100 samples for an outbreak to 500 to 1,000 samples a week. And how you manage and work with that level of data just is completely different. And so we're really trying to examine new ways of not just how do we run our workflows and what tools we use to analyze genomes, but how do we manage the data that are coming out of those workflows and how do we connect that to public health? I want to hear more about like what threw you for a loop as you're developing Secret? Like what were you like developing towards? And then what changed and how did you have to turn directions in the beginning of all this? Secret is probably my first public workflow. And I, when I first put it up there and started sharing it on GitHub, I was absolutely paranoid that I was doing something wrong. And so every time somebody would like fork my repository or whatever, I would track it down and I'd see what changes they were making and try to figure out like why they were making those changes, because generally the only feedback I would get was like, oh, this isn't working or, and so I wanted to create something that was usable for everybody, but also something that made sense scientifically and analytically so that it would be not just user-friendly, but also useful in understanding the pandemic. So having my code so public and so open to everybody and evaluated by so many people was a very different experience, but most of the people didn't actually have problems with my code as I thought they would be. Not that I'm as much of an expert as at Python or Groovy or Nextflow as I could be. So, and as for a turning point in the workflow, I was, and I don't really know if there, there was one other than that. A lot of the shifts have been very gradual. The workflow today is very, very similar to what it was back in the summer of 2020, just without less bugs or with fewer bugs. And the use case for it is still essentially the same. And so I can't really say that there's been a huge dramatic turning point. This is a good lesson for everybody that, you know, you can, you can just start messing around with some software that is just crucial to you and it can be important for everybody and to, and the lesson is just, you know, don't be scared of that stage fright, I guess. I think you must've overcome it by now because it's an incredibly popular repo and people are using it. My heart's still like panics every time I see a new issue. Okay. Well, I think that at this point, someone who's listening can either try it out or not, or send you a message saying how much they've enjoyed it. It's about time that everybody's written back to Erin to thank her for using it instead of complaining about the features. All right. I had one final question, if that's all right. I want to know what the name is because it's not secret as in top secret. It's secret as in C-E-C-R-E-T. So there it was, beginning of the pandemic, COVID, like the name SARS-CoV-2 hadn't been nailed down yet. And so it was the novel coronavirus. And I came up with this really cool acronym that matched a landmark in Northern Utah. Secret Lake is a really beautiful hike. If you're ever in Northern Utah, I guess I recommend it. It's a moderate to easy hike that if you have children, they might be able to go on it depending on how athletic your children are. And so it had an acronym. And then I went on maternity leave and I couldn't find where I had written the acronym down. And then it didn't necessarily matter anymore, but it had something with COVID and enriched. And I thought about trying to recreate it, but it doesn't seem to matter anymore. Okay. And so on that bombshell, we'll draw this episode to a close. This has been a COVID enriched episode. I appreciate you guys coming on and I appreciate everyone listening to the episode. We'll see you next time. Thank you so much for listening to us at home. If you like this podcast, please subscribe and rate us on iTunes, Spotify, SoundCloud, or the platform of your choice. Follow us on Twitter at MicroBinfy. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadram Institute.