Hello, and thank you for listening to the MicroBinFeed podcast. Here we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There's so much information we all know from working in the field, but nobody writes it down. There is no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work on microbes in food and the impact on human health. I work at Centers for Disease Control and Prevention and am an adjunct member at the University of Georgia in the U.S. Hi, I am Lee Katz, and welcome to the MicroBinFeed podcast. Andrew and I are here to talk with Dr. Finn McGuire. Nabil is on holiday at the moment. Finn is a bioinformatics researcher in the computer science department at Dalhousie University. Some of his activities include working with IRDA and the AMR card database. Finn also works with nonprofits and social scientists to try to help them apply data analysis methods in their work. Today, we'll dive into how fast you can go from training to getting an impactful research publication. Andrew, so how fast is it to get an impactful research publication on something you just learned how to do? Approximately four and a half days. I jest. So this stems from, I had a visitor one time, he turned up to learn bioinformatics, he had no background in it, and they expected to learn all of bioinformatics and get like their nature paper in the space of a week, four and a half days to be exact. Of course, that's not really realistic. It takes slightly longer than that to become trained in the area. You want to get a nature paper on bioinformatics and you only have a week. Well, what do you do? One of the things that I've found by teaching at Dalhousie a bit is we do a lot of graduate courses that are cross-pointed to CAS and life sciences, and it's a bit of a nightmare trying to teach to both halves of the class at the same time, because one is always bored and getting trivial material. Yeah, that is always a problem, you know, when you get a course and someone is at the back, you know, on the Linux command line and they're just bashing away, they could do everything in two seconds, and then you have other people who can barely know where the tab key is, and when you say pipe, they look at you blankly and they can't understand, you know, why you need a space here, or just the kind of basic mechanics of commands and that kind of thing. As I say, the two mental models that I find tend to be missing the most is file system, like just not having any real understanding of that, and then also understanding functions. Like I'm always amazed that people lack those two mental models, because they're so axiomatic for us when we're teaching. But if you think back to when you learned TED program and learned the command line, I know we spent probably the first six months just doing like if statements and for loops. When I did my undergrad in computer science, like they took us through things very, very, very slowly and kept, you know, repeating it over and over again, and he did like 20 different ways to use a for loop. Eventually, that kind of sunk in. I think when people do, say, a one week training course, they're taking in so much information that they can't realistically be an expert at the end of that week, all you can really do is give them a flavor of this is what you could do if you really wanted to go away and learn it. I think the core part of those courses is the teaching to help themselves kind of thing. Absolutely, yeah. You're giving them the terms, you're giving them how to look up the man pages or the help flags, read the documentation, because if you don't manage that, then there is relatively limited long term retention for those intensive courses. So I guess it's difficult, right? You get like, say, a clinician who isn't a biologist and isn't a computer scientist, and they land and they expect to take in as much as possible, and you know, usually very intelligent people, and they want to know everything as quick as possible with the minimum of effort. But yet they want to do things that are really advanced and maybe a PhD level or beyond. And how do we then go about training that untrained monkey to work as quickly as possible with these really advanced PhD level topics, but without any of the foundations behind it, they haven't got the years of learning how to program or the theoretical background in maybe in mathematics, or the theoretical background in obscure piece of biology or sequencing molecular biology, that kind of thing. So where do we start? Are you are you thinking of an of an experience where you trained a clinician to learn the command line? And if so, how long did it take to train that clinician? I've had varying experiences of it. So the Wellcome Trust advanced courses in Hingston will take in a wide variety of people and I've helped with those in years gone by. You can have people who are clinicians or who basically know nothing about mathematics or genomics. And then mixed in at those, you know, people who are vastly further along. And it's hard because you've only got a week and you have to go from like day one is introduction to the command line, you know, that's like an hour or two. And then it's introduction to like a few visual tools, like Artemis, that kind of thing. Then it's like slap buying into things like the inner workings of mappers and assemblers and RNA-seq and in-depth sequencing. You know, it gets a bit crazy. All the content is there and they give you like a big folder to take away home and they give you a virtual machine and things to work on. But same time, at the end of that, you're not an expert in bioinformatics. I'm sure some people expect to be. But really, that takes years, that's, you know, five, 10 years of work of a university education full time to get to that level, I think. Not a week. Do you experience teaching clinicians or something similar? And how long did it take you? I've not taught that many clinicians. I was on a very interesting course a couple of years ago, which was one of those domain focus courses. So it was about the International Course on Antibiotic Resistance. So it was a mix of science of antibiotic resistance, etc., different drug classes, different mechanisms, but then also like the bioinformatics, the analysis, here are the databases, yada, yada, yada. And that was an interesting course because it was like a mix between, there was a few bioinformaticians there, there was a few kind of wet lab scientists, there was a whole bunch of clinicians, there was pharmacists, and there was industry, like pharmaceutical people. I find with that breadth, being the domain first approach and like, okay, here's how this works. And this is the way that this bit is done, was great, and it worked quite well in that setting. Whereas if we'd gone for like an algorithms approach, something like that, where there's the unbalancing, like different backgrounds was so much greater, it was so much less of a level playing field. You were going to end up in a lot of harm, a lot of like chaos by like, you know, the kid at the back who, you know, did a physics PhD and is running on K shell or something is like, he's fine, he's got it, they're doing okay. But then, you know, there's someone else that's internal medicine, and it's not really done any much in computational research. I do think the, especially from the life sciences side, I'm like, there is that push towards it is the kind of integration of computational methods into curricula. So they are getting that like four-year undergrad or three-year undergrad, where they're actually being exposed consistently bit by bit. These are these tools, here's how you look up help, here's like one language or two languages that you kind of keep encountering. Do you really need to know all of this stuff? Do you need to know how to program to do bioinformatics analysis? Because I know recently we've set up Galaxy, which is like a website and you point and click and make workflows and can run analysis, and it's working quite well for people with no bioinformatics background at all. And so we have, we have some more wet lab people who are able to just go and do some assemblies. They're able to do some analysis, they're able to blast stuff, and they can do quite a lot without any of the underlying knowledge of how an assembler works or how blast works or what options you need to use. They just kind of drop it in and magically a result comes out and they know what the result is. And it's more about the data science analysis then, rather than this is the infrastructure and how it works. Yeah, I think one of the issues there is it's hugely dependent on which area of bioinformatics they're doing. So bacteria, it's like, it's a very well tooled area compared to a lot of other areas. It's very like, you know, you can run the assembler even on your laptop. It's not too bad. Like there's pretty well defined workflows and protocols and tooling for that Galaxy. You've already got workflows in there that'll do the analysis that you pretty much want to do. So my PhD was in sort of the eukaryote omics world, like microbial eukaryote world. Yeah. And it's, as I said, I imagine everyone in the room has experienced at some point, it's just like the state of the tooling is just completely different. Everything is an edge case. You need a much bigger system to do even just the basic assembly. You're doing so many tweaks and so many changes because you're in more uncharted territory than doing the 10,000 E. coli assembly. That is quite true because I remember groups in Sanger, they were doing like 50 eukaryotic worms. And then another group is doing like 20,000 bacteria and trivial to get assemblies and annotated assemblies out the other end for bacteria. But then the eukaryotes, it was just painful. Each one was like pulling teeth. And then the annotation was even worse because there's so much variation there. And then they, you know, there's nothing maybe closely related to port the annotation over from. So it's like, you're going way, way back to try and figure out what the hell is going on. Yeah. Just annotation and ortho prediction, like requires custom trained models generally. What models? Well, yeah, exactly. So you're building off other models, but like compared to, you know, let's just run a prodigal or proka or just like, great, that'll do. Absolutely. Yeah. And then there's so much variation in life and in biology. So in eukaryotes, I remember someone told me sometimes there's two copies of the chromosome and then sometimes there's three. It depends on the life cycle stage. It's like, what? That's just crazy. So my PhD was looking at paramecium brasseria, looking at endosymbiosis in that. So that's two nuclei. There's a somatic and a germline nuclei in one cell. It has a whole bunch of green algal endosymbionts. And then it's a series. fagotrophs, so it's riddled with bacterial DNA. And then you have a giant virus that hangs about on the outside of it. And then I was trying to do transcriptomics of single-cell transcriptomics of that system to try and look at the transcriptome of that into symbiosis. Sounds like you got one of the hardest projects you could possibly think up. We've submitted a paper for it now, only five years afterwards, finished. After all, PhD continued up on the empirical work on that. Yeah, that was a... We bit off a bit more than we could chew on that one, I think, in retrospect. Which area you're in is hugely dependent. So if you're teaching someone to do, you know, you carry it by informatics, then your benefit of learning something like Galaxy is probably going to work against you. Because the abstraction in something like Galaxy means it's much harder to go like, okay, I know how to run this in Galaxy. Okay, now I need to go and tweak this. This doesn't quite make sense. Or this had a weird error because there was some random string, you know, something weird came through the sequencer and it just messed up the way this worked. And a whole bunch of weird characters came in because this has weird different bases or weird modifications. Or you need to use a thousand CPUs to go and analyze the data. Yeah, exactly. And no one's going to let you have a Galaxy interface to their entire HPC cluster, generally. So for those people who are trying to do bioinformatics in five days, more it's like, okay, here's one assembler. You're going to work on this one assembler, and you're going to, like, you better do a deep dive on how all these parts work and how these tweaks make a difference and how the command line works and how the file systems work. That's always been my biggest issue with Galaxy. Acknowledging the bias that I'm very much on the side of being command line happy is trying to run stuff on it. I find it very like, okay, how do I actually get this to connect properly? Okay, oh, I've used the wrong, this file has the wrong extension. Okay, it's still a fast queue, but someone didn't encode .fq for this workflow and it's .fastq in my actual folder. Although I've come from the command line side and then the manual hacking of command lines and bash and then, you know, really nice dodgy Perl. And now I've settled on Galaxy and it's just like, oh, wow, this is just so easy for a lot of straightforward stuff. And I've done NextFlow. So as someone who would have to teach people, it solves a lot of problems. I don't have to teach them. But actually what I've found really good is a really good training course we did was in Gambia, where we use virtual machines. We spun them up on MRC Climb, which is like an open-source cloud system. And so then every student had their own VM that was kind of pre- configured with standard stuff like Conda and a few assemblers, whatever. And then we left people to just go and work through like some basic datasets. Seemed to work quite well. Maybe that's the future, you know? April Wright, who's a, she's a primarily undergrad institution in Louisiana, I think, Southeastern Louisiana University, wrote a really nice article on Faculty 1000 Research about sort of teaching computing in biology classrooms and all these different sort of pros and cons of these different kind of ways of doing it. Like having your cloud, having them install things on their own computer, which we'll probably all experience the chaos that can cause. And it's just a nice summary of a lot. And it ties into a lot of the literature about barriers to teaching, like what people actually want to learn, how it differs in undergrad, post-grad, et cetera, might be of interest to people. I think where the overhead of some of those systems, though, like Galaxy does take quite an effort to learn how to use. No, it doesn't. Galaxy is easy. It does take some effort though, and to use, especially to use it well. And then I'm like, in the whole like, are you, depending on what someone's doing, is them investing that effort, would it be better placed in throwing them on the command line? I find people get scared, right? They get scared off when they see just this cursor, they don't know what to type. There's no, they can't just kind of right click and select from menus or whatever. It's like literally a cursor and they have to type in stuff that they're probably getting drip fed. Often people will just blindly copy and paste or blindly type stuff in without really understanding what they're doing. And they don't understand like a dash is important. It's not like something you copy and paste from a Word document where it can be a different character. They don't understand any of this kind of stuff. And there's so many concepts. Yeah. I mean, there's so many concepts to have to take in, just really, really basic stuff to make it work. It scares a lot of people. We talked about like really complicated biology projects with multiple, with more than diploidy or more nuclei. But if you had like a project that kind of fit in the bounds of more normalcy, however you want to define that, and you had like a structure like Galaxy, I feel like that simplifies things at least. So you could get to that nature paper from knowing nothing a lot faster, maybe still not four and a half days, but maybe faster if you had sort of a more normal project and you had a structure like Galaxy or similar, right? I'd agree with that. Definitely. I think the key issue is what is the end goal of that learner, especially beyond your nature paper in the week? Like, are they going to be, are they interested? It's like, okay, we did this, ran all the things you worked with. And I thought, you know, I didn't really like the way that the assembler worked for that bit of the genome. Like it was terrible for the mobile genetic elements and plasmids. I'm going to work on this a bit more. I want to change the way this bit works. Then they have to move on to the command line entirely, and they have to move into that space. So it's a bit like, yeah, they got the paper and they kind of got a very high level overview of how this process works and where it doesn't work. But would they have been better just going to the command line a bit and taking that kind of that hit to the initial kind of hit cost, that initial investment in time and effort? So I think people don't really care about the mechanics. They care about the other end, right? They want the data and the actually interesting results from the data. Often a PhD student would just hand it, here's a lot of samples, or here's a lot of sequencing data. This is the interesting question, go and figure out what's going on there. They don't care. They have to sort their BAM file and then index it, or how to do that, or in the most efficient manner. They just want to know where are the SNPs? How does a tree look? They don't really care how to make the tree and what can they figure out in terms of how it relates in metadata and all the interesting results that they need for the paper and make the pretty figures. So actually maybe we should be teaching them more how to make the pretty pictures in R, rather than teaching them mechanics of stringing things together to do basic stuff people have done a million times before. I'm always, especially with them, as we've discussed from our biases of being command line junkies, like I'm always surprised at how the relative use of things like Cards RGI web interface and ResFinder interface versus the command line tools. It's a crazy amount of their use. So yeah, those interfaces are used heavily. I think I still have the concern of, you don't end up in the extreme examples of people who just throw something in BLAST and claim the virus has been manufactured, or RT-PCR doesn't work because there's a primer that matches humans, just for random examples recently. Just having everything abstracted away, you can't even begin to ask them those questions. Even the ones that care about, okay, how does this, the one person in that room that cares about what SAMTL sorts is actually doing, they can't even access some of those flags, especially in some workflow. True, because you don't need to worry about them. In Galaxy, they magically get sorted and you magically get an index. I will say one of the best trainings that I've witnessed was back in, I think, 2010 or maybe 2011, I believe, from Denmark. David Ussry came with his researcher, Shini, to CDC, and they taught a bunch of people who have never seen the command line how to run the command line on some very basic commands and use their package. It was called CMG BioTools, I believe. And they already made a VM. You just run this command, this command, this command. They just straight up told you exactly what to do. And you got publication-ready figures. And I thought that was just visionary. I think the command line for people in the room who just knew nothing was still a little hard, but they made it work. And I think if they continued it, there would be some kind of interface for it, and they would definitely abstract away the things. I know that there's the Canadian Biopharmatic Workshops. I think she works for Tableau now, but there was Anna Chrisson, I think UBC, did basically a PhD on visualization of analyses. And she had a whole session at one of the previous Metagenomics Canadian Biopharmatics Workshop. And that was really well-received because for the ones that don't care about the text on the command line, having the visualization actually was kind of... How do I toot this? Was the feeling the success of the process and engaging... Did she do her PhD with Jane Gurdie? I think she did, yeah. And yeah, because I remember she did an absolutely fantastic talk. It was on something crazy like... It was like ASM NGS or something. Like a paper reporting tool for clinicians reporting back AMR. And it was all about how do you make it so you can get across all the detailed information in a reasonable, concise manner to clinicians. And it was just a fantastic talk based on usability design. An evaluation of whole genome sequencing clinical report for reference microbiology labs. There you go. And peer J. ASM NGS 2017, she presented that. Yeah, I was there. As was I, yeah. I agree with you. So she has this amazing visualization thesis and if I've oversimplified it, I'm sorry if I did. But I think that it definitely streamlines with your earlier point, Andrew, that people just want to know how to run the thing and get the results. And Anna's work is kind of crucial at that. How do you get the results? How do you visualize it? And it would be great to focus training on getting those results out. So I think the most full workshops I've been engaged in are not actually microbial bioinformatics focused. They're these kind of research capacity development workshops, initiative called micro research. So the idea of these micro research courses essentially is there's a small amount of seed funding and you take a whole bunch of people... who have no research experience whatsoever, usually from like a mix of healthcare associated or whatever, I basically train them to develop a like a small project proposal and then do all the process of doing that piece of research. And it's been absurdly successful. Something like 70% of people that do the course are still in research five years later. Most groups end up with a PubMed publication out of it and then there's massive amounts of knowledge translation done with it. Like it works incredibly well. And most of the course is designed around, like you're paired with a mentor, a research mentor afterwards and through the course, but most of it is designed to give you the terminology, give you the language, give you the resources to know how to ask how to do something. Because you know, we're not teaching all of qualitative and quantitative stats and how to do research ethics and how to do knowledge translation in a two-week course and writing and developing a whole research proposal, doing the grant review, etc. Like that's not being done in that time frame, but here's basically, here's all the terms, here is how to find more information about this, here is how to, what you would ask to do this, here's what you would say, especially with the medic things, like here are the terms you would use to speak to the statistician to do that kind of analysis. How much pre-learning do you think someone should have before they come to you and ask can I go on your course? I'd always expect people maybe to, if they're interested as an area, maybe go and read a book on the basics, mathematics, biology, computer science, whatever, if they want to do more advanced stuff, rather than having to be, you know, having them sit there and you have to spoon- feed them. I think that having prerequisites is a really tough thing to solve, but it is necessary. You have to be able to say, because you don't come into a course knowing everything on one extreme, and you don't come into a course knowing nothing. You're not like a newborn baby, so you have to define what a person has to know, like, and be reasonable. You can't say, like, need to know the English language. I think some of it's just assumed. I find some people just think mathematics is easy, it's clicking a few buttons and you get some result. The hard part is working in a lab, collecting the samples. It's not the magic that happens with sequencing and mathematics. It's just, oh, make me a figure there, make me a tree and, you know, tell me the magical result that'll get me my nature paper. By teaching the galaxy workflow approach, or like these kind of pre-baked solutions, do we not encourage that thinking? You would hope, but often people can't even get that far, you know, because they don't have the prerequisite knowledge of how different things work or how sequencing works or whatever. A lot of people just get sequencing and they're blind to what it actually means or how it was created. I think it's fine. I think that we have matured over the years, even before I got into this career. Like, we don't have to know Linux administration anymore. That's okay that we've abstracted some stuff away. Well, if you're using VMs and Docker, you really do. Oh, okay. Well, okay, bad example. Maybe there's something, though, that we've matured away. Like, you don't need to know everything anymore. If the postdoc orders the new blade cluster to arrive when he's away teaching a course and you're the only one with any computational skills, you end up having to go to the data center and set it up. But I mean, like, if somebody is learning Galaxy or, you know, and I think that's a good example since Finn's here and he does Iridescent stuff. Like, just somebody from Canada who's just learning how to do stuff and they're learning Galaxy, they don't need to know how to do system administration. They just need to know how to do the interface and get their results. Get the data in on the right place, which is why, like, so much of it is devoted to uploaders and ways of uploading your data. And I'd say I think that works really well when most of the users of something like Irida are doing, like, very standardized surveillance-type analyses because that's his aim. The problem is I think that doesn't necessarily, it's not useful for kind of research that goes off that beaten track a bit more. And that's where we kind of go back to the original point of, like, is that almost, not wasted effort, but misplaced effort that would have been better placed for the person wanting to do that kind of stuff in going into the guts of the tools, going with less abstraction. Well, I know when I was building pipelines for different patenting groups, I put in place automated systems to go and map everything, say, to a reference or assemble everything automatically and annotate if you can. And actually you can get quite far down the line. You can produce a lot of results very quickly, automatically, without having to do too much work on the back end. And then that means that the researchers can then just go that one step further along. They don't have to worry about the ins and outs of mappers because the data is already mapped. You know, it's great if they do, but if they don't, then that's fine as well. Do you think that level of abstraction leads to less scrutiny of the results? It does. But at the same time, you don't have people making stupid basic mistakes because they don't know the insider knowledge that says, okay, sometimes if you paired end reads, maybe they overlap and maybe people didn't know that. Or maybe, I don't know, spades, when it has, say, overlapping reads, you have to treat them slightly differently to if they're not overlapping or, you know, all these tiny little things that you don't really think about, or read trimming or trimming adapters. There's some important stuff that maybe if you're in the field for a few years, you'll just have picked up and maybe it's not even written down, but you know it anyway. Even though we're basic of like, why doesn't math work if I give it a bunch of genomes? It's a multi-sequence learner. Yeah, and why is it taking so long? Is that a different problem? Yeah. Are there some basic things that you would teach somebody, like typing in dash help, just like some basics for every single command that would get? Except on the command line, that doesn't always work. If you do dash h, some commands will, you know, that's an actual option in a command, so it's not universal. And the same with dash help. You know, that's not a universal either, so. So we get the help automatically when no input's given, right? I know, it doesn't do anything destructive. I think that's one of the things I try and do the most when I'm teaching, especially doing practicals, is not giving the answer when a question is asked, and being like, okay, here's how we would look that up. You know, here is the help, here is the help command. Okay, look what that option does. Okay, the default is that if it says the default, but or whatever, you know, this is the wrong way around because we can see in the help messages. I think that is a really key bit of teaching. It's quite hard to do via a decimal and a calculator. I guess when you are teaching that kind of command line stuff, it can be difficult because often the examples, you want to give people a really complicated example, but at the same time, you need to have it run within a reasonable amount of time. You can't just say, okay, go and assemble your plasmodium falciparum and then expect them to get an answer in five minutes, you know? And at the same time, you don't want them to do full bacterial assemblies. It might just be you want to focus on, say, kind of assemble a plasmid. You know, that might run in five, 10 minutes or seconds, hopefully, but then you don't get as much biological insight out of that. So, there's this kind of balance, and I've seen it done right, well, I've seen it in both ways, you know, too simple, too complicated, and both of them are a disaster. So, there's some kind of intermediate in there that works quite well, but I don't know what it is. But the problem is, I don't think that intermediate is the same for all people in the class. True, and that's one of the problems. And the same with the whole level of abstraction issue. What is the appropriate level of abstraction? It's going to be different for different people in the class, depending on their learning objectives and their background. The simple example, see an example of it, try to run it yourself, then do, like, basically create a new one, like that classic sort of repeated workflow, like, you know, lots of learning program, literature and guidelines, et cetera, like kind of follow that model heavily. If they're a PhD student with their own data, and they basically have come to your workshop or your training course, because they have their whole bunch of bacterial reads, and they want to do something with them, great, because they have an easy, complex example that they can kind of build in themselves, like for their create, whereas if they're just like learning, they're, you know, they're an undergrad or a master's student who's kind of learning it, because, you know, this bio-traumatic stuff might be useful in the future. They hear it's an up-and-coming field with podcasts coming out all the time. They don't have that, like, they're creating examples. It's very arbitrary. They don't have, like, an obvious connection to that more complex stuff. Finding that intermediate is, yeah, you kind of need to know the class, need to know the people you're training. I was contemplating that also, because we brought this up earlier in the conversation, that there is a common denominator between everyone coming in the class with a bunch of different backgrounds, and I was contemplating that also, because in grad school, for a bioinformatics graduate program, you have people coming in with a biology degree, or a computer science degree, or there are some other ones that are not as prevalent, like a statistics background or something, and I think it is a real challenge with the professors and everybody to come with common ground, and I mean, I'm thinking back to when I first entered the program. It's been a long time. It was an experience, because people with computer science backgrounds excelled super quickly at the beginning of the program. It was almost demoralizing from my point of view, because I was coming in with a biology background, but after the first few semesters, it was very apparent that even though they had the methodology down, they didn't have, like, the purpose yet, and that's where the biology came in, the destination of that journey, and I think the biology is really hard to catch up on, because it's a bunch of memorization, honestly. That was a very long answer to say that the common denominator for every single training class graduate school is a very difficult and probably unsolved challenge. I agree entirely, but coming from the bias of I self-taught the CS, the online courses, the workshops, etc., enough that I can somehow fill the CS department. I find the life sciences, it's just that big blob that needs a bit more of a pedagogical guide through it. learning all the exceptions to all the rules, learning actually, there's a second strand. You wrote a great algorithm, but you ignored half the data. Like, we've seen that in tools before, in like quite good tools. The thing I've been very surprised at though, and why I'm kind of, I'm very hesitant on the abstraction, is actually some of the CS students, and then, you know, moving to CS department, I'll be in the promised land. I'll be able to do all this on the command line effortlessly. They won't have to think about it. It's actually with a lot of the way the CS teaching is happening with, you know, these online platforms or these like very kind of contrived Java, you know, Java examples, for example, if they're still doing that, is I've been surprised at how poor a lot of CS students have been at just command line file mugging and doing that, piping a file to like copying a file, moving it around. Like that kind of messy, like getting the data into the program when they haven't had a clean example of doing it. I think that's because a lot of CS isn't the getting your hands dirty. It is like writing code after you've done a bit of theory. So it's all about, say I learned Java, unfortunately, and that was all about building objects and levels of abstraction and interfaces and all of this kind of jazz. But there's very little on actual command line, how do you string these programs together? It was just, this is the code, this is the theory, you know, heavy theory, theory, theory, I presume so that you can move forward and then you can, you'll forever know the theory of computer science. So everything is easy. Everything is a network. Everything is a graph. That's what I reduce every problem to anyway. So maybe it did work. There is a lot of bad CS teaching out there, like substantially, and I know a lot of people failed computer science when I did it as an undergrad. Like the attrition rate was enormous. But I think a lot of people only went into it. I started in 99 because that was around the first dot-com boom and everyone thought, oh yeah, this is a great area to get into. You'll make a fortune, you'll be a millionaire when you're 25. And then of course, people realized actually this is kind of hard and you know, like 40% drop date in the first year. And now we just do the same thing with deep learning. Oh yeah. Which is, I mean, that's another interesting example of what is the right level of abstraction. Because you can do a deep learning bootcamp, you can teach them to run a convnet on some data pretty damn quickly, but they have no idea what they're doing necessarily. Well, when I interview people and I mentioned machine learning, because every CV these days mentions machine learning. And so I usually get them to draw out what a neural network is. And I say, okay, you know, it's really a neural network. Could you draw please? And half the people, they don't understand the absolute basics. The other half, you know, once it starts even drawing it, it's like, grand, yeah, you're fine. You actually know what's going on. I hadn't encountered that fizz buzz of machine learning. I tend to go with how logistic, how does it explain logistic regression? Very simple. Thanks for joining us, Finn. I think we made it through this topic unscathed. I think that you all have learned a bit about training, the common denominator between all students' backgrounds, the level of abstraction, and that you will not go from zero to nature in one week. We seem to have no agreement on a lot of things, but maybe that's a good thing that we discussed several angles anyway. Thank you all so much for listening to us at home. If you liked this podcast, please subscribe and like us on iTunes, Spotify, SoundCloud, or the platform of your choice. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group and edited by Nick Waters. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadrant Institute.