Hello, and thank you for listening to the MicroBinFeed podcast. Here we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There's so much information we all
know from working in the field, but nobody writes it down. There is no manual,
and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My
co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Andrew
and Nabil work in the Quadram Institute in Norwich, UK, where they work on
microbes in food and the impact on human health. I work at Centers for Disease
Control and Prevention and am an adjunct member at the University of Georgia in
the U.S. Hi, I am Lee Katz, and welcome to the MicroBinFeed podcast. Andrew and
I are here to talk with Dr. Finn McGuire. Nabil is on holiday at the moment.
Finn is a bioinformatics researcher in the computer science department at
Dalhousie University. Some of his activities include working with IRDA and the
AMR card database. Finn also works with nonprofits and social scientists to try
to help them apply data analysis methods in their work. Today, we'll dive into
how fast you can go from training to getting an impactful research publication.
Andrew, so how fast is it to get an impactful research publication on something
you just learned how to do? Approximately four and a half days. I jest. So this
stems from, I had a visitor one time, he turned up to learn bioinformatics, he
had no background in it, and they expected to learn all of bioinformatics and
get like their nature paper in the space of a week, four and a half days to be
exact. Of course, that's not really realistic. It takes slightly longer than
that to become trained in the area. You want to get a nature paper on
bioinformatics and you only have a week. Well, what do you do? One of the things
that I've found by teaching at Dalhousie a bit is we do a lot of graduate
courses that are cross-pointed to CAS and life sciences, and it's a bit of a
nightmare trying to teach to both halves of the class at the same time, because
one is always bored and getting trivial material. Yeah, that is always a
problem, you know, when you get a course and someone is at the back, you know,
on the Linux command line and they're just bashing away, they could do
everything in two seconds, and then you have other people who can barely know
where the tab key is, and when you say pipe, they look at you blankly and they
can't understand, you know, why you need a space here, or just the kind of basic
mechanics of commands and that kind of thing. As I say, the two mental models
that I find tend to be missing the most is file system, like just not having any
real understanding of that, and then also understanding functions. Like I'm
always amazed that people lack those two mental models, because they're so
axiomatic for us when we're teaching. But if you think back to when you learned
TED program and learned the command line, I know we spent probably the first six
months just doing like if statements and for loops. When I did my undergrad in
computer science, like they took us through things very, very, very slowly and
kept, you know, repeating it over and over again, and he did like 20 different
ways to use a for loop. Eventually, that kind of sunk in. I think when people
do, say, a one week training course, they're taking in so much information that
they can't realistically be an expert at the end of that week, all you can
really do is give them a flavor of this is what you could do if you really
wanted to go away and learn it. I think the core part of those courses is the
teaching to help themselves kind of thing. Absolutely, yeah. You're giving them
the terms, you're giving them how to look up the man pages or the help flags,
read the documentation, because if you don't manage that, then there is
relatively limited long term retention for those intensive courses. So I guess
it's difficult, right? You get like, say, a clinician who isn't a biologist and
isn't a computer scientist, and they land and they expect to take in as much as
possible, and you know, usually very intelligent people, and they want to know
everything as quick as possible with the minimum of effort. But yet they want to
do things that are really advanced and maybe a PhD level or beyond. And how do
we then go about training that untrained monkey to work as quickly as possible
with these really advanced PhD level topics, but without any of the foundations
behind it, they haven't got the years of learning how to program or the
theoretical background in maybe in mathematics, or the theoretical background in
obscure piece of biology or sequencing molecular biology, that kind of thing. So
where do we start? Are you are you thinking of an of an experience where you
trained a clinician to learn the command line? And if so, how long did it take
to train that clinician? I've had varying experiences of it. So the Wellcome
Trust advanced courses in Hingston will take in a wide variety of people and
I've helped with those in years gone by. You can have people who are clinicians
or who basically know nothing about mathematics or genomics. And then mixed in
at those, you know, people who are vastly further along. And it's hard because
you've only got a week and you have to go from like day one is introduction to
the command line, you know, that's like an hour or two. And then it's
introduction to like a few visual tools, like Artemis, that kind of thing. Then
it's like slap buying into things like the inner workings of mappers and
assemblers and RNA-seq and in-depth sequencing. You know, it gets a bit crazy.
All the content is there and they give you like a big folder to take away home
and they give you a virtual machine and things to work on. But same time, at the
end of that, you're not an expert in bioinformatics. I'm sure some people expect
to be. But really, that takes years, that's, you know, five, 10 years of work of
a university education full time to get to that level, I think. Not a week. Do
you experience teaching clinicians or something similar? And how long did it
take you? I've not taught that many clinicians. I was on a very interesting
course a couple of years ago, which was one of those domain focus courses. So it
was about the International Course on Antibiotic Resistance. So it was a mix of
science of antibiotic resistance, etc., different drug classes, different
mechanisms, but then also like the bioinformatics, the analysis, here are the
databases, yada, yada, yada. And that was an interesting course because it was
like a mix between, there was a few bioinformaticians there, there was a few
kind of wet lab scientists, there was a whole bunch of clinicians, there was
pharmacists, and there was industry, like pharmaceutical people. I find with
that breadth, being the domain first approach and like, okay, here's how this
works. And this is the way that this bit is done, was great, and it worked quite
well in that setting. Whereas if we'd gone for like an algorithms approach,
something like that, where there's the unbalancing, like different backgrounds
was so much greater, it was so much less of a level playing field. You were
going to end up in a lot of harm, a lot of like chaos by like, you know, the kid
at the back who, you know, did a physics PhD and is running on K shell or
something is like, he's fine, he's got it, they're doing okay. But then, you
know, there's someone else that's internal medicine, and it's not really done
any much in computational research. I do think the, especially from the life
sciences side, I'm like, there is that push towards it is the kind of
integration of computational methods into curricula. So they are getting that
like four-year undergrad or three-year undergrad, where they're actually being
exposed consistently bit by bit. These are these tools, here's how you look up
help, here's like one language or two languages that you kind of keep
encountering. Do you really need to know all of this stuff? Do you need to know
how to program to do bioinformatics analysis? Because I know recently we've set
up Galaxy, which is like a website and you point and click and make workflows
and can run analysis, and it's working quite well for people with no
bioinformatics background at all. And so we have, we have some more wet lab
people who are able to just go and do some assemblies. They're able to do some
analysis, they're able to blast stuff, and they can do quite a lot without any
of the underlying knowledge of how an assembler works or how blast works or what
options you need to use. They just kind of drop it in and magically a result
comes out and they know what the result is. And it's more about the data science
analysis then, rather than this is the infrastructure and how it works. Yeah, I
think one of the issues there is it's hugely dependent on which area of
bioinformatics they're doing. So bacteria, it's like, it's a very well tooled
area compared to a lot of other areas. It's very like, you know, you can run the
assembler even on your laptop. It's not too bad. Like there's pretty well
defined workflows and protocols and tooling for that Galaxy. You've already got
workflows in there that'll do the analysis that you pretty much want to do. So
my PhD was in sort of the eukaryote omics world, like microbial eukaryote world.
Yeah. And it's, as I said, I imagine everyone in the room has experienced at
some point, it's just like the state of the tooling is just completely
different. Everything is an edge case. You need a much bigger system to do even
just the basic assembly. You're doing so many tweaks and so many changes because
you're in more uncharted territory than doing the 10,000 E. coli assembly. That
is quite true because I remember groups in Sanger, they were doing like 50
eukaryotic worms. And then another group is doing like 20,000 bacteria and
trivial to get assemblies and annotated assemblies out the other end for
bacteria. But then the eukaryotes, it was just painful. Each one was like
pulling teeth. And then the annotation was even worse because there's so much
variation there. And then they, you know, there's nothing maybe closely related
to port the annotation over from. So it's like, you're going way, way back to
try and figure out what the hell is going on. Yeah. Just annotation and ortho
prediction, like requires custom trained models generally. What models? Well,
yeah, exactly. So you're building off other models, but like compared to, you
know, let's just run a prodigal or proka or just like, great, that'll do.
Absolutely. Yeah. And then there's so much variation in life and in biology. So
in eukaryotes, I remember someone told me sometimes there's two copies of the
chromosome and then sometimes there's three. It depends on the life cycle stage.
It's like, what? That's just crazy. So my PhD was looking at paramecium
brasseria, looking at endosymbiosis in that. So that's two nuclei. There's a
somatic and a germline nuclei in one cell. It has a whole bunch of green algal
endosymbionts. And then it's a series.  fagotrophs, so it's riddled with
bacterial DNA. And then you have a giant virus that hangs about on the outside
of it. And then I was trying to do transcriptomics of single-cell
transcriptomics of that system to try and look at the transcriptome of that into
symbiosis. Sounds like you got one of the hardest projects you could possibly
think up. We've submitted a paper for it now, only five years afterwards,
finished. After all, PhD continued up on the empirical work on that. Yeah, that
was a... We bit off a bit more than we could chew on that one, I think, in
retrospect. Which area you're in is hugely dependent. So if you're teaching
someone to do, you know, you carry it by informatics, then your benefit of
learning something like Galaxy is probably going to work against you. Because
the abstraction in something like Galaxy means it's much harder to go like,
okay, I know how to run this in Galaxy. Okay, now I need to go and tweak this.
This doesn't quite make sense. Or this had a weird error because there was some
random string, you know, something weird came through the sequencer and it just
messed up the way this worked. And a whole bunch of weird characters came in
because this has weird different bases or weird modifications. Or you need to
use a thousand CPUs to go and analyze the data. Yeah, exactly. And no one's
going to let you have a Galaxy interface to their entire HPC cluster, generally.
So for those people who are trying to do bioinformatics in five days, more it's
like, okay, here's one assembler. You're going to work on this one assembler,
and you're going to, like, you better do a deep dive on how all these parts work
and how these tweaks make a difference and how the command line works and how
the file systems work. That's always been my biggest issue with Galaxy.
Acknowledging the bias that I'm very much on the side of being command line
happy is trying to run stuff on it. I find it very like, okay, how do I actually
get this to connect properly? Okay, oh, I've used the wrong, this file has the
wrong extension. Okay, it's still a fast queue, but someone didn't encode .fq
for this workflow and it's .fastq in my actual folder. Although I've come from
the command line side and then the manual hacking of command lines and bash and
then, you know, really nice dodgy Perl. And now I've settled on Galaxy and it's
just like, oh, wow, this is just so easy for a lot of straightforward stuff. And
I've done NextFlow. So as someone who would have to teach people, it solves a
lot of problems. I don't have to teach them. But actually what I've found really
good is a really good training course we did was in Gambia, where we use virtual
machines. We spun them up on MRC Climb, which is like an open-source cloud
system. And so then every student had their own VM that was kind of pre-
configured with standard stuff like Conda and a few assemblers, whatever. And
then we left people to just go and work through like some basic datasets. Seemed
to work quite well. Maybe that's the future, you know? April Wright, who's a,
she's a primarily undergrad institution in Louisiana, I think, Southeastern
Louisiana University, wrote a really nice article on Faculty 1000 Research about
sort of teaching computing in biology classrooms and all these different sort of
pros and cons of these different kind of ways of doing it. Like having your
cloud, having them install things on their own computer, which we'll probably
all experience the chaos that can cause. And it's just a nice summary of a lot.
And it ties into a lot of the literature about barriers to teaching, like what
people actually want to learn, how it differs in undergrad, post-grad, et
cetera, might be of interest to people. I think where the overhead of some of
those systems, though, like Galaxy does take quite an effort to learn how to
use. No, it doesn't. Galaxy is easy. It does take some effort though, and to
use, especially to use it well. And then I'm like, in the whole like, are you,
depending on what someone's doing, is them investing that effort, would it be
better placed in throwing them on the command line? I find people get scared,
right? They get scared off when they see just this cursor, they don't know what
to type. There's no, they can't just kind of right click and select from menus
or whatever. It's like literally a cursor and they have to type in stuff that
they're probably getting drip fed. Often people will just blindly copy and paste
or blindly type stuff in without really understanding what they're doing. And
they don't understand like a dash is important. It's not like something you copy
and paste from a Word document where it can be a different character. They don't
understand any of this kind of stuff. And there's so many concepts. Yeah. I
mean, there's so many concepts to have to take in, just really, really basic
stuff to make it work. It scares a lot of people. We talked about like really
complicated biology projects with multiple, with more than diploidy or more
nuclei. But if you had like a project that kind of fit in the bounds of more
normalcy, however you want to define that, and you had like a structure like
Galaxy, I feel like that simplifies things at least. So you could get to that
nature paper from knowing nothing a lot faster, maybe still not four and a half
days, but maybe faster if you had sort of a more normal project and you had a
structure like Galaxy or similar, right? I'd agree with that. Definitely. I
think the key issue is what is the end goal of that learner, especially beyond
your nature paper in the week? Like, are they going to be, are they interested?
It's like, okay, we did this, ran all the things you worked with. And I thought,
you know, I didn't really like the way that the assembler worked for that bit of
the genome. Like it was terrible for the mobile genetic elements and plasmids.
I'm going to work on this a bit more. I want to change the way this bit works.
Then they have to move on to the command line entirely, and they have to move
into that space. So it's a bit like, yeah, they got the paper and they kind of
got a very high level overview of how this process works and where it doesn't
work. But would they have been better just going to the command line a bit and
taking that kind of that hit to the initial kind of hit cost, that initial
investment in time and effort? So I think people don't really care about the
mechanics. They care about the other end, right? They want the data and the
actually interesting results from the data. Often a PhD student would just hand
it, here's a lot of samples, or here's a lot of sequencing data. This is the
interesting question, go and figure out what's going on there. They don't care.
They have to sort their BAM file and then index it, or how to do that, or in the
most efficient manner. They just want to know where are the SNPs? How does a
tree look? They don't really care how to make the tree and what can they figure
out in terms of how it relates in metadata and all the interesting results that
they need for the paper and make the pretty figures. So actually maybe we should
be teaching them more how to make the pretty pictures in R, rather than teaching
them mechanics of stringing things together to do basic stuff people have done a
million times before. I'm always, especially with them, as we've discussed from
our biases of being command line junkies, like I'm always surprised at how the
relative use of things like Cards RGI web interface and ResFinder interface
versus the command line tools. It's a crazy amount of their use. So yeah, those
interfaces are used heavily. I think I still have the concern of, you don't end
up in the extreme examples of people who just throw something in BLAST and claim
the virus has been manufactured, or RT-PCR doesn't work because there's a primer
that matches humans, just for random examples recently. Just having everything
abstracted away, you can't even begin to ask them those questions. Even the ones
that care about, okay, how does this, the one person in that room that cares
about what SAMTL sorts is actually doing, they can't even access some of those
flags, especially in some workflow. True, because you don't need to worry about
them. In Galaxy, they magically get sorted and you magically get an index. I
will say one of the best trainings that I've witnessed was back in, I think,
2010 or maybe 2011, I believe, from Denmark. David Ussry came with his
researcher, Shini, to CDC, and they taught a bunch of people who have never seen
the command line how to run the command line on some very basic commands and use
their package. It was called CMG BioTools, I believe. And they already made a
VM. You just run this command, this command, this command. They just straight up
told you exactly what to do. And you got publication-ready figures. And I
thought that was just visionary. I think the command line for people in the room
who just knew nothing was still a little hard, but they made it work. And I
think if they continued it, there would be some kind of interface for it, and
they would definitely abstract away the things. I know that there's the Canadian
Biopharmatic Workshops. I think she works for Tableau now, but there was Anna
Chrisson, I think UBC, did basically a PhD on visualization of analyses. And she
had a whole session at one of the previous Metagenomics Canadian Biopharmatics
Workshop. And that was really well-received because for the ones that don't care
about the text on the command line, having the visualization actually was kind
of... How do I toot this? Was the feeling the success of the process and
engaging... Did she do her PhD with Jane Gurdie? I think she did, yeah. And
yeah, because I remember she did an absolutely fantastic talk. It was on
something crazy like... It was like ASM NGS or something. Like a paper reporting
tool for clinicians reporting back AMR. And it was all about how do you make it
so you can get across all the detailed information in a reasonable, concise
manner to clinicians. And it was just a fantastic talk based on usability
design. An evaluation of whole genome sequencing clinical report for reference
microbiology labs. There you go. And peer J. ASM NGS 2017, she presented that.
Yeah, I was there. As was I, yeah. I agree with you. So she has this amazing
visualization thesis and if I've oversimplified it, I'm sorry if I did. But I
think that it definitely streamlines with your earlier point, Andrew, that
people just want to know how to run the thing and get the results. And Anna's
work is kind of crucial at that. How do you get the results? How do you
visualize it? And it would be great to focus training on getting those results
out. So I think the most full workshops I've been engaged in are not actually
microbial bioinformatics focused. They're these kind of research capacity
development workshops, initiative called micro research. So the idea of these
micro research courses essentially is there's a small amount of seed funding and
you take a whole bunch of people...  who have no research experience whatsoever,
usually from like a mix of healthcare associated or whatever, I basically train
them to develop a like a small project proposal and then do all the process of
doing that piece of research. And it's been absurdly successful. Something like
70% of people that do the course are still in research five years later. Most
groups end up with a PubMed publication out of it and then there's massive
amounts of knowledge translation done with it. Like it works incredibly well.
And most of the course is designed around, like you're paired with a mentor, a
research mentor afterwards and through the course, but most of it is designed to
give you the terminology, give you the language, give you the resources to know
how to ask how to do something. Because you know, we're not teaching all of
qualitative and quantitative stats and how to do research ethics and how to do
knowledge translation in a two-week course and writing and developing a whole
research proposal, doing the grant review, etc. Like that's not being done in
that time frame, but here's basically, here's all the terms, here is how to find
more information about this, here is how to, what you would ask to do this,
here's what you would say, especially with the medic things, like here are the
terms you would use to speak to the statistician to do that kind of analysis.
How much pre-learning do you think someone should have before they come to you
and ask can I go on your course? I'd always expect people maybe to, if they're
interested as an area, maybe go and read a book on the basics, mathematics,
biology, computer science, whatever, if they want to do more advanced stuff,
rather than having to be, you know, having them sit there and you have to spoon-
feed them. I think that having prerequisites is a really tough thing to solve,
but it is necessary. You have to be able to say, because you don't come into a
course knowing everything on one extreme, and you don't come into a course
knowing nothing. You're not like a newborn baby, so you have to define what a
person has to know, like, and be reasonable. You can't say, like, need to know
the English language. I think some of it's just assumed. I find some people just
think mathematics is easy, it's clicking a few buttons and you get some result.
The hard part is working in a lab, collecting the samples. It's not the magic
that happens with sequencing and mathematics. It's just, oh, make me a figure
there, make me a tree and, you know, tell me the magical result that'll get me
my nature paper. By teaching the galaxy workflow approach, or like these kind of
pre-baked solutions, do we not encourage that thinking? You would hope, but
often people can't even get that far, you know, because they don't have the
prerequisite knowledge of how different things work or how sequencing works or
whatever. A lot of people just get sequencing and they're blind to what it
actually means or how it was created. I think it's fine. I think that we have
matured over the years, even before I got into this career. Like, we don't have
to know Linux administration anymore. That's okay that we've abstracted some
stuff away. Well, if you're using VMs and Docker, you really do. Oh, okay. Well,
okay, bad example. Maybe there's something, though, that we've matured away.
Like, you don't need to know everything anymore. If the postdoc orders the new
blade cluster to arrive when he's away teaching a course and you're the only one
with any computational skills, you end up having to go to the data center and
set it up. But I mean, like, if somebody is learning Galaxy or, you know, and I
think that's a good example since Finn's here and he does Iridescent stuff.
Like, just somebody from Canada who's just learning how to do stuff and they're
learning Galaxy, they don't need to know how to do system administration. They
just need to know how to do the interface and get their results. Get the data in
on the right place, which is why, like, so much of it is devoted to uploaders
and ways of uploading your data. And I'd say I think that works really well when
most of the users of something like Irida are doing, like, very standardized
surveillance-type analyses because that's his aim. The problem is I think that
doesn't necessarily, it's not useful for kind of research that goes off that
beaten track a bit more. And that's where we kind of go back to the original
point of, like, is that almost, not wasted effort, but misplaced effort that
would have been better placed for the person wanting to do that kind of stuff in
going into the guts of the tools, going with less abstraction. Well, I know when
I was building pipelines for different patenting groups, I put in place
automated systems to go and map everything, say, to a reference or assemble
everything automatically and annotate if you can. And actually you can get quite
far down the line. You can produce a lot of results very quickly, automatically,
without having to do too much work on the back end. And then that means that the
researchers can then just go that one step further along. They don't have to
worry about the ins and outs of mappers because the data is already mapped. You
know, it's great if they do, but if they don't, then that's fine as well. Do you
think that level of abstraction leads to less scrutiny of the results? It does.
But at the same time, you don't have people making stupid basic mistakes because
they don't know the insider knowledge that says, okay, sometimes if you paired
end reads, maybe they overlap and maybe people didn't know that. Or maybe, I
don't know, spades, when it has, say, overlapping reads, you have to treat them
slightly differently to if they're not overlapping or, you know, all these tiny
little things that you don't really think about, or read trimming or trimming
adapters. There's some important stuff that maybe if you're in the field for a
few years, you'll just have picked up and maybe it's not even written down, but
you know it anyway. Even though we're basic of like, why doesn't math work if I
give it a bunch of genomes? It's a multi-sequence learner. Yeah, and why is it
taking so long? Is that a different problem? Yeah. Are there some basic things
that you would teach somebody, like typing in dash help, just like some basics
for every single command that would get? Except on the command line, that
doesn't always work. If you do dash h, some commands will, you know, that's an
actual option in a command, so it's not universal. And the same with dash help.
You know, that's not a universal either, so. So we get the help automatically
when no input's given, right? I know, it doesn't do anything destructive. I
think that's one of the things I try and do the most when I'm teaching,
especially doing practicals, is not giving the answer when a question is asked,
and being like, okay, here's how we would look that up. You know, here is the
help, here is the help command. Okay, look what that option does. Okay, the
default is that if it says the default, but or whatever, you know, this is the
wrong way around because we can see in the help messages. I think that is a
really key bit of teaching. It's quite hard to do via a decimal and a
calculator. I guess when you are teaching that kind of command line stuff, it
can be difficult because often the examples, you want to give people a really
complicated example, but at the same time, you need to have it run within a
reasonable amount of time. You can't just say, okay, go and assemble your
plasmodium falciparum and then expect them to get an answer in five minutes, you
know? And at the same time, you don't want them to do full bacterial assemblies.
It might just be you want to focus on, say, kind of assemble a plasmid. You
know, that might run in five, 10 minutes or seconds, hopefully, but then you
don't get as much biological insight out of that. So, there's this kind of
balance, and I've seen it done right, well, I've seen it in both ways, you know,
too simple, too complicated, and both of them are a disaster. So, there's some
kind of intermediate in there that works quite well, but I don't know what it
is. But the problem is, I don't think that intermediate is the same for all
people in the class. True, and that's one of the problems. And the same with the
whole level of abstraction issue. What is the appropriate level of abstraction?
It's going to be different for different people in the class, depending on their
learning objectives and their background. The simple example, see an example of
it, try to run it yourself, then do, like, basically create a new one, like that
classic sort of repeated workflow, like, you know, lots of learning program,
literature and guidelines, et cetera, like kind of follow that model heavily. If
they're a PhD student with their own data, and they basically have come to your
workshop or your training course, because they have their whole bunch of
bacterial reads, and they want to do something with them, great, because they
have an easy, complex example that they can kind of build in themselves, like
for their create, whereas if they're just like learning, they're, you know,
they're an undergrad or a master's student who's kind of learning it, because,
you know, this bio-traumatic stuff might be useful in the future. They hear it's
an up-and-coming field with podcasts coming out all the time. They don't have
that, like, they're creating examples. It's very arbitrary. They don't have,
like, an obvious connection to that more complex stuff. Finding that
intermediate is, yeah, you kind of need to know the class, need to know the
people you're training. I was contemplating that also, because we brought this
up earlier in the conversation, that there is a common denominator between
everyone coming in the class with a bunch of different backgrounds, and I was
contemplating that also, because in grad school, for a bioinformatics graduate
program, you have people coming in with a biology degree, or a computer science
degree, or there are some other ones that are not as prevalent, like a
statistics background or something, and I think it is a real challenge with the
professors and everybody to come with common ground, and I mean, I'm thinking
back to when I first entered the program. It's been a long time. It was an
experience, because people with computer science backgrounds excelled super
quickly at the beginning of the program. It was almost demoralizing from my
point of view, because I was coming in with a biology background, but after the
first few semesters, it was very apparent that even though they had the
methodology down, they didn't have, like, the purpose yet, and that's where the
biology came in, the destination of that journey, and I think the biology is
really hard to catch up on, because it's a bunch of memorization, honestly. That
was a very long answer to say that the common denominator for every single
training class graduate school is a very difficult and probably unsolved
challenge. I agree entirely, but coming from the bias of I self-taught the CS,
the online courses, the workshops, etc., enough that I can somehow fill the CS
department. I find the life sciences, it's just that big blob that needs a bit
more of a pedagogical guide through it.  learning all the exceptions to all the
rules, learning actually, there's a second strand. You wrote a great algorithm,
but you ignored half the data. Like, we've seen that in tools before, in like
quite good tools. The thing I've been very surprised at though, and why I'm kind
of, I'm very hesitant on the abstraction, is actually some of the CS students,
and then, you know, moving to CS department, I'll be in the promised land. I'll
be able to do all this on the command line effortlessly. They won't have to
think about it. It's actually with a lot of the way the CS teaching is happening
with, you know, these online platforms or these like very kind of contrived
Java, you know, Java examples, for example, if they're still doing that, is I've
been surprised at how poor a lot of CS students have been at just command line
file mugging and doing that, piping a file to like copying a file, moving it
around. Like that kind of messy, like getting the data into the program when
they haven't had a clean example of doing it. I think that's because a lot of CS
isn't the getting your hands dirty. It is like writing code after you've done a
bit of theory. So it's all about, say I learned Java, unfortunately, and that
was all about building objects and levels of abstraction and interfaces and all
of this kind of jazz. But there's very little on actual command line, how do you
string these programs together? It was just, this is the code, this is the
theory, you know, heavy theory, theory, theory, I presume so that you can move
forward and then you can, you'll forever know the theory of computer science. So
everything is easy. Everything is a network. Everything is a graph. That's what
I reduce every problem to anyway. So maybe it did work. There is a lot of bad CS
teaching out there, like substantially, and I know a lot of people failed
computer science when I did it as an undergrad. Like the attrition rate was
enormous. But I think a lot of people only went into it. I started in 99 because
that was around the first dot-com boom and everyone thought, oh yeah, this is a
great area to get into. You'll make a fortune, you'll be a millionaire when
you're 25. And then of course, people realized actually this is kind of hard and
you know, like 40% drop date in the first year. And now we just do the same
thing with deep learning. Oh yeah. Which is, I mean, that's another interesting
example of what is the right level of abstraction. Because you can do a deep
learning bootcamp, you can teach them to run a convnet on some data pretty damn
quickly, but they have no idea what they're doing necessarily. Well, when I
interview people and I mentioned machine learning, because every CV these days
mentions machine learning. And so I usually get them to draw out what a neural
network is. And I say, okay, you know, it's really a neural network. Could you
draw please? And half the people, they don't understand the absolute basics. The
other half, you know, once it starts even drawing it, it's like, grand, yeah,
you're fine. You actually know what's going on. I hadn't encountered that fizz
buzz of machine learning. I tend to go with how logistic, how does it explain
logistic regression? Very simple. Thanks for joining us, Finn. I think we made
it through this topic unscathed. I think that you all have learned a bit about
training, the common denominator between all students' backgrounds, the level of
abstraction, and that you will not go from zero to nature in one week. We seem
to have no agreement on a lot of things, but maybe that's a good thing that we
discussed several angles anyway. Thank you all so much for listening to us at
home. If you liked this podcast, please subscribe and like us on iTunes,
Spotify, SoundCloud, or the platform of your choice. And if you don't like this
podcast, please don't do anything. This podcast was recorded by the Microbial
Bioinformatics Group and edited by Nick Waters. The opinions expressed here are
our own and do not necessarily reflect the views of CDC or the Quadrant
Institute.