Hello, and thank you for listening to the MicroBinFeed podcast. Here we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There's so much information we all
know from working in the field, but nobody writes it down. There is no manual,
and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My
co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Andrew
and Nabil work in the Quadram Institute in Norwich, UK, where they work on
microbes in food and the impact on human health. I work at Centers for Disease
Control and Prevention and am an adjunct member at the University of Georgia in
the U.S. Hi, everyone. It's Nabil up here in the editing booth. In this episode,
we continue talking to Thuy Jørgensen and Kai Blin about developing new
protocols and software to track SARS-CoV-2 variants of interest. Okay, so maybe
taking a step back, right? In terms of CT, how does your method work as a CT
goes up, as a viral load goes down, or the samples become degraded? So that's a
great question. So CT, of course, is not just CT. So for example, the qPCR
that's currently being run at DTU, it has seven dark cycles before starting to
record CT. So this changes the CT value. So with this system, this has been
tested by some of the hospitals to be around five normal cycles because
annealing temperature is higher in those seven dark cycles. So that's about...
So you have to... To the numbers I'm going to mention, you have to add about
five cycles to get another setup. But so basically, we have about 90% up until
CT30 with our system, CT35 with other systems that would translate to. And then
it starts dropping off until... So we do 45 cycles of PCR in our PCR. And
samples that have a CT of 35 in our system, so 42 in other systems, only succeed
in 25% of cases. But those are edge cases, so extreme that they would generally
not be called from the qPCR either. So these results, again, we have the luxury
of not having to make the decision. We only report what we find. What's your
cutoff for a positive? Do you say like a CT of 40, like maybe where there's one
viron? Yeah, we do not have one. So all samples that show any threshold of any
of the targets in the normal qPCR, we sang a sequence. And the overall
efficiency, so the first batch of 195 samples we had, had an overall efficiency
including 510 samples that only recorded one target in the qPCR. So we had an
efficiency of 85% in our Sanger setup. Have you done any limits of detection on
your setup? No. So, but we have tracked, we have some like, so we have the CT
values of these samples and we have in the manuscript that will come out, we
have a comparison of the CT value. If we got a result or not. I think it's
pretty incredible that you might be indicating that your CT threshold might be
very high, that you can detect very small numbers, kinds of genomes. Because I,
just putting that in context, I mean, I think that a lot of people around the
world have a threshold of CT 30 or 29 or something like that for genome
sequencing or a company that I talked with even said that they had to go as low
as like 26 or 24. So that's also like, depending on the qPCR setup, that can be
different, but I understand completely because like our efficiency is not high
either when we get to those CTs. The beautiful thing is just that it's so cheap,
like it costs $5 to run a sample from end to end, including all classware,
including everything. So we don't care if it doesn't work. We care if it works.
Can I comment on something in the paper also along these lines? I mean, we, I
guess we haven't gotten into the actual paper itself, but I really liked some of
the language you put in there because you were, you were basically saying like,
Hey guys, this is 2000s technology and it's the good stuff still works, the old
stuff still works. And I, I really like how you kind of drove that home. And I
think you're kind of driving that here in your language. Yeah. So that this was
our approach. We have like my usual job. I haven't touched Sanger sequencing
since I started doing my master's thesis in 2008. Like this is not normal for me
to do Sanger sequencing. It's not my first or second or third choice. Like I do
nanopore sequencing and Illumina sequencing, and I've done all parts of the
process from extracting the DNA to pressing start on the Nexsec machine and on
the, on the Minions. And Kai and I just discussed it and figured that this would
be the simplest thing that would work. And then we tried it. I mean, we, I was
actually a little bit embarrassed at first and we have also gotten some laughs
because this is, there was somebody who wrote to me in joke, jokingly, welcome
to 1995. And I appreciate that a lot. So contamination is a huge issue in our
lab, particularly when you're doing so much PCR. And I'm wondering, could you
walk me through how you handle your controls and how you detect contamination?
Yes. So basically it's the same people who set this up as the people who work in
the normal qPCR detection. So that means that they have all the, all everything.
It's the same, it's the same reagents, it's the same people, it's the same
rooms. So what we have is we have good separate physical separation of the
samples. So the master mix is prepared in one room. The PCR is set up in another
room. It is run in a third room and it is moved to new wells in a post PCR room.
People who enter the post PCR room cannot go back to any of the other rooms. So
like we have, like, like this lab is usually dealing with animal diseases and
animal disease detection. So they are like, their customers are very, very
dependent on them having a very clean set up and like the least possible
contamination. The way that we then detect if those contamination is just to
have four negative controls per sample that spread out randomly. And we have, we
have never, we have seen individual, very rarely we see one position being
called as either a mutation or no mutation, but we've never seen a full length
sequence from a negative control. Contaminations happen like that's like that
happens in absolutely every system, but about we have minimized it in a lot of
different parts of this chain. And do you run a positive control? We do not run
a positive control. We have considered running a high titer positive control,
but we have not, we have not done that. So that's actually not part of that has
been suggested by SSI as well, but it does not, we haven't implemented that now.
But like, so as like, so in the PCR, the qPCR, then you will actually need to
analyze a negative result. But if we get a negative result here, we don't
actually analyze it. We only analyze when we get a call. So I think it's less
important to have positive controls than it is in the qPCR. So how much does
your method cost? $5 per sample, including everything. And it is simpler than
any other workflow. There's no new machines. There are no, there's no new
enzymes. There's minimal handling time and those minimal analysis. And there's
almost no data storage because it's really only 300 kilobytes per sequence that
we get in. So like there's no multiplexing, there's no demultiplexing, there's
no assembly step. So a lot of the steps that usually take up a lot of energy,
even though they're not being talked about so much, simply don't exist in this
setup. So it costs $5 and it is so simple that anybody who can do a PCR test can
also run a variant screen this way. And how much hands-on time is there? Setting
up one PCR and spending five minutes moving some of that product into a new
plate. Wow, that's really quick. Yes, it is very, very, very quick and easy
setup. Thought about deploying this elsewhere. Has anyone been in touch? I mean,
I know it's very new, but have you been talking to other people about maybe in
just countries that are getting started with COVID? Yeah. I had a meeting with a
representative from two African countries a couple of weeks ago where I tried to
convince... They have Sanger sequencing in-house, both places in their COVID
testing unit. And I tried to convince them that this would be... They could test
all of their positive samples with no investment. And I don't know if they have
actually started it, but I would love to spread the message. It's also like
there's really no investment in starting this. And we scale to full production
in less than a week. So it's really ridiculously simple. And anybody who has
resources to do PCR test has resources to do this as well. That's awesome. Okay.
So you've got your software in GitHub and you've got a paper coming out on, I
presume, bioRxiv. Is that right? Unfortunately not. So bioRxiv said, no, thank
you. So it's metaRxiv, which actually turned out to be a little bit of an issue.
In Denmark, this type of study, because it is completely anonymized data, like
there's no personal tracking information in it. So we do not need an ethics
approval for it. metaRxiv does require ethics approval. So that means that I now
have had to write an ethics approval statement and ask the Danish Ethic Approval
Board to say that it's not necessary by Danish law. And then I need to state
that in my submission to metaRxiv. Have you considered going to preprints.org or
one of the other preprint servers? I have, but I have not done so. But I have
considered that, yes.  itself is actually up on protocols.io. So between that,
the GitHub site, and I think there's a page on SSI that has the software so you
can run it, right? Yeah. You're pretty much, I think that that's all anyone
would require to get started, right? I don't think there's any additional info.
No, there's no information you need to get started, but seeing information on
the actual calls, on the results that this generates, that is not either of
those places. So like seeing that it works and how well it works, that will be
in the pre-print. I did notice that there is this web app from SSI that seems to
be running the software, making it available to anybody. Could you tell us more
about that? Part of our idea here was to have a method that was sort of as
accessible as humanly possible to get the whole thing off the ground, right?
Because, I mean, as Tua said, we stood up the whole thing in like about a week,
and that was sort of time that we could easily invest. But, I mean, I also know
that usually when you want people to use a piece of software, then you of course
also need to spend some time maintaining it. And it was clear that even if you
just wanted to roll it out here in Denmark, it was going to get rolled out at
the different hospitals and the different regional test centers and everybody.
And a lot of these people don't have dedicated bioinformatics staff. So, I mean,
all of the sort of getting set up, getting the dependencies installed and
everything needed to be as simple as possible, right? So, I mean- Can I add to
that, that just running it command line would also be a challenge? Like this is
a barrier that some people are not going to make. Yeah, so we have the command
line tool, and that's sort of what sort of I do all of the development on, and
that sort of is also sort of my preferred way of running it. But it was clear
that this was not sort of end user compatible for the people that we targeted
this at. On the other hand, sort of my usual go-to solution for this would be to
set up like a web service where you can just upload the data and run it. But, of
course, that also comes with the challenges because, I mean, hospitals are
rightfully squeamish about uploading patient sample data to some random website,
right? I mean, we could have built something that works within Denmark that
would be sort of compliant with all of the requirements by, I don't know,
hosting it at SSI in some internal network or something like that. But again, in
order to make it sort of more widely accessible, sort of, it was nice to have
the option to go for another solution. And there's a small spinoff company from
the SSI as far as I understand that basically builds like bioinformatics
environments by sort of transpiling whatever software you want to use into
WebAssembly. And then you can run the whole thing in your web browser. So
basically they already had Bowtie and SamTools binaries available in
WebAssembly. And then for this pipeline, what they had to do was to go and build
a Tracy WebAssembly library. To get Tracy in. And then I think they also have a
Python interpreter. I probably just have CPython transpiled to WebAssembly. And
then that just runs the Python script. And I'm not sure how exactly their
runtime works under the hood, so how they link the files and the input data and
everything. But I mean, in the end, it's like the command line script that runs
within the browser in WebAssembly and then just dumps out our standard output
onto the screen and creates a zip file of all of the results that we have that
you can then download. But I mean, the data never really leaves your computer.
So it's really just the browser making the internal blob available via the
download function. So basically you can go to the website. I mean, though that
is at this biolib.com website, the actual data never leaves your system. But
that's amazing. So essentially you, it's by opening the page, you're sort of
downloading the app onto your, running it in your browser, but all the data
remains on your system, which is fantastic. And I mean, this is space age kind
of technology. Yeah, this is pretty amazing. I'm always, I mean, like the
performance takes quite a bit of a hit doing this. So I mean, it takes like
three to four minutes to run this on a hundred samples instead of like 10
seconds, right? But I mean, for the convenience of not having to deal with any
of the dependencies, and I mean, as a bioinformatician, you will know that sort
of dependency handling is like the thing that you spend 80% of your time on, and
then like you need to spend the other 80% on actually getting your job done. And
here is sort of you take out like the dependency handling part of the work. And
I mean, that's really super convenient. And I mean, it's not like the most fancy
interface ever. It's just like a couple of input boxes where you can like put
the zip file in and select what reference you want to use or something like
that. And then you hit run and it comes back with a text box of the output,
right? But I mean, it's workable and it gets you the results within a couple of
minutes, which I mean, again, for the we're talking about an overall turnaround
time of like around a day or two from, I mean, a bit more from swap to results.
So I mean, three, four minutes of runtime for the software is really not a big
deal. Yeah, so the company that set this up is called BioLib and I think they
did a fantastic job. I think it's also worth mentioning that, guys, it's kind of
the push of a button is kind of a joke in the bioinformatics communities, but
this is really a push of a button. Like there's one button and you push it to
start and then you get your result. So I think that the button says run, by the
way. Yeah, I mean, it's obviously a very single purpose piece of software where
you can do this, right? I mean, like the simplicity wouldn't scale to a more
complex analysis pipeline, but like particular for this kind of analysis and
given the fact that the people who usually run it are sort of not computer
people whatsoever, like the training required is very simple because as Tua
said, it's really just like push one button, wait a bit, your results are there,
you like take them and put them wherever they need to be entered, done. You guys
are like really scaring me with the one button thing. I know you're saying that
it's really easy, but then something happens and then it's not easy. I mean,
it's not for all pipelines, right? I mean, but for this particular one, you also
don't have like a ton of options for the command line script, right? I mean,
like there's a bunch of options that I basically built in because I thought they
were kind of need to have, right? So we started out with only looking at
specific mutations and I saw that a couple of samples, we had like mutations in
the position, but sort of not from the amino acid to the, or not to the amino
acid that people were interested in. So like the official sort of policy was we
don't call them as we found that mutation, right? So, I mean, it's like the P681
position has a couple of different mutations that are possible or I think it's
that one. And there's like a histidine and an arginine or something like that.
And like, initially we were only interested in the histidine. So if we found an
arginine, we were not supposed to report that. But I mean, there is a command
line switch that says, well, I expected the histidine, but I found an arginine
instead because I thought it was neat. But that was sort of me not being able to
switch off the scientist brain, I guess. For again, like for the public health
angle of this, sort of like the use is very simple and like the less things
there are to twiddle, like the less likely somebody is going to misuse it and
get the wrong results, right? And I mean, at the end of the day, so from a
public health perspective is sort of not isolating a person that should have
been isolated or like not doing the contact tracing with sort of enough vigor is
like worse and like false positives are less problematic than false negatives
and some of that, right? So again, it's a bit different to what you usually do
when you sort of work in science. Yeah, one more thing about what you said too
is just, we might've talked about this like even last year on this podcast, but
like WebAssembly is kind of the future, I think for a lot of stuff to make sure
things run. And I think that would be an interesting topic to focus on another
time. I think exactly for the reasons that Kai was talking about that you can
deploy the app without getting around installation on a system, getting around
dependencies and the date, and you can assure the user that the data remains on
their machine is a huge boon for public health applications. Obviously there's,
I think there's a lot of work in terms of making it more efficient and that's
just down to writing a better port of whatever for the WebAssembly language. But
you can write Rust, right Lee? So you can do all of it. I can do anything, yeah.
Yeah. I'm magical now. I mean, it's really, really mainly interesting for the
like external tools perspective, right? So, I mean, like if we had targeted
WebAssembly from the start, I mean, I could easily have rewritten like the
Python pieces of code that I currently have for Rust, but that still would get
you only so much, like so far, right? Because again, we want to run Tracy, which
is written in C. We want to run Bowtie, which is written in C++ and SamTools is
in C++ as well.  well, I think. So like that, you also need to support. And
again, I mean, it's also for all the compiled languages, it's still pretty
straightforward to get some WebAssembly. I think like LLVM now has a WebAssembly
target. So I mean, technically, you can probably get everything from there into
WebAssembly very decently as well. I mean, I must say, I mean, personally, I'm
more impressed that they could just run like random Python script in WebAssembly
and that worked. And of course, also all of the being able to pass the data
around. So I think they have a slight patch set on top of the original script in
order to make that happen. So I mean, like the call outs to the external tools
and where I use sort of pipes from Bowtie into some tools and stuff like that,
they need to do something slightly different there. But there's a pull request,
like a draft pull request on the GitHub repository that basically has there
changed that. So I mean, we're not going to merge that because we still want to
be able to run the standalone command line tool. But it's sort of out there in
order to look at it and see what's happening. And of course, I mean, like all of
the magic is happening in their proprietary back end. So I don't really know how
that works. But I mean, it's pretty nifty. I fully agree. So actually, this is
also the setup that we are using. So the biolib page is not only there for other
people to use it. This is what the people who work in the pipeline day to day is
using to analyze the results. So I think we have a couple of other installations
as well. So I've talked to some people from, I think, Aarhus University who were
setting up something else there. And they're running everything on a cluster. So
that's where the Singularity images are coming from. Because they were like,
hey, we'd like to run it on our cluster. And Singularity is the most
straightforward thing to do there. So here's a Docker file. Can you please add
it? And I mean, the reason why it's in Bioconda is because when Tua got started
and we were playing with the prototypes, he was running on his Windows machine
and Bioconda was the thing that he had set up. So it was just the easiest thing
to make sure that he could go ahead and try a new version of the prototype if
you could just go ahead and install all of the dependencies from Bioconda. And
then, I mean, seeing how all of the dependencies were in Bioconda anyway, I
figured just getting Bioconda set up so you can also install the tool itself
from Bioconda was pretty straightforward. And I mean, Bioconda has this Bioconda
bot that basically picks up new releases when I do one to PyPy and automatically
does a new pull request. And I basically just need to go over and say, yeah, I
did indeed release a new version. And I'd like to have that one in Bioconda as
well. And basically just need to tell it, OK, it's fine. Merge the request,
please. And then within however long it takes for them to build and roll out the
new packages, it will also be available in Bioconda. So again, it's really, as I
mentioned before, to add a new mutation that falls within our window, all we
need is five minutes of work to go and open the editor, find the code, add a
line into the dictionary of these are the mutations we care about, and then do a
new release. And then within a couple of minutes, we have the new thing up and
running. So I mean, it's really, from a software engineering perspective, it's
pretty straightforward and not a lot of work. So I mean, that's nice because
it's a side project, right? So it's an important side project, but it's also
something I'm happy that I don't need to devote too much of my time to this. I
think if I can add one thing before we end this, it's actually the scalability.
I'm a little sad that I completely forgot to talk about that. But as Kai
mentioned, this is like one sample costs a fifth of five samples. And it costs
five times as much to run 25 samples. So all the other systems, they don't scale
particularly well. You need a minimum amount of samples to press start on an
Illumina machine. But this really, you can take whatever amount of samples you
want, and it scales completely linearly. And it's dirt cheap. So that's all the
time we have for today. We've been talking about SARS-CoV-2 and a new method for
identifying variants of concern using Sanger sequencing. Welcome to 1995. I want
to thank Kai and Thuy for joining us today. And thanks to everyone for
listening. Thank you all so much for listening to us at home. If you like this
podcast, please subscribe and like us on iTunes, Spotify, SoundCloud, or the
platform of your choice. And if you don't like this podcast, please don't do
anything. This podcast was recorded by the Microbial Bioinformatics Group and
edited by Nick Waters. The opinions expressed here are our own and do not
necessarily reflect the views of CDC or the Quadrant Institute.