Hello, and thank you for listening to the MicroBinfeed podcast. Here we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There's so much information we all
know from working in the field, but nobody writes it down. There is no manual,
and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My
co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Andrew
and Nabil work in the Quadram Institute in Norwich, UK, where they work on
microbes in food and the impact on human health. I work at Centers for Disease
Control and Prevention and am an adjunct member at the University of Georgia in
the U.S. Hello, and welcome to the MicroBinfeed podcast. We are joined today by
Emma Griffiths for her second appearance on the show. She was one of the
masterminds for the SARS-CoV-2 metadata in one of our previous episodes. Check
it out if you missed it. We are also joined by Joao Carrizo, making his first
appearance. Joao currently works with Biomir Yu as a bioinformatics data
scientist, and in his previous life, he was a tenured professor at the
University of Lisbon. Nabil and myself, Lee, are your hosts today, and so let's
get into it. And we're talking about ontologies today. What are they? Why do we
use them? To start us off, we came up with a little activity different from our
other episodes, so Joao, if you wouldn't mind starting us off, could you tell us
how to say, Good morning, all. I love bioinformatics. In Portuguese? Boas dias a
todos. Awesome. And Emma, you volunteered to do it in French. I sure did.
Bonjour à tous. J'aime la bioinformatique. And I'll do one in Spanish. Buenos
dias a todos. Me gusta bioinformática. So, I asked the questions earlier, just
help us get into it. Who wants to give us the first definition of what an
ontology is? I can take a crack at that. Ontologies are a controlled vocabulary
where all of the fields are organized into a hierarchy, and there are logical
relationships in between all of the terms. And these different relationships can
link information together in different types of ways. Some are more complex than
others. Some are quite simple. So, ontologies are great ways to standardize
information, structure information, and to query information. Joao, what's your
hot take on that definition? I fully agree. I always prefer the shorter version.
Ontologies describe objects and the relationship between objects, but I think
Hema was right on the spot, because those relationships are complex. And even
between two items or objects, you can describe several levels of relationships.
And you can even describe relationships between relationships. So, relationships
can become the objects themselves, in a sense. It's quite abstract in its
essence, what an ontology is, but it also has a very critical, and I think one
of the reasons that we need to use it is, like we discussed before, computers
are dumb, right? Yeah. We just said in three different languages the same thing.
How do we map the relationships between the things to a common meaning? For a
computer, well, Google Translate can do, but it needs to know what those
relationships are. And if it's direct translation, it's one thing. But if it
gets into semantics and different interpretations, and particular expressions,
then it's much more complex. Why should we care about, is there an ontology for
hatred of ontologies? No, there's not an ontology for hatred of ontologies.
There's ontologies for bad words. Ontology is pretty much for everything, and
that's because ontologies aren't really a new technology, right? They're
basically a way of understanding the existence, right? They're very
philosophical. But they do have this practical application of being able to
structure information. Many people have implemented ontologies for structuring
information for different kinds of projects. From, I think, the U.S. Department
of Defense, Google, lots of different companies and agencies have implemented
ontologies. But the issue is that usually people use ontologies, like I've
already mentioned, in an agency-specific way. And the way you design or build or
architect your ontologies really affects the way that they can be used and the
way that they can interoperate. So if you don't build or construct your
ontologies in the same way, they don't work together. You still end up with
these silos of information. Even though the information is structured, it's not
structured in a compatible or interoperable way. There are efforts out there in
the community to build ontologies in a consistent way. But maybe that's a bit of
a deep dive for right off the top of the show. But yeah, why do you care about
ontologies? Because, just because of that, because they're a way to create
interoperability between datasets and between databases and between humans,
actually. I don't think that we even need to go that further to need the
ontologies. Basically, this is a microbial bioinformatics podcast. So I know
that all the bioinformaticians love Excel sheets, right? So the thing is, if we
would love to have the data well formatted in a way that is consistent, that we
can process and understand exactly what each field is, we can do that inside on
our own Excel sheet for the person that is doing it, right? But we cannot pass
it in a proper way that is comprehensible. And we cannot query two different
Excel sheets directly. So the ontology does what Emma said. We can, if we have
ontologies that are truly interoperable, that means that actually you can cross
the meanings and definitions between two different parts or fields together for
extra value and knowledge. That's right. And I think that it also speaks to one
of the points that you made earlier. It's the meaning of words, right? You can
have the same word mean different things to different people. If you're trading
spreadsheets between different agencies that represent different sectors like
animal health and veterinary care and human health and the agricultural sector,
the same words can mean very different things. And the computer doesn't know
that, right? The computer only knows what you tell it. It's really that
equivalency in meaning that's really important. It's not just text searching.
Oh, yes. As you know, I entered in ontologies because of the little calling
algorithms. Exactly because of that problem, Emma. Right? So it's, and my PhD
was basically microbial typing. I know it's such an area for many of my
mathematicians, but I'm okay with that. But the thing is people say terms and
assume it's the same thing when it's not. It happens a lot of time in things
like we have these strains or we have these isolates. Is it the same thing? Is
the same thing in this context than the other? What level of knowledge does it
imply having one thing or the other? It's different. But I think for me, the
best antidote is if you define what is an allele or what is a loci. And this is
the same not only for gene by gene methods, but also for SNPs. I remember that
if you ask someone from human genetics what is a SNP definition, you'll get
something different from microbial bioinformatics. And it will depend also from
people to people and what programs they use. So the definition of what a SNP is
can be very abstract. But if you want to know exactly what it means, I have a
SNP in this position in this context, then it's tied to all the process that you
obtain that you use to obtain that specific SNP, right? It's a more complex,
it's the set of tools and parameters used that define that SNP. And if you're
not using the same parameters or tools, then the definitions are not really
interoperable between different researchers. That's right. And one of the things
that I didn't mention in the definition earlier is that one of the nice things
about ontologies is that every term is always really well defined. And you can
disambiguate the meanings of terms using IDs. Every term in the ontology gets an
ID. So if you know, one of the examples that we use when we give talks on
ontologies is the usage of the word biscuits, right? If you talk about biscuits
in the US and biscuits in the UK, they're two different things, right? A biscuit
in the US is kind of more like what I would call a scone. But whereas a biscuit
in the UK is like what I would call a cookie, right? So using ontologies, these
are very different things, but you can disambiguate these items using IDs.
Another example is, you know, we were helping to design an ISO standard for the
use of whole genome sequencing for food microbiology. And we were responsible
for the metadata section. And one of the issues that came up was being able to
disambiguate what is a strain from what is an isolate, right? And if you start
digging around in the public repositories and looking up their definitions, they
don't provide much clarity on that. Being able to really articulate the meaning
of the words is very important. Another example, when we were standardizing some
metadata for an interagency project, I came across this one example and this one
entry.  somebody had said that the sample was from animal layer crumb. And I was
like, what? That must be the crumbs at the bottom of a bag of animal feed. So
that's what I standard, that's, I rewrote it as that. And then the people came
back to me and they're like, no, like this is wrong. And what it is, is animal
layer crumb, it's, it's a special type of feed for chickens that are like layers
as opposed to broilers. But if you don't have that specialized knowledge in that
field, that's jargon, right? And, and we see metadata, it's full of jargon. So
if you don't have that specialized knowledge, again, it's comes back to that
meaning of words. You can really misinterpret what's going on. And it's that
metadata that provides the context for interpreting your genomics information.
So, you know, mistakes like that, it seems like semantics when we're talking
about standardizing metadata, but it can have a big impact if it skews, you
know, your understanding of what's happening in the real world. Oh, I totally
agree. I think that basically, and you know, that Hema, for me, I consider it,
it's the price that we have to pay to speed up things and move it to the next
level. And okay, well, that's the other thing. So everyone always says, this is
the price we have to pay and talk about ontologies like they're bad words. But I
just, I love, I love ontologies. I love standardization. I mean, part of it's
probably because I'm nosy and I just want to look at people's data. I find, I
find the whole world of semantics and data structures fascinating. So I guess
it's good that I'm doing that and you're not. No, no, I fully understand what
you say because I always hated philosophy in high school, but actually I feel
that, that ontology is close to philosophy. Oh, they are. Since we are all PhDs
here, right? I finally understood while doing ontologies, why, why we are PhDs
because, you know, it's the deep meaning of the terms actually has lots of
repercussions downstream and sometimes on the analysis. And if you want to get
the best data possible for some inputs, and then we have to have the things very
well defined to avoid the very famous garbage in, garbage out. Yeah, exactly. I
don't say it as the bad words. I always said that if, if ontologies work, it's
the kind of thing that people won't even know that it's over there, right? It's
like the TCP IP protocols or the web protocols are now the very, the zoom
encoding protocols that should be in the background, gluing everything together.
Is there, is there an analogy then? Like if, if there's, is there like an
invisible ontology layer that I'm always using and I'm not aware of it when I'm
doing bioinformatics? Well, when you use Google, they have a proprietary
ontology. So you probably are using ontologies every day and you don't realize
it. Like I said, they're implemented in all kinds of technologies and
industries. It's just, it's, it's ontologies are slow in, into moving into
public health. And so I think that's why a lot of us don't really know about
them so much, but yeah. Most people see it, that's a huge overhead for
organizing and putting the data in the right format. And in my experience, most
people are just doing their jobs and they are doing that. It's assumed that that
part needs to be done quick because if they're seeing patients, if there are
nurses, if there are people doing the other job for them, this is just the
other, the other thing, right? There's usually no payoff for the one doing the
sample collection. It's all very well to have a wonderful ontology that us as
data scientists can play around with and write wonderful papers and say all of
that. But the poor sucker who has to fill that format in the first place, he
doesn't get a look in on what's going on. It doesn't matter. So why bother? And
I think that's still a massive hurdle. I fully agree. I fully agree because like
I said, the end user needs, the thing needs to be transparent in ontology
application. And I think there's even a layer on top of ontologies that things
like natural language interpretation and disambiguation that can be done to ease
that difficulty. But actually the use of smarter interfaces for data entry
should also alleviate that overhead that you mentioned, Nabil. But I'm
absolutely right. So what is the added value for the last, for the person in the
field collecting the data for us to play with? That needs to be shown through
the results and they should be also reported back to them and to say, look,
thanks for you doing this right. We managed to do this, otherwise it would be
impossible, which is sometimes very hard to achieve and almost utopic, right?
There's a number of things that you just mentioned that really resonate with me.
So the, you know, one of them is the whole chicken and the egg problem. To
convince people that ontologies are useful, you really need a good use case, but
to build up ontologies that are useful for people, you needed that use case to
begin with to create a useful ontology, but then you need to be able to
demonstrate to the people that ontologies are useful for them to provide you
with a good use case. When we often talk to folks, we'll say, you know, we can
build the vocabulary for you. We can work with you to help standardize your
stuff. And they're like, well, what can, you know, what can you do? What, like,
what can you do for us? And like, can you show us some examples? And we're like,
well, I don't, what do you need? Tell me what you need. And then we can work
with you to fix it. So it's kind of, you know, this vicious cycle of not having
good examples for people, but you don't have good examples because, you know,
you don't know what their issues are. So you haven't been able to solve them
yet, but also coming back to this idea of metadata, kind of being a second class
citizen and that everybody gets really excited about the genomics results, but
metadata, people don't feel they get bang for their buck investing in the
metadata side of things. I think that's a very old fashioned thing now of just
producing a genome used to be worth something. That's a technical marvel. It's
simply not. Generating sequencing data is completely useless without good
contextual information. And I'm probably speaking to the choir by saying that,
but just categorically, like you have to get better at this sort of thing.
Otherwise we're going to just do very, very boring publications. Oh, yes. And we
have also the question of scale, right? It used to be the case that you managed
to sequence one or two or three genomes and publish based on that. And now we
are just churning and churning genomes. So I think the metadata is actually more
valuable, the genomic data per se right now, because now the genomic data is
easy, but the metadata is hard. So maybe it's the next frontier. Well, I mean,
if you want, if anyone wants a practical application of what this is like, next
time you open up a publication with a phylogenetic tree on it, just put your
hand over the side of the tree that has all of the little blips that show you
what the country is, all of the color coding, just put your hand over that bit
and just look at this spiky little figure of the tree itself and tell me what
does that tell you? It tells you nothing. And that's what the sequencing data
tells you. Without the contextual information, there is no contrast. There's
nothing to say any linkage between genotype or phenotype. There's nothing to
talk about what is going on in different niches. There's nothing to talk about
what's happening over time. That's all out the window. And if that is not high
fidelity data coming in, you're not going to be able to do it. You can't impute
it after the fact. It just, you just can't make it up. I don't think it happens.
I think slowly people slowly people have understood that you can't just magic
this stuff out of thin air. You need to do better sampling and have these sort
of systems in place to begin with. And we are seeing a change where people take
this more seriously, but it's slow. Yeah. And I think that people tend to think
of metadata as just like a sample came from a chicken, but it's really the
methods as well. Right. So when you were just mentioning purpose of sampling,
why was it for a cluster investigation? Was it for surveillance? Were you
monitoring? Was it for an outbreak investigation? You know, that affects the
samples in your tree and it affects the interpretation. I think we need to give
metadata more street cred. And, you know, we often get together at conferences
and we talk about the nuances of each other's tools. They're the genomics
analysis tools, and there's never any mention of, of metadata tools. And that's,
to me, that's like the other half of the problem. I know people, again, because
ontologies and standards and things are kind of a bad word in the community. I
know people, they're not going to put the butts in the seats, right? They're
not, they're not sensations. That's interesting that you raised that, because
you think that gene ontology, the famous or infamous go is sometimes the cause
of that bad rep. I, well, no, I mean, I think if, if you have an experience with
something that's tricky, you're going to obviously say, oh, this whole concept
or this whole field is not garbage, but, you know, it's, it's, yeah, it's not,
not as great. Right. But that's one example, right? Like how many aligners are
there? How many, you know, we have lots of different tools for, for similar
things and people enjoy hearing about those things, but all you know, is the
gene ontology. Like you just, like you just asked me now, where are ontologies
used in everyday life? And I, you know, I think they just don't get much
attention. And so it's that whole thing where you, you don't experience them and
you don't, you know, you don't see them in action. I think if, I think if anyone
is going to, is going to get upset about ontologies, they have to then stop
using things like keg or cog or, or even some of the.  the HMM databases like
PFAMs and so on, which are all basically controlled language for describing gene
function. And we very happily take our, you know, protein predicted genes and
shove it into those tools and look at the comings and goings of the different
categories and put that in, oh, it's a great result and whatever. All of that
heavy lifting of structuring those networks was done by someone else. And that
analysis that someone else, someone later is capable of doing is because someone
took the time long before to get that to work. So, and we're building on top of
it. We change the, you know, some of it's wrong. There's always issues. You keep
changing it, you keep tweaking it as we learn more and more, but that's the
platform and that's the power that you can get that if you just try to get some
sort of order to your data before you get too bogged down with your different
analysis. Yeah, and, you know, you're right. I think that all of these, you
know, ontologies and community standards, they do build on top of each other,
right? If you think of the kind of swirling free text system, there was
originally, you know, every minimal metadata, minimal data for matching, every
step improves the system, but you've got to have people using it and complaining
about it and then doing something about it, right? And the other point to make
is that, you know, ontologies are just, they're essentially files, right? Of
instructions of how to structure information. You need a tool to implement the
ontologies, right? Or you need to build the ontology into the backend of your
system. And so, you know, I think that there is a heavy lift in using, I mean,
you can use your ontologies just to standardize your metadata, right? That's
still a heavy lift. Like going and converting what you've got in your database
to this other standard, that's not trivial. That takes a while. But if you can
have a tool that would do that for you, you can start to implement these things
a lot easier. And what Joao said before about, you know, ontologies should be
invisible and should be working in the background. That is absolutely true. Like
you shouldn't have to think about ontologies. They should just be there working
for you, but somebody has to have engineered them into your system or given you
a tool to influence them. And I think we just don't have many people working on
that right now. And that's why we don't have good things, right? I mean, you
have lots of people working on the genomic side of things, but you don't have a
lot of people working on the ontology metadata side of things. So, yeah. But I
think that one thing that should be a warning is this is not a defining ontology
is not the kind of thing that a single person can sit in a room doing it. It
needs to be a collaborative effort of discussion between the persons using the
ontology and using what you need to have agreement. And there's also that social
aspect of having, you know, several humans to agree into something. That's
always a problem. But it's an important activity because lots of the times if
people disagree, that means that some things need to be disambiguated further
and people have to understand that. So, but they sometimes feel that they are
losing time just discussing things instead of putting into practice. That's why
I like Nabil's examples. You're right. You know, we have KEGG, COG, GO
ontologies. You have all this level of things that people could use, right? But
they only use the part they, mostly the part they understand, which is kind of
controlled vocabulary part. We now have the power to, if you are doing climate
change and you have strain data, and if you can link climate data with the
sampling location, with the strain location, with the genomic briefing, that
strain of a strains in that location, things like that. And then you can only
have to do that if you have everything very well defined and you know that
everything is talking the same thing. Like we said, the computers are dumb,
right? And we have to tell them everything. Hey guys, I'm gonna cut you off
there. We're gonna talk about the practical applications next time. This has
been a really interesting conversation. Tune in on the next episode for
practical applications. Thank you all so much for listening to us at home. If
you like this podcast, please subscribe and like us on iTunes, Spotify,
SoundCloud, or the platform of your choice. And if you don't like this podcast,
please don't do anything. This podcast was recorded by the Microbial
Bioinformatics Group and edited by Nick Waters. The opinions expressed here are
our own and do not necessarily reflect the views of CDC or the Quadrant
Institute.