Hello, and thank you for listening to the MicroBinFeed podcast. Here we will be
discussing topics in microbial bioinformatics. We hope that we can give you some
insights, tips, and tricks along the way. There's so much information we all
know from working in the field, but nobody writes it down. There is no manual,
and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My
co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Both
Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work
on microbes in food and the impact on human health. I work at Centers for
Disease Control and Prevention and am an adjunct member at the University of
Georgia in the US. Hello and welcome to the MicroBinFeed podcast. Generating
data in our field is one thing, but sharing it with others is another challenge.
Today, we're talking about data sharing, which dovetails nicely into our
ontology episode. We're joined again by Emma Griffiths and Jaa Krisho. Emma is a
research associate at the University of British Columbia in Vancouver, Canada.
Her lab is embedded at the BC Center for Disease Control. She leads the metadata
harmonization for the Canadian COVID genomics network called Cancogen. Jaa
currently works with BioMérieux as a bioinformatics data scientist. In his
previous life, he was a tenured professor at the University of Lisbon. So I'm
Nabil. I'm your host for today. And let's get started. What is the single reason
why we should use ontologies as a sort of catch up for everyone else who didn't
hear the last episode? Well, I mean, it's hard to boil it down to a single
reason, but I would say because they help to standardize information and in a
way that makes it interoperable. This data sharing is the title of this episode,
right? If we share data, we have to know how to share and how to collate things
that we are sharing. We have to say, we have to know that everything that we are
sharing concerning things like location is actually the same thing that the
other person is sharing in terms of location or in terms of susceptibilities or
something and so on and so forth. So the definition, what the things are, is
needed in order to share. Therefore, ontologies will fill in that need,
basically. And I think for people who aren't quite convinced, like what are some
examples where this works that they can maybe refer to, to get an idea of what,
you know, more what we're talking about? One of the examples we've been working
on is helping to get standardized terms into genome tracker. So genome tracker,
I'm sure people are very familiar, is a network of labs that are looking at
foodborne pathogens and it is led by the FDA. And there's a lot of different
contributors all over the world, a lot of different labs. And so to better
standardize the food sources in the environments where foodborne pathogens are
inhabiting, they are providing at least two different fields right now that
provide standardized terms, standardized ontology terms. And they are looking at
analyses right now to help, to better understand how that standardized
information can enrich analysis. So that's one example. Joao? Yeah. So for
instance, we developed some time ago, the Type-On, the typing ontology, and we
finally managed to start using it on the Chewbacca and Manitou server. So this
allows us to basically define what the schema is, what is an allele and what is
an loci, and it helps better to clarify and to help people share gene by gene
schemas in a way that is more interoperable. And then you can also collate that
with other ontologies for further meaning, allowing further exploration. So this
is, I think in microbial genomics, this is the case, but in human genomics and
special drug discovery fields, there's lots of really good examples, right,
Emma? I'm thinking about the work from Michel Dumontier in drug discovery by
using ontologies. You probably know more about that than I do. So this is
exactly it, right? It's using ontologies to create interoperability, to create
knowledge bases, right? It's not just a particular database with standardized
terms in it. It's really, so in the last episode, we were talking about what
ontologies are, and they're not just standardized terms. They provide
definitions for terms and IDs so that you can disambiguate meanings, but they
also provide synonyms so that you can start to equate terms from different
lexicons. And they also have relationships between all of the different terms
that enable you to query, link information in more complex ways. And it's using,
it's really leveraging these relationships that you're able to build this more
intricate understanding of drugs and how they work in the body, their chemistry.
This all contributes to building this knowledge base, and that's the work that's
going on in that lab. Also, there's another good example, the ARO, the
Antimicrobial Resistance Ontology. So that underpins the CARD database that I
think a lot of people that might be listening use, and the RGI, the Resistance
Gene Identifier. This is a really nice ontology that links together information
about antibiotics, about their mechanisms of action, and gene genomics
information. So that has enabled this tool and this database to really push the
needle, I think, in enabling people to analyze genomes for identifying
antimicrobial resistance. So I think that's a really good example. No, I did not
know that one about the AARG databases. It definitely sounds from the both of
you that this sort of work is that kind of standing on the shoulders of giants.
This iterative, cumulative effort that we get better at over time and worth the
initial investment. Where I wanted to move to next is now that we have convinced
everybody this is how they should do things, and this is something they should
consider, how do they then integrate this into their work? So aspects of what do
people need to keep in mind? Where does this fall over? What are the tips and
tricks from both of you? So that's a good question. I mean, a good place to
start is to standardize your metadata. So to use community standards. And if you
use an ontology, so there is a community of ontology builders called the Oboe
Foundry, and this is the Open Biomedical Ontology Foundry. These are a community
of scientists that are using the same principles and practices for creating the
architecture for their ontologies. So they make ontologies about all kinds of
things. Uberon, the anatomical ontology, and the environment ontology, disease
ontology, food ontology. Every domain that you can think of, there's an ontology
that would probably fit your metadata. You can go and check out their resources
and start to... Yeah, and there's actually coronavirus infectious disease
ontology by now. Yeah. It's there. It's called CEDAW. Yeah, right. So if I may,
Emma, so that is one take. You have to, well, that's the most important thing.
You have to standardize your metadata. And how to do that, the tricks that I
usually use is to ask the question, what do you want to achieve? What is the
question that you are answering? And based on that, I always take basically a
bottom-up approach to try to solve the ontology and try to classify the terms
and try to understand how they go. But then we have to find out where is the
connection between the other possible ontologies. So there's always the need. If
I'm, for instance, if I'm talking about something genomic and somehow someone
tells me, oh, but I want to connect this to antibiotic resistance, right? So
then I have to see how can I integrate something like the arrow, the antibiotic
resistance ontology from CARD, like Emma said, into it and annotate following
the rules of arrow. So I can make the connection between whatever I'm developing
to the bigger scope. But for me, I always think, you know, from the problem and
grow to have this kind of natural growth around the problem and keep asking
people, is this the only question? What do you want to know more? To have
something achievable as fast as possible and build on it, basically. So the
immediate problem is sort of managing your information, right? So that involves
standardizing your terms and your metadata. And as you say, though, the next
step is how do you want to use it? It's generating all of these questions and
thinking how you want to use this information because that also will affect the
kinds of ontologies that you're implementing and how you implement them in your
database or your system. One way of implementing ontologies, right, is creating
something like a graph database so that you can start to query your data more
effectively. There's things you can do to implement these things, but they all
have different levels of lift, right? There's all different levels of input of
energy and resources. So you have to think of what...  what your outcomes are,
what questions you want to ask. I might take the opportunity to ask some of the
questions that I keep running into and maybe both of you can give me some advice
and I'm sure the problems I face are very similar to whatever you want to deal
with. These are these are very very simple use cases and maybe in that we can
illustrate the potential and keep in mind this thing of it's not only for myself
but also I'm going to share this with other people. One of the issues that we
have often is when we start off with the with the system we're going to these
issues that both of you are talking about you have a lot of free text. What
methods do you both of you suggest for me to take a paragraph of text and
convert it into some category and then convert into some sort of ontology? I'm
going to do some shameful promotion here but there are tools there are lots of
tools that exist out there actually you can go to the EMBL-EBI database or
website there's a nice tool there where you can put in big chunks of text and
it'll highlight all it has it already has ontologies in the in its back end and
it'll highlight all of the ontology terms in a paragraph of text if that's what
you want to do. If what you're using is short text short free text records of
like for metadata we have developed a tool called LexMapper that would that is
precisely designed to standardize terms from that kind of text. It'll it kind of
divides up all of the words into tokens and then it compares it it it'll convert
your short entries into standardized terms so that you can automate that and
that's a command line tool or it's available as a service but there are lots of
other tools out there that people are developing that will help you with that.
There's no magic bullet though unfortunately yeah exactly because as you know by
now people can get anything in those fields and lots of typos and lots of semi-
sentences and that is really hard to a computer to decode again like I like to
say computers are dumb. Usually I consider this the problem that is like when I
did statistical counseling to my colleagues they came up with the data right and
I said the same thing that Ronald Fisher did if you come up with the data you
already and the only thing I can do is an autopsy on the data right so ideally
and again this is a problem because people don't think about it from beginning
right when they are collecting the data and there's lots of legacy data on it
that should be useful. Ideally you should try to define everything a priori
before data collection that helps a lot and will speed up data collection and
help anyone collecting data a lot but if you have free text then you have the
tools that Hema told you about that in some cases you can get at least a very
good run if you get 70 percent to 80 percent solved using those tools is already
a big big help but rest assured that you'll have to dig in manually in some
cases. And I think that that's an important point that you bring up it's the
idea that at the point of information collection or sample collection right you
have to think that genomics can be used for all kinds of things it's very much
part of the the culture to share well you know in genomics it's been it's been
very common from the start to share data and so you have to think when you put
your information your sample your metadata out into the world you want to make
it as informative as possible so that people can use it in in many different
ways. So for example saying that a sample is from a chicken what does that mean?
Is it from a retail? Is it from a live chicken running around? Is it a swab of
you know chicken in an abattoir? To try to be as descriptive as possible when
you're collecting the information but of course that's you know it's still free
text and you still need to use that information. So let's say I've now run done
my free text or at least taken the data played around with the tools that both
of you mentioned and I've kind of got a rough idea what I'm looking for then I'm
going to Oboe Foundry and looking up ontologies that relate to that is that?
Well if you've used FlexMapper or one of any of the other tools they've like it
will have been accessing different ontologies so you don't have to do that if
you've already used a tool it's already done that for you but if you want to
know more if you want to do more with your information then yeah you can go to
Oboe Foundry and have a look. Oh well I'll follow on from Ja's point then like I
don't know very much about the sample collection but I want to insist that they
start that whoever's doing it includes certain information where would I go to
find like a generic specification of the kind of values that people look at or
people do track with these ontologies help me understand what I could
potentially be looking for and then I could you know pick and choose from that
to give to someone else? So two things there one unfortunately I don't know of a
repository of standards for all kinds of things there are there's the genomic
standards consortium and they do an absolute absolutely fabulous job of putting
together attributes standardized attribute specific lists of standardized terms
there's the mix the minimum information about any sequence there's MIMS there's
MIGS so there's all kinds of attribute packages out there that you can use for
standardizing your information and then of course there are the attribute
packages that are offered by public repositories the INSTC but I another
shameless plug we also have another tool called Gene it's basically like an
Amazon for ontologies it enables you to create specifications you can surf
through different ontologies so say you wanted to be describing food and food
products food processes you can select that particular ontology and you can surf
it and you can pick whatever fields of information you want and you can create a
specification and then you can use that spec to create data entry forms so you
you can go and manually you know go to opal foundry or look at you know mix or
any of these different kinds of standards and create your own standardized form
or you could get something like gene to do it for you I think I can't remember
those do include things like what the valid values would be that sort of thing
if if that makes sense for those terms or I think they do I think the mix one oh
yeah doesn't seem to have that yeah the GSC all of those standards provide they
prescribe the formats and the suggested values those are really good they give
you examples of usage we're developing the the they're currently in the process
of developing another package for food and food production environments and
animal feed those I believe will be made available soon and also as an option
when when you go to submit your data you know and you can take a field whether
it's clinical or environmental then there'll also be an option for you to submit
food I should also put a word of warning there you know Emma I have to sometimes
break the dream before reforming it so uh the the thing is uh there are lots of
ontologies like everyone understood by now there's some overlap in some
ontologies and it's not necessarily that overlap is not necessarily well defined
the word of of caution that I want to know is like we started the previous
episode talking with different languages and there are several small details
even in the ways that when I we said I love bioinformatics in Portuguese I said
I love but not in a way that is the same thing that love means in English for
instance that means that there are slight differences that can make a world of
change and this is one of the the problems of ontologies it's like different
languages you have to understand exactly where they came from and what they are
applied from to make sure that you that you find out that and sometimes
validations like you said in the bill you know they are not necessarily in the
scope of the ontology level it can be you can can include that but also those
validations can be contested because for instance you have if you have a
sequencing sequencing related ontology that is for human domain it will be
different the the thresholds that will set for microbial domain that's my word
of caution when using specialties high level ontologies you have to understand
very well what the term means and understand if you're getting different term
the same term in different ontologies if they actually mean something or are
mapped in efforts like COBOL they try to take care of that right Emma they want
to they try to to validate somehow and reuse instead of recreate exactly thanks
for bringing that up so in the oboe foundry one of the key practices is reuse of
terms so it is you can create new terms like say the biscuits biscuits example
that we had in the the previous episode you can create uh new terms if say you
didn't like a definition that somebody else created so let me back up for a
second so all of these ontologies are created by people right and all of these
people have their own world views if you are in animal health or human health or
environmental research or agriculture your world world view is going to be
slightly different right because of of what you have to do what your research is
how you operate ontologies are meant to be universal right you're trying to use
language that everybody understands  It's meant to rep, they're meant to
represent truth, but different people's truths can be different depending on
their kind of world. So you're right in that all the ontologies are meant to
reuse terms as much as possible. They're meant to create this interoperability,
but because different people are creating them, there can be differences in
axioms, which, you know, result in differences in relationships and logic. So as
much as you can run a reasoner over different ontologies, there, there is
clashes that are created. There, you know, this is, nothing is perfect. And so
it's not, you're, you're right. That ontology development isn't as seamless as
I'm kind of painting it to be. And there are, there, there are lots of issues.
There are lots of issues. No, what I wanted to highlight is basically in the
bill, this is not the kind of the canned solution. It's not easy to, to, to, to
set an ontology in the sense that I always feel it's more like a deep dive
activity. You actually have to, to go through the process, endure the process
for a while. It's not easy for the uninitiated, let's say on it. But after that,
you actually, one of the things that for me is actually quite rewarding and it's
high opening is the process itself reveals some things that we never thought
about it, about the, about the field that we are looking at. For instance, if
you are, if you are doing something that's very specific, like I tackled and we
have some drafts of the NGS onto the, the ontology for creating the
relationships for sequencing the NGS turns from raw data to from library prep to
the data produced in the NGS or IEEE sequencing activities. And it actually
helped me a lot to understand better the process and understand where are the
areas that things are not well-defined. I mean, to name something is to know
something. Exactly. Is it a rose by any, any name, a rose, right? A rose by, but
in this case that wouldn't apply. A rose by another name would not smell as
sweet in the ontology. Unless they were synonyms. Yeah. So no, totally. So from
between the two of you, I feel like there are a lot of existing attempts out
there that can help people or people so that people don't have to start from
scratch. And I think particularly for the mix and mix standards, those are very,
seem to be quite practical in how they're defined and can really help people
jumpstart along. But yeah, as, as, as that thing of assessing whether these are
appropriate or going to fit answer the question or even merging different ones
together, it seems to be a very personal problem. And there isn't really a way
to, that's just something that a person has, that's, that's a human being, a
human element that has to be assessed. And that's not anything I think anyone
can give much advice about without really understanding the question that
someone is trying to ask. Yeah. The point that there's one quick point that I
want to make though, in that ontologies and standards really only deliver on
their promise when, when lots of people use them. So when in, you know, only
small pockets of people use these in bubbles, then you're really not going to be
able to get the, get that interoperability again. It's just like that X, X KCD
comic where, you know, it's, you're just building another standard and then it's
just one of many. I usually use the metaphor of the railroad system, right? If
you just build railroad systems inside the small town that never connected to
other towns and to other countries and transcontinental transport lines, then
it's, it's good, it's limited, but it has not the potential to, to connect
everything. I started on the ontologies trying first to run away from them until
I saw, and I saw the link data concepts, you know, from Tim Berners-Lee, which
actually is another level related to the ontologies. We want linked data, right?
How do we link data to share? And then I realized, yes, to do it properly, we
actually need ontologies or simplified forms of ontologies. But if we need to
full powers of connecting the data and linking the data, then we need the
ontologies with all the aspects and nitty gritty details. If we do a little
expose, I mean, even our Genepio genomic epidemiology ontology and your type on
an NGS, those won't even work together because we design them differently. So,
you know, one of our goals is to, is to harmonize, right. But, but that's a good
example of, you know, people creating ontologies for specific purposes, but when
they don't jive, you know, how, how they build those things, how you group
things into classes and the kind of relations that you use, things don't work
together. And then you're stuck with the same problem, the same siloing. Issue.
And one of the powers that I like in ontologies is if you have, if you have it
all done, you actually can use the reasoning on top of that. Yeah. So we can
actually try to infer relationship between terms of ontologies or different
ontologies in this case, categories in our metadata that we never realized that
we're connected just by the ontology itself and reasoning algorithms that allows
us to explore that. Right. I think that's already outside the scope of this, of
this one. Oh, I can, I can throw a question back at you Nabil. So what, what was
the, what was the reason you never used ontologies so far? That's a good one.
That's, that's nice. Why didn't I use ontologies? Because I didn't understand my
data and I had to write the paper now. I had to write the paper yesterday. It's
a very good point because it takes time, right? And if you either write a paper
of ontologies or you do the paper now, and it's exactly that. I mean, the thing
is, is, is it's an iterative thing. I mean, you start off with a dataset, you
don't quite understand what it's about and you learn it, you ask questions, you
go backwards and forwards. And by the time you've understood what the data
actually means, you've written the paper, you've generated the figures, it's
been sent out for review. And then you don't need the ontology because you've
done the work. You've done the work. Yeah, but you've, you've cried a lot while
you were doing the work, trying to fit things together. But if you had started
off, I think Joao mentioned this before, if you start off your project saying,
okay, these are the questions that we're going to ask, where are the standards?
Let's just put these structures into place before you collect the samples,
before you do the analysis. Then you don't have to worry about that when you're
writing the paper and doing the analysis. I think for me, where this is going to
come into its own is if and when I start creating the next project on the back
of the one that I've done. Because in that, I think that is what you need to, I
mean, I think one of the things that we haven't said is that implicitly
everybody is doing this. When you then have to generate the next proposal,
you've said, oh, I've done all this work, it's in this paper, you can read it,
now I'm going to do this next one, give me money, you're going to say, oh, I'm
going to do a sample of this, I'm going to look at this information, I'm going
to collect this, I'm going to compare it with that. You're building your
ontology in this free text, like six page document that you're going to send to
these anonymous reviewers, but you're doing that specification then and there
based on what you've done before. So that's where, for me, that's where the real
value comes is that if you walk into the project with this, that's basically
what you are saying, like you walk into the project with this, explicitly, you
can just point to that and just say, we're going to do this. Data management
should be part of every single grant application, right? That should be part of
what reviewers are looking for. How are you structuring your information? How
are you storing it? How are you using it? How are you going to reuse it? I mean,
that's that, at least in the UK, that is a mandatory section. I think you have
to have data management and not just how you're going to look after it, but how
are you going to disseminate that information? And usually people have a very
light explanation, like, oh yeah, we will upload data to public repositories, X,
Y, Z. It's like, yeah, okay, sure. You have to do that, but it's like, how are
you going to make it interoperable? That's a question that people tend not to
ask, but that is a question now that is mandatory for most research councils,
that you have to say this. I think my point is, I'm just parroting what Joe is
saying, is like, you can run away from this as much as you like. You are
basically going to be doing this one way or the other. You are going to have to
structure your data because that's the only way you're going to be able to
analyze it. Whether you do it explicitly in a nice tidy way that you can reuse
the next time over and over again, and you can get better at it, or you can just
grope around in the dark and just have an awful time of it. My final comment on
that, I think it relates to the nature of science, right? It's a cumulative
activity. If you are the PhD student struggling on your paper, that's not
something that you, you really need to waste time on it. Waste time, same waste
time in a non-native speaker way, it's not a waste of time, but it's time, it's
a very time consuming way of thinking on the problem. But of course, if you are
a group leader that wants to, you have a plan of over time that you need to put
things together, then you realize that you have to do that. You have to think
ahead and you have to say, look, PhD student number one, you have to have things
in this format, because if you don't do, they eventually leave and everything is
an Excel sheet that they only understand. Right. So, I mean, it's, it's
extending the longevity of your data, right? Exactly.  Exactly, it's making it
really data for the future and data that can be understood five years and ten
years from now. Right. Yeah, and I mean that you can squeeze a student project
out of it or something like that, you know. Exactly. It pays dividends in the
end. Any final comments from the both of you on this? Standardize, standardize,
standardize. Think on the standardization first before the project. We know it's
hard, but it will pay off a lot. And don't just think about what you need for
your project and standardize it. Think about how your data can be used in the
future and try to use international community standards as much as possible.
Don't reinvent the wheel, just realign it. Yes. Exactly. Exactly, exactly. Yeah.
The one thing that all bioinformaticians love to do. Yep. I've got wheels for
days. Wheels for days. Wheels for days. Yep. All right. And on that bombshell, I
think we'll draw to a close. I want to thank Jaa and Emma for joining me today
on the MicroBinFeed podcast. We've been talking about different things, sharing
data, organizing data, ontologies, and hope you've enjoyed it and hope to see
you next time. Thanks for having us, Nabil. Thank you very much for the
invitation, Nabil. No worries. All right. See you all later. Thank you all so
much for listening to us at home. If you like this podcast, please subscribe and
like us on iTunes, Spotify, SoundCloud, or the platform of your choice. And if
you don't like this podcast, please don't do anything. This podcast was recorded
by the Microbial Bioinformatics Group and edited by Nick Waters. The opinions
expressed here are our own and do not necessarily reflect the views of CDC or
the Quadrant Institute.