Hello, and thank you for listening to the MicroBinFeed podcast. Here we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There's so much information we all know from working in the field, but nobody writes it down. There is no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Both Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work on microbes in food and the impact on human health. I work at Centers for Disease Control and Prevention and am an adjunct member at the University of Georgia in the US. Hello and welcome to the MicroBinFeed podcast. Generating data in our field is one thing, but sharing it with others is another challenge. Today, we're talking about data sharing, which dovetails nicely into our ontology episode. We're joined again by Emma Griffiths and Jaa Krisho. Emma is a research associate at the University of British Columbia in Vancouver, Canada. Her lab is embedded at the BC Center for Disease Control. She leads the metadata harmonization for the Canadian COVID genomics network called Cancogen. Jaa currently works with BioMérieux as a bioinformatics data scientist. In his previous life, he was a tenured professor at the University of Lisbon. So I'm Nabil. I'm your host for today. And let's get started. What is the single reason why we should use ontologies as a sort of catch up for everyone else who didn't hear the last episode? Well, I mean, it's hard to boil it down to a single reason, but I would say because they help to standardize information and in a way that makes it interoperable. This data sharing is the title of this episode, right? If we share data, we have to know how to share and how to collate things that we are sharing. We have to say, we have to know that everything that we are sharing concerning things like location is actually the same thing that the other person is sharing in terms of location or in terms of susceptibilities or something and so on and so forth. So the definition, what the things are, is needed in order to share. Therefore, ontologies will fill in that need, basically. And I think for people who aren't quite convinced, like what are some examples where this works that they can maybe refer to, to get an idea of what, you know, more what we're talking about? One of the examples we've been working on is helping to get standardized terms into genome tracker. So genome tracker, I'm sure people are very familiar, is a network of labs that are looking at foodborne pathogens and it is led by the FDA. And there's a lot of different contributors all over the world, a lot of different labs. And so to better standardize the food sources in the environments where foodborne pathogens are inhabiting, they are providing at least two different fields right now that provide standardized terms, standardized ontology terms. And they are looking at analyses right now to help, to better understand how that standardized information can enrich analysis. So that's one example. Joao? Yeah. So for instance, we developed some time ago, the Type-On, the typing ontology, and we finally managed to start using it on the Chewbacca and Manitou server. So this allows us to basically define what the schema is, what is an allele and what is an loci, and it helps better to clarify and to help people share gene by gene schemas in a way that is more interoperable. And then you can also collate that with other ontologies for further meaning, allowing further exploration. So this is, I think in microbial genomics, this is the case, but in human genomics and special drug discovery fields, there's lots of really good examples, right, Emma? I'm thinking about the work from Michel Dumontier in drug discovery by using ontologies. You probably know more about that than I do. So this is exactly it, right? It's using ontologies to create interoperability, to create knowledge bases, right? It's not just a particular database with standardized terms in it. It's really, so in the last episode, we were talking about what ontologies are, and they're not just standardized terms. They provide definitions for terms and IDs so that you can disambiguate meanings, but they also provide synonyms so that you can start to equate terms from different lexicons. And they also have relationships between all of the different terms that enable you to query, link information in more complex ways. And it's using, it's really leveraging these relationships that you're able to build this more intricate understanding of drugs and how they work in the body, their chemistry. This all contributes to building this knowledge base, and that's the work that's going on in that lab. Also, there's another good example, the ARO, the Antimicrobial Resistance Ontology. So that underpins the CARD database that I think a lot of people that might be listening use, and the RGI, the Resistance Gene Identifier. This is a really nice ontology that links together information about antibiotics, about their mechanisms of action, and gene genomics information. So that has enabled this tool and this database to really push the needle, I think, in enabling people to analyze genomes for identifying antimicrobial resistance. So I think that's a really good example. No, I did not know that one about the AARG databases. It definitely sounds from the both of you that this sort of work is that kind of standing on the shoulders of giants. This iterative, cumulative effort that we get better at over time and worth the initial investment. Where I wanted to move to next is now that we have convinced everybody this is how they should do things, and this is something they should consider, how do they then integrate this into their work? So aspects of what do people need to keep in mind? Where does this fall over? What are the tips and tricks from both of you? So that's a good question. I mean, a good place to start is to standardize your metadata. So to use community standards. And if you use an ontology, so there is a community of ontology builders called the Oboe Foundry, and this is the Open Biomedical Ontology Foundry. These are a community of scientists that are using the same principles and practices for creating the architecture for their ontologies. So they make ontologies about all kinds of things. Uberon, the anatomical ontology, and the environment ontology, disease ontology, food ontology. Every domain that you can think of, there's an ontology that would probably fit your metadata. You can go and check out their resources and start to... Yeah, and there's actually coronavirus infectious disease ontology by now. Yeah. It's there. It's called CEDAW. Yeah, right. So if I may, Emma, so that is one take. You have to, well, that's the most important thing. You have to standardize your metadata. And how to do that, the tricks that I usually use is to ask the question, what do you want to achieve? What is the question that you are answering? And based on that, I always take basically a bottom-up approach to try to solve the ontology and try to classify the terms and try to understand how they go. But then we have to find out where is the connection between the other possible ontologies. So there's always the need. If I'm, for instance, if I'm talking about something genomic and somehow someone tells me, oh, but I want to connect this to antibiotic resistance, right? So then I have to see how can I integrate something like the arrow, the antibiotic resistance ontology from CARD, like Emma said, into it and annotate following the rules of arrow. So I can make the connection between whatever I'm developing to the bigger scope. But for me, I always think, you know, from the problem and grow to have this kind of natural growth around the problem and keep asking people, is this the only question? What do you want to know more? To have something achievable as fast as possible and build on it, basically. So the immediate problem is sort of managing your information, right? So that involves standardizing your terms and your metadata. And as you say, though, the next step is how do you want to use it? It's generating all of these questions and thinking how you want to use this information because that also will affect the kinds of ontologies that you're implementing and how you implement them in your database or your system. One way of implementing ontologies, right, is creating something like a graph database so that you can start to query your data more effectively. There's things you can do to implement these things, but they all have different levels of lift, right? There's all different levels of input of energy and resources. So you have to think of what... what your outcomes are, what questions you want to ask. I might take the opportunity to ask some of the questions that I keep running into and maybe both of you can give me some advice and I'm sure the problems I face are very similar to whatever you want to deal with. These are these are very very simple use cases and maybe in that we can illustrate the potential and keep in mind this thing of it's not only for myself but also I'm going to share this with other people. One of the issues that we have often is when we start off with the with the system we're going to these issues that both of you are talking about you have a lot of free text. What methods do you both of you suggest for me to take a paragraph of text and convert it into some category and then convert into some sort of ontology? I'm going to do some shameful promotion here but there are tools there are lots of tools that exist out there actually you can go to the EMBL-EBI database or website there's a nice tool there where you can put in big chunks of text and it'll highlight all it has it already has ontologies in the in its back end and it'll highlight all of the ontology terms in a paragraph of text if that's what you want to do. If what you're using is short text short free text records of like for metadata we have developed a tool called LexMapper that would that is precisely designed to standardize terms from that kind of text. It'll it kind of divides up all of the words into tokens and then it compares it it it'll convert your short entries into standardized terms so that you can automate that and that's a command line tool or it's available as a service but there are lots of other tools out there that people are developing that will help you with that. There's no magic bullet though unfortunately yeah exactly because as you know by now people can get anything in those fields and lots of typos and lots of semi- sentences and that is really hard to a computer to decode again like I like to say computers are dumb. Usually I consider this the problem that is like when I did statistical counseling to my colleagues they came up with the data right and I said the same thing that Ronald Fisher did if you come up with the data you already and the only thing I can do is an autopsy on the data right so ideally and again this is a problem because people don't think about it from beginning right when they are collecting the data and there's lots of legacy data on it that should be useful. Ideally you should try to define everything a priori before data collection that helps a lot and will speed up data collection and help anyone collecting data a lot but if you have free text then you have the tools that Hema told you about that in some cases you can get at least a very good run if you get 70 percent to 80 percent solved using those tools is already a big big help but rest assured that you'll have to dig in manually in some cases. And I think that that's an important point that you bring up it's the idea that at the point of information collection or sample collection right you have to think that genomics can be used for all kinds of things it's very much part of the the culture to share well you know in genomics it's been it's been very common from the start to share data and so you have to think when you put your information your sample your metadata out into the world you want to make it as informative as possible so that people can use it in in many different ways. So for example saying that a sample is from a chicken what does that mean? Is it from a retail? Is it from a live chicken running around? Is it a swab of you know chicken in an abattoir? To try to be as descriptive as possible when you're collecting the information but of course that's you know it's still free text and you still need to use that information. So let's say I've now run done my free text or at least taken the data played around with the tools that both of you mentioned and I've kind of got a rough idea what I'm looking for then I'm going to Oboe Foundry and looking up ontologies that relate to that is that? Well if you've used FlexMapper or one of any of the other tools they've like it will have been accessing different ontologies so you don't have to do that if you've already used a tool it's already done that for you but if you want to know more if you want to do more with your information then yeah you can go to Oboe Foundry and have a look. Oh well I'll follow on from Ja's point then like I don't know very much about the sample collection but I want to insist that they start that whoever's doing it includes certain information where would I go to find like a generic specification of the kind of values that people look at or people do track with these ontologies help me understand what I could potentially be looking for and then I could you know pick and choose from that to give to someone else? So two things there one unfortunately I don't know of a repository of standards for all kinds of things there are there's the genomic standards consortium and they do an absolute absolutely fabulous job of putting together attributes standardized attribute specific lists of standardized terms there's the mix the minimum information about any sequence there's MIMS there's MIGS so there's all kinds of attribute packages out there that you can use for standardizing your information and then of course there are the attribute packages that are offered by public repositories the INSTC but I another shameless plug we also have another tool called Gene it's basically like an Amazon for ontologies it enables you to create specifications you can surf through different ontologies so say you wanted to be describing food and food products food processes you can select that particular ontology and you can surf it and you can pick whatever fields of information you want and you can create a specification and then you can use that spec to create data entry forms so you you can go and manually you know go to opal foundry or look at you know mix or any of these different kinds of standards and create your own standardized form or you could get something like gene to do it for you I think I can't remember those do include things like what the valid values would be that sort of thing if if that makes sense for those terms or I think they do I think the mix one oh yeah doesn't seem to have that yeah the GSC all of those standards provide they prescribe the formats and the suggested values those are really good they give you examples of usage we're developing the the they're currently in the process of developing another package for food and food production environments and animal feed those I believe will be made available soon and also as an option when when you go to submit your data you know and you can take a field whether it's clinical or environmental then there'll also be an option for you to submit food I should also put a word of warning there you know Emma I have to sometimes break the dream before reforming it so uh the the thing is uh there are lots of ontologies like everyone understood by now there's some overlap in some ontologies and it's not necessarily that overlap is not necessarily well defined the word of of caution that I want to know is like we started the previous episode talking with different languages and there are several small details even in the ways that when I we said I love bioinformatics in Portuguese I said I love but not in a way that is the same thing that love means in English for instance that means that there are slight differences that can make a world of change and this is one of the the problems of ontologies it's like different languages you have to understand exactly where they came from and what they are applied from to make sure that you that you find out that and sometimes validations like you said in the bill you know they are not necessarily in the scope of the ontology level it can be you can can include that but also those validations can be contested because for instance you have if you have a sequencing sequencing related ontology that is for human domain it will be different the the thresholds that will set for microbial domain that's my word of caution when using specialties high level ontologies you have to understand very well what the term means and understand if you're getting different term the same term in different ontologies if they actually mean something or are mapped in efforts like COBOL they try to take care of that right Emma they want to they try to to validate somehow and reuse instead of recreate exactly thanks for bringing that up so in the oboe foundry one of the key practices is reuse of terms so it is you can create new terms like say the biscuits biscuits example that we had in the the previous episode you can create uh new terms if say you didn't like a definition that somebody else created so let me back up for a second so all of these ontologies are created by people right and all of these people have their own world views if you are in animal health or human health or environmental research or agriculture your world world view is going to be slightly different right because of of what you have to do what your research is how you operate ontologies are meant to be universal right you're trying to use language that everybody understands It's meant to rep, they're meant to represent truth, but different people's truths can be different depending on their kind of world. So you're right in that all the ontologies are meant to reuse terms as much as possible. They're meant to create this interoperability, but because different people are creating them, there can be differences in axioms, which, you know, result in differences in relationships and logic. So as much as you can run a reasoner over different ontologies, there, there is clashes that are created. There, you know, this is, nothing is perfect. And so it's not, you're, you're right. That ontology development isn't as seamless as I'm kind of painting it to be. And there are, there, there are lots of issues. There are lots of issues. No, what I wanted to highlight is basically in the bill, this is not the kind of the canned solution. It's not easy to, to, to, to set an ontology in the sense that I always feel it's more like a deep dive activity. You actually have to, to go through the process, endure the process for a while. It's not easy for the uninitiated, let's say on it. But after that, you actually, one of the things that for me is actually quite rewarding and it's high opening is the process itself reveals some things that we never thought about it, about the, about the field that we are looking at. For instance, if you are, if you are doing something that's very specific, like I tackled and we have some drafts of the NGS onto the, the ontology for creating the relationships for sequencing the NGS turns from raw data to from library prep to the data produced in the NGS or IEEE sequencing activities. And it actually helped me a lot to understand better the process and understand where are the areas that things are not well-defined. I mean, to name something is to know something. Exactly. Is it a rose by any, any name, a rose, right? A rose by, but in this case that wouldn't apply. A rose by another name would not smell as sweet in the ontology. Unless they were synonyms. Yeah. So no, totally. So from between the two of you, I feel like there are a lot of existing attempts out there that can help people or people so that people don't have to start from scratch. And I think particularly for the mix and mix standards, those are very, seem to be quite practical in how they're defined and can really help people jumpstart along. But yeah, as, as, as that thing of assessing whether these are appropriate or going to fit answer the question or even merging different ones together, it seems to be a very personal problem. And there isn't really a way to, that's just something that a person has, that's, that's a human being, a human element that has to be assessed. And that's not anything I think anyone can give much advice about without really understanding the question that someone is trying to ask. Yeah. The point that there's one quick point that I want to make though, in that ontologies and standards really only deliver on their promise when, when lots of people use them. So when in, you know, only small pockets of people use these in bubbles, then you're really not going to be able to get the, get that interoperability again. It's just like that X, X KCD comic where, you know, it's, you're just building another standard and then it's just one of many. I usually use the metaphor of the railroad system, right? If you just build railroad systems inside the small town that never connected to other towns and to other countries and transcontinental transport lines, then it's, it's good, it's limited, but it has not the potential to, to connect everything. I started on the ontologies trying first to run away from them until I saw, and I saw the link data concepts, you know, from Tim Berners-Lee, which actually is another level related to the ontologies. We want linked data, right? How do we link data to share? And then I realized, yes, to do it properly, we actually need ontologies or simplified forms of ontologies. But if we need to full powers of connecting the data and linking the data, then we need the ontologies with all the aspects and nitty gritty details. If we do a little expose, I mean, even our Genepio genomic epidemiology ontology and your type on an NGS, those won't even work together because we design them differently. So, you know, one of our goals is to, is to harmonize, right. But, but that's a good example of, you know, people creating ontologies for specific purposes, but when they don't jive, you know, how, how they build those things, how you group things into classes and the kind of relations that you use, things don't work together. And then you're stuck with the same problem, the same siloing. Issue. And one of the powers that I like in ontologies is if you have, if you have it all done, you actually can use the reasoning on top of that. Yeah. So we can actually try to infer relationship between terms of ontologies or different ontologies in this case, categories in our metadata that we never realized that we're connected just by the ontology itself and reasoning algorithms that allows us to explore that. Right. I think that's already outside the scope of this, of this one. Oh, I can, I can throw a question back at you Nabil. So what, what was the, what was the reason you never used ontologies so far? That's a good one. That's, that's nice. Why didn't I use ontologies? Because I didn't understand my data and I had to write the paper now. I had to write the paper yesterday. It's a very good point because it takes time, right? And if you either write a paper of ontologies or you do the paper now, and it's exactly that. I mean, the thing is, is, is it's an iterative thing. I mean, you start off with a dataset, you don't quite understand what it's about and you learn it, you ask questions, you go backwards and forwards. And by the time you've understood what the data actually means, you've written the paper, you've generated the figures, it's been sent out for review. And then you don't need the ontology because you've done the work. You've done the work. Yeah, but you've, you've cried a lot while you were doing the work, trying to fit things together. But if you had started off, I think Joao mentioned this before, if you start off your project saying, okay, these are the questions that we're going to ask, where are the standards? Let's just put these structures into place before you collect the samples, before you do the analysis. Then you don't have to worry about that when you're writing the paper and doing the analysis. I think for me, where this is going to come into its own is if and when I start creating the next project on the back of the one that I've done. Because in that, I think that is what you need to, I mean, I think one of the things that we haven't said is that implicitly everybody is doing this. When you then have to generate the next proposal, you've said, oh, I've done all this work, it's in this paper, you can read it, now I'm going to do this next one, give me money, you're going to say, oh, I'm going to do a sample of this, I'm going to look at this information, I'm going to collect this, I'm going to compare it with that. You're building your ontology in this free text, like six page document that you're going to send to these anonymous reviewers, but you're doing that specification then and there based on what you've done before. So that's where, for me, that's where the real value comes is that if you walk into the project with this, that's basically what you are saying, like you walk into the project with this, explicitly, you can just point to that and just say, we're going to do this. Data management should be part of every single grant application, right? That should be part of what reviewers are looking for. How are you structuring your information? How are you storing it? How are you using it? How are you going to reuse it? I mean, that's that, at least in the UK, that is a mandatory section. I think you have to have data management and not just how you're going to look after it, but how are you going to disseminate that information? And usually people have a very light explanation, like, oh yeah, we will upload data to public repositories, X, Y, Z. It's like, yeah, okay, sure. You have to do that, but it's like, how are you going to make it interoperable? That's a question that people tend not to ask, but that is a question now that is mandatory for most research councils, that you have to say this. I think my point is, I'm just parroting what Joe is saying, is like, you can run away from this as much as you like. You are basically going to be doing this one way or the other. You are going to have to structure your data because that's the only way you're going to be able to analyze it. Whether you do it explicitly in a nice tidy way that you can reuse the next time over and over again, and you can get better at it, or you can just grope around in the dark and just have an awful time of it. My final comment on that, I think it relates to the nature of science, right? It's a cumulative activity. If you are the PhD student struggling on your paper, that's not something that you, you really need to waste time on it. Waste time, same waste time in a non-native speaker way, it's not a waste of time, but it's time, it's a very time consuming way of thinking on the problem. But of course, if you are a group leader that wants to, you have a plan of over time that you need to put things together, then you realize that you have to do that. You have to think ahead and you have to say, look, PhD student number one, you have to have things in this format, because if you don't do, they eventually leave and everything is an Excel sheet that they only understand. Right. So, I mean, it's, it's extending the longevity of your data, right? Exactly. Exactly, it's making it really data for the future and data that can be understood five years and ten years from now. Right. Yeah, and I mean that you can squeeze a student project out of it or something like that, you know. Exactly. It pays dividends in the end. Any final comments from the both of you on this? Standardize, standardize, standardize. Think on the standardization first before the project. We know it's hard, but it will pay off a lot. And don't just think about what you need for your project and standardize it. Think about how your data can be used in the future and try to use international community standards as much as possible. Don't reinvent the wheel, just realign it. Yes. Exactly. Exactly, exactly. Yeah. The one thing that all bioinformaticians love to do. Yep. I've got wheels for days. Wheels for days. Wheels for days. Yep. All right. And on that bombshell, I think we'll draw to a close. I want to thank Jaa and Emma for joining me today on the MicroBinFeed podcast. We've been talking about different things, sharing data, organizing data, ontologies, and hope you've enjoyed it and hope to see you next time. Thanks for having us, Nabil. Thank you very much for the invitation, Nabil. No worries. All right. See you all later. Thank you all so much for listening to us at home. If you like this podcast, please subscribe and like us on iTunes, Spotify, SoundCloud, or the platform of your choice. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group and edited by Nick Waters. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadrant Institute.