Hello, and thank you for listening to the MicroBinfeed podcast. Here, we will be discussing topics in microbial bioinformatics. We hope that we can give you some insights, tips, and tricks along the way. There is so much information we all know from working in the field, but nobody writes it down. There is no manual, and it's assumed you'll pick it up. We hope to fill in a few of these gaps. My co-hosts are Dr. Nabil Ali Khan and Dr. Andrew Page. I am Dr. Lee Katz. Both Andrew and Nabil work in the Quadram Institute in Norwich, UK, where they work on microbes in food and the impact on human health. I work at Centers for Disease Control and Prevention and am an adjunct member at the University of Georgia in the US. Welcome to the MicroBinfeed podcast. Today, I've got Nabil and Lee joining me, and they're going to tell me about mobile genetic elements and, well, what they are and why I should care. So I come from a computer science background, so a lot of the terminology and biology is, I suppose, not my cup of tea. I haven't learned it, but these two guys, luckily, know all the buzzwords and all the lingo and are going to explain to me exactly what I should know, and hopefully, I'll become a better biophysician at the end of it. What is a mobile genetic element, and what kind of different ones are there? Well, I should first start off and say that I've been working on COVID, SARS-CoV-2 doesn't have any mobile genetic elements, and I'm also pretty rusty on all of this stuff. But I'll give it a good go. So mobile genetic elements are genomic elements or segments that can mobilize themselves and move between different genomes, between different chromosomes. There's a plethora, there's a multitude of different types you can get. This includes things like bacteriophages, so bacteriophages that integrate as prophages into a host chromosome and then excise themselves and transduce elsewhere. You have transposons, you have plasmids, you have just genomic islands as well. You have integrative conjugative elements and integrons, so these are things that look like plasmids, but then dive into the chromosome. It just keeps going on and on, and within these different classifications, they can be remixed, so you can have something that's sort of phage-looking, but isn't. You can have something that's plasmid- looking, but isn't. And it's incredibly difficult. As long as there's some sort of mechanism for it to transfer, then it's a mobile genetic element. There are even mobile genetic elements that don't seem to have obvious signatures of how it actually mobilizes, or whether it's able to do it, whether it's sort of autonomous in its ability to mobilize. So often you can have prophages that don't have all of the bacteriophage excision machinery properly and they borrow it from somewhere else, and then they use that to excise along with something else. It's a wild, wonderful world out there, and we kind of pretend, a lot of us pretend that it's not actually there, and try not to think of it too much, and pretend everything is just nice and tidy sequence types. So it's just like the accessory genome that people talk about in pangenomics. Yeah, I mean, it doesn't, just because something's in the accessory genome, it doesn't strictly mean that it is mobilized, but both things do kind of overlap with each other. When you say mobilized, I mean, how do I tell the difference between maybe some phage-annotated genes in the chromosome that I've sequenced, and just a random contig I have with random annotated phage genes? Phages are quite particular because you should hopefully see, you should see something for the capsule, you should see some tail proteins, things like that, for the sheath as well. So they have a specific structure and you expect a certain panel of genes including certain proteins there, but not necessarily, because you can find from the very beginning of bacterial genomics that definitely instances of phage-like, prophage-like elements, which don't necessarily have the full set of genes that you'd be expecting, so you're not quite sure if that's something that's self- contained and complete. There are certain tricks to identify the likelihood that something is a mobile genetic element. Quite often I will run Plasmid Finder or something like that to look for the ink type or whatever in my isolate. Now, how accurate is that, and what does it actually really tell me? It's obviously looking for a certain replicon and it's trying to type based on the replicon. I don't know if Plasmid Finder does it, whether it looks for genes involving conjugation as well and gives you a type from that, I don't think it does. Some other tools do, but there are certain features you can use to identify a plasmid. Usually it's the ink type, that's the most standard way of going after it. But then, I've seen before where there are ink types buried within the chromosome itself as well as within standalone plasmids, so how do you know which is which? You don't. If you're dealing with short reads, you have no idea whether this is something that's accidentally been misassembled in a way that it's been merged into the chromosome by accident, or it could be an ICE, could be a genuine biological event where something like a plasmid has integrated into the chromosome. Difficult to say. Usually people are these days using short reads to do large-scale isolate sequencing, so what are the things we should be wary of? Yeah, if you are sequencing with short reads, then you might have some region of the genome, some IS element that repeats so many different times you won't be able to place it. Well, we've had some extreme examples of that. We had this strain, this Vibrio strain, and it had three phages back-to-back-to-back in it, and we had to type it and everything, and PacBio was just becoming available to us and we were able to finally span it after a year into study of this. And it was so hard. So, I mean, the real dangers with short-read sequencing is you won't be able to span that repeat, you won't be able to place where it is, but you might come across something with three of the same phages. In this case, it was a tandem phage, and even longer reads wouldn't be able to handle it, which is where really long-read sequencing comes in. Yeah, some E. coli can have up to 10 or 20 prophage or prophage-looking repetitive elements in the genome, and each of those will probably be a breakpoint in your genome assembly. I've had a lot of trouble with short-reads and assembling pertussis. Are MGEs responsible for this hellhole of an assembly problem? So with pertussis, they have a certain IS element. I forgot which number, but with IS elements, you'll number them like IS element number 1111 or something. And in some cases, I think with pertussis also, people use it to actually use it as a sequence-typing method, and you can find it hundreds or maybe thousands of times in the genome. It's incredible, and it will really break up your genome. Shigella is another example of an organism that has tons of IS elements, and usually your N50s are pretty lousy. Don't you mean E. coli? Yeah. Yeah, I guess that extends to all inter-invasive E. coli as well. Grant, I have heard that CRISPRs are the mechanism that bacteria use against phage. Like how on earth does that work, and informatically, what can we do with it? I mean, sort of off-the-cuff explanation of CRISPRs, because there's a lot going on in terms of explaining how CRISPRs work. But it works like, it's like a bacterial-acquired immune system, right? So, they have a set of associated genes, and they have a panel of sequences that they use as targets to identify foreign DNA. And when they identify it, they then go off and destroy it. In some organisms, you'll find that there are phage, or foreign plasmid, or foreign phage, or foreign other things that are in the spacer sequences that it'll go after. CRISPRs are pretty easy to find at this point, because the associated genes are usually quite well-conserved in the species. So, they're easy to find, and the segment itself, the spacer sequences, and the repeats between those spacer sequences, is usually very short. So the whole thing comes out quite nicely in a de novo genome assembly for most organisms. And it's a very unique structure. So there are tools like CRISPR Finder, PileCR, PileRCR, I think, from Robert Gurr. There are tons of them, CRISB, CRISPR-DB, which you can use to, or MINST is another one, which you can use to find CRISPR sequences. So do CRISPRs and the transmission of CRISPRs follow the same evolutionary path as the chromosome? Like, say, if I take one ST E. coli, will I find mostly that the CRISPRs in there are identical, you know, in all of those, or do they differ quite substantially? Depends on the organism. It's a bit hit and miss. I would say you can't really use it as a rule of thumb. For salmonella, it works a lot better, particularly in some cerevals like typhimurium. CRISPRs can be, do follow, I guess, the spacer sequences were fixed quite a long time ago, and it sort of has followed along with the cerevals. ours. In other organisms like Mycobacterium tuberculosis, like you have spoligot typing and that is effectively looking at the CRISPR panel. That's what that's derived from and that's a standard typing method. So it varies really whether you can, how CRISPRs relate. You kind of think it shouldn't because it's defined on mobile DNA that the organism run into, but sometimes it can be associated with the species tree without the original evolution of the organism. I think at one point or another, like we try to use it for genomic surveillance also, but it just is not as reliable as using SNPs or MLST, unfortunately. Yeah, usually, I think historically people were interested in CRISPR typing before genomics. It was a useful thing to try prior to genomics and you'll find a lot of literature where people in, especially in the public health sphere, do it. And then, you know, once you have genomics, all of that quickly tops off because you realize it's much easier just to shake a genome out. Okay, so assuming that I start with the FASTQ file, right, of an isolate, what do I do next if I want to actually really dig into the mobile genetic elements? You want to assemble the genome somehow if you've got short, you've got short reads, right? Short or long? I mean, let's start with short. Short or long? All right, let's start with short. So you want to, either way you want to assemble the genome because one of the key ways that you're going to identify your mobile genetic elements is the synteny of the genes. Just finding a singular transposon just in the vacuum of gene space doesn't tell you anything. So assembly is a good way to go. Lee, what would you do next? I'd probably do a rough and dirty annotation with Prokka just to have something I can look at in a genome browser like Artemis. I feel like you stole my brain over there. That's exactly what I was going to say. I'm 100% agreeing with you. And then you want to get to the stage where you can just look at it and decide what to do next. So you have your annotations and how would you do a better annotation after that? Because Prokka is just a generalized annotation. You might want to do something a little bit more plasmid-specific or whatever you think you might have. So if Prokka is showing a bunch of IS elements, do some kind of IS element annotator. If you have a bunch of plasmid- looking genes, a plasmid annotator. So one thing I would add, and I think Andrew was alluding to this earlier, is a good trick is once you have your assembly, you map your reads back onto your assembly and then look at the coverage. Because one obvious sign that you've got an MGE is a collapse repeat. So it's a duplicated thing, that repetitive thing over and over again. If you plot the coverage, you'll have a massive spike over a certain region of the genome, even if you don't know what those genes do. That's a good indication that something over and over again, it's an MGE. Just as you said that, I've remembered that when you look at a GC plot, often if there's an MGE, it will have a different GC to the background chromosome, and that can be an easy way to spot it. Yeah, a lot of proof pages are AT-rich, generally speaking. So that's a good trick to find it. So you mentioned Artemis there, but it is actually quite important to go and eyeball your data for basically any mathematics you do, because I trust no algorithms at all. They're good for getting you 90% of the way there, 95% of the way, but actually you have to dig in and see exactly what's going on and use your brain. And unfortunately, that kind of stuff can't be automated, no matter what people claim about artificial intelligence. As I was saying at the beginning, mobile genetic elements are definitely a wild west still, and you can't trust a singular tool to get the answer right. Okay, so now you've gone and you've got your assembly, you've got some annotation, how do you sort the wheat from the chaff? So if you've given me a genome and say, Nabil, where are the proof pages? If there are any proof pages, I would bung it into FAST, which is a web service you could use, or FASTA now, and that'll just quickly tell you, it has a sort of set of criteria that it's using to measure different regions of the genome. Does it have, you know, the structural proteins we were talking about that would suggest a phage? Does it have these AT-rich sort of things? It has a panel of metrics it does, and it says, well, these are the most likely pro-phage regions in your genome. Most phage typing methods have this, and they don't have like a sort of, they have a set of heuristics, they're not like a complete solution. So you, again, you go back to Artemis and double check whether those make sense. So there's things like FASTA, PhageFind, PhySpy, there's a ton that you can play around with and get slightly different results each time. And so then you have to figure out which one is right, but that's where I'd start for phages. 100% agree with you. Once again, same page. So I know when I've manually looked through genomes at the annotation, you'll see like a phage tail annotated, and then a helix, and then, you know, some genes in between. And often, you know, I go and shove those, that sequence into NR in NCBI, and there's nothing. Like, it's like, this has never been seen, no one has ever deposited this ever. Like, what the hell do I do next? Panic. Yeah. I don't think there's anything you can do. You got to go and talk to your friends in the wet lab and actually do some real microbiology and figure out the gene function of that thing. Sounds hard. Sounds hard. No, that's, you have to do actual work at that point. There's really not too much you can do once you've bunged it into NR. You can try more sort of more sensitive alignment approaches. So like CyBLAST kind of things, which is looking at not just the base position, not just the base alignment, but also the position and things like that. You can do methods like that to kind of look for very divergent, similar genes. Or again, looking at things like, I mean, obviously if you're doing it NR, you are using it in amino acid space, right? And not just nucleotide, NT, NR. And you might have to start thinking about looking at structural predictions and comparing that. It gets pretty messy when you don't have any idea. I was thinking when you were talking about all that, like if I were looking for like the moonshot and I was stressing out about it and I had zero annotation so far, I'd probably go to like InterProScan or something like to at least give me an idea what the domains are on there. Yeah, there are some ProPhage databases or like sort of MG, like plasma databases that are a bit better curated that you can possibly throw them against to get an idea. Maybe PFAMs might help you in that case, if you're really struggling. Maybe that might be one way, but whenever I suggested that and tried it, it doesn't help. Basically when I run into that thing and there's no result on NR, like nothing, it's like, I don't know, there's nothing I can do. I've tried all of those different things and nothing works. It's like, yep, well, this is just an unknown, a duff. Because Lee has had a power cut, obviously they're freezing down there in Atlanta, Georgia, but we are going to call it a day there and we will catch up again on mobile genetic elements. So thank you once again to Nabil and Lee for this really insightful discussion. Thank you so much for listening to us at home. If you like this podcast, please subscribe and rate us on iTunes, Spotify, SoundCloud, or the platform of your choice. Follow us on Twitter at Microbinfee. And if you don't like this podcast, please don't do anything. This podcast was recorded by the Microbial Bioinformatics Group. The opinions expressed here are our own and do not necessarily reflect the views of CDC or the Quadram Institute.