MicroBinfie Podcast, 36 Roary the pangenome pipeline

Released on December 10, 2020

Back to episode list

We chat with Andrew about Roary, software for generating a pangenome, and learn about the background to the project, where the name comes from, hidden features and the light hearted FAQ.

  • Paper: academic.oup.com/bioinformatics/a…1/22/3691/240757
  • Software: github.com/sanger-pathogens/Roary
  • Documentation: sanger-pathogens.github.io/Roary/

Extra notes

  • The primary focus of the discussion was on solving the challenges of creating and analyzing bacterial pan genomes, especially at scale.

  • A pan genome involves collecting all genomic content within a species, which includes both core and accessory genes. This is crucial for understanding genetic variations that affect survivability in different environments.

  • The software Roary was developed to generate pan genomes efficiently, designed originally to serve the needs within the Sanger Institute where existing tools were inadequate for handling large bacterial datasets.

  • There is a distinction between core genomes (genes present in all or nearly all strains) and accessory genomes (genes that vary between strains). The ability to analyze these differences is essential for understanding bacterial functionality and evolution.

  • Other tools were initially used but faced scalability and usability issues, particularly when managing large datasets.

  • The Roary pipeline addressed these challenges by massively scaling through High-Performance Computing (HPC) and using methods like CD-hit to pre-cluster data, thereby reducing the complexity and volume of necessary comparisons.

  • The tool was designed to be fast, memory-efficient, and scalable to thousands of genomes, offering a unique capability in the field.

  • Roary's implementation included thoughtful considerations such as integration with Prokka for consistent genome annotation, avoiding variability issues inherent with mixing different annotation tools.

  • Consistency in processing different data sets was emphasized as crucial, illustrating how differing annotations can lead to incorrect analyses.

  • Roary is widely distributed and maintained in software ecosystems like Debian, Homebrew, and Conda, making it accessible for various users.

  • Despite its effectiveness and widespread use (evidenced by substantial citations), the software has not been updated recently due to shifts in focus within the field, although it remains open for community-driven updates via GitHub.

  • The podcast highlights broader trends in microbial bioinformatics, underscoring the importance of methodological consistency and scalability in large-scale genomic analyses.

  • Users are advised to prepare their genomic data sets consistently to avoid errors, reflecting a broader challenge in bioinformatics of ensuring data accuracy and completeness.

Episode 36 transcript