Episode 80: Benchmark datasets for SARS-CoV-2
📅28 April 2022
⏱️00:43:05
🎙️Microbial Bioinformatics
👥Guests
CDC Pertussis Lab, Whole Genome Sequencing Researcher
CDC Waterborne Disease Prevention Branch, Comparative Genomics Researcher
In this episode of the microbinfie podcast, researchers from the CDC discuss their comprehensive benchmark datasets for SARS-CoV-2 genome sequencing, addressing critical needs in bioinformatics pipeline validation and quality control.
Additional Resources
- For details on previous papers related to bacterial datasets, you can refer to this article.
Connect with the Researchers
- Jill Hagey can be found on Twitter @JillHagey and through her website at jvhagey.github.io.
- Lingzi Xiaoli can be connected with on LinkedIn.
Both researchers bring valuable insights into the development of datasets that contribute significantly to the scientific community's understanding of SARS-CoV-2.
Key Points
1. SARS-CoV-2 Benchmark Dataset Development
- Created to support state partners in developing bioinformatics pipelines
- Designed to accommodate labs at different stages of sequencing capabilities
- Includes diverse sample types: outbreak scenarios, sequencing platform variations, and failed QC samples
2. Dataset Characteristics
- Covers multiple SARS-CoV-2 variants, including early Delta variants
- Provides raw reads, complete genomes, and phylogenetic trees
- Addresses challenges of host contamination and variable genome quality
3. Data Accessibility and Future Plans
- Uses downloadable TSV files with accession codes and checksums
- Ongoing project with plans to incorporate emerging variants
- Maintained through GitHub for continuous updates
Take-Home Messages
- Benchmark datasets are crucial for standardizing SARS-CoV-2 genome analysis
- Quality control remains a significant challenge in COVID-19 genomic research
- Collaborative, adaptable approaches are key to managing evolving viral data