Episode 80: Benchmark datasets for SARS-CoV-2

📅28 April 2022

⏱️00:43:05

🎙️Microbial Bioinformatics

👥Guests

Lingzi Xiaoli

CDC Pertussis Lab, Whole Genome Sequencing Researcher

Jill Hagey

CDC Waterborne Disease Prevention Branch, Comparative Genomics Researcher

Listen on SoundCloud Download MP3 📝View Transcript

In this episode of the microbinfie podcast, researchers from the CDC discuss their comprehensive benchmark datasets for SARS-CoV-2 genome sequencing, addressing critical needs in bioinformatics pipeline validation and quality control.

Additional Resources

For details on previous papers related to bacterial datasets, you can refer to this article.

Connect with the Researchers

Jill Hagey can be found on Twitter @JillHagey and through her website at jvhagey.github.io.
Lingzi Xiaoli can be connected with on LinkedIn.

Both researchers bring valuable insights into the development of datasets that contribute significantly to the scientific community's understanding of SARS-CoV-2.

Key Points

1. SARS-CoV-2 Benchmark Dataset Development

Created to support state partners in developing bioinformatics pipelines
Designed to accommodate labs at different stages of sequencing capabilities
Includes diverse sample types: outbreak scenarios, sequencing platform variations, and failed QC samples

2. Dataset Characteristics

Covers multiple SARS-CoV-2 variants, including early Delta variants
Provides raw reads, complete genomes, and phylogenetic trees
Addresses challenges of host contamination and variable genome quality

3. Data Accessibility and Future Plans

Uses downloadable TSV files with accession codes and checksums
Ongoing project with plans to incorporate emerging variants
Maintained through GitHub for continuous updates

Take-Home Messages

Benchmark datasets are crucial for standardizing SARS-CoV-2 genome analysis
Quality control remains a significant challenge in COVID-19 genomic research
Collaborative, adaptable approaches are key to managing evolving viral data