Episode 81: The people behind the benchmark datasets for SARS-CoV-2
📅29 April 2022
⏱️00:09:27
🎙️Microbial Bioinformatics
👥Guests
CDC Researcher
Researcher
The microbinfie podcast explores the intricate process of developing benchmark datasets for SARS-CoV-2, highlighting the complex challenges of bioinformatics data mining and quality control.
Links and Resources
- Explore the datasets: CDC SARS-CoV-2 Datasets
- Check out a previous related episode for part 1 of the conversation.
Related Works
- Previous paper on bacterial datasets: PeerJ Article
Connect with Our Guests
-
Jill Hagey:
- Twitter: @JillHagey
- Website: jvhagey.github.io
-
Lingzi Xiaoli:
- LinkedIn: Lingzi Xiaoli
Stay tuned to learn more about the insights and implications of these datasets in the field of virology and genomics.
Extra notes
- The discussion revolves around creating benchmark datasets for SARS-CoV-2 and the methods used in acquiring and processing these datasets.
- Jill explored using the Python package Selenium to automate the process of downloading sequences by remotely interacting with web browsers, although this approach was later abandoned in favor of more efficient methods.
- Checking and verifying metadata for datasets was crucial. There was an emphasis on ensuring sample consistency, such as making sure they were all Illumina reads, paired-end, and used Arctic_primers.
- Challenges were noted with handling incorrect or inconsistent metadata entries on platforms like SRA, particularly when metadata did not specify the sequencing technology used or if there were discrepancies in naming conventions for primers (e.g., multiple ways of indicating "Arctic V3").
- Initial quality control (QC) on sequence data involved fast QC, depth of coverage analysis using SAMtools, and running the Titan pipeline, which provided various QC metrics including Pangolin lineage assignments and detection of amino acid mutations.
- Selection of representative samples for variants of concern (VOCs) involved using tools like Snippy to minimize SNP differences compared to CDC internal references, ensuring key mutations were present primarily in the spike protein.
- The filtering process was enhanced by a linkage between GISAID assemblies and SRA records, ensuring only high-quality datasets were selected for further comparison.
- Decisions on which viral lineages to include in the study were based on CDC-defined variants of interest or concern due to the absence of a WHO nomination at the time.
Key technical tools and processes mentioned include:
- Selenium for web automation.
- SAMtools for depth of coverage analysis.
- Snippy for SNP comparison.
- Titan pipeline for detailed quality control.
Challenges highlighted pertained predominantly to automatic data retrieval, metadata verification, and naming conventions in datasets, reflecting common difficulties in the field of microbial bioinformatics.