Nabil-Fareed Alikhan

Bioinformatics · Microbial Genomics · Software Development

A Framework for Quality Control in Genomics

Posted on November 11, 2024

Part 2 of 10 in the series: Genomic Quality Control

When receiving results from bacterial genomics analyses such as genotyping, in silico serotyping, clustering, phylogenetic inference, and predicting antimicrobial resistance (AMR) determinants, you should remember that your data has traversed a laborious and exhaustive journey. That journey could look something like this:

Basic workflow

Each of these steps have the potential to introduce errors. Errors which could drastically alter the final interpretation. As a Bioinformatician looking at the data post sequencing, there are two easy oppotunities to assess the quality of the data as it makes it way through our workflow. We should:

QC oppotunities

Many bioinformaticians have their own preferred approach for checking data quality. If you ask them for their approach, they will usually list a set of programs without much explanation as to how these tools were selected and what issue they are trying to address. I will try to focus on describing how to approach the problem rather than giving you my own preferred solution. Here is a well described genome assembly pipeline that covers both read quality control and genome assembly quality control.

GHRU pipeline

The pipeline (and figure) are from the GHRU SPAdes Assembly workflow. In my opinion, this is a comprehesive pipeline that produces good results. You are welcome to use it. Hopefully, you can see that in these steps, while cleared named, it is not obvious why these steps are necessary.

When working with genomic data, all quality control tools answer one or more of these broad questions:

These four questions can be broken down into seven criteria (with a extra bonus criterion) for quality control of genomic data. Some of these relate to the seqeuenced reads while others apply to the genome assembly.

We will break these categories down further in these sections:

Series

Genomic Quality Control

A comprehensive guide to quality control in bacterial genomics, from sequencing to assembly

1Why is Quality Control Important in Genomics?
2A Framework for Quality Control in Genomics
3Quality Control Criteria for Sequenced Reads
4Practical: Read Classification with Kraken2
5Practical: Read Classification on Command Line
6Practical: Quality Control for Short Reads
7Quality Control Criteria for Genome Assemblies
8Practical: Genome Assembly Quality Control Exercise
9Glossary for Genomic Quality Control
10Further Reading and Additional Resources
← Previous
Why is Quality Control Important in Genomics?
Next →
Quality Control Criteria for Sequenced Reads