Nabil-Fareed Alikhan

Bioinformatics · Microbial Genomics · Software Development

Why is Quality Control Important in Genomics?

Posted on November 11, 2024

Part 1 of 10 in the series: Genomic Quality Control

In this series, we are going to delve into a critical aspect of genomic epidemiology: the impact of sequencing and assembly errors on our interpretations. As you know, high-throughput sequencing has revolutionized our ability to track and understand infectious diseases. However, the accuracy of our epidemiological insights heavily relies on the quality of the sequencing and assembly processes. Errors in these processes can significantly distort our understanding of pathogen dynamics, leading to misleading conclusions about genetic diversity and transmission pathways. The type of issues include:

The real effect of poor data

In this example, I selected 12 Salmonella enterica ser. Choleraesuis and created "a poorly assembled" genome from one of them (SAL_FC0090AA_AS). I then created a neighbour joining tree based on average nucleotide idenitity using mashtree to show you the effect. This is a common tool for creating a tree to show the similarity between genomes. BAD_FC0090AA_AS.result.fasta is a clear outlier, and no where near FC0090AA_AS.result.fasta which it was based on. If this analysis method was capable of handling the poorly assembled genome we should see the two genome together. This amount of difference between BAD_FC0090AA_AS.result.fasta and the others, is enough to change our intepretation. For instance, if the other genomes belonged to an outbreak, would we consider BAD_FC0090AA_AS.result.fasta part of that outbreak too?

Poor QC tree example

💡 Tip

This error won't be so pronounced in real data. And you won't be able to clearly spot it by eye in the final analysis. You need to be confident about the data quality from checking upstream.

A very special Salmonella Typhi

Someone once came to me with some results similar to the table below. The black means the genome had AMR determinants that would confer that resistance, white means absence. They were looking at multi-drug resistant Salmonella enterica serovar Typhi and found that one of their samples had a special profile of predicted AMR determinants that included extra mechanisms. They were very excited that they had found something new and wonderful and wanted me to just sanity check it.

Special Salmonella case

I did a basic check of the taxonomic classification of the sample and it came back - Klebsiella pneumoniae. It was not a special Typhi, but a run of the mill Klebsiella that had been picked up by mistake.

The null result

The most common manifestation of sequencing and genome assembly errors is that a downstream tool just doesn't work. Here is an error thrown by snippy when given poor quality data.

Null result error

This content was prepared as part of the GenEpi-BioTrain programme funded by ECDC. The GenEpi-BioTrain programme is an interdisciplinary course in genomic and epidemiology, which was held at Institut Pasteur between May 27th and June 7th, 2024.

Series

Genomic Quality Control

A comprehensive guide to quality control in bacterial genomics, from sequencing to assembly

1Why is Quality Control Important in Genomics?
2A Framework for Quality Control in Genomics
3Quality Control Criteria for Sequenced Reads
4Practical: Read Classification with Kraken2
5Practical: Read Classification on Command Line
6Practical: Quality Control for Short Reads
7Quality Control Criteria for Genome Assemblies
8Practical: Genome Assembly Quality Control Exercise
9Glossary for Genomic Quality Control
10Further Reading and Additional Resources
Next →
A Framework for Quality Control in Genomics