Nabil-Fareed Alikhan

Bioinformatics · Microbial Genomics · Software Development

Practical: Genome Assembly Quality Control Exercise

Posted on November 11, 2024

Part 8 of 10 in the series: Genomic Quality Control

Here we will look at some genomes and assess their quality. Most of these genomes have (in my opinion) have quality issues. Remember that we are looking for as close to a perfect reconstruction of the original genome as possible, and we can assess this in different ways:

These are explained here, Quality control criteria for genome assemblies.

Exercise: Assessing genome assembly quality

The exercise is to download some assembled draft genomes and assess if they pass routine quality control. The genomes are available at https://zenodo.org/records/10018484. You can directly download them on the command line with wget or similar.

wget -O additional_genomes_for_qc.zip https://zenodo.org/records/10018484/files/additional_genomes_for_qc.zip?download=1

The files are FASTA (they are plain text) inside the zip file. There are eight genomes. Each FASTA file will look something like this. There will be multiple contigs, one after the other. Each with a fasta header (>NODE_XXXX) followed by the nucleotide sequence (ATGC).

>NODE_126_length_2252_cov_16.6409_ID_251
CAGCGTGGACTGATGTTCAG......
>NODE_22_length_135487_cov_12.0245_ID_43
GGCCGAGGCTCCCCACCGGCGCGGG....
>NODE_68_length_16957_cov_12.5198_ID_135
TGGTGTTGGTGCCAACGGCCTGACC...
>NODE_16_length_182200_cov_11.9821_ID_31
GCCGCTTTTTCGCGTTGCTTAATCT...

Try to create a table of your assessment (with better explanations) like the one below:

Sample namePass/FailReason
sample_1Yes
sample_2No
sample_3Maybe
sample_4I don't know
sample_5Could you repeat the question?
sample_6
sample_7
sample_8

This exercise is open-ended, and you can use any tools you like. You can use the tools mention in Quality control criteria for genome assemblies, or you can try other tools. You can work in groups, or divide the different criteria between yourselves.

Here is a link to the assemblies

If you do not have time, or the capacity to run these analyses, please use the precomputed results below. It is more important that you learn how to interpret the results.

If you complete the task above, also try to answer the following questions:

What is N50?

What is GC Content?

What is "Genome Coverage"?

How does BUSCO measure completeness?

How does aligning to a reference genome help assess completeness?

How does using Kraken help assess contamination?

Answers to exercises

Series

Genomic Quality Control

A comprehensive guide to quality control in bacterial genomics, from sequencing to assembly

1Why is Quality Control Important in Genomics?
2A Framework for Quality Control in Genomics
3Quality Control Criteria for Sequenced Reads
4Practical: Read Classification with Kraken2
5Practical: Read Classification on Command Line
6Practical: Quality Control for Short Reads
7Quality Control Criteria for Genome Assemblies
8Practical: Genome Assembly Quality Control Exercise
9Glossary for Genomic Quality Control
10Further Reading and Additional Resources
← Previous
Quality Control Criteria for Genome Assemblies
Next →
Glossary for Genomic Quality Control