Nabil-Fareed Alikhan

Bioinformatics · Microbial Genomics · Software Development

Quality Control Criteria for Genome Assemblies

Posted on November 11, 2024

Part 7 of 10 in the series: Genomic Quality Control

Go back to A framework for quality control

Quality control processes for genome assemblies aim to answer "Does the genome assembly look like an intact genome from the organism I am expecting?" The criteria Contiguity, Completeness, Contamination, Correctness assess this in different ways.

The ultimate goal of genome assembly is to reconstruct the original genome sequence, but to do this we have sheared, amplified the DNA and read off short sequences. There are errors introduced at each step. At the time of writing, there is no reliable method of automatically generating a perfect and complete genome sequence. We must make do with what we have, and be able to assess if the genome assembly is 'good enough' for our purposes.

There are many tools available to assess genome assembly quality. We will look at some of the most common ones.

Contiguity (Genome assembly)

As mentioned above, we are aiming, but not expecting, a complete single genome (chromosome) sequence. Contiguity measures how contiguous the assembled genome is, the less fragements the better. You can look at metrics like the N50 and L50, which indicate the length of the longest contig and the number of contigs needed to cover a certain percentage of the genome. Higher N50 and lower L50 values are generally better. How can we assess contiguity?

💡 Tip

A "contig" (short for contiguous sequence) is a set of overlapping DNA segments that together represent a consensus region of DNA. In the context of genome assembly, contigs are created by piecing together shorter sequences, called reads, that have been obtained from sequencing technology.

Running QUAST on command line

quast.py assembly.fasta

QUAST offers various options and parameters to customize the analysis and output. You can specify different options to generate specific reports or change the output format. For example, you can use the -o flag to specify a different output directory, or use -R to provide a reference genome for alignment if available.

Here's an example of a more customized command:

quast.py -o custom_output_folder -R reference.fasta -g gene_annotation.gff assembly.fasta

This command specifies a custom output folder, uses a reference genome for alignment, and provides a gene annotation file for additional analysis.

Please refer to the QUAST documentation for a full list of available options and their descriptions. The specific options and settings you use may depend on your analysis goals and the characteristics of your data.

Completeness (Genome assembly)

Genome completeness refers to the extent to which a sequenced genome accurately represents the full genetic material of an organism. It is a measure of how thoroughly the genome has been sequenced and assembled, reflecting the presence of all expected genes, sequences, and structural elements.You can assess genome completeness by comparing your assembly to a reference genome, if available. You can also assess genomes via the number of essential genes for that organism. How can we assess completeness?

Running BUSCO

Here is some example code to run BUSCO on our example data.

conda activate week2 
conda install -c conda-forge -c bioconda busco=5.5.0
busco --list-datasets
wget https://mmbdtp.github.io/seq-analysis/long_assembly.fasta
busco -i long_assembly.fasta  --out assembly-busco  --mode genome -l bacteria_odb10
cat assembly-busco/short_summary.specific.bacteria_odb10.test-busco.txt

Contamination (Genome assembly)

Contamination in genome assembly refers to the presence of extraneous DNA sequences from sources other than the target organism in the final assembled genome. These unwanted sequences can originate from various sources and can significantly compromise the accuracy and reliability of the genome assembly. Some common reasons for contamination in sequencing data include:

We can use Kraken2 in the same way that we used it for sequence reads.

Using Kraken2 Here is some example code to run Kraken2.

conda activate my_env 
conda install -y kraken2 
kraken2 du -h -d1 /shared/public/db/kraken2
kraken2 --threads 8 --db /shared/public/db/kraken2/k2_standard_08gb/ --output long.hits.txt --report long.report.txt  --use-names long_assembly.fasta

Correctness (Genome assembly)

Assess the accuracy of your assembly by checking for misassemblies, such as structural errors, inversions, or translocations. Visualization tools like Artemis or Bandage can help identify such issues. Effectively we are trying to assess, is the genome assembly what we expect? How can we assess correctness?

BONUS: Circumstantial (Genome assembly)

These are not direct evidence of a good genome, but can be reassuring. Here are some circumstanial evidence of a good genome:

Remember that the quality of your bacterial genome assembly may also depend on the sequencing technology used, the software and parameters employed for assembly, and the quality of the source DNA. Careful evaluation and validation of your assembly are essential for accurate results.

Let's apply these criteria to some sample data in Practical - Genome assembly QC.

Series

Genomic Quality Control

A comprehensive guide to quality control in bacterial genomics, from sequencing to assembly

1Why is Quality Control Important in Genomics?
2A Framework for Quality Control in Genomics
3Quality Control Criteria for Sequenced Reads
4Practical: Read Classification with Kraken2
5Practical: Read Classification on Command Line
6Practical: Quality Control for Short Reads
7Quality Control Criteria for Genome Assemblies
8Practical: Genome Assembly Quality Control Exercise
9Glossary for Genomic Quality Control
10Further Reading and Additional Resources
← Previous
Practical: Quality Control for Short Reads
Next →
Practical: Genome Assembly Quality Control Exercise