Nabil-Fareed Alikhan

Bioinformatics · Microbial Genomics · Software Development

Glossary for Genomic Quality Control

Posted on November 11, 2024

Part 9 of 10 in the series: Genomic Quality Control

This glossary collects brief definitions for terms used across the Genomic Quality Control series.

Base Quality

The quality score associated with each base in a sequencing read, indicating the probability that the base is called correctly. It is often represented using Phred scores. Utilized to filter and trim sequencing reads to improve overall data quality and reliability. Typically visualized in quality control tools to identify regions of low-quality bases within sequencing reads.

De novo genome assembly

The process of reconstructing a complete genome sequence from DNA sequencing reads without the aid of a reference genome. De novo assembly is typically performed when a reference genome is not available or when studying non-model organisms with significant genetic variation (i.e. most bacteria). De novo assembly poses several challenges, including repetitive regions, sequencing errors, and variations in genome structure and complexity. During de novo assembly, sequenced reads are overlapped and assembled into contiguous stretches of DNA, known as contigs. Additional steps may be performed to order and orient contigs into larger scaffolds, using paired-end or mate-pair information to bridge gaps between contigs and improve the continuity and accuracy of the assembly.

FASTQ

A text-based format for storing both nucleotide sequence data and corresponding quality scores. Each entry in a FASTQ file consists of four lines: a sequence identifier, the raw sequence, a plus sign, and a string of quality scores. Widely used for storing and sharing raw sequencing data from high-throughput sequencing platforms. For example,

@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATTTGTGTTAAATACAAAATT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65

Phred Score

A measure of the quality of the identification of nucleotides generated by sequencing machines, represented on a logarithmic scale. The score ( Q ) is calculated as ( Q = -10 log_10 P ), where ( P ) is the probability of an incorrect base call. Used to assess the accuracy of individual base calls in sequencing data. Higher Phred scores indicate higher confidence. For example, a Phred score of 20 corresponds to a 1% error rate (i.e. 99% accuracy).

Sequenced Reads

Short fragments of DNA or RNA sequences that are output from sequencing machines. Each read represents a portion of the original DNA or RNA molecule being sequenced. Sequenced reads are the raw data used for downstream bioinformatics analyses, such as alignment to a reference genome, variant calling, and assembly. Reads can be single-end (one end of the fragment is sequenced) or paired-end (both ends of the fragment are sequenced, providing more information for alignment and assembly).

Series

Genomic Quality Control

A comprehensive guide to quality control in bacterial genomics, from sequencing to assembly

1Why is Quality Control Important in Genomics?
2A Framework for Quality Control in Genomics
3Quality Control Criteria for Sequenced Reads
4Practical: Read Classification with Kraken2
5Practical: Read Classification on Command Line
6Practical: Quality Control for Short Reads
7Quality Control Criteria for Genome Assemblies
8Practical: Genome Assembly Quality Control Exercise
9Glossary for Genomic Quality Control
10Further Reading and Additional Resources
← Previous
Practical: Genome Assembly Quality Control Exercise
Next →
Further Reading and Additional Resources