Glossary for Genomic Quality Control
Posted on November 11, 2024
This glossary collects brief definitions for terms used across the Genomic Quality Control series.
Base Quality
The quality score associated with each base in a sequencing read, indicating the probability that the base is called correctly. It is often represented using Phred scores. Utilized to filter and trim sequencing reads to improve overall data quality and reliability. Typically visualized in quality control tools to identify regions of low-quality bases within sequencing reads.
De novo genome assembly
The process of reconstructing a complete genome sequence from DNA sequencing reads without the aid of a reference genome. De novo assembly is typically performed when a reference genome is not available or when studying non-model organisms with significant genetic variation (i.e. most bacteria). De novo assembly poses several challenges, including repetitive regions, sequencing errors, and variations in genome structure and complexity. During de novo assembly, sequenced reads are overlapped and assembled into contiguous stretches of DNA, known as contigs. Additional steps may be performed to order and orient contigs into larger scaffolds, using paired-end or mate-pair information to bridge gaps between contigs and improve the continuity and accuracy of the assembly.
FASTQ
A text-based format for storing both nucleotide sequence data and corresponding quality scores. Each entry in a FASTQ file consists of four lines: a sequence identifier, the raw sequence, a plus sign, and a string of quality scores. Widely used for storing and sharing raw sequencing data from high-throughput sequencing platforms. For example,
@SEQ_ID
GATTTGGGGTTCAAAGCAGTATCGATCAAATAGTAAATTTGTGTTAAATACAAAATT
+
!''*((((***+))%%%++)(%%%%).1***-+*''))**55CCF>>>>>>CCCCCCC65
Phred Score
A measure of the quality of the identification of nucleotides generated by sequencing machines, represented on a logarithmic scale. The score ( Q ) is calculated as ( Q = -10 log_10 P ), where ( P ) is the probability of an incorrect base call. Used to assess the accuracy of individual base calls in sequencing data. Higher Phred scores indicate higher confidence. For example, a Phred score of 20 corresponds to a 1% error rate (i.e. 99% accuracy).
Sequenced Reads
Short fragments of DNA or RNA sequences that are output from sequencing machines. Each read represents a portion of the original DNA or RNA molecule being sequenced. Sequenced reads are the raw data used for downstream bioinformatics analyses, such as alignment to a reference genome, variant calling, and assembly. Reads can be single-end (one end of the fragment is sequenced) or paired-end (both ends of the fragment are sequenced, providing more information for alignment and assembly).
Genomic Quality Control
A comprehensive guide to quality control in bacterial genomics, from sequencing to assembly