Nabil-Fareed Alikhan

Bioinformatics · Microbial Genomics · Software Development

Understanding File Compression

Posted on November 12, 2024

Part 1 of 2 in the series: Linux & File Compression

Bioinformatics data can be large, sequenced reads for a single bacterial isolate maybe several hundred megabytes, while metagenomic datasets will be much larger. An Illumina NextSeq 550 can produce up to 120 Gigabase pairs per run (~120 Gigabytes of data). To minimize the footprint of these data, most sequenced data is compressed in one way or another and this what you will usually encounter.

In this module, we will cover the some basics on file compression, understand the different formats you may encounter and practice one of the most common formats (gzip).

⚠️ Warning

You should avoid working with uncompressed sequenced read data (i.e FASTQ or SAM). It is a waste of disk space and will slow down your analysis. Some bioinformatics tools you will encounter will force you to use uncompressed files, and this is a sign they are poorly written.

What is file compression?

File compression is the process of reducing the size of one or more files or folders to save disk space or reduce the time required for file transfer. Compression algorithms remove redundancy and encode data more efficiently. File compression is a rich field of study in computer science.

Here is a table of common compression formats and their descriptions:

Compression FormatDescriptionCommonly Used On
ZIPUses ZIP compression algorithm; highly compatibleMultiple platforms
GzipUses DEFLATE algorithm; commonly in Unix-like systemsUnix-like systems
7-ZipOpen-source with high compression ratiosWindows, various platforms
RAROften used for large files and supports multiple methodsMultiple platforms
TARGroups files/directories (not compression by itself)Multiple platforms
Bzip2Uses Burrows-Wheeler Transform; good compression ratiosMultiple platforms
LZMAKnown for high compression ratiosLinux and software distribution
Compressed Archive FormatsNative archive formatsRespective platforms

Watch Compression: Crash Course Computer Science #21 for a light introduction on file compression (in general).

Exercise questions

What is a file extension?

Show Answer

A string of characters attached to a filename, usually preceded by a full stop and indicating the format of the file. e.g. my_reads.fastq.gz has the extension .gz, indicating that it is a gzip compressed file. Keep in mind that the file extension may not be a reliable indicator of the true file format.

What are the file extensions for the following compression formats?

  • Zip
  • Gzip
  • 7-Zip
  • RAR
  • TAR
  • BZIP2
  • LZMA
Show Answer
Compression FormatExtensions
ZIP.zip
Gzip.gz, .gzip
7-Zip.7z
RAR.rar
TAR.tar
Bzip2.bz2
LZMA.xz, .lzma

What programs could you use to compress and decompress the following formats (These programs can be for any operating system)?

  • Zip
  • Gzip
  • 7-Zip
  • RAR
  • TAR
  • BZIP2
  • LZMA
Show Answer

You can suggest programs for any operating system

Compression FormatSoftware
ZIPunzip (unix), winzip (windows), native software
Gziptar, gunzip, gzip (unix)
7-Zip7zip (windows)
RARwinRAR (windows)
TARtar (unix)
Bzip2tar, bunzip2/bzip2 (unix)
LZMAtar (unix)

Which of these formats has the best compression ratio?

Show Answer

The compression ratio is a measure of how effectively a compression algorithm reduces the size of data. It is typically expressed as a ratio or a percentage and represents the relationship between the original data size (before compression) and the compressed data size (after compression).

This is an open question. Depending on the data, different compression formats will have different compression ratios. From the list above, implementations of LZMA i.e. xz seems to be the best for FASTQ, although it is not commonly used. See this review

Other formats that perform well (but not listed above) are dsrc2 (for Illumina fastq) and zstd

What are considerations when choosing a compression format?

Show Answer
  • Compression ratio
  • Time to compress/decompress
  • Software support
  • Streaming support
  • Compatibility with operating system(s)
  • Ease for all users involved

What is the difference between SAM and BAM formats?

Show Answer

BAM files contain the same information as SAM files, except they are in binary file format which is not readable by humans. On the other hand, BAM files are smaller and more efficient for software to work with than SAM files, saving time and reducing costs of computation and storage.

Series

Linux & File Compression

A practical guide to file compression in Linux for bioinformatics

1Understanding File Compression
2Using tar and gzip for Fun and Profit
Next →
Using tar and gzip for Fun and Profit