Using tar and gzip for Fun and Profit
Posted on November 12, 2024
Gzip (GNU zip) compression is a widely used file compression and decompression format and tool that reduces the size of files or data streams to save storage space or reduce data transfer times over a network. It was developed as part of the GNU project and is commonly found in Unix-like operating systems, including Linux.
tar is a command-line utility and a file format used for archiving and compressing files and directories. The term "tar" stands for "tape archive," refering to when data was stored on magnetic tapes. The tar utility is a commonly used tool for creating, extracting, and managing archive files in the tar format.
Example files
Make some example files for us to play with.
You can use any combination of these, or make your own. Make at least 3 files. You can use a terminal text editor like vi or emacs, but do not use the GUI text editor.
grep --help > grep_help.txt
touch empty_file.txt
echo "This is some text" > sometext.txt
Exercise 1: Creating files
Create the some example file, as described above.
Show Answer
grep --help > grep_help.txt
touch empty_file.txt
echo "This is some text" > sometext.txt
What is touch? What does touch do if the file already exists?
Show Answer
touch updates the access and modification times of each FILE to the current time.
A FILE argument that does not exist is created empty, unless -c or -h is supplied.
See man touch for more information.
From the example above, What does the > do for the grep output?
Show Answer
The > redirects the standard output of the command to the specified file. > will overwrite the location, if the file already exists. >> Will append.
What does echo do?
Show Answer
Echo the STRING(s) to standard output. See man echo for more information.
How can you view the contents of these files?
Show Answer
cat is one way to view the contents of a file. cat will print the contents of a file to standard output. Try other options like head, less, more, tail, nl. You can also use a text editor like vi or emacs.
Using gzip
The remainder of this module will use the example files you've created to demonstrate how to use gzip and tar. gzip allows use to compress files, while tar allows us to group files together. Let's start with gzip. gzip should be pre-installed. To check if you have gzip installed, open a terminal and run:
gzip --version
If it's not installed, (tell us!) but on other systems you can typically install it using your system's package manager (e.g., apt, yum, brew, etc.).
Compress a File
To compress a file using gzip, you can run the following command:
gzip filename
Replace filename with the name of the file you want to compress. This command will create a compressed file with the .gz extension, such as filename.gz.
Decompress a File
To decompress a .gz file, use the gunzip command:
gunzip filename.gz
This will restore the original file, removing the .gz extension.
Compression Options
-dor--decompress: Decompress the specified file(s).-cor--stdout: Write compressed data to standard output (useful for piping data).-kor--keep: Keep the original file after compression (by default, the original file is removed).-vor--verbose: Display compression details.
Compressing and Decompressing with Piping
You can also use gzip and gunzip with pipes to compress and decompress data on the fly. For example, to compress the output of a command and save it to a file:
command_to_generate_data | gzip -c > compressed_data.gz
And to decompress and process data from a compressed file:
gunzip -c compressed_data.gz | command_to_process_data
Exercise 2: Using gzip
With the information above, and whatever else you can find via the internet, compress the files you created in Exercise 1.
Show Answer
gzip grep_help.txt
Now reverse it by decompressing those files
Show Answer
gunzip grep_help.txt.gz
OR
gzip -d grep_help.txt.gz
What is the difference in size of the compressed files and the original files?
Show Answer
- grep_help.txt: 4042 bytes
- grep_help.txt.gz: 1580 bytes
Can you compress one of the files while keeping the original (decompressed)?
Show Answer
gzip -k grep_help.txt
What happens to the file extension after compressing with gzip?
Show Answer
The file extension changes from .txt to .txt.gz. i.e. gz is appended to the end.
How can you view the contents of these files?
Show Answer
zcat grep_help.txt.gz
gunzip -c grep_help.txt.gz # This will also work esp. for Mac users
What is the difference between | and >?
Show Answer
| is a pipe. It redirects the standard output of the command on the left to the standard input of the command on the right.
> redirects the standard output of the command on the left to the specified file. > will overwrite the location, if the file already exists. >> Will append.
Can you combine the steps in exercise 1 and 2, to create and compress a file in one step?
This is difficult. Hint, use piping.
Show Answer
grep --help | gzip -c > grep_help.txt.gz
Did you notice that we needed to specify the -c flag for gzip? This is because gzip expects a file as input, not standard input. The -c flag tells gzip to read from standard input. Did you also notice that we needed to give the full file extension for the output file? This is because gzip will not automatically append the .gz extension when reading from standard input.
Using tar
To use the tar command in a Linux or Unix-like operating system, you can perform various operations, such as creating archives, extracting files from archives, and more. Here's a basic explanation of how to use tar with some common operations:
Creating a Tar Archive
To create a tar archive, you typically use the -c (create) option, followed by the -f (file) option to specify the archive file's name. You also list the files and directories you want to include in the archive. For example:
tar -cvf archive.tar file1.txt file2.txt directory/
-c: Create a new archive.-v: Verbose mode (optional, for displaying the progress).-f archive.tar: Specify the archive file name.file1.txt file2.txt directory/: List the files and directories to include in the archive.
Viewing the Contents of a Tar Archive
You can list the contents of a tar archive without extracting them using the -t (list) option:
tar -tvf archive.tar
-t: List the contents of the archive.
Extracting Files from a Tar Archive
To extract files from a tar archive, you can use the -x (extract) option:
tar -xvf archive.tar
-x: Extract files from the archive.
Extracting Files to a Specific Directory
You can specify the target directory for extraction using the -C (change directory) option:
tar -xvf archive.tar -C /path/to/target_directory
-C /path/to/target_directory: Extract files to the specified directory.
Compressing Tar Archives
You can create compressed tar archives using the gzip or bzip2 compression utilities. For example:
- To create a gzip-compressed tar archive:
tar -cvzf archive.tar.gz file1.txt file2.txt directory/ - To create a bzip2-compressed tar archive:
tar -cvjf archive.tar.bz2 file1.txt file2.txt directory/
To extract from compressed archives, you can use the -z option for gzip-compressed archives or the -j option for bzip2-compressed archives.
tar -xvzf archive.tar.gz
tar -xvjf archive.tar.bz2
Hint. Ask Arnie:

Exercise 3: Using tar
With the information above, and whatever else you can find via the internet, bundle all the files you created in Exercise 1 into a tarball
Show Answer
tar -cf my_files.tar grep_help.txt sometext.txt empty_file.txt
Here's a breakdown of the command and its options:
tar: This is the command itself, used to work with tar archives.-c: This option stands for "create." It tells tar to create a new archive.-f my_files.tar: This option is followed by the name of the archive file you want to create. In this case, "my_files.tar" is the name of the archive file.grep_help.txt sometext.txt empty_file.txt: These are the names of the files you want to include in the archive. The tar command will add these three files to "my_files.tar."
You can look inside the archive with -t:
tar -tf my_files.tar
Now reverse it by extracting those files
Show Answer
tar -xf my_files.tar
Here's a breakdown of the command and its options:
tar: This is the command itself, used to work with tar archives.-x: This option stands for "extract." It tells tar to extract the contents of the specified archive.-f my_files.tar: This option is followed by the name of the archive file you want to extract. In this case, "my_files.tar" is the name of the archive file from which you want to extract files.
What is the difference in size of the tar file and the sum of the original files? Is this what you expected?
Show Answer
- grep_help.txt + sometext.txt: 4060 bytes
- my_files.tar: 10240 bytes
Why is this? tar archives have a minimum size of 10240 bytes by default; see the GNU tar manual for details (but this is not GNU-specific). The result will still be bigger than sum of the file size, because tar stores metadata about files. Remember, tar will bundle files together, not compress them.
What is the difference in file name of the output file between directly compressing with gzip and using tar?
Show Answer
tar will use whatever file name you have specified. It will not add additional file extensions. That is, you can make a tarball and not specify .tar as the file extension.
What is the purpose of the -v flag in the tar command.
Show Answer
-v: This option stands for "verbose." It instructs tar to operate in verbose mode, which means it will display the names of the files as they are extracted from the archive. It provides feedback about the extraction process.
Can you combine the steps in exercise 2 and 3, to create an archive and compress it in one step?
i.e. create a .tar.gz file in one step.
Show Answer
tar -cvzf my_files.tar.gz grep_help.txt sometext.txt empty_file.txt
Try running zcat on this file, what does zcat do?
Show Answer
zcat is identical to gunzip -c. (On some systems, zcat may be installed as gzcat to preserve the original link to compress.) zcat uncompresses either a list of files on the command line or its standard input and writes the uncompressed data on standard output. zcat will uncompress files that have the correct magic number whether they have a .gz suffix or not.
Linux & File Compression
A practical guide to file compression in Linux for bioinformatics