Representing Nucleotides
Posted on November 12, 2024
During sequencing, the nucleotide bases in a DNA or RNA sample (library) are determined by the sequencer. For each fragment in the library, a sequence is generated, also called a read, which is simply a succession of nucleotides. The sequence of a read is represented by a string of letters, where each letter represents a nucleotide base. The most common nucleotide bases are adenine (A), cytosine (C), guanine (G), and thymine (T). In RNA, thymine is replaced by uracil (U).
Sequencing instruments may also give ambiguous signals, which are represented by the letters R, Y, S, W, K, M, B, D, H, V, and N. This convention was set by the International Union of Pure and Applied Chemistry (IUPAC) in 1985. These letters represent the following combinations of nucleotides:
| IUPAC nucleotide code | Base |
|---|---|
| A | Adenine |
| C | Cytosine |
| G | Guanine |
| T (or U) | Thymine (or Uracil) |
| R | A or G |
| Y | C or T |
| S | G or C |
| W | A or T |
| K | G or T |
| M | A or C |
| B | C or G or T |
| D | A or G or T |
| H | A or C or T |
| V | A or C or G |
| N | any base |
| . or - | gap |
N means the base could not be determined. This is different from a gap, which means the base is not present in the sequence.
Read Johnson 2010 for more details.
Exercise: Working with IUPAC codes
Convert the following sequences to all possible combinations
ATGRCTAYCGTG
Show Answer
- ATGACTACCGTG
- ATGGCTACCGTG
- ATGACTATCGTG
- ATGGCTATCGTG
Convert this sequence to all possible combinations
ATGNATC-GTG
Show Answer
- ATGAATC-GTG
- ATGTATC-GTG
- ATGGATC-GTG
- ATGCATC-GTG
Write pseudocode or a program to check if a sequence is valid according to the IUPAC nucleotide code.
This is difficult.
Show Answer
A pseudocode solution:
sequence = "ATGRCXTAYCGTG"
true_count = 0
For each character in the sequence:
For each base in ATGCNACGTURYSWKMBDHVN
If character == base:
true_count add 1
If true_count == length of sequence:
return True
Else:
return False
Pseudocode is a way to describe the logical steps a computer program should take without getting into the specifics of a particular programming language. It's a high-level description of the algorithm or logic you want to implement. Pseudocode uses plain language and simple structures to outline the sequence of operations and decisions in a program.
A Python solution using regular expressions:
import re
def is_valid_nucleotide_sequence(sequence):
pattern = r'^[ATGCNACGTURYSWKMBDHVN]*$'
return bool(re.match(pattern, sequence))
# Example usage:
sequence = "ATGRCXTAYCGTG"
if is_valid_nucleotide_sequence(sequence):
print("The sequence is valid.")
else:
print("The sequence contains invalid characters.")
Bioinformatics File Formats
Understanding how sequence data is represented and stored in bioinformatics