Home About Software Publications Posts MicroBinfie Podcast

Tips and tricks for Python

Posted on August 26, 2023

an AI generated picture (Midjourney) with prompt; 'computer programming in the style of Girl with a Pearl Earring by Johannes Vermeer'. You can share and adapt this image following a CC BY-SA 4.0 licence.

This post is part of a series that summarised a workshop we ran recently. We were discussing programming, discussing common pitfalls, and then looked at some fun python tricks.

There are several sections:

List comprehension

List comprehension in Python is a concise and expressive way to create lists by applying an expression to each item in an iterable (such as a list, tuple, or range) and optionally filtering the items based on a condition. It's a powerful feature that can simplify code and make it more readable, reducing the need for explicit loops.

The general syntax of a list comprehension is as follows:

python

new_list = [expression for item in iterable if condition]

Here's what each part means:

expression: This is the operation you want to perform on each item in the iterable to create the elements of the new list.
item: This represents each individual item in the iterable.
iterable: This is the collection of items you want to iterate over, such as a list, tuple, or range.
condition (optional): This is an optional part where you can specify a condition that determines whether an item is included in the new list.

python

fasta_headers = [
    ">gene1|speciesA",
    ">gene2|speciesB",
    ">gene3|speciesA",
]

# -------------------------------------------#
gene_names = [header.split("|")[0][1:] for header in fasta_headers]
print(f'My gene list is : {", ".join(gene_names)}')

# This syntax is not particularly faster or better, it might be clearer in SOME situations.
# You can get the same result with a familiar loop.
gene_names_again = []
for header in fasta_headers:
  gene_names_again.append(header.split("|")[0][1:])

print(f'My gene list is : {", ".join(gene_names_again)}')

gene_names_filter = [header.split("|")[0][1:] for header in fasta_headers if header.endswith('speciesA')]
print(f'My gene list is : {", ".join(gene_names_filter)}')

The output:

python

My gene list is : gene1, gene2, gene3
My gene list is : gene1, gene2, gene3
My gene list is : gene1, gene3

Dictionary comprehension (and zip)

Dictionaries can be created in a similar way.

python

fasta_headers = [
    ">gene1|speciesA",
    ">gene2|speciesB",
    ">gene3|speciesA",
]
gene_species = {header.split("|")[0][1:]: header.split("|")[1] for header in fasta_headers}
print(gene_species)

The output:

python

{'gene1': 'speciesA', 'gene2': 'speciesB', 'gene3': 'speciesA'}

It can get pretty intense. In this one line, we are merging two lists into a single dictionary.

python

codons = ["ATG", "GCT", "TAA", "CAG"]
amino_acids = ["Methionine", "Alanine", "STOP", "Glutamine"]
codon_to_aa = {codon: aa for codon, aa in zip(codons, amino_acids)}
print(codon_to_aa)

The output:

python

{'ATG': 'Methionine', 'GCT': 'Alanine', 'TAA': 'STOP', 'CAG': 'Glutamine'}

Using enumerate

There's a lot of old code that doesn't use this preferred and cleaner syntax. Using enumerate is usually safer, as you don't have this count variable floating around - too often you can forget to increment it.

python

dna_sequences = ["ATCGAAGCT", "GTTAGTCC", "AGCGTAAGGT", "GATC"]
# ------------

for index, sequence in enumerate(dna_sequences, start=1):
    gc_content = (sequence.count("G") + sequence.count("C")) / len(sequence) * 100
    print(f"Sequence {index}: GC content = {round(gc_content,2)}%")

print('\nAgain with a counter')
count = 1
for sequence in dna_sequences:
    gc_content = (sequence.count("G") + sequence.count("C")) / len(sequence) * 100
    print(f"Sequence {count}: GC content = {round(gc_content,2)}%")
    count += 1

Dictionary iteration - use .items()

You use a lot of dictionaries in python. You often want to fetch both the key and value at the same time. Let's take our fasta header example from before.

python

fasta_headers = [
    ">gene1|speciesA",
    ">gene2|speciesB",
    ">gene3|speciesA",
]
gene_species = {header.split("|")[0][1:]: header.split("|")[1] for header in fasta_headers}

for gene in gene_species: # This is OK
    print(f'My gene: {gene}; My species: {gene_species[gene]}')

print('\nAgain with items')
for gene, species in gene_species.items(): # This is BETTER
    print(f'My gene: {gene}; My species: {species}')

Use logging for proper scripts

The logging module in Python provides a flexible framework for emitting log messages from applications. It supports various log levels and destinations.

python

import logging

# Configure the logging settings
logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s - %(levelname)s - %(message)s",
    datefmt="%Y-%m-%d %H:%M:%S"
)

def find_motif(sequence, motif):
    positions = []
    for i in range(len(sequence) - len(motif) + 1):
        if sequence[i:i+len(motif)] == motif:
            positions.append(i)
            logging.info(f"Motif found at position {i}")
    return positions

# Example sequence and motif
sequence = "ATCGAAGCTGTTAGTCCAGCGTAAGGTGATC"
motif = "GTTA"

# Find the motif in the sequence
motif_positions = find_motif(sequence, motif)

if motif_positions:
    logging.info(f"Motif '{motif}' found at positions: {', '.join(map(str, motif_positions))}")
else:
    logging.warning(f"Motif '{motif}' not found in the sequence")

Questions or comments? @ me on Mastodon @happykhan@mstdn.science or Twitter @happy_khan

The banner image is an AI generated picture (Midjourney) with prompt; 'computer programming in the style of Girl with a Pearl Earring by Johannes Vermeer'. You can share and adapt this image following a CC BY-SA 4.0 licence.