HomeAboutSoftwarePublicationsPostsMicroBinfie Podcast

Tips and tricks for Python

Posted on August 26, 2023
an AI generated picture (Midjourney) with prompt; 'computer programming in the style of Girl with a Pearl Earring by Johannes Vermeer'. You can share and adapt this image following a CC BY-SA 4.0 licence.

This post is part of a series that summarised a workshop we ran recently. We were discussing programming, discussing common pitfalls, and then looked at some fun python tricks.

There are several sections:

List comprehension

List comprehension in Python is a concise and expressive way to create lists by applying an expression to each item in an iterable (such as a list, tuple, or range) and optionally filtering the items based on a condition. It's a powerful feature that can simplify code and make it more readable, reducing the need for explicit loops.

The general syntax of a list comprehension is as follows:

python
new_list = [expression for item in iterable if condition]

Here's what each part means:

  • expression: This is the operation you want to perform on each item in the iterable to create the elements of the new list.
  • item: This represents each individual item in the iterable.
  • iterable: This is the collection of items you want to iterate over, such as a list, tuple, or range.
  • condition (optional): This is an optional part where you can specify a condition that determines whether an item is included in the new list.
python
fasta_headers = [
">gene1|speciesA",
">gene2|speciesB",
">gene3|speciesA",
]
# -------------------------------------------#
gene_names = [header.split("|")[0][1:] for header in fasta_headers]
print(f'My gene list is : {", ".join(gene_names)}')
# This syntax is not particularly faster or better, it might be clearer in SOME situations.
# You can get the same result with a familiar loop.
gene_names_again = []
for header in fasta_headers:
gene_names_again.append(header.split("|")[0][1:])
print(f'My gene list is : {", ".join(gene_names_again)}')
gene_names_filter = [header.split("|")[0][1:] for header in fasta_headers if header.endswith('speciesA')]
print(f'My gene list is : {", ".join(gene_names_filter)}')

The output:

python
My gene list is : gene1, gene2, gene3
My gene list is : gene1, gene2, gene3
My gene list is : gene1, gene3

Dictionary comprehension (and zip)

Dictionaries can be created in a similar way.

python
fasta_headers = [
">gene1|speciesA",
">gene2|speciesB",
">gene3|speciesA",
]
gene_species = {header.split("|")[0][1:]: header.split("|")[1] for header in fasta_headers}
print(gene_species)

The output:

python
{'gene1': 'speciesA', 'gene2': 'speciesB', 'gene3': 'speciesA'}

It can get pretty intense. In this one line, we are merging two lists into a single dictionary.

python
codons = ["ATG", "GCT", "TAA", "CAG"]
amino_acids = ["Methionine", "Alanine", "STOP", "Glutamine"]
codon_to_aa = {codon: aa for codon, aa in zip(codons, amino_acids)}
print(codon_to_aa)

The output:

python
{'ATG': 'Methionine', 'GCT': 'Alanine', 'TAA': 'STOP', 'CAG': 'Glutamine'}

Using enumerate

There's a lot of old code that doesn't use this preferred and cleaner syntax. Using enumerate is usually safer, as you don't have this count variable floating around - too often you can forget to increment it.

python
dna_sequences = ["ATCGAAGCT", "GTTAGTCC", "AGCGTAAGGT", "GATC"]
# ------------
for index, sequence in enumerate(dna_sequences, start=1):
gc_content = (sequence.count("G") + sequence.count("C")) / len(sequence) * 100
print(f"Sequence {index}: GC content = {round(gc_content,2)}%")
print('\nAgain with a counter')
count = 1
for sequence in dna_sequences:
gc_content = (sequence.count("G") + sequence.count("C")) / len(sequence) * 100
print(f"Sequence {count}: GC content = {round(gc_content,2)}%")
count += 1

Dictionary iteration - use .items()

You use a lot of dictionaries in python. You often want to fetch both the key and value at the same time. Let's take our fasta header example from before.

python
fasta_headers = [
">gene1|speciesA",
">gene2|speciesB",
">gene3|speciesA",
]
gene_species = {header.split("|")[0][1:]: header.split("|")[1] for header in fasta_headers}
for gene in gene_species: # This is OK
print(f'My gene: {gene}; My species: {gene_species[gene]}')
print('\nAgain with items')
for gene, species in gene_species.items(): # This is BETTER
print(f'My gene: {gene}; My species: {species}')

Use logging for proper scripts

The logging module in Python provides a flexible framework for emitting log messages from applications. It supports various log levels and destinations.

python
import logging
# Configure the logging settings
logging.basicConfig(
level=logging.INFO,
format="%(asctime)s - %(levelname)s - %(message)s",
datefmt="%Y-%m-%d %H:%M:%S"
)
def find_motif(sequence, motif):
positions = []
for i in range(len(sequence) - len(motif) + 1):
if sequence[i:i+len(motif)] == motif:
positions.append(i)
logging.info(f"Motif found at position {i}")
return positions
# Example sequence and motif
sequence = "ATCGAAGCTGTTAGTCCAGCGTAAGGTGATC"
motif = "GTTA"
# Find the motif in the sequence
motif_positions = find_motif(sequence, motif)
if motif_positions:
logging.info(f"Motif '{motif}' found at positions: {', '.join(map(str, motif_positions))}")
else:
logging.warning(f"Motif '{motif}' not found in the sequence")

Questions or comments? @ me on Mastodon @happykhan@mstdn.science or Twitter @happy_khan

The banner image is an AI generated picture (Midjourney) with prompt; 'computer programming in the style of Girl with a Pearl Earring by Johannes Vermeer'. You can share and adapt this image following a CC BY-SA 4.0 licence.