HomeAboutSoftwarePublicationsPostsMicroBinfie Podcast

My favourite python modules

Posted on August 27, 2023
an AI generated picture (Midjourney) with prompt; 'computer programming in the style of The Two Fridas by Frida Kahlo '. You can share and adapt this image following a CC BY-SA 4.0 licence.

This post is part of a series that summarised a workshop we ran recently. We were discussing programming, discussing common pitfalls, and then looked at some fun python tricks.

There are several sections:

My favourite python modules

  • Google sheets API for Python
  • Openpyxl: allows for easy interaction with Excel files, enabling reading, writing, and manipulation of data within Excel spreadsheets.
  • Biopython: handlese bioinformatics, offering various tools for working with biological data
  • NumPy: A fundamental package for numerical computations, providing support for arrays and matrices, along with mathematical functions to operate on these arrays efficiently.
  • Pandas: A powerful data analysis and manipulation library that provides data structures like DataFrames for handling structured data, making it easier to work with tabular data.
  • scikit-learn: A machine learning library that provides a wide range of tools for tasks like classification, regression, clustering, dimensionality reduction, and more.
  • logging: The built-in module for flexible and powerful logging capabilities, which is essential for debugging and monitoring applications.
  • click or typer. Better CLI options for Python
  • queue: Use queue for creating job queues for python (best approach for multi-threading

Openpyxl example

Let's say you have a list of DNA sequences, and you want to calculate the GC content of each sequence (the percentage of bases that are either guanine (G) or cytosine (C)). You'll then store the sequences along with their GC content in an Excel file.

text
ATCGGCTAGCTAGCTAGCTAGCTGACGTAGC
CGTAGCTAGCTAGCTAGCTGACTAGCTAGC
...

Here's how you could use openpyxl to process this data and create an Excel file:

  1. Install the openpyxl library if you haven't already:
bash
pip install openpyxl
  1. Create a Python script:
python
import openpyxl
from openpyxl import Workbook
# Read the DNA sequences from the file
sequences = []
with open('sequences.txt', 'r') as f:
sequences = [line.strip() for line in f]
# Calculate GC content for each sequence
def calculate_gc_content(sequence):
gc_count = sequence.count('G') + sequence.count('C')
return (gc_count / len(sequence)) * 100
# Create a new workbook and select the active worksheet
wb = Workbook()
ws = wb.active
# Add headers to the worksheet
ws['A1'] = 'DNA Sequence'
ws['B1'] = 'GC Content (%)'
# Add sequence data and GC content to the worksheet
for row_num, sequence in enumerate(sequences, start=2):
ws.cell(row=row_num, column=1, value=sequence)
gc_content = calculate_gc_content(sequence)
ws.cell(row=row_num, column=2, value=gc_content)
# Save the workbook
wb.save('gc_content_analysis.xlsx')

In this example:

  • We read the DNA sequences from the sequences.txt file and store them in the sequences list.
  • We define a function calculate_gc_content() to calculate the GC content of a given sequence.
  • We create a new workbook and select the active worksheet.
  • We add headers to the worksheet for the sequence and GC content.
  • We iterate through the sequences, calculate their GC content using the defined function, and populate the worksheet.
  • Finally, we save the workbook with the calculated GC content in an Excel file named gc_content_analysis.xlsx.

Running this script will result in an Excel file with two columns: one for the DNA sequences and another for their corresponding GC content percentages.

Biopython example

  1. Install Biopython:
bash
pip install biopython
  1. Create a Python script:
python
from Bio import SeqIO
# Input FASTA file name
fasta_file = "input.fasta"
# Output GenBank file name
genbank_file = "output.gb"
# Read the FASTA file and write to a GenBank file
with open(fasta_file, "r") as fasta_handle, open(genbank_file, "w") as genbank_handle:
records = SeqIO.parse(fasta_handle, "fasta")
SeqIO.write(records, genbank_handle, "genbank")
print("Conversion complete.")

In this example:

  • We import the necessary module from Biopython.
  • We define the input FASTA file name and the desired output GenBank file name.
  • We use the SeqIO.parse() function to read the sequences from the FASTA file in FASTA format.
  • We use the SeqIO.write() function to write the parsed sequences to the GenBank file in GenBank format.

To use this script, replace "input.fasta" with the path to your input FASTA file and "output.gb" with the desired path for the output GenBank file. When you run the script, it will read the sequences from the FASTA file and write them to a GenBank file.

Questions or comments? @ me on Mastodon @happykhan@mstdn.science or Twitter @happy_khan

The banner image is an AI generated picture (Midjourney) with prompt; 'computer programming in the style of The Two Fridas by Frida Kahlo '. You can share and adapt this image following a CC BY-SA 4.0 licence.