This post is part of a series that summarised a workshop we ran recently. We were discussing programming, discussing common pitfalls, and then looked at some fun python tricks.
There are several sections:
Let's say you have a list of DNA sequences, and you want to calculate the GC content of each sequence (the percentage of bases that are either guanine (G) or cytosine (C)). You'll then store the sequences along with their GC content in an Excel file.
ATCGGCTAGCTAGCTAGCTAGCTGACGTAGCCGTAGCTAGCTAGCTAGCTGACTAGCTAGC...
Here's how you could use openpyxl
to process this data and create an Excel file:
openpyxl
library if you haven't already:pip install openpyxl
import openpyxlfrom openpyxl import Workbook# Read the DNA sequences from the filesequences = []with open('sequences.txt', 'r') as f:sequences = [line.strip() for line in f]# Calculate GC content for each sequencedef calculate_gc_content(sequence):gc_count = sequence.count('G') + sequence.count('C')return (gc_count / len(sequence)) * 100# Create a new workbook and select the active worksheetwb = Workbook()ws = wb.active# Add headers to the worksheetws['A1'] = 'DNA Sequence'ws['B1'] = 'GC Content (%)'# Add sequence data and GC content to the worksheetfor row_num, sequence in enumerate(sequences, start=2):ws.cell(row=row_num, column=1, value=sequence)gc_content = calculate_gc_content(sequence)ws.cell(row=row_num, column=2, value=gc_content)# Save the workbookwb.save('gc_content_analysis.xlsx')
In this example:
sequences.txt
file and store them in the sequences
list.calculate_gc_content()
to calculate the GC content of a given sequence.gc_content_analysis.xlsx
.Running this script will result in an Excel file with two columns: one for the DNA sequences and another for their corresponding GC content percentages.
Biopython
:pip install biopython
from Bio import SeqIO# Input FASTA file namefasta_file = "input.fasta"# Output GenBank file namegenbank_file = "output.gb"# Read the FASTA file and write to a GenBank filewith open(fasta_file, "r") as fasta_handle, open(genbank_file, "w") as genbank_handle:records = SeqIO.parse(fasta_handle, "fasta")SeqIO.write(records, genbank_handle, "genbank")print("Conversion complete.")
In this example:
Biopython
.SeqIO.parse()
function to read the sequences from the FASTA file in FASTA format.SeqIO.write()
function to write the parsed sequences to the GenBank file in GenBank format.To use this script, replace "input.fasta"
with the path to your input FASTA file and "output.gb"
with the desired path for the output GenBank file. When you run the script, it will read the sequences from the FASTA file and write them to a GenBank file.
Questions or comments? @ me on Mastodon @happykhan@mstdn.science or Twitter @happy_khan
The banner image is an AI generated picture (Midjourney) with prompt; 'computer programming in the style of The Two Fridas by Frida Kahlo '. You can share and adapt this image following a CC BY-SA 4.0 licence.