Applying LLMs to Biology and Bioinformatics

#learnai #oxlo #ai

We are building a command-line bioinformatics assistant that ingests a raw DNA sequence, finds open reading frames with local Python logic, and uses an LLM to interpret the peptides and answer a research question. It is aimed at biologists and bioinformatics students who need quick sequence insights without writing a full analysis pipeline from scratch.

What you'll need

An Oxlo.ai API key from https://portal.oxlo.ai
Python 3.10 or newer
The OpenAI SDK: pip install openai

Step 1: Set up the Oxlo.ai client

Import the OpenAI SDK and point it at Oxlo.ai. Because Oxlo.ai is fully OpenAI-compatible, this single client handles everything.

from openai import OpenAI

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")

Step 2: Build a minimal DNA toolkit

Before calling the model, we need real biological data to ground the response. I will implement transcription, translation, and a simple ORF scanner with the standard codon table. No external bioinformatics libraries are required.

CODON_TABLE = {
    'UUU': 'F', 'UUC': 'F', 'UUA': 'L', 'UUG': 'L',
    'UCU': 'S', 'UCC': 'S', 'UCA': 'S', 'UCG': 'S',
    'UAU': 'Y', 'UAC': 'Y', 'UAA': '*', 'UAG': '*',
    'UGU': 'C', 'UGC': 'C', 'UGA': '*', 'UGG': 'W',
    'CUU': 'L', 'CUC': 'L', 'CUA': 'L', 'CUG': 'L',
    'CCU': 'P', 'CCC': 'P', 'CCA': 'P', 'CCG': 'P',
    'CAU': 'H', 'CAC': 'H', 'CAA': 'Q', 'CAG': 'Q',
    'CGU': 'R', 'CGC': 'R', 'CGA': 'R', 'CGG': 'R',
    'AUU': 'I', 'AUC': 'I', 'AUA': 'I', 'AUG': 'M',
    'ACU': 'T', 'ACC': 'T', 'ACA': 'T', 'ACG': 'T',
    'AAU': 'N', 'AAC': 'N', 'AAA': 'K', 'AAG': 'K',
    'AGU': 'S', 'AGC': 'S', 'AGA': 'R', 'AGG': 'R',
    'GUU': 'V', 'GUC': 'V', 'GUA': 'V', 'GUG': 'V',
    'GCU': 'A', 'GCC': 'A', 'GCA': 'A', 'GCG': 'A',
    'GAU': 'D', 'GAC': 'D', 'GAA': 'E', 'GAG': 'E',
    'GGU': 'G', 'GGC': 'G', 'GGA': 'G', 'GGG': 'G',
}

def transcribe(dna: str) -> str:
    return dna.upper().replace('T', 'U')

def translate(rna: str) -> str:
    protein = []
    for i in range(0, len(rna) - 2, 3):
        codon = rna[i:i + 3]
        protein.append(CODON_TABLE.get(codon, '?'))
    return ''.join(protein)

def find_orfs(dna: str, min_protein_len: int = 30):
    rna = transcribe(dna)
    orfs = []
    for frame in range(3):
        seq = rna[frame:]
        protein = translate(seq)
        start = 0
        while start < len(protein):
            if protein[start] == 'M':
                stop = protein.find('*', start)
                if stop == -1:
                    break
                segment = protein[start:stop]
                if len(segment) >= min_protein_len:
                    dna_start = frame + start * 3
                    dna_end = frame + stop * 3 + 3
                    orfs.append({
                        "frame": frame + 1,
                        "dna_start": dna_start,
                        "dna_end": dna_end,
                        "length_aa": len(segment),
                        "sequence": segment
                    })
                start = stop + 1
            else:
                start += 1
    return orfs

Step 3: Write the system prompt

The system prompt constrains the model to act as a bioinformatics assistant and keeps the output structured and concise.

SYSTEM_PROMPT = """You are a bioinformatics research assistant.
The user will provide a DNA sequence, a list of open reading frames (ORFs) detected by a local script, and a specific research question.

Your job is to:
1. Summarize each ORF with its length and frame position.
2. Predict possible protein families or functional domains based on the peptide sequences.
3. Answer the user's research question using the sequence evidence.
4. Suggest one follow-up experiment or database search.

Keep your answer concise. Use bullet points. If you are uncertain, say so."""

Step 4: Create the analysis agent

This function runs the local ORF finder, packages the results as JSON, and sends them to Oxlo.ai. I use kimi-k2.6 because its reasoning and coding strengths handle structured biological data accurately.

import json

def analyze_sequence(dna: str, question: str) -> str:
    orfs = find_orfs(dna, min_protein_len=20)

    context = {
        "dna_length": len(dna),
        "orfs_found": len(orfs),
        "orfs": orfs,
        "user_question": question
    }

    user_message = json.dumps(context, indent=2)

    response = client.chat.completions.create(
        model="kimi-k2.6",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_message},
        ],
    )

    return response.choices[0].message.content

Step 5: Add a CLI entrypoint

Finally, I will add a main block with a sample beta-globin fragment and a research question so we can run the script immediately.

if __name__ == "__main__":
    sample_dna = (
        "ATGGTGCACCTGACTCCTGAGGAGAAGTCTGCCGTTACTGCCCTGTGGGGCAAGGTG"
        "AACGTGGATGAAGTTGGTGGTGAGGCCCTGGGCAGGTTGGTATCAAGGTTACAAGAC"
        "AGGTTTAAGGAGACCAATAGAAACTGGGCATGTGGAGACAGAGAAGACTCTTGGGTT"
        "TCTGATAGGCACTGACTCTCTCTGCCTATTGGTCTATTTTCCCACCCTTAGGCTGCT"
        "GGTGGTCTACCCTTGGACCCAGAGGTTCTTTGAGTCCTTTGGGGATCTGTCCACTCC"
        "TGATGCTGTTATGGGCAACCCTAAGGTGAAGGCTCATGGCAAGAAAGTGCTCGGTGC"
        "CTTTAGTGATGGCCTGGCTCACCTGGACAACCTCAAGGGCACCTTTGCCACACTGAG"
        "TGAGCTGCACTGTGACAAGCTGCACGTGGATCCTGAGAACTTCAGGCTCCTGGGCAA"
        "TATGATCGTGAT"
    )

    question = (
        "Does this sequence contain a known globin family motif, "
        "and what is the likely function of the longest ORF?"
    )

    print(analyze_sequence(sample_dna, question))

Run it

Save the script as bio_agent.py, export your API key, and run it.

$ export OXLO_API_KEY="sk-..."
$ python bio_agent.py

- ORF 1 (frame 1, 147 aa): Matches a classic globin fold. The peptide contains conserved histidine residues consistent with heme coordination.
- ORF 2 (frame 1, 44 aa): Short fragment with no clear functional domain match.

Answer: The longest ORF is consistent with beta-globin and likely functions in oxygen transport.

Follow-up: Run BLASTp against the NCBI nr database to confirm the exact subfamily and check for pathogenic variants.

Next steps

Two concrete ways to extend this assistant:

Integrate the NCBI E-utilities API to fetch real taxonomy and literature context before sending data to the model.
Add reverse-complement ORF scanning and benchmark the pipeline on long genomic contigs. Because Oxlo.ai uses flat per-request pricing, cost stays predictable even when you feed in long sequences that would inflate token bills elsewhere. See https://oxlo.ai/pricing for plan details.