Applying LLMs in Biology

#learnai #oxlo #ai

We are going to build a literature screening agent that ingests biology paper abstracts and returns structured summaries with extracted genes, proteins, diseases, and suggested hypotheses. This saves molecular biologists and bioinformaticians hours of manual review when surveying new research. Because Oxlo.ai charges a flat rate per request rather than per token, feeding it a 3,000-token abstract costs the same as a one-sentence query, which makes high-volume literature screening economically practical.

What you'll need

Python 3.10 or newer installed locally. The OpenAI SDK, which you can install with pip install openai. An Oxlo.ai API key from https://portal.oxlo.ai.

Step 1: Connect to Oxlo.ai

I initialize the client against Oxlo.ai's OpenAI-compatible endpoint. I picked llama-3.3-70b because it handles dense biomedical prose well, and Oxlo.ai serves it with no cold starts so the first request is as fast as the tenth.

from openai import OpenAI

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")

response = client.chat.completions.create(
    model="llama-3.3-70b",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Say hello"},
    ],
)
print(response.choices[0].message.content)

Step 2: Define the system prompt

The system prompt is the agent's instruction manual. I keep it in a top-level constant so I can tune it without touching the rest of the code.

SYSTEM_PROMPT = """You are a biomedical literature analyst.
Read the abstract provided by the user and produce a structured analysis with exactly these sections:
1. Summary: one paragraph explaining the core finding in plain language.
2. Key Entities: list any genes, proteins, cell lines, diseases, or drugs mentioned.
3. Methods: briefly note the experimental approach.
4. Hypothesis Suggestion: propose one follow-up experiment based on the findings.
Be concise. Use terminology appropriate for a molecular biology PhD student."""

Step 3: Analyze a single abstract

Now I wrap the API call in a function. Passing the full abstract in the user message lets the model see every detail before it summarizes.

def analyze_abstract(abstract: str) -> str:
    response = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": abstract},
        ],
    )
    return response.choices[0].message.content

sample = (
    "p53 is a tumor suppressor that responds to DNA damage. "
    "Here we show that phosphorylation at Ser46 by HIPK2 selectively induces expression of p53AIP1, "
    "leading to apoptotic cell death in colorectal cancer cell lines. "
    "Knockdown of HIPK2 attenuated the DNA-damage-induced apoptosis. "
    "These results suggest a novel regulatory mechanism for p53-dependent apoptosis."
)

print(analyze_abstract(sample))

Step 4: Extract structured entities

For downstream pipelines, I need machine-readable data. Oxlo.ai supports JSON mode, so I can force valid JSON and parse it safely into a Python dictionary.

import json

EXTRACTION_PROMPT = """You are a biomedical NER extractor.
Read the abstract and return a single valid JSON object with these exact keys:
- genes: list of gene symbols mentioned
- proteins: list of protein names or symbols
- diseases: list of diseases or conditions
- cell_lines: list of cell lines used
- hypothesis: one string proposing a follow-up study
If a category is missing, return an empty list for it, or an empty string for hypothesis.
Return only the JSON object, with no markdown formatting."""

def extract_entities(abstract: str) -> dict:
    response = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=[
            {"role": "system", "content": EXTRACTION_PROMPT},
            {"role": "user", "content": abstract},
        ],
        response_format={"type": "json_object"},
    )
    return json.loads(response.choices[0].message.content)

Step 5: Batch screen multiple papers

I loop over a list of abstracts and collect both the human-readable summary and the structured entities. Because Oxlo.ai uses request-based pricing, a long 4,000-token abstract costs the same as a short one, so I never need to truncate text to stay inside a budget.

def screen_abstracts(abstracts: list[str]) -> list[dict]:
    results = []
    for abstract in abstracts:
        summary = analyze_abstract(abstract)
        entities = extract_entities(abstract)
        results.append({
            "summary": summary,
            "entities": entities,
        })
    return results

papers = [
    (
        "p53 is a tumor suppressor that responds to DNA damage. "
        "Here we show that phosphorylation at Ser46 by HIPK2 selectively induces expression of p53AIP1, "
        "leading to apoptotic cell death in colorectal cancer cell lines. "
        "Knockdown of HIPK2 attenuated the DNA-damage-induced apoptosis. "
        "These results suggest a novel regulatory mechanism for p53-dependent apoptosis."
    ),
    (
        "CRISPR-Cas9 screening identified SLC7A11 as a vulnerability in KRAS-mutant lung adenocarcinoma. "
        "Genetic ablation of SLC7A11 in combination with glutamate dehydrogenase inhibition "
        "suppressed tumor growth in patient-derived xenograft models. "
        "Our findings reveal a metabolic dependency that can be targeted therapeutically."
    ),
]

Run it

Execute the batch and print the reports. The first paper should return p53 and HIPK2 entities, while the second should surface SLC7A11 and KRAS.

if __name__ == "__main__":
    reports = screen_abstracts(papers)
    for idx, report in enumerate(reports, 1):
        print(f"--- Paper {idx} ---")
        print(report["summary"])
        print("\nEntities:")
        print(json.dumps(report["entities"], indent=2))
        print()

Example output:

--- Paper 1 ---
Summary: The study identifies a post-translational modification of p53 by HIPK2 at Ser46, which transcriptionally upregulates p53AIP1 to drive apoptosis in colorectal cancer cells. This positions HIPK2 as a critical kinase in the DNA-damage response axis.

Key Entities: p53, HIPK2, p53AIP1, colorectal cancer
Methods: Phosphorylation site mapping, knockdown via RNA interference, apoptosis assays in cancer cell lines
Hypothesis Suggestion: Test whether HIPK2/p53AIP1 signaling is suppressed in microsatellite-stable versus unstable colorectal tumors.

Entities:
{
  "genes": ["p53", "HIPK2", "p53AIP1"],
  "proteins": ["p53", "HIPK2", "p53AIP1"],
  "diseases": ["colorectal cancer"],
  "cell_lines": ["colorectal cancer cell lines"],
  "hypothesis": "Evaluate HIPK2 expression levels across colorectal cancer subtypes to determine predictive value for chemotherapy response."
}

--- Paper 2 ---
Summary: Using CRISPR-Cas9 screening, the authors demonstrate that SLC7A11 is a metabolic vulnerability in KRAS-mutant lung adenocarcinoma. Combining SLC7A11 loss with glutamate dehydrogenase inhibition significantly reduced tumor burden in patient-derived xenografts.

Key Entities: SLC7A11, KRAS, lung adenocarcinoma, glutamate dehydrogenase
Methods: CRISPR-Cas9 dropout screening, genetic ablation, patient-derived xenograft models
Hypothesis Suggestion: Determine whether SLC7A11 expression correlates with glutamate dependency across KRAS-mutant lung cancer patient samples.

Entities:
{
  "genes": ["SLC7A11", "KRAS"],
  "proteins": ["SLC7A11", "glutamate dehydrogenase"],
  "diseases": ["lung adenocarcinoma"],
  "cell_lines": [],
  "hypothesis": "Assess whether dietary glutamine restriction synergizes with SLC7A11 inhibitors in KRAS-mutant lung cancer models."
}

Next steps

Replace the hard-coded list with live calls to the NCBI E-utilities API so you can fetch PubMed abstracts by PMID or search term. You can also push the extracted JSON entities through Oxlo.ai's embedding models, such as bge-large, to build a searchable vector index of genes, diseases, and hypotheses across thousands of papers.