DEV Community

shashank ms
shashank ms

Posted on

Introduction to LLMs for Research

We are building a research synthesis agent that reads paper abstracts and produces structured literature reviews. This helps researchers map a domain and find gaps without reading every full text. The agent runs entirely on Oxlo.ai, so you pay per request instead of per token, which keeps costs flat even when you feed it long abstracts or multi-paper batches.

What you'll need

You need Python 3.10 or newer, the OpenAI SDK, and an Oxlo.ai API key. Grab a free key from https://portal.oxlo.ai. The free tier includes 60 requests per day, which is plenty for prototyping. Install the SDK:

pip install openai

Step 1: Connect to Oxlo.ai

First, I initialize the client pointing at Oxlo.ai's OpenAI-compatible endpoint. I use llama-3.3-70b as the workhorse model because it handles long-context reasoning well and keeps the per-request cost flat regardless of input size.

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key=os.getenv("OXLO_API_KEY", "YOUR_OXLO_API_KEY")
)

MODEL = "llama-3.3-70b"

Step 2: Define the research persona

I keep the system prompt in its own variable so I can tweak it without touching the request logic. The prompt anchors the model to academic standards and forces structured output.

SYSTEM_PROMPT = """You are a research methodology assistant. Your job is to help researchers analyze academic abstracts and synthesize findings.

Rules:
- Be concise but thorough.
- Always structure output with clear headings.
- When analyzing papers, identify: Research Question, Methodology, Key Findings, Limitations, and Suggested Future Work.
- When comparing multiple papers, highlight agreements, contradictions, and gaps in the literature.
- Use bullet points for readability.
- Do not hallucinate citations. If no DOI or specific reference is provided, do not invent one."""

Step 3: Generate sub-questions from a topic

Before reading papers, I want the agent to break a broad topic into answerable research questions. This gives me a search roadmap.

def generate_research_questions(topic):
    user_prompt = f"""Given the research topic '{topic}', generate 5 specific sub-questions that would help map the current state of the literature.

Format each as:
1. [Sub-question]
   - Why it matters: [one sentence]
   - Disciplines involved: [list]"""

    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_prompt},
        ],
        temperature=0.3,
    )
    return response.choices[0].message.content

# Test
topic = "Large language models for automated program repair"
questions = generate_research_questions(topic)
print(questions)

Step 4: Analyze a single abstract

Next, I add a function that takes a raw abstract and extracts the structured components. Because Oxlo.ai charges per request, not per token, I can pass the full abstract plus the system prompt without watching the meter run on input length.

def analyze_abstract(title, abstract):
    user_prompt = f"""Title: {title}
Abstract: {abstract}

Analyze this paper using the required structure."""

    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_prompt},
        ],
        temperature=0.2,
    )
    return response.choices[0].message.content

# Example abstract
sample_title = "LLM-Patch: Faster Program Repair with Large Language Models"
sample_abstract = "We present LLM-Patch, a novel approach that uses large language models to generate patches for buggy programs. Our method combines fault localization with few-shot prompting to produce candidate repairs. Evaluated on Defects4J, LLM-Patch correctly fixes 47 bugs, outperforming previous neural approaches by 12 percent. Limitations include reliance on test suite quality and high computational cost."

analysis = analyze_abstract(sample_title, sample_abstract)
print(analysis)

Step 5: Synthesize multiple papers

Single-paper analysis is useful, but the real value comes from synthesis. I feed the agent two or more analyses and ask it to find themes and contradictions.

def synthesize_literature(analyses):
    combined = "\n\n---\n\n".join([f"Paper {i+1}:\n{a}" for i, a in enumerate(analyses)])

    user_prompt = f"""Given the following structured analyses of multiple papers, produce a short synthesis that covers:
1. Common themes across the papers
2. Methodological differences
3. Contradictions or conflicting findings
4. Clear gaps in the literature that future work should address

Papers:
{combined}"""

    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_prompt},
        ],
        temperature=0.3,
        max_tokens=2048,
    )
    return response.choices[0].message.content

Step 6: Wrap the agent in a simple CLI

I tie everything together in a small script that takes a topic, generates questions, analyzes a hardcoded list of abstracts, and prints the synthesis. In production you would fetch these from arXiv or Semantic Scholar.

def research_agent(topic, papers):
    print(f"=== Research Map: {topic} ===\n")

    questions = generate_research_questions(topic)
    print(questions)
    print("\n" + "=" * 50 + "\n")

    analyses = []
    for title, abstract in papers:
        print(f"Analyzing: {title}")
        analysis = analyze_abstract(title, abstract)
        analyses.append(analysis)
        print(analysis)
        print("\n" + "-" * 30 + "\n")

    print("=== Synthesis ===\n")
    synthesis = synthesize_literature(analyses)
    print(synthesis)
    return synthesis

papers = [
    (
        "LLM-Patch: Faster Program Repair with Large Language Models",
        "We present LLM-Patch, a novel approach that uses large language models to generate patches for buggy programs. Our method combines fault localization with few-shot prompting to produce candidate repairs. Evaluated on Defects4J, LLM-Patch correctly fixes 47 bugs, outperforming previous neural approaches by 12 percent. Limitations include reliance on test suite quality and high computational cost."
    ),
    (
        "Neural Program Repair with Execution-Guided Feedback",
        "This paper introduces ExecRepair, a neural program repair system that leverages execution traces to guide patch generation. Unlike static approaches, ExecRepair uses runtime error signals to rank candidate patches, resulting in a 23 percent improvement in plausible patch rate on the ManyBugs dataset. The main limitation is the need for executable test environments and increased runtime overhead."
    )
]

research_agent("Large language models for automated program repair", papers)

Run it

Execute the script from your terminal:

export OXLO_API_KEY="your-key-here"
python research_agent.py

With the example papers, the agent prints something like the following:

=== Research Map: Large language models for automated program repair ===

1. What categories of software bugs are most effectively resolved by LLM-based patch generation?
   - Why it matters: Identifying bug categories helps focus research efforts on high-impact areas.
   - Disciplines involved: Software engineering, machine learning.
2. How does few-shot prompting compare to fine-tuning for program repair tasks?
   - Why it matters: Understanding the trade-offs guides practical adoption in industry.
   - Disciplines involved: Software engineering, natural language processing.
...

=== Synthesis ===

Common themes:
- Both papers rely on automated test suites to validate generated patches.
- Neither addresses patches for non-functional requirements such as performance or security.

Methodological differences:
- LLM-Patch uses few-shot prompting with fault localization, while ExecRepair leverages execution traces for ranking.

Contradictions:
- LLM-Patch reports a 12 percent improvement over neural baselines, whereas ExecRepair claims a 23 percent improvement over static approaches. Direct comparison is difficult due to different datasets.

Gaps:
- No study evaluates hybrid approaches that combine few-shot LLM prompting with execution-guided feedback.
- Limited evaluation on real-world industrial codebases outside of Defects4J and ManyBugs.

Next steps

Wire the agent to the arXiv or Semantic Scholar API so it fetches live abstracts instead of hardcoded strings. If you plan to batch-process hundreds of papers, consider Oxlo.ai's Premium plan for priority queueing and higher daily limits. Details are at https://oxlo.ai/pricing.

Another solid upgrade is switching the synthesis step to qwen-3-32b or kimi-k2.6 if you need stronger multilingual reasoning or vision support for figures and tables.

Top comments (0)