LLM Evaluation Frameworks for Academic Research

#engineering #oxlo #ai

We are building a lightweight, reproducible evaluation framework that scores open-source LLMs against custom academic benchmarks. It gives researchers a concrete way to compare reasoning quality across models. Because Oxlo.ai uses flat per-request pricing, you can run large batch evaluations with long rubrics and detailed judge prompts without watching token costs scale.

What you'll need

Python 3.10 or newer
The OpenAI SDK: pip install openai
An Oxlo.ai API key from https://portal.oxlo.ai
Your key exported in the environment: export OXLO_API_KEY="your-key"

Step 1: Verify connectivity with a lightweight model

Start by instantiating the client and confirming that calls route through Oxlo.ai correctly. I use DeepSeek V3.2 here because it responds quickly and sits on a free tier, so this smoke test costs nothing.

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key=os.environ["OXLO_API_KEY"]
)

response = client.chat.completions.create(
    model="deepseek-v3.2",
    messages=[{"role": "user", "content": "Say 'Oxlo.ai evaluator ready'"}],
    max_tokens=20
)

print(response.choices[0].message.content)

Step 2: Define a reproducible benchmark dataset

Hardcode a small set of questions with explicit rubrics. Keeping the rubric text in the same file makes the evaluation auditable and easy to version control.

BENCHMARK = [
    {
        "id": "math-reasoning-01",
        "question": "A farmer has 17 sheep and all but 9 die. How many are left?",
        "rubric": "Score 1 if the answer is explicitly 9. Score 0 if any other number or reasoning is given.",
        "max_score": 1
    },
    {
        "id": "logic-01",
        "question": "If it takes 5 machines 5 minutes to make 5 widgets, how long does it take 100 machines to make 100 widgets?",
        "rubric": "Score 1 if the answer is 5 minutes. Score 0.5 if the reasoning is partially correct but the final answer is wrong. Score 0 otherwise.",
        "max_score": 1
    },
    {
        "id": "coding-01",
        "question": "Write a Python function that returns the nth Fibonacci number using recursion.",
        "rubric": "Score 1 if the code is valid recursive Python with correct base cases. Score 0.5 if recursive but has minor bugs. Score 0 otherwise.",
        "max_score": 1
    }
]

Step 3: Configure the judge model and scoring engine

The judge is itself an LLM call. I use Llama 3.3 70B because it follows structured instructions reliably. The system prompt below enforces JSON-only output so we can parse scores programmatically.

JUDGE_PROMPT = """You are an impartial academic evaluator. Your job is to grade a model's response to a question using the provided rubric.

Output ONLY a JSON object with this exact structure:
{
  \"score\": ,
  \"justification\": \"\"
}

Do not include markdown formatting or extra text."""

Now wrap the judge in a function that calls Oxlo.ai and clamps the returned score to the rubric maximum.

import json
from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key=os.environ["OXLO_API_KEY"]
)

def score_response(question, rubric, candidate_answer, max_score):
    evaluation_prompt = f"""Question: {question}

Rubric: {rubric}

Candidate Answer:
{candidate_answer}

Evaluate according to the rubric and return the required JSON."""

    response = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=[
            {"role": "system", "content": JUDGE_PROMPT},
            {"role": "user", "content": evaluation_prompt}
        ],
        temperature=0.1,
        max_tokens=300
    )

    raw = response.choices[0].message.content.strip()
    if raw.startswith("

```"):
        raw = raw.split("```

")[1].replace("json", "").strip()

    result = json.loads(raw)
    result["score"] = min(result["score"], max_score)
    return result

Step 4: Batch evaluate candidate models

With the judge ready, loop over the candidate models you want to compare. I picked Qwen 3 32B, Kimi K2.6, and DeepSeek V3.2 to cover multilingual reasoning, advanced coding, and efficient general reasoning. Each model answers every benchmark question, then the judge grades the answer.

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key=os.environ["OXLO_API_KEY"]
)

CANDIDATE_MODELS = ["qwen-3-32b", "kimi-k2.6", "deepseek-v3.2"]

def run_candidate(model_id, question):
    response = client.chat.completions.create(
        model=model_id,
        messages=[{"role": "user", "content": question}],
        temperature=0.2,
        max_tokens=500
    )
    return response.choices[0].message.content

results = {m: [] for m in CANDIDATE_MODELS}

for model_id in CANDIDATE_MODELS:
    for item in BENCHMARK:
        answer = run_candidate(model_id, item["question"])
        grade = score_response(
            item["question"],
            item["rubric"],
            answer,
            item["max_score"]
        )
        results[model_id].append({
            "id": item["id"],
            "answer": answer,
            "score": grade["score"],
            "justification": grade["justification"]
        })
        print(f"Graded {model_id} on {item['id']}: {grade['score']}/{item['max_score']}")

Step 5: Aggregate results and emit a comparison report

Finally, tally scores and print a clean table. This gives you an at-a-glance view of which model handles your specific academic tasks best.

def summarize(results):
    print(f"{'Model':<20} {'Total':<8} {'Mean':<6}")
    print("-" * 36)
    for model_id, grades in results.items():
        total = sum(g["score"] for g in grades)
        mean = total / len(grades) if grades else 0
        print(f"{model_id:<20} {total:<8.1f} {mean:<6.2f}")

summarize(results)

for model_id, grades in results.items():
    print(f"\n--- {model_id} ---")
    for g in grades:
        print(f"{g['id']}: {g['score']} | {g['justification']}")

Run it

Save the complete script as eval_framework.py, export your key, and execute it. Below is the condensed output from a real run.

$ export OXLO_API_KEY="sk-oxlo.ai-..."
$ python eval_framework.py

Oxlo.ai evaluator ready

Graded qwen-3-32b on math-reasoning-01: 1/1
Graded qwen-3-32b on logic-01: 1/1
Graded qwen-3-32b on coding-01: 1/1
Graded kimi-k2.6 on math-reasoning-01: 1/1
Graded kimi-k2.6 on logic-01: 0.5/1
Graded kimi-k2.6 on coding-01: 1/1
Graded deepseek-v3.2 on math-reasoning-01: 1/1
Graded deepseek-v3.2 on logic-01: 1/1
Graded deepseek-v3.2 on coding-01: 0.5/1

Model                Total    Mean  
------------------------------------
qwen-3-32b           3.0      1.00  
kimi-k2.6            2.5      0.83  
deepseek-v3.2        2.5      0.83  

--- qwen-3-32b ---
math-reasoning-01: 1 | The answer correctly states 9 sheep remain.
logic-01: 1 | The answer correctly identifies 5 minutes.
coding-01: 1 | The function is valid recursive Python with correct base cases.

--- kimi-k2.6 ---
math-reasoning-01: 1 | The answer explicitly states 9.
logic-01: 0.5 | The reasoning discusses rates but concludes 100 minutes.
coding-01: 1 | The recursive Python implementation is correct.

--- deepseek-v3.2 ---
math-reasoning-01: 1 | The answer is exactly 9.
logic-01: 1 | The answer correctly states 5 minutes.
coding-01: 0.5 | The function is recursive but omits the base case for n=0.

Wrap-up and next steps

Expand the benchmark by adding domain-specific questions from your field, then swap in larger judge models like DeepSeek R1 671B MoE for more nuanced grading. If you scale to hundreds of questions, Oxlo.ai's request-based pricing keeps the cost flat per call regardless of how verbose your rubrics become. See https://oxlo.ai/pricing for plan details.