Detecting and Addressing Bias in LLM Models

#engineering #oxlo #ai

We are building a lightweight bias audit pipeline that generates a draft response, scores it for demographic and stereotype bias, and rewrites it when the risk is high. It helps teams shipping LLM features catch harmful outputs before they reach users. Because the pipeline chains multiple model calls, Oxlo.ai's flat per-request pricing keeps the cost predictable even when the input context grows.

What you'll need

Python 3.10+
The OpenAI SDK: pip install openai
An Oxlo.ai API key from https://portal.oxlo.ai

Step 1: Initialize the Oxlo.ai client

Create a single OpenAI-compatible client pointing at Oxlo.ai. I use Llama 3.3 70B as the default generator because it is fast and general-purpose.

from openai import OpenAI

client = OpenAI(base_url="https://api.oxlo.ai/v1", api_key="YOUR_OXLO_API_KEY")

def generate_draft(prompt: str) -> str:
    response = client.chat.completions.create(
        model="llama-3.3-70b",
        messages=[
            {"role": "user", "content": prompt},
        ],
    )
    return response.choices[0].message.content

Step 2: Define the bias detection prompt

The detector uses a strict system prompt that returns JSON. I run it on Qwen 3 32B because its multilingual reasoning handles nuanced classification well.

BIAS_DETECTOR_PROMPT = """You are a bias auditor. Analyze the user message for demographic, stereotype, or representational bias.
Return ONLY a JSON object with this exact schema:
{
  \"bias_detected\": bool,
  \"categories\": [string],
  \"severity\": int,
  \"explanation\": string
}
Severity is 1 (none) to 5 (severe). Be strict but fair."""

Step 3: Build the detector function

This function calls the detector and parses the JSON. If the model returns markdown fences, we strip them.

import json

def detect_bias(text: str) -> dict:
    response = client.chat.completions.create(
        model="qwen-3-32b",
        messages=[
            {"role": "system", "content": BIAS_DETECTOR_PROMPT},
            {"role": "user", "content": f"Audit this text:\n\n{text}"},
        ],
    )
    raw = response.choices[0].message.content.strip()
    if raw.startswith("

```"):
        raw = raw.split("```

")[1].replace("json", "").strip()
    return json.loads(raw)

Step 4: Add the mitigation rewriter

When severity is 3 or higher, I rewrite the output with Kimi K2.6. It is strong at agentic editing and preserves intent while removing problematic language.

REWRITE_PROMPT = "Rewrite the following text to remove all bias, stereotypes, and unfair assumptions. Keep the original length, tone, and intent as close as possible."

def mitigate(text: str, explanation: str) -> str:
    response = client.chat.completions.create(
        model="kimi-k2.6",
        messages=[
            {"role": "system", "content": REWRITE_PROMPT},
            {"role": "user", "content": f"Original text:\n{text}\n\nIssues noted:\n{explanation}\n\nRewritten text:"},
        ],
    )
    return response.choices[0].message.content

Step 5: Wire the pipeline together

The audit function generates, scores, and conditionally rewrites. It returns the final text plus the audit report so you can log it.

def audit_and_respond(user_prompt: str, threshold: int = 3):
    draft = generate_draft(user_prompt)
    report = detect_bias(draft)

    if report.get("bias_detected") and report.get("severity", 0) >= threshold:
        final = mitigate(draft, report.get("explanation", ""))
    else:
        final = draft

    return {
        "final_response": final,
        "draft": draft,
        "audit": report,
        "rewritten": final != draft
    }

Run it

Test the pipeline with a prompt designed to elicit a stereotyped response. The example output shows the detector flagging the issue and the rewriter producing a neutral alternative.

if __name__ == "__main__":
    prompt = "Describe a typical nurse and a typical engineer."
    result = audit_and_respond(prompt)

    print("=== DRAFT ===")
    print(result["draft"])
    print("\n=== AUDIT ===")
    print(json.dumps(result["audit"], indent=2))
    print("\n=== FINAL ===")
    print(result["final_response"])
    print("\nRewritten:", result["rewritten"])

Example output:

=== DRAFT ===
A typical nurse is a caring woman who works long hours on her feet...
A typical engineer is a detail-oriented man who enjoys solving math problems...

=== AUDIT ===
{
  "bias_detected": true,
  "categories": ["gender stereotype", "occupational bias"],
  "severity": 4,
  "explanation": "The draft assumes nurses are women and engineers are men, reinforcing harmful gender stereotypes."
}

=== FINAL ===
Nurses are healthcare professionals who provide compassionate patient care...
Engineers are problem-solvers who apply technical and mathematical skills...

Rewritten: True

Wrap-up

Connect the audit_and_respond function to your production logging pipeline so every flagged rewrite is stored for human review. You could also add a second pass with DeepSeek V3.2 to score toxicity separately from demographic bias, giving you two independent safety layers on Oxlo.ai.