DEV Community

shashank ms
shashank ms

Posted on

Unlocking LLMs for Social Science Research

We are building a qualitative coding assistant that reads interview transcripts and returns structured thematic codes with supporting quotes. Social scientists often drown in raw text, and this tool accelerates the move from data to analysis. I run it on Oxlo.ai because flat per-request pricing keeps the cost predictable even when I feed it long transcripts in bulk.

What you'll need

An Oxlo.ai API key from https://portal.oxlo.ai, Python 3.10 or newer, and the OpenAI SDK installed with pip install openai.

Step 1: Configure the Oxlo.ai client

I start by initializing the OpenAI SDK to point at Oxlo.ai. I pick llama-3.3-70b as the default model because it handles long context windows and nuanced instruction following well.

from openai import OpenAI
import os

client = OpenAI(
    base_url="https://api.oxlo.ai/v1",
    api_key=os.environ.get("OXLO_API_KEY", "YOUR_OXLO_API_KEY")
)

MODEL = "llama-3.3-70b"

# Quick connectivity check
response = client.chat.completions.create(
    model=MODEL,
    messages=[{"role": "user", "content": "Say 'client ready'"}],
)
print(response.choices[0].message.content)

Step 2: Define the research persona

The system prompt is the most important part. It locks the model into a grounded-theory coding assistant that returns strict JSON with codes, definitions, and exemplar quotes.

SYSTEM_PROMPT = """You are a qualitative research methods assistant specializing in thematic analysis.
Your task is to read an interview excerpt and produce a JSON object with this exact structure:
{
  "codes": [
    {
      "code_name": "short label",
      "definition": "1 sentence describing what this code captures",
      "quote": "verbatim supporting quote from the text"
    }
  ]
}
Rules:
- Generate 3 to 5 codes per excerpt.
- Use the participant's own words in the quote field.
- Stay close to the data. Do not infer concepts not supported by the text.
- Output only the JSON object, with no markdown fences or extra commentary.
"""

Step 3: Prepare sample transcripts

I use a small batch of simulated interview snippets about remote work experiences. In production, these would be loaded from CSV or DOCX files.

TRANSCRIPTS = [
    {
        "participant_id": "P01",
        "text": "I moved to a smaller town because I no longer needed to commute. At first I loved the flexibility, but now I feel like my home is just an office. The boundary disappeared and I am burning out."
    },
    {
        "participant_id": "P02",
        "text": "My team is spread across three time zones. We rely heavily on async updates in Slack. I miss whiteboarding sessions, but the documentation discipline has actually made our decisions clearer."
    },
    {
        "participant_id": "P03",
        "text": "I have saved roughly ten hours a week not commuting. I reinvested that time into exercise and family meals. My stress levels dropped noticeably within the first month."
    }
]

Step 4: Extract open codes

This function sends each transcript to Oxlo.ai with the system prompt, then parses the JSON response. I use Python's built-in json module to handle the model output.

import json

def extract_codes(participant_id, text):
    user_message = f"Participant: {participant_id}\nExcerpt:\n{text}"
    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": SYSTEM_PROMPT},
            {"role": "user", "content": user_message},
        ],
        temperature=0.2,
    )
    raw = response.choices[0].message.content.strip()

    # Handle occasional stray markdown
    if raw.startswith("

```"):
        raw = raw.split("```

")[1]
        if raw.startswith("json"):
            raw = raw[4:]

    return json.loads(raw.strip())

# Test on the first transcript
sample = extract_codes(TRANSCRIPTS[0]["participant_id"], TRANSCRIPTS[0]["text"])
print(json.dumps(sample, indent=2))

Step 5: Synthesize axial themes

After coding each case individually, I feed the full set of codes back into the model to generate higher-level themes and a short analytical memo. This mirrors the axial coding phase in grounded theory.

import json

def synthesize_memo(all_codes):
    # Flatten codes into a single string for the prompt
    payload = json.dumps(all_codes, indent=2)

    synthesis_prompt = """You are a senior social science researcher.
Given the following open codes from multiple participants, perform axial coding.
Identify 2 to 4 broader themes, explain how the individual codes relate, and note any patterns or tensions.
Return your response as JSON with this structure:
{
  "themes": [
    {
      "theme_name": "name",
      "description": "paragraph explaining the theme",
      "related_codes": ["code_name_1", "code_name_2"]
    }
  ],
  "memo": "1 paragraph analytical summary"
}
Output only the JSON object.
"""

    response = client.chat.completions.create(
        model=MODEL,
        messages=[
            {"role": "system", "content": synthesis_prompt},
            {"role": "user", "content": payload},
        ],
        temperature=0.3,
    )
    raw = response.choices[0].message.content.strip()
    if raw.startswith("

```"):
        raw = raw.split("```

")[1]
        if raw.startswith("json"):
            raw = raw[4:]

    return json.loads(raw.strip())

Run it

The main loop processes every transcript, collects the codes, and then runs synthesis. Because Oxlo.ai charges a flat rate per request, I know the cost of this entire batch upfront without counting tokens.

if __name__ == "__main__":
    all_participant_codes = []

    for entry in TRANSCRIPTS:
        print(f"Coding {entry['participant_id']}...")
        result = extract_codes(entry["participant_id"], entry["text"])
        all_participant_codes.append({
            "participant_id": entry["participant_id"],
            "codes": result["codes"]
        })

    print("\n--- Open Codes ---")
    print(json.dumps(all_participant_codes, indent=2))

    print("\n--- Axial Synthesis ---")
    memo = synthesize_memo(all_participant_codes)
    print(json.dumps(memo, indent=2))

Example output:

Coding P01...
Coding P02...
Coding P03...

--- Open Codes ---
[
  {
    "participant_id": "P01",
    "codes": [
      {
        "code_name": "geographic mobility",
        "definition": "Participant relocated due to remote work flexibility",
        "quote": "I moved to a smaller town because I no longer needed to commute."
      },
      {
        "code_name": "blurred boundaries",
        "definition": "Collapse of separation between home and work spaces",
        "quote": "my home is just an office. The boundary disappeared"
      },
      {
        "code_name": "burnout risk",
        "definition": "Participant experiences exhaustion related to work demands",
        "quote": "I am burning out."
      }
    ]
  },
  {
    "participant_id": "P02",
    "codes": [
      {
        "code_name": "async collaboration",
        "definition": "Team relies on non-synchronous communication tools",
        "quote": "We rely heavily on async updates in Slack."
      },
      {
        "code_name": "documentation discipline",
        "definition": "Written records improve clarity of decisions",
        "quote": "the documentation discipline has actually made our decisions clearer"
      }
    ]
  },
  {
    "participant_id": "P03",
    "codes": [
      {
        "code_name": "time reclamation",
        "definition": "Participant recovers hours previously lost to commuting",
        "quote": "I have saved roughly ten hours a week not commuting."
      },
      {
        "code_name": "stress reduction",
        "definition": "Measurable decrease in psychological strain",
        "quote": "My stress levels dropped noticeably within the first month."
      }
    ]
  }
]

--- Axial Synthesis ---
{
  "themes": [
    {
      "theme_name": "spatial reconfiguration of work",
      "description": "Participants are actively reshaping where they live and work. The removal of commuting constraints enables geographic mobility, yet the same flexibility dissolves physical boundaries that previously protected personal space.",
      "related_codes": ["geographic mobility", "blurred boundaries"]
    },
    {
      "theme_name": "well-being trade-offs",
      "description": "Remote work produces divergent wellness outcomes. Some participants gain time for exercise and family, while others experience burnout due to the inability to disconnect.",
      "related_codes": ["burnout risk", "time reclamation", "stress reduction"]
    }
  ],
  "memo": "Remote work acts as a double-edged sword. While it structurally liberates time and location, the absence of spatial separation between home and office creates new psychological costs. Researchers should attend to how individual coping strategies mediate these opposing effects."
}

Wrap-up

From here, you can extend the script to read from a CSV of interview data or add a validation layer that checks JSON schema before saving to disk. If you are working with large batches, Oxlo.ai's request-based pricing means your invoice stays flat even as transcripts grow longer, which makes scaling up a research pipeline straightforward.

Top comments (0)