DEV Community: Sanskriti

ML Model Scored 86%. The Dataset It Learned From Was Biased. GitHub Copilot Helped Me See It.

Sanskriti — Wed, 03 Jun 2026 05:20:02 +0000

This is a submission for the GitHub Finish-Up-A-Thon Challenge

What I Built

In 2018 I trained a charity donor classifier for a Udacity machine learning nanodegree project. The task was to predict whether someone earns more than $50k per year as a proxy for donation likelihood for a fictional charity called CharityML. Gradient Boosting won the model comparison at 86.78% accuracy and an F-score of 0.7469. I submitted it, got my grade, and filed the notebook away.

Coming back in 2026, I did not just fix the code. I audited what the model actually learned. The answer was uncomfortable.

Demo

Live demo: https://sanskriti1991.github.io/machineLearningProjects/finding_donors/

GitHub repo: https://github.com/sanskriti1991/machineLearningProjects/tree/master/finding_donors

The demo lets you input census features and see how the model predicts donation likelihood. It shows a fairness warning for demographic groups with known prediction disparities and charts the prediction rates, false positive rates, and false negative rates across all demographic groups.

The Comeback Story

Before: a notebook that could not run on any modern setup

Opening the notebook in VS Code on Python 3.13 with current sklearn revealed three immediate problems.

The sklearn imports had not kept up with eight years of library changes:

# 2018 — no longer works
from sklearn.cross_validation import train_test_split
from sklearn.grid_search import GridSearchCV

The print statements were Python 2 syntax throughout. And the visualization helper file had an integer division bug where j/3 returns a float in Python 3, breaking array indexing entirely. One character fix changed j/3 to j//3. The notebook had silently needed it for eight years.

After: a running notebook with a fairness audit

Once the code ran, I started looking at the dataset more carefully. The UCI Adult Income dataset extracted from the 1994 US Census had appeared in hundreds of published research papers by 2021, spanning AI fairness, privacy preservation, and model debugging. UC Berkeley researchers published "Retiring Adult" at NeurIPS 2021 calling for it to be retired.

Their finding: the $50k income threshold used as the positive class label was the 76th income percentile overall in 1994, but the 88th percentile for Black Americans and the 89th percentile for women.

The model did not learn who donates. It learned who 1994 America paid well.

The fairness audit made that concrete:

Asian-Pac-Islander males predicted as likely donors: 32%
White males: 26%
Black females: 4%
American Indian females: nearly 0%

86.78% overall accuracy. Completely silent on all of the above.

My Experience with GitHub Copilot

Working with Copilot on this project was not a smooth straight line. It was honestly more like a collaboration that required patience on both sides.

The rate limit reality

I am on the free Copilot tier. Partway through the session, after several back and forth prompts fixing the deprecated imports and print statements, Copilot hit its rate limit and went quiet. I had to wait for it to reset before continuing. That could have been the moment I gave up. It wasn't. I kept the notebook open, documented what had been fixed so far, and came back when the quota reset.

Learning to prompt better

My first prompts were too broad. Asking Copilot to fix the entire notebook at once produced suggestions it could not apply directly to notebook cells in the browser environment. I had to adjust, breaking the task into smaller pieces and being more specific about what I needed. That back and forth was frustrating at first but it forced me to understand the changes rather than just accepting them blindly.

Where Copilot genuinely delivered

Once I found the right prompt style the three moments that mattered most were clear.

First, identifying the deprecated sklearn imports and explaining exactly why each module had moved. Old line, new line, reason. Clear and immediately useful.

Second, catching the integer division bug in visuals.py where j/3 silently breaks in Python 3. I would have spent a long time hunting that one down without Copilot pointing at the exact line.

Third, generating the full fairness audit from a single inline comment. That was the most impressive moment. One descriptive comment and Copilot produced working code that reconstructed demographic groups from one-hot encoded columns, calculated prediction rates and error rates by group, and saved the charts. It then summarized the findings in plain English:

"The model appears to have learned patterns reflecting 1994 wage inequality rather than actual donation likelihood. This suggests that systemic biases in income distribution at the time are influencing the model's predictions."

That is the sentence I should have written in 2018. Now I have.

Why this dataset was called for retirement (UC Berkeley, NeurIPS 2021): https://arxiv.org/abs/2108.04884

We Trusted Auto-Ack. The Queue Agreed. Our Costs Didn't.

Sanskriti — Sat, 30 May 2026 20:21:46 +0000

Most async bugs announce themselves. This one didn't.

No failed jobs. No customer complaints. No error logs. Just infrastructure costs climbing steadily with no obvious cause. It took correlating message IDs across logs to finally see it: the same message being processed two, sometimes three times per delivery.

The culprit was a race condition hiding inside an acknowledgment pattern.

What Happened

A consumer picked up a message and started doing work. That work took time. Before it finished, the queue's retry timeout fired, assumed failure, and redelivered the message to a second consumer. Now two workers were doing identical work concurrently, both completing successfully, both silently doubling the cost.

The system looked healthy by every normal metric. It just wasn't.

The Fix

One configuration change.

Python

# The problem
channel.basic_consume(queue='jobs', on_message_callback=process, auto_ack=True)

# The fix
def process(ch, method, properties, body):
    do_the_work(body)
    ch.basic_ack(delivery_tag=method.delivery_tag)

channel.basic_consume(queue='jobs', on_message_callback=process, auto_ack=False)

Java (Spring AMQP)

// The problem
@RabbitListener(queues = "jobs", ackMode = "AUTO")
public void process(String message) {
    doTheWork(message);
}

// The fix
@RabbitListener(queues = "jobs", ackMode = "MANUAL")
public void process(String message, Channel channel, @Header(AmqpHeaders.DELIVERY_TAG) long tag)
        throws IOException {
    doTheWork(message);
    channel.basicAck(tag, false);
}

Acknowledge after the work completes, not when the message arrives.

The Real Blindspots

This pattern shows up in any async system. Three things that hide it.

Auto-ack tells the queue you are done before you are. With auto-ack enabled, the queue marks the message delivered the moment your consumer receives it. If your worker takes longer than the visibility timeout to finish, the queue sees an unacknowledged message, assumes failure, and redelivers it. A second consumer picks it up and starts the same work. Both complete. Both looked successful. Neither knew about the other.

Manual acknowledgment closes this gap. The queue does not consider the message done until your code explicitly says so, after the work is genuinely finished.

Timeout values set for ideal conditions. When your worker runs slow due to load, cold start, or external API lag, the queue retries before you finish. Even with manual ack, if your visibility timeout is shorter than your worst-case processing time, you will see the same duplicate behavior. Your timeout needs to reflect worst-case latency, not average.

Idempotency masking the problem. If duplicate work produces the same result, nothing breaks visibly. No errors, no data corruption, just silent duplicate calls. The cost climbs and nothing alerts you. This is exactly why the bug survived as long as it did.

The Checklist

Before shipping any async worker:

Manual acknowledgment only. Ack after completion, never on receipt.
Timeout values account for worst-case latency, not average.
Every message has a correlation ID traceable across all consumers.
Worker operations are idempotent and safe to run twice.
You are monitoring work volume, not just queue depth.

The Learning

The queue delivered the message successfully. That is not the same as the work being done once.

I Baked a Football Cake and It Taught Me About Building AI Agents

Sanskriti — Sun, 30 Nov 2025 02:33:06 +0000

I recently baked a football cake and it helped me realize AI agents work just like layered desserts. Here’s how flavors, molds and icing maps to agentic design.

The code example below is designed to break down user goals into actionable steps and execute them using either custom tools or LLM reasoning. It uses regex to extract numbered steps from the LLM’s plan output. Executes each step by matching keywords like "search" or "compute" to the appropriate tool, or falls back to LLM reasoning.

Just like a custom cake has layers of flavor, structure, and decoration, an AI agent has its own stack.

It uses llama 3 LLM model and created two custom simple tools:

search_tool() - simulates a search engine returning mock results.
compute_tool() - simulates a computation task returning a placeholder result.

Implementing the agent architecture:

Base model: The sponge layer. I used llama 3, Ollama LLM model. It handles basic reasoning via the LLM.

# -------------------------------------------------------
# Ollama LLM Wrapper
# -------------------------------------------------------
class OllamaLLM:
    def __init__(self, model="llama3"):
        self.model = model

    def __call__(self, prompt: str) -> str:
        """Send a prompt to a local Ollama instance."""
        resp = requests.post(
            "http://localhost:11434/api/generate",
            json={"model": self.model, "prompt": prompt, "stream": False}
        )
        text = json.loads(resp.text).get("response", "")
        return text

# -------------------------------------------------------
# Base Agent 
# -------------------------------------------------------
class AgentCore:
    def __init__(self, llm):
        self.llm = llm

    def reason(self, prompt):
        return self.llm(prompt)

Tool execution: The icing and decor are the search and compute tools.

# -------------------------------------------------------
# Local Tools
# -------------------------------------------------------
def search_tool(query: str) -> dict:
    return {
        "tool": "search",
        "query": query,
        "results": [
            {"title": "Top NFL QBs 2024", "eff": 98.1},
            {"title": "Quarterback Rankings", "eff": 95.6},
        ],
    }


def compute_tool(task: str) -> dict:
    return {
        "tool": "compute",
        "task": task,
        "result": 42,  # we pretend the tool computed something important
    }

Prompt logic: Like flavor choices, it defines behavior. It adds step parsing and tool execution.

# -------------------------------------------------------
# Agent prompt and structured tool execution
# -------------------------------------------------------
class StructuredAgent(AgentCore):

    def parse_steps(self, plan: str):
        """Extract step lines starting with numbers."""
        lines = plan.split("\n")
        steps = []
        for line in lines:
            match = STEP_REGEX.search(line.strip())
            if match:
                cleaned = match.group(2).strip()
                steps.append(cleaned)
        return steps

    def execute_step(self, step: str):
        step_lower = step.lower()

        if "search" in step_lower:
            return search_tool(step)

        if "calculate" in step_lower or "compute" in step_lower:
            return compute_tool(step)

        # fallback: let the model reason
        return self.reason(step)

    def run(self, goal: str):
        PLAN_PROMPT =f"""You are a task decomposition engine.  
Your ONLY job is to break the user's goal into a small set of concrete, functional steps.
Your outputs MUST stay within the domain of the user’s goal.  
If the goal references football, metrics, or sports, remain in that domain only.

RULES:
- Only return steps directly needed to complete the user’s goal.
- Do NOT invent topics, examples, reviews, or unrelated domains.
- Do NOT expand into full explanations.
- No marketing language.
- No creative writing.
- No assumptions beyond the user's exact goal.
- No extra commentary.

FORMAT:
1. <short step>
2. <short step>
3. <short step>

User goal: "{goal}"
"""
        plan = self.llm(PLAN_PROMPT)
        steps = self.parse_steps(plan)

        outputs = []
        for step in steps:
            outputs.append({
                "step": step,
                "output": self.execute_step(step)
            })
        return outputs

User-facing output: The final taste = formatted responses. It formats the output for user facing responses.

# -------------------------------------------------------
# User Facing Agent (Formatted Output Layer)
# -------------------------------------------------------
class FinalAgent(StructuredAgent):
    def respond(self, goal: str):
        results = self.run(goal)

        formatted = "\n".join(
            f"- **{r['step']}** → {r['output']}"
            for r in results
        )

        return (
            f"## Result for goal: *{goal}*\n\n"
            f"{formatted}\n"
        )

Takeaway

Whether you're baking or writing code, structure matters. Think in layers. And if you ever need a sweet analogy to explain AI agents try cake. Got a dev inspired dessert metaphor? Drop it in the comments and let’s make tech tasty.

Tutorial

System Requirements:
Python version: 3.11.6
Ollama: Install and run Ollama locally to serve the Llama 3 model.

import requests
import json
import re

# parser that detects all common LLM step styles, including:
# 1. Do X
# 1) Do X
# Step 1: Do X
# **Step 1:** Do X
# - Step 1: Do X
# ### Step 1
# Step One:

STEP_REGEX = re.compile(
    r"(?:^|\s)(?:\*\*)?(?:Step\s*)?(\d+)[\.\):\- ]+(.*)", re.IGNORECASE
)



# -------------------------------------------------------
# Ollama LLM Wrapper
# -------------------------------------------------------
class OllamaLLM:
    def __init__(self, model="llama3"):
        self.model = model

    def __call__(self, prompt: str) -> str:
        """Send a prompt to a local Ollama instance."""
        resp = requests.post(
            "http://localhost:11434/api/generate",
            json={"model": self.model, "prompt": prompt, "stream": False}
        )
        text = json.loads(resp.text).get("response", "")
        return text

# -------------------------------------------------------
# Base Agent 
# -------------------------------------------------------
class AgentCore:
    def __init__(self, llm):
        self.llm = llm

    def reason(self, prompt):
        return self.llm(prompt)

# -------------------------------------------------------
# Local Tools
# -------------------------------------------------------
def search_tool(query: str) -> dict:
    return {
        "tool": "search",
        "query": query,
        "results": [
            {"title": "Top NFL QBs 2024", "eff": 98.1},
            {"title": "Quarterback Rankings", "eff": 95.6},
        ],
    }


def compute_tool(task: str) -> dict:
    return {
        "tool": "compute",
        "task": task,
        "result": 42,  # we pretend the tool computed something important
    }



# -------------------------------------------------------
# Agent prompt and structured tool execution
# -------------------------------------------------------
class StructuredAgent(AgentCore):

    def parse_steps(self, plan: str):
        """Extract step lines starting with numbers."""
        lines = plan.split("\n")
        steps = []
        for line in lines:
            match = STEP_REGEX.search(line.strip())
            if match:
                cleaned = match.group(2).strip()
                steps.append(cleaned)
        return steps

    def execute_step(self, step: str):
        step_lower = step.lower()

        if "search" in step_lower:
            return search_tool(step)

        if "calculate" in step_lower or "compute" in step_lower:
            return compute_tool(step)

        # fallback: let the model reason
        return self.reason(step)

    def run(self, goal: str):
        PLAN_PROMPT =f"""You are a task decomposition engine.  
Your ONLY job is to break the user's goal into a small set of concrete, functional steps.
Your outputs MUST stay within the domain of the user’s goal.  
If the goal references football, metrics, or sports, remain in that domain only.

RULES:
- Only return steps directly needed to complete the user’s goal.
- Do NOT invent topics, examples, reviews, or unrelated domains.
- Do NOT expand into full explanations.
- No marketing language.
- No creative writing.
- No assumptions beyond the user's exact goal.
- No extra commentary.

FORMAT:
1. <short step>
2. <short step>
3. <short step>

User goal: "{goal}"
"""
        plan = self.llm(PLAN_PROMPT)
        steps = self.parse_steps(plan)

        outputs = []
        for step in steps:
            outputs.append({
                "step": step,
                "output": self.execute_step(step)
            })
        return outputs


# -------------------------------------------------------
# User Facing Agent (Formatted Output Layer)
# -------------------------------------------------------
class FinalAgent(StructuredAgent):
    def respond(self, goal: str):
        results = self.run(goal)

        formatted = "\n".join(
            f"- **{r['step']}** → {r['output']}"
            for r in results
        )

        return (
            f"## Result for goal: *{goal}*\n\n"
            f"{formatted}\n"
        )


# -------------------------------------------------------
# Test Cases
# -------------------------------------------------------
if __name__ == "__main__":
    agent = FinalAgent(llm=OllamaLLM("llama3"))

    tests = [
        "Compare NFL quarterback efficiency metrics and summarize insights.",
        "Search for top training drills for youth football players.",
        "Compute a simple metric and explain how you'd structure the process.",
    ]

    for i, t in enumerate(tests, 1):
        print("=" * 70)
        print(f"TEST {i}: {t}")
        print(agent.respond(t))
        print()

Sample Output

======================================================================
TEST 1: Compare NFL quarterback efficiency metrics and summarize insights.

Result for goal: Compare NFL quarterback efficiency metrics and summarize insights.

Gather data on NFL quarterback statistics → Here are some key statistics for NFL quarterbacks, gathered from various sources including Pro-Football-Reference.com, ESPN, and NFL.com:

Passing Statistics

Career Passing Yards: + Tom Brady: 73,517 yards (most in NFL history) + Drew Brees: 72,503 yards + Peyton Manning: 71,940 yards
Career Touchdowns: + Tom Brady: 624 touchdowns + Drew Brees: 571 touchdowns + Aaron Rodgers: 462 touchdowns
Interceptions: + Brett Favre: 336 interceptions (most in NFL history) + Eli Manning: 244 interceptions + Philip Rivers: 234 interceptions
Completion Percentage: + Aaron Rodgers: 65.5% completion percentage (highest in NFL history) + Drew Brees: 64.7% + Tom Brady: 63.4%

Rushing Statistics

Career Rushing Yards: + Cam Newton: 5,442 yards + Russell Wilson: 3,911 yards + Michael Vick: 3,844 yards
Career Rushing Touchdowns: + Cam Newton: 85 rushing touchdowns (most among QBs) + Russell Wilson: 44 rushing touchdowns + Michael Vick: 36 rushing touchdowns

Other Statistics

Win-Loss Record: + Tom Brady: 230-72 regular season record (best among QBs) + Drew Brees: 208-115 regular season record + Peyton Manning: 208-141 regular season record
Playoff Wins: + Tom Brady: 32 playoff wins (most in NFL history) + Joe Montana: 23 playoff wins + Terry Bradshaw: 20 playoff wins

Note: These statistics are accurate as of the end of the 2020 NFL season and may change over time.

I hope this helps! Let me know if you have any specific questions or if there's anything else I can help with.

Identify relevant efficiency metrics (e.g., passer rating, yards per attempt) → Here are some common efficiency metrics used to evaluate quarterbacks:

Passer Rating: A quarterback's cumulative performance based on completion percentage, yards per attempt, touchdowns, and interceptions. * Formula: (Completions - Attempts + Touchdowns * 5 + Interceptions * -2) / Attempted Passes
Yards Per Attempt (YPA): Measures a quarterback's average yardage gained per pass attempt.
Completion Percentage: The percentage of passes completed out of total attempts.
Touchdown-to-Interception Ratio: Evaluates a quarterback's ability to score touchdowns compared to throwing interceptions.
Red Zone Efficiency: Measures a quarterback's success in scoring touchdowns within the opponent's 20-yard line (e.g., red zone).
Third Down Conversion Percentage: Assesses a quarterback's ability to convert third-down plays into first downs.
Fourth Quarter Points Per Game: Evaluates a quarterback's performance in critical, late-game situations.
Adjusted Net Yards Per Attempt (ANY/A): A more advanced metric that adjusts for opponent strength and incorporates additional factors like sacks and fumbles.

Other efficiency metrics used to evaluate quarterbacks include:

Quarterback Wins: A simple measure of a quarterback's contribution to their team's win-loss record.
Passer Rating Average per Game: A variation of the traditional passer rating formula, adjusted for games played.
Sack Rate: Measures a quarterback's frequency of being sacked relative to their total pass attempts.
Fumble Rate: Evaluates a quarterback's tendency to fumble the ball relative to their total pass attempts.

Keep in mind that each metric has its strengths and weaknesses, and no single efficiency metric can fully capture a quarterback's performance.

Calculate averages and rankings for each quarterback → {'tool': 'compute', 'task': 'Calculate averages and rankings for each quarterback', 'result': 42}

How AI Thinks It Thinks: ChatGPT, Copilot and Gemini Explain Themselves

Sanskriti — Tue, 02 Sep 2025 04:26:06 +0000

When building LangChain agents, understanding how your LLM "reasons" isn’t just a technical curiosity, it’s critical for debugging. They operate as probabilistic engines trained on vast datasets, generating outputs that mimic reasoning, planning, and even self awareness. Whether they’re glorified autocomplete systems or emergent cognitive frameworks, they’re shaping how we build, debug, and deploy intelligent workflows. To truly understand what these models can and can’t do, I went straight to the source. I prompted ChatGPT, Microsoft Copilot, and Google Gemini to explain their own architecture, reasoning process, and limitations. The best way to learn how they think is to ask them to think about themselves. Here’s what they said.

Questions Asked

How do you work?
Can you explain more technically?
What’s the actual math behind it?
How do you create your own output with reason if all you are doing is token prediction?
Can you explain it step by step like I’m 5?
What are your limitations today?

Response Summary

Question	ChatGPT	Copilot	Gemini
1. How do you work?	Describes token prediction, training on large datasets, and emergent reasoning.	Explains transformer architecture and modular orchestration.	Covers transformer basics and training data.
2. Can you explain more technically?	Introduces embeddings, attention, and sampling with equations.	Details inference pipeline, memory, and tool integration.	Describes self-attention and model layers.
3. What’s the actual math behind it?	Shares softmax and sampling logic.	Explains conditional probability and token scoring.	Mentions attention weights and vector math.
4. How do you create your own output with reason if all you are doing is token prediction?	Frames reasoning as emergent behavior from token selection.	Simulates reasoning and planning via intermediate steps and tool calls.	Says reasoning emerges from scale and training.
5. Can you explain it step by step like I’m 5?	Uses analogies like smart autocomplete and storytelling.	Describes step-by-step planning like a checklist.	Uses library and storytelling metaphors.
6. What are your limitations today?	Hallucinations, context limits, bias, no memory.	No consciousness, no intent, sandboxed tools.	Hallucinations, bias, static knowledge, no real reasoning.

Verdict: Who Explained It Best?

Model	Clarity	Depth	Conciseness	Dev Utility
ChatGPT	Y	Y	Y	Y
Copilot	Y	Y	Y	Y
Gemini	Y	Maybe	Maybe	Maybe

Legend - Y (It's able to explain in detail); Maybe (Somewhat explained it)

Takeaway for Agent Builders

If you're designing agents that need to simulate reasoning, learn and make decisions. Copilot’s modular thinking and planning metaphors are especially useful. ChatGPT is great for understanding the math and mechanics. Gemini is fine but less actionable.

Don't forget to try it on your own and let's discuss in the comments.

Versions used -
Gemini 2.5 Flash
Copilot Quick Response Mode
ChatGPT GPT-5

LangChain + Ollama in the Wild: Hard Learned Lessons on Building a Custom LLM Agent

Sanskriti — Wed, 20 Aug 2025 05:32:18 +0000

I stumbled upon Ollama, an open-source application that makes it easy to run large language models locally with minimal setup. Integrating LangChain with Ollama is straightforward: you can wire up a model with just a few lines of code.

Turning that integration into shippable code, something reliable enough to move beyond a demo requires more care. Through trial and error, I ran into several pitfalls that made the difference between a quick prototype and a stable system.

For this article, I’ll use a simple scheduling agent as the example to walk through four key lessons.

src-https://ollama.com/public/blog/meta-ollama-llama3.png

1. Skipping Explicit Schemas

If you let the model “be concise,” it will drift into natural language instead of structured outputs. The fix is to define JSON schemas directly in the system prompt.

SYSTEM_PROMPT = """
You are a calendaring assistant. Actions:

1. create_meeting
   { "person": string, "datetime": string, "reason": string }

2. reschedule_meeting
   { "person": string, "new_datetime": string }

3. cancel_meeting
   { "person": string }

4. escalate_issue
   { "reason": string }

Output only valid JSON:
{ "action": "<name>", "input": {...} }
"""

2. Post Processing Is Not Optional

Even with a schema, the model sometimes returns almost JSON: stray commas, comments, or text. That’s why you should always sanitize before parsing.

import re, json

def clean_json(payload: str) -> str:
    no_comments = re.sub(r'//.*', '', payload)
    return re.sub(r',(\s*[}\]])', r'\1', no_comments).strip()

def parse_payload(raw: str):
    payload = clean_json(raw)
    return json.loads(payload)

3. No Timeouts/Retries

Network hiccups or model stalls will block your system if you don’t enforce limits. Ollama doesn’t provide retries or timeouts out of the box, so you need to add them at the call site.

from langchain_ollama import ChatOllama
from langchain.schema import SystemMessage, HumanMessage

llm = ChatOllama(model="llama3", temperature=0)

def call_with_retry(messages, retries=2):
    for attempt in range(retries):
        try:
            return llm.invoke(input=messages, timeout=10)  # enforce timeout
        except Exception:
            if attempt == retries - 1:
                raise

4. Ignoring Drifts

As prompts or models change, outputs can silently drift. A schema that worked last week may suddenly fail. Adding lightweight regression checks helps you catch this early.

def test_golden_case():
    messages = [SystemMessage(content=SYSTEM_PROMPT),
                HumanMessage(content="Book a lunch with Alex at 1pm")]
    ai_msg = call_with_retry(messages)
    cmd = parse_payload(ai_msg.content)
    assert "action" in cmd and "input" in cmd

Putting It Together:

import json, re
from langchain_ollama import ChatOllama
from langchain.schema import SystemMessage, HumanMessage, AIMessage

# 1) Define your agent meeting functions as JSON in the system prompt
SYSTEM_PROMPT = """
You are a calendaring assistant. Actions:

1. create_meeting
   { "person": string, "datetime": string, "reason": string }

2. reschedule_meeting
   { "person": string, "new_datetime": string }

3. cancel_meeting
   { "person": string }

4. escalate_issue
   { "reason": string }

Output only valid JSON:
{ "action": "<name>", "input": {...} }
"""

# 2) Init the Ollama model
llm = ChatOllama(model="llama3", temperature=0)

# 3) AI message as cleaned up JSON
def parse_payload(raw: str):
    payload = clean_json(raw)
    return json.loads(payload)

# 4) Remove any JS-style comments and trailing commas in objects/arrays
def clean_json(payload: str) -> str:
    no_comments = re.sub(r'//.*', '', payload)
    return re.sub(r',(\s*[}\]])', r'\1', no_comments).strip()

# 5) Invoke llm model with retry limit
def call_with_retry(messages, retries=2):
    for attempt in range(retries):
        try:
            return llm.invoke(input=messages)
        except Exception:
            if attempt == retries - 1:
                raise

# 6) Execute custom agent and to return the result
def run_agent(user_request: str):
    messages = [SystemMessage(content=SYSTEM_PROMPT), HumanMessage(content=user_request)]
    ai_msg: AIMessage = call_with_retry(messages)
    cmd = parse_payload(ai_msg.content)

    action, params = cmd["action"], cmd["input"]

    if action == "create_meeting":
        return {"result": f"Created meeting with {params['person']} on {params['datetime']}"}

    elif action == "reschedule_meeting":
        return {"result": f"Rescheduled {params['person']} to {params['new_datetime']}"}

    elif action == "cancel_meeting":
        return {"result": f"Cancelled meeting with {params['person']}"}

    elif action == "escalate_issue":
        return {"result": f"Escalated due to: {params['reason']}"}

    else:
        return {"error": f"Unknown action: {action}"}

# 7) Optional - regression check
def test_golden_case():
    messages = [SystemMessage(content=SYSTEM_PROMPT),
                HumanMessage(content="Reschedule a lunch with Alex at 1pm")]
    ai_msg = call_with_retry(messages)
    cmd = parse_payload(ai_msg.content)
    print(cmd["action"])
    print(cmd["input"])
    assert "action" in cmd and "input" in cmd

# Application start
if __name__ == "__main__":
    query = "Book a call with Mr. Russell for next Thursday at 3 PST for a quick lunch"
    print(run_agent(query))
    # Optional validation step
    #test_golden_case()

O/P: {'result': 'Created meeting with Mr. Russell on 2023-03-16T15:00:00-08:00'}

Takeaway

LangChain + Ollama is fast to set up, but brittle if you skip guardrails. These small investments turn a fragile code into a service you can trust.

QQ???

Ollama updated lib now supports tools, so would you wrap dispatch logic into a proper toolset (using LangChain Tool abstraction) or keep it closer to plain Python for control?