DEV Community: Sergey Inozemtsev

I built a conversational RAG memory for my wife's LinkedIn agency for $44

Sergey Inozemtsev — Mon, 01 Jun 2026 12:27:19 +0000

My wife runs an agency that helps tech founders grow their LinkedIn presence. That means dozens of calls every week — clients, their teams, new leads, partner agencies, and more. All recorded through TL;DV.

The problem: when she needs to pull together context from an interview with a founder, something the CTO mentioned two months ago, and an insight from last week's call with a marketer — there's no good way to do it. Pasting full transcripts into Claude burns through context limits fast, and every new session starts from scratch.

What it does

TLBrain indexes call transcripts and gives Claude a persistent, searchable memory of every client conversation.

When my wife asks "what did we discuss with Acme last month?" — Claude queries the index, retrieves the relevant transcript segments, and answers with actual context from the calls.

The flow looks like this:

TL;DV records a call and fires a webhook
An import service converts the transcript into a Google Doc and places it in the right client folder in Google Drive — automatically
A sync service picks up new and changed documents, parses them, generates summaries and facts via Gemini, and indexes everything into Qdrant
A remote MCP server connects to Claude — accessible as a tool in both Claude.ai chat and Claude Cowork, on any device

By default, Claude only gets the relevant segments retrieved for each query — but there's also a tool to fetch the full transcript when needed.

What it costs

232 transcripts indexed for $44 — one-time cost. Each new transcript costs ~$0.19 to index.

Infrastructure runs entirely on free tiers:

Cloud Run
Firestore
Google Drive
Qdrant Cloud

All four stay within free tier limits for a small agency workload.

The only meaningful cost is Gemini API (Tier 1) — generation and embeddings during indexing. Embeddings are generated only for summaries and facts — not for utterances. Utterances are stored with BM25 sparse vectors and retrieved by range. This keeps both cost and vector storage size low. Free tier has strict rate limits that would make indexing 200+ transcripts impractical.

A few details that keep the cost low: embeddings use text-embedding-004 with output_dimensionality=768 — 4× cheaper than the default 3072. Summary and facts are generated in a single Gemini request per window. And if a file hasn't changed, it's skipped entirely — Gemini is never called again for the same content.

The $0.19 per transcript is a one-time indexing cost — you pay to embed the conversation once, not every time Claude searches it. Stop recording calls for a month, go on vacation, pause the business — the system costs nothing during that time. You only pay again when new transcripts are indexed. That said, this assumes you stay within the free tiers — once your call volume grows beyond those limits, infrastructure costs will kick in.

Without an index, asking Claude about a specific client means pasting entire transcripts into the context window — hitting message limits fast, and starting from scratch every session. TLBrain sends Claude only the relevant fragments: up to 75 utterances out of hundreds.

Why not just use Claude Projects?

Claude Projects is the obvious first answer. But it has two hard limits for this use case:

Context ceiling — paste enough transcripts and you hit the limit. Every new session starts from scratch.
No structure — there's no concept of clients, dates, or searchable facts. Everything is a flat pile of documents.

Imagine she needs to find what a client said about budget across three calls from different months. With Claude Projects, she'd need to manually find the right transcripts, paste them one by one, and hope they fit in context. With 232 calls in the archive, that's not a workflow — that's a research project. TLBrain returns the relevant fragments in seconds.

Architecture

TL;DV API  ←─────────────────  Reconciliation (Cloud Run, daily)
     ↓
Webhook Handler (Cloud Function)
     ↓
Import Service (Cloud Run)
     ↓
Google Drive + Firestore  ←───  Sync Checker (Cloud Function, every 15 min)
     ↓
Vector Sync (Cloud Run)
     ↓
Qdrant Cloud + Firestore
     ↓
MCP Server (Cloud Run)
     ↓
Claude (chat / Cowork)

Note: Firestore is used throughout as the state store — tracking import status, content hashes, and sync state.

Six services sounds like a lot — but each split is intentional. The webhook handler must respond to TL;DV in under 2 seconds or the delivery is marked as failed, so import runs in a separate service. The MCP server is isolated from the sync pipeline so a slow indexing job never blocks Claude's queries. Services are also split by runtime pattern: Cloud Functions wake up, check something, and go back to sleep — no idle cost. Cloud Run containers handle long-running tasks and stay warm longer: the MCP server keeps its instance alive for 15 minutes after the last request, so there are no cold starts during an active session.

Client folders in Google Drive are the source of truth for data organization. Each subfolder under the root is a client name. The sync service doesn't know about TL;DV — it only reads Google Docs from Drive. The import service and sync service are fully decoupled.

A daily reconciliation job cross-checks TL;DV's API against Firestore and queues anything the webhook missed — so no transcript gets lost silently.

One unexpected benefit

If a transcript contains transcription errors, I simply edit the Google Doc.

The sync service detects the change, regenerates summaries, re-embeds affected chunks, and updates Qdrant automatically.

No admin panel required.

Why traditional RAG chunking fails on conversations

Standard RAG splits text by token count. That works for documents but breaks on transcripts:

speaker boundaries are lost
replies get split mid-thought
retrieval returns fragments without context

The fix: treat each utterance as the atomic unit.

{
    "type": "utterance",
    "doc_id": "1BxK...drive_file_id",
    "client_name": "Acme Corp",
    "dialog_date": "2025-03-12",
    "speaker": "Alice",
    "text": "I're planning to launch in Q3, budget is around 5000.",
    "order_index": 42,
    "version": "sha256_of_content",
}

But utterances alone lose context. "Price is 5000" is meaningless without the surrounding conversation. So I generate summaries over sliding windows using anchor-based windowing:

def generate_windows(
    utterances: list[dict],
    anchor_step: int = 3,
    half_window: int = 2,
) -> list[dict]:
    if not utterances:
        return []
    windows = []
    n = len(utterances)
    for i in range(0, n, anchor_step):
        start = max(0, i - half_window)
        end = min(n - 1, i + half_window)
        window_utterances = utterances[start : end + 1]
        windows.append({
            "center_index": utterances[i]["order_index"],
            "covered_range": [
                utterances[start]["order_index"],
                utterances[end]["order_index"],
            ],
            "utterances": window_utterances,
        })
    return windows

Summaries and facts are generated in English regardless of the original language — so Claude always queries in English for consistent retrieval quality.

Why semantic search alone isn't enough

With summaries indexed, semantic search works well — most of the time.

The failure case: "what was the price she mentioned?" Specific numbers, names, short factual statements — these rarely survive summarization. The summary might say "discussed pricing" but the actual figure only lives in the raw utterance.

Semantic miss. Keyword hit.

The fix: three parallel searches.

Facts handle structured values like prices and dates. BM25 catches what semantic search misses — exact company names, abbreviations, or foreign words that don't survive summarization. If a client mentioned a specific vendor by name, semantic search might return "discussed partnerships" — BM25 finds the exact utterance.

from concurrent.futures import ThreadPoolExecutor

with ThreadPoolExecutor(max_workers=2) as executor:
    semantic_future = executor.submit(
        search_summaries_and_facts,
        query=query, client_name=client_name,
        date_from=date_from, date_to=date_to,
        top_k=15,
    )
    keyword_future = executor.submit(
        keyword_search_utterances,
        query=query, client_name=client_name,
        date_from=date_from, date_to=date_to,
    )
semantic_hits = semantic_future.result()   # dense, score >= 0.6
keyword_hits  = keyword_future.result()    # BM25, no threshold

# Pin: user_facts bypass score threshold entirely
user_fact_hits = search_user_facts(query, client_name=client_name)
pinned_hits = []
if user_fact_hits:
    hits_by_doc = {}
    for h in user_fact_hits:
        hits_by_doc[h["doc_id"]] = hits_by_doc.get(h["doc_id"], 0) + 1
    for doc_id, hit_count in hits_by_doc.items():
        pinned_hits.extend(search_summaries_for_doc(doc_id, query, top_k=hit_count * 5))

Note: ThreadPoolExecutor instead of asyncio.gather — the Qdrant Python SDK is synchronous. Real parallelism here comes from a thread pool, not coroutines.

Each search serves a different purpose:

Dense (semantic) over summaries and facts — finds topically relevant conversations
BM25 (keyword) over raw utterances — catches exact matches that don't survive summarization
Pin over user-added facts — forces specific documents into results regardless of score

Results are merged, overlapping ranges within the same document are combined, and the final
utterances are fetched by index range — no second search needed.

The output is a list of segments:

{
  "doc_id": "1BxKmN9vQ2rTzAp_drivefile",
  "client_name": "Acme Corp",
  "dialog_date": "2025-03-12",
  "segments": [
    {
      "range": [40, 46],
      "dialog": [
        {"speaker": "Alice",  "text": "So what's the timeline looking like?",             "order_index": 40},
        {"speaker": "Bob",    "text": "I need to be live by end of Q3.",                 "order_index": 41},
        {"speaker": "Alice",  "text": "I're planning to launch in Q3, budget is 5000.",  "order_index": 42},
        {"speaker": "Bob",    "text": "That works. Can you send a proposal by Friday?",   "order_index": 43},
        {"speaker": "Alice",  "text": "Sure, I'll have it over by Thursday.",             "order_index": 44}
      ]
    }
  ]
}

This is what gets sent to Claude as context. Not the full transcript — just the relevant
fragments.

What's next

Today TLBrain indexes 232 conversations and gives Claude access to years of client history — without loading entire transcripts into context.

The whole project is open source: TLBrain on GitHub

In the next post I'll cover how I turned this into a production remote MCP server with Google OAuth on Cloud Run.

One Tool Calling Interface for OpenAI, Claude, and Gemini

Sergey Inozemtsev — Thu, 12 Mar 2026 09:56:46 +0000

llm-api-adapter is an open‑source Python library designed to simplify working with multiple LLM providers.

Many AI applications today need to support multiple LLM providers.

Common reasons include:

cost optimization
fallback when a provider is unavailable
access to different model capabilities
experimentation with new models

In practice, the moment you try to support OpenAI, Claude, and Gemini, the integration becomes messy.

Tool calling alone already breaks portability:

Provider	Tool format
OpenAI	`tool_calls`
Anthropic	`tool_use` blocks
Google Gemini	`functionCall` / `functionResponse`

These are not just syntax differences.\
They require different request structures, response parsing, and execution loops.

Supporting multiple providers usually leads to:

provider-specific integration logic
provider-specific request/response handling
duplicated tool execution flows
multiple SDK dependencies

The result is more code, more bugs, and much harder provider switching.

To simplify this, I built llm-api-adapter — a small Python library that provides one unified interface for multiple LLM APIs.

Define tools once and run the same application logic across OpenAI, Anthropic, and Gemini.

Architecture

The adapter acts as a translation layer between your application and LLM providers.

              Application Logic
                     │
                     ▼
           UniversalLLMAPIAdapter
                     │
                     ▼
          Provider Translation Layer
                     │
                     ▼
 ┌─────────────┬─────────────┬─────────────┐
 │   OpenAI    │  Anthropic  │   Gemini    │
 │ tool_calls  │  tool_use   │ functionCall│
 └─────────────┴─────────────┴─────────────┘

Your application communicates with one interface, while the adapter converts requests and responses to the provider-specific formats.

Installation

pip install llm-api-adapter

The "Strawberry" problem

A classic example showing why tool calling matters:

How many "r" letters are in "strawberry"?

The correct answer is 3, but models often fail because they reason over tokens, not characters.

Best practice is:

Let the LLM reason, but delegate deterministic tasks to code.

This is exactly what tool calling enables.

Defining a tool once

With llm-api-adapter, tools are defined using a provider-agnostic schema.

from llm_api_adapter.models.tools import ToolSpec

tools = [
    ToolSpec(
        name="count_letter_in_word",
        description="Count how many times a specific letter appears in a word",
        json_schema={
            "type": "object",
            "properties": {
                "word": {"type": "string"},
                "letter": {"type": "string", "minLength": 1, "maxLength": 1},
            },
            "required": ["word", "letter"],
            "additionalProperties": False,
        },
    )
]

The adapter automatically converts this schema to:

OpenAI tools
Anthropic tool_use
Gemini functionCall

Running the same code across providers

The application logic remains identical.

Only the provider name, model, and API key change.

import json
from typing import Any, Dict

from llm_api_adapter.universal_adapter import UniversalLLMAPIAdapter
from llm_api_adapter.models.messages.chat_message import (
    UserMessage,
    AIMessage,
    ToolMessage,
)

def run_tool(name: str, args: Dict[str, Any]) -> Dict[str, Any]:
    if name == "count_letter_in_word":
        word, letter = args["word"], args["letter"]
        return {
            "word": word,
            "letter": letter,
            "count": word.lower().count(letter.lower()),
        }

providers = [
    ("openai", "gpt-5.2", openai_api_key),
    ("anthropic", "claude-haiku-4-5", anthropic_api_key),
    ("google", "gemini-2.5-flash", google_api_key),
]

for org, model, key in providers:
    adapter = UniversalLLMAPIAdapter(
        organization=org,
        model=model,
        api_key=key,
    )

    messages = [
        UserMessage('How many "r" letters are in "strawberry"?')
    ]

    first = adapter.chat(
        messages=messages,
        tools=tools,
        tool_choice="auto",
        max_tokens=1000,
    )

    if first.tool_calls:
        messages.append(
            AIMessage(content="", tool_calls=first.tool_calls)
        )
        for tc in first.tool_calls:
            result = run_tool(tc.name, tc.arguments)
            messages.append(
                ToolMessage(
                    tool_call_id=tc.call_id,
                    content=json.dumps(result),
                )
            )
        final = adapter.chat(
            messages=messages,
            previous_response=first,
            max_tokens=1000,
        )

        print(f"--- {org} / {model} ---")
        print(final.content)
        print()

Example output

Even though the models use different tokenization internally, they all trigger the tool correctly.

--- openai / gpt-5.2 ---
There are 3 letters "r" in "strawberry".

--- anthropic / claude-haiku-4-5 ---
There are 3 "r" letters in "strawberry".

--- google / gemini-2.5-flash ---
There are three "r" letters in "strawberry".

Without vs with an adapter

Problem	Without Adapter	With llm-api-adapter
Tool definitions	Provider specific	One universal schema
Tool execution	Custom logic per provider	Unified interface
Response parsing	Different formats	Single response model
Provider switching	Rewrite code	Change model string
Dependencies	Multiple SDKs	One library

Why this matters

Supporting multiple LLM providers normally requires separate integrations and duplicated logic.

A unified interface lets you:

keep application logic provider-agnostic
switch models without rewriting code
simplify agent architectures

Instead of adapting your code to each provider, you adapt the providers to your code.

GitHub

The project is open source.

👉 https://github.com/Inozem/llm_api_adapter

You will find full documentation, examples, and the source code in the repository.

Clean Architecture for AI Agents with Convo-Lang

Sergey Inozemtsev — Tue, 10 Feb 2026 20:47:41 +0000

Decoupling Orchestration from Reasoning

In this post, I’ll show how to design a clean, maintainable architecture for AI systems using Convo-Lang.

As a concrete example, I’ll use a hallucination-resistant AI agent that analyzes a job description, evaluates candidate fit against detailed professional experience, and generates a tailored resume only when the role is actually relevant.

In this setup, all reasoning and decision logic lives in Convo-Lang, while Python is used strictly for orchestration — loading inputs, executing agents, and wiring the pipeline together.

The goal of the example is not the resume itself. The goal is to demonstrate how to decouple orchestration from reasoning and build an AI system that is easy to understand, extend, and maintain over time.

The full working example is available in the Convo-Lang repository.

You can explore the complete code here: https://github.com/convo-lang/convo-lang/tree/main/packages/convo-lang-py/examples/02_patterns/resume_generator

You can clone it, run it locally, and experiment with it by simply replacing the job description and writing your own experience profile — the sample inputs live in the data/ folder.

What Convo-Lang actually is

Convo-Lang is not:

a prompt template engine
a thin wrapper around chat completions
a “nicer way to write prompts”

Convo-Lang is a domain-specific language for LLM reasoning and agent workflows.

It allows you to define:

explicit agent roles
typed input and output contracts
deterministic logic
schema-enforced outputs
multi-agent pipelines

All of this lives in .convo files — outside of application code.

Why resumes are a good stress test

Resume generation is a hostile domain for hallucinations:

inventing skills is unacceptable
inventing companies or roles is unacceptable
inventing dates is unacceptable
decisions must be explainable

A single “smart prompt” is the worst possible approach here.

So instead of asking how to prompt, I started by asking:
how should this system be modeled?

Modeling the system as Convo-Lang agents

The solution is built as five Convo-Lang agents, each responsible for exactly one thing:

JobDescriptionAnalyzer
Turns raw job text into structured requirements.
CandidateProfileAnalyzer
Converts free-form experience text into factual, structured data.
ProfileJobMatcher
Matches experience to requirements and explicitly lists gaps.
ResumeWriter
Generates a resume strictly from verified data.
FitEvaluator
Decides whether applying makes sense.

Each agent:

lives in its own .convo file
has a single responsibility
communicates only through typed contracts

This separation is not cosmetic.
It is the foundation of reliability.

Typed contracts instead of “return JSON please”

In most LLM systems, structured output is a suggestion.

In Convo-Lang, it is a contract.

Here is a real schema used by the CandidateProfileAnalyzer agent:

>define
ProfileData = struct(
  workExperience: array(
    struct(
      title: string
      companyName: string
      firstDate: string
      lastDate?: string
      summary: string
      experience: array(string)
    )
  )
  projects?: array(
    struct(
      title: string
      firstDate: string
      lastDate?: string
      experience: array(string)
    )
  )
)

This immediately changes system behavior:

required fields must exist
optional fields are explicit
invented fields are invalid
downstream agents can trust the data shape

Hallucinations don’t silently propagate.
They violate the contract.

Validating inputs before any reasoning happens

Hallucinations often start before generation.
They start when invalid or ambiguous input quietly enters the system.

Convo-Lang allows agents to validate inputs explicitly, before any reasoning takes place.

>define
JobData = struct(
  title: string
  mustRequirements: array(string)
  niceToHaveRequirements: array(string)
  keywords: array(string)
)

>do
jobData = new(JobData job_data)

That single line enforces a lot:

checks that job_data exists
validates required fields
enforces correct types
rejects malformed input early

If the input does not match JobData, the agent does not proceed.

The model never reasons over invalid data.

Here, input validation is part of the agent’s contract, not an afterthought.

Explainable matching instead of opaque scoring

The ProfileJobMatcher agent does not produce a mysterious score.

It produces:

only relevant roles and projects
explicit matchReasons for each item
two concrete gap lists: must-have and nice-to-have

MatchData = struct(
  coverageProfileData: struct(
    workExperience: array(
      title: string
      companyName: string
      firstDate: string
      lastDate?: string
      summary: string
      experience: array(string)
      matchReasons: array(string)
    )
    projects?: array(...)
  )
  gaps: struct(
    mustRequirements: array(string)
    niceToHaveRequirements: array(string)
  )
)

Nothing is hidden.
Every match and every gap is inspectable.

This output becomes the single source of truth for all downstream steps.

Deterministic logic inside the agent (not in prose)

A key feature of Convo-Lang is that deterministic logic lives next to reasoning.

In the FitEvaluator, the final decision is not guessed.
It is calculated.

>do
jobData = new(JobData job_data)
matchData = new(MatchData match_data)

totalConfidence = 100
jobRequirementsAmount = jobData.mustRequirements.length
requirementPoints = div(totalConfidence jobRequirementsAmount)

requirementGapAmount = matchData.gaps.mustRequirements.length
mainConfidence = mul(
  sub(jobRequirementsAmount requirementGapAmount)
  requirementPoints
)

decision = "apply"

if (lt(mainConfidence 70)) then (
  decision = "skip"
)
elif (lt(mainConfidence 90)) then (
  decision = "maybe apply"
)

This is business logic:

readable
reviewable
testable

The LLM explains the decision — but it does not invent the rules.

Schema-enforced output with `@json`

Convo-Lang does not rely on “please return JSON”.

It enforces it.

@json RecommendationData
>user
Help the candidate decide whether applying for this job makes sense.

If the output does not match RecommendationData, it is invalid.

Structured output is no longer a best-effort promise.
It is a guarantee.

Python as an orchestrator, not a reasoning layer

So where does Python fit into this architecture?

Python is intentionally boring.

It does not:

contain prompts
contain business rules
interpret free-form model output

It only:

loads input data
executes agents
passes validated JSON between them
handles I/O

job_data = convo_job_description_analyzer.complete(...)
profile_data = convo_candidate_profile_analyzer.complete(...)
match_data = convo_profile_job_matcher.complete(...)
resume_data = convo_resume_writer.complete(...)
decision = convo_fit_evaluator.complete(...)

All intelligence lives in .convo.
Python is just the runtime.

This separation is deliberate.

Why this separation matters

By keeping reasoning in Convo-Lang and orchestration in Python:

AI logic becomes portable
behavior is consistent across CLI, editor, and SDK
prompt changes don’t require backend redeploys
agent logic can be reviewed like code

The agents folder becomes the product.
The SDK becomes an implementation detail.

What this example actually demonstrates

This post is not really about resumes.

It demonstrates that Convo-Lang lets you:

treat LLM logic as first-class code
build multi-agent systems without prompt chaos
validate inputs and outputs explicitly
make hallucinations visible instead of hidden
scale reasoning without rewriting everything

That is why Convo-Lang is worth using.

Final takeaway

Hallucinations are rarely a model problem.
They are almost always an architecture problem.

Convo-Lang gives you the tools to fix that at the right level.

Resources

Convo-Lang core: https://github.com/convo-lang/convo-lang
Convo-Lang Python SDK: https://github.com/convo-lang/convo-lang/tree/main/packages/convo-lang-py
Resume agent example: https://github.com/convo-lang/convo-lang/tree/main/packages/convo-lang-py/examples/02_patterns/resume_generator
Documentation: https://learn.convo-lang.ai/

Prompts are logic, not strings: Why I contributed to Convo-Lang

Sergey Inozemtsev — Sun, 04 Jan 2026 13:36:56 +0000

If you’ve built anything non-trivial with LLMs, you’ve probably written code like this:

prompt = f"""
Analyze this job description: {job_description}
Analyze this candidate profile: {candidate_profile}
Decide whether the candidate is a good fit.
Return JSON.
"""

It works.
Until your project grows.

The problem: prompt spaghetti and technical debt

Hardcoding prompts directly into application code feels convenient at first.
But very quickly it turns into long-term technical debt:

Prompts become unreadable f-string monsters
Prompt changes require code changes and redeploys
Prompt versions drift across files and branches
Prompt engineers and copywriters are afraid to touch .py
Prompt logic, business logic, and orchestration logic get mixed together

At that point, prompts are:

hard to test
hard to reuse
hard to reason about

We already solved this problem for SQL, HTML, configs, and templates.
LLM prompts deserve the same treatment.

Prompts are not strings — they are logic

A modern LLM “prompt” is not just text.

It contains:

structure
contracts
conditions
branching
deterministic steps

Treating it as a Python string literal is the fastest way to lose control over your AI system.

That’s where Convo-Lang comes in.

Convo-Lang as an AI-native DSL

Convo-Lang is an open-source, AI-native DSL for building conversations and agent workflows.

Instead of embedding prompts into code, you define them in .convo files:

explicit schemas
role-based messages
deterministic logic blocks
structured outputs
multi-agent workflows

Your prompt becomes a first-class artifact, not a string buried in code.

How it works: Python as a thin runtime

Here’s all the Python code required to run a single agent:

print("Running FitEvaluator agent...")

convo_fit_evaluator = Conversation(agent_configs)
convo_fit_evaluator.add_convo_file("agents/fitEvaluator.convo")

job_apply_decision = convo_fit_evaluator.complete(
    variables={"job_data": job_data, "match_data": match_data}
)

Notice what’s missing:

no prompt text
no formatting logic
no hidden reasoning

Python is just the runtime.
All intelligence lives in .convo.

Typed contracts instead of free-form prompts

The core idea that changed how I think about prompts:

Agents should communicate through typed contracts, not vague instructions.

Example schema definitions used by an agent:

>define
JobData = struct(
  title:string
  mustRequirements:array(string)
  niceToHaveRequirements:array(string)
)

RecommendationData = struct(
  recommendation: struct(
    decision: enum("apply","maybe apply","skip")
    confidence: number
    summary: string
  )
)

This gives you:

explicit input shapes
explicit output contracts
predictable agent behavior

The agent no longer “guesses” what to return.

Deterministic logic lives next to the prompt

Convo-Lang is not just a prompt format.
It allows you to define explicit, deterministic logic inside the agent.

>do
jobData = new(JobData job_data)
matchData = new(MatchData match_data)

total = 100
reqCount = jobData.mustRequirements.length
niceCount = jobData.niceToHaveRequirements.length

reqPoints = div(total reqCount)
nicePoints = div(reqPoints 4)

reqGaps = matchData.gaps.mustRequirements.length
niceGaps = matchData.gaps.niceToHaveRequirements.length

confidence = add(
  mul(sub(reqCount reqGaps) reqPoints)
  mul(sub(niceCount niceGaps) nicePoints)
)

decision = "apply"
if (lt(confidence 70)) then (decision = "skip")
elif (lt(confidence 90)) then (decision = "maybe apply")

This is business logic, not prompt prose.

Schema-enforced output with `@json`

@json RecommendationData
>user
Return recommendation for this candidate and job.

This is not a suggestion.
It is schema‑enforced output validation.

Malformed or invalid responses don’t silently pass through.

Cross-SDK portability by design

Because the Convo-Lang core is implemented in TypeScript, it guarantees identical behavior across environments:

VS Code preview
CLI
Python runtime

If your prompt passes validation in the editor, it will behave the same way in your Python backend.

Write once. Run anywhere.

Architecture: a smart bridge, not a rewrite

The Python SDK does not reimplement the Convo-Lang engine.

Instead, it acts as a high‑performance bridge to the Node.js core, which handles parsing, validation, and async I/O.

This preserves full syntax and behavior parity across SDKs.

Separation of concerns — for real

With this approach:

.convo files own AI reasoning and decision logic
Python only orchestrates execution
Prompt engineers don’t touch backend code
Developers don’t rewrite prompts

The agents/ folder is the product.
Python is just the runtime.

Why I contributed to the Python SDK

I believe AI workflows need standards.

Prompts should be portable, testable, and explicit.

That’s why I helped bring Convo-Lang to Python.

Resources

Resume generator example (Python): https://github.com/convo-lang/convo-lang/tree/main/packages/convo-lang-py/examples/02_patterns/resume_generator
Convo-Lang core: https://github.com/convo-lang/convo-lang
Convo-Lang Python SDK: https://github.com/convo-lang/convo-lang/tree/main/packages/convo-lang-py
Documentation: https://learn.convo-lang.ai/

How I built a bulletproof CI/CD for my LLM Python library

Sergey Inozemtsev — Wed, 24 Dec 2025 09:20:35 +0000

When building an open-source library that integrates with multiple LLM providers (OpenAI, Anthropic, Google), reliability matters. Users expect upgrades to be safe and predictable.

This post describes the CI/CD setup I use for llm-api-adapter. The key idea is simple: test not only the code, but the actual published package.

The Strategy: two pipelines, three stages

I use a dual-pipeline setup aligned with GitHub Flow:

Dev pipeline — runs on every push to dev. Its job is early feedback and validating the distribution process.
Main pipeline — runs on main and version tags. Its job is stable, repeatable releases to PyPI.

Most CI setups stop at unit or integration tests. This one goes further by validating the artifact installed from TestPyPI.

1. Dev pipeline: pre-flight validation

The dev workflow is where most of the safety guarantees come from.

Stage A: Unit & Integration tests

Executed with pytest
Tests are separated via markers (unit, integration)
Fast feedback on logic and provider integration

Stage B: publish to TestPyPI

After tests pass, the package is built and published to TestPyPI.

This step catches issues that tests alone cannot:

Incorrect pyproject.toml
Missing files in the source distribution
Broken dependency declarations

Stage C: E2E tests from TestPyPI

This is the critical part of the pipeline.

The job:

Waits for TestPyPI to index the new release
Installs the package from TestPyPI, not from source
Pulls dependencies from the real PyPI
Runs real end-to-end tests using live API keys

pip install --index-url https://test.pypi.org/simple/ \
            --extra-index-url https://pypi.org/simple \
            llm-api-adapter

At this point, the CI environment matches what users will experience after pip install.

2. Main pipeline: controlled release

Once the package is validated in dev, changes move to main.

What runs on `main`

Full unit + integration test suite on every PR
No publishing on pushes

What triggers a release

A version tag (vX.Y.Z)
Build and publish to PyPI
Credentials handled via GitHub Secrets

By the time a tag is pushed, the same artifact has already passed E2E tests via TestPyPI.

Why this setup works

Before publishing to PyPI, I know that:

The code behaves correctly (unit + integration tests)
The package is installable from a registry (TestPyPI)
External LLM providers respond as expected (E2E tests)

Most importantly, this approach prevents broken versions from ever being published to PyPI.

Conclusion

If your library depends on external APIs, testing only the source code is not enough.

Testing the published artifact is what makes releases predictable and safe.

The full setup is fully public and reproducible:

👉 Repository:

https://github.com/Inozem/llm_api_adapter

👉 GitHub Actions workflows:

https://github.com/Inozem/llm_api_adapter/tree/main/.github/workflows

Question for you

How do you usually set up CI for your open-source projects?

Python LLM: reasoning is disabled by default in llm-api-adapter

Sergey Inozemtsev — Sun, 14 Dec 2025 09:00:27 +0000

Reasoning improves LLM output quality, but it is expensive and often unnecessary. Worse: most providers enable it implicitly or hide it behind non-obvious parameters.

Result: developers pay for reasoning even when they don’t need it.

The Problem

Reasoning is handled inconsistently across providers:

OpenAI: often enabled implicitly.
Gemini: controlled via thinkingConfig.
Anthropic (Claude): may enforce minimum reasoning tokens (e.g. 1024).
Nano/Mini models: sometimes impossible to disable reasoning entirely.

This leads to:

hidden costs
provider-specific conditionals
easy-to-miss misconfiguration

The Approach: Off by Default

Starting from llm_api_adapter v0.2.3, reasoning is disabled by default.

If it is not explicitly enabled:

no reasoning tokens are used
no extra cost is incurred
existing code keeps working

Costly features should be opt-in, not opt-out.

Enabling Reasoning Explicitly

When reasoning is actually required, it can be enabled via a single unified parameter:

String levels: "none" | "low" | "medium" | "high"
Numeric values: 256, 512, 1024, 2048, etc.

The adapter:

maps the value to provider-specific fields
applies correct formats per API
respects provider minimums
prevents invalid configurations

Installation

pip install llm-api-adapter

Example

from llm_api_adapter.universal_adapter import UniversalLLMAPIAdapter

messages = [
    {"role": "user", "content": "Explain quantum computing simply."}
]

# Pick a provider (same interface)
adapter = UniversalLLMAPIAdapter(
    organization="openai",
    model="gpt-5.1",
    api_key=openai_api_key,
)

# or
# adapter = UniversalLLMAPIAdapter(
#     organization="google",
#     model="gemini-2.5-pro",
#     api_key=google_api_key,
# )

# or
# adapter = UniversalLLMAPIAdapter(
#     organization="anthropic",
#     model="claude-sonnet-4-5",
#     api_key=anthropic_api_key,
# )

response = adapter.chat(
    messages=messages,
    reasoning_level="low",  # off by default, enabled explicitly
)

print(response.content)

Why This Matters

Lower and predictable costs
No accidental reasoning usage
Cleaner application code
Unified control across OpenAI, Claude, and Gemini

Repository

Source code and documentation: https://github.com/Inozem/llm-api-adapter

Structured prompts: how YAML cut my LLM costs by 30%

Sergey Inozemtsev — Wed, 05 Nov 2025 09:34:01 +0000

Result Summary:

Metric	Original Prompt	YAML Prompt	Savings
Tokens	355	251	−104 (−29.3%)
Cost	0.00001775 USD	0.00001255 USD	−29.3%

I'll show how this works using a popular prompt taken from the internet, rewritten in YAML format to show whether structured phrasing can reduce token count without harming quality.

Why it matters:

Fewer tokens → lower cost per request.
YAML forces clarity and structure, improving consistency of answers.
Easier to maintain and version prompts in code.

Original Prompt (PROMPT_A)

prompt_a = """
You are the "Architect Guide," specialized in assisting programmers who are experienced in individual module development but are looking to enhance their skills in understanding and managing entire project architectures.
Your primary roles and methods of guidance include:

Basics of Project Architecture: Start with foundational knowledge, focusing on principles and practices of inter-module communication and standardization in modular coding.
Integration Insights: Provide insights into how individual modules integrate and communicate within a larger system, using examples and case studies for effective project architecture demonstration.
Exploration of Architectural Styles: Encourage exploring different architectural styles, discussing their suitability for various types of projects, and provide resources for further learning.
Practical Exercises: Offer practical exercises to apply new concepts in real-world scenarios.
Analysis of Multi-layered Software Projects: Analyze complex software projects to understand their architecture, including layers like Frontend Application, Backend Service, and Data Storage.
Educational Insights: Focus on reviewing project readme files and source code for comprehensive understanding.
Use of Diagrams and Images: Utilize architecture diagrams and images to aid in understanding project structure and layer interactions.
Clarity Over Jargon: Avoid overly technical language, focusing on clear, understandable explanations.
No Coding Solutions: Focus on architectural concepts and practices rather than specific coding solutions.
Detailed Yet Concise Responses: Provide detailed responses that are concise and informative without being overwhelming.
Practical Application and Real-World Examples: Emphasize practical application with real-world examples.
Clarification Requests: Ask for clarification on vague project details or unspecified architectural styles to ensure accurate advice.
Professional and Approachable Tone: Maintain a professional yet approachable tone.
Use of Everyday Analogies: When discussing technical concepts, use everyday analogies to make them more accessible and understandable.
"""

Optimized YAML Prompt (PROMPT_B)

prompt_b = """
system: |
  Role: "Architect Guide"
  Purpose: Help developers skilled in module-level coding grow into understanding and managing full project architectures.

guidelines: |
  - Teach project architecture fundamentals: modular communication, standardization, and structure.
  - Explain module integration within larger systems using examples and case studies.
  - Compare architectural styles, discuss suitability, and share learning resources.
  - Provide practical exercises for real-world application.
  - Analyze multi-layered software (frontend, backend, data storage) to illustrate architecture.
  - Offer educational insights: review README files and source code for comprehension.
  - Use diagrams and visuals to clarify system interactions.
  - Prefer clarity over jargon; use plain, accessible language.
  - Focus on architecture concepts — no coding solutions.
  - Be detailed yet concise; avoid information overload.
  - Include real-world examples for practical relevance.
  - Ask clarifying questions about unclear project details.
  - Maintain a professional, approachable tone.
  - Use everyday analogies for complex concepts.

style: |
  Clear, didactic, structured.
  Encourage understanding of architecture as a living system, not just code components.
"""

Observation: The output quality didn’t just stay the same — it improved. ChatGPT understood the intent better, and responses became more focused.

Experiment Code Example

# Universal adapter
# using [llm-api-adapter](https://github.com/Inozem/llm_api_adapter)
# makes it easy to switch between different providers for testing
# can be installed easily via: pip install llm-api-adapter
from llm_api_adapter.universal_adapter import UniversalLLMAPIAdapter

messages_a = [
    {"role": "system", "content": prompt_a},
    {"role": "system", "content": "Help me to create weather application."},
]
messages_b = [
    {"role": "system", "content": prompt_b},
    {"role": "system", "content": "Help me to create weather application."},
]
adapter = UniversalLLMAPIAdapter(
    organization="openai",
    model="gpt-5-nano",
    api_key=openai_api_key,
)
# Runs
resp_a = adapter.chat(messages=messages_a)
resp_b = adapter.chat(messages=messages_b)
# Token savings
tokens_a = resp_a.usage.input_tokens
tokens_b = resp_b.usage.input_tokens
saved = tokens_a - tokens_b
rel = (saved / tokens_a) * 100 if tokens_a else 0
print("PROMPT A tokens:", tokens_a)
print("PROMPT B tokens:", tokens_b)
print("Saved tokens:", saved)
print(f"Relative saving: {rel:.1f}%")
# Cost
print("Cost A:", f"{resp_a.cost_input:.8f} {resp_a.currency}")
print("Cost B:", f"{resp_b.cost_input:.8f} {resp_b.currency}")

Prompt I used to generate the new YAML-formatted version:

Optimize this prompt into YAML format

Conclusion: Structured prompts are not just cleaner — they’re cheaper. Try YAML structuring in your next LLM project. It’s simple, reproducible, and can cut your costs by ~30%.

Unifying 3 LLM APIs in Python: OpenAI, Anthropic & Google with one SDK

Sergey Inozemtsev — Tue, 04 Nov 2025 12:34:19 +0000

A year ago, I released the first version of LLM API Adapter — a lightweight SDK that unified OpenAI, Anthropic, and Google APIs under one interface.

It got 7 ⭐ on GitHub and valuable feedback from early users.

That was enough motivation to take it to the next level.

What changed in the new version

The new version (v0.2.2) is now:

SDK-free — it talks directly to provider APIs, no external dependencies.
Unified — one chat() interface for all models (OpenAI, Anthropic, Google).
Transparent — automatic token and cost tracking.
Resilient — consistent error taxonomy across providers (auth, rate, timeout, token limits).
Tested — 98% unit test coverage.

Example: chat with any LLM

from llm_api_adapter.universal_adapter import UniversalLLMAPIAdapter

adapter = UniversalLLMAPIAdapter(provider="openai", model="gpt-5")

response = adapter.chat([
    {"role": "system", "content": "Be concise."},
    {"role": "user", "content": "Explain how LLM adapters work."},
])

print(response.content)

Switching models is as simple as changing two parameters:

adapter = UniversalLLMAPIAdapter(provider="anthropic",
                                 model="claude-sonnet-4-5")
# or
adapter = UniversalLLMAPIAdapter(provider="google",
                                 model="gemini-2.5-pro")

Token & cost tracking example

Every response now includes full token and cost accounting — no manual math needed.

from llm_api_adapter.universal_adapter import UniversalLLMAPIAdapter

google = UniversalLLMAPIAdapter(
    organization="google",
    model="gemini-2.5-pro",
    api_key=google_api_key
)

response = google.chat(**chat_params)

print(response.usage.input_tokens, "tokens",
      f"({response.cost_input} {response.currency})")
print(response.usage.output_tokens, "tokens",
      f"({response.cost_output} {response.currency})")
print(response.usage.total_tokens, "tokens",
      f"({response.cost_total} {response.currency})")

Output:

512 tokens (0.00025 USD)
137 tokens (0.00010 USD)
649 tokens (0.00035 USD)

Why I built this

Working with multiple LLMs used to mean rewriting the same code — again and again.

Each SDK had its own method names, parameter names, and error classes.

So I built a unified interface that abstracts those details.

One adapter — one consistent experience.

Join the project

You can try it now:

pip install llm-api-adapter

Docs & examples: github.com/Inozem/llm_api_adapter

If you like the idea — ⭐ star it or share feedback in Issues.

LLM API Adapter SDK for Python

Sergey Inozemtsev — Thu, 14 Nov 2024 10:49:57 +0000

Here is my LLM API Adapter SDK for Python that allows you to easily switch between different LLM APIs.

At the moment, it supports: OpenAI, Anthropic, and Google. And only the chat function (for now).

It simplifies integration and debugging as it has standardized error classes across all supported LLMs.

It also manages request parameters like temperature, max tokens, and other settings for better control.

To use the adapter, you need to download the library and obtain API keys for the LLMs you want. In the code, I demonstrated how easy it is to use it.

from llm_api_adapter.messages.chat_message import AIMessage, Prompt, UserMessage
from llm_api_adapter.universal_adapter import UniversalLLMAPIAdapter

messages = [
    Prompt(
        "You are a friendly assistant who explains complex concepts "
        "in simple terms."
    ),
    UserMessage("Hi! Can you explain how artificial intelligence works?"),
    AIMessage(
        "Sure! Artificial intelligence (AI) is a system that can perform "
        "tasks requiring human-like intelligence, such as recognizing images "
        "or understanding language. It learns by analyzing large amounts of "
        "data, finding patterns, and making predictions."
    ),
    UserMessage("How does AI learn?"),
]

gpt = UniversalLLMAPIAdapter(
    organization="openai",
    model="gpt-3.5-turbo",
    api_key=openai_api_key
)
gpt_response = gpt.generate_chat_answer(messages=messages)
print(gpt_response.content)

claude = UniversalLLMAPIAdapter(
    organization="anthropic",
    model="claude-3-haiku-20240307",
    api_key=anthropic_api_key
)
claude_response = claude.generate_chat_answer(messages=messages)
print(claude_response.content)

google = UniversalLLMAPIAdapter(
    organization="google",
    model="gemini-1.5-flash",
    api_key=google_api_key
)
google_response = google.generate_chat_answer(messages=messages)
print(google_response.content)

I have explained everything in more detail in the documentation: https://github.com/Inozem/llm_api_adapter

This is the first stage, and it is just the beginning. I'd love to hear your thoughts, feedback, or ideas on where it could go next.

GenAi #Python #LLM #OpenAI #GPT #Anthropic #Claude #Google #Gemini

DEV Community: Sergey Inozemtsev

I built a conversational RAG memory for my wife's LinkedIn agency for $44

What it does

What it costs

Why not just use Claude Projects?

Architecture

One unexpected benefit

Why traditional RAG chunking fails on conversations

Why semantic search alone isn't enough

What's next

One Tool Calling Interface for OpenAI, Claude, and Gemini

Architecture

Installation

The "Strawberry" problem

Defining a tool once

Running the same code across providers

Example output

Without vs with an adapter

Why this matters

GitHub

Clean Architecture for AI Agents with Convo-Lang

Decoupling Orchestration from Reasoning

What Convo-Lang actually is

Why resumes are a good stress test

Modeling the system as Convo-Lang agents

Typed contracts instead of “return JSON please”

Validating inputs before any reasoning happens

Explainable matching instead of opaque scoring

Deterministic logic inside the agent (not in prose)

Schema-enforced output with @json

Python as an orchestrator, not a reasoning layer

Why this separation matters

What this example actually demonstrates

Final takeaway

Resources

Prompts are logic, not strings: Why I contributed to Convo-Lang

The problem: prompt spaghetti and technical debt

Prompts are not strings — they are logic

Convo-Lang as an AI-native DSL

How it works: Python as a thin runtime

Typed contracts instead of free-form prompts

Deterministic logic lives next to the prompt

Schema-enforced output with @json

Cross-SDK portability by design

Architecture: a smart bridge, not a rewrite

Separation of concerns — for real

Why I contributed to the Python SDK

Resources

How I built a bulletproof CI/CD for my LLM Python library

The Strategy: two pipelines, three stages

1. Dev pipeline: pre-flight validation

Stage A: Unit & Integration tests

Stage B: publish to TestPyPI

Stage C: E2E tests from TestPyPI

2. Main pipeline: controlled release

What runs on main

What triggers a release

Why this setup works

Conclusion

Question for you

Python LLM: reasoning is disabled by default in llm-api-adapter

The Problem

The Approach: Off by Default

Enabling Reasoning Explicitly

Installation

Example

Why This Matters

Repository

Structured prompts: how YAML cut my LLM costs by 30%

Original Prompt (PROMPT_A)

Optimized YAML Prompt (PROMPT_B)

Experiment Code Example

Unifying 3 LLM APIs in Python: OpenAI, Anthropic & Google with one SDK

What changed in the new version

Example: chat with any LLM

Token & cost tracking example

Why I built this

Join the project

LLM API Adapter SDK for Python

GenAi #Python #LLM #OpenAI #GPT #Anthropic #Claude #Google #Gemini

Schema-enforced output with `@json`

Schema-enforced output with `@json`

What runs on `main`