DEV Community: Greg Mate

Your AI agent is leaking memory across users. Here's why and how to stop it

Greg Mate — Mon, 22 Jun 2026 11:13:06 +0000

Most agent demos connect to a CRM and update a record. Impressive in a presentation. Broken the moment a second user shows up.

The hard part is not the tool call. It is acting as the right person, remembering only their context, and making sure nothing leaks across users. This is the part that gets hand-waved in demos because it is genuinely annoying to get right.

We ran into this building a reference implementation for the Scalekit x Actian x Render Agents in Production Hackathon in San Francisco on June 27. Here is what we learned.

The problem with agent memory in multi-user systems

When you build an agent that acts on behalf of a user, there are two separate scoping problems:

What is the agent allowed to do on their behalf (identity, permissions, tokens)
What does the agent remember about them (context, history, prior decisions)

Most implementations treat these as the same problem. They are not. Scalekit handles the first one well. VectorAI DB handles the second. But getting them to agree on who the current user is requires deliberate wiring.

The naive approach, a single shared vector index for all users, fails quietly. Alice's agent starts pulling context that belongs to Bob. Nothing crashes. No errors. The output just gets subtly wrong in ways that are hard to debug.

One collection per user

VectorAI DB does not have a native multi-tenancy API. There are no user-scoped namespaces, no per-user tokens, no RBAC on individual collections. The Community Edition ships one isolation primitive: collections.

So the pattern is simple. One collection per user, named after the same identifier your auth layer already uses:

from actian_vectorai import VectorAIClient, VectorParams, Distance, CollectionExistsError

client = VectorAIClient("localhost:6574")

def get_or_create_user_collection(user_id: str, dim: int = 384):
    name = f"user-{user_id}-memories"
    try:
        client.collections.create(
            name,
            vectors_config=VectorParams(size=dim, distance=Distance.Cosine),
        )
    except CollectionExistsError:
        pass
    return name

The key detail: whatever string Scalekit uses as the user identifier for its connected account is the string you pass here. One source of truth, no mapping table, no sync to maintain.

Two things you cannot change after collection creation: vector dimension and distance metric. Pick your embedding model before you create any collections. Changing either later requires deleting the collection and losing all its data.

Getting it running locally

pip install actian-vectorai-client

docker pull actian/vectorai:latest
docker run -d --name vectorai \
  -v ./local_data:/var/lib/actian-vectorai \
  -p 6573-6575:6573-6575 \
  -e ACTIAN_VECTORAI_ACCEPT_EULA=YES \
  actian/vectorai:latest

The container will not start without ACTIAN_VECTORAI_ACCEPT_EULA=YES. No error, just an immediate exit with code 1.

One thing that will trip you up: the pip package is actian-vectorai-client but the import is actian_vectorai. Different strings. It will fail at import time if you use the package name.

from actian_vectorai import VectorAIClient, VectorParams, Distance, PointStruct

client = VectorAIClient("localhost:6574")

The dependency conflict nobody warns you about

If you are combining this with scalekit-sdk-python, you will hit a dependency conflict that is not version-specific and not obvious from the error message.

scalekit-sdk-python==2.12.0 pins protobuf<7.0.0. actian-vectorai-client needs protobuf>=6.31.1. When pip resolves this, it downgrades protobuf, and then actian_vectorai fails at import time with:

google.protobuf.runtime_version.VersionError: Detected incompatible Protobuf 
Gencode/Runtime versions when loading actian_vectorai_common.proto: 
gencode 6.31.1 runtime 5.29.6.

The fix:

# Install everything except scalekit normally
grep -v scalekit-sdk-python requirements.txt > /tmp/req.txt
pip install -r /tmp/req.txt

# Then install scalekit without its dependency resolution
pip install scalekit-sdk-python==2.12.0 --no-deps

# Explicitly reinstate the versions actian-vectorai-client needs
pip install "protobuf>=6.31.1" "grpcio-status>=1.67.0"

This works because scalekit's <1.67 grpcio-status constraint is stale metadata. At runtime, the newer versions are compatible. The --no-deps flag skips the constraint check. Not a blessed install path from Scalekit's side, but it works and the combination has been stable.

The capacity behavior you should know before demo day

Community Edition caps at 5,000 vectors total, across all your collections combined. That is not the surprising part.

The surprising part: the cap is enforced asynchronously. Writes succeed past the limit. About 30 seconds later, a background enforcement task runs and blocks further writes:

CapacityExceededError: Vector capacity exceeded: 5,005 vectors stored, 
limit is 5,000. Delete vectors or upgrade your licence to continue.

During a demo, this means your inserts can succeed, your reads can succeed, and then your next write silently fails half a minute later with no indication of why at the point of the call. Worth knowing before you have 10 people watching.

The 30-day trial unlocks 1 million vectors. Get that set up before the day if you are planning anything beyond a few users.

Deploying to Render

VectorAI DB is Docker-only right now, which Render handles natively. Pull actian/vectorai:latest directly as a private Docker service, no custom Dockerfile needed.

Two things the service needs:

ACTIAN_VECTORAI_ACCEPT_EULA=YES as an environment variable
A persistent disk mounted at /var/lib/actian-vectorai, or you lose all data on every redeploy. This cannot be set via render.yaml on an existing service. It has to be added manually through the Render dashboard.

Keep the VectorAI DB service private, not public-facing. Your agent app connects to it over Render's internal network at vectorai-db:6574.

Build this at the hackathon

On June 27 in San Francisco, Scalekit, Actian, and Render are running a build day focused on agents that act as real users with real permissions. If this is the problem you want to work on, register here.

Our team will be on-site all day. The participant guide is live here with the install commands, the per-user pattern, and everything else in this post in a format you can keep open during the build.

I Built a Python Agent That Uses a Vector DB as Memory, Not Retrieval

Greg Mate — Thu, 11 Jun 2026 17:37:12 +0000

Vector databases are almost always talked about in the context of RAG. Store your documents, embed them, retrieve the relevant chunks at inference time. That's the default pattern and it works — until it doesn't.

I've been working on Actian VectorAI DB and started wondering: what if the vector DB isn't a document store at all? What if it's a memory layer for an agent?

So I built it to find out.

The Idea

The distinction sounds subtle but it matters. In a classic RAG setup, you pre-load a vector store with documents. The corpus is static. The agent queries it but never changes it.

What I wanted to build was different. An agent that writes to the vector store as it runs — storing every interaction as a vector — and then searches its own past conversations semantically when it needs context. The corpus is built from the agent's own history, not from documents you loaded upfront.

The agent is the author of its own knowledge base.

The Stack

Everything runs locally. No cloud, no external API calls, nothing leaving the machine:

Actian VectorAI DB: vector store and semantic search
Ollama + llama3.2: local LLM
BAAI/bge-small-en-v1.5: embedding model
Python: the glue

The fully local constraint wasn't just a preference, rather the core to the premise. If the agent is storing personal memory, it shouldn't be doing it in someone else's cloud.

How It Works

Every time you send the agent a message, it does four things:

Embeds your message as a vector
Searches VectorAI DB for semantically similar past interactions
Injects the relevant memories into the system prompt
Responds, then stores the full exchange back into VectorAI DB

See:

def chat(self, user_message: str) -> str:
    """Process a user message and return the assistant reply."""
    # 1. Embed the incoming message for semantic search
    query_vec = embed(user_message)

    # 2. Recall semantically relevant memories (cross-session by default).
    # score_threshold=0.50 prevents loosely-related memories from being injected
    # as context. min_importance=0.5 excludes low-confidence episodic fragments
    # (episodes are stored at 0.3, explicit facts at 0.9).
    past_memories = self.memory.recall(
        query_vector=query_vec,
        limit=5,
        score_threshold=0.30,
    )

    # 3. Build system prompt with injected memories
    system_prompt = self._build_system_prompt(past_memories)

    # 4. Extend short-term conversation window
    self.conversation.append({"role": "user", "content": user_message})

    # 5. Call the local LLM via Ollama
    messages = [{"role": "system", "content": system_prompt}] + self.conversation
    response = self.llm.chat.completions.create(
        model=self.model,
        messages=messages,
    )
    assistant_reply = response.choices[0].message.content

    # 6. Append reply to short-term window
    self.conversation.append({"role": "assistant", "content": assistant_reply})

    # 7. Persist this exchange as an episodic long-term memory
    # Episodic importance is kept low (0.3) intentionally: the agent's own
    # replies may contain errors or hallucinations. Explicit facts stored via
    # remember_fact() use importance=0.9 and will always rank above episodes.
    memory_text = f"User said: {user_message}\nAgent replied: {assistant_reply}"
    memory_vec = embed(memory_text)
    self.memory.remember(
        content=memory_text,
        vector=memory_vec,
        session_id=self.session_id,
        memory_type="episode",
        importance=0.3,
    )

    return assistant_reply

The search is cross-session by default. A memory from last Tuesday will surface today if it's semantically close enough to what you're asking. The collection lives on disk via Docker volume so it persists across restarts.

There's also a remember: <fact> command to store explicit high-importance facts at a higher importance score, separately from the episodic conversation log.

What Broke Along the Way

The embedding model defaulted to a HuggingFace download on first run, which immediately broke the fully local setup. Fixed it by loading the model with local_files_only=True and requiring a one-time manual download before the first run — so the embedding step is fully offline on every subsequent run.

The Memory Decay Problem

The first version had a flat importance score for every interaction. Every exchange stored at 0.6, explicit facts at 0.9. No decay, no forgetting — the collection just grew indefinitely. That's fine as a proof of concept but it's not how memory actually works. Old, rarely referenced memories shouldn't compete equally with recent, frequently accessed ones.

So I added importance-weighted decay. Every memory now gets scored on four signals before being returned:

age_hours = (now - timestamp) / 3600
recency   = exp(-age_hours / 168)          # half-life ~1 week
freq      = min(access_count / 10.0, 1.0)  # saturates at 10 accesses

final_score = (
    0.6 * cosine_similarity
  + 0.2 * importance
  + 0.15 * recency
  + 0.05 * access_frequency
)

Cosine similarity still does the heavy lifting — it has to, otherwise semantically irrelevant memories would surface. But recency and access frequency now influence ranking. A memory from six weeks ago that's never been referenced again will lose ground to a recent one, even if the raw cosine similarity is similar.

The weights and half-life are module-level constants so they're easy to tune without touching the logic.

The recall path also tracks access — every time a memory surfaces in a query, its access_count increments and last_accessed updates. Memories that keep coming up stay relevant. Ones that don't, fade.

Here's what the ranked output looks like against four synthetic test memories:

Rank  Score    Imp   Content
  1   0.9135   0.9   recent + high access (1 hr old, 8 accesses)
  2   0.6776   0.9   old + high importance (30 days, 0 accesses)
  3   0.6704   0.3   recent + no access (2 hrs old, 0 accesses)
  4   0.5112   0.3   old + low importance (60 days, 0 accesses)

The recent, frequently accessed memory dominates. The old, low-importance one drops to the bottom regardless of semantic similarity. That's the behavior you want from something calling itself memory.

The Hallucination Problem

Persistent memory introduces a risk that RAG pipelines don't have in the same way: if the agent hallucinates something and stores it, that hallucination gets recalled as a confident memory in the next session. The wrong information compounds.

Three risks needed fixing.

The LLM had no instruction to stay within recalled memories. The original system prompt said "use these memories when relevant" — permissive enough that the model would freely supplement from its training data when memory was thin. Three explicit rules were added: only use facts from the listed memories for personal claims, say "I don't know" when no memory covers a question, and never infer or guess personal details.

Hallucinated replies were stored and recalled as truth. Every exchange was stored at importance=0.6, meaning a hallucinated reply could be recalled next session and treated as a confident memory. Episodic importance was lowered to 0.3 — well below explicit facts at 0.9 — so bad replies can never outrank things the user deliberately told the agent.

Weakly-matched memories were being injected as context. The recall threshold was low enough to pull in semantically distant memories that could mislead the LLM. The threshold was raised and a min_importance filter added so episodic fragments are excluded from injection entirely. Only explicitly stored facts ever reach the LLM.

The importance ladder now looks like this:

importance=0.9  ->  explicit facts (remember: <fact>)   always recalled if score ≥ 0.50
importance=0.5  ->  the min_importance gate             <- filter line
importance=0.3  ->  episodic exchanges (chat history)   never recalled, never injected

A test suite with 5 offline pytest tests guards all three risks — mocking both the memory store and the LLM call, then inspecting the messages array sent to the model before it responds.

5 passed in 10.56s ✓

What I Found

When I examined how VectorAI DB was actually being used in the implementation, the key finding was this:

The corpus is built dynamically from the agent's own past conversations, not from a pre-loaded document index. The agent is the author of its own knowledge base, which accumulates at runtime.

That's the thing that makes this memory rather than retrieval. It's a small shift in how you think about what a vector DB is for: not a document store you query at inference time, but a persistent layer that grows with the agent over time, and now one that forgets appropriately too.

The agent works. Cross-session recall is functioning, decay is verified, the stack is fully local.

What's Next

Testing retrieval quality as the memory grows over longer periods
Exploring what other use cases this pattern unlocks beyond conversation memory

Find the repo here. If you're working on anything in this space — agentic memory, local-first AI stacks, or just fighting with MCP setup — I'd love to hear what you're seeing in the comments.

We're running our first hackathon: Build with VectorAI DB, win Claude subscriptions

Greg Mate — Thu, 09 Apr 2026 09:39:39 +0000

The Actian VectorAI DB Build Challenge is our first community hackathon, and we want to see what you build. Solo or team, beginner or experienced, local or cloud. If you've been looking for a reason to actually ship something with a vector database, this is it.

April 13-18, 2026 | Virtual | Register on DoraHacks

What you're building

An AI application that solves a real, tangible problem using Actian VectorAI DB. It can run on your laptop, on a server, in the cloud, wherever. The only rule: VectorAI DB has to be a core part of your stack, not something you bolted on at the end.

Your project also needs to go beyond basic similarity search. Pick at least one of these:

Hybrid Fusion - combine multiple search signals into one ranked result. Not just meaning, not just keywords. Both, fused together.

What that looks like in practice: A job board that ranks candidates by semantic fit ("backend engineer who gets distributed systems") AND keyword match ("Golang, Kubernetes") merged into one list using RRF or DBSF.

Filtered Search - pair vector search with structured filters on your data so results are actually useful, not just semantically close.

What that looks like in practice: A campus event finder that understands what you're looking for but also filters by date, location, and student org. So you're finding events you can go to, not just events that sound similar.

Named Vectors / Multimodal - store and search across different data types in the same collection. Text, images, audio, whatever fits your idea.

What that looks like in practice: A study tool where you search your notes by typing a question or uploading a diagram. Both hit the same knowledge base, just through different vector spaces.

Bonus points for running locally, on ARM, or offline. No fixed weight, judges' call.

Not sure what to build?

Some starting points, but don't let these limit you:

A RAG app over any dataset you actually care about (research papers, course notes, documentation, news)
A semantic search tool with smart filters (campus events, job listings, study materials)
A recommendation engine that combines meaning and metadata
An anomaly detection or monitoring system
An AI agent with vector-powered memory
A multimodal search tool across text and images

Getting started

The database runs in Docker and works natively on Mac (including Apple Silicon), Linux, and Windows. No Rosetta, no platform flags needed.

# Clone the repo and start the database
docker compose up

# Install the Python client
pip install actian-vectorai

Not sure where to begin? Start with the featured RAG example:

pip install -r examples/rag/requirements.txt
python examples/rag/rag_example.py

It walks you through building a complete retrieval-augmented generation app from scratch. You'll have something running in under 10 minutes.

VectorAI DB handles storage and search. You bring your own embedding model. A good default to start with is sentence-transformers/all-MiniLM-L6-v2, fast, lightweight, and works well for most text use cases.

pip install sentence-transformers

For the full API docs and more examples, check the repo README linked in Discord.

Prizes

🥇 1st place team: Claude Max 5x, 3 months per person

🥈 2nd place team: Claude Max 5x, 1 month per person

🥉 3rd place team: Claude Pro, 1 month per person

Teams of up to 4. Solo submissions welcome.

How we judge

Use of Actian VectorAI DB (30%): Is VectorAI DB doing real work in this app? Does the team know why they used it the way they did?
Real-world impact (25%): Does it solve something people actually care about? Would someone use this?
Technical execution (25%): Does it work? Is the code coherent and the architecture thought through?
Demo and presentation (20%): Can you explain what you built and why it matters?

How to submit

All submissions go through DoraHacks. You'll need a public GitHub or GitLab repo with a README, a working demo (video, Loom, or live link), and a short write-up covering what you built, why, and which technical requirement you used.

Results announced April 20 on Discord.

Join us

Discord for support, team formation, and progress sharing: discord.gg/432A2M63Py

Drop a comment if you're in. See you April 13.

Building Your First AI Agent Without Frameworks

Greg Mate — Fri, 13 Jun 2025 10:50:56 +0000

Want to understand how AI agents actually work? Let's build one from scratch before jumping into frameworks.

Most AI agent tutorials start with LangGraph or CrewAI, which are great tools, but they can make it hard to understand what's happening underneath.

An agent is really just a language model that can call functions. Once you understand that, frameworks make way more sense.

Today we're building a customer support system using OpenAI's API and Python. This will give you the fundamentals that make any agent framework easier to use and debug.

What we're building:

A routing system that decides which "specialist" handles each query
Function-calling agents that can search FAQs and analyze sentiment
Simple state management to track conversations
Logic to escalate to humans when needed

By the end, you'll understand how agents work under the hood, making you much more effective when you do use frameworks.

An Agent is Just an LLM with Tools

Seriously, that's all there is to it:

Language model with a specific job
Functions it can call
Logic to decide when to use them

Everything else is just orchestration.

Let's start with the simplest possible agent:

import openai
import json
from typing import Dict, List, Any

# Set up OpenAI (get your API key from https://platform.openai.com/api-keys)
import os
openai.api_key = os.getenv("OPENAI_API_KEY")

class SimpleAgent:
    def __init__(self, name: str, role: str, tools: List[callable]):
        self.name = name
        self.role = role
        self.tools = {tool.__name__: tool for tool in tools}

    def respond(self, message: str) -> str:
        # Create tool descriptions for the model
        tool_descriptions = []
        for name, func in self.tools.items():
            tool_descriptions.append({
                "type": "function",
                "function": {
                    "name": name,
                    "description": func.__doc__ or f"Function {name}",
                    "parameters": {
                        "type": "object",
                        "properties": {
                            "query": {"type": "string", "description": "The input query"}
                        },
                        "required": ["query"]
                    }
                }
            })

        # Call OpenAI with function calling
        response = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[
                {"role": "system", "content": self.role},
                {"role": "user", "content": message}
            ],
            tools=tool_descriptions,
            tool_choice="auto"  # Let the model decide when to use tools
        )

        # Handle function calls
        if response.choices[0].message.tool_calls:
            tool_call = response.choices[0].message.tool_calls[0]
            function_name = tool_call.function.name
            arguments = json.loads(tool_call.function.arguments)

            # Execute the function
            if function_name in self.tools:
                result = self.tools[function_name](arguments["query"])
                return f"{self.name}: {result}"

        # Regular response if no function call
        return f"{self.name}: {response.choices[0].message.content}"

# Test it out
def search_faq(query: str) -> str:
    """Search the FAQ database for answers"""
    faqs = {
        "shipping": "Standard shipping takes 3-5 business days",
        "refund": "Refunds processed within 5-7 business days",
        "return": "Returns accepted within 30 days"
    }

    for topic, answer in faqs.items():
        if topic in query.lower():
            return answer
    return "No FAQ found for that topic"

# Create an FAQ agent
faq_agent = SimpleAgent(
    name="FAQ Assistant",
    role="You're a helpful FAQ assistant. Use the search_faq function to find answers to customer questions.",
    tools=[search_faq]
)

# Test it
print(faq_agent.respond("How long does shipping take?"))
# FAQ Assistant: Standard shipping takes 3-5 business days

Done. You just built an AI agent. It understands questions, knows when to use its tool, and gives helpful answers.

Adding More Specialists

Now let's add agents that handle different stuff:

def analyze_sentiment(query: str) -> str:
    """Analyze the emotional tone of customer messages"""
    # Simple keyword approach - you could use [Transformers](https://huggingface.co/docs/transformers/index) for a real sentiment model
    negative_words = ["angry", "frustrated", "terrible", "awful", "hate"]
    urgent_words = ["urgent", "immediately", "asap", "emergency"]

    query_lower = query.lower()

    if any(word in query_lower for word in urgent_words):
        return "URGENT: Customer needs immediate attention"
    elif any(word in query_lower for word in negative_words):
        return "NEGATIVE: Customer is frustrated, handle with care"
    else:
        return "NEUTRAL: Standard response appropriate"

def check_escalation_needed(query: str) -> str:
    """Determine if human escalation is needed"""
    escalation_triggers = [
        "speak to manager", "cancel account", "legal action", 
        "complaint", "lawsuit", "terrible service"
    ]

    if any(trigger in query.lower() for trigger in escalation_triggers):
        return "ESCALATE: Route to human agent immediately"
    else:
        return "CONTINUE: AI agent can handle this query"

# Create specialized agents
sentiment_agent = SimpleAgent(
    name="Sentiment Analyzer",
    role="You analyze customer emotions. Use analyze_sentiment to understand how the customer is feeling.",
    tools=[analyze_sentiment]
)

escalation_agent = SimpleAgent(
    name="Escalation Manager", 
    role="You decide when customers need human help. Use check_escalation_needed to evaluate queries.",
    tools=[check_escalation_needed]
)

The Router: Deciding Who Handles What

Here's where it gets interesting - we need something to decide which agent handles each message:

class AgentRouter:
    def __init__(self):
        self.agents = {
            "faq": faq_agent,
            "sentiment": sentiment_agent,
            "escalation": escalation_agent
        }
        self.conversation_history = []

    def route_query(self, query: str) -> str:
        """Decide which agent should handle this query"""

        # Save the conversation
        self.conversation_history.append({"role": "user", "content": query})

        # Basic routing - you could make this way smarter
        query_lower = query.lower()

        # Check for escalation triggers first
        if any(word in query_lower for word in ["manager", "complaint", "cancel", "lawsuit"]):
            agent_name = "escalation"
        # Check for emotional language
        elif any(word in query_lower for word in ["angry", "frustrated", "urgent", "terrible"]):
            agent_name = "sentiment"
        # Default to FAQ for standard questions
        else:
            agent_name = "faq"

        # Get response from the right agent
        agent = self.agents[agent_name]
        response = agent.respond(query)

        # Save that too
        self.conversation_history.append({"role": "assistant", "content": response})

        return f"[Routed to {agent_name.upper()}]\n{response}"

    def get_conversation_summary(self) -> str:
        """Get a summary of the conversation so far"""
        if not self.conversation_history:
            return "No conversation yet"

        summary = f"Conversation with {len(self.conversation_history)//2} exchanges:\n"
        for i, msg in enumerate(self.conversation_history[-4:]):  # Last 2 exchanges
            role = "Customer" if msg["role"] == "user" else "Agent"
            summary += f"{role}: {msg['content']}\n"

        return summary

# Test the complete system
router = AgentRouter()

print("=== Customer Support Agent System ===\n")

# Test different types of queries
test_queries = [
    "How long does shipping take?",
    "I'm really frustrated with this terrible service!",
    "I want to speak to your manager right now!",
    "What's your return policy?"
]

for query in test_queries:
    print(f"Customer: {query}")
    response = router.route_query(query)
    print(f"{response}\n")

print("Conversation Summary:")
print(router.get_conversation_summary())

Making It Smarter: Let the AI Do the Routing

Keyword matching works, but we can do better. Let's use the LLM itself to make routing decisions:

class SmartRouter:
    def __init__(self):
        self.agents = {
            "faq": faq_agent,
            "sentiment": sentiment_agent, 
            "escalation": escalation_agent
        }
        self.conversation_history = []

    def smart_route(self, query: str) -> str:
        """Use AI to decide which agent should handle the query"""

        routing_prompt = f"""You're routing customer queries to specialists.

        Options:
        - faq: Standard questions about policies, shipping, returns
        - sentiment: Upset or frustrated customers  
        - escalation: Complex complaints or requests for managers

        Customer: "{query}"

        Which specialist? Just answer: faq, sentiment, or escalation"""

        response = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": routing_prompt}],
            temperature=0
        )

        agent_choice = response.choices[0].message.content.strip().lower()

        # Default to FAQ if something weird happens
        if agent_choice not in self.agents:
            agent_choice = "faq"

        # Get response from chosen agent
        agent_response = self.agents[agent_choice].respond(query)

        return f"[Smart routed to {agent_choice.upper()}]\n{agent_response}"

# Test smart routing
smart_router = SmartRouter()

print("=== Smart Routing Test ===\n")

smart_test_queries = [
    "My package is late and I'm getting married tomorrow!",
    "Do you accept international credit cards?", 
    "This is absolutely ridiculous, I want my money back immediately!",
    "Can I return something I bought 3 weeks ago?"
]

for query in smart_test_queries:
    print(f"Customer: {query}")
    response = smart_router.smart_route(query)
    print(f"{response}\n")

Adding Memory: Making Conversations Actually Work

Real support conversations build on what happened before. Here's how to add memory:

class MemoryAwareRouter:
    def __init__(self):
        self.agents = {
            "faq": faq_agent,
            "sentiment": sentiment_agent,
            "escalation": escalation_agent
        }
        self.conversation_memory = []
        self.customer_context = {
            "sentiment_history": [],
            "escalated": False,
            "resolved_issues": []
        }

    def process_with_memory(self, query: str) -> str:
        """Process query with full conversation context"""

        # Save current message
        self.conversation_memory.append({"role": "user", "content": query, "timestamp": "now"})

        # Build context summary
        context = self._build_context()

        routing_prompt = f"""Previous conversation context:
        {context}

        Current message: "{query}"

        Which specialist should handle this?
        - faq: Standard questions
        - sentiment: Emotional customers
        - escalation: Complex issues or if already escalated

        Consider the conversation history. Answer: faq, sentiment, or escalation"""

        response = openai.chat.completions.create(
            model="gpt-4o-mini",
            messages=[{"role": "user", "content": routing_prompt}],
            temperature=0
        )

        agent_choice = response.choices[0].message.content.strip().lower()
        if agent_choice not in self.agents:
            agent_choice = "faq"

        # Update customer context based on routing
        if agent_choice == "sentiment":
            self.customer_context["sentiment_history"].append("negative")
        elif agent_choice == "escalation":
            self.customer_context["escalated"] = True

        # Get enhanced response with context
        agent_response = self._get_contextual_response(agent_choice, query)

        # Add to memory
        self.conversation_memory.append({
            "role": "assistant", 
            "content": agent_response,
            "agent": agent_choice
        })

        return f"[Contextual routing to {agent_choice.upper()}]\n{agent_response}"

    def _build_context(self) -> str:
        """Build conversation context summary"""
        if not self.conversation_memory:
            return "New conversation"

        context = f"Conversation history: {len(self.conversation_memory)} messages\n"
        context += f"Customer escalated: {self.customer_context['escalated']}\n"
        context += f"Negative sentiment detected: {len(self.customer_context['sentiment_history'])} times\n"

        # Include last few exchanges
        recent = self.conversation_memory[-4:]
        for msg in recent:
            role = "Customer" if msg["role"] == "user" else f"Agent ({msg.get('agent', 'unknown')})"
            context += f"{role}: {msg['content'][:100]}...\n"

        return context

    def _get_contextual_response(self, agent_name: str, query: str) -> str:
        """Get response with conversation context"""
        agent = self.agents[agent_name]

        # Add context to the agent's response
        if self.customer_context["escalated"] and agent_name != "escalation":
            prefix = "[Customer previously escalated] "
        elif len(self.customer_context["sentiment_history"]) > 1:
            prefix = "[Customer has been frustrated multiple times] "
        else:
            prefix = ""

        response = agent.respond(query)
        return prefix + response

# Test memory-aware system
memory_router = MemoryAwareRouter()

print("=== Memory-Aware Conversation ===\n")

conversation_flow = [
    "What's your return policy?",
    "That's not good enough, I'm really frustrated!",
    "I want to speak to someone who can actually help me!",
    "Fine, what information do you need for the return?"
]

for query in conversation_flow:
    print(f"Customer: {query}")
    response = memory_router.process_with_memory(query)
    print(f"{response}\n")

What You Actually Built

You just created a complete customer support system using basic Python and OpenAI. Here's what you learned:

The fundamentals:

✅ Agents = LLM + functions + routing logic
✅ Function calling lets agents take actions
✅ Smart routing decides who handles what
✅ State management keeps conversations coherent
✅ Memory makes agents context-aware

Why this approach:

You'll understand what frameworks actually do for you
Easier to debug when things go wrong
You can customize behavior exactly how you want
Works with any LLM provider
Good foundation before learning frameworks

Making It Production Ready

To actually deploy this, you'd need:

The basics:

Error handling (APIs fail)
Database for conversation storage
Rate limiting (prevent abuse)
Proper logging

The nice-to-haves:

Real sentiment analysis model
Integration with your FAQ database
Actual escalation to humans (Slack API, email, etc.)
Analytics on what's working

When frameworks make sense:
Now you understand what LangGraph, CrewAI, and AutoGen do - they handle the routing and orchestration you just built manually. They're great when:

You need complex multi-step workflows
You want pre-built integrations and tools
You're working on a team that benefits from standardized patterns
You need features like human-in-the-loop or advanced state management

The key is knowing when the abstraction helps versus when you need more control.

The Real Lesson

AI agents are organized LLMs with specific jobs and the ability to call functions. The "multi-agent" part is smart routing and state management.

Understanding these fundamentals makes you better at using any framework because you know what's happening underneath. Start here, then use frameworks when their features solve real problems you're facing.

Built something cool with this? I'd love to see what you made - drop it in the comments!

How to Prevent AI Agents From Breaking in Production

Greg Mate — Fri, 06 Jun 2025 12:21:12 +0000

Deploying AI agents in production is trickier than most teams expect. What works perfectly in development often becomes a reliability nightmare once real traffic hits.

After looking at incident reports, some clear patterns emerge. The same few issues keep causing the majority of production failures.

42% of AI agent failures come from hallucinated API calls, and another 23% are GPU memory leaks. These aren't edge cases - they're systematic problems that need systematic solutions.

Here's what's actually breaking and how to prevent it.

Common failure patterns

Hallucinated API calls

LLMs generate code that looks correct but calls non-existent methods or deprecated endpoints. Traditional validation tools miss this because the code is syntactically valid - it just references APIs that don't exist in your environment.

Teams often spend significant time debugging what appears to be infrastructure issues when the root cause is the AI making incorrect assumptions about available APIs.

GPU memory leaks

A known vulnerability in AMD, Apple, and Qualcomm GPUs can cause AI workloads to leak over 180MB per inference cycle. In Kubernetes environments, this can cascade across pods and eventually crash entire nodes.

Standard monitoring often doesn't catch this until resource exhaustion is already occurring.

Cascading failures

AI agents are more interconnected than typical microservices. A single malformed operation can stall agent threads for extended periods, and recovery processes often reset accumulated context, leading to broader system failures.

Insufficient observability

Most teams monitor traditional infrastructure metrics but lack visibility into AI-specific behavior like GPU utilization patterns, token consumption, and model performance degradation.

Practical solutions

Constrain API generation

Instead of relying on post-generation validation, limit what the LLM can suggest in the first place by providing explicit API context:

# Extract what's actually available
global_deps = extract_imports(codebase)
local_deps = parse_function_calls(current_module)

# Tell the LLM what it can actually use
prompt = f"""
Available APIs: {global_deps}
Local functions: {local_deps}
Task: {user_request}
"""

Teams using dependency-constrained prompting report fewer API hallucinations. The approach is straightforward: if you don't tell the LLM about APIs that don't exist, it's less likely to invent them.

Implement GPU resource controls

Set explicit resource limits in your container orchestration:

resources:
  limits:
    nvidia.com/gpu: 1
    memory: "4Gi"
  requests:
    memory: "4Gi"
    cpu: "2"

Monitor GPU memory usage and restart containers before they crash:

#!/bin/bash
while true; do
  vram_usage=$(nvidia-smi --query-gpu=memory.used --format=csv,noheader,nounits)
  if [ $vram_usage -gt 7500 ]; then  # 90% of 8GB
    kubectl rollout restart deployment/ai-agent
  fi
  sleep 30
done

This type of proactive monitoring has reduced OOM crashes in production environments.

Version AI components as units

AI agents consist of multiple interdependent components: models, vector databases, prompt templates, and configuration. These should be versioned and deployed together:

# ai-agent-chart/Chart.yaml
dependencies:
  - name: llm-model
    version: "1.2.3"
  - name: vector-db
    version: "0.9.1"
  - name: prompt-templates
    version: "2.1.0"

Deploying the entire bundle as a unit prevents version mismatches that can cause subtle but significant failures.

Add AI-specific monitoring

Traditional APM tools don't capture AI-specific metrics. You need to track GPU utilization, token consumption, and model performance alongside business outcomes. OpenTelemetry provides a good foundation for this:

from opentelemetry import trace
import time

tracer = trace.get_tracer(__name__)

def ai_inference(prompt, user_id):
    with tracer.start_as_current_span("ai_inference") as span:
        start_time = time.time()

        span.set_attribute("prompt.length", len(prompt))
        span.set_attribute("user.id", user_id)

        response = model.generate(prompt)

        span.set_attribute("response.length", len(response))
        span.set_attribute("inference.duration", time.time() - start_time)
        span.set_attribute("tokens.consumed", count_tokens(prompt + response))

        return response

Correlating these metrics with infrastructure data helps identify when GPU pressure affects response quality.

Build resilient fallback systems

Implement circuit breakers for external API calls:

from tenacity import retry, stop_after_attempt, wait_exponential

@retry(
    stop=stop_after_attempt(3),
    wait=wait_exponential(multiplier=1, min=2, max=10)
)
def call_external_api(endpoint, payload):
    response = requests.post(endpoint, json=payload, timeout=10)
    response.raise_for_status()
    return response.json()

Have a clear escalation path when AI components fail:

def ai_with_fallback(user_request):
    try:
        return ai_agent.process(user_request)
    except AIAgentError:
        return rule_based_handler.process(user_request)
    except Exception:
        escalate_to_human(user_request)
        return "Request escalated to support team"

Making AI agents production-ready

AI agents in production require the same operational discipline as any other critical system. The difference is that they have unique failure modes that traditional monitoring and deployment practices don't address.

Teams that succeed treat AI agents as complex distributed systems with proper observability, resource management, and graceful degradation. The ones that struggle try to deploy them like traditional applications.

The good news is that once you address these systematic issues, AI agents become much more predictable and reliable in production environments.

Deploy AI Agents Without Infrastructure Headaches

Greg Mate — Fri, 30 May 2025 11:17:40 +0000

Platform engineers have a new nightmare: explaining to their CTO why the AI agent deployment that worked perfectly in staging is now burning through $50,000/month in production. The Terraform config looks flawless. The security groups are properly configured. The ECS tasks are healthy. But somehow, the vector database is choking on embeddings, the LLM gateway is routing traffic to the wrong regions, and the workflow orchestration is stuck in an infinite retry loop.

Traditional IaC tools weren't built for this complexity.

Traditional IaC Can't Handle AI Workloads

When ChatGPT generates your Terraform config, it looks perfect. But deploy it and everything breaks:

# This looks right but will fail in production
resource "aws_security_group" "ai_agent" {
  name = "ai-agent-sg"

  ingress {
    from_port   = 443
    to_port     = 443
    protocol    = "tcp"
    cidr_blocks = ["0.0.0.0/0"]  # ❌ Too permissive
  }
}

resource "aws_ecs_service" "ai_agent" {
  name            = "ai-agent"
  cluster         = aws_ecs_cluster.main.id
  task_definition = aws_ecs_task_definition.ai_agent.arn

  # ❌ Missing: vector DB networking, LLM provider configs, 
  # retry policies, cost controls, monitoring...
}

LLMs generating IaC are trained on public examples, not production systems. They miss vector database networking, multi-provider LLM failover, and other complexities that break under real traffic.

AI agents need completely different infrastructure:

Traditional Layer:         AI-Specific Layer:
- Compute (ECS/Lambda)     - Vector Database (Pinecone/Weaviate)
- Storage (S3/EBS)         - LLM Gateway (Multi-provider routing)
- Database (RDS)           - Workflow Orchestration (Temporal/Prefect)
- Networking (VPC/ALB)     - Model Serving & State Management

Each has its own failure modes and scaling patterns that traditional IaC treats as generic cloud resources.

What Actually Works

Pulumi for AI Infrastructure

Pulumi has native AI providers that treat vector databases and LLM gateways as real infrastructure. The trade-off? Your team needs to learn TypeScript/Python instead of HCL, and you're betting on a smaller ecosystem than Terraform's.

Alternative approaches:

Custom Terraform providers - Build your own for AI services (more work, but stays in Terraform)
Terraform + scripts - Use Terraform for basic infra, scripts for AI-specific parts
AWS CDK - Good if you're AWS-only

import * as pinecone from "@pulumi/pinecone";
import * as temporal from "@pulumi/temporal";

// Native vector database support
const vectorIndex = new pinecone.Index("knowledge-base", {
    name: "customer-support-kb",
    metric: "cosine",
    dimension: 1536,
    spec: {
        serverless: {
            cloud: "aws",
            region: "us-east-1"
        }
    }
});

// Workflow orchestration as code
const aiWorkflow = new temporal.Namespace("ai-workflows", {
    namespace: "customer-support",
    retention: "7d"
});

Temporal Handles Complex AI Workflows

Temporal manages the orchestration that AI agents need. Downsides: another system to operate, and your team needs to learn workflow concepts.

Alternatives:

Prefect - Similar to Temporal but more Python-native
Step Functions - AWS-native, simpler but less powerful
Kubernetes Jobs - If you want to stay close to K8s

@workflow.defn
class CustomerSupportAgent:
    @workflow.run
    async def handle_request(self, user_query: str) -> str:
        # Survives infrastructure failures
        context = await workflow.execute_activity(
            search_knowledge_base,
            user_query,
            start_to_close_timeout=timedelta(seconds=30)
        )

        # Automatic retries with backoff
        response = await workflow.execute_activity(
            call_llm_with_context,
            {"query": user_query, "context": context},
            retry_policy=RetryPolicy(maximum_attempts=3)
        )

        # Long-running workflows (hours/days/weeks)
        if needs_human_review(response):
            await workflow.wait_condition(
                lambda: workflow.info().search_attributes.get("approved")
            )

        return response

class CostOptimizedAI(pulumi.ComponentResource):
    def __init__(self, name: str):
        # Spot instances for training
        self.training_cluster = aws.ecs.Cluster(
            f"{name}-training",
            capacity_providers=["FARGATE_SPOT"]
        )

        # Reserved capacity for production
        self.inference_service = aws.ecs.Service(
            f"{name}-inference",
            desired_count=self.calculate_optimal_capacity()
        )

Security and Operational Considerations

API Key Management:

Use AWS Secrets Manager or Azure Key Vault for LLM API keys
Rotate keys automatically (most AI providers support this)
Never put API keys in your IaC code - use secret references

Rollback Strategy:

AI infrastructure changes can break in subtle ways
Always test rollbacks in staging first
Keep vector database backups before schema changes
Use blue-green deployments for model updates

Team Training:

Budget 2-4 weeks for engineers to learn Pulumi + Temporal
Start with one person, then spread knowledge
Document your AI infrastructure patterns for the team

Monitoring That Actually Matters

Regular monitoring misses what's important for AI systems. AI infrastructure spending hits $223 billion by 2028, so you need proper observability:

const aiMetrics = new aws.cloudwatch.Dashboard("ai-observability", {
    dashboardBody: pulumi.jsonStringify({
        widgets: [{
            type: "metric",
            properties: {
                metrics: [
                    // Traditional metrics
                    ["AWS/ECS", "CPUUtilization"],
                    ["AWS/ECS", "MemoryUtilization"],

                    // AI-specific metrics that actually matter
                    ["AI/VectorDB", "QueryLatency"],
                    ["AI/LLM", "TokensPerSecond"],
                    ["AI/LLM", "ResponseQuality"],
                    ["AI/Workflow", "CompletionRate"],
                    ["AI/Cost", "DollarPerInteraction"]
                ],
                title: "AI System Health"
            }
        }]
    })
});

// Alert on cost spikes
const costSpike = new aws.cloudwatch.MetricAlarm("ai-cost-spike", {
    comparisonOperator: "GreaterThanThreshold",
    metricName: "DollarPerInteraction",
    threshold: 0.50, // Alert if cost per interaction > $0.50
    alarmDescription: "AI infrastructure costs spiking"
});

What Teams Are Seeing

People adopting AI-native infrastructure report significant improvements:

10-100x lower costs with serverless vector databases vs. provisioned capacity
Self-hosted models can cost significantly less than API-based solutions for high-volume workloads

Companies using Temporal for AI workflows report significantly reduced debugging time and improved reliability for long-running AI processes.

Start here:

Check your AI costs - How much are you spending compared to self-hosted options?
Pick one AI workflow to rebuild as a test
Try Pulumi with Pinecone - deploy a test vector database

Next month:

Move critical AI workflows to Temporal
Set up cost monitoring and alerts
Add AI-specific observability

Companies building reliable, cheap AI infrastructure stopped using traditional IaC tools. They switched to AI-native approaches that treat AI workloads properly.

Your call: Keep fighting with Terraform and burning money, or use patterns that actually work.

AI Deployment: Why Serverless is Perfect (and Terrible)

Greg Mate — Wed, 28 May 2025 10:40:29 +0000

Your AI agent works perfectly in development. You've tested the reasoning chains, the tool integrations are solid, and the responses are exactly what users need. Then you deploy to production and everything breaks.

The timeout kills your multi-step workflows after 15 minutes. Your bundle exceeds the 250MB limit because you need scikit-learn, pandas, and a vector database client. Cold starts take 6+ seconds while your models load, making real-time interactions impossible.

Sound familiar? You're not alone. One developer working on an e-commerce recommendation engine discovered that "scikit-learn and pandas libraries increased the size of my deployment package beyond the AWS Lambda package limits." Another found their TensorFlow model loading caused API calls to timeout after 29 seconds.

Here's the thing: serverless isn't broken for AI. You're just hitting the boundaries of what it was designed for. Traditional serverless platforms were built for quick, stateless web requests—not long-running AI agent workflows that need to maintain context, load large models, and perform complex reasoning chains.

But before you abandon serverless entirely, understand this: for certain AI workloads, serverless is absolutely perfect. The question isn't whether to use serverless for AI—it's knowing when it works brilliantly and when it fails catastrophically.

When Serverless Shines for AI Deployments

Serverless excels in three specific AI scenarios that traditional infrastructure can't match.

Unpredictable Traffic Patterns

AI applications often experience extreme traffic variability. Your chatbot gets mentioned in a tweet and suddenly handles 1000x normal load. A content generation API processes 10 requests per hour during quiet periods, then 1000 requests during marketing campaigns.

Serverless platforms automatically scale from zero to thousands of concurrent executions without configuration. AWS Lambda provides 1,000 concurrent executions by default, scaling instantly based on demand. You pay only for actual compute time—not idle servers waiting for the next AI inference request.

Event-Driven AI Processing

Many AI workflows fit perfectly into event-driven patterns. Document uploaded → extract text → summarize content. New customer signup → analyze preferences → generate personalized recommendations. Code commit → run AI code review → post feedback.

These discrete, triggered operations align with serverless strengths. Each event spawns an independent function execution that processes the task and terminates. No need to manage background services or polling mechanisms.

Simple Inference Tasks

Lightweight AI operations—sentiment analysis, text classification, simple embeddings generation—work excellently in serverless environments. These tasks typically complete within seconds, use manageable dependencies, and don't require complex state management.

A sentiment analysis API using a pre-trained model can process requests in under 100ms with warm starts, providing excellent user experience while benefiting from serverless cost efficiency.

The Serverless Reality Check

The problems start when your AI workloads bump against fundamental serverless constraints.

Timeout Limitations Kill Complex Workflows

AWS Lambda caps execution at 15 minutes maximum. Vercel Functions limits vary by plan: 60 seconds on Hobby, 300 seconds on Pro, 900 seconds on Enterprise. Cloudflare Workers allows unlimited wall-clock time but restricts CPU time to 5 minutes.

Multi-step AI agent workflows routinely exceed these limits. Consider a research agent that:

Searches multiple data sources (2-3 minutes)
Processes and analyzes findings (3-5 minutes)
Generates comprehensive report (5-8 minutes)
Formats and delivers output (1-2 minutes)

Total runtime: 11-18 minutes. This workflow will fail on most serverless platforms or hit timeout limits that kill execution before completion.

Real-world example: AI agents performing "extract, transform, and load (ETL) jobs and content generation workflows such as creating PDF files or media transcoding require fast, scalable local storage to process large amounts of data quickly"—operations that frequently exceed serverless timeout constraints.

Bundle Size Problems Block AI Dependencies

Traditional serverless deployments face severe size restrictions:

AWS Lambda ZIP packages: 50MB compressed, 250MB uncompressed
Vercel Functions: 250MB uncompressed including layers
Cloudflare Workers: 3MB free, 10MB paid plans

Popular AI libraries routinely exceed these limits. Scikit-learn, pandas, numpy, and scipy together often surpass 250MB. Add a vector database client like Pinecone or Weaviate, plus an LLM SDK, and you're well beyond platform constraints.

The introduction of AWS Lambda container images (up to 10GB) fundamentally changes this landscape, but requires more complex deployment processes and sacrifices some serverless simplicity.

Cold Start Performance Destroys User Experience

AI workloads suffer dramatically from cold start penalties. Research shows that 99.9% of cold starts take up to 6.99 seconds for Java-based AI applications, while warm starts complete in just 33 milliseconds.

Loading TensorFlow models can cause initial API calls to timeout after 29 seconds during cold starts, though subsequent warm function calls process images in under one second. This unpredictable performance makes serverless unsuitable for real-time AI interactions where users expect immediate responses.

The cold start penalty compounds with AI complexity: larger models, more dependencies, and initialization-heavy frameworks all extend startup times beyond acceptable user experience thresholds.

Making Serverless Work: Practical Patterns

You can work around serverless limitations with architectural patterns designed for AI workloads.

1. Workflow Suspension and Resume

Break long-running AI processes into discrete steps with state persistence between invocations. Each step saves progress to external storage, enabling the next function to continue from checkpoint.

// Step 1: Initial Analysis
export const analyzeInput = async (event) => {
  const analysis = await performAnalysis(event.input);

  // Save state to Redis/DynamoDB
  await saveState(event.workflowId, { 
    step: 'analysis',
    result: analysis,
    nextStep: 'generate'
  });

  // Trigger next step
  await triggerNextStep(event.workflowId);

  return { status: 'processing', workflowId: event.workflowId };
};

// Step 2: Content Generation  
export const generateContent = async (event) => {
  const state = await loadState(event.workflowId);
  const content = await generateFromAnalysis(state.result);

  await saveState(event.workflowId, {
    step: 'complete',
    finalResult: content
  });

  return { status: 'complete', result: content };
};

This pattern enables unlimited workflow duration by staying within individual function timeout limits while maintaining progress state.

2. External State Management

AI agents require sophisticated state management beyond serverless stateless models. Externalize all persistent data to dedicated storage:

Redis/ElastiCache: Conversation context, short-term agent memory
PostgreSQL/MongoDB: Long-term user preferences, interaction history
Vector databases: Embeddings storage for semantic search and RAG

export const chatAgent = async (event) => {
  // Load conversation context
  const context = await redis.get(`chat:${event.userId}`);

  // Process with context
  const response = await generateResponse(event.message, context);

  // Update conversation state
  await redis.setex(`chat:${event.userId}`, 3600, {
    messages: [...context.messages, event.message, response],
    lastActivity: Date.now()
  });

  return response;
};

3. Container-Based Deployment

Use AWS Lambda container images to eliminate bundle size constraints. Include complete AI frameworks and pre-trained models within container deployments.

FROM public.ecr.aws/lambda/python:3.9

# Copy model files during build
COPY models/ ${LAMBDA_TASK_ROOT}/models/
COPY requirements.txt .

RUN pip install -r requirements.txt

COPY app.py ${LAMBDA_TASK_ROOT}

CMD ["app.lambda_handler"]

Container deployment enables 10GB packages while maintaining serverless operational benefits, though with increased deployment complexity.

4. Smart Cold Start Mitigation

Implement strategies to minimize cold start impact:

Model Pre-warming: Use scheduled functions to keep models loaded:

// Scheduled every 5 minutes
export const keepWarm = async () => {
  const modelExists = await checkModelAvailability();
  if (!modelExists) {
    await downloadAndCacheModel();
  }
  return { status: 'model ready' };
};

Progressive Response: Return immediate acknowledgment, then stream results:

export const aiInference = async (event) => {
  // Immediate response
  const responseId = generateId();
  await sendInitialResponse(responseId);

  // Background processing with streaming updates
  processInBackground(event.input, responseId);

  return { responseId, status: 'processing' };
};

Platform-Specific Considerations

AWS Lambda: Enterprise-Grade with Complexity Trade-offs

Strengths: Longest timeouts (15 minutes), container support up to 10GB, mature ecosystem, Provisioned Concurrency for predictable performance.

Best for: Complex AI workflows, enterprise deployments requiring compliance and integration with AWS services.

Limitations: Cold start performance, complex configuration for container deployments.

Vercel Functions: Developer Experience with Timeout Constraints

Strengths: Excellent developer experience, edge distribution, Fluid Compute for extended durations.

Best for: Simple AI APIs, content generation workflows, applications prioritizing deployment simplicity.

Limitations: Aggressive timeout limits (60 seconds on free tier), bundle size restrictions persist.

Cloudflare Workers: Global Edge with Memory Constraints

Strengths: Global edge distribution, unlimited wall-clock time, recent CPU limit increases to 5 minutes.

Best for: Real-time AI inference requiring global distribution, lightweight AI operations.

Limitations: 128MB memory limit, 10MB maximum bundle size, V8 runtime restrictions.

When NOT to Use Serverless for AI

Certain AI workloads fundamentally conflict with serverless constraints:

Always-On AI Agents: Customer service bots, monitoring systems, and agents requiring continuous availability benefit from dedicated infrastructure avoiding cold start penalties.

Heavy Model Inference: Large language models requiring substantial memory (8GB+ RAM) or specialized hardware (GPUs) exceed serverless platform capabilities.

Complex Multi-Agent Systems: Workflows requiring persistent communication between multiple AI agents, shared memory, or complex coordination patterns work better with traditional infrastructure.

High-Volume Production Workloads: Applications processing thousands of AI requests per minute may find dedicated infrastructure more cost-effective than per-invocation serverless pricing.

Hybrid Architectures: Best of Both Worlds

Most production AI systems benefit from hybrid approaches combining serverless and traditional infrastructure. AWS Step Functions provides excellent orchestration for these patterns:

Router Pattern

Use serverless functions as intelligent routers directing requests to appropriate processing infrastructure:

export const aiRouter = async (event) => {
  const complexity = analyzeRequestComplexity(event);

  if (complexity.simple) {
    return await processServerless(event);
  } else {
    return await queueForContainerProcessing(event);
  }
};

Hot/Cold Architecture

Maintain always-on infrastructure for baseline load, serverless for traffic spikes:

Containers handle predictable, consistent traffic
Serverless functions scale for demand peaks
Cost optimization through usage pattern matching

Making the Right Choice for Your AI Deployment

Use this decision framework when evaluating serverless for AI workloads:

Choose Serverless When:

Execution time consistently under 10 minutes
Traffic patterns are unpredictable or bursty
Dependencies fit within platform bundle limits (or container deployment acceptable)
Workflow can be broken into discrete steps
Cold start latency is acceptable for use case

Choose Traditional Infrastructure When:

Workflows require 15+ minutes execution time
Always-on availability is critical
Memory requirements exceed 10GB
Complex multi-agent coordination needed
Consistent sub-second response times required

Consider Hybrid When:

Traffic patterns combine baseline and spike loads
Some workflows fit serverless constraints, others don't
Cost optimization across variable usage patterns is priority

The Bottom Line

Serverless isn't universally perfect or terrible for AI deployment—it's contextual. Simple, discrete AI operations work excellently in serverless environments, providing cost efficiency and automatic scaling. Complex, long-running AI agent workflows require architectural adaptations or alternative infrastructure.

The key is matching your specific AI workload characteristics to platform capabilities rather than forcing incompatible patterns. As serverless platforms continue evolving—container support, extended timeouts, better cold start performance—the viable use cases for serverless AI will expand.

Start by auditing your current AI deployment challenges against serverless constraints. If timeout limits, bundle sizes, or cold start performance block your use case, consider hybrid architectures or traditional infrastructure. If your workflows fit serverless patterns, you'll benefit from simplified operations and automatic scaling.

The serverless AI landscape changes rapidly. What's impossible today may be trivial next year. But right now, success depends on honest assessment of your requirements against current platform realities—not wishful thinking about what serverless should support.

5 Developer Pain Points Solved by Internal Developer Platforms

Greg Mate — Fri, 16 May 2025 12:03:56 +0000

Ever feel like you spend more time wrestling with tools than actually building stuff? You're not alone.

According to GitLab's research, developers waste up to 75% of their time just maintaining toolchains rather than coding. Even worse, over 78% of DevOps professionals report wasting between 25-100% of their time keeping their toolchain running.

Traditional development is like being handed a giant bin of unsorted LEGO bricks and told to build a castle. You spend most of your time digging through the pile looking for the right pieces, and everyone builds differently.

Platform engineering is like getting those official LEGO kits with sorted pieces, clear instructions, and modular components. You still have creative freedom, but you're not wasting hours hunting for that one specific brick or reinventing foundations that have already been perfected.

I've spent years documenting developer workflows and watching teams struggle with the same problems over and over. Let's look at five major pain points and how Internal Developer Platforms (IDPs) actually solve them.

What's an Internal Developer Platform anyway?

Before diving in, a quick definition: an IDP is a self-service layer that sits on top of your infrastructure and tools, abstracting away complexity so developers can focus on building rather than configuring. Think of it as a unified interface for your entire development lifecycle.

No more jumping between 10+ tools just to deploy a simple feature.

Pain Point #1: Deployment Bottlenecks

The Problem

How long does it take your team to get code from commit to production? For most teams, it's days or weeks. Elite teams deploy in under a day.

The bottleneck isn't usually the code—it's the deployment process itself. When deployments require specialized knowledge or manual steps, everything slows down. If the one person who knows how to deploy is on vacation, you're stuck.

The Solution

IDPs provide self-service templates for deployments. Instead of developers needing to understand the underlying infrastructure, they get standardized workflows with the right guardrails.

With a platform approach, your team can:

Deploy without waiting for DevOps/platform teams
Use templates that enforce best practices
Automate the entire CI/CD pipeline
Deploy with a single click or command

Getting Started

You don't need a huge budget to implement this. Start with:

GitHub Actions or GitLab CI for automated pipelines
Docker (used by 59% of professional developers) for consistent environments
Standardized deployment scripts checked into your repo

Set up templates for your most common deployment types and build from there.

Pain Point #2: Context Switching Costs

The Problem

Each interruption costs developers 20+ minutes to regain focus. When developers have to switch between different tasks, tools, and contexts, productivity tanks.

The math is brutal: for a team of 10 engineers losing 10 minutes per context switch at $72/hour, that's $120 lost per build. With 50 builds per day and 22 working days, you're burning $132,000 monthly in lost productivity.

The 2024 State of Developer Productivity report found "time spent gathering project context" tied for the biggest productivity leak (26%).

The Solution

Platform engineering attacks this by creating unified interfaces and standardized workflows. Instead of switching between CI/CD tools, cloud consoles, monitoring dashboards, and ticketing systems, developers get a single interface.

Implementing an IDP gives you:

One portal for accessing all development resources
Integrated workflows that reduce tool-switching
Standardized processes that become muscle memory
Fewer interruptions due to missing context

Getting Started

For smaller teams, you can start with:

A centralized dashboard linking to your most-used tools
Consistent CLI tools that work across projects
Documentation that follows the same structure for all services
Automating workflows that currently require multiple tools

Pain Point #3: Environment Inconsistency

The Problem

"It works on my machine" might be the most frustrating phrase in software development. Environment inconsistencies waste countless hours on debugging issues that only appear in specific environments.

When dev, test, and production environments don't match, you're essentially testing different systems. Problems appear out of nowhere during deployment, and fixing them becomes a painful guessing game.

The Solution

IDPs provide standardized environment templates and self-service provisioning. This ensures consistency across all stages of development.

With a platform approach:

Every environment uses identical configurations
Developers can spin up environments on-demand
Configuration changes propagate consistently
Local development matches production

Getting Started

Begin with:

Docker for containerizing applications
Docker Compose for local development environments
Environment configuration stored as code
Automated environment provisioning scripts

Even small teams can implement these practices incrementally.

Pain Point #4: Cognitive Load from Multiple Tools

The Problem

Most teams juggle 6+ different tools, with 13% managing up to 14 different tools in their development chain. Each tool has its own interface, quirks, and mental model.

Learning and remembering how to use all these tools creates massive cognitive overhead, especially for new team members.

The Solution

Platform engineering streamlines development by providing standardized tools and interfaces. IDPs create a single point of entry for developers to access everything they need.

Implementing a platform approach gives you:

Uniform interfaces across different tools
Standardized workflows that work the same way everywhere
Simplified onboarding for new team members
Lower learning curve for daily tasks

Getting Started

Start by:

Auditing your current toolchain to identify redundancies
Creating consistent interfaces for your most-used tools
Building wrapper scripts that standardize common commands
Setting up a simple internal portal or wiki that provides single-point access

Pain Point #5: Security & Compliance Overhead

The Problem

Security is crucial but often becomes a productivity killer. Manual security reviews, compliance checks, and remediations consume valuable development time and delay deployments.

When security is bolted on at the end rather than built in from the start, it creates friction and frustration.

The Solution

Platform engineering embraces "self-service with guardrails." IDPs build security into workflows rather than tacking it on afterward.

With a platform approach:

Security scanning happens automatically in pipelines
Compliance checks run continuously
Policy enforcement happens transparently
Developers get instant feedback on security issues

Getting Started

Even small teams can implement:

Pre-commit hooks for basic security checks
Automated vulnerability scanning in CI pipelines
Compliance-as-code using tools like OPA
Security templates for new projects

Leveraging What You Already Have

The good news? You probably already have the foundation for platform engineering in place. The trick is connecting these pieces into a cohesive experience:

Your Git workflow can expand beyond code versioning to include configuration and Infrastructure as Code specs.

Those Docker containers you use for local development? With some standardization, they become the basis for consistent environments across your pipeline.

That CI/CD pipeline you built for testing? It can become the backbone of a self-service deployment platform.

The key isn't getting new tools—it's connecting what you have in smarter ways. Focus on eliminating the manual steps between these systems first, then build interfaces that make the process seamless.

What's your team's biggest development pain point? Let me know in the comments!

Streamlining Multi-Tenant Kubernetes: A Practical Implementation Guide for 2025

Greg Mate — Wed, 14 May 2025 14:55:05 +0000

Let's face it: running multiple applications on separate clusters is a resource nightmare. If you've got different teams or customers needing isolated environments, you're probably spending way more on infrastructure than you need to.

Multi-tenancy in Kubernetes offers a solution, but it comes with its own set of challenges. How do you ensure proper isolation? What about resource allocation? And the big one – security?

This guide provides practical steps for implementing multi-tenant Kubernetes that actually works in production environments. By the end, you'll have a roadmap for consolidating your infrastructure while maintaining isolation where it matters.

What Multi-Tenancy Actually Means in 2025

Multi-tenancy has become a bit of a buzzword, but at its core, it still means the same thing: multiple users sharing the same infrastructure. In Kubernetes, we typically see two flavors:

Multiple teams within an organization: Different departments or projects sharing a cluster, where team members have access through kubectl or GitOps controllers
Multiple customer instances: SaaS applications running customer workloads on shared infrastructure

The key tradeoffs haven't changed much over the years, either. You're always balancing:

Isolation: Keeping tenants from accessing or messing with each other's resources
Resource efficiency: Maximizing hardware utilization and reducing costs
Operational complexity: Making sure your team can actually manage this setup

What has changed are the tools and patterns. Pure namespace-based isolation is still common, but we've seen a shift toward more sophisticated approaches using hierarchical namespaces, virtual clusters, and service meshes. Let's start with the building blocks you'll need for a practical implementation.

For more details about how the platform approaches multi-tenancy, check Kubernetes documentation.

The Building Blocks: Practical Implementation Guide

Namespace Configuration That Actually Works

Namespaces are your first line of defense in multi-tenancy. Here's a modern namespace configuration with isolation in mind:

apiVersion: v1
kind: Namespace
metadata:
  name: tenant-a
  labels:
    tenant: tenant-a
    pod-security.kubernetes.io/enforce: baseline
    pod-security.kubernetes.io/audit: restricted
    pod-security.kubernetes.io/warn: restricted
    networking.k8s.io/isolation: enabled

This does a few key things:

Creates a dedicated namespace for the tenant
Labels it for easier filtering and policy targeting
Applies Pod Security Standards (the modern replacement for Pod Security Policies)
Marks it for network isolation

When organizing namespaces, many teams follow a pattern like {tenant}-{environment} (e.g., marketing-dev, marketing-prod). For SaaS applications, you might use customer IDs or similar identifiers.

The key thing to remember: namespaces alone aren't enough for true isolation. They're just containers for resources – you need additional controls to enforce boundaries.

RBAC That Actually Isolates Tenants

Role-Based Access Control (RBAC) is essential for preventing tenants from accessing each other's resources. Here's a pattern that works well in practice:

# Tenant admin role
apiVersion: rbac.authorization.k8s.io/v1
kind: Role
metadata:
  namespace: tenant-a
  name: tenant-admin
rules:
- apiGroups: ["", "apps", "batch"]
  resources: ["pods", "services", "deployments", "jobs"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
- apiGroups: ["networking.k8s.io"]
  resources: ["ingresses"]
  verbs: ["get", "list", "watch", "create", "update", "patch"]
- apiGroups: [""]
  resources: ["configmaps", "secrets"]
  verbs: ["get", "list", "watch", "create", "update", "patch", "delete"]
---
# Binding for tenant admin
apiVersion: rbac.authorization.k8s.io/v1
kind: RoleBinding
metadata:
  name: tenant-a-admin-binding
  namespace: tenant-a
subjects:
- kind: User
  name: tenant-a-admin
  apiGroup: rbac.authorization.k8s.io
roleRef:
  kind: Role
  name: tenant-admin
  apiGroup: rbac.authorization.k8s.io

Notice a few important things here:

The role is scoped to a specific namespace (tenant-a)
It grants permissions for common resources but nothing cluster-wide
The binding associates a user with this role

The pattern is simple but effective: create a set of standard roles for each tenant (admin, developer, viewer), each scoped to the tenant's namespace(s).

One mistake I see teams make is being too generous with permissions. Start restrictive and loosen gradually as needed – it's much easier than trying to lock things down after a breach.

Network Policies That Actually Isolate Traffic

Network isolation is critical for multi-tenancy. By default, all pods in a Kubernetes cluster can talk to each other – not what you want in a multi-tenant environment.

Here's a practical network policy that isolates tenant traffic:

apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: tenant-isolation
  namespace: tenant-a
spec:
  podSelector: {}  # Applies to all pods in namespace
  policyTypes:
  - Ingress
  - Egress
  ingress:
  - from:
    - namespaceSelector:
        matchLabels:
          tenant: tenant-a
  egress:
  - to:
    - namespaceSelector:
        matchLabels:
          tenant: tenant-a
  - to:
    - namespaceSelector:
        matchLabels:
          common-services: "true"

This policy does two important things:

Allows ingress traffic only from the same tenant's namespace
Allows egress traffic only to the same tenant's namespace or to namespaces labeled as common services

The second part is particularly important – your tenants probably need access to shared services like monitoring, logging, or databases. By labeling those namespaces as common-services: "true", you create controlled exceptions to your isolation rules.

A common mistake is forgetting about DNS and other cluster services. Make sure your network policies allow access to kube-system services that tenants need, or you'll have some very confusing debugging sessions.

Resource Quotas to Prevent Noisy Neighbors

One bad tenant can ruin the party for everyone by consuming all available resources. Resource quotas prevent this "noisy neighbor" problem:

apiVersion: v1
kind: ResourceQuota
metadata:
  name: tenant-a-quota
  namespace: tenant-a
spec:
  hard:
    requests.cpu: "10"
    requests.memory: 20Gi
    limits.cpu: "20" 
    limits.memory: 40Gi
    persistentvolumeclaims: "20"
    services: "30"
    count/deployments.apps: "25"
    count/statefulsets.apps: "10"

This quota sets limits on:

CPU and memory consumption (both requests and limits)
Number of persistent volume claims (storage)
Number of services and workloads (deployments, statefulsets)

Setting appropriate quota sizes takes some experimentation. Monitor actual usage patterns and adjust accordingly – too restrictive and legitimate workloads fail, too loose and you're back to the noisy neighbor problem.

Pro tip: In addition to ResourceQuotas (which operate at namespace level), use LimitRanges to set default and maximum limits for individual containers. This prevents tenants from creating resource-hungry pods that still fit within their overall quota.

Real-World Implementation Benefits

Research and industry reports show clear benefits when organizations implement proper multi-tenancy in Kubernetes environments:

According to documented implementations, organizations typically see:

30-40% reduction in infrastructure costs by consolidating multiple single-tenant clusters
Significant decrease in time spent on cluster maintenance and updates
Improved resource utilization, often doubling from around 30-35% to 70% or more
Better standardization across development teams

However, implementation isn't without challenges. Common issues include:

Resistance from teams concerned about workload security and isolation
Migration complexity for existing applications
Learning curve for new multi-tenant tooling and workflows
Special accommodations needed for resource-intensive or security-sensitive workloads

This highlights an important point: multi-tenancy isn't all-or-nothing. Many successful implementations use a hybrid approach, keeping some high-security or high-performance workloads on dedicated clusters while consolidating standard workloads in shared environments.

Solving the Big Three Challenges

Challenge 1: Security Vulnerabilities

Cross-tenant data leakage and escalation attacks are the nightmare scenarios in multi-tenant environments. Here's a practical security checklist:

Enforce Pod Security Standards:

   apiVersion: v1
   kind: Namespace
   metadata:
     name: tenant-a
     labels:
       pod-security.kubernetes.io/enforce: restricted
       pod-security.kubernetes.io/enforce-version: v1.29

The "restricted" profile prevents pods from running as privileged, accessing host namespaces, or using dangerous capabilities.

Isolate tenant storage:
Use StorageClasses with tenant-specific access controls, or better yet, separate storage backends for sensitive data.
Implement regular security scanning:
Tools like Trivy, Falco, and Kube-bench can identify vulnerabilities in your multi-tenant setup.
Audit, audit, audit:
Enable audit logging and regularly review access patterns – many breaches are detected through unusual access.

Challenge 2: Resource Contention

Even with resource quotas, you can still run into contention issues. Here are some practical solutions:

Pod Priority and Preemption:

   apiVersion: scheduling.k8s.io/v1
   kind: PriorityClass
   metadata:
     name: tenant-high-priority
   value: 1000000

Assign different priority classes to tenant workloads based on their importance.

Node Anti-Affinity:

   affinity:
     podAntiAffinity:
       requiredDuringSchedulingIgnoredDuringExecution:
       - labelSelector:
           matchExpressions:
           - key: tenant
             operator: In
             values:
             - tenant-a
         topologyKey: "kubernetes.io/hostname"

This prevents multiple pods from the same tenant being scheduled on the same node, distributing the load.

Quality of Service Classes: Set appropriate QoS classes (Guaranteed, Burstable, BestEffort) for different tenant workloads to influence how they're treated under resource pressure.

Challenge 3: Operational Complexity

Managing dozens or hundreds of tenants manually isn't feasible. Here's how to simplify operations:

Automate tenant provisioning:
Create a standardized process for spinning up new tenant namespaces, applying policies, and setting quotas.
Use a tenant operator:
Tools like Capsule or the Multi-Tenant Operator can handle tenant lifecycle management, from creation to termination:

   apiVersion: tenancy.stakater.com/v1alpha1
   kind: Tenant
   metadata:
     name: tenant-a
   spec:
     owners:
     - name: tenant-a-admin
       kind: User
     namespaces:
     - tenant-a-dev
     - tenant-a-prod
     quota:
       hard:
         requests.cpu: '10'
         requests.memory: 20Gi
     resourcePooling: true
     namespacePrefix: tenant-a-

Implement tenant-aware monitoring:
Tag all metrics and logs with tenant identifiers to simplify debugging and enable tenant-specific dashboards.
Create self-service capabilities:
Build internal tools that let tenants manage their own resources within the constraints you define.

Wrapping Up: Is Multi-Tenancy Right for You?

Multi-tenant Kubernetes isn't a silver bullet, but it can significantly reduce costs and operational overhead when implemented correctly. Here's a quick checklist to decide if it's right for your organization:

✅ You have multiple teams or customers using similar infrastructure
✅ You're comfortable with the security implications of shared infrastructure
✅ You have the operational maturity to implement and maintain isolation
✅ The cost savings outweigh the increased complexity

The implementation patterns we've covered – namespace isolation, RBAC, network policies, and resource quotas – provide a solid foundation for most multi-tenant environments. Start small, perhaps with just two teams or customers, and expand as you gain confidence in your isolation mechanisms.

Remember, you don't have to go all-in on multi-tenancy. Many organizations use a hybrid approach, with shared clusters for most workloads and dedicated clusters for high-security or high-performance applications.

Whatever approach you choose, make sure your teams understand the boundaries and limitations of your multi-tenant setup. Technical controls are important, but so is user education – a confused tenant can unintentionally cause problems for everyone.

What's your experience with multi-tenant Kubernetes? Have you implemented any of these patterns, or do you have alternative approaches? Share your thoughts in the comments below.

Goodbye, 2023! dyrector.io’s Annual Recap

Greg Mate — Wed, 20 Dec 2023 11:04:33 +0000

2023 is coming to an end, which means it's time to revisit what happened with the team and the project of dyrector.io in the past 12 months.

January – Full Stack Highlighted dyrector.io

After the lengthy Christmas break with a full stomach and a couple extra kilograms the real surprise caught us blind-sided. The Full Stack platform featured dyrector.io in its highlights.

Team-wise the most notable event was our Minus 30 hike in the pleasant January weather, which was a great occasion to have a chat about both technology related and unrelated things, and also to taste some pálinka.

February – dyrector.io Alpha Dropped

The first weeks of February were all about attending FOSDEM and the upcoming launch of dyrector.io on Product Hunt. On the day of the launch we made alpha access available.

Our Product Hunt launch turned out to be a shot at the buzzer, but we still did nice. With a launch 6 hours into the voting, we reached the #11 spot. The same day we made a new release and a demo video. Busier than planned, but we did good.

At the conference in Belgium, we were able to catch up with a lot of likeminded people eager to learn about open-source software.

At the same time, our teammate, Levi showed up in the local cloud meetup scene as organizer and a presenter, too. Another teammate of ours, Nándi was interviewed in the podcast series of Uptime Community about DevOps, ChatGPT, and open-source.

March – Three (Hundred) Is the Magic Number

We doubled down on catering to a self-hosting audience in the first months of 2023, which helped us reach 300 stars on GitHub on the 3rd of March. We published a bunch of blog posts about self-hosting certain types of applications, which you can find here.

In March, we published our Awesome repository containing infrastructure related questions. We consider it useful when someone is onboarded to a new project maintaining infrastructure.

Another important event of the month was when Docker announced the end of Free Teams on Docker Hub. Backlash was inevitable and so was the organization backing out of their plans of monetizing Free Teams.

April – Adventures in the UK and Hungary

A portion of our team took a business trip in the UK to visit Hanover Displays at their HQ in Brighton. While Levi and Gopher was there, they paid a visit to the LEGO HQ for a meetup, as well.

After the trip in the UK, we went out of the office for a few days of team building when we could unwind with the whole team.

Levi attended KubeCon in Amsterdam, too, which turned out to be the funniest way to reach 420 stars on GitHub on April 20th. Trust me, we didn’t plan this whatsoever.

May – 0.4.0 & Roadmap Published

After a Q1 busy with refactoring and making dyrector.io’s code more efficient, we started to make new releases faster. The first step was making 0.4.0, which didn’t deliver any significant changes to functionality, but it was important to accelerate our release cycle in the long run.

At the same time, we published our roadmap on GitHub and added new issues to the repository.

We also made some new friends: ConfigCat reviewed the platform on their blog.

June – Team Building in Croatia & 1000000000 Stars

Release 0.5.0 was a special moment for our team. It was the first version in months that included new features. To celebrate this special moment, we went to Croatia to finish working on the new version and chill at the sunny beach.

This was the perfect way to kick off our summer. After the trip to Croatia, we were able to consistently release on a bi-weekly basis, shipping new features again and again.

After 0.5.0 dropped, we passed 512 stars on GitHub, or 1000000000 in binary.

July – Automated Deployments With dyrector.io

One of the most significant features we added this year was the auto-deployment capability. The GitHub Actions compatible feature came out on July 14 in release 0.6.0.

A very pleasant surprise was when Nevo David mentioned dyrector.io in his blog post, which resulted in increased exposure and interest in the platform. In a few days we gained hundreds of stars on GitHub.

Even though it was the middle of the summer, we took no breaks. Between publishing new releases full of new features, we went to Lake Balaton to sail and Nándi and Geri even completed the Lake Balaton Cross Swimming.

At the end of July Levi attended WeAreDevelopers 2023 in Berlin.

August – dyrector.io Turns International

The most significant change was an internal change: our teammate, Nándi moved to the Netherlands with his girlfriend. We officially became a remote-first company, while the rest of the team still showed up at the office every day. We had a goodbye party for him where we said farewell with a few cans of his favorite beverages for the road.

We launched dyrector.io on a new platform called Dev Hunt, which is an open-source Product Hunt alternative. With the help of our community, we were able to reach the #1 spot and the Developer Tool of the Week title that comes with it.

In other cloud-related news HashiCorp announced they're changing their products' license, including Terraform’s, to Business Source License, which sparked the foundation of OpenTF, which later was named OpenTofu.

September – Product Hunt Launch #2

The majority of August was spent on preparations for our Product Hunt launch in September. The date was set – September 8th. We knew a product like ours only has a chance of a significant result on a Friday.

The result: #6 in the daily rankings, top 50 in the weekly with around 260 votes. Definitely an impressive result with a heavily developer-focused tool.

In the meantime, Levi took care of networking: he appeared in the Follow The Pattern podcast, attended InfoBip’s Shift conference in Croatia, and went to Kubernetes Community Days in Vienna.

October – darklens Enters the Scene

The biggest achievement of October in our household was a one-week sprint when more than half of the team was on vacation. Three teammates of ours joined forces, two developers and one marketer, to develop a complimentary product to dyrector.io.

We named this tool darklens, which makes Docker logs and container settings available in your browser. A week after the sprint we launched darklens on Product Hunt for an impressive #14 spot with 140 upvotes.

November – Team Building in Portugal

Over the summer, the whole team was able to snag developer tickets to Web Summit in Lisbon. Soon as we got the confirmation, we started planning our travel to Portugal. With a little sightseeing and networking at the conference, the week we spent in Lisbon turned out to be a blast. We made a lot of new connections.

One of the coolest things of the year was when people found the invitation card for our CTF puzzle and came to our Discord channel or stopped by to say hi at Web Summit.

December – 0.10.0. Dropped

The latest release of dyrector.io, 0.10.0 dropped in early December. You can find out more about it on GitHub.

That’s it for 2023. So long, and thanks for all the fish!

This blogpost was written by the team of dyrector.io. dyrector.io is an open-source continuous delivery & deployment platform with version management.

Support us with a star on GitHub.

5 Use Cases When Containerization Is Absolutely Useless for You

Greg Mate — Thu, 30 Nov 2023 14:19:59 +0000

#1 Static, Unchanging Environments

If your application has minimal dependencies and operates consistently across different environments without the need for isolation, containerization may offer little benefit.

Example:

If your application will be the only process executed on the machine.

#2 Limited Scalability Needs

For applications with predictable and steady workloads that do not require rapid scaling or dynamic resource allocation, the overhead of containerization might outweigh the advantages.

Example:

Small scale IoT apps.

#3 Simple, Standalone Applications

In cases where your application is straightforward, lacks dependencies, and isn't part of a larger ecosystem with varied technologies, containerization may introduce unnecessary complexity.

Example:

Zero dependency binaries, and also debugging a host process is more straightforward than doing the same with a container.
Offline applications installed from external medium, running without internet connection.

#4 Resource-Constrained Environments

On systems with extremely limited resources, such as embedded devices or constrained hardware, the overhead of running containerization platforms might not be justified.

Example:

Microelectronics.

#5 Desktop Applications

Sounds exotic, huh? For a good reason. It would be very unusual to use containers for desktop applications. Though similar isolation techniques exist, it is not widespread.

Example:

cs_16_nosteam_portable.exe😅

If You Really Need to Containerize...

You can use dyrector.io to deploy and manage containerized services.

⭐ Star dyrector.io on GitHub:

https://github.com/dyrector-io/dyrectorio

Dagger 101: How to Get Started with Containerized CI Workflows

Greg Mate — Thu, 23 Nov 2023 11:04:26 +0000

Continuous Integration and Continuous Delivery are the secret sauces of shipping new features consistently and reliably to your software. However, the effectiveness of this process is closely tied to the tooling that orchestrates it. Some of the pain points of CI/CD systems are slow feedback loops, vendor lock-in, lack of abstraction, limited composability, or YAML itself. This is where Dagger comes into the spotlight, promising a more unified and accelerated path.

Introduction

The development and deployment process at dyrector.io has already become much faster each year as we adopt and integrate better tools and methods. However, we aim to further unify and accelerate this. Dagger philosophy aligns with what we consider crucial for a truly rapid and seamless process:

Local testing: Enable developers to test their code instantly, locally
Programmable CI: Replace messy YAML-based, complex CI with code
Compatibility: If it runs in a container, you can add it to your pipeline
Portability: The same pipeline can run on your local machine, a CI runner, a dedicated server, or any container hosting service
Universal caching: Every operation is cached by default, and caching works the same everywhere

Currently, we have the option to use our own dyrector.io (we’ll refer to it as dyo many times in this blog post) go CLI with our commands or Docker Compose with its YAML to spin up our stack for local testing, while we also maintain a GitHub Actions workflow for running end-to-end tests on GitHub. This setup lacks coherence, as we cannot employ the specialized GitHub Actions workflow YAML in a local setting or with a different CI/CD environment.

We want to get closer to being able to ship every single day, or even multiple times a day, as quickly as we possibly can, using the same tool running locally and in CI. Dagger feels like an actual innovation in CI/CD, and it seems it will enable us to do that. There is also a strong focus on getting feedback from the community and utilizing it when we’re designing and building something that people really need.

Setting up Dagger CI/CD

We would like to use Dagger locally with the dyo Go CLI, and for this we need the Dagger Go SDK for integration (there are many Dagger SDKs) and the Dagger Engine, which will run our pipelines. We developed a small proof of concept (POC) to evaluate if we could use our entire stack locally with Dagger. If this POC will be successful, we plan to use the same setup in our GitHub workflow, essentially using GitHub Actions just to trigger the Dagger pipeline.

Steps to set up Dagger for our project:

Install the Dagger Go SDK (again, you can use any other Dagger SDK for your project, but we use Go) Go to your existing project – in our case it is dyrectorio.

$ go get dagger.io/dagger
$ go mod tidy

Add local Dagger test to our Makefile It is for simple and fast “make test” (similarly to our other commands).

# Shortcut for local testing
.PHONY: test
test:
    go run golang/cmd/dagger/main.go

Create Dagger main.go
We already have dyo, dagent and crane in our golang/cmd, so put dagger here too.
Import Dagger SDK

Create a Dagger client using the SDK
This will allow you to interact with the Dagger Engine and create pipelines.

Create Dagger pipelines

Additional note:
We can also install the Dagger CLI if we want to, but this is an optional tool to interact with the Dagger Engine from the command-line – it has a nice terminal UI though, with parallel progress bars that are visually impressive if you are into that sort of thing.

Install the Dagger CLI

$ cd /usr/local
$ curl -L https://dl.dagger.io/dagger/install.sh | sh

Workflow Integration

As you will see, the “Dagger way” is a very “Docker-ish” way - no surprise, one of the co-founders of Dagger is Solomon Hykes, earlier founder and technical director of Docker.

To show you concrete code examples from our POC:

Import Dagger SDK
In our main.go:

import (
    "context"
    "dagger.io/dagger"
    …)

Create a Dagger client using the SDK

func initDaggerClient(ctx context.Context) *dagger.Client {
    client, err := dagger.Connect(ctx, dagger.WithLogOutput(os.Stdout))
    if err != nil {
        panic(err)
    }
    return client
}

And we can call this initDaggerClient() function in our main() like this:

    ctx := context.Background()
    client := initDaggerClient(ctx)
    defer client.Close()

Run unit tests on our NestJS-based Crux backend:

func runCruxUnitTestPipeline(ctx context.Context, client *dagger.Client) {
    log.Info().Msg("Run crux unit test pipeline...")

    _, err := client.Container().From("node:20-alpine").
        WithDirectory("/src", client.Host().Directory("web/crux/"), dagger.ContainerWithDirectoryOpts{
            Exclude: []string{"node_modules"},
        }).
        WithWorkdir("/src").
        WithExec([]string{"npm", "ci"}).
        WithExec([]string{"npm", "run", "test"}).
        Stdout(ctx)
    if err != nil {
        panic(err)
    }

    log.Info().Msg("Crux unit test pipeline done.")
}

We can call this runCruxUnitTestPipeline() function in our main():
runCruxUnitTestPipeline(ctx, client)

Run unit tests on our Next.js-based Crux UI frontend is very similar to the above code, we only need to change the host directory to “web/crux-ui/” and an additional “.next” exclusion, everything else remains the same:

    WithDirectory("/src", client.Host().Directory("web/crux-ui/"), dagger.ContainerWithDirectoryOpts{
        Exclude: []string{"node_modules", ".next"},
    }).

A slightly more advanced example when we run our Crux backend in production mode (as we do for e2e test) with a connected PostgreSQL DB service container:

func getEnv(envPath string) map[string]string {
    cruxEnv, err := godotenv.Read(envPath)
    if err != nil {
        panic(err)
    }
    return cruxEnv
}

func getCruxPostgres(client *dagger.Client, cruxEnv map[string]string) *dagger.Container {
    databaseURL := cruxEnv["DATABASE_URL"]
    parsedURL, err := url.Parse(databaseURL)
    if err != nil {
        panic(err)
    }
    postgresUsername := parsedURL.User.Username()
    postgresPassword, _ := parsedURL.User.Password()
    postgresDB := strings.TrimPrefix(parsedURL.Path, "/")

    dataCache := client.CacheVolume("data")

    cruxPostgres := client.Pipeline("crux-postgres").Container().From("postgres:14.2-alpine").
        WithMountedCache("/data", dataCache).
        WithEnvVariable("POSTGRES_USER", postgresUsername).
        WithEnvVariable("POSTGRES_PASSWORD", postgresPassword).
        WithEnvVariable("POSTGRES_DB", postgresDB).
        WithEnvVariable("PGDATA", "/data/postgres").
        WithExposedPort(5432)

    return cruxPostgres
}

func runCruxProd(ctx context.Context, client *dagger.Client, cruxPostgres *dagger.Container) *dagger.Container {
    crux := client.Pipeline("crux").Container().From("node:20-alpine")
    crux = crux.
        WithDirectory("/src", client.Host().Directory("web/crux/"), dagger.ContainerWithDirectoryOpts{
            Exclude: []string{"node_modules"},
        }).
        WithWorkdir("/src").
        WithServiceBinding("localhost", cruxPostgres).
        // WithEnvVariable("NOCACHE", time.Now().String()).
        WithExec([]string{"npm", "ci"}).
        WithExec([]string{"npm", "run", "build"}).
        WithExec([]string{"npm", "run", "prisma:migrate"}).
        WithExec([]string{"npm", "run", "start:prod"})

    _, err := crux.Stdout(ctx)
    if err != nil {
        panic(err)
    }

    return crux
}

We can run the above code in our main() like this:

    cruxEnv := getEnv("web/crux/.env") 
    cruxPostgres := getCruxPostgres(client, cruxEnv) 
    runCruxProd(ctx, client, cruxPostgres)

We would like to note that we made our POC with Dagger 0.8.x during September, so the code snippets above will show that. But even then the new API development of Dagger Services v2 (which we will need for our complex e2e pipeline) was in progress at Dagger in a separate feature branch and they promised on their Discord forum back then that this new API with some breaking changes will be included in Dagger 0.9. It wasn’t just us showing demand for parallel long running service containers - and they kept their word and it is indeed included in Dagger 0.9.0 released at the end of October. Shouts to Team Dagger!

We put our POC on hold in October, but we have been keeping an eye on Service v2 developments and news. We will try out Service v2 in the near future and dedicate another blog post to whether we managed to solve our entire e2e pipeline with Dagger.

Dagger efficiently caches each step of the pipelines, automatically handling the caching of source code copies, containers and builds, and when developers configure it programmatically, it also caches mounted volumes such as database data, node_modules, and Go build-cache. Our logs provide clear examples of this on reruns without code modifications.

    copy web/crux/ CACHED
    > in host.directory web/crux/
    …
    pull docker.io/library/postgres:14.2-alpine CACHED
    > in crux-postgres > from postgres:14.2-alpine
    > in crux > service bvqf991cmob5i.97ul8ph8qf1qc.dagger.local
    …
    exec docker-entrypoint.sh postgres
    > in crux > service bvqf991cmob5i.97ul8ph8qf1qc.dagger.local
    [0.15s] PostgreSQL Database directory appears to contain a database; Skipping initialization
    …
    [0.30s] 2023-11-08 10::11.131 UTC [15] LOG:  database system is ready to accept connections
    ...
    exec docker-entrypoint.sh npm run build CACHED
    > in crux
    exec docker-entrypoint.sh npm run prisma:migrate CACHED
    > in crux
    exec docker-entrypoint.sh npm ci CACHED
    > in crux
    copy / /src CACHED
    > in crux
    exec docker-entrypoint.sh npm run start:prod
    > in crux
    [0.57s] > crux@0.7.0 start:prod
    [0.57s] > node dist/main
    [2.31s] [Nest] 33  - 11/07/2023, 14:24:13.142 AM     LOG [NestFactory] Starting Nest application...
    ...

Challenges and Lessons Learned

We were able to run most of our stack with Dagger 0.8.x, the Crux backend and the Crux-UI frontend separately, but our entire e2e test will require Dagger 0.9.x with the Services v2 API that we can run Crux, Crux-ui, Traefik and Kratos as long running service containers for the Playwright e2e container.

If you want to know more about the Services v2, Dagger wrote a blog post about it here:

Dagger 0.9: Host-to-container, container-to-host, and other networking improvements: https://dagger.io/blog/dagger-0-9

Best Practices for Dagger CI/CD

The fact that we can write the CI/CD code in Go and in a docker-like style had a refreshing effect on us. Here are some general tips:

Iterate small: Start with a small POC to understand how Dagger fits into your workflow before scaling up
Community engagement: Stay active in Dagger's community forums or Discord channels for support and to keep up with the latest developments
Documentation: Keep your Dagger configurations well-documented to ease onboarding and maintenance
Monitor and optimize: Regularly review the performance of your pipelines and optimize caching strategies for better efficiency

Conclusion

We have seen firsthand the transformative nature of Dagger and the flexibility of its programmable pipelines. It stands out as a forward-thinking solution, addressing typical CI/CD bottlenecks with a developer-centric approach. Since Dagger is relatively new and evolving, keeping an eye on updates and community feedback can help in adopting best practices as they emerge.

Dagger Resources

There's still lot to learn about Dagger, so it might be worth the time to check out the following resources to learn about this tool:

You can explore further on Dagger's official website: https://dagger.io
For those eager to dive deeper into Dagger's capabilities, the Dagger documentation is an excellent resource: https://docs.dagger.io
For absolute hackers: https://github.com/dagger/dagger
Dagger Discord community: https://discord.gg/dagger-io

This blogpost was written by the team of dyrector.io. dyrector.io is an open-source continuous delivery & deployment platform with version management.

Support us with a star on GitHub.