DEV Community

ForgeWorkflows
ForgeWorkflows

Posted on • Originally published at forgeworkflows.com

Building AI Agents: Scratch vs. Agent-Service-Toolkit

In 2026, 72% of organizations now use AI in at least one business function, up from 50% in prior years, according to McKinsey's State of AI 2024 report. That adoption curve has a dirty secret: most of those deployments are still running on duct-taped notebook exports, not maintainable services. The gap between "it works in a Jupyter cell" and "it runs reliably in production" has eaten more engineering hours than any prompt engineering problem I've seen.

The question I keep getting from engineers in Python and FastAPI communities is not whether to build with LLMs—that decision is already made. The question is how to structure the surrounding infrastructure so the thing doesn't collapse the first time a real user hits it. That's the comparison I want to work through here: building your own orchestration layer from scratch versus adopting an opinionated toolkit like Agent-Service-Toolkit, which bundles LangGraph, FastAPI, and Streamlit into a single deployable scaffold. Both paths are legitimate. Neither is free.

Why This Decision Matters More Than Model Choice

Most tutorials stop at the LLM call. They show you how to send a message and parse a response, then leave you to figure out how to wrap that in an API, persist conversation state, handle retries, and give a non-engineer a way to test it. That gap is where projects stall.

I made this mistake myself—built our first Autonomous SDR on a flat three-node architecture where research, scoring, and writing all reported to a single orchestrator. On five leads, it worked fine. At fifty, the scoring node sat idle waiting on research outputs that had nothing to do with scoring. The fix wasn't a better model. It was splitting into discrete nodes with explicit handoff contracts between them. That change cut end-to-end processing time and made each node independently testable. Every pipeline we've built since uses explicit inter-node schemas for exactly that reason—implicit data passing doesn't hold up when load increases. If you're curious how we think about that kind of quality standard across our builds, that's documented in our BQS methodology.

Approach A: Building the Stack Yourself

Rolling your own infrastructure gives you complete control over every dependency. You choose the orchestration library, the API framework, the state store, and the frontend. For teams with strong opinions about their existing stack—say, a FastAPI service already running in Kubernetes with a Redis cache—this is often the right call.

The cost is real, though. You write the boilerplate for session management, tool registration, streaming responses, and error handling. You wire up the OpenAPI docs manually or accept that they won't exist. You build a test UI or tell your product manager to wait. We've seen this path take two to three weeks before a single business stakeholder can interact with the system. That's not a criticism—it's the honest tradeoff. Custom infrastructure is more adaptable long-term but slower to validate.

There's also a maintenance surface to consider. Every custom abstraction you write is one more thing your team owns. When LangGraph ships a breaking change, you absorb it directly. When FastAPI updates its dependency injection model, you update your wrappers. This is manageable for a dedicated platform team. For a two-person ML team at a startup, it's a meaningful ongoing tax.

Approach B: Agent-Service-Toolkit as Opinionated Scaffold

Agent-Service-Toolkit takes a different position: make the infrastructure decisions for you so you can focus on the logic that's actually specific to your use case. The toolkit ships with LangGraph handling orchestration and memory, FastAPI providing the HTTP layer with automatic OpenAPI documentation and async support, and Streamlit giving you a working test interface without writing a single line of React.

The practical effect is that you scaffold a working service—one where a non-technical stakeholder can open a browser tab and interact with your system—in hours rather than days. That speed matters most during the validation phase, when you're still figuring out whether the core behavior is correct before investing in custom UI.

The limitation is the flip side of the benefit: the toolkit is opinionated. If your organization already runs a different API framework, or if you need a state persistence layer that isn't what LangGraph provides out of the box, you're either adapting the toolkit or fighting it. I've watched teams spend more time customizing an opinionated scaffold than they would have spent building the relevant pieces themselves. The toolkit is not a universal answer—it's the right answer for teams who don't yet have strong infrastructure opinions and want to move from concept to testable service quickly.

The Infrastructure Decisions That Actually Differ

Three specific decisions separate these approaches in practice.

State management. LangGraph's built-in checkpointing handles conversation memory and node state persistence. Building this yourself means choosing between Redis, a relational database, or an in-memory store—and writing the serialization logic for each. The toolkit makes this decision for you. If that decision fits your constraints, you save the work. If it doesn't, you're overriding a core assumption.

Inter-node contracts. This is where I see the most failures in custom builds. When nodes pass data implicitly—through shared mutable state or loosely typed dictionaries—debugging becomes archaeology. Agent-Service-Toolkit's LangGraph integration encourages typed state schemas between nodes. You can ignore this and pass raw dicts, but the scaffold nudges you toward explicit contracts. That nudge has real value, especially for teams newer to multi-node orchestration.

Deployment surface. A FastAPI service with auto-generated OpenAPI docs is immediately consumable by other services, by Postman, by your QA team. A custom Flask wrapper with no docs is not. The toolkit gives you the former by default. Building it yourself means you either invest in documentation tooling or accept that integration will be harder than it needs to be. For teams building automation pipelines that other systems will call—the kind of work we cover in our analysis of manual vs. automated operations—this matters more than it might seem upfront.

When to Build From Scratch

Choose the custom path when your organization has an existing service architecture that the toolkit would conflict with. If you're running a Go-based microservices platform and Python is a second-class citizen, forcing a Python-first scaffold into that environment creates more problems than it solves. Similarly, if your state management requirements are unusual—multi-tenant isolation, compliance-driven audit logging, or a specific vector store integration—you'll spend more time bending the toolkit than building the relevant pieces directly.

Custom builds also make sense when the team has the bandwidth to own the infrastructure long-term. A dedicated platform engineering team that will maintain the orchestration layer across multiple products gets more value from a custom foundation they fully understand than from a third-party scaffold they partially understand.

When the Toolkit Wins

Use Agent-Service-Toolkit when speed to a testable, shareable prototype is the primary constraint. If you need a product manager or a client to interact with the system within a week, the Streamlit frontend alone justifies the toolkit. You're not building a demo—you're building a real service with a real API—but you get the demo surface for free.

It also wins for teams building their first production LLM service. The opinionated structure teaches good patterns: typed state, explicit tool registration, async request handling. Engineers who build with the toolkit first tend to make better infrastructure decisions when they eventually do build custom systems, because they've seen what a well-structured service looks like. That's not a small thing. The adoption gap in many organizations isn't a model problem—it's an infrastructure literacy problem.

What We'd Do Differently

Start with explicit node schemas before writing any logic. Whether you use the toolkit or build from scratch, define the data contract between each processing step before you write the first function. We skipped this on an early build and spent two days debugging a failure that turned out to be a missing key in a dict passed between nodes. Typed schemas would have caught it at definition time.

Don't use the Streamlit frontend as your long-term UI. It's excellent for validation and internal testing. It is not a customer-facing interface. Teams that ship the Streamlit layer to end users accumulate UI debt that's harder to pay down than if they'd planned for a proper frontend from the start. Use it for what it's designed for—rapid internal feedback—then build the real interface when the behavior is validated.

Plan for the toolkit's update cycle before you commit. Agent-Service-Toolkit is a third-party dependency. LangGraph itself is under active development as of 2026. Before adopting any opinionated scaffold, check the project's release cadence and breaking change history. A scaffold that hasn't been updated in six months when its core dependencies have shipped three major versions is a liability, not an asset.

Top comments (0)