DEV Community: Manychat Engineering

How we solved navigation in a modular SwiftUI app with TCA

Manychat Engineering — Tue, 30 Jun 2026 08:29:08 +0000

We looked for a navigation pattern that fit modular TCA apps. We didn’t find one, so we built our own.

In a SwiftUI app, navigation is one of the first things that breaks once the project stops being simple. NavigationStack and a few @State booleans work well enough in small projects. But when an app goes modular, features move into separate Swift packages, flows need to be reused, deep links show up, and someone eventually asks how navigation logic gets tested. That’s when the simple patterns stop being enough.

I’m Evgeny Serdyukov, iOS Tech Lead at Manyсhat, and this is the story of how we came up with a SwiftUI navigation architecture that actually works for our modular app, built around TCA.

Navigation challenges in a modular app

In a modular app, features are isolated by design. A feature doesn’t know what screen comes before it or after it — and it shouldn’t. But navigation still has to connect everything somehow. How do you do that without features depending on each other?

When we committed to SwiftUI navigation, we quickly realized the examples wouldn’t take us very far. Neither Apple’s documentation nor most community articles cover navigation in a large modular app. SwiftUI gives you all the navigation primitives you need — stacks, pushes, pops, and sheets — but offers little guidance on how those pieces should fit together in a modular architecture.

The next problem appears when a feature becomes a flow. Imagine a multi-step feature (sub-flows) such as checkout with choosing delivery address, payment method, order confirmation, and payment success screen. Internally, it wants its own navigation stack.

Unfortunately, SwiftUI doesn’t allow nesting a NavigationStack inside another NavigationStack. The usual workaround is to present such flows as sheets, giving them a separate navigation context. That works until design requirements get involved.

Our designers wanted the same flow to be reusable in different contexts: sometimes pushed directly into the current navigation stack, sometimes presented as a sheet. The implementation shouldn’t care which presentation style was chosen.

TCA adds another layer of complexity. In TCA, navigation isn’t owned by SwiftUI — it’s part of the feature’s state and managed by reducers. In a modular app, every complex feature has its own reducer, and naturally wants to own its own stack. But that runs straight into SwiftUI’s single-stack limitation.

By this point, we had a fairly concrete set of requirements.

#	Requirement	Why
R1	Inline push instead of a modal	Per-screen swipe-back, host navigation bar stays visible.
R2	One Feature per sub-flow	Three-step Checkout is one reducer, not three.
R3	Host doesn't know sub-flow internals	Adding a step inside Checkout mustn't touch the host.
R4	Sub-flow doesn't know its host	Same Checkout binary plugs into any host.

To explain the solution, I’ll use a simplified checkout flow: Home → Checkout → Payment → Receipt. The full example is on GitHub — a small but complete app with real features, deep links, modal presentations, and tests

Alternatives considered

Canonical TCA StackState TCA has a built-in navigation system. The standard approach is to declare a StackState inside the host reducer — a typed list of everything that can be pushed onto the stack:

@Reducer struct ShopFlow {
 struct State { var path = StackState<Path.State>() }
 @Reducer enum Path {
     case checkout(CheckoutRootFeature)
     case payment(PaymentFeature)
     case receipt(ReceiptFeature)
 }
}

Advancing the sub-flow is straightforward:

state.path.append(.payment(.init()))

At first glance, this looks exactly like what we need.

The problem is that Path lives in the host — ShopFlow. Then the host must enumerate every screen that can ever appear on the stack, including the internal screens of nested sub-flows. Add a step inside Checkout, and you’re editing ShopFlow.

A three-step sub-flow becomes three separate reducers with delegate handoffs between them. The host knows every internal transition and every screen.

For us, that broke the purpose of modularity.

2. TCACoordinators (johnpatrickmorgan)

TCACoordinators brings the Coordinator pattern to TCA and is probably the most commonly suggested solution for complex navigation. It offers two ways to handle sub-flows

Option A: present sub-flow as sheet or full-screen covers. This keeps sub-flows isolated, but they can no longer participate in the parent’s navigation stack. Users can’t navigate through the sub-flow using the same push-based experience, and swipe-back behavior changes.

For us, that violated the requirement that a flow must be reusable both as a push and as a sheet.

Option B: flatten steps into the parent’s screen enum.

This removes the sheet limitation, but puts us back in the same situation as the canonical StackState approach: the parent coordinator must enumerate every screen from every child flow.

The modularity problem returns.

3. Global AppPath enum

Another common approach is to define a single navigation enum at the application root.

To compile, Hub module must depend on EVERY feature in the app at once:

public enum AppPath {
case checkout(CheckoutFeature)
case payment(PaymentFeature)
case receipt(ReceiptFeature)
…all other navigation path cases
}
Any host can append any case.

This feels flexible at first, but the flexibility is an illusion. The enum must import every feature, and every feature depends on the enum — the dependency arrows stop pointing one way and the module boundary collapses. Every new screen edits the same file, turning it into a chokepoint for merge conflicts and full-app rebuilds. Worse, ownership leaks upward: the knowledge that payment leads to receipt now lives at the root, not inside checkout. Different name, same shape — this is the god router we set out to avoid.

Alternative	R1 inline push	R2 single Feature	R3 host-blind	R4 sub-flow blind
Canonical StackState<Path>	✅	❌	❌	❌
TCACoordinators (sheet)	❌	✅	✅	✅
TCACoordinators (flattened)	✅	❌	❌	❌
Global AppPath	✅	❌	❌	❌

We explored six to eight variations in total. These three represent the main architectural archetypes. Each solved part of the problem, but every one came with a tradeoff we weren’t willing to accept. So we built our own.

The solution: the flow pattern

We didn’t invent this from scratch. The foundation is the Coordinator pattern, originally described by Soroush Khanlou in 2015: a dedicated object owns navigation logic while features emit events.

We call it a Flow rather than a Coordinator for two reasons.

First, to avoid confusion with TCACoordinators, which implements a different navigation model.

Second, a Flow is more than a classic coordinator. It’s a full TCA reducer that owns the child state and can mutate it directly. A traditional Coordinator decides which screen is shown. A Flow also decides what data that screen starts with and can update that data at any point through the normal TCA state tree.

The architecture is built on three primitives:

Shared — a reference-semantic, lock-protected shared value provided by swift-sharing. The host flow owns one instance and passes it down to child features.
Path enums — typed hashable values pushed into the NavigationStack path. Each feature declares its own.
.navigationStackContext — a custom TCA reducer modifier that handles push, cleanup-on-pop, and effect cancellation in one place.

The key idea is simple: navigation is a shared state, not imperative routing. A feature navigates by mutating its own state; the shared stack reflects that change, and .navigationStackContext keeps the two in sync — in both directions — automatically. No feature ever knows the whole map; it only owns its own slice of it.

Before diving into the primitives, it helps to see where everything lives. Here’s how this maps to an actual project structure:

App/
Packages/
├── ThirdParty/ ← TCA wrapper
├── Core/ ← NavigationState, navigationStackContext DSL
├── Features/ ← Isolated feature reducers + views
└── Flows/ ← Navigation composition layer

Dependency arrows flow in one direction:

App → Flows → Features → Core → ThirdParty.

Features don’t import Flows. Flows don’t know how features are presented inside themselves.

Here is how flows (navigation layer) interact with features:

Solid arrows represent state ownership: parent reducers own child state and pass down the shared navigation stack.

Dashed arrows represent delegate actions: child features report meaningful events, and the parent decides what to do next.

Features never talk to each other directly. Everything goes through the Flow above them.

NavigationState

SwiftUI’s built-in NavigationPath is opaque. You can push and pop values, but you can’t inspect what’s currently on the stack. Real navigation logic often needs that visibility: to pop to an anchor, check whether a screen is already pushed, or validate deep-link state.

NavigationState wraps NavigationPath with a parallel [AnyHashable] array, making the stack fully inspectable:

public struct NavigationState: Equatable, @unchecked Sendable {
 private var _path = NavigationPath()
 public private(set) var items: [AnyHashable] = []

 public var path: NavigationPath {
     get { _path }
     set {
         // Called by SwiftUI on back-swipe — trim items to match
         let diff = _path.count - newValue.count
         if diff > 0 { items.removeLast(min(diff, items.count)) }
         _path = newValue
     }
 }

 public mutating func append<V: Hashable & Sendable>(_ value: V) {
     _path.append(value)
     items.append(AnyHashable(value))
 }

 /// Pop everything after the most recent occurrence of `value`; anchor stays.
 public mutating func popTo<V: Hashable & Sendable>(_ value: V) { ... }

 /// Pop the anchor and everything after it.
 public mutating func popPast<V: Hashable & Sendable>(_ value: V) { ... }
}

The implementation maintains a single invariant:

path.count == items.count

The path setter accepts only shorter writes — that’s SwiftUI signaling a back-swipe. Everything else goes through append.

Typed path enums

Every feature that participates in navigation declares its own path enum:

public enum CheckoutPath: Hashable, Sendable {
 case checkout
 case payment
 case receipt
}

This enum is the feature’s public navigation contract.

The host Flow knows that CheckoutPath exists, and nothing more.

Internal screens are represented as enum cases. Adding a new screen to the checkout flow means adding a new case and updating the destinations inside the same feature package. The host remains unchanged.

The navigationStackContext

This is the piece that ties everything together and makes everything declarative. You declare your navigable children once:

// Inside CheckoutFeature.body:
.navigationStackContext(nav: \.$nav, pathAction: \.pathDidChange) { context in
 context.ifLet(\.payment, action: \.payment, path: CheckoutPath.payment) {
     PaymentFeature()
 }
 context.ifLet(\.receipt, action: \.receipt, path: CheckoutPath.receipt) {
     ReceiptFeature()
 }
}

What this one modifier buys you:

1. Auto-push. When state.payment changes from nil to a value, the modifier automatically appends CheckoutPath.payment to the shared navigation stack — nav.items.

No manual navigation mutations:

state.$nav.withLock {
$0.append(.payment)
}

2. Cleanup on pop. When the user swipes back, SwiftUI emits .pathDidChange(newPath). The modifier checks whether CheckoutPath.payment is still present in the navigation stack. If not, it sets state.payment = nil. TCA’s ifLet then automatically cancels any in-flight payment effects.

3. Programmatic pop. Calling state.$nav.withLock { $0.popPast(.payment) } updates the navigation stack. The modifier observes the change and triggers the same cleanup flow as a user-initiated back swipe.

The result is a clear separation of responsibilities: reducers manage state, while navigationStackContext manages navigation synchronization and cleanup.

Inside navigationStackContext

The DSL is built from two lower-level primitives composed together:

ifLetNavigated. It extends TCA’s standard .ifLet with navigation awareness:

func ifLetNavigated<ChildState, ChildAction, Path, Child>( _ 
child: WritableKeyPath<State, ChildState?>,
 action childAction: CaseKeyPath<Action, ChildAction>,
 nav: KeyPath<State, Shared<NavigationState>>,
 path: Path,
 syncOn syncAction: CaseKeyPath<Action, NavigationPath>,
 relayTo childSyncAction: CaseKeyPath<ChildAction, NavigationPath>? = nil,
 @ReducerBuilder<ChildState, ChildAction> then childReducer: () -> Child
) -> some Reducer<State, Action>

It keeps the child state and navigation state synchronized.

Specifically, it:

Watches the syncAction (path changes). If the child’s path tag disappears from nav.items, it sets the child state back to_ nil_.
Watches state transitions from nil to a value and automatically pushes the corresponding path tag to nav.items.
Optionally forwards navigation changes to the child via childSyncAction. This is the relayTo mechanism for nested sub-flows.

2. observingNavChanges. Whenever the navigation stack shrinks — whether because of a back swipe or a programmatic pop — it re-dispatches pathAction:

func observingNavChanges(
 nav: KeyPath<State, Shared<NavigationState>>,
 pathAction: CaseKeyPath<Action, NavigationPath>
) -> some Reducer<State, Action>

This creates a single navigation event stream for all navigation mutations. Whether the user swiped back or a reducer called popPast, navigation follows the same cleanup path.

navigationStackContext — a result builder DSL — composes the primitives above.

It collects _NavChild declarations, applies ifLetNavigated to each one, and wraps the result with observingNavChanges:

func navigationStackContext(
 nav: KeyPath<State, Shared<NavigationState>>,
 pathAction: CaseKeyPath<Action, NavigationPath>,
 @_NavigationBuilder _ build: (NavigationStackContext<State, Action>) -> [_NavChild<State, Action>]
) -> some Reducer<State, Action> {
 let context = NavigationStackContext(nav: nav, pathAction: pathAction)
 let children = build(context)
 var result: any Reducer<State, Action> = self
 for child in children {
     result = child.apply(result) // chain ifLetNavigated for each child
 }
 return _AnyReducerBox(base: result)
     .observingNavChanges(nav: nav, pathAction: pathAction)
}

Flows: the composition layer

A Flow is a reducer that owns Shared and coordinates navigation between features. Here’s ShopFlow from the diagram above:

@Reducer
public struct ShopFlow {
 @ObservableState
 public struct State: Equatable, Sendable {
     @Shared(value: NavigationState()) public var nav

     public var checkout: CheckoutFeature.State?
     public var home = ShopHomeFeature.State()
 }

 public enum Action {
     case checkout(CheckoutFeature.Action)
     case home(ShopHomeFeature.Action)
     case openURL(URL)
     case pathDidChange(NavigationPath)
 }

 public var body: some ReducerOf<Self> {
     Scope(state: \.home, action: \.home) { ShopHomeFeature() }

     Reduce { state, action in
         switch action {
         case let .home(.delegate(.buyButtonTapped(productID))):
             // Just assign state — navigationStackContext auto-pushes
             state.checkout = CheckoutFeature.State(
                 nav: state.$nav,
                 productID: productID
             )
             return .none

         case .checkout(.delegate(.finished)):
             state.$nav.withLock { $0.popPast(CheckoutPath.checkout) }
             return .none

         case let .openURL(url):
             guard let route = ShopDeepLinkParser.parse(url) else { return .none }
             // ...
             return .none

         case .checkout, .home, .pathDidChange:
             return .none
         }
     }
     .navigationStackContext(nav: \.$nav, pathAction: \.pathDidChange) { context in
         context.ifLet(
             \.checkout,
              action: \.checkout,
              path: CheckoutPath.checkout,
              relayTo: \.pathDidChange
         ) {
             CheckoutFeature()
         }
     }
 }
}

ShopFlow doesn’t know that Checkout contains a Payment screen. It doesn’t know that Checkout eventually shows a Receipt screen. It doesn’t even know how many steps the flow contains.

The only thing it knows is that CheckoutPath.checkout is the entry point.

When home signals that the user tapped Buy, ShopFlow assigns CheckoutFeature.State and passes down its shared navigation stack. navigationStackContext auto-pushes the first checkout screen.

From that point on, everything inside CheckoutFeature is opaque to the host. The host Flow sees the flow as a single destination, regardless of how many screens live behind it.

The relayTo: .pathDidChange forwards path changes from ShopFlow into CheckoutFeature’s own navigationStackContext, so that Checkout’s internal screens get proper cleanup when the user swipes all the way back.

The view layer

ShopFlowView’s job is simply to connect SwiftUI’s NavigationStack to the Flow’s shared navigation state and register feature destinations. Here’s the root view for our example flow:

public struct ShopFlowView: View {
 @SwiftUI.Bindable var store: StoreOf<ShopFlow>

 public var body: some View {
     NavigationStack(path: $store.nav.path.sending(\.pathDidChange)) {
         ShopHomeView(store: store.scope(state: \.home, action: \.home))
             .checkoutDestinations(store: store.scope(state: \.checkout, action: \.checkout))
     }
 }
}

.sending(.pathDidChange) is the bridge between SwiftUI.NavigationStack and TCA. When the user swipes back, SwiftUI writes a shorter NavigationPath into the binding. .sending converts that into a .pathDidChange action and sends it through the reducer. That’s what triggers navigationStackContext’s cleanup.

checkoutDestinations is a View extension defined inside CheckoutFeature — the feature owns its destination registration:

public extension View {
 func checkoutDestinations(store: StoreOf<CheckoutFeature>?) -> some View {
     navigationDestination(for: CheckoutPath.self) { path in
         if let store {
             switch path {
             case .checkout:
                 CheckoutView(store: store)
             case .payment:
                 if let s = store.scope(state: \.payment, action: \.payment) {
                     PaymentView(store: s)
                 }
             case .receipt:
                 if let s = store.scope(state: \.receipt, action: \.receipt) {
                     ReceiptView(store: s)
                 }
             }
         }
     }
 }
}

The host Flow attaches checkoutDestinations, but it doesn’t know how Checkout routes internally. The routing logic lives inside the feature package.

Adding a new Checkout screen means updating CheckoutPath and checkoutDestinations. Both changes happen inside the CheckoutFeature package. ShopFlowView doesn’t change.

One more thing is worth mentioning. navigationDestination(for:) must be attached to the root view managed by the NavigationStack, not to one of the pushed destination views.

SwiftUI scopes destination registrations to the subtree of the view that declares them. Registering destinations inside a pushed screen can lead to routes that are ignored or resolved inconsistently.

For that reason, all destination registration in this architecture happens at the stack root, while the destination definitions themselves remain owned by the corresponding feature packages.

Deep linking

Deep links are handled entirely in the Flow layer, where navigation state lives:

case let .openURL(url):
 guard let route = ShopDeepLinkParser.parse(url) else { return .none }
 switch route {
 case let .checkout(productID):
     state.checkout = CheckoutFeature.State(nav: state.$nav, productID: productID)
     return .none

 case let .checkoutPayment(productID):
     state.checkout = CheckoutFeature.State(nav: state.$nav, productID: productID)
     return .send(.checkout(.deepLinkToPayment))

 case let .receipt(orderID):
     state.checkout = CheckoutFeature.State(nav: state.$nav, productID: "unknown")
     return .send(.checkout(.deepLinkToReceipt(orderID: orderID)))
 }

The parser itself is a pure function and can be tested independently of the rest of the navigation system:

public enum ShopDeepLinkParser {
 public static func parse(_ url: URL) -> ShopDeepLink? {
     let segments = pathSegments(from: url)
     // shop://checkout/{productID}
     if segments.count == 2, segments[0] == "checkout" {
         return .checkout(productID: segments[1])
     }
     // shop://checkout/{productID}/payment
     if segments.count == 3, segments[0] == "checkout", segments[2] == "payment" {
         return .checkoutPayment(productID: segments[1])
     }
     // https://example.com/receipt/{orderID}
     if segments.count == 2, segments[0] == "receipt" {
         return .receipt(orderID: segments[1])
     }
     return nil
 }
}

Since navigation state is plain data, handling a deep link is no different from any other flow entry. The Flow just builds the right state and sends the actions. The only view-layer piece is forwarding the URL:

struct ContentView: View {
 @State private var store = Store(initialState: ShopFlow.State()) {
     ShopFlow()
 }
 var body: some View {
     ShopFlowView(store: store)
         .onOpenURL { store.send(.openURL($0)) }
 }
}

Everything past that — parsing, building state, pushing the stack — is pure reducer logic. Which is exactly why deep-link tests need no simulator.

Modal presentations

Modals — sheets and alerts — use TCA’s @Presents, and a key principle holds: a presentation lives with the feature that raises it, the host never touches it.

CheckoutFeature owns a promo-code sheet:

@ObservableState
public struct State: Equatable, Sendable {
 @Presents public var promoCode: PromoCodeFeature.State?
 // ...
}

A sheet is presented by assigning state, and dismissed by setting it back to nil:

case .promoCodeButtonTapped:
 state.promoCode = PromoCodeFeature.State()
 return .none

case let .promoCode(.presented(.delegate(.applied(code)))):
 state.appliedPromoCode = code
 state.promoCode = nil // dismiss the sheet
 return .none

PaymentFeature owns the alert. It’s the one that knows a payment failed. The host never sees the alert:

case .failedButtonTapped:
 state.alert = AlertState {
     TextState("Payment failed")
 } actions: {
     ButtonState(action: .retryPayment) { TextState("Try again") }
     ButtonState(role: .cancel, action: .cancelPayment) { TextState("Cancel") }
 } message: {
     TextState("Your payment could not be processed.")
 }
 return .none

case .alert(.presented(.cancelPayment)):
 return .send(.delegate(.cancelled)) // tell the host; it pops the stack

Both are wired with .ifLet, each inside its own feature’s body:

// inside CheckoutFeature
.ifLet(\.$promoCode, action: \.promoCode) { PromoCodeFeature() }

// inside PaymentFeature
.ifLet(\.$alert, action: \.alert)

Push navigation and modal presentations are symmetric: both are optional states that TCA manages via .ifLet. The only difference is whether you route through NavigationStack or a .sheet / .alert modifier. And just like push navigation, a modal stays encapsulated in the feature that owns it — PaymentFeature decides to alert; CheckoutFeature only learns the user cancelled via a delegate action, then pops the stack.

What a feature needs to be a sub-flow

A feature only needs a few things to become a reusable sub-flow that can participate in any host’s navigation stack.

It defines a typed path enum describing the screens it can navigate to: A *Path enum containing Hashable & Sendable cases for the feature’s screens.
It defines its own destination registration : A *Destinations(store:) view extension that registers navigationDestination(for: *Path.self) and maps path cases to views.
It accepts a navigation stack from its host: State.init(nav: Shared), with a default Shared(value: NavigationState()) value so the feature can also run standalone in a sheet.

If the features contain nested navigation, it also:

Declares case pathDidChange(NavigationPath) in Action
Uses .navigationStackContext in its body.
The parent then passes relayTo: .pathDidChange to forward path changes inward.

The feature lives in Features/, not Flows/. It doesn’t own NavigationState, it uses one passed in from outside.

Here’s what that looks like in practice:

@Reducer
public struct CheckoutFeature {
 @ObservableState
 public struct State: Equatable, Sendable {
     @Shared public var nav: NavigationState // shared with host

     public var payment: PaymentFeature.State?
     public var receipt: ReceiptFeature.State?
     public var productID: String
     // ...
 }

 public enum Action {
     case continueButtonTapped
     case deepLinkToPayment
     case pathDidChange(NavigationPath) // receives path changes from host
     case payment(PaymentFeature.Action)
     case receipt(ReceiptFeature.Action)
     case delegate(Delegate)
     // ...
 }
}

The @shared reference means CheckoutFeature and its host share the same NavigationState. When Checkout pushes .payment, the host’s NavigationStack sees the change immediately. The default value in init means the same feature works as a sheet.

Testing

TCA’s TestStore makes navigation logic straightforward to test. The entire stack is inspectable via nav.items:

func test_buyButtonPushesCheckout() async {
 let store = TestStore(initialState: ShopFlow.State()) { ShopFlow() }
 store.exhaustivity = .off

 await store.send(.home(.buyButtonTapped))
 await store.skipReceivedActions()

    XCTAssertEqual(store.state.checkout?.productID, "coffee-grinder")
    XCTAssertTrue(store.state.nav.items.contains { ($0 as? CheckoutPath) == .checkout })
}

func test_backSwipeCleansPaymentButKeepsCheckout() async {
 let store = TestStore(initialState: ShopFlow.State()) { ShopFlow() }
 store.exhaustivity = .off

 await store.send(.home(.buyButtonTapped))
 await store.skipReceivedActions()
 await store.send(.checkout(.continueButtonTapped))

 // Simulate swipe-back to checkout (remove payment from path)
 var path = NavigationPath()
 path.append(CheckoutPath.checkout)
 await store.send(.pathDidChange(path))
 await store.skipReceivedActions()

 XCTAssertNotNil(store.state.checkout) // checkout stays
    XCTAssertNil(store.state.checkout?.payment) // payment cleaned up
 XCTAssertEqual(
     store.state.nav.items.compactMap { $0 as? CheckoutPath },
     [.checkout]
 )
}

func test_deepLinkToPaymentBuildsFeatureAndStack() async {
 let store = TestStore(initialState: ShopFlow.State()) { ShopFlow() }
 store.exhaustivity = .off

 await store.send(.openURL(URL(string: "shop://checkout/espresso-machine/payment")!))
 await store.skipReceivedActions()

    XCTAssertEqual(store.state.checkout?.productID, "espresso-machine")
    XCTAssertNotNil(store.state.checkout?.payment)
 XCTAssertEqual(
     store.state.nav.items.compactMap { $0 as? CheckoutPath },
     [.checkout, .payment]
 )
}

func test_receiptCloseFinishesFlow() async {
 let store = TestStore(initialState: ShopFlow.State()) { ShopFlow() }
 store.exhaustivity = .off

 await store.send(.openURL(URL(string: "https://example.com/receipt/order-42")!))
 await store.skipReceivedActions()
 await store.send(.checkout(.receipt(.closeButtonTapped)))
 await store.skipReceivedActions()

 XCTAssertNil(store.state.checkout)
    XCTAssertTrue(store.state.nav.items.isEmpty)
}

You can test deep links, back-swipes, alert interactions, and multi-screen flows without ever launching a simulator. Navigation logic is just state transformation.

Conclusion

This is a niche solution.

If you’re building a small app or a team of two, the canonical StackState works fine. The complexity here pays off when you have multiple teams working on isolated feature packages, designers who want full flexibility in how sub-flows are presented, and a codebase where build times matter.

The cost is real: roughly 150 lines of custom infrastructure in Core. In return:

Features are isolated reducers that emit delegate actions. They don’t know what comes before or after them.
Flows own Shared, compose features, and route delegate actions between them.
navigationStackContext handles push, cleanup, and effect cancellation automatically.
Deep links are just state construction and actions — no different from any other flow entry.
Navigation logic is plain state, so tests need no simulator.

Most importantly, sub-flows remain self-contained.

We pushed the same CheckoutFeature binary into three different hosts without touching it once. Our designers get their animations. A change inside any feature doesn’t touch Flows or App — the compiler sees the boundary and rebuilds only what changed.

In practice, incremental builds run in roughly 2 seconds.

Enjoy this kind of problem-solving? Come build cool things with a great team — join Manychat.

Practical observability checklist for APIs, workers & jobs. Part 1

Manychat Engineering — Tue, 23 Jun 2026 12:02:11 +0000

The minimum set of signals that helps you understand what’s happening in production before users tell you something is wrong.

Production has a special talent for turning “seems fine” into “why is everything on fire?”

The service is up. Dashboards are green. Then reality hits: a restart that never reaches readiness, a worker that quietly stops consuming events, a scheduled job that never runs, latency creeping upward until users notice first.

Most production failures are not mysterious. They are predictable, observable, and usually fixable. Yet they still turn into incidents because we discover them too late. After enough incidents, a pattern becomes hard to ignore: we’re not missing fixes first — we’re missing signals.

Green dashboard can still hide a broken workload.

That realization changes how you think about observability. The question is no longer “Do we have Grafana?”, “Do we collect logs?”, or “Should we add tracing?”

The real question is: can we understand what is happening in production before users — or another team — tell us something is wrong?

I’m Daria, a Python Engineer at Manychatwith a QA/SDET background and a strong preference for systems that are boring to run. Over the past year, my team shipped a new class of production Python services for data processing and analytics and built their observability from scratch. This article is the checklist that emerged from that work.

It’s intentionally vendor-agnostic. The goal is not to recommend a particular monitoring stack, framework, or observability platform. The goal is to identify the minimum set of signals that tells you whether a workload is healthy and doing the job it’s supposed to do.

It covers three workload types: an HTTP API , a background worker (queue consumer, event processor, task worker — anything that does work outside the request/response path), and a scheduled job. Each fails differently, so each needs a different observability baseline.

What “observable” means in practice

Before adding dashboards, alerts, or traces, ask a simpler question: what does “working” actually mean for this service? Not “is the process running?” or “does Kubernetes think the pod is alive?”

But — what does correct behavior look like from the outside?

For an API : it accepts requests, responds correctly, within acceptable latency. For a worker : it consumes events, handles them successfully, keeps backlog under control, and makes progress recently enough. For a scheduled job: it ran today, completed successfully, processed a non-suspicious amount of data, and produced output fresh enough for the product.

Once you can answer that, observability becomes much easier to reason about. A system is observable enough when you can answer important operational questions about production services quickly:

If you can answer these in minutes, debugging becomes more predictable. If not, you guess — and guessing under production pressure leads to random dashboard clicking, noisy Slack threads, and fixing the first visible symptom instead of the actual problem.

This is why observability should start from operational questions, not from tools. Tools are implementation details, signals are the product.

A metric is useful if it answers a question.

A log is useful if it helps reconstruct what happened.

A trace is useful if it connects behavior across components.

An alert is useful if it tells the right people about an actionable problem early enough.

The goal is not more data. It’s the right questions and signals.

Workload type matters

It is tempting to use one generic checklist for every production workload. But an HTTP API, a worker, and a scheduled job fail differently, so each needs a different observability baseline.

Different workloads fail differently.

API checklist

An API is usually the first to show when something breaks — users send requests, downstream services call it, error rates and latency surface fast.

The core questions are:

Is it up?
Is it ready?
Is it serving requests?
Is it failing?
Is it fast enough?
Is traffic normal?

Here’s what to watch for an API.

Health and readiness

Liveness, readiness, and user-facing checks are related but answer different questions. Liveness: is this process alive, or should it be restarted? Readiness: can this instance safely receive traffic right now? A user-facing or a synthetic check: does the service behave correctly from the outside?

A process can be alive without being ready. A service can be technically ready and still fail a real user-facing flow.

An HTTP 200 alone may not be enough: you may also want to check response latency, expected response shape, or data freshness. A useful readiness check should reflect whether the service can actually do its job: database connectivity, required configuration, critical dependencies, internal startup state.

The important part is not to turn readiness into a heavy synthetic transaction. The important part is to avoid the false comfort of “the process exists, therefore the service is fine.”

Useful signals:

service/pod availability,
readiness status,
restart count,
startup failures,
dependency readiness when critical.

Common mistakes:

treating liveness as readiness,
alerting only when the pod disappears,
not alerting when the service exists but never becomes ready.

Request rate and throughput

Request rate gives context. A latency spike during a traffic surge tells a different story from the same spike during normal load. A sudden drop can also be a signal — maybe clients stopped calling, routing broke, a feature flag changed, or an upstream service failed.

Useful signals:

requests per second,
requests by route/endpoint,
traffic split by status code,
traffic split by important client/source if applicable.

Careful with labels. Endpoint labels are useful but raw URLs, account IDs, user IDs, or arbitrary request parameters can create high cardinality and make your metrics backend very unhappy.

Error rate

It is one of the first signals people expect from an API. You want to know:

how many requests fail,
whether failures are client-side or server-side,
which endpoints are affected,
whether the failure is sustained or just a tiny spike.

Useful signals:

5xx rate,
4xx rate when meaningful,
error ratio by endpoint,
exception count by error class,
dependency error count if the API calls databases, caches, queues, or external services.

Latency, especially tail latency

Averages are often too polite. They hide the pain. Users don’t experience the average request — they experience the one they’re waiting for right now. That’s why p95 and p99 are usually more useful than average latency alone.

Useful signals:

p50 latency for baseline behavior,
p95 latency for common bad experience,
p99 latency for tail behavior,
latency by endpoint,
latency of important dependencies when available.

Common mistake:

looking only at average latency,
histogram buckets too coarse to show the real problem,
one latency SLO applied to very different endpoints.

One thing worth knowing: if your dashboard shows p99 stuck exactly at the highest histogram bucket boundary for a long time, the real latency may be worse than the chart can show. That’s not a healthy signal, that’s an instrumentation limitation.

Domain-specific signals

Generic API metrics are necessary but not always enough. Many production issues only make sense when you add one or two domain-specific signals:

cache hit/miss ratio,
cache invalidation count,
downstream query duration,
number of records returned,
rate of empty responses,
feature-specific processing outcomes,
calls to a critical third-party dependency.

Do not turn everything into a metric. Add the signals that explain important system behavior.

Worker / event processor checklist

Workers are tricky because they can look alive while doing nothing useful. A worker can be running as a process but failing as a workload. For workers, “alive” is not the same as “working”.

The process is running. The service instance is running. CPU is fine. The platform reports healthy. But no events are being consumed. Or they’re read and fail during handling. Or one poison message blocks everything. Or the backlog quietly grows while the worker is technically “up”.

For workers, liveness is not enough. The real question is: is it making progress?

Read rate / consumption rate

First, you need to know whether the worker is actually reading from the queue — events from Kafka, RabbitMQ, SQS, tasks from a queue, messages from a stream, whatever your architecture uses.

Useful signals:

events/messages/tasks read total,
read rate over time,
read failures by error class,
last read timestamp.

A worker that isn’t reading may still look alive. Without these metrics, you’ll discover the problem indirectly — through stale data, customer reports, or a growing backlog.

Processing outcomes

Reading work is not the same as handling it successfully. A worker may consume events but fail while processing them — and without outcome metrics, you won’t know.

Useful signals:

processed/handled total,
success count,
failed count,
skipped/unhandled count,
retry count,
failure by error class,
failure by handler/event type/task type.

A good metric shape to aim for:

events_handled_total{handler, event_type, outcome}
events_processing_failed_total{handler, event_type, error_class}

The exact names should follow your project and monitoring conventions. What matters is the model: count handled work, separate outcomes, keep labels bounded, and make failures explorable by handler, event type, and error class.

Backlog / queue depth

If your architecture has a queue, backlog is one of the most important things to watch.

Useful signals:

queue depth,
oldest message age,
lag by partition/topic/stream when applicable,
backlog growth rate.

Backlog needs context. A queue depth of 100 may be perfectly fine in one system and catastrophic in another. What matters is whether the worker can catch up and whether the delay violates product expectations.

Show backlog, processing rate, and failure rate together on the same dashboard. If backlog says work exists, processing rate says it’s moving, and failure rate stays quiet — you’re good.

Processing rate and backlog together.

Last successful progress timestamp

This is one of the most useful signals for silent failures, when the worker looks alive but isn’t actually doing anything. Track the timestamp of the last successful progress point — whether it is a read, a completed processing step or a full read+process cycle, depending on what “progress” means for your worker.

Useful signals:

last_read_timestamp_seconds,
last_processed_timestamp_seconds,
last_successful_task_timestamp_seconds.

Processing duration

Workers need duration metrics too, but the question is different from APIs — not request/response time, but how long it actually takes to process a unit of work.

Useful signals:

processing duration histogram,
p50/p95/p99 processing time,
duration by handler/task type,
slow processing count.

Common mistake:

measuring the wrong boundary and then misinterpreting the result.

If you decorate a high-level handle_event function, your histogram may include routing, validation, handler execution, logging, dependency calls, and error handling. That’s still useful, but know what you’re actually measuring.

Scheduled job checklist

Scheduled jobs fail even more quietly. A daily job may do nothing for 23 hours and still be healthy, which makes generic service-style monitoring a poor fit.

The first question is whether it ran successfully when it was supposed to. Then: how long did it take, did it process the expected amount of data, when was the last success, and is the result still fresh enough for whatever depends on it.

Last run timestamp

You need to know when the job last started.

Useful signal: last_run_timestamp_seconds.

This tells you whether the scheduler triggered the job at all. If the last run timestamp is too old, the problem may be scheduling, deployment, permissions, environment configuration, or the job process not starting.

Last success timestamp

A job can run and fail. That is why the last run is not enough.

Useful signal: last_success_timestamp_seconds.

This is often the best freshness signal for scheduled jobs.

Last run status

A simple status metric is extremely practical.

Useful signal: last_run_status where 1 = success, 0 = failure.

This gives a clear “latest result” view.

Duration

Duration helps detect degradation before complete failure.

Useful signal:

last_run_duration_seconds,
duration history over time,
p95/p99 duration if the job runs frequently enough.

For daily jobs, even a simple last-duration gauge can help.

It tells you whether the job is getting slower, whether a data volume increased affected runtime, whether a dependency slowed down, and whether the job is getting close to exceeding its scheduling window.

Output / records processed

Success status alone can be misleading for data jobs — the job may complete without producing anything useful.

Useful signals :

records processed,
records inserted/updated/deleted,
number of accounts/customers/entities processed,
output freshness,
number of empty results,
validation failures.

This is where business-level metrics can be helpful.

Reading the signals together

These signals become most useful when you read them together:

last run recent + last status failure = job ran but failed
last run old + last success old = job may not be running
last run recent + last success recent + duration increased = job works but may be slowing down
last run succeeded + records processed is unexpectedly zero = the job works but not useful

Records processed unexpectedly zero in the last successful run.

This is why scheduled job observability should not rely on one status flag alone. You want enough signals to distinguish “did not run”, “ran and failed”, “ran and succeeded”, and “ran but produced suspicious output”.

Don’t forget about dependency and infrastructure metrics

Application metrics tell you how the workload behaves. Dependency and infrastructure metrics help explain why.

If API latency goes up, the cause may be in the application, database, cache, external API, connection pool, or infrastructure. For a database-backed service, API latency should be visible together with database query duration, connection pool behavior, database errors/timeouts, and storage-level signals such as IOPS or read/write latency when relevant.

API latency goes up due to slow DB query.

Useful signals:

DB connection count / pool usage,
query latency,
slow queries,
DB errors/timeouts,
IOPS / disk latency for managed databases such as RDS,
cache hit/miss ratio,
cache latency,
external dependency latency/error rate,
CPU and memory,
restarts,
disk and network signals.

Infrastructure metrics support investigation but they don’t replace user-impact signals. High CPU is context. High p99 latency is an impact. A service can have a normal CPU and still return wrong data. A worker can have a healthy pod and still stop processing. A scheduled job can have no alarming resource usage because it never ran.

Start from workload behavior, and use infrastructure metrics to explain what you find.

***

That’s the metrics side covered: what to watch for APIs, workers, and scheduled jobs, and how to read the signals together.

Metrics tell you something is wrong. But they won’t tell you what exactly happened, or where in the system it happened. That’s what logs and traces are for, and knowing when to reach for which one is half the battle. We’ll also talk about alerting that actually pages you for the right reasons, and a rollout order of the observability setup that won’t kill you. All in the second part.

Stay tuned!

Why an engineer should try running a community

Manychat Engineering — Tue, 16 Jun 2026 08:52:01 +0000

What do you get from organizing engineering meetups besides extra work? After organizing several of them, I have a few answers.

The latest meetup in April 2026.

Over three years at Manychat, I switched teams a few times and got to know a lot of engineers across the company. Still, I often learned about interesting technical projects completely by accident.

We had sprint reviews, of course, but they’re focused on product progress. The engineering stories behind the work rarely make it onto the agenda.

At some point, I got tired of relying on luck. We already had engineering communities, and when an opportunity came up to get more involved in the frontend one, I saw it as a chance to create a place where engineers from different disciplines could share what they were building and learn from each other.

Before starting, I had a long conversation with my manager. It forced me to think more carefully about why I wanted to do this in the first place: what value it would bring, not only to the company and the team, but also to me personally.

That question stayed with me. Five meetups later, with the sixth coming at the end of June, I think I have an answer.

I’mEgor, Frontend Techlead at Manychat. In this article, I won’t talk about why engineering communities are good for companies. Instead, I want to share what running one can give you as an engineer.

Building your network

When I joined Manychat, there were about 15 frontend engineers. I knew all of them. Today the company has more than 400 people, and that’s simply no longer possible.

We have Slack channels for introductions. Every new hire gets a welcome message. The problem is that someone new joins almost every day. Reading an introduction is easy. Actually getting to know people is much harder. If you’re not working on the same project, you may never interact at all. As systems grow, repositories split, and teams become more specialized, that becomes even more likely.

Organizing a meetup gives you a legitimate reason to talk to people you’d otherwise never meet. You learn what another team is building, what problems they’re solving, and which shortcuts they’ve already discovered.

Without these conversations, people slowly disappear into their own corners of the company. Important knowledge becomes local knowledge.

Our first offline meetup in July 2025.

Building soft skills

Ordering food, booking a room, finding a clicker that actually works, that’s the easy part (and can also be delegated). The harder part starts when you need people.

Most engineers spend their day talking to the same handful of people. Community work is different. Suddenly you’re asking strangers for favors: to share their work and knowledge, to spend their time preparing a talk, and finally to stand in front of a room full of colleagues and explain something they normally only discuss with their team. You have no authority over them, no leverage, no budget. Just your ability to explain why it matters and why it’s worth their time.

And then there are all the people outside engineering. Running a meetup means asking for budget, finding time in people’s schedules, promoting events, and getting organizational support. That usually involves managers, HR, and various stakeholders. They care about different things than engineers do. Learning how to explain the same initiative to different audiences is a useful skill on its own.

That turns out to be surprisingly useful practice for all kinds of soft skills. The era of highly paid engineers who can quietly ship tickets and avoid people altogether seems to be ending.

Whether you want to become a tech lead, an engineering manager, or simply a stronger engineer, your ability to work with people matters more every year. Running a community gives you a place to practice at low stakes.

And one more bonus. When you organize events, bring people together, and keep things moving, colleagues start associating you with ownership. You’re no longer just the person working on a specific project. You’re the person making something happen because you care.

Our second meetup in November 2025.

Expanding your picture of what’s happening

One thing I noticed after changing teams several times is how easy it is to develop a very narrow view of the company.

Sprint reviews are mostly about product increments. There’s no time to talk about new technical practices, architectural decisions, or how a team changed the way it works.

You can spend years in your slice of the codebase without understanding what’s going on around you. Meanwhile, colleagues are solving interesting problems and building things that could be useful to you too. Without a place where people share that, it just never comes up. Meetups became that place for us.

One of the meetup’s real wins

We had our own routing mechanism for a long time, then migrated to React Router, but URL management still wasn’t particularly pleasant. One engineer, as part of their PDP, implemented an abstraction layer for unified URL handling with fully typed parameters.

A meetup gave him a stage. Other engineers started experimenting with it. Today, every frontend engineer uses that solution. Without the meetup, that work would have stayed inside a single team. That’s probably one of my favorite examples of what these events can do.

I’ve switched teams several times and worked in very different parts of the product, and I still learn something new every time. That’s probably the most personally valuable thing I get as a community organizer.

Here’s what attendance looked like at our latest meetup in April 2026.

Ok, I’m in, where do you start?

After five meetups, I don’t think there’s a magic formula. But there are a few things that will help you to begin from square one:

1. Go to other people’s meetups first.

Before organizing anything, become part of someone else’s community. Watch how they run events, talk to organizers, ask what broke. Starting from scratch sounds exciting but it’s also an efficient way to collect every avoidable mistake yourself.

2. Start smaller.

It’s tempting to announce a company-wide engineering meetup from day one. Don’t.

Start with a group where you already have relationships — mobile, backend, frontend. Start with your own unit. The first two meetups naturally gravitated toward frontend topics. That’s where I knew the most people and where finding speakers was easiest.

3. Find people who care about the same thing.

This is the most important one. Communities don’t die because of a lack of ideas. They die because one person gets tired. If the entire community depends on one organizer, it already has an expiration date. Find people who care about the same thing and build a core group. They will keep things going when you’re too overwhelmed with work tasks or just out of energy.

4. Learn how to motivate people.

For this you should know what the speaker gets out of it. Sometimes it’s visibility and recognition. Sometimes it is a practice of public speaking. Or just a chance to spend time with interesting colleagues and free pizza.

The same applies to attendees. A meetup without an audience is every organizer’s nightmare. Trust me.

5. Get support before you start.

Meetups happen during work hours, so does the preparation for them. Most managers support such kinds of activities. Some won’t. Get explicit support from your manager or leadership before investing months into building something.

And if your company has people responsible for employer branding or DevRel, talk to them early. In reality, these teams are often looking for exactly this kind of initiative. They can help promote events through internal communication channels, attract more attendees, and generally make sure your meetup doesn’t depend entirely on word of mouth. More importantly, they could help with motivation including something more tangible than recognition: for example points that can be exchanged for company merch, small rewards, or other perks.

6. Don’t do everything yourself.

If you’re speaking, ask someone else to host. If you’re organizing, let someone else own parts of the process. Communities become more sustainable when responsibility is shared. And you’ll enjoy them more too. Because at some point you stop running an event and start being part of a community.

Want to be part of our engineering community? Take a look at our open positions. Maybe we’re looking for you.

How to Build an Agentic Design System People (and Agents) Will Actually Use (Part 1)

Manychat Engineering — Wed, 10 Jun 2026 08:35:30 +0000

On building Manychat’s design system in eight working days, and why those days were even possible.

Most design systems don’t fail by being wrong. They fail by multiplying.

When Manychat shifted mobile-first, we needed a design system that worked across all three platforms. We had one — mature, well-maintained, living on the web. But mobile had never adopted it. So each platform did what made sense locally: iOS added its own semantics, Android added its own, web shipped its own names. We’d been building parallel design systems — none of them quite agreeing on what subtle text or warning yellow actually meant.

Hello 👋 — my name is Thanh, an iOS Engineer at Manychat. This series is about what we did about it: building our agentic design system which works across all the platforms. Just in 8 working days.

This is the first chapter focusing on what a solid design system should have in its foundation and how we rebuilt it to be AI-driven. If you finish it and you’re not slightly curious how AI can change the way you build — I failed 😔.

A design system is a shared language

A design system isn’t a component library, a Figma file, or even tokens in the abstract. It’s an agreement — a shared language between design and engineering for what the product should look like, feel like, and behave like, written in a form both sides can read, write, and hold each other to. Tokens encode the agreement: Link, Danger, Surface, Subtle. Not the same hex, but the same meaning.

Without that language, a codebase isn’t a system — it’s a collection of coincidences that happen to look like one. Engineers re-derive every spacing decision from scratch. Dark mode becomes a parallel product maintained by guesswork. New engineers spend their first week learning which color belongs where, knowledge that lives in tribal memory rather than documentation. Designers feel it from the other side: they invest in a Figma system and engineering ships something close. Nobody’s fault — there’s just no vocabulary to hold either side accountable.

Users notice eventually, not as bugs but as wobble — spacing that almost matches, colors that almost pair, a dark mode that almost works.

The question isn’t whether you need a design system. It’s how you build one that actually fits into the development flow instead of sitting next to it. Here’s what worked for us.

Do tokens before components

Most design systems start where they can ship: buttons, inputs, cards. That’s how we started too.

Brad Frost’s Atomic Design borrows directly from chemistry — atoms (button, input, label) bond into molecules, molecules into organisms, then templates, then pages. Simple, stable things combine into complex, situational ones.

Frost himself extended the model downward, treating tokens as the sub-atomic layer beneath atoms — the particles, and the rules atoms are made of. We use the same framing.

Sub-atoms are the raw decisions every atom is built from: color, spacing, radius, shadow, motion, typography. Other teams call them tokens. Either way, users don’t see them. Users see what they make possible.

Think of it like a food pyramid: what’s at the base supports everything above.

Skip them, and components rest on quicksand. A “primary button” isn’t a stable concept if “primary” doesn’t resolve to a specific background, a specific radius, a specific spacing rhythm, or a specific shadow language. Without those, every primary button in the codebase is a small re-derivation. The next brand refresh will break every component that never agreed on what “primary” meant.

Start with components, and you’ll rewrite them. Start with sub-atoms, and you’ll build on something that holds.

Name tokens by intent

Token names have nouns but no verbs — they describe a value, not an intent. I once found blue300 in our iOS codebase used as a text color, a button background, an icon tint, a tab indicator, and a border stroke — all in files written by different people, technically valid against the palette, none of them intentionally chosen against a shared rule. It was even worse: neutral100 had more than a hundred uses across the same codebase, working as disabled backgrounds, card surfaces, separators, chip fills, divider lines. One gray, five intentions.

The fix is a semantic layer. We built a three-layer architecture:

Core — raw values. Eleven color families, scales from s0 to s900. Platform-agnostic, defined once, and referenced everywhere — with no product opinion baked in
Semantic — intent-based tokens, organized into five categories: text, icon, background, border, shimmer. Semantic tokens are named using patterns such as text.brand, background.warningDefault, icon.danger. Each token resolves to different core values depending on context — light mode, dark mode, high contrast, brand variant.
Component — names scoped to specific components. button.primaryBackground maps to a semantic background.

The layers talk in one direction: component → semantic → core. The payoff is change isolation. Need to update a brand color? Adjust core; everything downstream follows. Need a high-contrast accessibility mode? Remap semantics. White-label the product? Swap the semantic resolution. None of it requires touching the components that consume the tokens.

The categories themselves do enforcement work. If a token lives under text, it can only be used for text. If an engineer reaches for background.brand to color an icon, the name itself signals the mistake — before any linter, review, or designer catches it. Naming by intent turns the taxonomy into a guardrail.

Start where it hurts

Don’t try to define everything at once. For us, the categories that hurt most were color and spacing — the most visible sources of inconsistency.

Your starting point might be different — typography, elevation, something product-specific. It doesn’t matter. A small system that works beats a large one that’s half-finished

Build with AI in mind, and expect everything to change

By the time you read this, half of this is already different.

AI changes two things at once: the tools you use to build a design system, and the requirements the system itself has to meet. Structure and semantic clarity — everything we covered above — matter more when AI is in the loop. An AI agent reads your system more literally than a person does. If there are some gaps, a person fills them with intuition; an agent fills them with whatever fits the pattern — including the flawed ones.

For months we’d been building the first version of the design system the classic way: multi-week negotiations on what subtle text should mean, designer-engineer back-and-forth on dark-mode pairings, reviewed screen by screen. That process gave us the right foundation — semantic layers, tokens, atoms, a shared language. But the structure wasn’t designed for AI to read. So we restarted from scratch to make it AI-driven.

What we ended up building is close to what the 2026 design-systems community has started calling an agentic design system. Thanks to AI, in just eight days we could structurally encode the foundation we had as machine-readable infrastructure rather than static, human-oriented documentation. The result is Manyfest Design System : one Figma file for all platforms — web, iOS, Android — built to be read by both humans and AI agents.

Just as the classic process taught us how to lay the foundation, the AI-first rebuild taught us what that foundation needs to support.

One: design intent has to be explicit

Designers and engineers bring years of product context to every decision. AI doesn’t — not yet. So the token structure has to carry that context instead.

Our schema adopts the W3C Design Tokens Community Group standard, which has become the industry baseline. The standard provides the what: value, type, and description. We added the intent block to capture the why. This is what the community now calls token intent metadata — the structured rules about usage and pairings that transform a token from a simple hex string into something an AI agent can actually reason over.

So in Manyfest, every token carries metadata about why. It’s a hex string plus the role it plays, the surfaces it’s allowed on, and the contrast guarantees it carries.

// illustrative — actual schema is still being formalized
{
  "color.background.warning": {
    "$value": "{color.core.yellow.500}",
    "$type": "color",
    "$description": "Surface tone for non-blocking warning states",
    "$intent": {
      "useFor": ["banners", "inline alerts", "form-field warnings"],
      "doNotUseFor": ["error states", "destructive actions", "icon foregrounds"],
      "pairsWith": ["text.warning", "icon.warning", "border.warning"]
    }
  }
}

The metadata isn’t decoration. It’s the difference between an AI agent getting the token right by guessing, and getting it right because the token told it.

Two: reviews stop being only between humans

Designers and engineers used to review each other in PRs and Figma threads. Now there’s a third reader.

We have a shared skill (figma-component-review) that takes a Figma URL, parses the file, pulls the design variables and the component context, scans the matching package code in the design-system repo, and writes back a categorized list of questions: for the designer (intent ambiguity, missing variants, accessibility gaps) and for the engineer (token mismatch, naming drift, component reuse opportunities). The questions land as comments on the Figma node, where the designer is already working.

The point isn’t to replace the design review. It’s to surface the same questions earlier — the ones both sides would otherwise hit three weeks later, in a PR. That’s only possible because the system has structure: semantic tokens, named intent, atomic hierarchy. Give the skill a flat palette and it has nothing to compare against.

Three: AI catches mistakes humans miss

Not by writing code. By reading it.

We have AI in the loop on every PR: descriptions auto-written from the diff in the template, Conventional Commit titles normalized, contextual labels applied (accessibility, performance, migration), a split gate that blocks merges over 1,000 lines. It takes a whole category of small mistakes off the team’s plate.

This is an example of the PR from the template we give AI to fill the PR. The full version you can find here.

The real value shows up in smaller moments. While setting up Manyfest’s iOS skill — the file that tells AI agents how to scaffold a new component — an AI reviewer caught three mistakes:

a cyclic dependency I’d introduced in a preview helper;
a scope error that made an internal helper accidentally public;
three doc references pointing to a file I’d renamed two PRs earlier.

AI wasn’t writing my code, it was catching my mistakes — and that’s the version of AI-driven engineering I trust.

Four: don’t let AI decide, let it accelerate

AI isn’t magic, at least not at the moment I’m writing this. It hallucinates with confidence. It invents APIs that don’t exist, writes code that compiles and quietly does another thing. Check the screenshot below: at one point our skill description randomly slipped into another language for no apparent reason. Because why not, I guess.

Treat it like a senior teammate — sharp, fast, sometimes too confident. Brief it carefully. Read its work. Push back when it’s wrong. It implements faster than you can but it doesn’t decide better than you can. The judgment stays yours.

The receipts

Eight working days — that’s how long it took to ship the AI-driven version of the design system. The classic version, the one we restarted from, had taken months. But the eight days became possible because we already had these months without AI, finding the right approach.

The system isn’t done yet. What’s done is the foundation: Manyfest in Figma is the source of truth, and everything defined there translates automatically into the format each platform speaks — twenty-three components, ready to wire in on every change.

Building a design system that AI can read pays off immediately: a designer can start prototyping by dropping a Figma file into Claude with the Figma MCP installed. What used to take a sprint takes a day.

Chapter 2 is coming next with more on building an AI-driven design system and the skills we developed along the way.

How we hire for infrastructure at Manychat: from first call to offer in 2 weeks

Manychat Engineering — Thu, 04 Jun 2026 09:44:19 +0000

How we hire for infrastructure at Manychat: from first call to offer in 10 days

Most hiring drags on because teams have only a vague idea of who they’re actually looking for. We decided to fix that first.

The reason most hiring takes months has nothing to do with the number of stages. It’s that there’s no clear definition of the target. So companies compensate: they collect more CVs, add more interviews, compare candidates to each other — and still struggle to make a call.

We decided to remove that guesswork. Instead of growing the candidate pipeline, we built a process where a strong candidate can be identified and hired immediately.

I’m Dmitry, Head of Infrastructure at Manychat, and in this post I’ll walk through how we hire senior infrastructure engineers in ten days — by knowing exactly who we’re looking for and what we’re evaluating them against.

Know what you’re looking for

“Knowing who you’re looking for” isn’t about years of experience or a tech stack. It’s about understanding what this person will actually own. What problems will they solve? What level of autonomy is expected? What does good look like in six months?

Without answers to those questions, every candidate looks “somewhat relevant.” With them, the wrong ones are easy to spot.

Score candidates against the target, not against each other

When the previous point fails, the usual practice kicks in: companies build a large candidate pool and pick the best one from it. Candidates get compared to each other, not against a defined set of expectations. This creates delays — when a strong candidate comes in, you can’t commit immediately, because you need to see three to five more first.

We don’t do that. When the right candidate shows up, we say yes.

Filter hard at the top, move fast at the bottom

Knowing exactly who we want means we can filter aggressively at the top of the funnel. Here’s how it worked for our Senior SRE search.

We looked at two things on a resume. First, scale — large user bases, high-traffic systems, complex migrations, strict reliability requirements, or cloud/platform constraints. That experience can come from large companies, fast-growing startups, open-source platforms, or other high-scale environments.

Second, adaptability. The stack matters, but it’s not the whole picture. For a Senior SRE, we were looking for a generalist, not a specialist. Technologies change — what’s dominant today won’t be in two years. A CV that shows someone has switched technologies multiple times — started on bash, moved to Ansible, then picked up Kubernetes and AWS — is a reliable signal they can keep growing.

Out of 150 applications per hiring cycle, roughly 1,000 per quarter in total, around a dozen make it to the TA Screening Call. Five to seven reach the technical interview. Two get to the hiring manager interview. At least one gets the offer.

Hiring cycle

Clear evaluation criteria during interview stages

This is the key. Before any interview, we build not just question lists but scorecards: what good and bad answers look like. The scale is simple: needs improvement, meets expectations, exceeds expectations, and red flags. We place the candidate on it. That keeps evaluations consistent and removes cognitive load from the team. We write scorecards for both hard skills and soft skills.

For the technical interview, we have six topics: AWS, Kubernetes, Terraform, CI/CD, Observability, Security.

<a href="https://medium.com/media/84ee690a1e103580f0a685297bec3523/href">https://medium.com/media/84ee690a1e103580f0a685297bec3523/href</a><a href="https://medium.com/media/380a79fab054f48dbe6b89e6cedca914/href">https://medium.com/media/380a79fab054f48dbe6b89e6cedca914/href</a>

None of them includes live coding or similar exercises (they’re easily bypassed with AI), and we don’t see the point. We just talk. The candidate meets the whole team at once — we’re a small enough team so everyone joins the call.

Scorecards get filled out by each member of the team independently, on the day of the interview. If there are significant disagreements, we do a debrief — sometimes that means revisiting the criteria to arrive at a shared read on the candidate. Usually by the next day we know whether they’re moving forward. If the team says no, the process ends there. No wild cards from me.

The hiring manager interview works the same way — scorecards again. My job is to assess engineering maturity: what problems they’ve actually solved, what they’ve led, at what scale. I focus on four things: autonomy, data mindset, incident troubleshooting, project ownership.

To assess autonomy, I ask: Tell me about the last improvement you initiated without being asked. What was the problem, what did you do, and what was the result?

If there’s something to tell, I go deeper: Which metrics changed? Who pushed back? Specific dates, tools, results?

Here’s how the scoring looks for this dimension:

<a href="https://medium.com/media/169599f31e4a417e7397cc79ca05ddfc/href">https://medium.com/media/169599f31e4a417e7397cc79ca05ddfc/href</a>

On our scale: “I upgraded Kubernetes” — needs improvement. “I migrated the whole company to a new stack” — meets expectations. “I led a cross-functional initiative — breaking up a monolith, moving between architectures, technically and organizationally complex” — exceeds expectations.

After this interview I know the candidate’s leadership level: what they can take on, how large a piece I can hand them. Sometimes the answer is the opposite — strong technically, but not ready for autonomous work, or missing the product mindset entirely.

The result of tech and leadership interviews is a filled-in candidate profile. For example, a candidate knows AWS well, has a gap in Kubernetes, but is solid on observability, and has led complex projects. From that I can tell whether they’re a fit for the team and whether they can carry what I need them to carry. The decision becomes easy.

What about the stages and timeline?

Application review takes up to two weeks but we push to move faster.

The hardest part logistically is getting the whole team together for the technical interview. Fitting the screening and the technical into one week rarely happens, but that’s what we aim for.

After the technical, I get on a call with the candidate within two days — already with the team’s feedback in hand. The final decision takes another day or two at most.

Total: two weeks on average from first call to offer

If this sounds like your kind of hiring process and you’re an experienced SRE engineer — follow me on Linkedin to stay in the loop on new openings. They are coming.

No sprints for a quarter: an experiment in a Scrum-culture company

Manychat Engineering — Thu, 14 May 2026 09:42:10 +0000

How we shipped a core product rebuild in one quarter without a single sprint.

At the end of last year, we ran a double experiment: brought together a team from across different units and traded our beloved Scrum for Kanban. All to rethink the architecture of one of our core products so it could grow and scale faster. One constraint: users shouldn’t notice anything had changed. Metrics had to hold.

I’m Dmitry, Frontend Area Lead at Manychat. Here’s what happened — spoiler: it worked, and we’re doing it again — and when it might be worth trying for you, especially if you’re staring down a project full of unknowns.

Where the problem came from

Manychat is a marketing automation platform for messengers. One of our core features is the Flow Builder: a canvas where you drag, connect, and configure automation sequences. Powerful tool but it is not the friendliest for beginners.

So we built EasyBuilder — a simplified interface with pre-packaged solutions. No canvas, just pick a scenario, walk through a few steps, configure in a couple of clicks. Done.

Experiments showed good results. Adoption grew. Then we hit the ceiling. It became clear that to keep growing, the product needed a fundamental rethink.

The problem was structural: EasyBuilder (front) and the Flow Builder (back) existed as two parallel systems. The frontend had no knowledge of the flow structure — it just collected form values and sent them to the backend, which assembled the automation from a rigid template. Every new scenario meant starting from scratch: adding anything new meant hardcoding it manually.

Rebuilding EasyBuilder wasn’t just a technical challenge. It also came with an organizational one. At Manychat, we have the following setup: a core team maintains the platform, a growth team ships features on top. One builds the foundation, the other builds on it. This is an intentional structure.

The catch was knowledge transfer. Whichever team built the new foundation first, the other would eventually need to pick it up — with zero context. Building that bridge from scratch would have been expensive and slow.

Two challenges arrived together: technical and organizational. They needed a single answer.

Tiger team

The obvious move was to spin up a cross-functional team and hand them the task. We figured out pretty quickly that wouldn’t work. Every team still had its own backlog, its own priorities. Work on Easy Builder would have dragged on. And we needed this done in a quarter max.

What we actually needed:

Full isolation from ongoing area priorities — a team thinking about exactly one thing
Not a standard headcount (two backend, two-three frontend, designer — too many), but the specific people for the specific job.
No onboarding budget: the task required people who already knew the context

The idea was to build a standalone team pulled from their regular units for the project, then return when it’s done and distribute expertise across their teams.

These are the areas that donated their teammates for the project.

That became the Tiger team. Yes, same as for NASA’s Apollo 13. We aimed high🙂.

The team: two frontend engineers, one backend engineer, a staff engineer from the infra team, and me — also a frontend, but acting as a tech process lead throughout the project. A PM and designer synced with us as needed rather than full-time.

Kanban instead of Scrum

The team was only half the experiment. The other half was the process.

Manychat runs on Scaled Scrum. Teams share a unified backlog, live by sprints, retrospectives, PBRs, and the rest of the Scrum ritual calendar. Large tasks get broken into small, predictable pieces so teams can move steadily and keep metrics stable. The principle: a team enters a sprint with minimal uncertainty, because you have to deliver within a fixed window.

Our task was the opposite of that. We didn’t know how much was actually in there. It was a Pandora’s box: you open it and you don’t know what you’ll find. Scrum would have forced us to package that uncertainty into sprints — break it into tasks, estimate, commit to a fixed cadence. Any estimate would have been a guess, and we’d have burned energy planning things we didn’t yet understand. For example, the first two weeks were research only. We had no concrete tasks to put in a sprint. Scrum wouldn’t have worked at all.

So we went with Kanban. Without sprints but with a roadmap instead. We moved with the whole epic from start to finish. Reprioritize when the task demands it. For a company where Scrum is the basic framework, this was an experiment inside an experiment.

Tiger team in action

The first two weeks looked like a startup. Every day in the office, in meeting rooms — brainstorming, challenging each other’s architectural proposals.The scope of the task was opening up gradually, and we needed to understand it before we could move.

A fragment of our work roadmap from October–November.

When we had a foundation, we started moving through the roadmap. Whole epics at a time. When something unexpected surfaced, we dealt with it on the spot and kept going.

The PM and designer weren’t in the room full-time — they joined weekly syncs and worked asynchronously: we’d hand off a task, they’d discuss it, come back with a decision and a design. That worked fine for most of the project. Where we hit friction was when a product decision was blocking a technical one, and the PM wasn’t available. Some sizable tasks stalled in the queue because we couldn’t move without a call.

In the end, 90% of the project was built in this mode. The remaining 10% — final fixes and polish — was handed off to a product team, who wrapped it up within a single sprint.

What we actually built

Instead of writing custom code for every new scenario, we created a meta-language on top of the Flow Builder that lets you assemble forms and configure scenarios automatically. Unlike EasyBuilder, where the frontend and backend operated as two separate systems via established and rigid contract, Quick Automations — this is how we called it, works with a single flow builder entity end-to-end.

Before, adding a new scenario meant a developer hardcoding it from scratch. Now, a developer isn’t needed at all. A product manager can come in, configure a scenario themselves — and a ready-made form appears for the user. No developer involvement, no new sprint per configuration.

This is how it looked for a product manager to create a new automation scenario, using Flow Builder. On the right part — the result the user will see.

And this is the user interface, the same as it was for Easy builder 1.0.

Knowledge transfer in practice

One of the expected benefits of this setup was knowledge transfer. The people who built the product went back to their permanent teams — taking everything they learned with them. That made it easier to land the solution in any area and keep improving it in parallel.

They ran workshops for engineers — walked through how the new system works, helped colleagues tackle tasks in ways that built understanding rather than just output. They ran sessions for product teams on how to configure scenarios and build new automations without involving developers.

That was the goal: teams who inherited the product can move independently — add new configurations, run experiments. The tiger team is gone, but what it built lives in the product and in the people.

When to use this

A temporary project team isn’t a universal tool. Neither is Scrum or Kanban. Before reaching for either, it’s worth asking: is the process serving the work, or is the work serving the process?

This project helped us answer that question. Here’s what we learned about when this format makes sense — and when it doesn’t.

It is definitely worth trying when:

The task is one-time, but strategically critical. Not your regular product flow, but something that needs to be done once and done right — laying a foundation that features will build on.
The scope is unpredictable. If you don’t know how much is actually in there, sprints will create pressure where you need flexibility.
The task doesn’t fit the standard process. Especially if a large task can’t be broken into sprints without losing its meaning.
The people you need are spread across teams. If the expertise is distributed across different units, and those people already know the context — no onboarding required.

And it may be less effective if:

The task is small and fits cleanly into a regular sprint. No need to build a separate structure.
The company isn’t willing to temporarily release people from their teams. Without that, a tiger team won’t fly.
There’s no clear endpoint. The team risks becoming permanent — which is a different story entirely.

As for our case the experiment worked 100%. We got a solution for the problem. App metrics held. We’re doing it again.

Frontend Performance Patterns to speed up your Web App

Manychat Engineering — Mon, 04 May 2026 10:53:16 +0000

How intentional loading decisions keep your app fast at scale.

Frontend performance is not a late-stage cleanup task. It’s not tech debt. It’s a set of decisions we make every day while we code — what we load, when we load it, and how we render it. The answer depends on the importance of the code, its size, and when the user actually needs it.

Get that wrong, and the browser pays for everything upfront — bytes, main thread, network — whether the user ever sees it or not.

I’m Liana, a frontend engineer at Manychat. In this article we cover five patterns we use to get this right — four import strategies and compression — and how we measure whether it’s working.

Import Patterns

Manychat uses 24 import patterns in practice, grouped into 7 categories — from static and dynamic to type, asset, style, re-exports, and legacy tooling. In practice, two dominate: static imports at 99.6% of the codebase, dynamic imports at 0.4%.

Pattern 1 — Static Imports

Static imports are required for the first meaningful paint. Layout, router, core UI — everything that must exist before the user sees anything. If it’s there on first load, it’s a static import.

If you look carefully at how static imports are ordered: Node.js plugins go on the top level — that’s how the formatter works. Then React core, then external packages. Then dependencies, modules, and other packages.

Pattern 2 — Dynamic imports

0.4% of the codebase. Small number, high impact.
You can see how they differ from static imports in the code — you need a separate file where you define a path and what needs to be imported on the lazy part:

Where is it in the UI? Everywhere where there’s a very heavy screen. Each route becomes a separate chunk which is loaded only on navigation.

A good example is the Flow Player page that you can access by creating an automation, sharing it from the CMS list, and handing the link to someone outside Manychat. It’s heavy. There’s no reason to pay for it on app load.

What dynamic import does — it doesn’t pay for code until the user goes there.

Pattern 3 — Import on Interaction

This is used for optional UI — modals, popovers, or similar things triggered by the user.

We actually don’t use it in our codebase, but we use something very similar: import on render, which is lazy-loaded on mount, not on interaction. You can see this in our modals — all of them work exactly the same way. Why? Because our modals are very lightweight, and there’s no need for import-on-interaction specifically. All our modals render immediately inside Suspense — they just load their chunks lazily, avoiding the cost of features many users never open.

Pattern 4 — Import on Visibility

The component loads when it enters the viewport. This avoids competing with the initial render and reduces the chunk size.

A good example is infinite scroll in TikTok and Instagram automation. When you want to pick a post or reel, a modal opens — and if you have a lot of them, you get an infinite scroll. It avoids loading chunks the user hasn’t scrolled to yet. We have a reusable sentinel component that handles this across the app.

Why not just one import pattern?
One pattern doesn’t fit all. Our goal is to make a right-sized chunk for what we want to load — because it’s bytes plus main thread plus network. We define them the same way we define error severity:

Critical — load immediately (static)
Heavy — load on navigation (dynamic)
Optional — load on interaction

Can we combine them?
Yes — in layers, not as one mega-pattern. And you don’t literally need every technique. That’s overengineering. Define what you want to optimize and why.

Why don’t we use some patterns — hover/focus, prefetch, idle prefetch?
It’s always a question of cost and benefit, and we always run the risk around cache and CDN. Sometimes absence is a deliberate priority — not that we disagree with the idea.

Will it actually change performance?
Like other optimization patterns — it depends on what you measure and where it hurts. Treat them as targeted experiments. They’re useful when you want a specific interaction to feel immediate.

Compression pattern

Imports decide what code loads and when — compression decides how much it weighs when it gets there. Where imports operate at the application bundle layer, compression operates at the origin server and CDN — how you finally deliver bytes to the client.

At Manychat we use two compression utilities: Gzip and Brotli. Both shrink files before they travel over the network and decompress them transparently in the browser. Both are lossless encodings for text-like content (JS, CSS, HTML, JSON, SVG — not already-compressed binaries like most JPEG/PNG).

You can check this in the network tab — go to JS files, look at response headers (look for content-encoding: br or gzip).

Gzip is the classic one, supported by every browser. Brotli compresses better, is a little slower, but delivers smaller chunks. Browsers and CDNs pick the best mutually supported algorithm from Accept-Encoding — supporting both gives a good baseline with Gzip and better size where Brotli is available.

Rule: prefer Brotli for static assets where supported; keep Gzip as fallback where needed.

How do we know the patterns are working?

With user-centric metrics and repeatable team habits.

We rely on Core Web Vitals — loading experience, responsiveness, and visual stability — via the web vitals library from Google. When a user stays on a page long enough, an analytics event fires: app_metrics user_interactive_performance. The central place in the codebase is log_web_vitals.

Beyond Core Web Vitals we track two more things:

INP (Interaction to Next Paint) — real-world interaction responsiveness
Long Tasks — where the user can feel that the app is stuck

Both are collected only for logged-in users and visible in Grafana dashboards. For Flow Builder specifically — the heaviest part of Manyсhat — we track INP on both desktop and mobile. In an ideal world every major component would have its own dashboard. For now, we start where it hurts most.

For audits and payload analysis we use Lighthouse — a built-in Chrome tool that generates a detailed performance report for any page. It’s useful for catching issues before they reach real users.
For day-to-day development we use browser DevTools — the network and performance tabs show what’s happening in real time while we code.

Performance is not a one-time fix. Every import decision, every byte that travels over the wire — these are choices that compound over time. Get them right consistently, and users never notice. Get them wrong, and they leave.

The patterns we covered — imports and compression — are not exotic optimizations. They’re the baseline. The metrics are what keep you honest: if you can’t see it, you can’t improve it.

If you want to learn more about how we build Manychat and who we’re currently looking for, check out Manychat Careers.

From idea to MVP in a hackathon with AI: 6 principles that get you there

Manychat Engineering — Thu, 23 Apr 2026 08:41:56 +0000

How to run a hackathon that results in a working MVP, and how AI fits into the process.

Manychat helps creators and brands automate conversations on Instagram, TikTok and beyond. The more the customer uses it, the richer the picture gets — automations running, leads coming in, content performing differently across posts. We wanted to surface that picture clearly, in one place, directly for the people running those automations.

We had the data. What we didn’t have was clarity on which features would actually be useful, and in what form. Instead of locking that question into a quarterly plan and finding out three months later, we decided to find out in a week.

Three people: two backend engineers, one PM, plus a data engineer who wasn’t full-time but was critical during the data prep phase. A couple of AI agents. Three days in the Amsterdam office, two days of remote prep. The output: an MVP running on real user data and seven customer interviews the following week.

My name is Artur, I’m a Python Tech Lead here at Manychat. Here’s what we learned from our hackathon experience — and where AI made the difference.

1. Start the hackathon before the hackathon

The first thing to sort out in advance is access. AI tools, repositories, internal documentation, relevant services. Permissions always take longer than you expect and can become a serious blocker mid-hackathon. Deal with it before you start.

The second thing is product hypotheses. We spent two days exploring data from our data warehouse. The internal documentation was a Google Sheets file with table and view descriptions — not exactly human-readable — and a data engineer who helped us navigate it. We packaged all of that as context, fed it to an LLM (Claude Code and the Claude desktop app), and asked it to find something productively useful for our customers. The PM turned those findings into clear product hypotheses, wrote the first PRD, and built early prototypes in Lovable — screenshots of which we later used as UI references during the hackathon.

The key point here isn’t that the LLM “invented” the product. It helped us quickly make sense of unfamiliar data and structure what we already intuitively wanted to validate.

2. Planning takes 80% of the time — and that’s fine

During the hackathon itself, we barely wrote any code by hand. Most of the time went to something else: figuring out exactly what needed to be done and formulating it precisely enough for an AI agent to execute.

Review the plan, not the code. The main insight from day one: the more precisely a task is defined for the agent, the fewer iterations you need. We used plan mode in Claude Code before every task, challenged the plan as a team, and saved it as soon as we were happy with it.

We didn’t review the code — but that doesn’t mean quality didn’t matter. Instead of code review, we validated the output with tests. What mattered wasn’t how it was written, but whether it worked correctly.

Save the plan immediately. Once, we had a great plan. We ran it, but the result wasn’t good enough. So we did what people usually do in that situation — /clear — to start fresh. The plan disappeared and with it, we wiped out some of the most valuable work along the way. We had to rebuild from scratch. After that, saving the plan became a mandatory step before any iteration.

Don’t ask for refactoring — reformulate. If the output isn’t right, don’t ask the agent to improve or fix what’s there. Reformulate the task, update the plan, and ask it to redo the whole thing. Especially when the context window is already half full — at that point, trying to patch the existing result only makes things worse.

3. Delegate coordination to AI-agents

If planning takes 80% of the time, the next challenge is turning that plan into a system of well-aligned tasks. This is where things usually start to fall apart: dependencies get lost, responsibilities overlap, and the overall structure drifts.

We found that delegating coordination to AI agents helped keep everything consistent. We created one sub-agent per direction. Each sub-agent was responsible for writing a detailed implementation plan for its part of the work. Once the sub-agents had their plans, a “parent” agent merged them into a single whole — and checked for conflicts, duplication, or blockers between tasks. This replaced a significant chunk of the coordination overhead between people.

The approach let a three-person team — one PM and two engineers — work in parallel and move fast. So fast, we weren’t quite ready for it.

4. Keep tasks as independent as possible

People should work in parallel on maximally different things. This is less obvious than it sounds.

On day one, the two of us split the backend by endpoints: one handler each. Seemed logical. We finished unexpectedly quickly and had to jump into another planning iteration — coordinating again, figuring out what’s next. With AI, this kind of split doesn’t really make sense: an agent closes a handler faster than you can get out of each other’s way.

The next day we changed the approach: each person takes a fully independent direction. One writes a feature end-to-end, the other handles infrastructure. Or one does backend, the other does frontend. For the latter, AI came to the rescue again: Claude Code with Figma via MCP let us assemble an interface from Manychat’s existing design system components — no frontend experience required.

While the engineers were building, the PM was running a parallel track: refining scope, defining the Ideal Customer Profile, finding customers who matched it, and tapping the marketing team’s existing warm contacts to line up interviews.

Parallelization didn’t go away — it just moved up a level. Instead of “two people on one module,” it became “each person owns an entire layer.”

5. Log your compromises

During an MVP hackathon, you’re going to make shortcuts. Intentionally. The goal isn’t to avoid them, but to make sure they don’t turn into hidden debt.

We kept a “Compromises” table and updated it as we went. Every deliberate technical or process shortcut got its own line. If the product validates, we have a concrete list of what needs to be brought up to production standards. Nothing gets forgotten, nothing quietly becomes permanent.

6. Stay focused on the original goal

When we saw how fast we were moving, we decided to add an unplanned feature: AI Summary. A short insight block generated on dashboard load — top post, best automation, overall account dynamics for the week. It wasn’t in the original plan, took half a day, and turned out to be one of the most talked-about things in customer interviews.

Then we got ambitious. Instead of showing the feature from a laptop, we decided to run a real experiment with a feature flag for actual customers. I took on the negotiations with infra and security so the team could keep shipping. Security didn’t approve it — and for a fair reason. Our solution was read-only, which made it safe in its current state. But the security team flagged that if any write operations were added down the line, the architecture would need to be revisited to account for that. They weren’t ready to approve this potential risk quickly, and we lost a person-day.

At the retro, the conclusion was simple: every new action needs to be validated against the original goal. The MVP was never meant for production in the first place. We just thought — we’re moving fast, why not? It worked for the unplanned feature. It didn’t work for the prod deploy.

What we ended up with

We shipped a working MVP analytics dashboard running on real user data.

The dashboard has three parts. At the top, an AI Summary : a short auto-generated insight about the last 7 days — total leads, DMs, comments, URL clicks, top-performing post with CTR, and one clear action to take.

Below that, Account Performance: four KPI cards — automations running, leads collected, average CTR, time saved.

And the main section: Top Content by Conversion — a table of posts ranked by leads, with reach, clicks, CTR, engagement, saves, and automation status. For creators who monetize on Instagram, the key question is always the same: which content is actually working? Not working in terms of likes — working in terms of leads, DMs, conversions. We wanted to make that visible at a glance, so they could double down on what’s performing and add automations where they’re missing out. Posts without automations show an “Add automation” prompt inline, and the customer can act on it without leaving the page.

The week after the hackathon, the PM ran seven interviews with real customers on their real data. Four said they’d use the dashboard regularly. We got a concrete list of what was missing and clear input for the next quarter’s planning.

In a normal cycle, development alone would take around two sprints — four weeks. User validation in larger companies can stretch for months: UX research queues, alignment meetings, synthesis. Here, it took three days to build and one week to get real customer feedback.

A fast MVP wasn’t even the goal. Reducing risk was. Instead of committing a full quarter to something that might not land, you validate the idea first, and then decide whether it’s worth investing in. The hackathon format makes this possible. A small team goes from hypothesis to a validated MVP in about two weeks.

AI changes the equation. Implementation becomes a non-issue: one engineer with a system of agents can cover what used to require a team. The bottleneck shifts. It’s no longer about writing code — it’s about knowing what to build. AI helps there too.

Want to be part of the team building Manychat? See what roles we have open right now.

PHP Fibers: Simplifying Async Code and Speeding Up Development

Manychat Engineering — Thu, 16 Apr 2026 12:54:49 +0000

PHP Fibers: simplifying async code and speeding up development

How serialization overhead, a surprise OpenSSL upgrade, and idle workers pushed us toward PHP 8.1 Fibers, and what changed when we did.

I’m Max, Infrastructure Team Lead at Manychat. This is the next part of our PHP series.

In the previous article, we built concurrent HTTP requests in PHP without threads — using curl_multi_exec to let a single worker handle multiple external calls at once. It worked. Then our AI features expanded, external calls multiplied, and the model started buckling under its own complexity.

This article is about what we did next: PHP 8.1 Fibers, and how they changed the way our workers process payloads.

What was exactly wrong with Concurrent Requests?

The curl_multi_exec architecture came with a steep price. To make pseudo-concurrency work, we had to explicitly serialize and deserialize requests, responses and exceptions at every async boundary. That meant a significant refactor, new internal tooling, and conventions developers had to follow just to write new code correctly. As AI features grew in scope and number, the cognitive overhead became impossible to ignore.

Error Handling complexity. Handling exceptions, timeouts, and corner cases got increasingly painful. Every new scenario — retries, network failures, edge cases — required explicit handling, and since context had to survive serialization boundaries, each one added another layer of boilerplate

Scattered Context. The hardest part wasn’t writing the code — it was reading it afterward. Business logic was split across serialization points: some state lived before the async boundary, some after. Tracing a single payload through the system meant mentally jumping between the sync worker, the async queue, and back. Code reviews became genuinely hard.

Testing Overhead. Testing also became more complicated. Tests had to account for the full serialization/deserialization chain. Even a simple mock meant verifying multiple intermediate steps instead of a single function call.

Idle Workers. Before Fibers, Meta API calls stayed synchronous — serializing and deserializing state at the point of the call would have required even more refactoring, so we just didn’t touch it. The average response time is around 250ms. Not slow enough to panic over — but not fast enough to ignore at Manychat’s scale. During that time, the worker just sat there.

Bottom line: the code was getting harder to read, harder to test, and harder to extend. Development was slowing down — and everyone felt it.

Three more things that made us rethink

While we were sitting with those outcomes, three things happened in parallel:

1. We solved our memory problem via PCNTL fork. By using pcntl_fork() to spawn workers, we enabled OPcache sharing and Linux copy-on-write — significantly reducing the memory footprint of each worker. In theory, we could almost stop worrying about idle workers; they were no longer consuming nearly as much memory. But they still consumed network connections. So the problem wasn’t fully gone.

2. Ubuntu upgrade revealed a new bottleneck. We migrated from Ubuntu 20.04 to the new LTS and CPU load jumped 10%. Nothing in our code had changed.

We dug in. The problem was OpenSSL 3.0 — shipped with the new Ubuntu — which made SSL handshakes significantly more expensive. OpenSSL’s root certificate store on Linux is one large concatenated file — and the new version introduced mutex-style locking when iterating through it. Even Facebook’s own optimization of using a single root certificate file didn’t fully absorb the hit.

The cause was our still-synchronous calls to the Meta API. Each payload opened a new TCP connection. At Manychat’s scale, that added up fast — and that 10% CPU overhead became the trigger for the next step.

So Anton Gorin, Chief Architect of Manychat, and I decided to combine our existing async worker — built on curl_multi_exec — with Fibers introduced in PHP 8.1.

What is a Fiber?

Fibers are a low-level mechanism for cooperative multitasking: pause execution at any point, resume it later from exactly the same spot, without — no threads, no processes.

<?php

$fiber = new Fiber(function() {

    echo "Suspending…\n";

    $last = Fiber::suspend(16);

    echo "Resuming with last value {$last}\n";

});

$last = $fiber->start();

echo "Suspended with last value {$last}\n";

$fiber->resume(42);

Suspended with last value 16;

Resuming with last value 42.

Unlike true multithreading, Fibers run within a single OS thread and don’t execute in parallel. Instead, they switch context explicitly via Fiber::suspend() and resume. That makes them well-suited for I/O-bound work: yield control while waiting for a response, do something else, come back when it’s ready.

The new payload processing flow with fibers

Previously, every HTTP request meant serialize, hand off, wait, deserialize, restore. Here’s what that looked like in practice:

The sync worker picks a payload from the queue and starts processing it.
When execution hits an external HTTP call, the worker serializes the request along with the current business-logic state and writes it to the async task queue.
The async worker reads from the queue, deserializes multiple requests, and executes them concurrently via curl_multi_exec.
When a response is ready, the async worker serializes it together with the updated state and writes it back to the sync task queue.
A sync worker picks it up, deserializes everything, restores business-logic state, and continues from where it left off.

This is the diagram of this very complex flow:

With Fibers, the logic was simpler:

The worker starts a fiber and begins processing a payload.
When execution hits an external HTTP call — Meta API, LLM, whatever — the fiber suspends, returning the request that needs to be executed.
The workflow is passed to Guzzle request loop, which executes the request and if there is no response ready with data, the worker immediately starts the next fiber and begins processing another payload.
If there is any response available in Guzzle loop, the corresponding fiber resumes from exactly where it stopped.
If that fiber produces another request, it suspends again and goes back into the loop.

Within a single worker, multiple fibers may be suspended, waiting for the response to come simultaneously and one actively executed at the same moment of time, depending on configuration.

What are the wins? And one trade-off

More cases to make async — like API calls via Meta SDK

Before Fibers, making Meta API calls async meant serializing and deserializing business state around every call. We just didn’t bother. With Fibers, we added a suspend point and called it done. A single Meta API call takes ~250ms — small individually, but Manychat makes billions of them. The compound effect is massive.

Savings on resources: CPU and connections

We rewrote part of the Facebook SDK to reuse connections. One HTTP/2 connection per worker, multiplexed across multiple requests. No repeated TCP handshakes. No OpenSSL overhead per request.

CPU usage returned to previous levels.

Asynchronous Sleep

Sometimes we need to wait — for example, before retrying after an HTTP 500, or to ensure correct message order before sending the next one. A regular sleep() blocks the entire process. If API errors spike and retry logic misbehaves, you’ve put the whole server to sleep.

With fibers, we can implement an asynchronous sleep. A specific fiber sleeps for a defined interval while the worker continues processing other fibers.

Simpler code

No more serialization. No more deserialization. Business context stays where it belongs — inside the fiber. Developers don’t even need to know they’re inside a fiber. The code looks like regular PHP — because for all practical purposes, it is.

In practice: instead of pushing a request back into the queue on retry, you just do an asynchronous sleep.

Simpler tests

Testing async code with Guzzle required enormous effort — the full serialization/deserialization chain had to be accounted for, and even a simple mock meant verifying multiple intermediate steps. With Fibers, the code reads linearly and tests follow naturally. That said, some things are hard to reproduce outside production — but in practice, if it worked in dev, it worked in prod.

One trade-off: blast radius

Fibers came with one compromise. Previously, our guiding principle was “better to crash hard than silently suffer from error” — non-fatal warning, log it, terminate the worker. One payload lost, clean slate.

With multiple fibers suspended simultaneously, that no longer works. Terminating the worker interrupts all in-flight payloads at once. We redesigned exception handling so that catchable errors terminate only the affected fiber while the worker continues processing others. Fatal errors — like out-of-memory — still take down the entire process. If five payloads are in flight, all five are lost.

This meant working through existing technical debt and committing to treating critical errors as critical — actually reacting to them, not letting them slide. Migrating to PHP 8.5 helped: it introduced stack traces for fatal errors, which made them significantly easier to diagnose and fix.

Could we have done less work?

Probably. Revolt, ReactPHP, AMPHP and OpenSwoole all solve similar problems and would have saved us from building a custom event loop. AMPHP in particular goes further — async SQL queries, not just HTTP, and battle-tested error handling out of the box.

But we didn’t start from a blank slate. We already had a Guzzle-based event loop from the earlier proof of concept, and adding Fibers on top was the natural next step. Starting over today, we’d look at Revolt first and skip the custom event loop entirely.

What we’d keep regardless: developers don’t need to know they’re inside a fiber. The wrapping happens under the hood. That was a deliberate choice — and it’s the part that matters most in a large codebase with many contributors.

This article is based on a talk I gave at PHP Talks #7. If you’d rather watch than read — the video is [_here](https://www.youtube.com/watch?v=in_XaE0T5IY)._

Migrating the Manychat iOS App from Xcode Frameworks and CocoaPods to Swift Package Manager

Manychat Engineering — Thu, 26 Mar 2026 09:47:47 +0000

How we migrated a large, high-load iOS app from Xcode Frameworks and CocoaPods to Swift Package Manager — without freezing feature development.

In August 2024, the CocoaPods project entered maintenance mode. Then in November 2024 it was announced that the trunk will become read-only on December 2, 2026. After that, publishing new pods or updating existing ones will no longer be possible.

That announcement became the final trigger for our migration to Swift Package Manager.

But honestly, we already had enough reasons to leave long before that.

The starting point

Our starting point was a fairly large and mature iOS application: the Manychat app with about 220k MAU.

By the time the migration started, the codebase was already modular:

~20 external dependencies managed through CocoaPods
~20 internal dependencies implemented as Xcode Frameworks

Ironically, this very modularization through Xcode Frameworks turned out to be the root of many problems.

Internal modules were implemented as separate targets in .xcodeproj, each compiled into a dynamic.framework.

Each dynamic framework is its own bundle, containing duplicated Swift metadata and adding hundreds of lines to the already monstrous project.pbxproj.

This dynamic linking came with several unpleasant side effects:

1. Bloated app size

107 MB for a mobile app is… not exactly elegant.

2. Painful configuration

Every framework required manual configuration: build settings, build phases, code signing, target configuration

Multiply that by dozens of modules and you get a configuration hell.

3. Endless conflicts in project.pbxproj

Whenever merge conflicts appeared, developers had to manually resolve them inside the gigantic and barely readable project.pbxproj file. A special kind of suffering.

And finally, maintaining two dependency systems at once (CocoaPods for external libraries and Xcode Frameworks for internal modules) was becoming increasingly painful.

Swift Package Manager promised a much cleaner future: a single, native dependency system fully integrated with the Apple ecosystem.

So we decided it was time.

Migration strategy

We split the migration into three phases:

Move external dependencies from CocoaPods to SPM.
Migrate internal frameworks from Xcode Frameworks to SPM.
Refine the architecture afterward.

Phase 1: External dependencies — leaving CocoaPods

We migrated dependencies one by one, starting with libraries that already supported SPM.

A typical migration looked like this:

Add the package in Xcode: File → Add Package Dependencies .
Provide the repository URL and choose the exact version.
Select the package products and attach them to the appropriate target.
Remove the dependency from the Podfile.
Verify imports still work.
Run tests.

Some libraries didn’t support SPM yet. Most of them were forks hosted in Manychat repositories — such as centrifuge-ios (WebSocket client) or DiffTableDirector (table diffing library). For those, we had to add SPM support ourselves : write Package.swift, publish tags and releases.

One dependency required special treatment: our Kotlin Multiplatform analytics library (KMM). To integrate it via SPM we first had to add SPM support on the KMM side. That meant building an XCFramework inside the KMM project.

This phase of migration happened incrementally and took roughly 3 months. At the end of this phase we achieved native Xcode integration: SPM dependencies were connected directly through Xcode (XCRemoteSwiftPackageReference in .xcodeproj) without additional wrappers.

At this point we dropped storyboards entirely. We’d been using our own fork of SwinjectStoryboard to support them, and we were done with both.

Phase 2: Migrating internal frameworks to SPM packages

This was the hardest part.

After external dependencies moved to SPM, we ran into a fundamental problem: Xcode Frameworks cannot properly consume transitive dependencies from SPM packages.

Here’s a concrete example:

Logically, Services should receive Alamofire transitively through Networking. In reality, Xcode does not propagate SPM dependencies through chains of dynamic frameworks. Instead, it throws linker errors: undefined symbols.

Why? Because dynamic frameworks and SPM packages live in completely different dependency resolution worlds. This limitation of Xcode’s build system made the migration even more complicated.

We saw two ways forward.

Option 1: declare all dependencies explicitly and migrate frameworks gradually. Then every Xcode Framework that needs a transitive SPM dependency declares it as a direct one which means duplicating dependencies.

Before migration:

After migration:

This approach would allow gradual migration, and reduce the risk of errors at each step. But the downside was obvious: it would require duplicating dependencies across all targets. The dependency graph would become a mess — real dependencies tangled with workarounds, impossible to tell apart. And every time a new SPM package is added, all dependent frameworks would need updating.

In other words: a temporary solution that would likely become permanent. We didn’t like that.

Option 2: migrate everything in one massive PR. Convert all Xcode Frameworks into local SPM packages at once.

The upside was tempting: a clean dependency graph from day one, no duplicated dependencies, and proper transitive resolution.

The tradeoff was equally obvious: a huge pull request touching the entire codebase while active development continued. High risk. Lots of potential conflicts.

After a week of debating and experimenting, we chose Option 2. Better one painful surgery than months of living with architectural band-aids.

The big migration

The actual migration looked like this.

We created a single Package.swift describing all 25 modules — about 515 lines. This was intentional: maintaining one manifest is easier than managing dozens.

Example:

// swift-tools-version: 5.10
import PackageDescription

let package = Package(
   name: "Modules",
   platforms: [.iOS(.v17)],
   products: LocalModule.allCases.map { $0.product },
   dependencies: ExternalPackage.packages,
   targets: LocalModule.allCases.map { $0.target }
)

Deleted all framework targets from project.pbxproj.
Added Modules as a local SPM package (XCLocalSwiftPackageReference).
Attached products to the main app target.
Connected packages to test targets.
Ran all tests and fixed compilation errors.
Merged it into dev.

The whole migration took four very intense days.

New dependency architecture

During the migration we also cleaned up the architecture. Some improvements became possible thanks to SPM; others we implemented simply because the migration gave us a good opportunity.

We reorganized the existing frameworks into two types of SPM modules: Core and Feature. Core for shared application components and Feature for individual features.

Modules/

├── Core/ ← Infrastructure and shared components (20 modules)

│ ├── Core/ — Foundation: TCA, Swinject, base extensions

│ ├── Models/ — Domain models

│ └── ...

│

└── Feature/ ← Product features (5 modules)

    ├── Automations/

    └── ...

We introduced helpers for dependency declaration and validation:

Shared dependencies for feature modules — encapsulation of shared dependencies for feature modules:

private func featureDependencies() -> [Dependency] {
   [
       local(.core),
       ...
       external(.swinject),
       external(.tca),
   ]
}

Features then add only their own specific dependencies on top:

case .automations:
   featureDependencies()
   ...
   external(.kingfisher)

DependencyBuilder — so each module explicitly declares both its local and external dependencies:

@DependencyBuilder
private var dependencies: [Dependency] {
   switch self {
   case .core:
       external(.swinject)
       ...

   case .models:
       local(.core)
       external(.dateTools)
       ...

   case .networking:
       local(.core)
       local(.models)
       ...

   // ... the rest modules
   }
}

Cycle detection — a validation check that prevents feature modules from depending on each other:

private func validateNoFeatureToFeatureDependencies() {
    let featureNames = Set(LocalModule.allCases.filter(\.isFeature).map(\.name))
    for module in LocalModule.allCases where module.isFeature {
        let featureDeps = module.target.dependencies.compactMap { dep -> String? in
            switch dep {
            case .byNameItem(name: let name, _):
                return featureNames.contains(name) ? name : nil
            default:
                return nil
            }
        }
        if !featureDeps.isEmpty {
            preconditionFailure(
                "Feature module '\(module.name)' must not depend on other feature modules: \(featureDeps.joined(separator: ", "))"
            )
        }
    }
}

Results

The most important outcome: we are no longer dependent on CocoaPods and now rely entirely on the native Apple dependency system.

But the migration produced several additional benefits:

App size down 31%. From 106.6 MB to 73.9 MB. This was a side effect of converting Xcode Frameworks to SPM packages, not something we specifically optimized for.

Where the savings came from:

Static linking. SPM packages link statically by default. LTO (Link-Time Optimization) can see the entire app at once instead of separate frameworks, and strips more aggressively.
Deduplication of Swift metadata. Each dynamic framework previously carried its own copy of Swift type metadata. Static linking lets the compiler deduplicate them.
Dead code stripping. The static linker removes unused code more effectively. Dynamic frameworks were always linked in full — even if the app used 10% of a .framework, the other 90% stayed in the binary.
No bundle overhead. Each .framework is a directory with an Info.plist, headers, and a code signature. With 20+ frameworks, that adds up.

A clear dependency graph. No words needed.

This is before:

This is after:

App launch 30% faster. From 1.1s — 1.3s down to stable 0.84s. Before migration, dyld had to load and bind every dynamic framework at startup — visible on the pre-main phase. With static linking, there’s one binary to load.

App start time: 1.29s before migration (5.13.1) vs 0.84s after (6.8.0).

The drop is hard to miss: launch time fell from ~1.0–1.1s to 0.84s after the migration in late February.

Clean build time down 28%. From 156 seconds to 113 seconds. SPM modules build in parallel just like frameworks — but without the .framework bundle copy phase and per-framework code signing.

The full migration took about six months: preparation, the actual move, and post-migration cleanup. We’re currently in the refinement phase, continuing to thin out the main app target by moving remaining logic into existing modules or extracting new ones, and migrating tests into SPM.

How to survive LLM Traffic Spikes in Python

Manychat Engineering — Thu, 05 Mar 2026 13:09:06 +0000

What it takes to route, rate-limit, and failover hundreds of LLM calls per second without breaking production.

At Manychat, we serve AI-powered automation to thousands of Instagram and messaging accounts. Behind that experience sits our Python AI Service — a layer between our product and multiple LLM providers, handling hundreds of LLM calls per second in production.

It works well. Until it doesn’t.

LLM calls don’t behave like traditional API requests. They take seconds, not milliseconds. They’re expensive. They come with strict rate limits. And when a single LLM provider goes down, your feature can go down with it.

Horizontal scaling doesn’t solve this. Adding more servers won’t lift provider limits or fix upstream outages. What you actually need is a control layer — one that decides where traffic goes, when to back off, and how to fail without taking the entire system down.

I’m Sergi Porta, Python Team Lead in the Manychat AI unit. In this article, I’ll walk through the LLM traffic routing architecture we use in our Python AI service. I’ll explain the core gateway patterns for multi-provider LLM traffic, with a practical focus on failover logic, rate limiting, and monitoring, and show how this allows our AI service to handle hundreds of LLM calls per second in production while surviving spikes and provider outages.

The goal is simple: survive spikes of hundreds of LLM calls per second and provider outages without drama.

Python AI Service: Technical Stack

Before diving into routing, a quick look at the service itself. The Python AI Service is built with Python 3.13 and FastAPI , relying heavily on asyncio to handle long-running LLM calls and other I/O-heavy workloads.

We use SQLAlchemy and Alembic to manage configuration, metadata, and lifecycle state. Even though the service is focused on LLM traffic, it still needs to behave like any other production system: consistent, observable, and predictable.

Python AI Service architecture.

For reliability, we work with multiple providers (Azure and OpenAI) and we are planning to add more providers to satisfy the product needs in the future. That gives us flexibility — but also complexity.

Each provider behaves differently. Latency varies. Rate limits differ. Availability patterns are not the same. At our scale, ignoring those differences is not an option.

We need to monitor each deployment in real time, route traffic dynamically based on capacity, and recover automatically from partial outages.

The solution was to introduce a dedicated abstraction layer — a routing layer that hides provider-specific complexity from the rest of the application.

LLMRouting: Turning Providers into a Resilient Pool

It is built on top of LiteLLM’s Router library. Here’s how it works. We define multiple backend deployments, each of which may contain replicas of the same LLM model. Instead of calling a specific provider directly, the agent sends the request to the routing layer and simply specifies the model it wants to use. What happens behind the scenes is abstracted away from the application.

The first core mechanism is weighted routing. Not all deployments are equal. Some run on provisioned throughput tiers. Others are pay-as-you-go. Some are cheaper. Some are faster. We assign each deployment a numeric weight, which determines how much traffic it receives. The higher the weight, the larger the share of requests.

Rate limits are inevitable at scale. When a deployment starts returning 429 responses during a traffic spike, the router doesn’t stubbornly retry the same endpoint. It shifts traffic to other healthy deployments in the pool.

If a deployment becomes fully unavailable, it enters a cooldown period. During that time, it is temporarily removed from rotation, and the remaining backends absorb the traffic.

This logic applies not only to deployments, but to entire providers. If Azure experiences an outage, traffic can be routed directly to OpenAI. Because the same model alias exists across providers, failover happens within the retry window. The result: even a full provider outage doesn’t immediately cause errors for users.

Now let’s look at the mechanisms that make this work in practice: weighted routing, rate-limit handling, cooldowns, and fallbacks.

Weighted Routing

Each model alias — for example, gpt-4o-mini — maps to multiple deployments across different providers. Every deployment has a numeric weight that determines its share of traffic.

In our current production setup, the primary Azure deployment carries a weight of 8 (about 73% of requests). A secondary Azure deployment carries a weight of 2 (roughly 18%). A direct OpenAI fallback has a weight of 1 (around 9%).

Here’s how that distribution looks conceptually:

The router uses a weighted random selection strategy (simple shuffle). Most requests are directed to provisioned-throughput tiers, while pay-as-you-go tiers remain warm and ready.

Traffic distribution isn’t hardcoded. It’s defined in YAML. That means we can rebalance weights or shift traffic across providers within seconds without deploying new code.

Handling Rate Limits

When a deployment returns a 429 rate-limit response, the router does not retry the same endpoint. Instead, it immediately selects another deployment from the pool and retries the request — up to four attempts in total (1 original and 3 retries). Because each model alias maps to multiple backends, a rate limit in one Azure region is usually resolved by routing the retry to another Azure deployment or directly to OpenAI.

Every rate-limit event is tracked per backend through a custom Prometheus callback. Grafana dashboards make it immediately visible when a deployment is approaching its capacity ceiling. That visibility allows us to adjust routing weights proactively instead of reacting to outages after they cascade.

Cooldowns: Isolating Failing Deployments

Cooldowns prevent failing deployments from absorbing traffic they can’t serve. When a deployment crosses a failure threshold, the router removes it from the routing pool for a defined time window. During that period, only healthy deployments receive traffic.

After the cooldown window expires, the deployment is reintroduced into rotation. This isolation is critical during partial outages. Instead of spreading failures across all incoming requests, the system converges on healthy endpoints within seconds.

Fallbacks Across Deployments and Providers

Fallbacks operate at both the routing and application levels.

At the routing layer, if retries on a primary deployment are exhausted, traffic shifts to the remaining tiers, including cross-provider fallbacks. Because the same model alias exists across providers, even a full regional outage does not require manual intervention. The router reroutes traffic within the retry window.

At the application level, an additional safety net handles edge cases such as empty-content responses. Before surfacing an error to the user, the service retries the entire call. In practice, this means that even during provider-side instability, traffic can be rerouted in under a second without visible degradation for the end user.

Monitoring and Observability: Seeing Problems Before Users Do

Routing and failover logic are only as good as your visibility into them. Our observability stack relies on two core systems: Prometheus and Grafana for real-time metrics and alerting, and OpenTelemetry for distributed tracing across the full request lifecycle.

Prometheus Metrics and Grafana Dashboards

Every LLM call passing through the router is instrumented via a custom Prometheus callback.

We record high-granularity metrics at both the model and backend levels — enough detail to understand not just that something is wrong, but where.

Model-level metrics include total call counts, latency distributions, token usage (prompt and completion), and error rates. Metrics are labeled by model alias and agent name, allowing us to isolate the performance of specific features, for example, intent detection versus flow generation.

Backend-level metrics record which deployment handled each request and categorize the outcome into a controlled set: success, timeout, rate_limit, api_error, and other. Keeping this taxonomy small helps maintain manageable Prometheus cardinality while still providing enough signal to diagnose routing behavior.

LLM Providers Metrics weights, error, latency dashboard.

These metrics feed into dedicated Grafana dashboards that help us answer four critical questions:

1. Is traffic distributed as expected?

We verify that routing weights are respected and detect unexpected shifts caused by cooldowns or failovers.

2. Is latency degrading anywhere?

P50, P95, and P99 histograms are broken down by backend to surface provider-specific slowdowns.

3. Are errors isolated or systemic?

Outcome breakdowns show whether failures are limited to a single deployment or spreading across the pool.

4. Is cost drifting?

Token counters per model and agent help detect prompt regressions and unexpected usage spikes.

Call counts and latency dashboard.

Errors dashboard.

Cost analysis dashboard.

Alerting System

Dashboards are useful for investigation. Alerts are what trigger action.

Slack notifications fire under two primary conditions:

P95 latency thresholds. Alerts trigger when latency exceeds defined limits (typically between 3.5 and 5 seconds, depending on the model). This helps catch provider slowdowns before users feel them.

Error rate breaches. An error rate above a certain threshold triggers an immediate notification. At our traffic level, that’s not a minor glitch — it’s a strong signal of an outage or misconfiguration that requires attention.

Monitoring the Asyncio Event Loop

For high-throughput asynchronous services, the health of the Python event loop is monitored via two dedicated metrics:

Event Loop Delay. This measures the gap between expected and actual asyncio.sleep intervals. Spikes above 1 ms usually indicate CPU-bound work or blocking calls that are starving the loop — and increasing LLM response latency.

Active task count. Tracking the number of running tasks helps detect backpressure caused by slow upstream responses or sudden spikes in concurrency.

Distributed Tracing with OpenTelemetry

While metrics provide high-level status, OpenTelemetry provides the context needed for deep investigation. The service automatically analyses three layers: FastAPI (HTTP requests), OpenAI API calls, and SQLAlchemy (database queries).

Each trace spans the full lifecycle of a request — from the initial HTTP call through intent detection, embedding generation, database lookups, and final LLM completion. We propagate custom attributes via OpenTelemetry baggage to preserve business context:

manychat.account_id: links spans to specific customer accounts.
manychat.session_id: associates spans with unique automation sessions.

These attributes let us pinpoint whether a bottleneck originated in an LLM backend, an embedding call, or a database query for a specific request. Traces are exported via OTLP gRPC and stored in S3 for long-term analysis.

Observability Checklist for Production LLM Routing

The following points define the requirements for the production LLM routing layer:

Backend call counts and outcomes to verify routing weights and failover activation.
Latency histograms by deployment to isolate provider-side slowdowns.
Error classification to distinguish between expected rate limits and critical authentication or timeout issues.
Token usage tracking for cost management and identifying prompt regressions.
Event loop health monitoring to detect blocking calls or task backpressure.
Distributed traces with business context to correlate LLM performance with database and embedding latency.
Threshold-based alerts for P95 latency and error rate breaches.

What is next for Python AI Service and LLM Routing?

LLM Routing-based architecture is solid, and we’re evolving it further.

Next, we’re extending the routing layer with RPM/TPM-aware selection, latency-based routing, and cost optimization, so the system can automatically prefer the most efficient deployment available in real time.

Rate-limit handling will also evolve. We’re introducing exponential backoff, more granular retry policies per error type.

Cooldowns will become more refined. Instead of a single threshold, we’ll define explicit “allowed failure” counts and tailor cooldown durations to the type of error, distinguishing between expected rate-limit spikes and critical authentication failures.

Fallback logic will also be extended to span multiple model groups. For example, falling back from gpt-4o-mini to gpt-4o, or dynamically selecting models based on context window size and content policies.

We’re also thinking about integrating in the future other providers such as Anthropic, Gemini, and potentially self-hosted models.

One hundred LLM calls per second once felt ambitious. Now we’re preparing for thousands. But that’s a story for another article.