<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Manychat Engineering</title>
    <description>The latest articles on DEV Community by Manychat Engineering (@manychattech).</description>
    <link>https://dev.to/manychattech</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1109338%2Fd0e5c8de-a9c1-4408-8f06-b2652c797b8f.jpg</url>
      <title>DEV Community: Manychat Engineering</title>
      <link>https://dev.to/manychattech</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/manychattech"/>
    <language>en</language>
    <item>
      <title>Practical observability checklist for APIs, workers &amp; jobs. Part 1</title>
      <dc:creator>Manychat Engineering</dc:creator>
      <pubDate>Tue, 23 Jun 2026 12:02:11 +0000</pubDate>
      <link>https://dev.to/manychat/practical-observability-checklist-for-apis-workers-jobs-part-1-24i3</link>
      <guid>https://dev.to/manychat/practical-observability-checklist-for-apis-workers-jobs-part-1-24i3</guid>
      <description>&lt;h4&gt;
  
  
  The minimum set of signals that helps you understand what’s happening in production before users tell you something is wrong.
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fd6lpwwa9jtgfui70nbca.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fd6lpwwa9jtgfui70nbca.png" width="800" height="432"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Production has a special talent for turning “seems fine” into “why is everything on fire?”&lt;/p&gt;

&lt;p&gt;The service is up. Dashboards are green. Then reality hits: a restart that never reaches readiness, a worker that quietly stops consuming events, a scheduled job that never runs, latency creeping upward until users notice first.&lt;/p&gt;

&lt;p&gt;Most production failures are not mysterious. They are predictable, observable, and usually fixable. Yet they still turn into incidents because we discover them too late. After enough incidents, a pattern becomes hard to ignore: &lt;em&gt;we’re not missing fixes first — we’re missing signals.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fvw5opwpsn8f40eob4fzv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fvw5opwpsn8f40eob4fzv.png" width="800" height="388"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Green dashboard can still hide a broken workload.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That realization changes how you think about observability. The question is no longer “Do we have Grafana?”, “Do we collect logs?”, or “Should we add tracing?”&lt;/p&gt;

&lt;p&gt;The real question is: can we understand what is happening in production before users — or another team — tell us something is wrong?&lt;/p&gt;

&lt;p&gt;I’m &lt;a href="https://www.linkedin.com/in/dariakors/" rel="noopener noreferrer"&gt;Daria&lt;/a&gt;, a Python Engineer at &lt;a href="https://careers.manychat.com/team/engineering" rel="noopener noreferrer"&gt;Manychat&lt;/a&gt;with a QA/SDET background and a strong preference for systems that are boring to run. Over the past year, my team shipped a new class of production Python services for data processing and analytics and built their observability from scratch. This article is the checklist that emerged from that work.&lt;/p&gt;

&lt;p&gt;It’s intentionally vendor-agnostic. The goal is not to recommend a particular monitoring stack, framework, or observability platform. The goal is to identify the minimum set of signals that tells you whether a workload is healthy and doing the job it’s supposed to do.&lt;/p&gt;

&lt;p&gt;It covers three workload types: &lt;strong&gt;an&lt;/strong&gt;  &lt;strong&gt;HTTP API&lt;/strong&gt; , &lt;strong&gt;a background worker&lt;/strong&gt; (queue consumer, event processor, task worker — anything that does work outside the request/response path), and &lt;strong&gt;a scheduled job&lt;/strong&gt;. Each fails differently, so each needs a different observability baseline.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What “observable” means in practice&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Before adding dashboards, alerts, or traces, ask a simpler question: what does “working” actually mean for this service? Not “is the process running?” or “does Kubernetes think the pod is alive?”&lt;br&gt;&lt;br&gt;
But — what does correct behavior look like from the outside?&lt;/p&gt;

&lt;p&gt;For an &lt;strong&gt;API&lt;/strong&gt; : it accepts requests, responds correctly, within acceptable latency. For a &lt;strong&gt;worker&lt;/strong&gt; : it consumes events, handles them successfully, keeps backlog under control, and makes progress recently enough. For a &lt;strong&gt;scheduled job:&lt;/strong&gt; it ran today, completed successfully, processed a non-suspicious amount of data, and produced output fresh enough for the product.&lt;/p&gt;

&lt;p&gt;Once you can answer that, observability becomes much easier to reason about. A system is observable enough when you can answer important operational questions about production services quickly:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fvr0esvk3whpiygw2vbro.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fvr0esvk3whpiygw2vbro.png" width="800" height="388"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you can answer these in minutes, debugging becomes more predictable. If not, you guess — and guessing under production pressure leads to random dashboard clicking, noisy Slack threads, and fixing the first visible symptom instead of the actual problem.&lt;/p&gt;

&lt;p&gt;This is why observability should start from operational questions, not from tools. Tools are implementation details, signals are the product.&lt;/p&gt;

&lt;p&gt;A metric is useful if it answers a question.&lt;br&gt;&lt;br&gt;
A log is useful if it helps reconstruct what happened.&lt;br&gt;&lt;br&gt;
A trace is useful if it connects behavior across components.&lt;br&gt;&lt;br&gt;
An alert is useful if it tells the right people about an actionable problem early enough.&lt;/p&gt;

&lt;p&gt;The goal is not more data. It’s the right questions and signals.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Workload type matters&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;It is tempting to use one generic checklist for every production workload. But an HTTP API, a worker, and a scheduled job fail differently, so each needs a different observability baseline.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fhxzax06b3flj5ntf525f.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fhxzax06b3flj5ntf525f.png" width="800" height="388"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Different workloads fail differently.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  API checklist
&lt;/h3&gt;

&lt;p&gt;An API is usually the first to show when something breaks — users send requests, downstream services call it, error rates and latency surface fast.&lt;/p&gt;

&lt;p&gt;The core questions are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Is it up?&lt;/li&gt;
&lt;li&gt;Is it ready?&lt;/li&gt;
&lt;li&gt;Is it serving requests?&lt;/li&gt;
&lt;li&gt;Is it failing?&lt;/li&gt;
&lt;li&gt;Is it fast enough?&lt;/li&gt;
&lt;li&gt;Is traffic normal?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Here’s what to watch for an API.&lt;/p&gt;

&lt;h4&gt;
  
  
  Health and readiness
&lt;/h4&gt;

&lt;p&gt;Liveness, readiness, and user-facing checks are related but answer different questions. Liveness: is this process alive, or should it be restarted? Readiness: can this instance safely receive traffic right now? A user-facing or a synthetic check: does the service behave correctly from the outside?&lt;/p&gt;

&lt;p&gt;A process can be alive without being ready. A service can be technically ready and still fail a real user-facing flow.&lt;br&gt;&lt;br&gt;
An HTTP 200 alone may not be enough: you may also want to check response latency, expected response shape, or data freshness. A useful readiness check should reflect whether the service can actually do its job: database connectivity, required configuration, critical dependencies, internal startup state.&lt;/p&gt;

&lt;p&gt;The important part is not to turn readiness into a heavy synthetic transaction. The important part is to avoid the false comfort of “the process exists, therefore the service is fine.”&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Useful signals:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;service/pod availability,&lt;/li&gt;
&lt;li&gt;readiness status,&lt;/li&gt;
&lt;li&gt;restart count,&lt;/li&gt;
&lt;li&gt;startup failures,&lt;/li&gt;
&lt;li&gt;dependency readiness when critical.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Common mistakes:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;treating liveness as readiness,&lt;/li&gt;
&lt;li&gt;alerting only when the pod disappears,&lt;/li&gt;
&lt;li&gt;not alerting when the service exists but never becomes ready.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Request rate and throughput
&lt;/h4&gt;

&lt;p&gt;Request rate gives context. A latency spike during a traffic surge tells a different story from the same spike during normal load. A sudden drop can also be a signal — maybe clients stopped calling, routing broke, a feature flag changed, or an upstream service failed.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Useful signals:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;requests per second,&lt;/li&gt;
&lt;li&gt;requests by route/endpoint,&lt;/li&gt;
&lt;li&gt;traffic split by status code,&lt;/li&gt;
&lt;li&gt;traffic split by important client/source if applicable.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Careful with labels.&lt;/strong&gt; Endpoint labels are useful but raw URLs, account IDs, user IDs, or arbitrary request parameters can create high cardinality and make your metrics backend very unhappy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Error rate&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It is one of the first signals people expect from an API. You want to know:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how many requests fail,&lt;/li&gt;
&lt;li&gt;whether failures are client-side or server-side,&lt;/li&gt;
&lt;li&gt;which endpoints are affected,&lt;/li&gt;
&lt;li&gt;whether the failure is sustained or just a tiny spike.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Useful signals:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;5xx rate,&lt;/li&gt;
&lt;li&gt;4xx rate when meaningful,&lt;/li&gt;
&lt;li&gt;error ratio by endpoint,&lt;/li&gt;
&lt;li&gt;exception count by error class,&lt;/li&gt;
&lt;li&gt;dependency error count if the API calls databases, caches, queues, or external services.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Latency, especially tail latency
&lt;/h4&gt;

&lt;p&gt;Averages are often too polite. They hide the pain. Users don’t experience the average request — they experience the one they’re waiting for right now. That’s why p95 and p99 are usually more useful than average latency alone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Useful signals:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;p50 latency for baseline behavior,&lt;/li&gt;
&lt;li&gt;p95 latency for common bad experience,&lt;/li&gt;
&lt;li&gt;p99 latency for tail behavior,&lt;/li&gt;
&lt;li&gt;latency by endpoint,&lt;/li&gt;
&lt;li&gt;latency of important dependencies when available.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Common mistake:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;looking only at average latency,&lt;/li&gt;
&lt;li&gt;histogram buckets too coarse to show the real problem,&lt;/li&gt;
&lt;li&gt;one latency SLO applied to very different endpoints.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;One thing worth knowing: if your dashboard shows p99 stuck exactly at the highest histogram bucket boundary for a long time, the real latency may be worse than the chart can show. That’s not a healthy signal, that’s an instrumentation limitation.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fhhwv1qnzui1vuhmghw2d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fhhwv1qnzui1vuhmghw2d.png" width="799" height="352"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Domain-specific signals
&lt;/h4&gt;

&lt;p&gt;Generic API metrics are necessary but not always enough. Many production issues only make sense when you add one or two domain-specific signals:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;cache hit/miss ratio,&lt;/li&gt;
&lt;li&gt;cache invalidation count,&lt;/li&gt;
&lt;li&gt;downstream query duration,&lt;/li&gt;
&lt;li&gt;number of records returned,&lt;/li&gt;
&lt;li&gt;rate of empty responses,&lt;/li&gt;
&lt;li&gt;feature-specific processing outcomes,&lt;/li&gt;
&lt;li&gt;calls to a critical third-party dependency.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Do not turn everything into a metric. Add the signals that explain important system behavior.&lt;/p&gt;

&lt;h3&gt;
  
  
  Worker / event processor checklist
&lt;/h3&gt;

&lt;p&gt;Workers are tricky because they can look alive while doing nothing useful. A worker can be running as a process but failing as a workload. For workers, “alive” is not the same as “working”.&lt;/p&gt;

&lt;p&gt;The process is running. The service instance is running. CPU is fine. The platform reports healthy. But no events are being consumed. Or they’re read and fail during handling. Or one poison message blocks everything. Or the backlog quietly grows while the worker is technically “up”.&lt;/p&gt;

&lt;p&gt;For workers, liveness is not enough. The real question is: is it making progress?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fbonmogqds46e7tb36wpd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fbonmogqds46e7tb36wpd.png" width="800" height="435"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Read rate / consumption rate
&lt;/h4&gt;

&lt;p&gt;First, you need to know whether the worker is actually reading from the queue — events from Kafka, RabbitMQ, SQS, tasks from a queue, messages from a stream, whatever your architecture uses.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Useful signals:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;events/messages/tasks read total,&lt;/li&gt;
&lt;li&gt;read rate over time,&lt;/li&gt;
&lt;li&gt;read failures by error class,&lt;/li&gt;
&lt;li&gt;last read timestamp.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A worker that isn’t reading may still look alive. Without these metrics, you’ll discover the problem indirectly — through stale data, customer reports, or a growing backlog.&lt;/p&gt;

&lt;h4&gt;
  
  
  Processing outcomes
&lt;/h4&gt;

&lt;p&gt;Reading work is not the same as handling it successfully. A worker may consume events but fail while processing them — and without outcome metrics, you won’t know.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Useful signals:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;processed/handled total,&lt;/li&gt;
&lt;li&gt;success count,&lt;/li&gt;
&lt;li&gt;failed count,&lt;/li&gt;
&lt;li&gt;skipped/unhandled count,&lt;/li&gt;
&lt;li&gt;retry count,&lt;/li&gt;
&lt;li&gt;failure by error class,&lt;/li&gt;
&lt;li&gt;failure by handler/event type/task type.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;A good metric shape to aim for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;events_handled_total{handler, event_type, outcome}&lt;/li&gt;
&lt;li&gt;events_processing_failed_total{handler, event_type, error_class}&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The exact names should follow your project and monitoring conventions. What matters is the model: count handled work, separate outcomes, keep labels bounded, and make failures explorable by handler, event type, and error class.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Backlog / queue depth&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;If your architecture has a queue, backlog is one of the most important things to watch.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Useful signals:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;queue depth,&lt;/li&gt;
&lt;li&gt;oldest message age,&lt;/li&gt;
&lt;li&gt;lag by partition/topic/stream when applicable,&lt;/li&gt;
&lt;li&gt;backlog growth rate.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Backlog needs context. A queue depth of 100 may be perfectly fine in one system and catastrophic in another. What matters is whether the worker can catch up and whether the delay violates product expectations.&lt;/p&gt;

&lt;p&gt;Show backlog, processing rate, and failure rate together on the same dashboard. If backlog says work exists, processing rate says it’s moving, and failure rate stays quiet — you’re good.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F27m4clbvk4ld1ng8ktwn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F27m4clbvk4ld1ng8ktwn.png" width="800" height="193"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Processing rate and backlog together.&lt;/em&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Last successful progress timestamp
&lt;/h4&gt;

&lt;p&gt;This is one of the most useful signals for silent failures, when the worker looks alive but isn’t actually doing anything. Track the timestamp of the last successful progress point — whether it is a read, a completed processing step or a full read+process cycle, depending on what “progress” means for your worker.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Useful signals:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;last_read_timestamp_seconds,&lt;/li&gt;
&lt;li&gt;last_processed_timestamp_seconds,&lt;/li&gt;
&lt;li&gt;last_successful_task_timestamp_seconds.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fpakvr6sihyq9xd4s6xgy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fpakvr6sihyq9xd4s6xgy.png" width="799" height="195"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Processing duration
&lt;/h4&gt;

&lt;p&gt;Workers need duration metrics too, but the question is different from APIs — not request/response time, but how long it actually takes to process a unit of work.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Useful signals:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;processing duration histogram,&lt;/li&gt;
&lt;li&gt;p50/p95/p99 processing time,&lt;/li&gt;
&lt;li&gt;duration by handler/task type,&lt;/li&gt;
&lt;li&gt;slow processing count.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Common mistake:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;measuring the wrong boundary and then misinterpreting the result.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you decorate a high-level handle_event function, your histogram may include routing, validation, handler execution, logging, dependency calls, and error handling. That’s still useful, but know what you’re actually measuring.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scheduled job checklist
&lt;/h3&gt;

&lt;p&gt;Scheduled jobs fail even more quietly. A daily job may do nothing for 23 hours and still be healthy, which makes generic service-style monitoring a poor fit.&lt;/p&gt;

&lt;p&gt;The first question is whether it ran successfully when it was supposed to. Then: how long did it take, did it process the expected amount of data, when was the last success, and is the result still fresh enough for whatever depends on it.&lt;/p&gt;

&lt;h4&gt;
  
  
  Last run timestamp
&lt;/h4&gt;

&lt;p&gt;You need to know when the job last started.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Useful signal:&lt;/strong&gt; last_run_timestamp_seconds.&lt;br&gt;&lt;br&gt;
This tells you whether the scheduler triggered the job at all. If the last run timestamp is too old, the problem may be scheduling, deployment, permissions, environment configuration, or the job process not starting.&lt;/p&gt;

&lt;h4&gt;
  
  
  Last success timestamp
&lt;/h4&gt;

&lt;p&gt;A job can run and fail. That is why the last run is not enough.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Useful signal:&lt;/strong&gt; last_success_timestamp_seconds.&lt;br&gt;&lt;br&gt;
This is often the best freshness signal for scheduled jobs.&lt;/p&gt;

&lt;h4&gt;
  
  
  Last run status
&lt;/h4&gt;

&lt;p&gt;A simple status metric is extremely practical.&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Useful signal:&lt;/strong&gt; last_run_status where 1 = success, 0 = failure.&lt;br&gt;&lt;br&gt;
This gives a clear “latest result” view.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Duration&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Duration helps detect degradation before complete failure.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Useful signal:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;last_run_duration_seconds,&lt;/li&gt;
&lt;li&gt;duration history over time,&lt;/li&gt;
&lt;li&gt;p95/p99 duration if the job runs frequently enough.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For daily jobs, even a simple last-duration gauge can help.&lt;/p&gt;

&lt;p&gt;It tells you whether the job is getting slower, whether a data volume increased affected runtime, whether a dependency slowed down, and whether the job is getting close to exceeding its scheduling window.&lt;/p&gt;

&lt;h4&gt;
  
  
  Output / records processed
&lt;/h4&gt;

&lt;p&gt;Success status alone can be misleading for data jobs — the job may complete without producing anything useful.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Useful signals&lt;/strong&gt; :&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;records processed,&lt;/li&gt;
&lt;li&gt;records inserted/updated/deleted,&lt;/li&gt;
&lt;li&gt;number of accounts/customers/entities processed,&lt;/li&gt;
&lt;li&gt;output freshness,&lt;/li&gt;
&lt;li&gt;number of empty results,&lt;/li&gt;
&lt;li&gt;validation failures.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is where business-level metrics can be helpful.&lt;/p&gt;

&lt;h4&gt;
  
  
  Reading the signals together
&lt;/h4&gt;

&lt;p&gt;These signals become most useful when you read them together:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;last run recent + last status failure = job ran but failed&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;last run old + last success old = job may not be running&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;last run recent + last success recent + duration increased = job works but may be slowing down&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;last run succeeded + records processed is unexpectedly zero = the job works but not useful&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Feissf6iqz2s6c0m2h02z.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Feissf6iqz2s6c0m2h02z.png" width="798" height="191"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Records processed unexpectedly zero in the last successful run.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This is why scheduled job observability should not rely on one status flag alone. You want enough signals to distinguish “did not run”, “ran and failed”, “ran and succeeded”, and “ran but produced suspicious output”.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F0p4qq9ez7f2o2oesteta.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2F0p4qq9ez7f2o2oesteta.png" width="800" height="388"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Don’t forget about dependency and infrastructure metrics
&lt;/h3&gt;

&lt;p&gt;Application metrics tell you how the workload behaves. Dependency and infrastructure metrics help explain why.&lt;br&gt;&lt;br&gt;
If API latency goes up, the cause may be in the application, database, cache, external API, connection pool, or infrastructure. For a database-backed service, API latency should be visible together with database query duration, connection pool behavior, database errors/timeouts, and storage-level signals such as IOPS or read/write latency when relevant.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ff5cnb1onbpnml8lgxi5k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Ff5cnb1onbpnml8lgxi5k.png" width="800" height="195"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;API latency goes up due to slow DB query.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Useful signals:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;DB connection count / pool usage,&lt;/li&gt;
&lt;li&gt;query latency,&lt;/li&gt;
&lt;li&gt;slow queries,&lt;/li&gt;
&lt;li&gt;DB errors/timeouts,&lt;/li&gt;
&lt;li&gt;IOPS / disk latency for managed databases such as RDS,&lt;/li&gt;
&lt;li&gt;cache hit/miss ratio,&lt;/li&gt;
&lt;li&gt;cache latency,&lt;/li&gt;
&lt;li&gt;external dependency latency/error rate,&lt;/li&gt;
&lt;li&gt;CPU and memory,&lt;/li&gt;
&lt;li&gt;restarts,&lt;/li&gt;
&lt;li&gt;disk and network signals.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Infrastructure metrics support investigation but they don’t replace user-impact signals. High CPU is context. High p99 latency is an impact. A service can have a normal CPU and still return wrong data. A worker can have a healthy pod and still stop processing. A scheduled job can have no alarming resource usage because it never ran.&lt;br&gt;&lt;br&gt;
Start from workload behavior, and use infrastructure metrics to explain what you find.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;***&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
That’s the metrics side covered: what to watch for APIs, workers, and scheduled jobs, and how to read the signals together.&lt;/p&gt;

&lt;p&gt;Metrics tell you something is wrong. But they won’t tell you what exactly happened, or where in the system it happened. That’s what logs and traces are for, and knowing when to reach for which one is half the battle. We’ll also talk about alerting that actually pages you for the right reasons, and a rollout order of the observability setup that won’t kill you. All in the second part.&lt;/p&gt;

&lt;p&gt;Stay tuned!&lt;/p&gt;




</description>
      <category>observability</category>
      <category>infrastructure</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Why an engineer should try running a community</title>
      <dc:creator>Manychat Engineering</dc:creator>
      <pubDate>Tue, 16 Jun 2026 08:52:01 +0000</pubDate>
      <link>https://dev.to/manychat/why-an-engineer-should-try-running-a-community-55p8</link>
      <guid>https://dev.to/manychat/why-an-engineer-should-try-running-a-community-55p8</guid>
      <description>&lt;h4&gt;
  
  
  What do you get from organizing engineering meetups besides extra work? After organizing several of them, I have a few answers.
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8o2kjpyh4e35qtngzzdb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8o2kjpyh4e35qtngzzdb.png" width="800" height="432"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The latest meetup in April 2026.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Over three years at Manychat, I switched teams a few times and got to know a lot of engineers across the company. Still, I often learned about interesting technical projects completely by accident.&lt;/p&gt;

&lt;p&gt;We had sprint reviews, of course, but they’re focused on product progress. The engineering stories behind the work rarely make it onto the agenda.&lt;/p&gt;

&lt;p&gt;At some point, I got tired of relying on luck. We already had engineering communities, and when an opportunity came up to get more involved in the frontend one, I saw it as a chance to create a place where engineers from different disciplines could share what they were building and learn from each other.&lt;/p&gt;

&lt;p&gt;Before starting, I had a long conversation with my manager. It forced me to think more carefully about why I wanted to do this in the first place: what value it would bring, not only to the company and the team, but also to me personally.&lt;/p&gt;

&lt;p&gt;That question stayed with me. Five meetups later, with the sixth coming at the end of June, I think I have an answer.&lt;/p&gt;

&lt;p&gt;I’m&lt;a href="https://www.linkedin.com/in/egor-naidikov-a23390215/" rel="noopener noreferrer"&gt;Egor&lt;/a&gt;, Frontend Techlead at &lt;a href="https://careers.manychat.com/team/engineering" rel="noopener noreferrer"&gt;Manychat&lt;/a&gt;. In this article, I won’t talk about why engineering communities are good for companies. Instead, I want to share what running one can give you as an engineer.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Building your network&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;When I joined Manychat, there were about 15 frontend engineers. I knew all of them. Today the company has more than 400 people, and that’s simply no longer possible.&lt;/p&gt;

&lt;p&gt;We have Slack channels for introductions. Every new hire gets a welcome message. The problem is that someone new joins almost every day. Reading an introduction is easy. Actually getting to know people is much harder. If you’re not working on the same project, you may never interact at all. As systems grow, repositories split, and teams become more specialized, that becomes even more likely.&lt;/p&gt;

&lt;p&gt;Organizing a meetup gives you a legitimate reason to talk to people you’d otherwise never meet. You learn what another team is building, what problems they’re solving, and which shortcuts they’ve already discovered.&lt;br&gt;&lt;br&gt;
Without these conversations, people slowly disappear into their own corners of the company. Important knowledge becomes local knowledge.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxdmvisqg24z11g6qc4xu.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxdmvisqg24z11g6qc4xu.jpeg" width="800" height="600"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Our first offline meetup in July 2025.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Building soft skills&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Ordering food, booking a room, finding a clicker that actually works, that’s the easy part (and can also be delegated). The harder part starts when you need people.&lt;/p&gt;

&lt;p&gt;Most engineers spend their day talking to the same handful of people. Community work is different. Suddenly you’re asking strangers for favors: to share their work and knowledge, to spend their time preparing a talk, and finally to stand in front of a room full of colleagues and explain something they normally only discuss with their team. You have no authority over them, no leverage, no budget. Just your ability to explain why it matters and why it’s worth their time.&lt;/p&gt;

&lt;p&gt;And then there are all the people outside engineering. Running a meetup means asking for budget, finding time in people’s schedules, promoting events, and getting organizational support. That usually involves managers, HR, and various stakeholders. They care about different things than engineers do. Learning how to explain the same initiative to different audiences is a useful skill on its own.&lt;/p&gt;

&lt;p&gt;That turns out to be surprisingly useful practice for all kinds of soft skills. The era of highly paid engineers who can quietly ship tickets and avoid people altogether seems to be ending.&lt;/p&gt;

&lt;p&gt;Whether you want to become a tech lead, an engineering manager, or simply a stronger engineer, your ability to work with people matters more every year. Running a community gives you a place to practice at low stakes.&lt;/p&gt;

&lt;p&gt;And one more bonus. When you organize events, bring people together, and keep things moving, colleagues start associating you with ownership. You’re no longer just the person working on a specific project. You’re the person making something happen because you care.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffy7ply8xlhr0nx7twt2p.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ffy7ply8xlhr0nx7twt2p.jpeg" width="800" height="600"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Our second meetup in November 2025.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Expanding your picture of what’s happening&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;One thing I noticed after changing teams several times is how easy it is to develop a very narrow view of the company.&lt;br&gt;&lt;br&gt;
Sprint reviews are mostly about product increments. There’s no time to talk about new technical practices, architectural decisions, or how a team changed the way it works.&lt;/p&gt;

&lt;p&gt;You can spend years in your slice of the codebase without understanding what’s going on around you. Meanwhile, colleagues are solving interesting problems and building things that could be useful to you too. Without a place where people share that, it just never comes up. Meetups became that place for us.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;One of the meetup’s real wins&lt;br&gt;&lt;br&gt;
We had our own routing mechanism for a long time, then migrated to React Router, but URL management still wasn’t particularly pleasant. One engineer, as part of their PDP, implemented an abstraction layer for unified URL handling with fully typed parameters.&lt;br&gt;&lt;br&gt;
A meetup gave him a stage. Other engineers started experimenting with it. Today, every frontend engineer uses that solution. Without the meetup, that work would have stayed inside a single team. That’s probably one of my favorite examples of what these events can do.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;I’ve switched teams several times and worked in very different parts of the product, and I still learn something new every time. That’s probably the most personally valuable thing I get as a community organizer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmrct0dsnxf14i777bgtd.jpeg" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmrct0dsnxf14i777bgtd.jpeg" width="800" height="534"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Here’s what attendance looked like at our latest meetup in April 2026.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Ok, I’m in, where do you start?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;After five meetups, I don’t think there’s a magic formula. But there are a few things that will help you to begin from square one:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1&lt;/strong&gt;. &lt;strong&gt;Go to other people’s meetups first.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Before organizing anything, become part of someone else’s community. Watch how they run events, talk to organizers, ask what broke. Starting from scratch sounds exciting but it’s also an efficient way to collect every avoidable mistake yourself.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2&lt;/strong&gt;. &lt;strong&gt;Start smaller&lt;/strong&gt;.&lt;br&gt;&lt;br&gt;
It’s tempting to announce a company-wide engineering meetup from day one. Don’t.&lt;br&gt;&lt;br&gt;
Start with a group where you already have relationships — mobile, backend, frontend. Start with your own unit. The first two meetups naturally gravitated toward frontend topics. That’s where I knew the most people and where finding speakers was easiest.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Find people who care about the same thing.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
This is the most important one. Communities don’t die because of a lack of ideas. They die because one person gets tired. If the entire community depends on one organizer, it already has an expiration date. Find people who care about the same thing and build a core group. They will keep things going when you’re too overwhelmed with work tasks or just out of energy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4. Learn how to motivate people.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
For this you should know what the speaker gets out of it. Sometimes it’s visibility and recognition. Sometimes it is a practice of public speaking. Or just a chance to spend time with interesting colleagues and free pizza.&lt;br&gt;&lt;br&gt;
The same applies to attendees. A meetup without an audience is every organizer’s nightmare. Trust me.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;5. Get support before you start.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Meetups happen during work hours, so does the preparation for them. Most managers support such kinds of activities. Some won’t. Get explicit support from your manager or leadership before investing months into building something.&lt;br&gt;&lt;br&gt;
And if your company has people responsible for employer branding or DevRel, talk to them early. In reality, these teams are often looking for exactly this kind of initiative. They can help promote events through internal communication channels, attract more attendees, and generally make sure your meetup doesn’t depend entirely on word of mouth. More importantly, they could help with motivation including something more tangible than recognition: for example points that can be exchanged for company merch, small rewards, or other perks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;6. Don’t do everything yourself.&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
If you’re speaking, ask someone else to host. If you’re organizing, let someone else own parts of the process. Communities become more sustainable when responsibility is shared. And you’ll enjoy them more too. Because at some point you stop running an event and start being part of a community.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Want to be part of our engineering community? Take a look at our&lt;/em&gt; &lt;a href="https://careers.manychat.com/team/engineering" rel="noopener noreferrer"&gt;&lt;em&gt;open positions&lt;/em&gt;&lt;/a&gt;&lt;em&gt;. Maybe we’re looking for you.&lt;/em&gt;&lt;/p&gt;




</description>
      <category>community</category>
      <category>softwareengineering</category>
      <category>devrel</category>
      <category>meetup</category>
    </item>
    <item>
      <title>How to Build an Agentic Design System People (and Agents) Will Actually Use (Part 1)</title>
      <dc:creator>Manychat Engineering</dc:creator>
      <pubDate>Wed, 10 Jun 2026 08:35:30 +0000</pubDate>
      <link>https://dev.to/manychat/how-to-build-an-agentic-design-system-people-and-agents-will-actually-use-part-1-4i03</link>
      <guid>https://dev.to/manychat/how-to-build-an-agentic-design-system-people-and-agents-will-actually-use-part-1-4i03</guid>
      <description>&lt;h4&gt;
  
  
  &lt;em&gt;On building Manychat’s design system in eight working days, and why those days were even possible.&lt;/em&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzil1kr08d1m52e040m5b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzil1kr08d1m52e040m5b.png" width="800" height="432"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Most design systems don’t fail by being wrong. They fail by multiplying.&lt;/p&gt;

&lt;p&gt;When Manychat shifted mobile-first, we needed a design system that worked across all three platforms. We had one — mature, well-maintained, living on the web. But mobile had never adopted it. So each platform did what made sense locally: iOS added its own semantics, Android added its own, web shipped its own names. We’d been building parallel design systems — none of them quite agreeing on what subtle text or warning yellow actually meant.&lt;/p&gt;

&lt;p&gt;Hello 👋 — my name is &lt;a href="https://www.linkedin.com/in/thanhdolong" rel="noopener noreferrer"&gt;Thanh&lt;/a&gt;, an iOS Engineer at &lt;a href="https://careers.manychat.com/team/engineering" rel="noopener noreferrer"&gt;Manychat&lt;/a&gt;. This series is about what we did about it: building our agentic design system which works across all the platforms. Just in 8 working days.&lt;/p&gt;

&lt;p&gt;This is the first chapter focusing on what a solid design system should have in its foundation and how we rebuilt it to be AI-driven. If you finish it and you’re not slightly curious how AI can change the way you build — I failed 😔.&lt;/p&gt;

&lt;h3&gt;
  
  
  A design system is a shared language
&lt;/h3&gt;

&lt;p&gt;A design system isn’t a component library, a Figma file, or even tokens in the abstract. It’s an agreement — a shared language between design and engineering for what the product should look like, feel like, and behave like, written in a form both sides can read, write, and hold each other to. Tokens encode the agreement: &lt;em&gt;Link&lt;/em&gt;, &lt;em&gt;Danger&lt;/em&gt;, &lt;em&gt;Surface&lt;/em&gt;, &lt;em&gt;Subtle&lt;/em&gt;. Not the same hex, but the same meaning.&lt;/p&gt;

&lt;p&gt;Without that language, a codebase isn’t a system — it’s a collection of coincidences that happen to look like one. Engineers re-derive every spacing decision from scratch. Dark mode becomes a parallel product maintained by guesswork. New engineers spend their first week learning which color belongs where, knowledge that lives in tribal memory rather than documentation. Designers feel it from the other side: they invest in a Figma system and engineering ships something close. Nobody’s fault — there’s just no vocabulary to hold either side accountable.&lt;/p&gt;

&lt;p&gt;Users notice eventually, not as bugs but as wobble — spacing that almost matches, colors that almost pair, a dark mode that almost works.&lt;/p&gt;

&lt;p&gt;The question isn’t whether you need a design system. It’s how you build one that actually fits into the development flow instead of sitting next to it. Here’s what worked for us.&lt;/p&gt;

&lt;h3&gt;
  
  
  Do tokens before components
&lt;/h3&gt;

&lt;p&gt;Most design systems start where they can ship: buttons, inputs, cards. That’s how we started too.&lt;/p&gt;

&lt;p&gt;Brad Frost’s &lt;a href="https://atomicdesign.bradfrost.com/" rel="noopener noreferrer"&gt;Atomic Design&lt;/a&gt; borrows directly from chemistry — atoms (button, input, label) bond into molecules, molecules into organisms, then templates, then pages. Simple, stable things combine into complex, situational ones.&lt;/p&gt;

&lt;p&gt;Frost himself &lt;a href="https://bradfrost.com/blog/post/design-tokens-atomic-design-%E2%9D%A4%EF%B8%8F/" rel="noopener noreferrer"&gt;extended the model downward&lt;/a&gt;, treating tokens as the &lt;em&gt;sub-atomic&lt;/em&gt; layer beneath atoms — the particles, and the rules atoms are made of. We use the same framing.&lt;/p&gt;

&lt;p&gt;Sub-atoms are the raw decisions every atom is built from: color, spacing, radius, shadow, motion, typography. Other teams call them tokens. Either way, users don’t see them. Users see what they make possible.&lt;/p&gt;

&lt;p&gt;Think of it like a food pyramid: what’s at the base supports everything above.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkha6ua7v78tlgboytyrm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkha6ua7v78tlgboytyrm.png" width="800" height="518"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Skip them, and components rest on quicksand. A “primary button” isn’t a stable concept if “primary” doesn’t resolve to a specific background, a specific radius, a specific spacing rhythm, or a specific shadow language. Without those, every primary button in the codebase is a small re-derivation. The next brand refresh will break every component that never agreed on what “primary” meant.&lt;/p&gt;

&lt;p&gt;Start with components, and you’ll rewrite them. Start with sub-atoms, and you’ll build on something that holds.&lt;/p&gt;

&lt;h3&gt;
  
  
  Name tokens by intent
&lt;/h3&gt;

&lt;p&gt;Token names have nouns but no verbs — they describe a value, not an intent. I once found &lt;em&gt;blue300&lt;/em&gt; in our iOS codebase used as a text color, a button background, an icon tint, a tab indicator, and a border stroke — all in files written by different people, technically valid against the palette, none of them intentionally chosen against a shared rule. It was even worse: &lt;em&gt;neutral100&lt;/em&gt; had more than a hundred uses across the same codebase, working as disabled backgrounds, card surfaces, separators, chip fills, divider lines. &lt;em&gt;One gray, five intentions.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The fix is a semantic layer. We built a three-layer architecture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Core&lt;/strong&gt;  — raw values. Eleven color families, scales from s0 to &lt;em&gt;s900&lt;/em&gt;. Platform-agnostic, defined once, and referenced everywhere — with no product opinion baked in&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Semantic&lt;/strong&gt;  — intent-based tokens, organized into five categories: &lt;em&gt;text&lt;/em&gt;, &lt;em&gt;icon&lt;/em&gt;, &lt;em&gt;background&lt;/em&gt;, &lt;em&gt;border&lt;/em&gt;, &lt;em&gt;shimmer&lt;/em&gt;. Semantic tokens are named using patterns such as &lt;em&gt;text.brand&lt;/em&gt;, &lt;em&gt;background.warningDefault&lt;/em&gt;, &lt;em&gt;icon.danger&lt;/em&gt;. Each token resolves to different core values depending on context — light mode, dark mode, high contrast, brand variant.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Component&lt;/strong&gt;  — names scoped to specific components. &lt;em&gt;button.primaryBackground&lt;/em&gt; maps to a semantic background.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The layers talk in one direction: component → semantic → core. The payoff is change isolation. Need to update a brand color? Adjust core; everything downstream follows. Need a high-contrast accessibility mode? Remap semantics. White-label the product? Swap the semantic resolution. None of it requires touching the components that consume the tokens.&lt;/p&gt;

&lt;p&gt;The categories themselves do enforcement work. If a token lives under text, it can only be used for text. If an engineer reaches for background.brand to color an icon, the name itself signals the mistake — before any linter, review, or designer catches it. Naming by intent turns the taxonomy into a guardrail.&lt;/p&gt;

&lt;h3&gt;
  
  
  Start where it hurts
&lt;/h3&gt;

&lt;p&gt;Don’t try to define everything at once. For us, the categories that hurt most were color and spacing — the most visible sources of inconsistency.&lt;/p&gt;

&lt;p&gt;Your starting point might be different — typography, elevation, something product-specific. It doesn’t matter. A small system that works beats a large one that’s half-finished&lt;/p&gt;

&lt;h3&gt;
  
  
  Build with AI in mind, and expect everything to change
&lt;/h3&gt;

&lt;p&gt;By the time you read this, half of this is already different.&lt;br&gt;&lt;br&gt;
AI changes two things at once: the tools you use to build a design system, and the requirements the system itself has to meet. Structure and semantic clarity — everything we covered above — matter more when AI is in the loop. An AI agent reads your system more literally than a person does. If there are some gaps, a person fills them with intuition; an agent fills them with whatever fits the pattern — including the flawed ones.&lt;/p&gt;

&lt;p&gt;For months we’d been building the first version of the design system the classic way: multi-week negotiations on what subtle text should mean, designer-engineer back-and-forth on dark-mode pairings, reviewed screen by screen. That process gave us the right foundation — semantic layers, tokens, atoms, a shared language. But the structure wasn’t designed for AI to read. So we restarted from scratch to make it AI-driven.&lt;/p&gt;

&lt;p&gt;What we ended up building is close to what the 2026 design-systems community has started calling an &lt;em&gt;agentic design system&lt;/em&gt;. Thanks to AI, in just eight days we could structurally encode the foundation we had as machine-readable infrastructure rather than static, human-oriented documentation. The result is &lt;strong&gt;Manyfest Design System&lt;/strong&gt; : one Figma file for all platforms — web, iOS, Android — built to be read by both humans and AI agents.&lt;/p&gt;

&lt;p&gt;Just as the classic process taught us how to lay the foundation, the AI-first rebuild taught us what that foundation needs to support.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One: design intent has to be explicit&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Designers and engineers bring years of product context to every decision. AI doesn’t — not yet. So the token structure has to carry that context instead.&lt;br&gt;&lt;br&gt;
Our schema adopts the &lt;a href="https://www.designtokens.org/" rel="noopener noreferrer"&gt;W3C Design Tokens Community Group&lt;/a&gt; standard, which has become the industry baseline. The standard provides the what: value, type, and description. We added the &lt;em&gt;intent&lt;/em&gt; block to capture the &lt;em&gt;why&lt;/em&gt;. This is what the community now calls &lt;strong&gt;token intent metadata&lt;/strong&gt;  — the structured rules about usage and pairings that transform a token from a simple hex string into something an AI agent can actually reason over.&lt;/p&gt;

&lt;p&gt;So in Manyfest, every token carries metadata about &lt;em&gt;why.&lt;/em&gt; It’s a hex string plus the role it plays, the surfaces it’s allowed on, and the contrast guarantees it carries.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="err"&gt;//&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;illustrative&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;—&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;actual&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;schema&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;is&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;still&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;being&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="err"&gt;formalized&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"color.background.warning"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"$value"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"{color.core.yellow.500}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"$type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"color"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"$description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Surface tone for non-blocking warning states"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"$intent"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"useFor"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"banners"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"inline alerts"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"form-field warnings"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"doNotUseFor"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"error states"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"destructive actions"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"icon foregrounds"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="nl"&gt;"pairsWith"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="s2"&gt;"text.warning"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"icon.warning"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"border.warning"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The metadata isn’t decoration. It’s the difference between an AI agent getting the token right by guessing, and getting it right because the token told it.&lt;/p&gt;

&lt;h4&gt;
  
  
  Two: reviews stop being only between humans
&lt;/h4&gt;

&lt;p&gt;Designers and engineers used to review each other in PRs and Figma threads. Now there’s a third reader.&lt;/p&gt;

&lt;p&gt;We have a shared skill (&lt;a href="https://gist.github.com/thanhdolong/229d2fa226b45e878571ec0f1e35ba62" rel="noopener noreferrer"&gt;figma-component-review&lt;/a&gt;) that takes a Figma URL, parses the file, pulls the design variables and the component context, scans the matching package code in the design-system repo, and writes back a categorized list of questions: &lt;em&gt;for the designer&lt;/em&gt; (intent ambiguity, missing variants, accessibility gaps) and &lt;em&gt;for the engineer&lt;/em&gt; (token mismatch, naming drift, component reuse opportunities). The questions land as comments on the Figma node, where the designer is already working.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1l5lx63hv2ogtwi3y1ik.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1l5lx63hv2ogtwi3y1ik.png" width="800" height="741"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The point isn’t to replace the design review. It’s to surface the same questions earlier — the ones both sides would otherwise hit three weeks later, in a PR. That’s only possible because the system has structure: semantic tokens, named intent, atomic hierarchy. Give the skill a flat palette and it has nothing to compare against.&lt;/p&gt;

&lt;h4&gt;
  
  
  Three: AI catches mistakes humans miss
&lt;/h4&gt;

&lt;p&gt;Not by writing code. By reading it.&lt;/p&gt;

&lt;p&gt;We have AI in the loop on every PR: descriptions auto-written from the diff in the template, Conventional Commit titles normalized, contextual labels applied (accessibility, performance, migration), a split gate that blocks merges over 1,000 lines. It takes a whole category of small mistakes off the team’s plate.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd4se68t1wcwx3e9s3wtl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd4se68t1wcwx3e9s3wtl.png" width="800" height="738"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;This is an example of the PR from the template we give AI to fill the PR. The full version you can find here.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The real value shows up in smaller moments. While setting up Manyfest’s iOS skill — the file that tells AI agents how to scaffold a new component — an AI reviewer caught three mistakes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a cyclic dependency I’d introduced in a preview helper;&lt;/li&gt;
&lt;li&gt;a scope error that made an internal helper accidentally public;&lt;/li&gt;
&lt;li&gt;three doc references pointing to a file I’d renamed two PRs earlier.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;AI wasn’t writing my code, it was catching my mistakes — and that’s the version of AI-driven engineering I trust.&lt;/p&gt;

&lt;h4&gt;
  
  
  Four: don’t let AI decide, let it accelerate
&lt;/h4&gt;

&lt;p&gt;AI isn’t magic, at least not at the moment I’m writing this. It hallucinates with confidence. It invents APIs that don’t exist, writes code that compiles and quietly does another thing. Check the screenshot below: at one point our skill description randomly slipped into another language for no apparent reason. Because why not, I guess.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiq2vuornyl21fxfunnxs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fiq2vuornyl21fxfunnxs.png" width="800" height="300"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Treat it like a senior teammate — sharp, fast, sometimes too confident. Brief it carefully. Read its work. Push back when it’s wrong. It implements faster than you can but it doesn’t decide better than you can. The judgment stays yours.&lt;/p&gt;

&lt;h3&gt;
  
  
  The receipts
&lt;/h3&gt;

&lt;p&gt;Eight working days — that’s how long it took to ship the AI-driven version of the design system. The classic version, the one we restarted from, had taken months. But the eight days became possible because we already had these months without AI, finding the right approach.&lt;/p&gt;

&lt;p&gt;The system isn’t done yet. What’s done is the foundation: Manyfest in Figma is the source of truth, and everything defined there translates automatically into the format each platform speaks — twenty-three components, ready to wire in on every change.&lt;/p&gt;

&lt;p&gt;Building a design system that AI can read pays off immediately: a designer can start prototyping by dropping a Figma file into Claude with the &lt;a href="https://help.figma.com/hc/en-us/articles/32132100833559" rel="noopener noreferrer"&gt;Figma MCP&lt;/a&gt; installed. What used to take a sprint takes a day.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Chapter 2 is coming next with more on building an AI-driven design system and the skills we developed along the way.&lt;/em&gt;&lt;/p&gt;




</description>
      <category>design</category>
      <category>iosappdevelopment</category>
      <category>softwaredevelopment</category>
      <category>mobileappdevelopment</category>
    </item>
    <item>
      <title>How we hire for infrastructure at Manychat: from first call to offer in 2 weeks</title>
      <dc:creator>Manychat Engineering</dc:creator>
      <pubDate>Thu, 04 Jun 2026 09:44:19 +0000</pubDate>
      <link>https://dev.to/manychat/how-we-hire-for-infrastructure-at-manychat-from-first-call-to-offer-in-2-weeks-4bpe</link>
      <guid>https://dev.to/manychat/how-we-hire-for-infrastructure-at-manychat-from-first-call-to-offer-in-2-weeks-4bpe</guid>
      <description>&lt;h3&gt;
  
  
  How we hire for infrastructure at Manychat: from first call to offer in 10 days
&lt;/h3&gt;

&lt;h4&gt;
  
  
  &lt;em&gt;Most hiring drags on because teams have only a vague idea of who they’re actually looking for. We decided to fix that first.&lt;/em&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6z0dccof98ioahix2ly2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6z0dccof98ioahix2ly2.png" width="800" height="432"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The reason most hiring takes months has nothing to do with the number of stages. It’s that there’s no clear definition of the target. So companies compensate: they collect more CVs, add more interviews, compare candidates to each other — and still struggle to make a call.&lt;/p&gt;

&lt;p&gt;We decided to remove that guesswork. Instead of growing the candidate pipeline, we built a process where a strong candidate can be identified and hired immediately.&lt;/p&gt;

&lt;p&gt;I’m &lt;a href="https://www.linkedin.com/in/dmitry-yackevich-72648244/" rel="noopener noreferrer"&gt;Dmitry&lt;/a&gt;, Head of Infrastructure at Manychat, and in this post I’ll walk through how we hire senior infrastructure engineers in ten days — by knowing exactly who we’re looking for and what we’re evaluating them against.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Know what you’re looking for&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;“Knowing who you’re looking for” isn’t about years of experience or a tech stack. It’s about understanding what this person will actually own. What problems will they solve? What level of autonomy is expected? What does good look like in six months?&lt;/p&gt;

&lt;p&gt;Without answers to those questions, every candidate looks “somewhat relevant.” With them, the wrong ones are easy to spot.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Score candidates against the target, not against each other&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;When the previous point fails, the usual practice kicks in: companies build a large candidate pool and pick the best one from it. Candidates get compared to each other, not against a defined set of expectations. This creates delays — when a strong candidate comes in, you can’t commit immediately, because you need to see three to five more first.&lt;/p&gt;

&lt;p&gt;We don’t do that. When the right candidate shows up, we say yes.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Filter hard at the top, move fast at the bottom&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Knowing exactly who we want means we can filter aggressively at the top of the funnel. Here’s how it worked for our Senior SRE search.&lt;/p&gt;

&lt;p&gt;We looked at two things on a resume. First, scale — large user bases, high-traffic systems, complex migrations, strict reliability requirements, or cloud/platform constraints. That experience can come from large companies, fast-growing startups, open-source platforms, or other high-scale environments.&lt;/p&gt;

&lt;p&gt;Second, adaptability. The stack matters, but it’s not the whole picture. For a Senior SRE, we were looking for a generalist, not a specialist. Technologies change — what’s dominant today won’t be in two years. A CV that shows someone has switched technologies multiple times — started on bash, moved to Ansible, then picked up Kubernetes and AWS — is a reliable signal they can keep growing.&lt;/p&gt;

&lt;p&gt;Out of 150 applications per hiring cycle, roughly 1,000 per quarter in total, around a dozen make it to the TA Screening Call. Five to seven reach the technical interview. Two get to the hiring manager interview. At least one gets the offer.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F921cdb3gg3bqrb0vfbcm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F921cdb3gg3bqrb0vfbcm.png" width="800" height="295"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Hiring cycle&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Clear evaluation criteria during interview stages&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;This is the key. Before any interview, we build not just question lists but scorecards: what good and bad answers look like. The scale is simple: needs improvement, meets expectations, exceeds expectations, and red flags. We place the candidate on it. That keeps evaluations consistent and removes cognitive load from the team. We write scorecards for both hard skills and soft skills.&lt;/p&gt;

&lt;p&gt;For the &lt;strong&gt;technical interview,&lt;/strong&gt; we have six topics: AWS, Kubernetes, Terraform, CI/CD, Observability, Security.&lt;/p&gt;

&lt;p&gt;&amp;lt;a href="&lt;a href="https://medium.com/media/84ee690a1e103580f0a685297bec3523/href%22&gt;https://medium.com/media/84ee690a1e103580f0a685297bec3523/href&lt;/a&gt;&lt;a" rel="noopener noreferrer"&gt;https://medium.com/media/84ee690a1e103580f0a685297bec3523/href"&amp;amp;gt;https://medium.com/media/84ee690a1e103580f0a685297bec3523/href&amp;amp;lt;/a&amp;amp;gt;&amp;amp;lt;a&lt;/a&gt; href="&lt;a href="https://medium.com/media/380a79fab054f48dbe6b89e6cedca914/href%22&gt;https://medium.com/media/380a79fab054f48dbe6b89e6cedca914/href&lt;/a" rel="noopener noreferrer"&gt;https://medium.com/media/380a79fab054f48dbe6b89e6cedca914/href"&amp;amp;gt;https://medium.com/media/380a79fab054f48dbe6b89e6cedca914/href&amp;amp;lt;/a&lt;/a&gt;&amp;gt;&lt;/p&gt;

&lt;p&gt;None of them includes live coding or similar exercises (they’re easily bypassed with AI), and we don’t see the point. We just talk. The candidate meets the whole team at once — we’re a small enough team so everyone joins the call.&lt;/p&gt;

&lt;p&gt;Scorecards get filled out by each member of the team independently, on the day of the interview. If there are significant disagreements, we do a debrief — sometimes that means revisiting the criteria to arrive at a shared read on the candidate. Usually by the next day we know whether they’re moving forward. If the team says no, the process ends there. No wild cards from me.&lt;/p&gt;

&lt;p&gt;The &lt;strong&gt;hiring manager interview&lt;/strong&gt; works the same way — scorecards again. My job is to assess engineering maturity: what problems they’ve actually solved, what they’ve led, at what scale. I focus on four things: autonomy, data mindset, incident troubleshooting, project ownership.&lt;/p&gt;

&lt;p&gt;To assess autonomy, I ask: &lt;em&gt;Tell me about the last improvement you initiated without being asked. What was the problem, what did you do, and what was the result?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;If there’s something to tell, I go deeper: &lt;em&gt;Which metrics changed? Who pushed back? Specific dates, tools, results?&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Here’s how the scoring looks for this dimension:&lt;/p&gt;

&lt;p&gt;&amp;lt;a href="&lt;a href="https://medium.com/media/169599f31e4a417e7397cc79ca05ddfc/href%22&gt;https://medium.com/media/169599f31e4a417e7397cc79ca05ddfc/href&lt;/a" rel="noopener noreferrer"&gt;https://medium.com/media/169599f31e4a417e7397cc79ca05ddfc/href"&amp;amp;gt;https://medium.com/media/169599f31e4a417e7397cc79ca05ddfc/href&amp;amp;lt;/a&lt;/a&gt;&amp;gt;&lt;/p&gt;

&lt;p&gt;On our scale: “I upgraded Kubernetes” — needs improvement. “I migrated the whole company to a new stack” — meets expectations. “I led a cross-functional initiative — breaking up a monolith, moving between architectures, technically and organizationally complex” — exceeds expectations.&lt;/p&gt;

&lt;p&gt;After this interview I know the candidate’s leadership level: what they can take on, how large a piece I can hand them. Sometimes the answer is the opposite — strong technically, but not ready for autonomous work, or missing the product mindset entirely.&lt;/p&gt;

&lt;p&gt;The result of tech and leadership interviews is a filled-in candidate profile. For example, a candidate knows AWS well, has a gap in Kubernetes, but is solid on observability, and has led complex projects. From that I can tell whether they’re a fit for the team and whether they can carry what I need them to carry. The decision becomes easy.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What about the stages and timeline?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Application review takes up to two weeks but we push to move faster.&lt;/p&gt;

&lt;p&gt;The hardest part logistically is getting the whole team together for the technical interview. Fitting the screening and the technical into one week rarely happens, but that’s what we aim for.&lt;br&gt;&lt;br&gt;
After the technical, I get on a call with the candidate within two days — already with the team’s feedback in hand. The final decision takes another day or two at most.&lt;/p&gt;

&lt;p&gt;Total: two weeks on average from first call to offer&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp5damhqhoycifkdbzfyl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fp5damhqhoycifkdbzfyl.png" width="797" height="109"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If this sounds like your kind of hiring process and you’re an &lt;a href="https://job-boards.greenhouse.io/manychat/jobs/8077934002?gh_jid=8077934002" rel="noopener noreferrer"&gt;experienced SRE engineer&lt;/a&gt; — follow me on &lt;a href="https://www.linkedin.com/in/dmitry-yackevich-72648244/" rel="noopener noreferrer"&gt;Linkedin&lt;/a&gt; to stay in the loop on new openings. They are coming.&lt;/p&gt;




</description>
      <category>hiring</category>
      <category>hr</category>
      <category>hiringprocess</category>
      <category>process</category>
    </item>
    <item>
      <title>No sprints for a quarter: an experiment in a Scrum-culture company</title>
      <dc:creator>Manychat Engineering</dc:creator>
      <pubDate>Thu, 14 May 2026 09:42:10 +0000</pubDate>
      <link>https://dev.to/manychat/no-sprints-for-a-quarter-an-experiment-in-a-scrum-culture-company-5755</link>
      <guid>https://dev.to/manychat/no-sprints-for-a-quarter-an-experiment-in-a-scrum-culture-company-5755</guid>
      <description>&lt;h4&gt;
  
  
  How we shipped a core product rebuild in one quarter without a single sprint.
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9hnw9m8wownnu2wla7kx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9hnw9m8wownnu2wla7kx.png" width="800" height="432"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At the end of last year, we ran a double experiment: brought together a team from across different units and traded our beloved Scrum for Kanban. All to rethink the architecture of one of our core products so it could grow and scale faster. One constraint: users shouldn’t notice anything had changed. Metrics had to hold.&lt;/p&gt;

&lt;p&gt;I’m &lt;a href="http://www.linkedin.com/in/dmitry-mudryak-23a2491a2" rel="noopener noreferrer"&gt;Dmitry&lt;/a&gt;, Frontend Area Lead at Manychat. Here’s what happened — spoiler: it worked, and we’re doing it again — and when it might be worth trying for you, especially if you’re staring down a project full of unknowns.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Where the problem came from&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Manychat is a marketing automation platform for messengers. One of our core features is the Flow Builder: a canvas where you drag, connect, and configure automation sequences. Powerful tool but it is not the friendliest for beginners.&lt;/p&gt;

&lt;p&gt;So we built EasyBuilder — a simplified interface with pre-packaged solutions. No canvas, just pick a scenario, walk through a few steps, configure in a couple of clicks. Done.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8se0zwwfs33aoepel8il.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8se0zwwfs33aoepel8il.png" width="800" height="491"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Experiments showed good results. Adoption grew. Then we hit the ceiling. It became clear that to keep growing, the product needed a fundamental rethink.&lt;/p&gt;

&lt;p&gt;The problem was structural: EasyBuilder (front) and the Flow Builder (back) existed as two parallel systems. The frontend had no knowledge of the flow structure — it just collected form values and sent them to the backend, which assembled the automation from a rigid template. Every new scenario meant starting from scratch: adding anything new meant hardcoding it manually.&lt;/p&gt;

&lt;p&gt;Rebuilding EasyBuilder wasn’t just a technical challenge. It also came with an organizational one. At Manychat, we have the following setup: a core team maintains the platform, a growth team ships features on top. One builds the foundation, the other builds on it. This is an intentional structure.&lt;/p&gt;

&lt;p&gt;The catch was knowledge transfer. Whichever team built the new foundation first, the other would eventually need to pick it up — with zero context. Building that bridge from scratch would have been expensive and slow.&lt;/p&gt;

&lt;p&gt;Two challenges arrived together: technical and organizational. They needed a single answer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tiger team
&lt;/h3&gt;

&lt;p&gt;The obvious move was to spin up a cross-functional team and hand them the task. We figured out pretty quickly that wouldn’t work. Every team still had its own backlog, its own priorities. Work on Easy Builder would have dragged on. And we needed this done in a quarter max.&lt;/p&gt;

&lt;p&gt;What we actually needed:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full isolation from ongoing area priorities — a team thinking about exactly one thing&lt;/li&gt;
&lt;li&gt;Not a standard headcount (two backend, two-three frontend, designer — too many), but the specific people for the specific job.&lt;/li&gt;
&lt;li&gt;No onboarding budget: the task required people who already knew the context&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The idea was to build a standalone team pulled from their regular units for the project, then return when it’s done and distribute expertise across their teams.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu9u9mx4rq9u5jakaf0o0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fu9u9mx4rq9u5jakaf0o0.png" width="800" height="800"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;These are the areas that donated their teammates for the project.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;That became the Tiger team. Yes, same as for NASA’s Apollo 13. We aimed high🙂.&lt;/p&gt;

&lt;p&gt;The team: two frontend engineers, one backend engineer, a staff engineer from the infra team, and me — also a frontend, but acting as a tech process lead throughout the project. A PM and designer synced with us as needed rather than full-time.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kanban instead of Scrum
&lt;/h3&gt;

&lt;p&gt;The team was only half the experiment. The other half was the process.&lt;/p&gt;

&lt;p&gt;Manychat runs on &lt;a href="https://medium.com/manychat-engineering/quality-approaches-when-developing-a-product-how-we-avoid-engineering-struggles-35330a7e18d1" rel="noopener noreferrer"&gt;Scaled Scrum&lt;/a&gt;. Teams share a unified backlog, live by sprints, retrospectives, PBRs, and the rest of the Scrum ritual calendar. Large tasks get broken into small, predictable pieces so teams can move steadily and keep metrics stable. The principle: a team enters a sprint with minimal uncertainty, because you have to deliver within a fixed window.&lt;/p&gt;

&lt;p&gt;Our task was the opposite of that. We didn’t know how much was actually in there. It was a Pandora’s box: you open it and you don’t know what you’ll find. Scrum would have forced us to package that uncertainty into sprints — break it into tasks, estimate, commit to a fixed cadence. Any estimate would have been a guess, and we’d have burned energy planning things we didn’t yet understand. For example, the first two weeks were research only. We had no concrete tasks to put in a sprint. Scrum wouldn’t have worked at all.&lt;/p&gt;

&lt;p&gt;So we went with Kanban. Without sprints but with a roadmap instead. We moved with the whole epic from start to finish. Reprioritize when the task demands it. For a company where Scrum is the basic framework, this was an experiment inside an experiment.&lt;/p&gt;

&lt;h3&gt;
  
  
  Tiger team in action
&lt;/h3&gt;

&lt;p&gt;The first two weeks looked like a startup. Every day in the office, in meeting rooms — brainstorming, challenging each other’s architectural proposals.The scope of the task was opening up gradually, and we needed to understand it before we could move.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdsowons2ne2sdznv7kei.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdsowons2ne2sdznv7kei.png" width="800" height="439"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;A fragment of our work roadmap from October–November.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;When we had a foundation, we started moving through the roadmap. Whole epics at a time. When something unexpected surfaced, we dealt with it on the spot and kept going.&lt;/p&gt;

&lt;p&gt;The PM and designer weren’t in the room full-time — they joined weekly syncs and worked asynchronously: we’d hand off a task, they’d discuss it, come back with a decision and a design. That worked fine for most of the project. Where we hit friction was when a product decision was blocking a technical one, and the PM wasn’t available. Some sizable tasks stalled in the queue because we couldn’t move without a call.&lt;/p&gt;

&lt;p&gt;In the end, 90% of the project was built in this mode. The remaining 10% — final fixes and polish — was handed off to a product team, who wrapped it up within a single sprint.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What we actually built&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Instead of writing custom code for every new scenario, we created a meta-language on top of the Flow Builder that lets you assemble forms and configure scenarios automatically. Unlike EasyBuilder, where the frontend and backend operated as two separate systems via established and rigid contract, Quick Automations — this is how we called it, works with a single flow builder entity end-to-end.&lt;/p&gt;

&lt;p&gt;Before, adding a new scenario meant a developer hardcoding it from scratch. Now, a developer isn’t needed at all. A product manager can come in, configure a scenario themselves — and a ready-made form appears for the user. No developer involvement, no new sprint per configuration.&lt;/p&gt;

&lt;p&gt;This is how it looked for a product manager to create a new automation scenario, using Flow Builder. On the right part — the result the user will see.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8la517mpdunql69f3da0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8la517mpdunql69f3da0.png" width="799" height="491"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And this is the user interface, the same as it was for Easy builder 1.0.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8y4rmwp7c1bcd0hkdn2b.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8y4rmwp7c1bcd0hkdn2b.png" width="799" height="491"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Knowledge transfer in practice
&lt;/h3&gt;

&lt;p&gt;One of the expected benefits of this setup was knowledge transfer. The people who built the product went back to their permanent teams — taking everything they learned with them. That made it easier to land the solution in any area and keep improving it in parallel.&lt;/p&gt;

&lt;p&gt;They ran workshops for engineers — walked through how the new system works, helped colleagues tackle tasks in ways that built understanding rather than just output. They ran sessions for product teams on how to configure scenarios and build new automations without involving developers.&lt;/p&gt;

&lt;p&gt;That was the goal: teams who inherited the product can move independently — add new configurations, run experiments. The tiger team is gone, but what it built lives in the product and in the people.&lt;/p&gt;

&lt;h3&gt;
  
  
  When to use this
&lt;/h3&gt;

&lt;p&gt;A temporary project team isn’t a universal tool. Neither is Scrum or Kanban. Before reaching for either, it’s worth asking: is the process serving the work, or is the work serving the process?&lt;/p&gt;

&lt;p&gt;This project helped us answer that question. Here’s what we learned about when this format makes sense — and when it doesn’t.&lt;/p&gt;

&lt;p&gt;It is definitely worth trying when:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;The task is one-time, but strategically critical.&lt;/strong&gt; Not your regular product flow, but something that needs to be done once and done right — laying a foundation that features will build on.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The scope is unpredictable.&lt;/strong&gt; If you don’t know how much is actually in there, sprints will create pressure where you need flexibility.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The task doesn’t fit the standard process.&lt;/strong&gt; Especially if a large task can’t be broken into sprints without losing its meaning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;The people you need are spread across teams.&lt;/strong&gt; If the expertise is distributed across different units, and those people already know the context — no onboarding required.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;And it may be less effective if:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The task is small and fits cleanly into a regular sprint. No need to build a separate structure.&lt;/li&gt;
&lt;li&gt;The company isn’t willing to temporarily release people from their teams. Without that, a tiger team won’t fly.&lt;/li&gt;
&lt;li&gt;There’s no clear endpoint. The team risks becoming permanent — which is a different story entirely.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As for our case the experiment worked 100%. We got a solution for the problem. App metrics held. We’re doing it again.&lt;/p&gt;




</description>
      <category>softwareengineering</category>
      <category>process</category>
      <category>softwaredevelopment</category>
      <category>kanban</category>
    </item>
    <item>
      <title>Frontend Performance Patterns to speed up your Web App</title>
      <dc:creator>Manychat Engineering</dc:creator>
      <pubDate>Mon, 04 May 2026 10:53:16 +0000</pubDate>
      <link>https://dev.to/manychat/frontend-performance-patterns-to-speed-up-your-web-app-4d0i</link>
      <guid>https://dev.to/manychat/frontend-performance-patterns-to-speed-up-your-web-app-4d0i</guid>
      <description>&lt;h2&gt;
  
  
  How intentional loading decisions keep your app fast at scale.
&lt;/h2&gt;

&lt;p&gt;Frontend performance is not a late-stage cleanup task. It’s not tech debt. It’s a set of decisions we make every day while we code — what we load, when we load it, and how we render it. The answer depends on the importance of the code, its size, and when the user actually needs it.&lt;/p&gt;

&lt;p&gt;Get that wrong, and the browser pays for everything upfront — bytes, main thread, network — whether the user ever sees it or not.&lt;/p&gt;

&lt;p&gt;I’m &lt;a href="https://www.linkedin.com/in/lianansan/" rel="noopener noreferrer"&gt;Liana&lt;/a&gt;, a frontend engineer at &lt;a href="https://careers.manychat.com/team/engineering" rel="noopener noreferrer"&gt;Manychat&lt;/a&gt;. In this article we cover five patterns we use to get this right — four import strategies and compression — and how we measure whether it’s working.&lt;/p&gt;

&lt;h2&gt;
  
  
  Import Patterns
&lt;/h2&gt;

&lt;p&gt;Manychat uses 24 import patterns in practice, grouped into 7 categories — from static and dynamic to type, asset, style, re-exports, and legacy tooling. In practice, two dominate: static imports at 99.6% of the codebase, dynamic imports at 0.4%.&lt;/p&gt;

&lt;h4&gt;
  
  
  Pattern 1 — Static Imports
&lt;/h4&gt;

&lt;p&gt;Static imports are required for the first meaningful paint. Layout, router, core UI — everything that must exist before the user sees anything. If it’s there on first load, it’s a static import.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgbbcbfdpc2a7gmfb0y9r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgbbcbfdpc2a7gmfb0y9r.png" alt=" " width="492" height="510"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you look carefully at how static imports are ordered: Node.js plugins go on the top level — that’s how the formatter works. Then React core, then external packages. Then dependencies, modules, and other packages.&lt;/p&gt;

&lt;h4&gt;
  
  
  Pattern 2 — Dynamic imports
&lt;/h4&gt;

&lt;p&gt;0.4% of the codebase. Small number, high impact.&lt;br&gt;
You can see how they differ from static imports in the code — you need a separate file where you define a path and what needs to be imported on the lazy part:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjvmd68ij87k44jfoygsn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjvmd68ij87k44jfoygsn.png" alt=" " width="799" height="393"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Where is it in the UI? Everywhere where there’s a very heavy screen. Each route becomes a separate chunk which is loaded only on navigation.&lt;/p&gt;

&lt;p&gt;A good example is the Flow Player page that you can access by creating an automation, sharing it from the CMS list, and handing the link to someone outside Manychat. It’s heavy. There’s no reason to pay for it on app load.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft7vckgcowng3k92iorhn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ft7vckgcowng3k92iorhn.png" alt=" " width="800" height="526"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;What dynamic import does — it doesn’t pay for code until the user goes there.&lt;/p&gt;

&lt;h4&gt;
  
  
  Pattern 3 — Import on Interaction
&lt;/h4&gt;

&lt;p&gt;This is used for optional UI — modals, popovers, or similar things triggered by the user.&lt;/p&gt;

&lt;p&gt;We actually don’t use it in our codebase, but we use something very similar: import on render, which is lazy-loaded on mount, not on interaction. You can see this in our modals — all of them work exactly the same way. Why? Because our modals are very lightweight, and there’s no need for import-on-interaction specifically. All our modals render immediately inside Suspense — they just load their chunks lazily, avoiding the cost of features many users never open.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F937w5b4d32iy1zw7s42i.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F937w5b4d32iy1zw7s42i.png" alt=" " width="800" height="561"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Pattern 4 — Import on Visibility
&lt;/h4&gt;

&lt;p&gt;The component loads when it enters the viewport. This avoids competing with the initial render and reduces the chunk size.&lt;/p&gt;

&lt;p&gt;A good example is infinite scroll in TikTok and Instagram automation. When you want to pick a post or reel, a modal opens — and if you have a lot of them, you get an infinite scroll. It avoids loading chunks the user hasn’t scrolled to yet. We have a reusable sentinel component that handles this across the app.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1vrs423ivsf47sxpv1ln.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1vrs423ivsf47sxpv1ln.png" alt=" " width="800" height="634"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why not just one import pattern?&lt;/strong&gt;&lt;br&gt;
One pattern doesn’t fit all. Our goal is to make a right-sized chunk for what we want to load — because it’s bytes plus main thread plus network. We define them the same way we define error severity:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Critical&lt;/strong&gt; — load immediately (static)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Heavy&lt;/strong&gt; — load on navigation (dynamic)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optional&lt;/strong&gt; — load on interaction&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Can we combine them?&lt;/strong&gt;&lt;br&gt;
Yes — in layers, not as one mega-pattern. And you don’t literally need every technique. That’s overengineering. Define what you want to optimize and why.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why don’t we use some patterns — hover/focus, prefetch, idle prefetch?&lt;/strong&gt;&lt;br&gt;
It’s always a question of cost and benefit, and we always run the risk around cache and CDN. Sometimes absence is a deliberate priority — not that we disagree with the idea.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Will it actually change performance?&lt;/strong&gt;&lt;br&gt;
Like other optimization patterns — it depends on what you measure and where it hurts. Treat them as targeted experiments. They’re useful when you want a specific interaction to feel immediate.&lt;/p&gt;

&lt;h4&gt;
  
  
  Compression pattern
&lt;/h4&gt;

&lt;p&gt;Imports decide what code loads and when — compression decides how much it weighs when it gets there. Where imports operate at the application bundle layer, compression operates at the origin server and CDN — how you finally deliver bytes to the client.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2nayye4yvmik07fq6nob.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2nayye4yvmik07fq6nob.png" alt=" " width="799" height="363"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At Manychat we use two compression utilities: Gzip and Brotli. Both shrink files before they travel over the network and decompress them transparently in the browser. Both are lossless encodings for text-like content (JS, CSS, HTML, JSON, SVG — not already-compressed binaries like most JPEG/PNG).&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fy4zvsayaa1u91w3770tc.png" alt=" " width="800" height="360"&gt;&lt;/th&gt;
&lt;th&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fplm8y5fzljzj4ufrxo6z.png" alt=" " width="799" height="361"&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;You can check this in the network tab — go to JS files, look at response headers (look for content-encoding: br or gzip).&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Gzip is the classic one, supported by every browser. Brotli compresses better, is a little slower, but delivers smaller chunks. Browsers and CDNs pick the best mutually supported algorithm from Accept-Encoding — supporting both gives a good baseline with Gzip and better size where Brotli is available.&lt;/p&gt;

&lt;p&gt;Rule: prefer Brotli for static assets where supported; keep Gzip as fallback where needed.&lt;/p&gt;

&lt;h4&gt;
  
  
  How do we know the patterns are working?
&lt;/h4&gt;

&lt;p&gt;With user-centric metrics and repeatable team habits.&lt;/p&gt;

&lt;p&gt;We rely on Core Web Vitals — loading experience, responsiveness, and visual stability — via the &lt;a href="https://github.com/GoogleChrome/web-vitals" rel="noopener noreferrer"&gt;web vitals library&lt;/a&gt; from Google. When a user stays on a page long enough, an analytics event fires: app_metrics user_interactive_performance. The central place in the codebase is log_web_vitals.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnzmc0pzznrh6gok4498h.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fnzmc0pzznrh6gok4498h.png" alt=" " width="800" height="232"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Beyond Core Web Vitals we track two more things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;INP (Interaction to Next Paint) — real-world interaction responsiveness&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Long Tasks — where the user can feel that the app is stuck&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Both are collected only for logged-in users and visible in Grafana dashboards. For Flow Builder specifically — the heaviest part of Manyсhat — we track INP on both desktop and mobile. In an ideal world every major component would have its own dashboard. For now, we start where it hurts most.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F44lhk8re4ergiiehh3fe.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F44lhk8re4ergiiehh3fe.png" alt=" " width="800" height="242"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;For audits and payload analysis we use &lt;a href="https://github.com/GoogleChrome/lighthouse" rel="noopener noreferrer"&gt;Lighthouse&lt;/a&gt; — a built-in Chrome tool that generates a detailed performance report for any page. It’s useful for catching issues before they reach real users.&lt;br&gt;
For day-to-day development we use browser DevTools — the network and performance tabs show what’s happening in real time while we code.&lt;/p&gt;




&lt;p&gt;Performance is not a one-time fix. Every import decision, every byte that travels over the wire — these are choices that compound over time. Get them right consistently, and users never notice. Get them wrong, and they leave.&lt;/p&gt;

&lt;p&gt;The patterns we covered — imports and compression — are not exotic optimizations. They’re the baseline. The metrics are what keep you honest: if you can’t see it, you can’t improve it.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If you want to learn more about how we build Manychat and who we’re currently looking for, check out &lt;a href="https://careers.manychat.com/" rel="noopener noreferrer"&gt;Manychat Careers&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>frontend</category>
      <category>performance</category>
    </item>
    <item>
      <title>From idea to MVP in a hackathon with AI: 6 principles that get you there</title>
      <dc:creator>Manychat Engineering</dc:creator>
      <pubDate>Thu, 23 Apr 2026 08:41:56 +0000</pubDate>
      <link>https://dev.to/manychat/from-idea-to-mvp-in-a-hackathon-with-ai-6-principles-that-get-you-there-2p7i</link>
      <guid>https://dev.to/manychat/from-idea-to-mvp-in-a-hackathon-with-ai-6-principles-that-get-you-there-2p7i</guid>
      <description>&lt;h4&gt;
  
  
  &lt;em&gt;How to run a hackathon that results in a working MVP, and how AI fits into the process.&lt;/em&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fppb7zuodei7je58c93mi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fppb7zuodei7je58c93mi.png" width="800" height="432"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://careers.manychat.com/team/engineering" rel="noopener noreferrer"&gt;Manychat&lt;/a&gt; helps creators and brands automate conversations on Instagram, TikTok and beyond. The more the customer uses it, the richer the picture gets — automations running, leads coming in, content performing differently across posts. We wanted to surface that picture clearly, in one place, directly for the people running those automations.&lt;/p&gt;

&lt;p&gt;We had the data. What we didn’t have was clarity on which features would actually be useful, and in what form. Instead of locking that question into a quarterly plan and finding out three months later, we decided to find out in a week.&lt;/p&gt;

&lt;p&gt;Three people: two backend engineers, one PM, plus a data engineer who wasn’t full-time but was critical during the data prep phase. A couple of AI agents. Three days in the &lt;a href="https://careers.manychat.com/office/amsterdam" rel="noopener noreferrer"&gt;Amsterdam office&lt;/a&gt;, two days of remote prep. The output: an MVP running on real user data and seven customer interviews the following week.&lt;/p&gt;

&lt;p&gt;My name is &lt;a href="https://www.linkedin.com/in/artur-epremyan-b16422236/" rel="noopener noreferrer"&gt;Artur&lt;/a&gt;, I’m a Python Tech Lead here at &lt;a href="https://careers.manychat.com/team/engineering" rel="noopener noreferrer"&gt;Manychat&lt;/a&gt;. Here’s what we learned from our hackathon experience — and where AI made the difference.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;1. Start the hackathon before the hackathon&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The first thing to sort out in advance is access. AI tools, repositories, internal documentation, relevant services. Permissions always take longer than you expect and can become a serious blocker mid-hackathon. Deal with it before you start.&lt;/p&gt;

&lt;p&gt;The second thing is product hypotheses. We spent two days exploring data from our data warehouse. The internal documentation was a Google Sheets file with table and view descriptions — not exactly human-readable — and a data engineer who helped us navigate it. We packaged all of that as context, fed it to an LLM (Claude Code and the Claude desktop app), and asked it to find something productively useful for our customers. The PM turned those findings into clear product hypotheses, wrote the first PRD, and built early prototypes in Lovable — screenshots of which we later used as UI references during the hackathon.&lt;/p&gt;

&lt;p&gt;The key point here isn’t that the LLM “invented” the product. It helped us quickly make sense of unfamiliar data and structure what we already intuitively wanted to validate.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;2. Planning takes 80% of the time — and that’s fine&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;During the hackathon itself, we barely wrote any code by hand. Most of the time went to something else: figuring out exactly what needed to be done and formulating it precisely enough for an AI agent to execute.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Review the plan, not the code.&lt;/strong&gt; The main insight from day one: the more precisely a task is defined for the agent, the fewer iterations you need. We used plan mode in Claude Code before every task, challenged the plan as a team, and saved it as soon as we were happy with it.&lt;/p&gt;

&lt;p&gt;We didn’t review the code — but that doesn’t mean quality didn’t matter. Instead of code review, we validated the output with tests. What mattered wasn’t how it was written, but whether it worked correctly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Save the plan immediately.&lt;/strong&gt; Once, we had a great plan. We ran it, but the result wasn’t good enough. So we did what people usually do in that situation — /clear — to start fresh. The plan disappeared and with it, we wiped out some of the most valuable work along the way. We had to rebuild from scratch. After that, saving the plan became a mandatory step before any iteration.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Don’t ask for refactoring — reformulate.&lt;/strong&gt; If the output isn’t right, don’t ask the agent to improve or fix what’s there. Reformulate the task, update the plan, and ask it to redo the whole thing. Especially when the context window is already half full — at that point, trying to patch the existing result only makes things worse.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;3. Delegate coordination to AI-agents&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;If planning takes 80% of the time, the next challenge is turning that plan into a system of well-aligned tasks. This is where things usually start to fall apart: dependencies get lost, responsibilities overlap, and the overall structure drifts.&lt;/p&gt;

&lt;p&gt;We found that delegating coordination to AI agents helped keep everything consistent. We created one sub-agent per direction. Each sub-agent was responsible for writing a detailed implementation plan for its part of the work. Once the sub-agents had their plans, a “parent” agent merged them into a single whole — and checked for conflicts, duplication, or blockers between tasks. This replaced a significant chunk of the coordination overhead between people.&lt;/p&gt;

&lt;p&gt;The approach let a three-person team — one PM and two engineers — work in parallel and move fast. So fast, we weren’t quite ready for it.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;4. Keep tasks as independent as possible&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;People should work in parallel on maximally different things. This is less obvious than it sounds.&lt;/p&gt;

&lt;p&gt;On day one, the two of us split the backend by endpoints: one handler each. Seemed logical. We finished unexpectedly quickly and had to jump into another planning iteration — coordinating again, figuring out what’s next. With AI, this kind of split doesn’t really make sense: an agent closes a handler faster than you can get out of each other’s way.&lt;/p&gt;

&lt;p&gt;The next day we changed the approach: each person takes a fully independent direction. One writes a feature end-to-end, the other handles infrastructure. Or one does backend, the other does frontend. For the latter, AI came to the rescue again: Claude Code with Figma via MCP let us assemble an interface from Manychat’s existing design system components — no frontend experience required.&lt;/p&gt;

&lt;p&gt;While the engineers were building, the PM was running a parallel track: refining scope, defining the Ideal Customer Profile, finding customers who matched it, and tapping the marketing team’s existing warm contacts to line up interviews.&lt;/p&gt;

&lt;p&gt;Parallelization didn’t go away — it just moved up a level. Instead of “two people on one module,” it became “each person owns an entire layer.”&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;5. Log your compromises&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;During an MVP hackathon, you’re going to make shortcuts. Intentionally. The goal isn’t to avoid them, but to make sure they don’t turn into hidden debt.&lt;/p&gt;

&lt;p&gt;We kept a “Compromises” table and updated it as we went. Every deliberate technical or process shortcut got its own line. If the product validates, we have a concrete list of what needs to be brought up to production standards. Nothing gets forgotten, nothing quietly becomes permanent.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;6. Stay focused on the original goal&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;When we saw how fast we were moving, we decided to add an unplanned feature: AI Summary. A short insight block generated on dashboard load — top post, best automation, overall account dynamics for the week. It wasn’t in the original plan, took half a day, and turned out to be one of the most talked-about things in customer interviews.&lt;/p&gt;

&lt;p&gt;Then we got ambitious. Instead of showing the feature from a laptop, we decided to run a real experiment with a feature flag for actual customers. I took on the negotiations with infra and security so the team could keep shipping. Security didn’t approve it — and for a fair reason. Our solution was read-only, which made it safe in its current state. But the security team flagged that if any write operations were added down the line, the architecture would need to be revisited to account for that. They weren’t ready to approve this potential risk quickly, and we lost a person-day.&lt;/p&gt;

&lt;p&gt;At the retro, the conclusion was simple: every new action needs to be validated against the original goal. The MVP was never meant for production in the first place. We just thought — we’re moving fast, why not? It worked for the unplanned feature. It didn’t work for the prod deploy.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What we ended up with&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;We shipped a working MVP analytics dashboard running on real user data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkqa7bfhbo9fslp3gynqs.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkqa7bfhbo9fslp3gynqs.png" width="800" height="518"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The dashboard has three parts. At the top, an &lt;strong&gt;AI Summary&lt;/strong&gt; : a short auto-generated insight about the last 7 days — total leads, DMs, comments, URL clicks, top-performing post with CTR, and one clear action to take.&lt;br&gt;&lt;br&gt;
Below that, &lt;strong&gt;Account Performance:&lt;/strong&gt; four KPI cards — automations running, leads collected, average CTR, time saved.&lt;br&gt;&lt;br&gt;
And the main section: &lt;strong&gt;Top Content by Conversion&lt;/strong&gt;  — a table of posts ranked by leads, with reach, clicks, CTR, engagement, saves, and automation status. For creators who monetize on Instagram, the key question is always the same: which content is actually working? Not working in terms of likes — working in terms of leads, DMs, conversions. We wanted to make that visible at a glance, so they could double down on what’s performing and add automations where they’re missing out. Posts without automations show an “Add automation” prompt inline, and the customer can act on it without leaving the page.&lt;/p&gt;

&lt;p&gt;The week after the hackathon, the PM ran seven interviews with real customers on their real data. Four said they’d use the dashboard regularly. We got a concrete list of what was missing and clear input for the next quarter’s planning.&lt;/p&gt;

&lt;p&gt;In a normal cycle, development alone would take around two sprints — four weeks. User validation in larger companies can stretch for months: UX research queues, alignment meetings, synthesis. Here, it took three days to build and one week to get real customer feedback.&lt;/p&gt;

&lt;p&gt;A fast MVP wasn’t even the goal. Reducing risk was. Instead of committing a full quarter to something that might not land, you validate the idea first, and then decide whether it’s worth investing in. The hackathon format makes this possible. A small team goes from hypothesis to a validated MVP in about two weeks.&lt;/p&gt;

&lt;p&gt;AI changes the equation. Implementation becomes a non-issue: one engineer with a system of agents can cover what used to require a team. The bottleneck shifts. It’s no longer about writing code — it’s about knowing what to build. AI helps there too.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Want to be part of the team building Manychat? See what&lt;/em&gt; &lt;a href="https://careers.manychat.com/team/engineering" rel="noopener noreferrer"&gt;&lt;em&gt;roles we have open&lt;/em&gt;&lt;/a&gt; &lt;em&gt;right now.&lt;/em&gt;&lt;/p&gt;




</description>
      <category>softwareengineering</category>
      <category>softwaredevelopment</category>
      <category>hackathonorganizing</category>
      <category>hackathon</category>
    </item>
    <item>
      <title>PHP Fibers: Simplifying Async Code and Speeding Up Development</title>
      <dc:creator>Manychat Engineering</dc:creator>
      <pubDate>Thu, 16 Apr 2026 12:54:49 +0000</pubDate>
      <link>https://dev.to/manychat/php-fibers-simplifying-async-code-and-speeding-up-development-g79</link>
      <guid>https://dev.to/manychat/php-fibers-simplifying-async-code-and-speeding-up-development-g79</guid>
      <description>&lt;h3&gt;
  
  
  PHP Fibers: simplifying async code and speeding up development
&lt;/h3&gt;

&lt;h4&gt;
  
  
  How serialization overhead, a surprise OpenSSL upgrade, and idle workers pushed us toward PHP 8.1 Fibers, and what changed when we did.
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftrlcz24fqc7gtwe5pygn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Ftrlcz24fqc7gtwe5pygn.png" width="800" height="432"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I’m &lt;a href="https://www.linkedin.com/in/maxim-fomichev/" rel="noopener noreferrer"&gt;Max&lt;/a&gt;, Infrastructure Team Lead at &lt;a href="https://careers.manychat.com/team/engineering" rel="noopener noreferrer"&gt;Manychat&lt;/a&gt;. This is the next part of our PHP series.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://medium.com/manychat-engineering/simulating-%D1%81oncurrent-requests-how-we-achieved-high-performance-http-in-php-without-threads-c3a94bae6c3b" rel="noopener noreferrer"&gt;In the previous article&lt;/a&gt;, we built concurrent HTTP requests in PHP without threads — using curl_multi_exec to let a single worker handle multiple external calls at once. It worked. Then our AI features expanded, external calls multiplied, and the model started buckling under its own complexity.&lt;/p&gt;

&lt;p&gt;This article is about what we did next: PHP 8.1 Fibers, and how they changed the way our workers process payloads.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What was exactly wrong with Concurrent Requests?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The curl_multi_exec architecture came with a steep price. To make pseudo-concurrency work, we had to explicitly serialize and deserialize requests, responses and exceptions at every async boundary. That meant a significant refactor, new internal tooling, and conventions developers had to follow just to write new code correctly. As AI features grew in scope and number, the cognitive overhead became impossible to ignore.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Error Handling complexity.&lt;/strong&gt; Handling exceptions, timeouts, and corner cases got increasingly painful. Every new scenario — retries, network failures, edge cases — required explicit handling, and since context had to survive serialization boundaries, each one added another layer of boilerplate&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Scattered Context.&lt;/strong&gt; The hardest part wasn’t writing the code — it was reading it afterward. Business logic was split across serialization points: some state lived before the async boundary, some after. Tracing a single payload through the system meant mentally jumping between the sync worker, the async queue, and back. Code reviews became genuinely hard.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Testing Overhead.&lt;/strong&gt; Testing also became more complicated. Tests had to account for the full serialization/deserialization chain. Even a simple mock meant verifying multiple intermediate steps instead of a single function call.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Idle Workers.&lt;/strong&gt; Before Fibers, Meta API calls stayed synchronous — serializing and deserializing state at the point of the call would have required even more refactoring, so we just didn’t touch it. The average response time is around 250ms. Not slow enough to panic over — but not fast enough to ignore at Manychat’s scale. During that time, the worker just sat there.&lt;/p&gt;

&lt;p&gt;Bottom line: the code was getting harder to read, harder to test, and harder to extend. Development was slowing down — and everyone felt it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Three more things that made us rethink
&lt;/h3&gt;

&lt;p&gt;While we were sitting with those outcomes, three things happened in parallel:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. We solved our memory problem via PCNTL fork.&lt;/strong&gt; By using pcntl_fork() to spawn workers, we enabled &lt;a href="https://medium.com/manychat-engineering/slashing-php-cli-memory-consumption-techniques-for-high-performance-apps-b89a6caf1406" rel="noopener noreferrer"&gt;OPcache sharing and Linux copy-on-write&lt;/a&gt; — significantly reducing the memory footprint of each worker. In theory, we could almost stop worrying about idle workers; they were no longer consuming nearly as much memory. But they still consumed network connections. So the problem wasn’t fully gone.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Ubuntu upgrade revealed a new bottleneck.&lt;/strong&gt; We migrated from Ubuntu 20.04 to the new LTS and CPU load jumped 10%. Nothing in our code had changed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3f8zs1hkpnvo91zjaach.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3f8zs1hkpnvo91zjaach.png" width="798" height="212"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We dug in. The problem was OpenSSL 3.0 — shipped with the new Ubuntu — which made SSL handshakes significantly more expensive. OpenSSL’s root certificate store on Linux is one large concatenated file — and the new version introduced mutex-style locking when iterating through it. Even Facebook’s own optimization of using a single root certificate file didn’t fully absorb the hit.&lt;/p&gt;

&lt;p&gt;The cause was our still-synchronous calls to the Meta API. Each payload opened a new TCP connection. At Manychat’s scale, that added up fast — and that 10% CPU overhead became the trigger for the next step.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1kvy7h5xwbjhs78mbga8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1kvy7h5xwbjhs78mbga8.png" width="800" height="220"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So Anton Gorin, Chief Architect of Manychat, and I decided to combine our existing async worker — built on curl_multi_exec — with Fibers introduced in PHP 8.1.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What is a Fiber?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Fibers are a low-level mechanism for cooperative multitasking: pause execution at any point, resume it later from exactly the same spot, without — no threads, no processes.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight php"&gt;&lt;code&gt;&lt;span class="cp"&gt;&amp;lt;?php&lt;/span&gt;

&lt;span class="nv"&gt;$fiber&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="k"&gt;new&lt;/span&gt; &lt;span class="nc"&gt;Fiber&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;function&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;

    &lt;span class="k"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Suspending…&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

    &lt;span class="nv"&gt;$last&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Fiber&lt;/span&gt;&lt;span class="o"&gt;::&lt;/span&gt;&lt;span class="nf"&gt;suspend&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;16&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;

    &lt;span class="k"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Resuming with last value &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;$last&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="p"&gt;});&lt;/span&gt;

&lt;span class="nv"&gt;$last&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nv"&gt;$fiber&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nf"&gt;start&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;

&lt;span class="k"&gt;echo&lt;/span&gt; &lt;span class="s2"&gt;"Suspended with last value &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="nv"&gt;$last&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s2"&gt;"&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;

&lt;span class="nv"&gt;$fiber&lt;/span&gt;&lt;span class="o"&gt;-&amp;gt;&lt;/span&gt;&lt;span class="nf"&gt;resume&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;42&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Suspended with last value 16;&lt;br&gt;&lt;br&gt;
Resuming with last value 42.&lt;/p&gt;

&lt;p&gt;Unlike true multithreading, Fibers run within a single OS thread and don’t execute in parallel. Instead, they switch context explicitly via Fiber::suspend() and resume. That makes them well-suited for I/O-bound work: yield control while waiting for a response, do something else, come back when it’s ready.&lt;/p&gt;

&lt;h3&gt;
  
  
  The new payload processing flow with fibers
&lt;/h3&gt;

&lt;p&gt;Previously, every HTTP request meant serialize, hand off, wait, deserialize, restore. Here’s what that looked like in practice:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The sync worker picks a payload from the queue and starts processing it.&lt;/li&gt;
&lt;li&gt;When execution hits an external HTTP call, the worker serializes the request along with the current business-logic state and writes it to the async task queue.&lt;/li&gt;
&lt;li&gt;The async worker reads from the queue, deserializes multiple requests, and executes them concurrently via curl_multi_exec.&lt;/li&gt;
&lt;li&gt;When a response is ready, the async worker serializes it together with the updated state and writes it back to the sync task queue.&lt;/li&gt;
&lt;li&gt;A sync worker picks it up, deserializes everything, restores business-logic state, and continues from where it left off.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;This is the diagram of this very complex flow:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0sf2cq8wq3fyq6vywy1w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F0sf2cq8wq3fyq6vywy1w.png" width="800" height="454"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With Fibers, the logic was simpler:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The worker starts a fiber and begins processing a payload.&lt;/li&gt;
&lt;li&gt;When execution hits an external HTTP call — Meta API, LLM, whatever — the fiber suspends, returning the request that needs to be executed.&lt;/li&gt;
&lt;li&gt;The workflow is passed to Guzzle request loop, which executes the request and if there is no response ready with data, the worker immediately starts the next fiber and begins processing another payload.&lt;/li&gt;
&lt;li&gt;If there is any response available in Guzzle loop, the corresponding fiber resumes from exactly where it stopped.&lt;/li&gt;
&lt;li&gt;If that fiber produces another request, it suspends again and goes back into the loop.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvgeccjfkph0c7tfo6yxv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvgeccjfkph0c7tfo6yxv.png" width="800" height="438"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Within a single worker, multiple fibers may be suspended, waiting for the response to come simultaneously and one actively executed at the same moment of time, depending on configuration.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbyy5xve3vwe9fj32kty9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fbyy5xve3vwe9fj32kty9.png" width="800" height="393"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;What are the wins? And one trade-off&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;More cases to make async — like API calls via Meta SDK&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Before Fibers, making Meta API calls async meant serializing and deserializing business state around every call. We just didn’t bother. With Fibers, we added a suspend point and called it done. A single Meta API call takes ~250ms — small individually, but Manychat makes billions of them. The compound effect is massive.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Savings on resources: CPU and connections&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
We rewrote part of the Facebook SDK to reuse connections. One HTTP/2 connection per worker, multiplexed across multiple requests. No repeated TCP handshakes. No OpenSSL overhead per request.&lt;/p&gt;

&lt;p&gt;CPU usage returned to previous levels.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Asynchronous Sleep&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Sometimes we need to wait — for example, before retrying after an HTTP 500, or to ensure correct message order before sending the next one. A regular sleep() blocks the entire process. If API errors spike and retry logic misbehaves, you’ve put the whole server to sleep.&lt;/p&gt;

&lt;p&gt;With fibers, we can implement an asynchronous sleep. A specific fiber sleeps for a defined interval while the worker continues processing other fibers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simpler code&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
No more serialization. No more deserialization. Business context stays where it belongs — inside the fiber. Developers don’t even need to know they’re inside a fiber. The code looks like regular PHP — because for all practical purposes, it is.&lt;br&gt;&lt;br&gt;
In practice: instead of pushing a request back into the queue on retry, you just do an asynchronous sleep.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simpler tests&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Testing async code with Guzzle required enormous effort — the full serialization/deserialization chain had to be accounted for, and even a simple mock meant verifying multiple intermediate steps. With Fibers, the code reads linearly and tests follow naturally. That said, some things are hard to reproduce outside production — but in practice, if it worked in dev, it worked in prod.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;One trade-off: blast radius&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Fibers came with one compromise. Previously, our guiding principle was “better to crash hard than silently suffer from error” — non-fatal warning, log it, terminate the worker. One payload lost, clean slate.&lt;/p&gt;

&lt;p&gt;With multiple fibers suspended simultaneously, that no longer works. Terminating the worker interrupts all in-flight payloads at once. We redesigned exception handling so that catchable errors terminate only the affected fiber while the worker continues processing others. Fatal errors — like out-of-memory — still take down the entire process. If five payloads are in flight, all five are lost.&lt;/p&gt;

&lt;p&gt;This meant working through existing technical debt and committing to treating critical errors as critical — actually reacting to them, not letting them slide. Migrating to PHP 8.5 helped: it introduced stack traces for fatal errors, which made them significantly easier to diagnose and fix.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Could we have done less work?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Probably. Revolt, ReactPHP, AMPHP and OpenSwoole all solve similar problems and would have saved us from building a custom event loop. AMPHP in particular goes further — async SQL queries, not just HTTP, and battle-tested error handling out of the box.&lt;/p&gt;

&lt;p&gt;But we didn’t start from a blank slate. We already had a Guzzle-based event loop from the earlier proof of concept, and adding Fibers on top was the natural next step. Starting over today, we’d look at Revolt first and skip the custom event loop entirely.&lt;/p&gt;

&lt;p&gt;What we’d keep regardless: developers don’t need to know they’re inside a fiber. The wrapping happens under the hood. That was a deliberate choice — and it’s the part that matters most in a large codebase with many contributors.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;This article is based on a talk I gave at&lt;/em&gt; &lt;a href="https://www.meetup.com/barcelona-php-talks" rel="noopener noreferrer"&gt;&lt;em&gt;PHP Talks #7&lt;/em&gt;&lt;/a&gt;&lt;em&gt;. If you’d rather watch than read — the video is &lt;em&gt;[_here&lt;/em&gt;](&lt;a href="https://www.youtube.com/watch?v=in_XaE0T5IY" rel="noopener noreferrer"&gt;https://www.youtube.com/watch?v=in_XaE0T5IY&lt;/a&gt;)&lt;/em&gt;._&lt;/p&gt;




</description>
      <category>phpdevelopers</category>
      <category>php</category>
      <category>softwaredevelopment</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Migrating the Manychat iOS App from Xcode Frameworks and CocoaPods to Swift Package Manager</title>
      <dc:creator>Manychat Engineering</dc:creator>
      <pubDate>Thu, 26 Mar 2026 09:47:47 +0000</pubDate>
      <link>https://dev.to/manychat/migrating-the-manychat-ios-app-from-xcode-frameworks-and-cocoapods-to-swift-package-manager-2nc1</link>
      <guid>https://dev.to/manychat/migrating-the-manychat-ios-app-from-xcode-frameworks-and-cocoapods-to-swift-package-manager-2nc1</guid>
      <description>&lt;h4&gt;
  
  
  &lt;em&gt;How we migrated a large, high-load iOS app from Xcode Frameworks and CocoaPods to Swift Package Manager — without freezing feature development.&lt;/em&gt;
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwqhzxjtpgvh800d4nq8p.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fwqhzxjtpgvh800d4nq8p.png" width="800" height="432"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In August 2024, the CocoaPods project entered maintenance mode. Then in November 2024 &lt;a href="https://blog.cocoapods.org/CocoaPods-Support-Plans/" rel="noopener noreferrer"&gt;it was announced&lt;/a&gt; that the trunk will become read-only on December 2, 2026. After that, publishing new pods or updating existing ones will no longer be possible.&lt;br&gt;&lt;br&gt;
That announcement became the final trigger for our migration to Swift Package Manager.&lt;/p&gt;

&lt;p&gt;But honestly, we already had enough reasons to leave long before that.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;The starting point&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Our starting point was a fairly large and mature iOS application: the Manychat app with about 220k MAU.&lt;br&gt;&lt;br&gt;
By the time the migration started, the codebase was already modular:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;~20 external dependencies managed through CocoaPods&lt;/li&gt;
&lt;li&gt;~20 internal dependencies implemented as Xcode Frameworks&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ironically, this very modularization through Xcode Frameworks turned out to be the root of many problems.&lt;/p&gt;

&lt;p&gt;Internal modules were implemented as separate targets in .xcodeproj, each compiled into a dynamic.framework.&lt;/p&gt;

&lt;p&gt;Each dynamic framework is its own bundle, containing duplicated Swift metadata and adding hundreds of lines to the already monstrous project.pbxproj.&lt;/p&gt;

&lt;p&gt;This dynamic linking came with several unpleasant side effects:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Bloated app size&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
107 MB for a mobile app is… not exactly elegant.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Painful configuration&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Every framework required manual configuration: build settings, build phases, code signing, target configuration&lt;br&gt;&lt;br&gt;
Multiply that by dozens of modules and you get a configuration hell.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Endless conflicts in project.pbxproj&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Whenever merge conflicts appeared, developers had to manually resolve them inside the gigantic and barely readable project.pbxproj file. A special kind of suffering.&lt;/p&gt;

&lt;p&gt;And finally, maintaining two dependency systems at once (CocoaPods for external libraries and Xcode Frameworks for internal modules) was becoming increasingly painful.&lt;/p&gt;

&lt;p&gt;Swift Package Manager promised a much cleaner future: &lt;strong&gt;a single, native dependency system fully integrated with the Apple ecosystem.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;So we decided it was time.&lt;/p&gt;
&lt;h3&gt;
  
  
  Migration strategy
&lt;/h3&gt;

&lt;p&gt;We split the migration into three phases:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Move external dependencies from CocoaPods to SPM.&lt;/li&gt;
&lt;li&gt;Migrate internal frameworks from Xcode Frameworks to SPM.&lt;/li&gt;
&lt;li&gt;Refine the architecture afterward.&lt;/li&gt;
&lt;/ol&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;Phase 1: External dependencies — leaving CocoaPods&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;We migrated dependencies one by one, starting with libraries that already supported SPM.&lt;/p&gt;

&lt;p&gt;A typical migration looked like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Add the package in Xcode: File → Add Package Dependencies &lt;strong&gt;.&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Provide the repository URL and choose the exact version.&lt;/li&gt;
&lt;li&gt;Select the package products and attach them to the appropriate target.&lt;/li&gt;
&lt;li&gt;Remove the dependency from the Podfile.&lt;/li&gt;
&lt;li&gt;Verify imports still work.&lt;/li&gt;
&lt;li&gt;Run tests.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Some libraries didn’t support SPM yet. Most of them were forks hosted in Manychat repositories — such as centrifuge-ios (WebSocket client) or DiffTableDirector (table diffing library). For those, we had to add SPM support ourselves : write Package.swift, publish tags and releases.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;One dependency required special treatment: our Kotlin Multiplatform analytics library (KMM). To integrate it via SPM we first had to add SPM support on the KMM side. That meant building an XCFramework inside the KMM project.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;This phase of migration happened incrementally and took roughly &lt;strong&gt;3 months.&lt;/strong&gt; At the end of this phase we achieved native Xcode integration: SPM dependencies were connected directly through Xcode (XCRemoteSwiftPackageReference in .xcodeproj) without additional wrappers.&lt;br&gt;&lt;br&gt;
At this point we dropped storyboards entirely. We’d been using our own fork of SwinjectStoryboard to support them, and we were done with both.&lt;/p&gt;
&lt;h3&gt;
  
  
  Phase 2: Migrating internal frameworks to SPM packages
&lt;/h3&gt;

&lt;p&gt;This was the hardest part.&lt;/p&gt;

&lt;p&gt;After external dependencies moved to SPM, we ran into a fundamental problem: Xcode Frameworks cannot properly consume transitive dependencies from SPM packages.&lt;/p&gt;

&lt;p&gt;Here’s a concrete example:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzi5o4robhwvxvnxll8h5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzi5o4robhwvxvnxll8h5.png" width="800" height="125"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Logically, &lt;em&gt;Services&lt;/em&gt; should receive &lt;em&gt;Alamofire&lt;/em&gt; transitively through &lt;em&gt;Networking&lt;/em&gt;. In reality, Xcode does not propagate SPM dependencies through chains of dynamic frameworks. Instead, it throws linker errors: &lt;em&gt;undefined symbols.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Why? Because dynamic frameworks and SPM packages live in completely different dependency resolution worlds. This limitation of Xcode’s build system made the migration even more complicated.&lt;/p&gt;

&lt;p&gt;We saw two ways forward.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option 1: declare all dependencies explicitly and migrate frameworks gradually.&lt;/strong&gt; Then every Xcode Framework that needs a transitive SPM dependency declares it as a direct one which means duplicating dependencies.&lt;/p&gt;

&lt;p&gt;Before migration:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzi5o4robhwvxvnxll8h5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzi5o4robhwvxvnxll8h5.png" width="800" height="125"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;After migration:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faoga62po7snpad1kaicm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faoga62po7snpad1kaicm.png" width="800" height="186"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkskqyfjyuuhefkix9zxr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkskqyfjyuuhefkix9zxr.png" width="800" height="186"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This approach would allow gradual migration, and reduce the risk of errors at each step. But the downside was obvious: it would require duplicating dependencies across all targets. The dependency graph would become a mess — real dependencies tangled with workarounds, impossible to tell apart. And every time a new SPM package is added, all dependent frameworks would need updating.&lt;br&gt;&lt;br&gt;
In other words: a temporary solution that would likely become permanent. We didn’t like that.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Option 2: migrate everything in one massive PR.&lt;/strong&gt; Convert all Xcode Frameworks into local SPM packages at once.&lt;br&gt;&lt;br&gt;
The upside was tempting: a clean dependency graph from day one, no duplicated dependencies, and proper transitive resolution.&lt;br&gt;&lt;br&gt;
The tradeoff was equally obvious: a huge pull request touching the entire codebase while active development continued. High risk. Lots of potential conflicts.&lt;/p&gt;

&lt;p&gt;After a week of debating and experimenting, we chose &lt;strong&gt;Option 2&lt;/strong&gt;. Better one painful surgery than months of living with architectural band-aids.&lt;/p&gt;
&lt;h3&gt;
  
  
  &lt;strong&gt;The big migration&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The actual migration looked like this.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;We created a &lt;strong&gt;single Package.swift&lt;/strong&gt; describing all 25 modules — about 515 lines. This was intentional: maintaining one manifest is easier than managing dozens.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="c1"&gt;// swift-tools-version: 5.10&lt;/span&gt;
&lt;span class="kd"&gt;import&lt;/span&gt; &lt;span class="kt"&gt;PackageDescription&lt;/span&gt;

&lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;package&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Package&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
   &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;"Modules"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;platforms&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;iOS&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;v17&lt;/span&gt;&lt;span class="p"&gt;)],&lt;/span&gt;
   &lt;span class="nv"&gt;products&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;LocalModule&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;allCases&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;$0&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;product&lt;/span&gt; &lt;span class="p"&gt;},&lt;/span&gt;
   &lt;span class="nv"&gt;dependencies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;ExternalPackage&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;packages&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
   &lt;span class="nv"&gt;targets&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kt"&gt;LocalModule&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;allCases&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;map&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="nv"&gt;$0&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt; &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Deleted all framework targets from project.pbxproj.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Added Modules as a local SPM package (XCLocalSwiftPackageReference).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Attached products to the main app target.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Connected packages to test targets.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Ran all tests and fixed compilation errors.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Merged it into dev.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The whole migration took four very intense days.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;New dependency architecture&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;During the migration we also cleaned up the architecture. Some improvements became possible thanks to SPM; others we implemented simply because the migration gave us a good opportunity.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;We reorganized the existing frameworks into two types of SPM modules: Core and Feature. &lt;strong&gt;Core&lt;/strong&gt; for shared application components and &lt;strong&gt;Feature&lt;/strong&gt; for individual features.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Modules/

├── Core/ ← Infrastructure and shared components (20 modules)

│ ├── Core/ — Foundation: TCA, Swinject, base extensions

│ ├── Models/ — Domain models

│ └── ...

│

└── Feature/ ← Product features (5 modules)

    ├── Automations/

    └── ...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We introduced helpers for dependency declaration and validation:&lt;br&gt;&lt;br&gt;
&lt;strong&gt;Shared dependencies for feature modules&lt;/strong&gt;  — encapsulation of shared dependencies for feature modules:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;featureDependencies&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Dependency&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="p"&gt;[&lt;/span&gt;
       &lt;span class="nf"&gt;local&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;core&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
       &lt;span class="o"&gt;...&lt;/span&gt;
       &lt;span class="nf"&gt;external&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;swinject&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
       &lt;span class="nf"&gt;external&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;tca&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
   &lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Features then add only their own specific dependencies on top:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;automations&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
   &lt;span class="nf"&gt;featureDependencies&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
   &lt;span class="o"&gt;...&lt;/span&gt;
   &lt;span class="nf"&gt;external&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kingfisher&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;DependencyBuilder&lt;/strong&gt;  — so each module explicitly declares both its local and external dependencies:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;@DependencyBuilder&lt;/span&gt;
&lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="k"&gt;var&lt;/span&gt; &lt;span class="nv"&gt;dependencies&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="kt"&gt;Dependency&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="k"&gt;self&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
   &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;core&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
       &lt;span class="nf"&gt;external&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;swinject&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
       &lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="err"&gt;​&lt;/span&gt;
   &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;models&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
       &lt;span class="nf"&gt;local&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;core&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
       &lt;span class="nf"&gt;external&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dateTools&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
       &lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="err"&gt;​&lt;/span&gt;
   &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nv"&gt;networking&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
       &lt;span class="nf"&gt;local&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;core&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
       &lt;span class="nf"&gt;local&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;models&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
       &lt;span class="o"&gt;...&lt;/span&gt;
&lt;span class="err"&gt;​&lt;/span&gt;
   &lt;span class="c1"&gt;// ... the rest modules&lt;/span&gt;
   &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Cycle detection&lt;/strong&gt;  — a validation check that prevents feature modules from depending on each other:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight swift"&gt;&lt;code&gt;&lt;span class="kd"&gt;private&lt;/span&gt; &lt;span class="kd"&gt;func&lt;/span&gt; &lt;span class="nf"&gt;validateNoFeatureToFeatureDependencies&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;featureNames&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="kt"&gt;Set&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="kt"&gt;LocalModule&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;allCases&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;filter&lt;/span&gt;&lt;span class="p"&gt;(\&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isFeature&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;map&lt;/span&gt;&lt;span class="p"&gt;(\&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;module&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt; &lt;span class="kt"&gt;LocalModule&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;allCases&lt;/span&gt; &lt;span class="k"&gt;where&lt;/span&gt; &lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isFeature&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;featureDeps&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;target&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dependencies&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;compactMap&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt; &lt;span class="n"&gt;dep&lt;/span&gt; &lt;span class="o"&gt;-&amp;gt;&lt;/span&gt; &lt;span class="kt"&gt;String&lt;/span&gt;&lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="k"&gt;in&lt;/span&gt;
            &lt;span class="k"&gt;switch&lt;/span&gt; &lt;span class="n"&gt;dep&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="k"&gt;case&lt;/span&gt; &lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;byNameItem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="k"&gt;let&lt;/span&gt; &lt;span class="nv"&gt;name&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;featureNames&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;contains&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;?&lt;/span&gt; &lt;span class="nv"&gt;name&lt;/span&gt; &lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;nil&lt;/span&gt;
            &lt;span class="k"&gt;default&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="kc"&gt;nil&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="o"&gt;!&lt;/span&gt;&lt;span class="n"&gt;featureDeps&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;isEmpty&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="nf"&gt;preconditionFailure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
                &lt;span class="s"&gt;"Feature module '&lt;/span&gt;&lt;span class="se"&gt;\(&lt;/span&gt;&lt;span class="n"&gt;module&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="n"&gt;name&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="s"&gt;' must not depend on other feature modules: &lt;/span&gt;&lt;span class="se"&gt;\(&lt;/span&gt;&lt;span class="n"&gt;featureDeps&lt;/span&gt;&lt;span class="o"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;joined&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nv"&gt;separator&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="s"&gt;", "&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="se"&gt;)&lt;/span&gt;&lt;span class="s"&gt;"&lt;/span&gt;
            &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;p&gt;The most important outcome: we are no longer dependent on CocoaPods and now rely entirely on the native Apple dependency system.&lt;/p&gt;

&lt;p&gt;But the migration produced several additional benefits:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;App size down 31%.&lt;/strong&gt; From 106.6 MB to 73.9 MB. This was a side effect of converting Xcode Frameworks to SPM packages, not something we specifically optimized for.&lt;/p&gt;

&lt;p&gt;Where the savings came from:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Static linking.&lt;/strong&gt; SPM packages link statically by default. LTO (Link-Time Optimization) can see the entire app at once instead of separate frameworks, and strips more aggressively.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Deduplication of Swift metadata.&lt;/strong&gt; Each dynamic framework previously carried its own copy of Swift type metadata. Static linking lets the compiler deduplicate them.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dead code stripping.&lt;/strong&gt; The static linker removes unused code more effectively. Dynamic frameworks were always linked in full — even if the app used 10% of a .framework, the other 90% stayed in the binary.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No bundle overhead.&lt;/strong&gt; Each .framework is a directory with an Info.plist, headers, and a code signature. With 20+ frameworks, that adds up.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;A clear dependency graph.&lt;/strong&gt; No words needed.&lt;br&gt;&lt;br&gt;
This is before:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F94e6cq1e3sitf2yk9qv7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F94e6cq1e3sitf2yk9qv7.png" width="799" height="298"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is after:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2uhstsa0is8btfn38hv4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F2uhstsa0is8btfn38hv4.png" width="800" height="373"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;App launch 30% faster.&lt;/strong&gt; From 1.1s — 1.3s down to stable 0.84s. Before migration, dyld had to load and bind every dynamic framework at startup — visible on the pre-main phase. With static linking, there’s one binary to load.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fidejnchux43oun84i61x.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fidejnchux43oun84i61x.png" width="800" height="579"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;App start time: 1.29s before migration (5.13.1) vs 0.84s after (6.8.0).&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faxoarnjd1nf7nuzrlsxf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faxoarnjd1nf7nuzrlsxf.png" width="800" height="487"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;The drop is hard to miss: launch time fell from ~1.0–1.1s to 0.84s after the migration in late February.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Clean build time down 28%.&lt;/strong&gt; From 156 seconds to 113 seconds. SPM modules build in parallel just like frameworks — but without the .framework bundle copy phase and per-framework code signing.&lt;/p&gt;

&lt;p&gt;The full migration took about six months: preparation, the actual move, and post-migration cleanup. We’re currently in the refinement phase, continuing to thin out the main app target by moving remaining logic into existing modules or extracting new ones, and migrating tests into SPM.&lt;/p&gt;




</description>
      <category>cocoapods</category>
      <category>swiftpackagemanager</category>
      <category>swift</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>How to survive LLM Traffic Spikes in Python</title>
      <dc:creator>Manychat Engineering</dc:creator>
      <pubDate>Thu, 05 Mar 2026 13:09:06 +0000</pubDate>
      <link>https://dev.to/manychat/how-to-survive-llm-traffic-spikes-in-python-4en</link>
      <guid>https://dev.to/manychat/how-to-survive-llm-traffic-spikes-in-python-4en</guid>
      <description>&lt;h4&gt;
  
  
  What it takes to route, rate-limit, and failover hundreds of LLM calls per second without breaking production.
&lt;/h4&gt;

&lt;p&gt;At &lt;a href="https://careers.manychat.com/team/engineering" rel="noopener noreferrer"&gt;Manychat&lt;/a&gt;, we serve AI-powered automation to thousands of Instagram and messaging accounts. Behind that experience sits our Python AI Service — a layer between our product and multiple LLM providers, handling hundreds of LLM calls per second in production.&lt;/p&gt;

&lt;p&gt;It works well. Until it doesn’t.&lt;/p&gt;

&lt;p&gt;LLM calls don’t behave like traditional API requests. They take seconds, not milliseconds. They’re expensive. They come with strict rate limits. And when a single LLM provider goes down, your feature can go down with it.&lt;br&gt;&lt;br&gt;
Horizontal scaling doesn’t solve this. Adding more servers won’t lift provider limits or fix upstream outages. What you actually need is a control layer — one that decides where traffic goes, when to back off, and how to fail without taking the entire system down.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feeocld0li9ktdz2eezjc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Feeocld0li9ktdz2eezjc.png" width="800" height="432"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I’m Sergi Porta, Python Team Lead in the Manychat AI unit. In this article, I’ll walk through the LLM traffic routing architecture we use in our Python AI service. I’ll explain the core gateway patterns for multi-provider LLM traffic, with a practical focus on failover logic, rate limiting, and monitoring, and show how this allows our AI service to handle hundreds of LLM calls per second in production while surviving spikes and provider outages.&lt;/p&gt;

&lt;p&gt;The goal is simple: survive spikes of hundreds of LLM calls per second and provider outages without drama.&lt;/p&gt;

&lt;h3&gt;
  
  
  Python AI Service: Technical Stack
&lt;/h3&gt;

&lt;p&gt;Before diving into routing, a quick look at the service itself. The Python AI Service is built with &lt;strong&gt;Python 3.13&lt;/strong&gt; and &lt;strong&gt;FastAPI&lt;/strong&gt; , relying heavily on &lt;strong&gt;asyncio&lt;/strong&gt; to handle long-running LLM calls and other I/O-heavy workloads.&lt;/p&gt;

&lt;p&gt;We use &lt;strong&gt;SQLAlchemy&lt;/strong&gt; and &lt;strong&gt;Alembic&lt;/strong&gt; to manage configuration, metadata, and lifecycle state. Even though the service is focused on LLM traffic, it still needs to behave like any other production system: consistent, observable, and predictable.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdnbtz3mahk3vg5qsvfno.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdnbtz3mahk3vg5qsvfno.png" width="800" height="602"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Python AI Service architecture.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;For reliability, we work with multiple providers (Azure and OpenAI) and we are planning to add more providers to satisfy the product needs in the future. That gives us flexibility — but also complexity.&lt;br&gt;&lt;br&gt;
Each provider behaves differently. Latency varies. Rate limits differ. Availability patterns are not the same. At our scale, ignoring those differences is not an option.&lt;br&gt;&lt;br&gt;
We need to monitor each deployment in real time, route traffic dynamically based on capacity, and recover automatically from partial outages.&lt;/p&gt;

&lt;p&gt;The solution was to introduce a dedicated abstraction layer — a routing layer that hides provider-specific complexity from the rest of the application.&lt;/p&gt;

&lt;h3&gt;
  
  
  LLMRouting: Turning Providers into a Resilient Pool
&lt;/h3&gt;

&lt;p&gt;It is built on top of &lt;a href="https://docs.litellm.ai/docs/routing" rel="noopener noreferrer"&gt;LiteLLM’s Router library&lt;/a&gt;. Here’s how it works. We define multiple backend deployments, each of which may contain replicas of the same LLM model. Instead of calling a specific provider directly, the agent sends the request to the routing layer and simply specifies the model it wants to use. What happens behind the scenes is abstracted away from the application.&lt;/p&gt;

&lt;p&gt;The first core mechanism is &lt;strong&gt;weighted routing&lt;/strong&gt;. Not all deployments are equal. Some run on provisioned throughput tiers. Others are pay-as-you-go. Some are cheaper. Some are faster. We assign each deployment a numeric weight, which determines how much traffic it receives. The higher the weight, the larger the share of requests.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rate limits&lt;/strong&gt; are inevitable at scale. When a deployment starts returning 429 responses during a traffic spike, the router doesn’t stubbornly retry the same endpoint. It shifts traffic to other healthy deployments in the pool.&lt;/p&gt;

&lt;p&gt;If a deployment becomes fully unavailable, it enters a &lt;strong&gt;cooldown period&lt;/strong&gt;. During that time, it is temporarily removed from rotation, and the remaining backends absorb the traffic.&lt;/p&gt;

&lt;p&gt;This logic applies not only to deployments, but to entire providers. If Azure experiences an outage, traffic can be routed directly to OpenAI. Because the same model alias exists across providers, &lt;strong&gt;failover&lt;/strong&gt; happens within the retry window. The result: even a full provider outage doesn’t immediately cause errors for users.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmzd4c4087x2bt2574722.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fmzd4c4087x2bt2574722.gif" width="800" height="458"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now let’s look at the mechanisms that make this work in practice: weighted routing, rate-limit handling, cooldowns, and fallbacks.&lt;/p&gt;

&lt;h4&gt;
  
  
  Weighted Routing
&lt;/h4&gt;

&lt;p&gt;Each model alias — for example, gpt-4o-mini — maps to multiple deployments across different providers. Every deployment has a numeric weight that determines its share of traffic.&lt;/p&gt;

&lt;p&gt;In our current production setup, the primary Azure deployment carries a weight of 8 (about 73% of requests). A secondary Azure deployment carries a weight of 2 (roughly 18%). A direct OpenAI fallback has a weight of 1 (around 9%).&lt;/p&gt;

&lt;p&gt;Here’s how that distribution looks conceptually:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8kzj7pfskyms8a7tm0xz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8kzj7pfskyms8a7tm0xz.png" width="800" height="233"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The router uses a weighted random selection strategy (simple shuffle). Most requests are directed to provisioned-throughput tiers, while pay-as-you-go tiers remain warm and ready.&lt;/p&gt;

&lt;p&gt;Traffic distribution isn’t hardcoded. It’s defined in YAML. That means we can rebalance weights or shift traffic across providers within seconds without deploying new code.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Handling Rate Limits&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;When a deployment returns a 429 rate-limit response, the router does not retry the same endpoint. Instead, it immediately selects another deployment from the pool and retries the request — up to four attempts in total (1 original and 3 retries). Because each model alias maps to multiple backends, a rate limit in one Azure region is usually resolved by routing the retry to another Azure deployment or directly to OpenAI.&lt;/p&gt;

&lt;p&gt;Every rate-limit event is tracked per backend through a custom Prometheus callback. Grafana dashboards make it immediately visible when a deployment is approaching its capacity ceiling. That visibility allows us to adjust routing weights proactively instead of reacting to outages after they cascade.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cooldowns: Isolating Failing Deployments&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Cooldowns prevent failing deployments from absorbing traffic they can’t serve. When a deployment crosses a failure threshold, the router removes it from the routing pool for a defined time window. During that period, only healthy deployments receive traffic.&lt;/p&gt;

&lt;p&gt;After the cooldown window expires, the deployment is reintroduced into rotation. This isolation is critical during partial outages. Instead of spreading failures across all incoming requests, the system converges on healthy endpoints within seconds.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fallbacks Across Deployments and Providers&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Fallbacks operate at both the routing and application levels.&lt;/p&gt;

&lt;p&gt;At the routing layer, if retries on a primary deployment are exhausted, traffic shifts to the remaining tiers, including cross-provider fallbacks. Because the same model alias exists across providers, even a full regional outage does not require manual intervention. The router reroutes traffic within the retry window.&lt;/p&gt;

&lt;p&gt;At the application level, an additional safety net handles edge cases such as empty-content responses. Before surfacing an error to the user, the service retries the entire call. In practice, this means that even during provider-side instability, traffic can be rerouted in under a second without visible degradation for the end user.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Monitoring and Observability: Seeing Problems Before Users Do&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Routing and failover logic are only as good as your visibility into them. Our observability stack relies on two core systems: Prometheus and Grafana for real-time metrics and alerting, and OpenTelemetry for distributed tracing across the full request lifecycle.&lt;/p&gt;

&lt;h3&gt;
  
  
  Prometheus Metrics and Grafana Dashboards
&lt;/h3&gt;

&lt;p&gt;Every LLM call passing through the router is instrumented via a custom Prometheus callback.&lt;/p&gt;

&lt;p&gt;We record high-granularity metrics at both the model and backend levels — enough detail to understand not just that something is wrong, but where.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model-level metrics&lt;/strong&gt; include total call counts, latency distributions, token usage (prompt and completion), and error rates. Metrics are labeled by model alias and agent name, allowing us to isolate the performance of specific features, for example, intent detection versus flow generation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Backend-level metrics&lt;/strong&gt; record which deployment handled each request and categorize the outcome into a controlled set: success, timeout, rate_limit, api_error, and other. Keeping this taxonomy small helps maintain manageable Prometheus cardinality while still providing enough signal to diagnose routing behavior.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3sbkjdtp17bmxaox3rbq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3sbkjdtp17bmxaox3rbq.png" alt="LLM Providers Metrics weights, error, latency dashboard." width="800" height="413"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;LLM Providers Metrics weights, error, latency dashboard.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;These metrics feed into dedicated Grafana dashboards that help us answer four critical questions:&lt;br&gt;&lt;br&gt;
&lt;strong&gt;1. Is traffic distributed as expected?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
We verify that routing weights are respected and detect unexpected shifts caused by cooldowns or failovers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;2. Is latency degrading anywhere?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
P50, P95, and P99 histograms are broken down by backend to surface provider-specific slowdowns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Are errors isolated or systemic?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Outcome breakdowns show whether failures are limited to a single deployment or spreading across the pool.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;4.&lt;/strong&gt;  &lt;strong&gt;Is cost drifting?&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Token counters per model and agent help detect prompt regressions and unexpected usage spikes.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc816p367cvlbcwoumbvy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc816p367cvlbcwoumbvy.png" width="800" height="284"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Call counts and latency dashboard.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdeux27ud15jtf004wtes.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdeux27ud15jtf004wtes.png" width="800" height="376"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Errors dashboard.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F67hgoqydmov51hdydsjt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F67hgoqydmov51hdydsjt.png" width="799" height="259"&gt;&lt;/a&gt;&lt;br&gt;
&lt;em&gt;Cost analysis dashboard.&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Alerting System
&lt;/h3&gt;

&lt;p&gt;Dashboards are useful for investigation. Alerts are what trigger action.&lt;br&gt;&lt;br&gt;
Slack notifications fire under two primary conditions:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;P95 latency thresholds.&lt;/strong&gt; Alerts trigger when latency exceeds defined limits (typically between 3.5 and 5 seconds, depending on the model). This helps catch provider slowdowns before users feel them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Error rate breaches.&lt;/strong&gt; An error rate above a certain threshold triggers an immediate notification. At our traffic level, that’s not a minor glitch — it’s a strong signal of an outage or misconfiguration that requires attention.&lt;/p&gt;

&lt;h3&gt;
  
  
  Monitoring the Asyncio Event Loop
&lt;/h3&gt;

&lt;p&gt;For high-throughput asynchronous services, the health of the Python event loop is monitored via two dedicated metrics:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Event Loop Delay&lt;/strong&gt;. This measures the gap between expected and actual asyncio.sleep intervals. Spikes above 1 ms usually indicate CPU-bound work or blocking calls that are starving the loop — and increasing LLM response latency.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Active task count.&lt;/strong&gt; Tracking the number of running tasks helps detect backpressure caused by slow upstream responses or sudden spikes in concurrency.&lt;/p&gt;

&lt;h3&gt;
  
  
  Distributed Tracing with OpenTelemetry
&lt;/h3&gt;

&lt;p&gt;While metrics provide high-level status, OpenTelemetry provides the context needed for deep investigation. The service automatically analyses three layers: FastAPI (HTTP requests), OpenAI API calls, and SQLAlchemy (database queries).&lt;/p&gt;

&lt;p&gt;Each trace spans the full lifecycle of a request — from the initial HTTP call through intent detection, embedding generation, database lookups, and final LLM completion. We propagate custom attributes via OpenTelemetry baggage to preserve business context:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;manychat.account_id: links spans to specific customer accounts.&lt;/li&gt;
&lt;li&gt;manychat.session_id: associates spans with unique automation sessions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These attributes let us pinpoint whether a bottleneck originated in an LLM backend, an embedding call, or a database query for a specific request. Traces are exported via OTLP gRPC and stored in S3 for long-term analysis.&lt;/p&gt;

&lt;h3&gt;
  
  
  Observability Checklist for Production LLM Routing
&lt;/h3&gt;

&lt;p&gt;The following points define the requirements for the production LLM routing layer:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Backend call counts and outcomes&lt;/strong&gt; to verify routing weights and failover activation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency histograms by deployment&lt;/strong&gt; to isolate provider-side slowdowns.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Error classification&lt;/strong&gt; to distinguish between expected rate limits and critical authentication or timeout issues.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token usage tracking&lt;/strong&gt; for cost management and identifying prompt regressions.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Event loop health monitoring&lt;/strong&gt; to detect blocking calls or task backpressure.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Distributed traces with business context&lt;/strong&gt; to correlate LLM performance with database and embedding latency.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Threshold-based alerts&lt;/strong&gt; for P95 latency and error rate breaches.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  What is next for Python AI Service and LLM Routing?
&lt;/h3&gt;

&lt;p&gt;LLM Routing-based architecture is solid, and we’re evolving it further.&lt;/p&gt;

&lt;p&gt;Next, we’re extending the routing layer with RPM/TPM-aware selection, latency-based routing, and cost optimization, so the system can automatically prefer the most efficient deployment available in real time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Rate-limit handling&lt;/strong&gt; will also evolve. We’re introducing exponential backoff, more granular retry policies per error type.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Cooldowns&lt;/strong&gt; will become more refined. Instead of a single threshold, we’ll define explicit “allowed failure” counts and tailor cooldown durations to the type of error, distinguishing between expected rate-limit spikes and critical authentication failures.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Fallback logic&lt;/strong&gt; will also be extended to span multiple model groups. For example, falling back from gpt-4o-mini to gpt-4o, or dynamically selecting models based on context window size and content policies.&lt;/p&gt;

&lt;p&gt;We’re also thinking about integrating in the future other &lt;strong&gt;providers such as&lt;/strong&gt; Anthropic, Gemini, and potentially self-hosted models.&lt;/p&gt;

&lt;p&gt;One hundred LLM calls per second once felt ambitious. Now we’re preparing for thousands. But that’s a story for another article.&lt;/p&gt;




</description>
      <category>softwareengineering</category>
      <category>ai</category>
      <category>python</category>
    </item>
  </channel>
</rss>
