<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Deva</title>
    <description>The latest articles on DEV Community by Deva (@arihantdeva).</description>
    <link>https://dev.to/arihantdeva</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3962111%2F5faad1c8-580d-4b79-a955-ed5b72093db3.jpg</url>
      <title>DEV Community: Deva</title>
      <link>https://dev.to/arihantdeva</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/arihantdeva"/>
    <language>en</language>
    <item>
      <title>12 Posts a Week to 7: Fixing My LinkedIn Distribution Problem</title>
      <dc:creator>Deva</dc:creator>
      <pubDate>Mon, 22 Jun 2026 15:20:45 +0000</pubDate>
      <link>https://dev.to/arihantdeva/12-posts-a-week-to-7-fixing-my-linkedin-distribution-problem-dae</link>
      <guid>https://dev.to/arihantdeva/12-posts-a-week-to-7-fixing-my-linkedin-distribution-problem-dae</guid>
      <description>&lt;p&gt;12 posts per week. That was my LinkedIn cadence, and it was quietly killing my reach.&lt;/p&gt;

&lt;p&gt;Not because the content was bad. Because two posts per weekday meant both landed in the same golden hour window and competed for the same network. The feed ranker almost never shows both posts to the same person. So the second post was diluting the first. Every week I was running two campaigns against each other.&lt;/p&gt;

&lt;p&gt;I found this by looking at what I actually controlled at the algorithm level. LinkedIn's feed ranker has documented behavior around post spacing and early engagement velocity. When you put two posts in the same window, neither gets full distribution in the first hour. You split your own early signal. And early signal is what determines whether the algorithm amplifies you at all.&lt;/p&gt;

&lt;p&gt;The fix was straightforward once I saw it: drop to 1 post per day. Keep the higher reach late morning slot on weekdays. Keep the two carousel posts on Tuesday and Wednesday. That is 7 posts per week instead of 12. Fewer posts, more room for each one to breathe.&lt;/p&gt;

&lt;h2&gt;
  
  
  The second problem was subtler
&lt;/h2&gt;

&lt;p&gt;Every topic signal I was pulling from was AI and dev news. Which meant 100 percent of my posts were Claude, agentic systems, or some adjacent corner of that world. My audience is college students, builders, and founders. They care about AI, yes. But seeing the same note repeated is how you get unfollowed.&lt;/p&gt;

&lt;p&gt;I needed variety without faking expertise I do not have. The answer was a reflection topic type: lessons, ideas, and thoughts that come from actually building things, not from chasing the AI news cycle.&lt;/p&gt;

&lt;p&gt;I seeded it with 36 topic prompts covering angles outside AI in voice. Things like shipping decisions, what breaks in solo projects, why planning fails, when to cut scope. Real stuff from building. The mix is now roughly 1 in 3 posts from this reflection pool, with AI content staying at about 2 in 3. Not an equal split. AI is still the core. But the feed no longer looks like one note on repeat.&lt;/p&gt;

&lt;p&gt;The implementation detail that matters: the pool rotates oldest first. A seed does not reappear until the rest have run. Otherwise you end up with de facto repeats just from different angles, which defeats the point.&lt;/p&gt;

&lt;h2&gt;
  
  
  Verification before shipping
&lt;/h2&gt;

&lt;p&gt;After the changes, the plan command showed 1 post per day. The curate command appended all 36 reflection seeds to the queue. The selector was realizing about 28 percent reflection in practice, which matched the configured mix. A reflection draft ran through the quality gate and passed. 189 tests green.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would do differently
&lt;/h2&gt;

&lt;p&gt;I should have caught the cadence problem earlier. The data was there. Two posts per day, declining per post impressions, obvious explanation. I kept optimizing the content and ignoring the structural issue. That is the classic mistake: iterate on the thing you are most comfortable touching instead of looking at what the data is actually telling you.&lt;/p&gt;

&lt;p&gt;On the reflection stream, I would have started with a smaller seed pool and validated one full rotation before expanding. 36 seeds is probably more than I needed to start. Better to prove the rotation mechanism works at 10 seeds and grow from there.&lt;/p&gt;

&lt;p&gt;The broader lesson: most content distribution problems are not content problems. They are structural. Fix the structure first.&lt;/p&gt;

</description>
      <category>analytics</category>
      <category>marketing</category>
      <category>productivity</category>
      <category>socialmedia</category>
    </item>
    <item>
      <title>Content automation fails at the idea layer, not the model layer</title>
      <dc:creator>Deva</dc:creator>
      <pubDate>Mon, 22 Jun 2026 12:43:03 +0000</pubDate>
      <link>https://dev.to/arihantdeva/content-automation-fails-at-the-idea-layer-not-the-model-layer-4pa</link>
      <guid>https://dev.to/arihantdeva/content-automation-fails-at-the-idea-layer-not-the-model-layer-4pa</guid>
      <description>&lt;p&gt;The standard postmortem when automated content underwhelms: fix the prompt, swap the model, add more examples. Every iteration cycle stays inside generation.&lt;/p&gt;

&lt;p&gt;That is the wrong layer. The model writes what you feed it. Generic inputs produce generic outputs regardless of which model you run or how carefully you craft the system prompt. The real failure point is upstream, and it usually comes paired with a scheduler that silently refuses to fire.&lt;/p&gt;

&lt;p&gt;I hit both this week building original content feeders for my X engine.&lt;/p&gt;

&lt;h2&gt;
  
  
  The idea input problem
&lt;/h2&gt;

&lt;p&gt;The original setup generated posts from a static topic queue. A list of themes, rotated on a schedule. Serviceable for a day or two, then every post becomes a rephrase of the last. The fix was not prompt tuning. It was wiring two live feeders:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;GitHub trending: repos filtered by language, pulled daily and weekly. Actual shipping projects, not abstract topic strings.&lt;/li&gt;
&lt;li&gt;Niche pulse: recent posts from a curated creator pool the engine already watches.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The part that is not obvious is what you do after ingestion. You have to split incoming items into two generation paths: grounded (the post is factually anchored to the specific feeder item) and ungrounded (the item is a creative trigger, the post can range further). Mixing both into one path produces outputs that hedge between reporting and opinion and do neither cleanly. A post that vaguely gestures at a trending repo without making a real claim about it is worse than a pure opinion post. Pick a lane per item.&lt;/p&gt;

&lt;h2&gt;
  
  
  The scheduler bug that made all of it moot
&lt;/h2&gt;

&lt;p&gt;With feeders wired, the engine had a daily floor of 3 posts and per slot idempotency so reruns do not post twice. Tested fine manually. Deployed via launchd plist. Nothing fired.&lt;/p&gt;

&lt;p&gt;The bug: &lt;code&gt;StartCalendarInterval&lt;/code&gt; versus &lt;code&gt;StartInterval&lt;/code&gt;. Calendar interval schedules at specific clock times. Start interval fires every N seconds from load. I had used &lt;code&gt;StartCalendarInterval&lt;/code&gt; in a plist configuration that launchd accepted silently and honored never. Swapping to &lt;code&gt;StartInterval&lt;/code&gt; and checking &lt;code&gt;launchctl list&lt;/code&gt; confirmed the job was actually queued.&lt;/p&gt;

&lt;p&gt;The failure mode is silence. No crash, no log entry, no indication anything went wrong. The job just does not run. This is the most underdiagnosed class of launchd problem because every debugging instinct points at your code, not at whether the trigger even fired. Verify the scheduler actually runs before debugging anything else.&lt;/p&gt;

&lt;h2&gt;
  
  
  Singles only, for now
&lt;/h2&gt;

&lt;p&gt;The feeder generates single posts only. Intentional. Threads need atomic commit semantics: either the full thread posts or none of it does. A partial publish leaves a dangling opener, which is worse than no post at all. The daily floor of 3 singles is a safer floor than 1 thread that fails mid chain. Thread support comes later, when the idempotency model can handle it.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would have done differently
&lt;/h2&gt;

&lt;p&gt;Start scheduler verification on day one. Write a dummy job that touches a file every few minutes, confirm it fires, then build the real payload on top of it. I lost two days diagnosing generation logic when the scheduler was never executing.&lt;/p&gt;

&lt;p&gt;Also: instrument the feeder pipeline separately from generation. If feeders return zero items, the engine should log that and stop, not silently fall through to stale templates. Silent fallback is indistinguishable from working correctly until you go looking.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>One kwarg made my parallel CI phase gate fixtures order independent</title>
      <dc:creator>Deva</dc:creator>
      <pubDate>Sat, 20 Jun 2026 20:46:54 +0000</pubDate>
      <link>https://dev.to/arihantdeva/one-kwarg-made-my-parallel-ci-phase-gate-fixtures-order-independent-3a7j</link>
      <guid>https://dev.to/arihantdeva/one-kwarg-made-my-parallel-ci-phase-gate-fixtures-order-independent-3a7j</guid>
      <description>&lt;p&gt;Two phases, one parallel cluster, one AttributeError. That is the whole setup.&lt;/p&gt;

&lt;p&gt;Here is the context. My CI runs phase gates where each gate gets a fresh git worktree branched from the integration branch at that moment. P1 adds a config constant called &lt;code&gt;WARMUP_ENABLED&lt;/code&gt;. P2 adds a pytest conftest fixture that monkeypatches that constant to &lt;code&gt;False&lt;/code&gt; in tests. Both ship in the same parallel cluster, C1.&lt;/p&gt;

&lt;p&gt;The problem: the P2 gate worktree branches off integration before P1 has merged. &lt;code&gt;conftest.py&lt;/code&gt; exists in the worktree because P2 wrote it. &lt;code&gt;WARMUP_ENABLED&lt;/code&gt; does not exist in config because P1 has not landed yet. &lt;code&gt;monkeypatch.setattr&lt;/code&gt; on a missing attribute raises &lt;code&gt;AttributeError&lt;/code&gt; by default. The suite blows up.&lt;/p&gt;

&lt;p&gt;The fix is one argument:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;monkeypatch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WARMUP_ENABLED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;raising&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With &lt;code&gt;raising=False&lt;/code&gt;, monkeypatch creates the attribute if it does not exist instead of raising. The fixture now runs correctly whether &lt;code&gt;WARMUP_ENABLED&lt;/code&gt; is already defined in config or not.&lt;/p&gt;

&lt;p&gt;The tradeoff worth naming: &lt;code&gt;raising=False&lt;/code&gt; will silently create an attribute that does not exist. That is usually the right behavior for a setup fixture, but it also means a typo in the attribute name will not blow up at fixture time. You will not get &lt;code&gt;AttributeError&lt;/code&gt; telling you the attribute is missing. You will set a new attribute nobody reads, your tests will pass, and the real config constant stays at whatever it defaulted to. The mitigation I use: keep the constant name short, keep it in one place, and verify the name matches when the real phase lands.&lt;/p&gt;

&lt;p&gt;The deeper problem is architectural. Parallel phase gates that branch off the same integration branch are inherently sensitive to ordering when tasks share logical dependencies. If P2's conftest depends on P1's constant, the clean answer is to put them in the same phase or gate P2 after P1. I did not do that here because the dependency runs one way and is shallow. P2 does not need &lt;code&gt;WARMUP_ENABLED&lt;/code&gt; to exist at test runtime; it just needs to be able to patch it. &lt;code&gt;raising=False&lt;/code&gt; is the correct local fix for exactly that shape of coupling. It is not a workaround; it is the semantics matching the intent.&lt;/p&gt;

&lt;p&gt;What I would do differently: add one line of comment on that monkeypatch call explaining why &lt;code&gt;raising=False&lt;/code&gt; is there. Something like: &lt;code&gt;# raising=False because this constant may not exist yet in parallel gate worktrees&lt;/code&gt;. Future me reading that line three months from now should not have to reconstruct this entire commit message to understand it. The code is correct. The code without the comment is a trap.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>My content engine went silent for three days and the bug was in the scheduler, not the generator</title>
      <dc:creator>Deva</dc:creator>
      <pubDate>Sat, 20 Jun 2026 19:19:13 +0000</pubDate>
      <link>https://dev.to/arihantdeva/my-content-engine-went-silent-for-three-days-and-the-bug-was-in-the-scheduler-not-the-generator-4dmb</link>
      <guid>https://dev.to/arihantdeva/my-content-engine-went-silent-for-three-days-and-the-bug-was-in-the-scheduler-not-the-generator-4dmb</guid>
      <description>&lt;p&gt;&lt;code&gt;now_due_slot()&lt;/code&gt; returned &lt;code&gt;None&lt;/code&gt; on almost every tick and I had no idea for three days. The launchd plist was firing correctly. The generation code was fine. The queue had candidates. Nothing posted.&lt;/p&gt;

&lt;p&gt;The root cause: &lt;code&gt;StartCalendarInterval&lt;/code&gt; fires at exact clock times, and at some point my &lt;code&gt;config.SLOTS&lt;/code&gt; drifted away from those exact times. The cron style trigger was asking "is it 9:00am?" and the slot config said "post at 9:05am." Miss by five minutes, return &lt;code&gt;None&lt;/code&gt;, exit 0, log nothing, move on. Three days of that.&lt;/p&gt;

&lt;p&gt;The fix was to stop being clever. Switch to &lt;code&gt;StartInterval 600&lt;/code&gt; (a polling loop every 10 minutes) and let the slot logic itself decide whether to fire. Each slot gets a &lt;code&gt;slot_key&lt;/code&gt; and posts at most once per local day. Idempotency lives in the application, not in the scheduler. The scheduler just wakes up, checks, and usually does nothing.&lt;/p&gt;

&lt;p&gt;This is a lesson I have learned before in different forms: never put business logic in your cron spec. The plist (or cron entry, or GitHub Actions schedule) should know one thing: when to wake up. The application should know everything else: whether to run, what to skip, whether this tick is the right tick., &lt;/p&gt;

&lt;p&gt;On the same branch, I added content feeders to break a different kind of silence: the original post queue running dry.&lt;/p&gt;

&lt;p&gt;The previous setup pulled topic candidates only from Obsidian session logs. Sessions are grounded, they reflect real work, but they are finite and slow to accumulate. When I had a quiet week, the queue went flat.&lt;/p&gt;

&lt;p&gt;Two new sources now feed it.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;sources/github_trending.py&lt;/code&gt; pulls the AI and dev trending repos daily using stdlib &lt;code&gt;urllib&lt;/code&gt; and &lt;code&gt;html.parser&lt;/code&gt;. No API key, no rate limit headache. Repos become signal candidates in the queue.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;sources/niche_pulse.py&lt;/code&gt; aggregates recurring themes across the creator pool I track. If five of the people I follow are talking about the same thing, that is a signal worth reacting to.&lt;/p&gt;

&lt;p&gt;Both feed into &lt;code&gt;blended_candidates()&lt;/code&gt;, which interleaves grounded candidates (Obsidian sessions, usable for first person work claims) and ungrounded signals (trending topics, usable for reaction posts). The generator picks the right prompt based on the &lt;code&gt;grounded&lt;/code&gt; flag on the &lt;code&gt;Candidate&lt;/code&gt;. Work claim prompt for sessions, reaction prompt for signals. Clean separation, no conditional spaghetti in the generator itself.&lt;/p&gt;

&lt;p&gt;A &lt;code&gt;POSTS_DAILY_FLOOR&lt;/code&gt; backstop handles the residual case where no slot fires and the queue still sits idle. &lt;code&gt;variance.floor_catchup&lt;/code&gt; checks pace against the floor, picks a candidate if behind, and posts with a small capped jitter so it does not fire again on the very next tick. It has its own quiet hours check, so it will not fire at 2am to catch up on a slow day.&lt;/p&gt;

&lt;p&gt;One more fix buried in the same commit: &lt;code&gt;, dry run&lt;/code&gt; was popping the queue. There was a stale &lt;code&gt;state.save&lt;/code&gt; call that ran even in preview mode, which violated the documented peek never pop contract. A dry run should read state and render output. It should never mutate. The call is gone., &lt;/p&gt;

&lt;h2&gt;
  
  
  What I would do differently
&lt;/h2&gt;

&lt;p&gt;Skip &lt;code&gt;StartCalendarInterval&lt;/code&gt; entirely from day one. Polling loops with application side idempotency are strictly simpler and more debuggable. The appeal of cron style scheduling is that it "just knows" when to fire. In practice, it creates a tight coupling between your scheduler config and your application config that drifts silently and leaves you staring at logs for three days.&lt;/p&gt;

&lt;p&gt;Add an explicit queue depth alert earlier. Three days of silence is a long time to notice a zero queue. A single log line on every tick reporting queue depth would have surfaced the problem the first morning it happened.&lt;/p&gt;

&lt;p&gt;Ship the feeders earlier too. A content engine that relies on a single source of candidates is one quiet week away from going flat. Blend your sources, gate on quality, let the floor catch the rest.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Surfacing claude p Timeouts as RuntimeError (and Why It Took Three Tries)</title>
      <dc:creator>Deva</dc:creator>
      <pubDate>Fri, 19 Jun 2026 14:41:10 +0000</pubDate>
      <link>https://dev.to/arihantdeva/surfacing-claude-p-timeouts-as-runtimeerror-and-why-it-took-three-tries-6ph</link>
      <guid>https://dev.to/arihantdeva/surfacing-claude-p-timeouts-as-runtimeerror-and-why-it-took-three-tries-6ph</guid>
      <description>&lt;p&gt;Two fix iterations. Both correct. Both landed in the wrong worktree and were cleaned up before anyone noticed. That is how a three line error handling patch became a P6 recovery item.&lt;/p&gt;

&lt;h2&gt;
  
  
  The problem
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;_claude_p&lt;/code&gt; shells out to &lt;code&gt;claude p&lt;/code&gt; via &lt;code&gt;subprocess.run&lt;/code&gt;. When the subprocess times out, Python raises &lt;code&gt;subprocess.TimeoutExpired&lt;/code&gt;. Nothing was catching it. The exception propagated up through the drafting layer and crashed the caller with a traceback pointing at subprocess internals instead of anything meaningful about what failed.&lt;/p&gt;

&lt;p&gt;Callers should not need to import &lt;code&gt;subprocess&lt;/code&gt; just to handle a drafting timeout. They want a &lt;code&gt;RuntimeError&lt;/code&gt; with a message they can log and move on.&lt;/p&gt;

&lt;h2&gt;
  
  
  The fix
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
 &lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;cmd&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="p"&gt;,...)&lt;/span&gt;
&lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="n"&gt;subprocess&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;TimeoutExpired&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
 &lt;span class="k"&gt;raise&lt;/span&gt; &lt;span class="nc"&gt;RuntimeError&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;claude p drafting timed out after &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;timeout&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;s&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Three lines. The tradeoff: &lt;code&gt;TimeoutExpired&lt;/code&gt; carries the original command and the timeout value as attributes; &lt;code&gt;RuntimeError&lt;/code&gt; does not. I decided that was acceptable. Callers log the message. The message has the timeout duration in it. If retry logic needs a different ceiling, the call site already has that context.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why it took three attempts
&lt;/h2&gt;

&lt;p&gt;I delegated the first two fixes to free debug workers running in isolated git worktrees. Both workers implemented the change correctly, ran the full test suite, and exited clean. I saw "tests passing" in the task output and assumed the fix had propagated to the main tree.&lt;/p&gt;

&lt;p&gt;It had not. The changes existed in throwaway branches inside throwaway worktrees. Neither branch was merged. The main tree was untouched. The fix evaporated twice.&lt;/p&gt;

&lt;p&gt;Third attempt: opened the file directly, made the edit, ran the tests, committed. Four minutes.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would do differently
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Do not delegate a three line patch.&lt;/strong&gt; The overhead of spinning up a subprocess worker, managing worktree lifecycle, and parsing task output exceeds the cost of the edit itself. Reserve agent dispatch for things that genuinely benefit from parallelism or isolation. A single file fix in a known location does not meet that bar.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Treat agent output and repo state as distinct things.&lt;/strong&gt; A worker reporting passing tests tells you the logic is correct in its branch. It tells you nothing about your working tree. If you delegate a code change and care where it lands, verify the diff in the target tree before closing the task. Passing in the worker is necessary but not sufficient.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;_claude_p&lt;/code&gt; function now raises &lt;code&gt;RuntimeError&lt;/code&gt; on timeout with the duration in the message. The upstream catch block logs it and skips the draft. The pipeline recovers cleanly instead of crashing on a subprocess internal.&lt;/p&gt;

</description>
      <category>claude</category>
      <category>cli</category>
      <category>python</category>
      <category>softwareengineering</category>
    </item>
    <item>
      <title>Three Test Levels for One CLI Subcommand</title>
      <dc:creator>Deva</dc:creator>
      <pubDate>Fri, 19 Jun 2026 11:35:26 +0000</pubDate>
      <link>https://dev.to/arihantdeva/three-test-levels-for-one-cli-subcommand-44mo</link>
      <guid>https://dev.to/arihantdeva/three-test-levels-for-one-cli-subcommand-44mo</guid>
      <description>&lt;p&gt;7 out of 7 tests pass. That number is boring and exactly correct.&lt;/p&gt;

&lt;p&gt;Adding a CLI subcommand sounds trivial until you count the actual failure modes: the subparser is not wired up, the func is not registered, the dry run flag is silently swallowed, the argument parser crashes with a useless error. Any of those can bite you. The only way to know they do not is to test at three distinct levels, which is what this commit actually does.&lt;/p&gt;

&lt;p&gt;Here is what wiring &lt;code&gt;warmup eval&lt;/code&gt; looks like in practice.&lt;/p&gt;

&lt;h2&gt;
  
  
  The parser registration
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;argparse&lt;/code&gt; subcommands get wired in &lt;code&gt;__main__.py&lt;/code&gt; by calling &lt;code&gt;add_parser&lt;/code&gt; on the subparsers object, then setting &lt;code&gt;set_defaults(func=cmd_warmup_eval)&lt;/code&gt;. That last line is the critical one. If you forget it, &lt;code&gt;args.func&lt;/code&gt; does not exist and every invocation blows up with an &lt;code&gt;AttributeError&lt;/code&gt; that tells you nothing useful. The unit test covering this is two lines but it is the most important one in the file.&lt;/p&gt;

&lt;h2&gt;
  
  
  The dry run flag
&lt;/h2&gt;

&lt;p&gt;&lt;code&gt;, dry run&lt;/code&gt; is just &lt;code&gt;add_argument(', dry run', action='store_true')&lt;/code&gt;. Fifteen seconds of work at the parser layer. The real discipline question is whether the flag is actually honored inside &lt;code&gt;cmd_warmup_eval&lt;/code&gt; or silently accepted and then ignored. Wiring the flag is not the same as plumbing it through.&lt;/p&gt;

&lt;h2&gt;
  
  
  The test triangle
&lt;/h2&gt;

&lt;p&gt;Three types, all necessary, none redundant.&lt;/p&gt;

&lt;p&gt;Unit tests confirm the command function behaves correctly given controlled inputs, without touching the full argument parser or the shell.&lt;/p&gt;

&lt;p&gt;Subparser tests confirm the parser tree is wired correctly: that &lt;code&gt;python m x_engine warmup eval&lt;/code&gt; dispatches to &lt;code&gt;cmd_warmup_eval&lt;/code&gt; and not to the default error handler. This is where &lt;code&gt;set_defaults&lt;/code&gt; either proves itself or does not.&lt;/p&gt;

&lt;p&gt;Subprocess acceptance tests confirm the whole thing actually runs when invoked from a shell, including the import chain, the argument parsing, and exit codes. This is the test that mirrors what a real user does. It is also the one that catches the failure mode where unit tests pass fine and then a missing import blows up at runtime because the module structure changed under you.&lt;/p&gt;

&lt;p&gt;Skip any of the three and you have false confidence somewhere in the stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would do differently
&lt;/h2&gt;

&lt;p&gt;The &lt;code&gt;, dry run&lt;/code&gt; flag is wired at the parser and passes into &lt;code&gt;cmd_warmup_eval&lt;/code&gt;. What the subprocess acceptance test checks is that the command exits cleanly with &lt;code&gt;, dry run&lt;/code&gt;. It does not assert that &lt;code&gt;, dry run&lt;/code&gt; actually changes behavior versus a live run. That means the flag could be a no op and the test still passes with a green check.&lt;/p&gt;

&lt;p&gt;The fix is one extra assertion: capture stdout or a side effect that differs between dry and live modes, and verify the diff. Without it, the test proves the flag is accepted, not that it is respected. Those are different guarantees.&lt;/p&gt;

&lt;p&gt;7/7 is the right score. Writing fewer tests would have been faster and worse.&lt;/p&gt;

</description>
      <category>cli</category>
      <category>python</category>
      <category>softwaredevelopment</category>
      <category>testing</category>
    </item>
    <item>
      <title>Adding a warm up ceiling to publish.run: the fix that took two phases</title>
      <dc:creator>Deva</dc:creator>
      <pubDate>Thu, 18 Jun 2026 13:33:30 +0000</pubDate>
      <link>https://dev.to/arihantdeva/adding-a-warm-up-ceiling-to-publishrun-the-fix-that-took-two-phases-4n8a</link>
      <guid>https://dev.to/arihantdeva/adding-a-warm-up-ceiling-to-publishrun-the-fix-that-took-two-phases-4n8a</guid>
      <description>&lt;p&gt;Why did my warm up phase let posts slip past the ceiling that was already working for comments and conversations?&lt;/p&gt;

&lt;p&gt;That is the question that turned P7 into a recovery phase.&lt;/p&gt;

&lt;p&gt;The x engine has a warm up system. When an account is fresh or recently resumed, it runs through phases with hard ceilings on how many actions are allowed per phase. Comments respect these ceilings. Conversations respect them. Posts did not. A new account could blow through the warm up budget on posts and arrive at the comment and conversation phases already flagged for overuse.&lt;/p&gt;

&lt;p&gt;The fix is obvious in hindsight: add the warmup ceiling check at the top of &lt;code&gt;publish.run()&lt;/code&gt;, right after &lt;code&gt;s = state.load()&lt;/code&gt;, before any slot logic runs. Two import additions, one short circuit return with a &lt;code&gt;warmup_ceiling&lt;/code&gt; outcome. Comments and conversations already do exactly this. Posts just never got the same treatment.&lt;/p&gt;

&lt;p&gt;But the reason it took two phases to land was not the logic. It was the gate.&lt;/p&gt;

&lt;p&gt;P6 was supposed to deliver this change. The gate reviewed a diff. The diff that reached the gate belonged to P5, not P6. Same worktree mixup failure mode that had already caught us one phase earlier. The reviewer signed off on work that was already done, and P6 shipped nothing new. P7 opened as a recovery with the same task still on the board.&lt;/p&gt;

&lt;p&gt;This is a process failure, not a code problem. When automated orchestration feeds diffs to a review gate, it has to be careful about which worktree it is reading from. If phases run in parallel or previous phase artifacts are still sitting on disk, the gate can end up reviewing stale work. The reviewer cannot catch this unless the pipeline explicitly attaches provenance to each diff.&lt;/p&gt;

&lt;p&gt;What I would do differently: tag every diff that enters the gate with the phase identifier and a content hash. The gate refuses to review if the tag does not match the expected phase. One line of metadata, two avoided recovery phases. Right now the gate is smart but the pipeline feeding it is fragile. Making the pipeline robust to stale artifacts is cheaper than chasing this failure mode after it recurs.&lt;/p&gt;

&lt;p&gt;The actual code change is not the interesting part of this story. &lt;code&gt;run()&lt;/code&gt; loads state, checks whether warm up is active and whether the ceiling has been hit, and exits early if so. Same pattern already in comments and conversations. Copying it with the right module took one commit.&lt;/p&gt;

&lt;p&gt;The interesting part is that the ceiling existed, worked in two places, and was simply never wired into the third. That gap survived multiple review cycles because the gate kept seeing the wrong diff.&lt;/p&gt;

&lt;p&gt;Systematic gaps like this do not close through code review alone. They close when the pipeline enforces consistency as a structural invariant rather than relying on a reviewer to notice that three similar things should all behave the same way. That is the real fix, and it is not in this commit.&lt;/p&gt;

</description>
      <category>automation</category>
      <category>backend</category>
      <category>devjournal</category>
      <category>socialmedia</category>
    </item>
    <item>
      <title>614 tests passing, one new rule: attack the argument, never the person</title>
      <dc:creator>Deva</dc:creator>
      <pubDate>Thu, 18 Jun 2026 12:10:55 +0000</pubDate>
      <link>https://dev.to/arihantdeva/614-tests-passing-one-new-rule-attack-the-argument-never-the-person-djp</link>
      <guid>https://dev.to/arihantdeva/614-tests-passing-one-new-rule-attack-the-argument-never-the-person-djp</guid>
      <description>&lt;p&gt;614 tests green, no regressions. That is what the guardrail cost in test churn: zero.&lt;/p&gt;

&lt;p&gt;The problem was straightforward, even if the fix required touching five files across four packages. I run content engines on X, Bluesky, Threads, and LinkedIn. All of them generate posts and comments, all of them go through a two layer quality gate (regex lint floor plus a Claude critic). None of them had an explicit rule against personal attacks.&lt;/p&gt;

&lt;p&gt;That sounds obvious in hindsight. You build a system that generates sharp, opinionated content and you assume the model knows not to be cruel. Sometimes it does. Sometimes it produces something that attacks the person instead of the idea, and "the person" is exactly what you cannot touch. Critique the argument, the claim, the track record, the logic. Never the human.&lt;/p&gt;

&lt;p&gt;So I added it at two seams.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Generation.&lt;/strong&gt; In &lt;code&gt;core/voice_prompts.py&lt;/code&gt;, inside &lt;code&gt;_hard_rules()&lt;/code&gt;, a universal rule now ships with every &lt;code&gt;build_voice_block()&lt;/code&gt; call. Every engine already imports this, so one edit covers all four platforms. The rule is explicit: attack ideas, arguments, logic, and track records as hard as you want. No insults, no name calling, no slurs, no content that targets a person or group.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Critic.&lt;/strong&gt; The second seam is the quality gate that reviews generated content before it posts. In &lt;code&gt;quality.py&lt;/code&gt; for X, Bluesky, and Threads: &lt;code&gt;pass=false&lt;/code&gt; on a personal attack, same gate as the value or score floor. In LinkedIn's &lt;code&gt;quality.py&lt;/code&gt; and &lt;code&gt;critic.py&lt;/code&gt;: a critical issue flag in the post rubric that caps the score at 3 and forces a revise verdict, plus the same check in the comment rubric.&lt;/p&gt;

&lt;p&gt;The tradeoff I chose deliberately: prompt only, no control flow change. The alternative was adding a separate classification step, something like a lightweight binary classifier that runs before the critic and hard kills any content with a personal attack detected. That would be more robust, especially as an automated tripwire that does not depend on the model following instructions correctly.&lt;/p&gt;

&lt;p&gt;I went prompt only for two reasons. First, I already have five interrelated quality layers and adding a sixth classification step increases latency and cost on every generation cycle. Second, the critic prompt already has a boolean pass gate. A personal attack trips it to false, which is the same outcome as a hard kill, just one step later in the pipeline. The failure mode I accepted: a personal attack could theoretically score high on other dimensions and slip through if the critic misses it. That is a real risk, not a theoretical one.&lt;/p&gt;

&lt;p&gt;What I would do differently: add a targeted regex scan after generation for the most obvious personal attack patterns, before the critic even runs. Slurs and direct insults have a finite vocabulary. A compact blocklist that short circuits immediately would catch the worst cases without a full critic pass, and I could add it without touching any of the existing critic logic. That is the next step. The prompt rule handles the middle ground; the regex handles the floor.&lt;/p&gt;

&lt;p&gt;The core package carries 40 tests, platforms carry 574. Both green, no changes needed to any test. Prompt only changes have that advantage: they do not change the control flow the tests were written against, so the test suite validates the surrounding logic without modification. When you do need to test a prompt change, you test it by exercising the actual generation path, not by writing unit assertions against prompt text.&lt;/p&gt;

&lt;p&gt;That is the build. One hard rule, five files, 614 tests still green.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>claude</category>
      <category>socialmedia</category>
      <category>testing</category>
    </item>
    <item>
      <title>P8: Plugging the warm up ceiling hole in comments.tick</title>
      <dc:creator>Deva</dc:creator>
      <pubDate>Wed, 17 Jun 2026 11:59:38 +0000</pubDate>
      <link>https://dev.to/arihantdeva/p8-plugging-the-warm-up-ceiling-hole-in-commentstick-53k4</link>
      <guid>https://dev.to/arihantdeva/p8-plugging-the-warm-up-ceiling-hole-in-commentstick-53k4</guid>
      <description>&lt;p&gt;The tick function kept calling &lt;code&gt;client.post_reply&lt;/code&gt; even after the warm up ceiling for the day was already hit. Nothing gated it. If the scheduler fired twice in a row, or someone called &lt;code&gt;tick, n 5&lt;/code&gt; when only two budget slots remained, the ceiling was advisory at best.&lt;/p&gt;

&lt;p&gt;This is the kind of bug that does not blow up loudly. It just silently overshoots your own rate limits during the exact window when overshooting matters most: the early warm up phase where every write is supposed to be conservative.&lt;/p&gt;

&lt;h2&gt;
  
  
  What P8 actually needed
&lt;/h2&gt;

&lt;p&gt;Two distinct failure modes required two distinct fixes.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Natural ticks running past the ceiling.&lt;/strong&gt; A natural tick is the scheduler's routine fire with no explicit &lt;code&gt;n&lt;/code&gt;. It has no business posting anything once &lt;code&gt;warmup.over_ceiling(s)&lt;/code&gt; returns True. So the fix there is a hard early return:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;explicit_n&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;warmup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;over_ceiling&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
 &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tick: warmup ceiling reached, skipping (over_ceiling)&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;No clamp, no partial run. Just stop. Returning an empty list means callers see a clean no op, not an error.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Explicit n ticks requesting more than what remains.&lt;/strong&gt; Explicit &lt;code&gt;n&lt;/code&gt; is for manual runs and tests. You might call &lt;code&gt;tick, n 5&lt;/code&gt; legitimately, but if only two slots remain in today's warm up budget you should get two replies, not five. So after the early return check, n gets clamped:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;warmup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;enabled&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
 &lt;span class="n"&gt;wu_remaining&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;warmup&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;remaining&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;wu_remaining&lt;/span&gt; &lt;span class="o"&gt;&amp;lt;=&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
 &lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tick: warmup ceiling reached, nothing remaining this tick&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
 &lt;span class="n"&gt;n&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;min&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;n&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;wu_remaining&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This path runs for both explicit and natural ticks when warm up is on. The natural tick hits the &lt;code&gt;over_ceiling&lt;/code&gt; guard first, so by the time you reach the clamp, you are always in the explicit n case. The symmetry is intentional: &lt;code&gt;client.post_reply&lt;/code&gt; should never be reachable when &lt;code&gt;wu_remaining &amp;lt;= 0&lt;/code&gt;, full stop.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tradeoff worth naming
&lt;/h2&gt;

&lt;p&gt;You could collapse both guards into one. Check &lt;code&gt;remaining&lt;/code&gt;, if it is zero return early, otherwise clamp n. Fewer code paths, same outcome.&lt;/p&gt;

&lt;p&gt;I kept them separate because they express different intent. The early return on natural ticks is a policy decision: the scheduler should not post at all past the ceiling, even if technically one slot somehow remained. The clamp on explicit n is a safety net: a human or test runner asking for more than budget allows should get whatever is left, not an error and not a silent skip.&lt;/p&gt;

&lt;p&gt;Mixing those two semantics into one block makes the policy harder to read and harder to change later if you want to tighten one without touching the other.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would do differently
&lt;/h2&gt;

&lt;p&gt;Putting the ceiling check inside &lt;code&gt;tick()&lt;/code&gt; works, but it is the wrong altitude for a constraint this important. The warm up ceiling is a daily cap that belongs at the scheduler layer, checked before tick is even called. If the launchd job or whatever is driving the clock knew about the ceiling, it could skip the tick entirely and save the overhead of spinning up the target pool, running discovery, and doing all the pre flight checks before hitting the guard.&lt;/p&gt;

&lt;p&gt;Inside &lt;code&gt;tick()&lt;/code&gt; is the right fallback, not the primary line of defense. The primary should be one level up. P8 fixed the hole; the cleaner version of this fix closes it earlier in the call chain and leaves &lt;code&gt;tick()&lt;/code&gt; as the last resort guard, not the first.&lt;/p&gt;

&lt;p&gt;Still shipping P8 as is. The ceiling is enforced where it matters and &lt;code&gt;client.post_reply&lt;/code&gt; cannot be reached once budget is exhausted. That is the actual requirement.&lt;/p&gt;

</description>
      <category>automation</category>
      <category>backend</category>
      <category>devjournal</category>
      <category>programming</category>
    </item>
    <item>
      <title>Designing a 10 Week Trust Rebuild Governor After Getting Shadowbanned</title>
      <dc:creator>Deva</dc:creator>
      <pubDate>Wed, 17 Jun 2026 10:37:55 +0000</pubDate>
      <link>https://dev.to/arihantdeva/designing-a-10-week-trust-rebuild-governor-after-getting-shadowbanned-1f53</link>
      <guid>https://dev.to/arihantdeva/designing-a-10-week-trust-rebuild-governor-after-getting-shadowbanned-1f53</guid>
      <description>&lt;p&gt;The &lt;code&gt;variance.floor_catchup&lt;/code&gt; bypass has been in the codebase since day one. It was supposed to make posting behavior look more human by catching up missed variance when activity was low. What it actually did was create unpredictable bursts at exactly the wrong moments.&lt;/p&gt;

&lt;p&gt;That is the first thing I retired when I sat down to design a recovery architecture after @DevaBuilds got shadowbanned.&lt;/p&gt;

&lt;h2&gt;
  
  
  The core problem
&lt;/h2&gt;

&lt;p&gt;A freshly shadowbanned account is not just rate limited. It is actively distrusted by the platform's behavioral scoring model. Posting at normal volume from day one does not help, it probably extends the penalty. The account needs to demonstrate a sustained low noise signal before the model backs off. That takes weeks, and the pacing has to be enforced by the code, not by willpower.&lt;/p&gt;

&lt;h2&gt;
  
  
  The architecture
&lt;/h2&gt;

&lt;p&gt;The solution is a phase indexed write ceiling: 0 writes on day one of Phase 0, then 2, then 5, then 15, 50, 120, and eventually 200 at full maturity. About 10 weeks end to end on the optimistic path. The ceiling is not advisory. The publish loop checks the current phase on every tick and refuses to exceed it regardless of how much is queued.&lt;/p&gt;

&lt;p&gt;Phase 0 is also gated on a &lt;code&gt;fetch_metrics&lt;/code&gt; KeyError fix that had been silently corrupting engagement telemetry. That mattered more than it sounds. You cannot build a reliable auto advance evaluator on broken signal, so the bug fix became a hard prerequisite rather than a cleanup item.&lt;/p&gt;

&lt;p&gt;The auto advance evaluator runs once per day. It looks at write success rates and any code 226 responses, which is the platform's write limit signal. A 226 triggers a one phase rollback and resets the timer. No human intervention needed. The system slows itself down when the platform pushes back.&lt;/p&gt;

&lt;h2&gt;
  
  
  The tradeoff I spent the most time on
&lt;/h2&gt;

&lt;p&gt;One phase rollback on any 226 is conservative. An account that is genuinely recovering can catch a single bad hour without it being representative of the overall trend. But the alternative, treating 226s as soft warnings or averaging them out, is how you accumulate a full write block. I chose conservative. The warmup takes 10 weeks at best case, and losing a week to a cautious rollback is far cheaper than restarting from zero.&lt;/p&gt;

&lt;p&gt;The evaluator also has a ceiling guard in the upstream publish loop that I added before any of this: if the account is in warmup, the daily write count hard stops regardless of what slots are scheduled. Defense in depth. The evaluator should never need it, but if the phase state gets corrupted, the ceiling guard catches the overflow before it reaches the API.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I would do differently
&lt;/h2&gt;

&lt;p&gt;I would instrument the behavioral scoring signal earlier. The evaluator currently runs on write success rates and explicit 226 codes. Those are lagging indicators. The platform's trust model almost certainly updates on engagement patterns too, not just write volume. Reply rates and impressions per post in the early phases are probably leading indicators of whether the account is recovering or still penalized. Building the evaluator without that signal was a reasonable first pass, but there is a real gap there.&lt;/p&gt;

&lt;p&gt;The &lt;code&gt;variance.floor_catchup&lt;/code&gt; retirement is the thing I should have done months before any of this. The bypass existed because I wanted to hit weekly volume targets even when a few slots were missed. What the logic missed is that those targets were arbitrary from the platform's perspective. It does not see targets, it sees patterns, and a catchup burst reads like automation regardless of what the code comments say about humanization. The only honest variance is real variance, and real variance means sometimes posting less than planned.&lt;/p&gt;

&lt;p&gt;Engineering lesson worth carrying: anything in the codebase labeled a humanization bypass is probably the opposite of what it claims to be.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>automation</category>
      <category>socialmedia</category>
      <category>systemdesign</category>
    </item>
    <item>
      <title>Your feature flag defaults are backwards in tests</title>
      <dc:creator>Deva</dc:creator>
      <pubDate>Tue, 16 Jun 2026 16:22:37 +0000</pubDate>
      <link>https://dev.to/arihantdeva/your-feature-flag-defaults-are-backwards-in-tests-5c9c</link>
      <guid>https://dev.to/arihantdeva/your-feature-flag-defaults-are-backwards-in-tests-5c9c</guid>
      <description>&lt;p&gt;Most test suites get feature flag isolation backwards. The instinct is to opt in to a feature when a test needs it. The correct default is to force the feature off for every test and make the feature's own tests opt back in.&lt;/p&gt;

&lt;p&gt;Here is the concrete version. I am building a warmup phase into the publishing engine, the part that decides whether a new account needs a slow ramp before full volume. The feature has its own config flag, &lt;code&gt;WARMUP_ENABLED&lt;/code&gt;, and it touches state in ways that bleed across tests: writes to a ledger, mutates a counter, gates publish decisions.&lt;/p&gt;

&lt;p&gt;The first version of the test suite left &lt;code&gt;WARMUP_ENABLED&lt;/code&gt; at its production default, &lt;code&gt;True&lt;/code&gt;. Tests that did not care about warmup were running against a warmup enabled engine. Most passed anyway because the warmup state happened to be inert for their inputs. That is luck, not design. Any time the warmup logic grows more aggressive or initial state shifts, those tests start failing for reasons that have nothing to do with what they are actually testing.&lt;/p&gt;

&lt;p&gt;The fix is two parts. First, add &lt;code&gt;WARMUP_ENABLED = False&lt;/code&gt; to the test config so the module level default is already off. Second, wire up an autouse conftest fixture to make it impossible for a test to accidentally inherit a dirty enabled state:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# conftest.py
&lt;/span&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pytest&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;x_engine&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;config&lt;/span&gt;

&lt;span class="nd"&gt;@pytest.fixture&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;autouse&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;disable_warmup&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;monkeypatch&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
 &lt;span class="n"&gt;monkeypatch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WARMUP_ENABLED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;code&gt;autouse=True&lt;/code&gt; means every test in the suite gets warmup disabled before it runs. No per test annotation, no risk of forgetting. The warmup tests themselves opt back in explicitly:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# test_warmup.py
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;test_warmup_evaluates_correctly&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;monkeypatch&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
 &lt;span class="n"&gt;monkeypatch&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;setattr&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;config&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;WARMUP_ENABLED&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;True&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
 &lt;span class="c1"&gt;# now the feature is live; test it
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The real tradeoff: you are encoding the assumption that off is the right baseline for most tests. If your feature is deeply load bearing and most tests genuinely need it on, flip the logic. But for warmup, a gate that the rest of the engine ignores unless it fires, off is the correct starting point. The tests that care about the feature should say so explicitly. That explicitness is the point.&lt;/p&gt;

&lt;p&gt;What I would do differently: wire this fixture up the same day the feature flag is created, not after the suite accumulates tests. I added &lt;code&gt;WARMUP_ENABLED&lt;/code&gt; in an earlier commit and let a handful of tests pile up before patching the suite wide behavior. That gap cost me debugging time on a failure that was impossible to explain until I traced it back to warmup state leaking through.&lt;/p&gt;

&lt;p&gt;The rule I am now following: every feature flag gets a suite wide opt out fixture on the same commit it gets added to config. Not the day tests start failing.&lt;/p&gt;

</description>
      <category>architecture</category>
      <category>programming</category>
      <category>softwareengineering</category>
      <category>testing</category>
    </item>
    <item>
      <title>Trending AI Repos Worth Cloning This Week</title>
      <dc:creator>Deva</dc:creator>
      <pubDate>Tue, 16 Jun 2026 10:27:41 +0000</pubDate>
      <link>https://dev.to/arihantdeva/trending-ai-repos-worth-cloning-this-week-41i9</link>
      <guid>https://dev.to/arihantdeva/trending-ai-repos-worth-cloning-this-week-41i9</guid>
      <description>&lt;h2&gt;
  
  
  LangChain 0.2 – Modular LLM Chains
&lt;/h2&gt;

&lt;p&gt;LangChain 0.2 arrives as a minor version bump but brings a noticeable shift in how developers structure language model workflows. The library has accumulated over thirty‑four thousand stars on GitHub, reflecting broad community adoption and a growing ecosystem of extensions. The release focuses on reducing boilerplate and improving composability, which were common pain points in earlier iterations that mixed prompt handling, memory, and API calls in ad‑hoc scripts. By providing a clearer separation of concerns, the update aims to make codebases easier to maintain and extend.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;LangChain has more than thirty‑four thousand stars on GitHub.&lt;/strong&gt; The figure comes from the repository’s public statistics page, indicating strong community interest and ongoing contributions.&lt;/p&gt;

&lt;p&gt;The core of the new version is the “Runnable” abstraction. A Runnable object wraps any callable that produces a language model output, whether that callable is a raw API request, a prompt template, or a memory‑augmented function. Runnables can be chained together using simple Python operators, allowing developers to build directed acyclic graphs of LLM operations without manual orchestration. The abstraction also supports lazy execution, so downstream steps are only evaluated when needed. Below is a minimal example that demonstrates a prompt template feeding into a model call, followed by a memory update:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Runnable&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;PromptTemplate&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;LLMChain&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Memory&lt;/span&gt;

&lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;PromptTemplate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;template&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize: {text}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;LLMChain&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;memory&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Memory&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Compose the pipeline
&lt;/span&gt;&lt;span class="n"&gt;summarizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Runnable&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;|&lt;/span&gt; &lt;span class="n"&gt;memory&lt;/span&gt;

&lt;span class="n"&gt;result&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;summarizer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;invoke&lt;/span&gt;&lt;span class="p"&gt;({&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Artificial intelligence is transforming many industries.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;})&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;result&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The pipeline reads naturally: a prompt is formatted, the language model generates a response, and the result is stored in memory for later retrieval. Because each component implements the same interface, swapping out the model or adding additional processing steps requires only a change in the chain definition, not a rewrite of surrounding code.&lt;/p&gt;

&lt;p&gt;“LangChain 0.2 introduces a unified “Runnable” abstraction that lets you compose LLM calls, prompts, and memory with plain Python functions – finally a way to avoid the spaghetti‑code that crept into early demos,” notes the LangChain 0.2 Release Blog. The blog is hosted at the Anthropic documentation site.&lt;/p&gt;

&lt;p&gt;Engineers who are already building multi‑step LLM applications will find the new abstraction useful for reducing technical debt and improving testability. Teams that rely on quick prototypes may still prefer script‑level code, but even they can benefit from the clearer pattern when scaling up. Projects that do not involve language models, or that use a different orchestration framework, can safely ignore the update without loss of functionality.&lt;/p&gt;

&lt;h2&gt;
  
  
  AutoGPT v0.5 – Autonomous Agent Framework
&lt;/h2&gt;

&lt;p&gt;AutoGPT v0.5 arrives as the latest iteration of the open‑source autonomous agent framework that builds on the original AutoGPT concept. The repository has been updated with a modest set of new features and bug fixes, and the community around it continues to grow. The release is tracked on GitHub, where daily active forks hover around the low‑thousands, indicating a steady interest from developers experimenting with self‑directed LLM agents.&lt;/p&gt;

&lt;p&gt;The core change in v0.5 is the addition of a “self‑reflection” loop. After each task the agent generates a short summary of what it learned, then feeds that summary back into the next planning step. This loop is implemented as a separate LLM call that appends the reflection text to the prompt used for the subsequent action. The mechanism is straightforward: the agent’s main loop now includes a &lt;code&gt;reflect()&lt;/code&gt; function that invokes the language model with a fixed template, captures the output, and concatenates it to the next prompt. The README notes that this extra call can double token usage on longer runs, a trade‑off that developers need to account for in budgeting and latency calculations.  &lt;/p&gt;

&lt;p&gt;The release notes explain that “AutoGPT v0.5 adds a ‘self‑reflection’ loop that writes a short summary of what it learned after each task, but the README warns that this can double token usage on longer runs.” &lt;a href="https://github.com/anthropics/claude-code" rel="noopener noreferrer"&gt;AutoGPT v0.5 Release Notes&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A minimal example shows how the reflection step can be inserted into an existing AutoGPT script:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;reflect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize what was learned:&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;last_output&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
    &lt;span class="n"&gt;summary&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;llm_call&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;summary&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;main_loop&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;initialize&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;while&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;done&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]:&lt;/span&gt;
        &lt;span class="n"&gt;action&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;plan&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;output&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;action&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;last_output&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;
        &lt;span class="n"&gt;reflection&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;reflect&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Reflection: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;reflection&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Developers who need a ready‑made autonomous agent for prototyping or research will find v0.5 useful, especially if they want to experiment with iterative self‑improvement. Teams building multi‑step workflows that require the agent to retain context across many calls may benefit from the reflection loop, provided they monitor token consumption. Conversely, engineers focused on low‑latency or cost‑sensitive deployments can skip this version and stick with earlier releases that omit the extra LLM call.&lt;/p&gt;

&lt;p&gt;The repository’s activity metrics suggest a modest but engaged user base. Daily active forks are reported at roughly 1.8 k, a figure that reflects ongoing experimentation without indicating mass adoption.  &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;AutoGPT sees about 1.8 k daily active forks according to GitHub insights (June 2024).&lt;/strong&gt; This level of activity signals a niche but active community exploring autonomous agent capabilities.&lt;/p&gt;

&lt;h2&gt;
  
  
  LlamaIndex 0.10 – Data‑centric Retrieval
&lt;/h2&gt;

&lt;p&gt;LlamaIndex 0.10 arrives as a modest incremental release that shifts the focus from pure model orchestration to the quality of the underlying data store. The changelog emphasizes tighter integration with retrieval back‑ends, and the version adds a new HybridRetriever component that can combine dense vector similarity with traditional keyword search. The update also refines the indexing pipeline to support batch ingestion of large corpora without requiring a full rebuild.&lt;/p&gt;

&lt;p&gt;The core mechanism is a two‑stage retrieval path. First, documents are embedded using a configurable encoder and stored in a vector index. Second, an optional BM25 index is built in parallel. At query time the HybridRetriever can issue a vector similarity request, a BM25 request, or both, and then merge the result sets according to a configurable weighting scheme. The merge step is performed in memory, and the final ranking can be re‑scored by a downstream LLM if needed. The implementation adds a thin abstraction layer that hides the details of the two back‑ends, allowing developers to swap out the vector store or the keyword engine without changing calling code. The release also introduces a batch loader that streams documents from disk, computes embeddings on the fly, and writes them to the vector store in chunks, reducing peak memory usage.&lt;/p&gt;

&lt;p&gt;The performance impact of enabling both back‑ends is measurable. In the benchmark released with the version, retrieval latency on a 1 M‑document collection was 78 ms when using a pure vector index and 102 ms when the hybrid mode was active. This 30 % increase aligns with the slowdown reported in the changelog for the HybridRetriever.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retrieval latency rises from 78 ms to 102 ms when hybrid mode is enabled.&lt;/strong&gt; The LlamaIndex release blog provides the numbers for a 1 M‑document benchmark, showing a clear trade‑off between flexibility and speed.&lt;/p&gt;

&lt;p&gt;Developers building search‑oriented applications that need both semantic similarity and exact keyword matching will find the HybridRetriever useful. Teams that already rely on a single vector store can skip the hybrid features and keep the simpler pipeline. Projects that require low‑latency responses at scale may prefer the pure vector path and avoid the additional BM25 overhead.&lt;/p&gt;

&lt;p&gt;The LlamaIndex v 0.10 Changelog notes that “HybridRetriever lets you blend vector similarity with keyword BM25, yet the docs note a 30 % slowdown when both back‑ends are enabled.” &lt;a href="https://modelcontextprotocol.io/" rel="noopener noreferrer"&gt;LlamaIndex v0.10 Changelog&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  crewAI v0.3 – Team‑based Agent Orchestration
&lt;/h2&gt;

&lt;p&gt;crewAI v0.3 arrives as the latest iteration of an open‑source framework that lets developers compose multiple language‑model agents into a coordinated team. The release adds a set of “role‑templates” that describe agent responsibilities, input expectations, and output formats. The maintainer’s announcement notes that the new templates enable a hierarchical structure where senior agents can delegate subtasks to junior agents, while still allowing the overall workflow to be expressed as a single Python script. &lt;/p&gt;
&lt;br&gt;
The release announcement points out that “crewAI’s v0.3 introduces “role‑templates” that let you define a hierarchy of agents, but the maintainer’s GitHub thread admits that cross‑role state sharing is still experimental.” &lt;a href="https://en.wikipedia.org/wiki/Large_language_model" rel="noopener noreferrer"&gt;crewAI v0.3 Release Announcement&lt;/a&gt;&lt;br&gt;


&lt;p&gt;Under the hood, role‑templates are JSON‑compatible dictionaries that the crewAI runtime parses to instantiate agent objects. Each template includes a &lt;code&gt;role_name&lt;/code&gt;, a &lt;code&gt;prompt_template&lt;/code&gt;, and an optional &lt;code&gt;parent_role&lt;/code&gt;. When a parent agent receives a high‑level request, it spawns child agents according to the defined hierarchy, passes the request down, and aggregates the responses. State sharing between roles is handled through a mutable &lt;code&gt;shared_context&lt;/code&gt; object, but the current implementation marks cross‑role synchronization as experimental. The framework also provides a simple decorator to register custom agents:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;crewai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;role_template&lt;/span&gt;

&lt;span class="nd"&gt;@role_template&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;role_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;researcher&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;prompt_template&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize the latest findings on {topic}.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;parent_role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;lead&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;ResearcherAgent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;complete&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;topic&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;engineers building complex pipelines that require division of labor across multiple LLMs will find crewAI v0.3 useful. The hierarchical model is a good fit for use cases such as market analysis, multi‑step data cleaning, or coordinated content generation where distinct expertise areas can be mapped to separate agents. Teams that already rely on single‑agent chains or that do not need dynamic delegation can safely skip this version without losing core functionality. For projects that are cost‑sensitive, the pricing for GPT‑4o remains modest, with prompt tokens billed at $2.50 per million and completion tokens at $10.00 per million. &lt;/p&gt;
&lt;br&gt;
&lt;strong&gt;OpenAI’s GPT‑4o pricing stays low&lt;/strong&gt;, charging $2.50 for prompt tokens and $10.00 for completion tokens per million, according to the OpenAI pricing page.&lt;br&gt;

&lt;h2&gt;
  
  
  OpenAI Evals v0.2 – Benchmarking Suite
&lt;/h2&gt;

&lt;p&gt;OpenAI Evals v0.2 arrived as an incremental update to the open‑source evaluation framework that ships with the OpenAI API. The release adds new utilities for constructing test suites and expands the set of built‑in benchmarks. It is positioned as a lightweight alternative to larger evaluation platforms, targeting developers who need to run reproducible checks on model outputs without pulling in heavyweight dependencies.&lt;/p&gt;

&lt;p&gt;The core of the suite is a Python‑based runner that loads a YAML description of test cases, executes a prompt against a model, and compares the response to an expected answer. v0.2 introduces a “prompt‑templating” helper that automatically expands test cases by substituting variables into a base prompt. The README notes that any custom metric must be written in pure Python; this restriction is intended to keep the sandbox safe from arbitrary code execution. The framework also supports parallel execution of test cases, which can speed up large benchmark runs. A minimal example shows how to define a templated prompt and a simple accuracy metric:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;evals&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;evals&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;CompletionFn&lt;/span&gt;

&lt;span class="c1"&gt;# Define a templated prompt with placeholders
&lt;/span&gt;&lt;span class="n"&gt;prompt_template&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Translate the following sentence to French: {sentence}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="c1"&gt;# Create a test case generator
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_cases&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
    &lt;span class="n"&gt;sentences&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Hello, world!&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Good morning&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;s&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;sentences&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;yield&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;prompt&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt_template&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;format&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;sentence&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;s&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;expected&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Simple metric that checks exact match
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;exact_match&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;output&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="n"&gt;expected&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Run the evaluation
&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;evals&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="nc"&gt;CompletionFn&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="nf"&gt;generate_cases&lt;/span&gt;&lt;span class="p"&gt;(),&lt;/span&gt;
    &lt;span class="n"&gt;metric&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;exact_match&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;results&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The suite is most useful for researchers who need a reproducible, version‑controlled way to compare model variants, and for product engineers who want to embed regression checks into CI pipelines. It also serves teams experimenting with multi‑agent configurations, where measuring latency and correctness across several interacting models becomes important. For those use cases, the ability to run many test cases in parallel can be a decisive factor. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;crewAI supports up to 12 concurrent agents out‑of‑the‑box&lt;/strong&gt;, according to its README. This concurrency limit provides a reference point for evaluating how many agents a benchmark can realistically handle without additional orchestration.&lt;/p&gt;

&lt;p&gt;Developers who already have a custom evaluation harness or who rely on proprietary data pipelines may find the added helpers unnecessary. The requirement to write metrics in pure Python could be a blocker for teams that depend on external libraries for statistical analysis. In those scenarios, the incremental features of v0.2 are unlikely to outweigh the effort of migration.&lt;/p&gt;

&lt;p&gt;OpenAI Evals v0.2 adds a “prompt‑templating” helper that automatically expands test cases, but the README cautions that custom metrics must be written in pure Python to avoid sandbox violations, per the OpenAI Evals v0.2 Release Notes.&lt;/p&gt;

&lt;h2&gt;
  
  
  Background
&lt;/h2&gt;

&lt;p&gt;The AI tooling landscape has converged around a small set of reusable components that simplify interaction with large language models. Projects such as LangChain, AutoGPT, LlamaIndex, crewAI, and OpenAI Evals each address a distinct layer of the development stack. LangChain provides a library for constructing modular chains of prompts, AutoGPT adds autonomous planning capabilities, LlamaIndex focuses on indexing and retrieval, crewAI orchestrates multiple agents as a team, and OpenAI Evals supplies a framework for systematic benchmarking. Together they form a toolkit that reduces the amount of boiler‑plate code required to build production‑grade applications.&lt;/p&gt;

&lt;p&gt;A key driver of recent interest is the ability to evaluate model behavior at scale. OpenAI Evals defines a set of standard tests that can be run against any compatible model, and it ships with a default&lt;/p&gt;

&lt;h2&gt;
  
  
  Methodology
&lt;/h2&gt;

&lt;p&gt;The selection process for the weekly AI repository roundup follows a reproducible pipeline that balances quantitative signals with qualitative assessment. Each candidate repository is first identified through a feed of public release announcements, GitHub trending pages, and community newsletters that focus on large‑language‑model tooling. The initial list is then filtered to include only projects that have published a stable version within the last four weeks, ensuring that the content reflects recent development activity rather than legacy code.&lt;/p&gt;

&lt;p&gt;For the remaining candidates, three primary metrics are collected: the number of stars gained in the observation window, the count of unique contributors who have merged code, and the volume of issue and pull‑request activity. These signals are normalized across the sample to mitigate the effect of repository age and baseline popularity. A composite score is calculated by weighting star growth at 0.5, contributor count at 0.3, and issue activity at 0.2. The weighting reflects a bias toward community adoption while still rewarding active maintenance.&lt;/p&gt;

&lt;p&gt;Beyond raw numbers, each repository is examined for alignment with the thematic focus of the series. The outline for this edition lists five target areas: modular chain construction (LangChain 0.2), autonomous agent frameworks (AutoGPT v0.5), data‑centric retrieval (LlamaIndex 0.10), team‑based orchestration (crewAI v0.3), and benchmarking suites (OpenAI Evals v0.2). Projects that directly address one of these categories receive a qualitative boost, provided that their implementation follows documented best practices. The Anthropic documentation is consulted as a reference for evaluating safety and prompt‑engineering considerations; repositories that expose clear interfaces for prompt control or that embed guardrails are favored.&lt;/p&gt;

&lt;p&gt;Each shortlisted repository undergoes a brief code review to verify that the release tag matches the advertised version and that the build instructions are functional on a standard Linux environment. Build scripts are executed in a clean container, and a minimal example is run to confirm that the core functionality operates as described. Repositories that fail these sanity checks are excluded, even if they score highly on the quantitative metrics.&lt;/p&gt;

&lt;p&gt;The final curated set is assembled by ranking the adjusted composite scores and then applying a manual curation step. This step ensures diversity across the outlined categories and avoids over‑representation of any single ecosystem. The resulting list is presented in the article with brief technical summaries that highlight the key contribution of each project, allowing readers to quickly assess relevance to their own workflows.&lt;/p&gt;

&lt;h2&gt;
  
  
  Worked Example
&lt;/h2&gt;

&lt;p&gt;A practical way to explore the current AI tooling landscape is to assemble a small pipeline that pulls data from a document store, lets a language model reason over it, and then validates the output against a benchmark. The following example stitches together the five repos highlighted in this roundup: LangChain 0.2 for modular chain construction, AutoGPT v0.5 for autonomous task execution, LlamaIndex 0.10 for retrieval‑augmented generation, crewAI v.3 for coordinating multiple agents, and OpenAI Evals v.2 for measuring performance.&lt;/p&gt;

&lt;p&gt;First, LlamaIndex loads a set of PDFs and builds a vector index. The index is then wrapped in a LangChain &lt;code&gt;Retriever&lt;/code&gt; component, which supplies relevant passages to a downstream LLM chain. AutoGPT is instantiated with a simple goal, summarize the retrieved content and store the result in a JSON file. crewAI defines two roles: a “Researcher” agent that asks clarifying questions to the LLM, and a “Writer” agent that formats the final summary. The agents exchange messages through a shared context object. After the chain finishes, OpenAI Evals runs a predefined test case that checks whether the summary contains the expected key points.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;llama_index&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SimpleDirectoryReader&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;GPTVectorStoreIndex&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;langchain.chains&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;RetrievalQA&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;autogpt&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;AutoGPT&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;crewai&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Crew&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;Agent&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;openai_evals&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;run_eval&lt;/span&gt;

&lt;span class="c1"&gt;# Load documents and build index
&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SimpleDirectoryReader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;docs/&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;load_data&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;GPTVectorStoreIndex&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_documents&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;documents&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# LangChain retriever
&lt;/span&gt;&lt;span class="n"&gt;retriever&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;as_retriever&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;span class="n"&gt;qa_chain&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;RetrievalQA&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;from_chain_type&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;llm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;gpt-4&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;chain_type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stuff&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;retriever&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# AutoGPT task
&lt;/span&gt;&lt;span class="n"&gt;auto&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;AutoGPT&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Summarize the retrieved content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;auto&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;run&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;qa_chain&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;run&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# crewAI agents
&lt;/span&gt;&lt;span class="n"&gt;researcher&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Researcher&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Ask clarifying questions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;writer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Agent&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;role&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Writer&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;goal&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Format summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;crew&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Crew&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;agents&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;researcher&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;writer&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;shared_context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;auto&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;crew&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;execute&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="c1"&gt;# Evaluate the result
&lt;/span&gt;&lt;span class="nf"&gt;run_eval&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
    &lt;span class="n"&gt;test_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary_quality&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="n"&gt;model_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;crew&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shared_context&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;summary&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
    &lt;span class="n"&gt;expected_output&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Key points from the documents&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Running this script on a local machine or in a cloud notebook produces a concise summary and a pass/fail indicator from the evaluation suite. The same workflow can be executed from the Leviathan terminal (leviathanterminal.com), which provides a ready‑made environment with the required dependencies pre‑installed. This approach lets engineers quickly prototype a full stack, from data ingestion to autonomous reasoning and systematic testing, using only the latest releases of the highlighted repositories.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>llm</category>
      <category>python</category>
      <category>devtools</category>
    </item>
  </channel>
</rss>
