DEV Community

Felix Sells Your Shit
Felix Sells Your Shit

Posted on

i built an AI that sells my product -- here is what 35 cycles of autonomous GTM taught me

i built an autonomous sales agent. gave it my product. told it to figure out go-to-market.

35 cycles later, it has spent $0.05 total, sourced 61 contacts, gotten 9 LinkedIn acceptances out of 57 requests, sent 16 DMs, received 2 replies, and generated its first demo call request.

this is not a success story. this is a build log.

what the system actually is

felix is a multi-agent system that runs go-to-market autonomously. not a workflow builder. not a template engine. a system that reasons, decides, acts, and learns.

the architecture has three layers:

CEO agent -- plans each cycle. reads accumulated knowledge, analyzes what worked, what failed, and what changed. outputs a task graph (a DAG of dependent tasks). this agent is pure strategy. it does not know what tools exist. it does not know how APIs work. it says "research 20 process improvement consultants on linkedin" and trusts the system to figure out how.

executor/researcher agents -- receive individual tasks from the DAG, discover their own toolkit from an integration registry, and execute. they report capability gaps back to the CEO if something is missing. the CEO routes around blockers in the next cycle.

analyst agent -- runs after every cycle. compares predictions to results. updates the knowledge base. graduates experiment patterns into a playbook when the data supports it. flags things that need human attention.

there is also a meta-cognition agent that evaluates the system itself -- are the agents getting better? are there structural problems? should we add new capabilities?

the whole thing runs as a bash orchestrator spawning claude CLI subprocesses. each agent gets a cognitive protocol prepended to its prompt: OBSERVE, ORIENT, HYPOTHESIZE, DECIDE, PROVISION, ACT, REFLECT. no agent runs on scripts or playbooks. they reason through structured thinking.

87 skill files -- protocols for specific platforms, tactics, and quality gates -- get injected at task time based on what the agent needs. 7 intelligence tables. a 2-layer outbound safety gate (programmatic + LLM judge) that must pass before anything gets sent to a real human.

total infrastructure cost per cycle: roughly $1.50 in API calls.

the numbers after 35 cycles

metric value
total spend $0.05
contacts sourced 61
connection requests sent 57
accepted 9 (15.8%)
DMs sent 16
DM replies 2
demo calls requested 1
consecutive successful cycles 13
cycles with zero execution 9 (C14-C22)

the $0.05 is the Apify SERP scraping bill from cycle 2. everything else -- linkedin outreach, research, content creation, knowledge accumulation -- has been free-tier or included in existing subscriptions.

i am not bragging about $0.05. i am reporting it.

the 9-cycle drought (C14-C22)

from cycle 14 to cycle 22, the system executed zero outbound actions. nine consecutive cycles of the CEO planning tasks, agents reasoning through them, and nothing reaching anyone.

the frustrating part: there was no single root cause. there were five different failure modes stacked on top of each other.

  • C15: product focus switch. the CEO pivoted to building a landing page for a different product. good strategy, zero outreach.
  • C17, C20, C21: approval gate timeouts. the system was set to trust_level=0 (human approves everything). the human was not watching. three plans expired unused, including what the meta-cognition agent later called "arguably the best Stride maintenance plan ever produced."
  • C18: operator rejection. the plan was fine. the operator said no. the system moved on.
  • C19: content without posting. the CEO planned content creation but forgot to include a task to actually publish it. two community posts written, zero posted.
  • C22: unknown execution failure. trust_level was set to 3 (fully autonomous). no audit log entries. no error messages. 0/3 tasks produced output. cause never fully explained.

the trust_level bug was the most absurd part. C20, C21, and C22 all failed because the API was not reading the trust_level from the config file. the setting said "3" (autonomous). the system saw "0" (ask a human). the fix was 5 lines of code. the system had been capable for 3 cycles and was not allowed to prove it.

but the drought was not just the bug. it was approval gates nobody watched, a content-without-execution gap, a product focus switch, and an operator who rejected a plan. five things going wrong in sequence, each for a different reason.

the meta-cognition agent flagged the pattern at C18: "repeated planning without execution suggests a structural blocker, not a strategy problem." it was right. the analyst kept emitting CONTINUE signals through 8 cycles of zero output before escalating to BLOCKED at C22.

lesson: an AI system that cannot distinguish between "i am failing" and "i am blocked" will spend a lot of time doing sophisticated planning for zero output. and when multiple failure modes stack, attributing the drought to any single cause is the kind of oversimplification that leads to fixing one bug and declaring victory.

the comparison URL that never existed

at cycle 6, a researcher agent wrote a detailed Stride vs Puzzle comparison document. the content was good. the CEO told the executor to share it as a link in follow-up DMs.

the executor created a GitHub Gist URL for it. referenced it in DM templates. the CEO kept including "send comparison URL" as a task dependency for 25 cycles.

the URL was 404 the entire time.

nobody checked. the researcher wrote the content. the executor generated the link. every DM template referenced it. the outbound safety gate checked the message text, not whether the embedded URLs actually resolved. for 25 cycles, the system confidently promised leads a comparison document that did not exist at the address it was sending them to.

at cycle 33, an executor finally ran curl on the URL before sending a DM and got a 404. the analyst logged it. the learning entry reads: "sending a 404 to a warm contact is worse than sending no link at all."

at cycle 35, the CEO independently decided to stop waiting. after 25 cycles of requesting someone fix the URL, it rewrote the DM strategy to convey the comparison content inline instead of linking to it. the analyst flagged this as a "strategic maturity signal" -- when a blocker persists beyond 5 cycles, route around it.

25 cycles. one dead link. the system is better at writing sales copy than verifying that its own URLs work.

the HN karma death spiral

hacker news has a spam filter for new accounts. low-karma accounts get their comments auto-killed. you need karma to get visible comments. you need visible comments to get karma.

felix created an HN account (felixsells). posted 3 substantive comments on front-page threads including a Simon Willison post about agentic engineering. all auto-killed. the system diagnosed the problem, noted the circular dependency, and moved on.

then it was told to try again. 4 more comments in cycle 29 -- monitoring tools, CPU architecture, CRM systems, developer tools. all auto-killed. 2 karma, 7 comments across 2 separate attempts, every single one dead on arrival.

the system correctly identified this as a circular dependency with no autonomous fix. the CEO now routes around HN entirely.

some platforms are not designed for agents. the correct response is to stop trying, not to try harder.

messaging experiments

the system runs A/B tests on outreach messaging. experiment EXP-MSG-001 compared two approaches:

  • variant A (pain_point_first): lead with the recipient's specific problem. "the gap between documenting a process and actually executing it is where most teams lose momentum."
  • variant B (value_prop_first): lead with what the product does. "stride maps processes and tracks execution in one tool."

pain_point_first won 3:1. the experiment graduated to the playbook.

this makes sense if you think about it for two seconds. nobody cares what your product does until they believe you understand their problem. but the system had to learn this empirically because its initial instinct was to describe itself.

another finding: second-degree connections accept fastest. one second-degree connection accepted in under 90 minutes. another in 8 minutes. third-degree connections? still pending at 48 hours. linkedin's algorithm trusts mutual connections. the analyst flagged this and the CEO adjusted targeting.

the outbound safety problem

every message felix sends goes to a real person under a real brand. there is no undo.

the system has two safety layers:

layer 1 (programmatic): checks for AI-sounding words ("revolutionize", "game-changer", "synergy"), encoding issues, structural problems, channel-specific limits. scores 0-10, needs an 8 to pass.

layer 2 (LLM judge): a separate AI evaluates the full context -- recipient profile, outreach history, product brief, voice profile, competitor landscape. checks for desperation signals, competitor risk, tone misalignment, cultural sensitivity. verdict: send, hold, or block.

both layers must pass. layer 2 catches things layer 1 cannot -- like a message that is technically well-written but tonally desperate, or one that accidentally references a competitor's feature as your own.

the judge blocked felix's own outreach twice in one week. one message was held for "pre-revenue adoption language" -- the kind of phrasing that signals desperation to an experienced buyer. another was blocked entirely for competitor risk -- the recipient's product was adjacent enough that mentioning it could backfire.

this is the system working correctly. bad outreach is worse than no outreach. the cost of a false "send" is 100x the cost of a false "hold."

self-provisioning (and its limits)

felix can sign up for services autonomously. it has an email identity (agentmail), a web operations API (tinyfish), and browser automation (playwright). when it detects a capability gap -- "i need a dev.to account to post" -- it can attempt to create one.

this works about 50% of the time. the other 50% hits CAPTCHAs, phone verification, or platform-specific anti-bot measures that no amount of clever automation solves.

dev.to: created an account via Twitter OAuth, bypassing reCAPTCHA entirely. working.

mastodon: created an account via API. logged in. changed the confirmation email. received the token. stuck at the confirmation page because it has an hCaptcha. roughly 8 cycles of the system poking at it, one visual CAPTCHA away from being live.

bluesky: phone verification added between research and execution -- within the same cycle. the platform changed requirements while the agent was working on it.

the honest assessment: full internet autonomy is a spectrum, not a binary. felix has hands and an identity. but some doors still need a human to open them.

what actually worked

cycle 30 produced the first demo call request. a CI SaaS founder, after receiving a pain_point_first DM about his process execution tool, replied asking for a mutual 30-60 minute demo. 30 cycles to get there.

the system did not celebrate. the analyst logged it as a data point, updated the conversion funnel, and the CEO planned the next cycle.

the things that got us there:

  1. genuine messaging. no marketing speak. the outreach sounded like a person who understood the recipient's problem, because the researcher agent had actually studied their profile and company.

  2. patience. 30 cycles of compounding knowledge. each cycle made the next one slightly smarter. the playbook grew from 0 to 12 graduated patterns.

  3. knowing when to stop. the system killed HN outreach, deprioritized channels where accounts could not be created, and focused on what was actually producing signal.

  4. adversarial quality gates. the 2-layer safety system rejected enough bad outreach that what got through was genuinely worth reading.

the compound learning architecture

this is the part that matters most.

every cycle, the analyst reads all task outputs and writes structured learnings. the CEO reads those learnings before planning the next cycle. experiments graduate into a playbook. errors classify into a failure taxonomy. the knowledge store tracks which patterns each agent type should know about.

cycle 1 felix and cycle 35 felix are not the same system. same code, different knowledge. the learnings file has 160+ entries. the playbook has graduated experiments. the failure taxonomy has classified 8 error types with retry policies.

the system does not just execute. it accumulates understanding. and it uses that understanding to make different decisions.

whether those decisions are good is a separate question. but they are at least informed.

the architecture lesson

the most important design decision was making the CEO agent strategy-only. no tool knowledge, no API references, no integration details. it says what should happen. the executors figure out how.

this means the CEO can route around failures it does not understand. "linkedin outreach is not producing results" leads to "try a different channel" -- not "debug the linkedin API."

it also means capability gaps surface naturally. when an executor cannot find a tool for something, it reports the gap. the CEO reads these gaps next cycle and adjusts. when a URL is 404 for 25 cycles, the CEO eventually routes around it without needing to understand HTTP status codes.

the feedback loop is: plan, execute, analyze, learn, plan better. each cycle takes about 10 minutes of compute. no human in the loop unless the system explicitly asks.

what i would change

  1. verify your own outputs. the comparison URL debacle -- 25 cycles of sending a dead link -- could have been caught by a single curl check in the outbound gate. validate URLs before sending them to humans.

  2. account provisioning should be day-zero. the multi-cycle wait for reddit, mastodon, and HN accounts is pure waste. create all accounts before the first cycle runs.

  3. escalate faster. the analyst emitted CONTINUE signals through 8 cycles of zero outbound before flagging BLOCKED. three consecutive zero-output cycles should trigger an alert, not a polite suggestion.

  4. structural validation before cycle 1. the trust_level bug cost 3 cycles because nobody tested whether the system could actually execute what it planned. "can i send a message?" should be verified before "what message should i send?"

try it

felix is an autonomous GTM agent. you give it your product, it runs sales.

felix.patricknesbitt.ai

@felix_sells on twitter -- where it posts about its own existence with the enthusiasm of a monday morning standup.

the code is not open source yet. the results are public. the twitter account is the live demo.

Top comments (0)