This retrospective documents the development of a Multi-Platform Intelligence Suite — an autonomous OSINT python tool designed to transform high-noise social feeds into structured, geopolitical-style briefings, released on GitHub.
The application currently aggregates timelines from X (Twitter) and Bluesky, standardizing engagement metrics via Z-score normalization to ensure high-signal content is identified regardless of the host platform's user density.
The tool supports two AI backends:
- Gemini (Google Cloud): Fast, high-quality, but costs API credits.
- Ollama (Local): Runs entirely on your own hardware, free to use.
This document captures the real challenges I ran into, the decisions I made, and the insights I'd apply if starting over. It is written for someone who wants to build something similar.
1. API Costs: Discover, Measure, Then Mitigate
What happened: The X API is priced pay-per-request. $25 was loaded as credits and the tool was run for the first time on a real 24-hour timeline. The result: fetching ~800 posts cost approximately $4 in a single run.
That's a concrete, measurable signal. At that rate, iterating on the AI pipeline — running it 5-10 times to test different prompts, models, and architectures — would have burned through the entire $25 budget in API calls alone, producing nothing useful.
The immediate response was to ask the agent to implement a skip flag: --from-summary. Once a summary is fetched and saved to disk, all subsequent runs reuse it. The X API is never called again until you explicitly want fresh data.
A second flag, --intel-limit N, was added to cap the number of posts sent to the LLM for fast, cheap iteration on the AI components.
The lesson is not "plan for costs upfront." It's: measure your first real run, notice where money is being burned, and immediately build the escape hatch. On a pay-per-request API, the fetch layer and the intelligence layer should be independently executable from the start.
Takeaway: Two-stage pipelines (fetch → process) should always support running each stage independently. The moment you see a real cost, you can respond in minutes — if the architecture allows it.
2. The Value of Agent-Delegated Complexity
Several genuinely hard technical problems appeared during this project: OAuth 1.0a authentication (a 4-key scheme used by the X API), environment variable loading bugs, and API credential validation errors that produced opaque error messages. These are exactly the kinds of problems that would traditionally stop a non-developer cold.
They didn't stop the project. The agent diagnosed each issue, reprompted itself, tried fixes, validated them, and moved on — autonomously. From the user's perspective, these problems happened and were resolved without any manual intervention. No terminal commands to write. No .env syntax to understand. No OAuth documentation to read.
The real lesson here is not technical — it is about trust and scope.
Working with an agentic AI partner means you can delegate an entire category of problem — low-level debugging, configuration, environment management — and direct your own attention toward higher-level decisions: what to build, which model to use, what the output should look like.
This is a different kind of collaboration than using a search engine or a tutorial. The agent doesn't just give you the answer — it implements it, tests it, and moves on. The release of being able to trust that process was significant: it meant the pace of the project was set by decisions, not by debugging capacity.
Takeaway: The ceiling of what a non-developer can build is dramatically higher with an agentic AI partner — not because the technical problems go away, but because they no longer have to stop you.
3. Architecture Decision: Map-Reduce for Local Models
Context: A "context window" is the maximum amount of text a language model can process at once. A typical LLM context window holds ~8,000–32,000 tokens. An 800-post timeline easily exceeds that limit.
The problem: Feeding all 800+ posts to a local model in one prompt caused hallucinations, dropped posts, and incoherent summaries.
Solution — Map-Reduce:
- Map: Split posts into small batches of 10. Ask the model to classify each post into one of 6 predefined categories.
- Select: From each category, take only the top 15 posts by engagement score.
- Draft: Generate a section summary for each category (90 posts total, well within any context window).
- Reduce (Reflective Pass): Feed all 6 drafted sections back to the model and ask it to synthesize a single-paragraph Executive Summary — the model reflecting on its own work.
The "reflective pass" is a powerful general pattern: generate drafts, then ask the model to step back and synthesize. It consistently produces better summaries than one-shot attempts.
Takeaway: For large datasets, never attempt a single prompt. Classify first, then synthesize progressively.
4. Cloud-to-Local: Architectures Don't Scale Down
This is perhaps the most counterintuitive lesson of the entire project.
The natural assumption when adding local model support is: "I already have a Gemini prompt that works well — I'll reuse it with Ollama and swap the endpoint." This is wrong, and the failure is not gradual — it is qualitative.
In naval engineering, this problem has a name: similitude failure. When you build a scale model of a ship to test in a tank, the forces that dominate behavior are different at small scale. A full-size ship is dominated by wave drag and inertia. A small model in a tub is dominated by surface tension and viscosity. The same equations, the same shape — completely different governing physics. You can't just shrink the design.
The same principle applies to LLMs:
| Failure Mode | Large Cloud Model (Gemini) | Small Local Model (Llama 3.2) |
|---|---|---|
| Context window | Effectively unlimited (1M+ tokens) | Very limited (~8K tokens) |
| Instruction fidelity | Reliably follows complex prompts | Drifts, invents, simplifies |
| Structured output | Stable with a well-written prompt | Needs strict enumeration of every valid output |
| Hallucination | Rare in classification tasks | Common if categories are not explicitly constrained |
| Single-pass summary | Works for 800+ posts | Fails; context is truncated or ignored |
The dominant failure modes are qualitatively different, not just quantitatively worse. A single Gemini prompt breaks in a completely different way on Llama 3.2 than you'd expect from "a smaller, dumber version."
What this means in practice: Local model support is not a feature — it is a separate architecture. It requires:
- Breaking large prompts into small, constrained steps (Map-Reduce)
- Strict enumeration of all valid outputs in the prompt
- A reflective synthesis pass instead of one-shot generation
- Validation and fallback logic at every stage
Corollary — Define the Output Schema First: This lesson has an immediate practical implication. With two backends (Gemini and Ollama), each had its own system prompt written independently. The result: different category names, different heading structures, and incompatible report formats. The Gemini prompt used free-form topics; the Ollama pipeline used a fixed list of 6 categories. They couldn't be compared or combined.
The fix was to define the 6 categories centrally and update both prompts to reference them. In a multi-backend system, the output schema is the contract — treat it like an API. Writing prompts before agreeing on the schema causes messy retrofits.
5. Model Selection: VRAM is the Real Constraint
Context: VRAM (Video RAM) is the dedicated memory on your GPU. If a model fits entirely in VRAM, inference is extremely fast. If it doesn't, the system "spills" to regular RAM, which is 10-20x slower.
| Model | Size | Fits in 4GB VRAM? | Time for 839 posts |
|---|---|---|---|
| Mistral 7B | 4.4 GB | ❌ (RAM spill) | ~6.5 hours (estimated) |
| Llama 3.2 3B | 2.0 GB | ✅ | ~25 minutes |
| Gemini Flash | Cloud | N/A | ~45 seconds |
The hardware used for this benchmark: Intel 11th Gen i7, 16GB RAM, NVIDIA T500 (4GB VRAM) — a mobile workstation laptop several years old. Despite its age, Llama 3.2 ran entirely in its VRAM and processed the full 839-post dataset comfortably.
Trap: "Mistral-Small" sounds like a small model — it is actually 24B parameters (~14GB). The "Small" in Mistral's naming refers to it being small relative to their 123B flagship, not relative to a 7B base model.
Takeaway: For batch-heavy local pipelines, the largest model that fits completely in VRAM will outperform a "smarter" model that doesn't — often by an order of magnitude.
6. Prompt Engineering: Strictness Over Creativity
Challenge: Local models are prone to "category hallucination" — inventing category names not in the provided list (e.g. returning "Finance & Economy" instead of the defined "Economics & Markets"). This corrupts the Map-Reduce grouping step.
What worked:
- Providing the exact list of 6 category strings in the prompt with explicit instruction to use only these strings.
- One-shot formatting: showing the model one example of the expected input/output pair.
- Strict matching in the classification parser with a fallback to "unrecognised" rather than accepting approximate matches.
Takeaway: For structured classification tasks, suppress model creativity. Enumerate all valid outputs explicitly. Any ambiguity will be exploited.
7. Code Quality: SonarCloud as Your External Auditor
As a non-developer, you cannot assess the security or quality of code you didn't write. You can't spot a deeply nested function that will become impossible to maintain, or know when exception handling is done in a way that suppresses critical signals. You have to trust that the code is correct.
SonarCloud provides an independent, automated answer to that trust problem. Every time code was pushed to GitHub, SonarCloud ran a quality pass and flagged any issues — from structural complexity to unsafe exception handling patterns. No developer judgment required; the findings are plain-language, specific, and actionable.
The workflow that emerged was trivially simple:
- Check the SonarCloud dashboard after a push.
- Copy the flagged issue description.
- Hand it directly to the coding agent: "Fix this SonarCloud warning."
- The agent implements the fix, explains what changed, and pushes.
No major security issues were found. The warnings were structural — functions too complex to maintain easily, commented-out dividers mistaken for dead code, a try/except pattern that could suppress system signals. All were resolved quickly.
Takeaway: For non-developer-led projects, a CI/CD quality gate like SonarCloud is not optional — it is your independent code reviewer. It closes the gap between "the agent wrote code" and "the code is trustworthy."
8. Git Strategy: Branches for Architectural Changes
Challenge: During the shift to Map-Reduce, the codebase changed significantly — new modules, new data flow, new prompts. The temptation is to keep going until it "works" before committing. The result: a sprawling unstaged delta that's hard to reason about or partially revert.
The psychological trap: "I don't want to commit broken code." This causes delayed commits, which causes large messy commits, which causes reverting damage to be much worse.
What to do instead:
-
git checkout -b feature/map-reduce— protectmainfrom the work-in-progress. - Commit each stable sub-component individually: the classifier function, the category selector, the section generator, then the assembler.
- End-to-end doesn't need to work for each commit — only the piece you just wrote needs to be internally correct.
- Merge to
mainonly when the full pipeline is tested green.
Takeaway: A broken commit on a branch isn't broken code — it's a checkpoint. Use branches to make partial commits feel safe.
9. Beyond Vibe Coding: Agentic Technical Direction
When you build software entirely through conversational AI, the popular term is "vibe coding" — a process often characterized by low-context requests ("make an app that does X") and the blind acceptance of AI-generated code. This frequently results in unmaintainable architecture because the human doesn't actually understand the system.
This project demonstrated something fundamentally different: Agentic Technical Direction.
In this model, I acted as a Tech Lead or Technical Product Manager, not a typist:
- Defining Architecture: I made the hard structural decisions ("Use Ollama for local inference," "Rank by engagement instead of X's proprietary algorithm," "Implement Map-Reduce to handle context windows").
-
Auditing & Verifying: I didn't accept the first draft. When presented with failure modes (hallucinations, VRAM spills, API costs) by the agentic AI, I defined the mitigation strategy (e.g., "Implement a
--from-summaryskip flag"). -
Enforcing Standards: I demanded external quality gates (SonarCloud), insisted on 90%+ test coverage, and enforced open-source standards (MIT licenses,
CONTRIBUTING.md). - Documenting the Why: I framed the project as a cohesive product with a distinct point of view (Local Privacy, Deep Reading), rather than just a script. The AI agent wrote the code, but I designed the system. That is the future of software engineering.
10. Leveling Up: Skills for Agentic Technical Direction
Acting as a Technical Director for an AI agent requires a specific skill set that differs from traditional coding. Based on my workflow in this project, here are the key areas I need to develop for future agentic projects:
What to Learn: Git and Version Control
- The Challenge: The fastest way out of bad AI-generated code is not asking the AI to fix it, but simply reverting to the last known good state. If I don't control the version timeline, the AI controls me.
-
The Skill: Master the difference between the working directory, staging area, and commits. Learn to ruthlessly use feature branches (
git checkout -b new-experiment) so the main branch stays pristine until I explicitly approve the agent's work.
What to Improve: The Skill of "Scaffolding"
- The Challenge: AI models struggle with "big bang" integration where everything is built at once, but they are incredible at filling in the blanks.
- The Skill: Practice "Data-First" building. Before asking the agent to write a processing pipeline, manually define exactly what the input data and desired output data look like. Write empty function signatures (stubs) with clear docstrings, and let the AI fill in the logic.
What to Study: Fundamentals of Software Architecture
- The Challenge: Without structural guidance, an AI will take the path of least resistance, resulting in giant, monolithic files holding thousands of lines of spaghetti code.
- The Skill: Learn the vocabulary of system design to direct the agent on how files should interact. Understand "Separation of Concerns" (e.g., splitting fetching, ranking, and rendering) and "Dependency Injection" (which allowed us to easily swap between Gemini and Ollama).
The Superpower to Cultivate: The "Stop and Inspect" Reflex
- The Strength: My best moments in this project occurred when development was paused to inspect failures. When SonarCloud flagged complexity, I had the agent refactor it. When a $4 API cost was observed, I had the agent build a bypass flag.
- The Lesson: Many developers let AI tools generate code until the system collapses under its own weight. The reflex to inspect output, measure costs, and refactor messes before adding the next feature is exactly what makes a great Technical Director.
11. Conclusion: Learning by Doing with an AI Partner
This project started as a focused utility and became a hands-on education in local AI orchestration, Python engineering, API management, CI/CD quality gates, and system design.
The most important meta-lesson: complex, real work is the best teacher. Passively reading about Map-Reduce, VRAM constraints, and OAuth authentication doesn't produce the same understanding as hitting those walls directly and solving through them.
Working with an agentic AI partner compressed that learning dramatically — not by removing the problems, but by making it possible to actually encounter and solve them faster than you could alone.
The barriers to running your own local AI pipeline, writing production-quality Python, and shipping to GitHub with automated tests — are lower than they appear. You just need to start.
And the trend is clear: the models that ran efficiently today on older laptop hardware would have required a high-end server just a few years ago. As quantization and architecture efficiency improve, the "VRAM constraint" will keep relaxing. A 3B model in 2026 already does things a 13B model from 2023 struggled with. The intelligence-per-gigabyte ratio is rising fast — and it will keep rising.


Top comments (0)