Daniel Nwaneri

Posted on May 27

What Building My Own AI Bot Taught Me About Generative AI

#ai #llm #rag #webdev

Intelligence as high-quality retrieval

I built a bot trained on my own X bookmarks and likes. Around 50,000 of them, accumulated over years of lurking, arguing, and clicking the save button on things that made me stop scrolling.

The technical part isn't complicated in principle. You pull your export, embed the text, build a RAG pipeline, add a style prompt derived from your own writing patterns, and you get something that responds to prompts by retrieving your most relevant saved content and riffing from there. I called it Bookmark Brain, which is either clever or embarrassing — I haven't decided.

What I didn't expect was how much it would clarify my thinking about what generative AI actually is.

The bot works too well. That's the problem.

When I ask it about API design opinions or takes on the current AI hype cycle, it returns something that sounds like me — specific, slightly annoyed, grounded in a particular set of concerns — better than most general-purpose LLMs do when I prompt them with "write in my voice." The difference isn't the model. It's the retrieval layer. The model in both cases is doing the same approximate thing. What changes is what it retrieves before it starts generating.

That realization landed harder than I expected: a significant chunk of what we call AI "intelligence" is retrieval. The system finds related content, mixes it with the query, and produces output shaped by that specific neighborhood of the embedding space. It's not thinking. It's not understanding. It's doing something closer to extremely sophisticated autocomplete with a memory. The illusion of reasoning comes from the quality of what was retrieved, not from inference happening in any deep sense.

The uncomfortable follow-on: I started noticing the same thing in myself. A lot of what I'd been calling original thinking was my brain doing something structurally similar — retrieving from a curated internal dataset of influences, combining them in ways that felt novel, outputting with enough fluency to pass as insight. The bot didn't make me feel smarter. It made me suspicious of my own cognition.

My bot sounds coherent because my bookmarks are coherent. I've spent years curating a specific worldview — skeptical of tech hype, interested in systems and incentives, irritated by vague abstraction. That worldview is baked into the dataset. Retrieval finds it. The model outputs it in grammatical sentences. The whole thing looks like intelligence from the outside.

Then the Granta thing happened.

If you missed it: Granta, the literary magazine, ran a piece flagged by AI detectors. Turns out the writing was human — and older than the detectors themselves. Pre-2022, written before the tools they were being assessed with even existed.

The writer, understandably, was furious. The editorial response was clumsy. What struck me was the confidence behind the process — the idea that a detector score constitutes evidence of anything meaningful.

It doesn't. AI detectors are probabilistic classifiers trained on distributional differences between human and AI writing. Dense, formal, or unusual prose trips them constantly. Academic writing, translated text, anything with a compressed or structured style — all of these get flagged. The detector isn't reading. It's pattern-matching statistical features. And those features shift as models improve, as writing styles evolve, as the gap between the training distribution and current reality widens.

Watching publications, employers, and universities lean on these tools as if they're reliable is the same energy as relying on a polygraph. The tool isn't detecting deception. It's detecting nervousness, or formality, or the wrong register for the context. The conclusion isn't what the tool thinks it is.

What the Granta situation made concrete for me: we have a collective problem with mistaking a signal for the thing the signal supposedly measures. Perplexity score is not authenticity. Semantic similarity is not understanding. And this is the same confusion that inflates most AI capability claims.

Here's the irony I live with every day.

I use AI heavily. I build with it, write with it, prototype faster because of it. I'm not performing skepticism while secretly relying on it — I'm actually relying on it, out in the open, and also genuinely skeptical of what it's doing and why the claims around it are so often overconfident.

Yes, I'm part of the problem. I know that. But I built Bookmark Brain precisely because I wanted to understand what the problem actually is — not at the level of takes and op-eds, but at the level of retrieval logs and embedding distances and why a particular output came out the way it did. The people most confident about AI — evangelists and critics alike — are usually the ones who haven't built anything with it. They're responding to the outputs. I wanted to see the pipes.

My bot makes this concrete in a specific way. Because I can see exactly what it's doing — retrieve, compose, style-match — I can no longer pretend the underlying process is mysterious. It isn't. It's a very good pattern engine. And the patterns it's good at are the ones humans have already made enough times to constitute a retrievable signal.

The things it can't do are equally clear. It can't tell me something genuinely new. It can't resolve contradictions in my bookmarks; it just retrieves whichever side of an argument is more semantically proximate to my query. It has no persistent sense of what I care about most — that's in the embedding weights and the retrieval ranking, not in anything like a value structure. If I've saved content across five years on Nigerian economic policy, it can retrieve that content. It cannot tell me what I should think about a new development that doesn't yet exist in those embeddings.

That's not a criticism. It's just an accurate description of what the tool is. The criticism is when people — including, honestly, past me — talk about these systems as if they're operating at a different level entirely.

Most people initially misunderstand generative AI the same way. They see the output and map it to human cognition because that's the only reference frame available. The output sounds like thinking. Therefore it is thinking. The logic is understandable and wrong.

What's actually happening is closer to: the system has compressed a large representation of existing human expression, retrieves the most contextually relevant parts, and generates a continuation that's statistically consistent with that neighborhood. That's not nothing. In fact it's remarkable. But it's not reasoning. It's not understanding. And it absolutely is not reliable in domains where the training distribution doesn't match the actual problem.

Building Bookmark Brain made this concrete rather than abstract. I could watch the retrieval logs. I could see what it was pulling. I could trace why a particular response came out the way it did. That transparency — available only because I built it — is exactly what's missing when people interact with closed systems and anthropomorphize the outputs.

The piece of this I'm still sitting with is about curation.

My bot is useful because I curated carefully for years. The quality of the output is downstream of the quality of the input — not the model, not the prompt engineering, the input. 50,000 bookmarks that reflect a consistent set of concerns, an identifiable worldview, real opinions.

If I'd bookmarked everything uncritically, the bot would be incoherent. Garbage in, garbage out, but at scale and with a convincing fluency that would make the garbage harder to spot.

That's the thing about generative AI broadly: it doesn't make bad data good. It makes it fluent. And fluency is exactly the property that makes it hard for people — including detectors, including reviewers, including people who should know better — to evaluate what's actually in front of them.

I built a tool that sounds like me. It works because of what I put into it, not because of anything the model does that's particularly special. The model is a compositor. The dataset is the author.

That's the most clarifying thing I've learned. It's what almost every discussion about these systems gets wrong.

Top comments (26)

Mykola Kondratiuk • Jun 3

what surprised me building something similar - once you add the style layer it stops feeling like retrieval and starts feeling like you actually said it. RAG without that is basically a glorified search.

Daniel Nwaneri • Jun 3

The style layer point is the one most people skip when they write about RAG. Without it you have a search engine with extra steps. With it you have something that feels authored and that gap is where the interesting questions live. The retrieval is doing the work but the style layer is what makes it feel like thought rather than lookup. Which is exactly what unsettled me about it.

Mykola Kondratiuk • Jun 3

honestly, style is the finish work but retrieval is the foundation — i've seen polished prose wrapped around the wrong chunks and it's worse than a bad UI because it feels authoritative. if the chunking is off, no style layer rescues it. style multiplies retrieval quality, it doesn't replace it.

Daniel Nwaneri • Jun 3

"Feels authoritative" is the exact failure mode. Bad UI signals itself. Broken layout, missing button, you know something is wrong. Bad chunking with good style signals nothing. The output reads clean and confident and is wrong about what it retrieved. That's a harder bug to catch because the surface looks fine.

Which means chunking strategy is actually a trust problem, not just a performance problem. The style layer is downstream of it entirely . you're right that it multiplies retrieval quality rather than replacing it. The implication is that most RAG evaluations are measuring the wrong layer...

Mykola Kondratiuk • Jun 3

yeah, from the PM side there's no ticket for this. model sounds right, PM ships it. the gap surfaces weeks later when the agent contradicts itself across artifacts and someone finally traces it back to the retrieval layer.

Ingo Steinke, web developer • Jun 3 • Edited

Such a bookmark brain was an idea that kept fascinating since computers got more powerful and the internet connected us to so many new sources and ideas. Fuzzy full-text search seemed to be an obvious answer, but what about synoyms? Tagging, hyperlinking, and bookmarking showed that we still needed curation and communication. And what about information getting outdated? And, now, what about AI hallucination?

Your project post points out two important things: AI, or any other processing, doesn't make bad data good. And many of "our own thoughts" are just shaped by our past experience and input. You conclude that the dataset is the author. I think that's a generalization neglecting actual analysis, creativity and findings. But in the context of AI and thinking, you're right.

What's also missing in many current discussions: facts still matter, creativity still matters, and inifinite monkey authors still won't write like Shakespeare. And AI won't, either.

Daniel Nwaneri • Jun 3

The pushback on "dataset is the author" is fair and I'll take it partially. The essay was making a narrower claim .That for this specific system, the quality of output is downstream of curation quality not model quality. The generalization breaks down when you introduce analysis and creativity that genuinely recombines rather than retrieves. The honest version is: the dataset is the dominant author and the model is a compositor with limited creative range.

The infinite monkey point cuts deeper than it looks though. It's not just that volume doesn't produce quality . it's that style is irreducible to pattern frequency. Shakespeare's compression, the specific weight of a line, isn't statistically recoverable from enough similar text. That's the ceiling. it's not a scale problem....

Mr. Marquez • May 28

I literally created an account just to say this...

John Stuart Mill argued that we cannot logically confirm other people have minds, but can only assume they do based on an argument from analogy. Mill reasoned that because other humans have bodies like our own and exhibit similar outward signs and behaviors, it is reasonable to infer they possess similar inner lives and consciousness, even though this remains an unproven assumption rather than certain knowledge.

Actual Intelligence entails not only learning and retrieval, but deduction and generalization as well; and, most important of all, does not equate frequency of repetition with truth. A single observation, the single instance of an unexpected occurrence, accepted as an undeniable fact, is sufficient to call into question a lifetime of assumptions.

As a thought experiment, consider the following: if an LLM were trained on 8th century knowledge, how intelligent would we consider it?

Daniel Nwaneri • May 28

The Mill framing cuts right to it. We extend mind-attribution by analogy — similar body, similar behavior, therefore similar inner life. LLMs pass the behavioral test well enough that the analogy fires automatically, even when the substrate is completely different.

The 8th century thought experiment is the sharper point though. The output would look primitive, and we'd call it unintelligent but the architecture would be identical. Which means what we're actually measuring when we say "intelligent" is the quality of the training distribution, not the reasoning process. That's not a small concession.

The part I'd push back on slightly: deduction and generalization. LLMs do something that looks like both, often convincingly. The failure mode isn't absence of those behaviors .it's that they're not grounded. A single contradicting observation should update the model. It doesn't. The weights are frozen. That's the real gap.

Mr. Marquez • May 30

You're right to push back on deduction and generalization (induction).
My mental model of human cognition and intelligence isn't formalized just yet, and those, I'll readily admit, were recently incorporated temporary placeholder concepts - better than what I had, but not quite good enough. The fact of the matter is that it's difficult to put my finger on it and put it into words. I glimpse but can't grasp. Discussion helps.
Inference, induction and deduction are integral to reason and logic; and reason and logic are inextricable from memory. Memory, as it so happens, is precisely what we've built and termed AI - a memory retrieval system. Induction is how it learns, from specific instances it derives general principles based on pattern recognition, which it can then apply to specific instances, such as forming a coherent sentence, and that's a form of deduction.
But its deduction is contextual, relying on those same patterns. Which is why, if asked: "Should I walk or drive to the carwash that's two blocks always?", it answers: "Walk". Because it's relying on language patterns for 'reasoning', not actually reasoning about what's been said. There is no conception of reality, as such, and therefore nothing to ground the words.
Deep learning systems that simulate outcomes based on known principles, that's a whole another story. That's much closer to human cognition. It's narrow intelligence, admittedly, but a step in the right direction. A necessary one.

Daniel Nwaneri • May 30

The carwash example is the cleanest version of this I've seen. The answer is linguistically correct and situationally wrong and the model has no mechanism to know the difference because there's no referent, only pattern. "Two blocks away" is a spatial fact. The model processes it as a token relationship.

The deep learning simulation point is where it gets interesting though. Physics simulators, protein folding, weather models — those systems are grounded in the thing they're modeling. The outputs are constrained by reality, not just by prior outputs. That's a genuinely different epistemic situation. The question I keep coming back to: is the gap between those systems and current LLMs architectural, or is it about what the training signal is coupled to? Because if it's the latter, grounding is a design choice, not a fundamental limit.

Mr. Marquez • Jun 2

I believe it to be architectural.

RLHF is... interesting, in its implications. I find it decidedly ironic that neural networks were proposed and presented as an alternative for rule based programming, to avoid the near impossible complexity of writing rule based programs that could suit every eventuality, as I've heard Geoffrey Hint state multiple times, only to turn around and say: "Well, that worked, but not quite well enough. Maybe it we correct it with rules...".

That's how I think of RLHF: "if output is X, replace with Y" or "if gethumaninput(x>y): x else y". That being the case, I question the premise itself. How far did the neural network architecture for machine learning progress the field towards actual intelligence? How much of what we consider noteworthy and significant about LLMs is actually derived or influenced by human feedback? Of course, guardrails are an essential security feature, so that step cannot be avoided, but my understanding from ablative models is that RLHF cements both the security features and the adherence and coherence of the model, which does beg the question: machine learned, or rule-based?

I suspect that RLHF -- again, ironically -- is actually making the models dumber and less accurate. After a long back and forth with GPT on a complex subject I knew little about, I realized it judged its own output as incorrect. When I pointed this out, it replied: "That's intentional. We're peeling away layers of complexity. The language is imprecise but pedagogically useful." My conclusion? Humans might prefer erroneous information that's understandable to correct information that is not. Hence my suspicion.

But getting back to the point at hand, architectural vs training signal. I believe we've modelled memory. With memory, knowledge. With knowledge, the appearance of reason and intelligence. But human reason functions more like a simulation: a world model based on items and properties, known relationships and assumptions alike, where deconstruction and recombination take place, and trial and error is allowed to occur until the desired eventuality is arrived at. That is why, in my view, AlphaGo, AlphaFold, etc., and the like, is the real breakthrough in the field. Applying the technology to build a world model that isn't based on an abstract symbolic representation of the world -- that being language.

We've essentially distilled an abstraction from an abstraction.
That's a world model that is too far removed from the world it models, for my particular taste.

Daniel Nwaneri • Jun 2

"Abstraction from an abstraction" is the formulation I've been circling without landing on. Language is already a lossy compression of experience. Training on language trains on the compression, not the thing compressed. AlphaFold works because the training signal is coupled to physical reality — the protein either folds or it doesn't. The feedback is grounded. LLMs get feedback on whether the output satisfies a human, which is a judgment about the abstraction, not the territory.

The RLHF point is sharper than it first looks. If the model learns that humans prefer confident, digestible wrongness over accurate complexity, then RLHF is systematically optimizing away from truth and toward palatability. That's not a guardrail problem — that's the training objective working exactly as designed, just toward the wrong target.

Where I'd push: is the fix a better world model, or a better feedback signal? AlphaGo has a world model but it's narrow by design. Generalizing that architecture requires grounding every domain in something as unambiguous as a Go board or a protein structure. Most of the domains we actually care about don't have that...

Mr. Marquez • Jun 2

"Most of the domains we actually care about don't have that..."

Don't they, though? It's a tough question. On the one hand, humanity's priority is science. Science is practical. It doesn't care if we understand why something happens, as long as it's predictable, replicable and applicable. Then we can make use of it. On the other hand, some advancements may be so intricate than we fail to foresee if there might be a simple underlying set of rules that governs it.

An Alpha... Material something-or-other would be interesting, because it'd materialize both branches. On the one hand, I trust new materials could be discovered by sheer volume and speed in trial and error iteration, but actually making that molecule in the real world would require devising a chemical process, optimizing it, discarding it if too resource intensive, etc., and even being aware of raw material costs, transportation, energy, initial investment, market value for the final product, etc., etc., etc. That's a bridge model, in a sense. A useful application that demands another model be designed, so that the first may be utilized to its fullest potential. The latter seems infinitely complex, but that is the ideal use case for machine learning algorithms: seemingly impossible, actually manageable.

Personally, I don't believe we'll reach anything that actually resembles reason and intelligence until and if words become associated with a 3d model that understands, through the same learning process, the physical laws of the universe. That's a big architecture change, though, a technology unbeknownst to us, far beyond what we currently possess. And, for all the time, effort, research and investment it'd require, it is most likely unnecessary at this point in time. There is valuable low hanging fruit still... in narrow intelligence.

Undoubtedly, better training data and better feedback is the most immediate and accessible improvement. I suppose that that is why data annotation jobs are on the rise. The ROI justifies it. Everything else is hypothetical, and therefore a business risk.

Daniel Nwaneri • Jun 3

The bridge model framing is the useful move here. AlphaMaterials would need a second model just to be usable . one that understands chemical synthesis constraints, supply chains, energy costs. That's not a limitation, that's actually how narrow intelligence compounds. Each model handles one grounded domain and the value comes from chaining them, not from building one system that reasons across all of them.

The 3D world model thesis is probably right as a direction and wrong as a timeline. Grounding language in physical simulation is the architecture change that would close the gap but the annotation ROI point cuts against waiting for it. You can extract enormous value from better-grounded narrow systems right now, which is exactly why the investment keeps flowing there instead of toward the harder architectural problem.

The part I keep sitting with: data annotation jobs rising is a signal that the feedback signal problem is understood internally even if it's not framed that way publicly. That's essentially admitting the current architecture needs human grounding at scale to stay useful. Which circles back to your earlier point . machine learned or rule-based? Increasingly, both.

Varsha Ojha • May 28

Building your own AI bot teaches the messy parts quickly. The demo is easy. The real learning starts with prompts, memory, bad outputs, edge cases, and figuring out where human control still needs to stay in the loop.

Syed Ahmer Shah • Jun 3

Congratulations on shipping your first AI bot, Dan! There is a massive difference between reading about how LLMs work and actually wrestling with the APIs and logic to build something yourself.

leob • May 28 • Edited

Interesting experiment! But that's a rather "humbling" take - that LLMs are doing not much more than regurgitating what we've put into it!

Well, at a basic level - however, I still think (not based on any real deep insight, but on "things I've read") that when you just put enough "stuff" into it (on a gargantuan scale - what the likes of OpenAI, Anthropic, Google etc etc do), and you let the LLM do the "probabilistic recombinatorial" thing, some rather amazing results come out - is that maybe what they call "emergent behavior"?

So they're saying that, at a fundamental level, we don't understand how LLMs work - we do understand the "low level mechanics" (the math and the hardware), but not how it produces its frankly baffling results - for lack of a better explanation, we call it "emergent behavior" ...

However - isn't the same true for our own brains, that fundamentally we don't really understand how it works? Again, we do understand the "low level mechanics" (the neurons and the synapses and so on), but not how it produces its astonishing results - again, we call it "emergent behavior" ...

The parallels are striking - the biggest difference is that AI doesn't have "initiative of its own", while humans do (and I think we should be happy about that) - probably that's because we were driven by evolutionary pressures (survival) to have that 'initiative'? It's kind of inherent in all biological organisms - LLMs don't reproduce, don't evolve, don't try to "survive" - if they would, would that be "AGI"? But it would obviously be scary!

Another difference is that LLMs don't really seem to be capable of "original thought" - or would they (if "scaled up enough")? The litmus test would be whether LLMs could at some point come up with a genuinely new theory (like Einstein's relativity theory, or quantum mechanics) - then again, most humans aren't capable of that either, it requires a rare stroke of genius and a lot of "coincidence" - right time, right place, pre-existing knowledge to work with ...

Daniel Nwaneri • May 28

The emergence parallel is real but I'd push back on the symmetry. With brains, we don't understand the mechanism or the output reliably . The behavior surprises us in both directions.

With LLMs, we understand the mechanism completely and still can't predict the output. That's a different kind of mystery.

The Einstein test is the sharp version of this: relativity required rejecting the dominant framework, not recombining within it. A model trained on Newtonian physics at sufficient scale would produce better Newtonian predictions not special relativity. The gap isn't scale . it's that genuine theoretical breakthroughs require treating anomalies as load-bearing rather than noise to average over.

leob • May 28 • Edited

Well you're right :-)

I was under the impression that (neuro)biologists understand the basic mechanisms, and would be able to fully understand the nervous system of very simple animals (there's a roundworm that has exactly 302 (female) or 383 (male) neurons) - but even that turns out to be currently out of reach - according to Google:

"No, scientists do not fully understand how the nervous system works, even in the simplest animals. While we have extensively mapped the physical wiring of some organisms, a complete structural map does not automatically reveal how the brain computes information to produce complex, dynamic behaviors."

Bummer - so even for that tiny roundworm, its nervous system is too complex for us to fully understand - let alone if you multiply that approximately a 100 million fold to get a human brain! That seems a somewhat "unsolvable" puzzle - I was a bit too optimistic there, lol ;-)

Maybe AI could someday help us solve that puzzle - using an artificial "brain" that we don't understand to help us understand another brain which we also don't understand ;-)

P.S. yeah I also don't believe that current LLMs would ever have the "originality" to invent a totally new theory (relativity etc), even if we'd scale it up by orders of mangnitude - maybe the point is it's "rigid" and lacks the plasticity of a biological brain? I can't prove it but there must be fundamental differences ...

Daniel Nwaneri • May 28

The C. elegans point actually sharpens the original argument. 302 neurons, fully mapped connectome and behavior still isn't predictable from the wiring diagram alone. That's the thing . structure doesn't explain dynamics.

Which cuts both ways: it makes the brain-LLM parallel weaker, not stronger. We can't explain either from first principles but the reason we can't is different in each case.

For LLMs it's the dimensionality of the weight space. For biological systems it's that the map is incomplete and the territory keeps changing. The roundworm's neurons are plastic. The weights aren't.

leob • May 28

Yeah you're right, probably the similarities (if there are any) are just skin deep - and I have the feeling that we don't even know what we don't know! ;-)

Andy Stewart • May 28

Spot on! This hits the nail on the head—RAG is essentially about retrieval, not true reasoning. A highly curated, local dataset is the real soul of AI; the model is just the assembly worker. Defending our data sovereignty and becoming true data curators is the ultimate hard-core asset in this era.

xulingfeng • May 28

This is a solid take on ai. One thing I'm curious about: how does this handle edge cases at scale? We hit some interesting bottlenecks around the 50-agent mark that forced us to rethink the architecture.

Followed you — keen to follow your work on this! 👋