I decided to build an LLM Twin using a clean ETL + FTI architecture, thinking it would be structured, scalable, and elegant.
It started well.
I designed a proper ETL pipeline:
extract data from blogs, GitHub, and posts
clean and normalize everything
store it nicely in a database
Simple, right?
Then reality happened.
My “clean data pipeline” slowly became:
random HTML scraping
inconsistent formats
mysterious edge cases
But technically…
it was still an ETL pipeline 😅
The idea was smart though:
Instead of overcomplicating things, I reduced everything into just three types:
articles
repositories
posts
Which meant I could scale easily later without rewriting everything.
That part actually worked.
But here’s the funny part.
I thought I was building a system that understands data.
What I really built was a system that shows me:
how messy real-world data is
how optimistic my assumptions were
and how “simple architecture” becomes complex in 2 days
Final Thought
You don’t build an LLM system in one go.
You:
build something messy
make it work
then slowly make it make sense
And somewhere along the way…
your “LLM Twin” starts looking less like a tool,
and more like a mirror of your own engineering decisions.

Top comments (3)
Good
Good
Great.