After collecting data, the next step in the FTI architecture is the Feature Pipeline.
This is the part where your messy digital life becomes something an ML system can actually use.
Articles.
Posts.
Code.
Notes.
All raw β all useless β until processed.
βοΈ What the Feature Pipeline does
Raw data β clean β chunk β embed β feature store
Thatβs it.
But this step is more important than training.
Bad features = bad model.
π Different data needs different processing
Your LLM Twin does not treat everything the same.
Articles β long text
Posts β short text
Code β structured text
Each needs different:
cleaning
chunking
embedding
Same pipeline, different logic.
Thatβs why grouping by type, not platform, was important.
π§Ή Step 1 β Cleaning
Remove noise.
HTML
emojis (sometimes)
formatting
duplicates
broken text
Clean data β better fine-tuning.
We also save this version for training.
βοΈ Step 2 β Chunking
LLMs canβt read huge text.
So we split.
Articles β big chunks
Posts β small chunks
Code β syntax chunks
Chunking is critical for RAG.
Bad chunking = bad retrieval.
π§ Step 3 β Embedding
Now we convert text into vectors.
text β embedding β vector DB
This allows:
similarity search
RAG
context retrieval
Your vector DB becomes your memory.
ποΈ Logical feature store (simple but powerful)
Instead of building a heavy feature store, we use:
vector DB
metadata
versioning logic
Why?
Because we need both:
offline data (training)
online data (RAG)
So we keep two snapshots:
clean data β training dataset
embedded data β RAG dataset
Simple. Flexible. Enough.
π§ Why this design is smart
Feature pipeline gives:
clean data for fine-tuning
embeddings for RAG
versioned datasets
modular system
scalable architecture
And most important:
Training and inference use the same features
No mismatch.
No chaos.
Beautiful FTI design.




Top comments (0)