Building InternFlow (Part 2): Designing an AI Pipeline Without Calling GPT APIs

#machinelearning #tutorial #ai #python

How I built a Retrieval-Augmented Generation system for code review and resume generation — and why avoiding hosted LLMs was the right call.

One of the first decisions I made when designing InternFlow's AI layer was this: I didn't want to depend on OpenAI or any hosted LLM API for the core functionality.

This wasn't about cost alone — though that matters for a student project. It was about understanding how AI products actually work underneath the marketing.

Why not just call GPT?

The core problem with sending code to a hosted LLM without context: it hallucinates. It'll write confident-sounding review comments about patterns it hasn't seen. For a resume generator, it'll invent metrics. None of that is useful for a student trying to land an internship.

Hosted LLM API ❌	Local RAG Pipeline ✅
Hallucinations on repo-specific code	Answers grounded in real repo data
API cost scales with every request	Fixed infrastructure cost
No control over context window	Full control over what the model sees
Third-party dependency	Runs entirely on our infrastructure

The RAG pipeline

RAG stands for Retrieval-Augmented Generation. Instead of asking a model to answer from memory, you first retrieve relevant context, then pass it to the model alongside the question.

Here's how the InternFlow pipeline works step by step:

Step 1: fetch_repository()
        ↓
        Pull metadata + source files from GitHub API
        Filter to relevant file types (.py, .ts, .md, etc.)

Step 2: chunk_content()
        ↓
        Split files into overlapping chunks
        Function-level splitting for code
        Paragraph-level for documentation

Step 3: generate_embeddings()
        ↓
        Run each chunk through sentence-transformers
        Local embedding model — no GPU required

Step 4: store_in_faiss()
        ↓
        Index all embeddings in FAISS vector database
        Persistent per-repository index

Step 5: retrieve_context(query)
        ↓
        Embed the query at generation time
        Retrieve top-k most similar chunks

Step 6: generate(context + prompt)
        ↓
        Pass retrieved context to local language model
        Output is grounded in actual repository content

RAG reduces hallucinations because the model is answering from retrieved evidence, not from memory. For code review, this matters enormously.

The tools I used

# Embedding model
from sentence_transformers import SentenceTransformer
model = SentenceTransformer('all-MiniLM-L6-v2')

# Vector store
import faiss
index = faiss.IndexFlatL2(embedding_dim)

# Chunking
from langchain.text_splitter import RecursiveCharacterTextSplitter
splitter = RecursiveCharacterTextSplitter(
    chunk_size=512,
    chunk_overlap=64
)

The all-MiniLM-L6-v2 model is small enough to run on CPU without meaningful latency. FAISS handles similarity search efficiently even with thousands of chunks per repository.

What made this harder than expected

Inference speed — Local models are slow without a GPU. I had to experiment with quantization and pick model sizes carefully. The right model for this use case is the smallest one that produces acceptable output.

Docker image size — A single AI service image with model weights bloated to gigabytes. I ended up downloading weights at container startup rather than baking them into the image layer, which made deployments much faster.

Memory management — Loading multiple models simultaneously caused OOM errors. Each service needed explicit memory budgets and lazy loading — models loaded on first request, not at startup.

Prompt engineering — Getting consistent, structured output from local models required significantly more iteration than with hosted APIs. The model needs very explicit instructions about output format.

The result

When a student connects their GitHub repo and pushes a commit, InternFlow can now:

Review the diff against their existing codebase context
Generate resume bullets that reference actual functions, actual metrics, actual decisions — not generic boilerplate

The pipeline runs entirely on our infrastructure, with no per-request API cost.

The bigger lesson: most of the engineering in an AI product has nothing to do with AI. It's data pipelines, memory management, latency budgets, and output parsing. Calling a hosted API hides all of that from you.

In Part 3, I'll walk through the deployment challenges: getting all of this running reliably in production with Docker, Nginx, SSL, and persistent storage.

I'm building InternFlow — connect your GitHub, get AI code reviews on every commit, and generate ATS-ready resume bullets from your real work.

→ Try it free at intern-flow.in