Build a Complete RAG System with TypeScript: From Embeddings to Retrieval
Retrieval-Augmented Generation (RAG) has become the standard pattern for building AI applications that can reason over private data. Instead of fine-tuning models on your documents, RAG retrieves relevant context at query time, making it more flexible and cost-effective. With NeuroLink, building a production-ready RAG system in TypeScript takes minutes, not weeks.
Understanding the RAG Pipeline
A complete RAG system consists of four key components:
- Document Processing: Loading and chunking documents
- Embedding Generation: Converting text to vector representations
- Vector Storage: Persisting embeddings for efficient retrieval
- Retrieval & Generation: Finding relevant context and generating responses
NeuroLink provides unified APIs for all these stages, working seamlessly across 13 AI providers.
Step 1: Document Processing and Chunking
Start by processing your documents with NeuroLink's built-in RAG capabilities:
import { NeuroLink } from "@juspay/neurolink";
import { readFile } from "fs/promises";
const neurolink = new NeuroLink();
// Load your documents
const documents = await Promise.all([
readFile("./docs/api-reference.md", "utf-8"),
readFile("./docs/guides/authentication.md", "utf-8"),
readFile("./docs/guides/webhooks.md", "utf-8"),
readFile("./docs/faq.md", "utf-8")
]);
// Configure chunking strategy
const ragConfig = {
chunking: {
strategy: "semantic", // semantic, fixed, recursive, or markdown
chunkSize: 512, // tokens per chunk
chunkOverlap: 50, // overlap between chunks
separators: ["\n## ", "\n### ", "\n\n", "\n", ". ", " "]
},
embedding: {
model: "text-embedding-3-large",
provider: "openai"
}
};
NeuroLink supports 10 different chunking strategies:
| Strategy | Best For |
|---|---|
semantic |
Preserves meaning boundaries |
fixed |
Simple, predictable chunks |
recursive |
Hierarchical document structure |
markdown |
Markdown-formatted docs |
code |
Source code files |
Step 2: Generating Embeddings
NeuroLink's embed() and embedMany() APIs handle embedding generation with automatic batching and provider fallback:
// Generate embeddings for a single text
const singleEmbedding = await neurolink.embed({
text: "How do I authenticate API requests?",
model: "text-embedding-3-large",
provider: "openai"
});
console.log(singleEmbedding.embedding); // float[] vector
// Batch embed multiple documents efficiently
const chunks = [
"Authentication requires an API key in the Authorization header...",
"Webhooks are sent as POST requests to your configured endpoint...",
"Rate limits are enforced per API key at 1000 requests per minute...",
// ... hundreds more chunks
];
const embeddings = await neurolink.embedMany({
texts: chunks,
model: "text-embedding-3-large",
provider: "openai",
batchSize: 100 // Automatic batching for rate limits
});
console.log(`Generated ${embeddings.length} embeddings`);
The embedMany() API automatically:
- Batches requests to respect provider rate limits
- Retries failed batches with exponential backoff
- Falls back to alternative providers if configured
- Caches results to reduce API costs
Step 3: Vector Storage and Retrieval
For production systems, you'll want to store embeddings in a vector database. Here's how to integrate with popular options:
import { Pinecone } from "@pinecone-database/pinecone";
// Initialize vector store
const pinecone = new Pinecone({ apiKey: process.env.PINECONE_API_KEY });
const index = pinecone.index("neurolink-docs");
// Store embeddings with metadata
async function storeEmbeddings(
chunks: string[],
embeddings: number[][]
) {
const records = chunks.map((chunk, i) => ({
id: `chunk-${i}`,
values: embeddings[i],
metadata: {
text: chunk,
source: "api-reference.md",
timestamp: Date.now()
}
}));
await index.upsert(records);
}
// Retrieve relevant chunks
async function retrieveContext(query: string, topK: number = 5) {
// Generate query embedding
const queryEmbedding = await neurolink.embed({
text: query,
model: "text-embedding-3-large",
provider: "openai"
});
// Search vector store
const results = await index.query({
vector: queryEmbedding.embedding,
topK,
includeMetadata: true
});
return results.matches?.map(m => m.metadata?.text) ?? [];
}
Step 4: The Complete RAG Query
Now put it all together for intelligent document Q&A:
async function askQuestion(question: string) {
// 1. Retrieve relevant context
const contexts = await retrieveContext(question, 5);
// 2. Build augmented prompt
const contextText = contexts.join("\n\n---\n\n");
// 3. Generate answer with context
const result = await neurolink.generate({
input: {
text: `Answer the question based on the following context.
Context:
${contextText}
Question: ${question}
Provide a clear, accurate answer. If the context doesn't contain
the answer, say "I don't have enough information to answer that."`
},
model: "claude-4-sonnet",
provider: "anthropic"
});
return result.output.text;
}
// Usage
const answer = await askQuestion(
"What's the rate limit for webhook endpoints?"
);
console.log(answer);
Simplified RAG with NeuroLink's Built-in Feature
For common use cases, NeuroLink offers a streamlined approach—just pass your files directly:
// One-line RAG: NeuroLink handles chunking, embedding, and retrieval
const result = await neurolink.generate({
input: {
text: "What are the authentication requirements?",
files: ["./docs/api-reference.md", "./docs/guides/authentication.md"]
},
model: "claude-4-sonnet",
provider: "anthropic",
rag: {
chunking: "semantic",
chunkSize: 512,
search: "hybrid" // semantic + keyword search
}
});
// NeuroLink automatically:
// 1. Reads the files
// 2. Chunks them using semantic strategy
// 3. Generates embeddings
// 4. Retrieves relevant chunks
// 5. Includes them in the prompt
// 6. Generates the response
Advanced: Hybrid Search and Reranking
For better retrieval quality, combine semantic search with keyword matching and reranking:
const result = await neurolink.generate({
input: { text: "How do I handle webhook retries?" },
rag: {
files: ["./docs/webhooks.md", "./docs/error-handling.md"],
search: "hybrid", // semantic + keyword
reranker: "cohere", // cohere, flashrank, or cross-encoder
topK: 10, // retrieve top 10
rerankTopK: 5 // return top 5 after reranking
}
});
Rerankers reorder results based on relevance to the query, significantly improving accuracy for complex questions.
Building a Production-Ready Document Q&A System
Here's a complete example with error handling and streaming:
import { NeuroLink } from "@juspay/neurolink";
import { Pinecone } from "@pinecone-database/pinecone";
class DocumentQASystem {
private neurolink: NeuroLink;
private vectorStore: any;
constructor() {
this.neurolink = new NeuroLink();
this.vectorStore = new Pinecone({
apiKey: process.env.PINECONE_API_KEY!
}).index("docs");
}
// Ingest documents into the system
async ingestDocuments(files: string[]) {
for (const file of files) {
const result = await this.neurolink.generate({
input: { files: [file] },
rag: {
chunking: "semantic",
chunkSize: 512,
storeEmbeddings: async (embeddings, chunks) => {
await this.vectorStore.upsert(
chunks.map((chunk, i) => ({
id: `${file}-${i}`,
values: embeddings[i],
metadata: { text: chunk, source: file }
}))
);
}
}
});
}
}
// Ask a question with streaming response
async *ask(question: string) {
// Get query embedding
const { embedding } = await this.neurolink.embed({
text: question,
model: "text-embedding-3-large"
});
// Retrieve context
const searchResults = await this.vectorStore.query({
vector: embedding,
topK: 5,
includeMetadata: true
});
const context = searchResults.matches
?.map((m: any) => m.metadata?.text)
.join("\n\n");
// Stream the response
const stream = await this.neurolink.stream({
input: {
text: `Context:\n${context}\n\nQuestion: ${question}\n\nAnswer:`
},
model: "claude-4-sonnet",
provider: "anthropic"
});
for await (const chunk of stream.stream) {
if ("content" in chunk) {
yield chunk.content;
}
}
}
}
// Usage
const qa = new DocumentQASystem();
await qa.ingestDocuments(["./docs/api.md", "./docs/faq.md"]);
// Stream the answer
for await (const token of qa.ask("What are the webhook retry policies?")) {
process.stdout.write(token);
}
Best Practices for Production RAG
- Choose chunk sizes wisely: 512 tokens works for most docs; use smaller chunks (256) for dense technical content
- Use semantic chunking: Preserves meaning better than fixed-size chunks
- Implement hybrid search: Combines semantic similarity with keyword matching
- Add reranking: Improves precision, especially for large knowledge bases
- Monitor retrieval metrics: Track hit rate, latency, and relevance scores
- Handle edge cases: Return "I don't know" when confidence is low
Conclusion
Building a RAG system doesn't require complex infrastructure or multiple libraries. NeuroLink provides a unified, provider-agnostic approach to document processing, embedding generation, and retrieval—letting you focus on building great AI experiences rather than plumbing.
Whether you need quick one-line RAG or a fully customized pipeline with hybrid search and reranking, NeuroLink scales from prototype to production seamlessly.
NeuroLink — The Universal AI SDK for TypeScript
- GitHub: github.com/juspay/neurolink
- Install:
npm install @juspay/neurolink - Docs: docs.neurolink.ink
- Blog: blog.neurolink.ink — 150+ technical articles
Top comments (0)