DEV Community

Cover image for Chunking in AI - The Secret Sauce You're Missing

Chunking in AI - The Secret Sauce You're Missing

Hey folks! πŸ‘‹

You know what keeps me up at night? Thinking about how to make our AI systems smarter and more efficient. Today, I want to talk about something that might sound basic but is crucial when building kick-ass AI applications: chunking ✨.

What the heck is chunking anyway? πŸ€”

Think of chunking as your AI's way of breaking down a massive buffet of information into manageable, bite-sized portions. Just like how you wouldn't try to stuff an entire pizza in your mouth at once (or maybe you would, no judgment here!), your AI needs to break down large texts into smaller pieces to process them effectively.

This is especially important for what we call RAG (Retrieval-Augmented Generation) models. These bad boys don't just make stuff up - they actually go and fetch real information from external sources. Pretty neat, right?

Why should you care? 🎯

Look, if you're building anything that deals with text - whether it's a customer support chatbot or a fancy knowledge base search - getting chunking right is the difference between an AI that gives spot-on answers and one that's just... meh.

Too big chunks? Your model misses the point.
Too small chunks? It gets lost in the details.

The Good, Bad, and Ugly of Text Chunking Strategies πŸ“š

Chunking is critical in RAG systems because it directly impacts how well the retrieval module pulls relevant data and how much context the generation module has to work with. I realized it all came down to how I was splitting my text. Let me save you some headaches and share what I've learned about different chunking strategies.

1. Fixed-Length Chunking: The "Assembly Line" Approach 🏭

You know how Henry Ford revolutionized car manufacturing with the assembly line? Fixed-length chunking is kind of like that - it's all about consistency and predictability.

function fixedLengthChunk(text: string, chunkSize: number = 1000): string[] {
  return text.match(new RegExp(`.{1,${chunkSize}}`, 'g')) || [];
}
Enter fullscreen mode Exit fullscreen mode

The Good:

  • Predictable as British weather (which means very predictable... just kidding! πŸ˜…)
  • Super easy to parallelize (your DevOps team will love you)

The Bad:

  • About as graceful as me attempting ballet - it'll brutally chop your sentences in half

When to use it? When you need speed and consistency more than preserving context.

2. Sentence-Based Chunking: The Grammar Nazi's Choice πŸ“š

function sentenceChunk(text: string): string[] {
  // This regex isn't perfect, but hey, what is in life?
  return text.match(/[^.!?]+[.!?]+/g) || [];
}
Enter fullscreen mode Exit fullscreen mode

The Good:

  • Keeps your sentences intact (Grammar enthusiasts, rejoice! πŸŽ‰)
  • Great for chatbots that need to sound human-like

The Bad:

  • Some sentences are longer than a Netflix binge session
  • Others are shorter than my attention span

3. Paragraph-Based Chunking: The Goldilocks Zone πŸ“

This one's like finding that perfect porridge temperature - when it works, it really works.

function paragraphChunk(text: string): string[] {
  return text.split(/\n\s*\n/); // Simple but effective!
}
Enter fullscreen mode Exit fullscreen mode

The Good:

  • Usually captures complete ideas
  • Works great with well-structured documents

The Bad:

  • Some paragraphs are longer than my AWS bill explanations

4. Recursive Chunking: The Inception Approach πŸŒ€

BWAAAAM (That's the Inception horn, in case you're wondering)

function recursiveChunk(text: string, maxSize: number = 1000): string[] {
  if (text.length <= maxSize) return [text];

  // Find the midpoint
  const midPoint = text.lastIndexOf('.', maxSize);
  if (midPoint === -1) return [text];

  const firstHalf = text.slice(0, midPoint + 1);
  const secondHalf = text.slice(midPoint + 1);

  return [...recursiveChunk(firstHalf), ...recursiveChunk(secondHalf)];
}
Enter fullscreen mode Exit fullscreen mode

The Good:

  • As flexible as a yoga instructor
  • Great for handling complex document structures

The Bad:

  • Can get as complicated as explaining serverless to your grandma

5. Semantic Chunking: The Smart Kid in Class 🧠

This is like having an AI to help your AI. How meta is that?

import { embedText, findSemanticBoundaries } from './your-fancy-ml-library';

async function semanticChunk(text: string): Promise<string[]> {
  const embeddings = await embedText(text);
  const boundaries = findSemanticBoundaries(embeddings);
  return boundaries.map(([start, end]) => text.slice(start, end));
}
Enter fullscreen mode Exit fullscreen mode

The Good:

  • As smart as a caffeinated software engineer
  • Keeps related concepts together

The Bad:

  • Computationally expensive (prepare for your AWS bill to make you cry)
  • More complex than explaining why you need another mechanical keyboard

6. Sliding Window Chunking: The "Better Safe Than Sorry" Approach πŸ”„

function slidingWindowChunk(text: string, windowSize: number = 1000, overlap: number = 200): string[] {
  const chunks: string[] = [];
  let start = 0;

  while (start < text.length) {
    const end = Math.min(start + windowSize, text.length);
    chunks.push(text.slice(start, end));
    start += windowSize - overlap;
  }

  return chunks;
}
Enter fullscreen mode Exit fullscreen mode

The Good:

  • Ensures no information falls through the cracks
  • Like having multiple security cameras with overlapping views

The Bad:

  • Creates more redundancy than a Kubernetes cluster
  • Can make your storage costs go brrrrr πŸ’Έ

What Should You Use? πŸ€”

Here's my rule of thumb:

  1. Start with fixed-length or sentence-based chunking
  2. If that doesn't work, try sliding window
  3. If you have the compute resources and need high accuracy, go for semantic chunking
  4. If all else fails, grab a coffee and try recursive chunking

Let's Get Our Hands Dirty: Real Examples πŸ’»

Python Example: Semantic Chunking

First, let's look at a Python example using LangChain for semantic chunking:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader

def semantic_chunk(file_path):
    # Load the document
    loader = TextLoader(file_path)
    document = loader.load()

    # Create a text splitter
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len,
        separators=["\n\n", "\n", " ", ""]
    )

    # Split the document into chunks
    chunks = text_splitter.split_documents(document)

    return chunks

# Example usage
chunks = semantic_chunk('knowledge_base.txt')
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {chunk.page_content[:50]}...")
Enter fullscreen mode Exit fullscreen mode

Node.js and CDK Example: Building a Knowledge Base

Now, let's build something real - a serverless knowledge base using AWS CDK and Node.js! πŸš€

First, the CDK infrastructure (this is where the magic happens):

import * as cdk from 'aws-cdk-lib';
import * as s3 from 'aws-cdk-lib/aws-s3';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as opensearch from 'aws-cdk-lib/aws-opensearch';
import * as iam from 'aws-cdk-lib/aws-iam';

export class KnowledgeBaseStack extends cdk.Stack {
  constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // S3 bucket to store our documents
    const documentBucket = new s3.Bucket(this, 'DocumentBucket', {
      removalPolicy: cdk.RemovalPolicy.DESTROY,
    });

    // OpenSearch domain for storing our chunks
    const openSearchDomain = new opensearch.Domain(this, 'DocumentSearch', {
      version: opensearch.EngineVersion.OPENSEARCH_2_5,
      capacity: {
        dataNodes: 1,
        dataNodeInstanceType: 't3.small.search',
      },
      ebs: {
        volumeSize: 10,
      },
    });

    // Lambda function for processing documents
    const processorFunction = new lambda.Function(this, 'ProcessorFunction', {
      runtime: lambda.Runtime.NODEJS_18_X,
      handler: 'index.handler',
      code: lambda.Code.fromAsset('lambda'),
      environment: {
        OPENSEARCH_DOMAIN: openSearchDomain.domainEndpoint,
      },
      timeout: cdk.Duration.minutes(5),
    });

    // Grant permissions
    documentBucket.grantRead(processorFunction);
    openSearchDomain.grantWrite(processorFunction);
  }
}
Enter fullscreen mode Exit fullscreen mode

And now, the Lambda function that does the chunking and indexing:

import { S3Event } from 'aws-lambda';
import { S3 } from 'aws-sdk';
import { Client } from '@opensearch-project/opensearch';
import { defaultProvider } from '@aws-sdk/credential-provider-node';
import { AwsSigv4Signer } from '@opensearch-project/opensearch/aws';

const s3 = new S3();
const CHUNK_SIZE = 1000;
const CHUNK_OVERLAP = 200;

// Create OpenSearch client
const client = new Client({
  ...AwsSigv4Signer({
    region: process.env.AWS_REGION,
    service: 'es',
    getCredentials: () => {
      const credentialsProvider = defaultProvider();
      return credentialsProvider();
    },
  }),
  node: `https://${process.env.OPENSEARCH_DOMAIN}`,
});

export const handler = async (event: S3Event) => {
  for (const record of event.Records) {
    const bucket = record.s3.bucket.name;
    const key = decodeURIComponent(record.s3.object.key.replace(/\+/g, ' '));

    // Get the document from S3
    const { Body } = await s3.getObject({ Bucket: bucket, Key: key }).promise();
    const text = Body.toString('utf-8');

    // Chunk the document
    const chunks = chunkText(text);

    // Index chunks in OpenSearch
    for (const [index, chunk] of chunks.entries()) {
      await client.index({
        index: 'knowledge-base',
        body: {
          content: chunk,
          documentKey: key,
          chunkIndex: index,
          timestamp: new Date().toISOString(),
        },
      });
    }
  }
};

function chunkText(text: string): string[] {
  const chunks: string[] = [];
  let start = 0;

  while (start < text.length) {
    const end = Math.min(start + CHUNK_SIZE, text.length);
    let chunk = text.slice(start, end);

    // Try to break at a sentence boundary
    const lastPeriod = chunk.lastIndexOf('.');
    if (lastPeriod !== -1 && lastPeriod !== chunk.length - 1) {
      chunk = chunk.slice(0, lastPeriod + 1);
    }

    chunks.push(chunk);
    start = Math.max(start + chunk.length - CHUNK_OVERLAP, start + 1);
  }

  return chunks;
}
Enter fullscreen mode Exit fullscreen mode

How It All Works Together πŸ”„

  1. Document Upload: When you upload a document to the S3 bucket, it triggers our Lambda function.
  2. Processing: The Lambda function:
    • Retrieves the document from S3
    • Chunks it using our smart chunking algorithm
    • Indexes each chunk in OpenSearch with metadata
  3. Retrieval: Later, when your application needs to find information, it can query OpenSearch to find the most relevant chunks.

Here's a quick example of how you might query this knowledge base:

async function queryKnowledgeBase(query: string) {
  const response = await client.search({
    index: 'knowledge-base',
    body: {
      query: {
        multi_match: {
          query: query,
          fields: ['content'],
        },
      },
    },
  });

  return response.body.hits.hits.map(hit => ({
    content: hit._source.content,
    documentKey: hit._source.documentKey,
    score: hit._score,
  }));
}
Enter fullscreen mode Exit fullscreen mode

The AWS Advantage πŸŒ₯️

Using AWS services like S3, Lambda, and OpenSearch gives us:

  • Serverless scalability (no servers to manage!)
  • Pay-per-use pricing (your wallet will thank you)
  • Managed services (less ops work = more coding fun)

Final Thoughts πŸ€”

There you have it, folks! A real-world example of how to implement chunking in a serverless knowledge base. The best part? This scales automatically and can handle documents of any size.

Remember, the key to good chunking is:

  1. Choose the right chunk size for your use case
  2. Consider overlap to maintain context
  3. Use natural boundaries when possible (like sentences or paragraphs)

What's your experience with building knowledge bases? Have you tried different chunking strategies? Let me know in the comments below! πŸ‘‡

Top comments (0)