Marcos Henrique for AWS Community Builders

Posted on Oct 10

Chunking in AI - The Secret Sauce You're Missing

#javascript #aws #ai #rag

Hey folks! 👋

You know what keeps me up at night? Thinking about how to make our AI systems smarter and more efficient. Today, I want to talk about something that might sound basic but is crucial when building kick-ass AI applications: chunking ✨.

What the heck is chunking anyway? 🤔

Think of chunking as your AI's way of breaking down a massive buffet of information into manageable, bite-sized portions. Just like how you wouldn't try to stuff an entire pizza in your mouth at once (or maybe you would, no judgment here!), your AI needs to break down large texts into smaller pieces to process them effectively.

This is especially important for what we call RAG (Retrieval-Augmented Generation) models. These bad boys don't just make stuff up - they actually go and fetch real information from external sources. Pretty neat, right?

Why should you care? 🎯

Look, if you're building anything that deals with text - whether it's a customer support chatbot or a fancy knowledge base search - getting chunking right is the difference between an AI that gives spot-on answers and one that's just... meh.

Too big chunks? Your model misses the point.
Too small chunks? It gets lost in the details.

The Good, Bad, and Ugly of Text Chunking Strategies 📚

Chunking is critical in RAG systems because it directly impacts how well the retrieval module pulls relevant data and how much context the generation module has to work with. I realized it all came down to how I was splitting my text. Let me save you some headaches and share what I've learned about different chunking strategies.

1. Fixed-Length Chunking: The "Assembly Line" Approach 🏭

You know how Henry Ford revolutionized car manufacturing with the assembly line? Fixed-length chunking is kind of like that - it's all about consistency and predictability.

function fixedLengthChunk(text: string, chunkSize: number = 1000): string[] {
  return text.match(new RegExp(`.{1,${chunkSize}}`, 'g')) || [];
}

The Good:

Predictable as British weather (which means very predictable... just kidding! 😅)
Super easy to parallelize (your DevOps team will love you)

The Bad:

About as graceful as me attempting ballet - it'll brutally chop your sentences in half

When to use it? When you need speed and consistency more than preserving context.

2. Sentence-Based Chunking: The Grammar Nazi's Choice 📚

function sentenceChunk(text: string): string[] {
  // This regex isn't perfect, but hey, what is in life?
  return text.match(/[^.!?]+[.!?]+/g) || [];
}

The Good:

Keeps your sentences intact (Grammar enthusiasts, rejoice! 🎉)
Great for chatbots that need to sound human-like

The Bad:

Some sentences are longer than a Netflix binge session
Others are shorter than my attention span

3. Paragraph-Based Chunking: The Goldilocks Zone 📝

This one's like finding that perfect porridge temperature - when it works, it really works.

function paragraphChunk(text: string): string[] {
  return text.split(/\n\s*\n/); // Simple but effective!
}

The Good:

Usually captures complete ideas
Works great with well-structured documents

The Bad:

Some paragraphs are longer than my AWS bill explanations

4. Recursive Chunking: The Inception Approach 🌀

BWAAAAM (That's the Inception horn, in case you're wondering)

function recursiveChunk(text: string, maxSize: number = 1000): string[] {
  if (text.length <= maxSize) return [text];

  // Find the midpoint
  const midPoint = text.lastIndexOf('.', maxSize);
  if (midPoint === -1) return [text];

  const firstHalf = text.slice(0, midPoint + 1);
  const secondHalf = text.slice(midPoint + 1);

  return [...recursiveChunk(firstHalf), ...recursiveChunk(secondHalf)];
}

The Good:

As flexible as a yoga instructor
Great for handling complex document structures

The Bad:

Can get as complicated as explaining serverless to your grandma

5. Semantic Chunking: The Smart Kid in Class 🧠

This is like having an AI to help your AI. How meta is that?

import { embedText, findSemanticBoundaries } from './your-fancy-ml-library';

async function semanticChunk(text: string): Promise<string[]> {
  const embeddings = await embedText(text);
  const boundaries = findSemanticBoundaries(embeddings);
  return boundaries.map(([start, end]) => text.slice(start, end));
}

The Good:

As smart as a caffeinated software engineer
Keeps related concepts together

The Bad:

Computationally expensive (prepare for your AWS bill to make you cry)
More complex than explaining why you need another mechanical keyboard

6. Sliding Window Chunking: The "Better Safe Than Sorry" Approach 🔄

function slidingWindowChunk(text: string, windowSize: number = 1000, overlap: number = 200): string[] {
  const chunks: string[] = [];
  let start = 0;

  while (start < text.length) {
    const end = Math.min(start + windowSize, text.length);
    chunks.push(text.slice(start, end));
    start += windowSize - overlap;
  }

  return chunks;
}

The Good:

Ensures no information falls through the cracks
Like having multiple security cameras with overlapping views

The Bad:

Creates more redundancy than a Kubernetes cluster
Can make your storage costs go brrrrr 💸

What Should You Use? 🤔

Here's my rule of thumb:

Start with fixed-length or sentence-based chunking
If that doesn't work, try sliding window
If you have the compute resources and need high accuracy, go for semantic chunking
If all else fails, grab a coffee and try recursive chunking

Let's Get Our Hands Dirty: Real Examples 💻

Python Example: Semantic Chunking

First, let's look at a Python example using LangChain for semantic chunking:

from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain.document_loaders import TextLoader

def semantic_chunk(file_path):
    # Load the document
    loader = TextLoader(file_path)
    document = loader.load()

    # Create a text splitter
    text_splitter = RecursiveCharacterTextSplitter(
        chunk_size=1000,
        chunk_overlap=200,
        length_function=len,
        separators=["\n\n", "\n", " ", ""]
    )

    # Split the document into chunks
    chunks = text_splitter.split_documents(document)

    return chunks

# Example usage
chunks = semantic_chunk('knowledge_base.txt')
for i, chunk in enumerate(chunks):
    print(f"Chunk {i}: {chunk.page_content[:50]}...")

Node.js and CDK Example: Building a Knowledge Base

Now, let's build something real - a serverless knowledge base using AWS CDK and Node.js! 🚀

First, the CDK infrastructure (this is where the magic happens):

import * as cdk from 'aws-cdk-lib';
import * as s3 from 'aws-cdk-lib/aws-s3';
import * as lambda from 'aws-cdk-lib/aws-lambda';
import * as opensearch from 'aws-cdk-lib/aws-opensearch';
import * as iam from 'aws-cdk-lib/aws-iam';

export class KnowledgeBaseStack extends cdk.Stack {
  constructor(scope: cdk.App, id: string, props?: cdk.StackProps) {
    super(scope, id, props);

    // S3 bucket to store our documents
    const documentBucket = new s3.Bucket(this, 'DocumentBucket', {
      removalPolicy: cdk.RemovalPolicy.DESTROY,
    });

    // OpenSearch domain for storing our chunks
    const openSearchDomain = new opensearch.Domain(this, 'DocumentSearch', {
      version: opensearch.EngineVersion.OPENSEARCH_2_5,
      capacity: {
        dataNodes: 1,
        dataNodeInstanceType: 't3.small.search',
      },
      ebs: {
        volumeSize: 10,
      },
    });

    // Lambda function for processing documents
    const processorFunction = new lambda.Function(this, 'ProcessorFunction', {
      runtime: lambda.Runtime.NODEJS_18_X,
      handler: 'index.handler',
      code: lambda.Code.fromAsset('lambda'),
      environment: {
        OPENSEARCH_DOMAIN: openSearchDomain.domainEndpoint,
      },
      timeout: cdk.Duration.minutes(5),
    });

    // Grant permissions
    documentBucket.grantRead(processorFunction);
    openSearchDomain.grantWrite(processorFunction);
  }
}

And now, the Lambda function that does the chunking and indexing:

import { S3Event } from 'aws-lambda';
import { S3 } from 'aws-sdk';
import { Client } from '@opensearch-project/opensearch';
import { defaultProvider } from '@aws-sdk/credential-provider-node';
import { AwsSigv4Signer } from '@opensearch-project/opensearch/aws';

const s3 = new S3();
const CHUNK_SIZE = 1000;
const CHUNK_OVERLAP = 200;

// Create OpenSearch client
const client = new Client({
  ...AwsSigv4Signer({
    region: process.env.AWS_REGION,
    service: 'es',
    getCredentials: () => {
      const credentialsProvider = defaultProvider();
      return credentialsProvider();
    },
  }),
  node: `https://${process.env.OPENSEARCH_DOMAIN}`,
});

export const handler = async (event: S3Event) => {
  for (const record of event.Records) {
    const bucket = record.s3.bucket.name;
    const key = decodeURIComponent(record.s3.object.key.replace(/\+/g, ' '));

    // Get the document from S3
    const { Body } = await s3.getObject({ Bucket: bucket, Key: key }).promise();
    const text = Body.toString('utf-8');

    // Chunk the document
    const chunks = chunkText(text);

    // Index chunks in OpenSearch
    for (const [index, chunk] of chunks.entries()) {
      await client.index({
        index: 'knowledge-base',
        body: {
          content: chunk,
          documentKey: key,
          chunkIndex: index,
          timestamp: new Date().toISOString(),
        },
      });
    }
  }
};

function chunkText(text: string): string[] {
  const chunks: string[] = [];
  let start = 0;

  while (start < text.length) {
    const end = Math.min(start + CHUNK_SIZE, text.length);
    let chunk = text.slice(start, end);

    // Try to break at a sentence boundary
    const lastPeriod = chunk.lastIndexOf('.');
    if (lastPeriod !== -1 && lastPeriod !== chunk.length - 1) {
      chunk = chunk.slice(0, lastPeriod + 1);
    }

    chunks.push(chunk);
    start = Math.max(start + chunk.length - CHUNK_OVERLAP, start + 1);
  }

  return chunks;
}

How It All Works Together 🔄

Document Upload: When you upload a document to the S3 bucket, it triggers our Lambda function.
Processing: The Lambda function:
- Retrieves the document from S3
- Chunks it using our smart chunking algorithm
- Indexes each chunk in OpenSearch with metadata
Retrieval: Later, when your application needs to find information, it can query OpenSearch to find the most relevant chunks.

Here's a quick example of how you might query this knowledge base:

async function queryKnowledgeBase(query: string) {
  const response = await client.search({
    index: 'knowledge-base',
    body: {
      query: {
        multi_match: {
          query: query,
          fields: ['content'],
        },
      },
    },
  });

  return response.body.hits.hits.map(hit => ({
    content: hit._source.content,
    documentKey: hit._source.documentKey,
    score: hit._score,
  }));
}

The AWS Advantage 🌥️

Using AWS services like S3, Lambda, and OpenSearch gives us:

Serverless scalability (no servers to manage!)
Pay-per-use pricing (your wallet will thank you)
Managed services (less ops work = more coding fun)

Final Thoughts 🤔

There you have it, folks! A real-world example of how to implement chunking in a serverless knowledge base. The best part? This scales automatically and can handle documents of any size.

Remember, the key to good chunking is:

Choose the right chunk size for your use case
Consider overlap to maintain context
Use natural boundaries when possible (like sentences or paragraphs)

What's your experience with building knowledge bases? Have you tried different chunking strategies? Let me know in the comments below! 👇

DEV Community

Chunking in AI - The Secret Sauce You're Missing

What the heck is chunking anyway? 🤔

Why should you care? 🎯

The Good, Bad, and Ugly of Text Chunking Strategies 📚

1. Fixed-Length Chunking: The "Assembly Line" Approach 🏭

2. Sentence-Based Chunking: The Grammar Nazi's Choice 📚

3. Paragraph-Based Chunking: The Goldilocks Zone 📝

4. Recursive Chunking: The Inception Approach 🌀

5. Semantic Chunking: The Smart Kid in Class 🧠

6. Sliding Window Chunking: The "Better Safe Than Sorry" Approach 🔄

What Should You Use? 🤔

Let's Get Our Hands Dirty: Real Examples 💻

Python Example: Semantic Chunking

Node.js and CDK Example: Building a Knowledge Base

How It All Works Together 🔄

The AWS Advantage 🌥️

Final Thoughts 🤔

Top comments (0)

Read next

My first full stack app

Building Cloud Security Efforts with AWS CAF and Well-Architected Framework

Make EditorJS work in Svelte(kit) SSR

Integrating AWS Services with Stripe for Seamless Payment Solutions