DEV Community

Cover image for Why AI Systems Become Expensive: Tokenization, Chunking, and Retrieval Design in the Cloud (AWS)
Mihindu Ranasinghe
Mihindu Ranasinghe

Posted on

Why AI Systems Become Expensive: Tokenization, Chunking, and Retrieval Design in the Cloud (AWS)

When building modern AI knowledge systems, discussions often jump directly to prompts, retrieval pipelines, or model selection. However, long before a model generates an answer, something more fundamental happens that your data must be transformed into a format that models can understand and retrieve efficiently.

This transformation typically involves several foundational steps:

1. Tokenization – Converting raw text into model-readable units
2. Chunking – Splitting documents into manageable segments
3. Vectorization – Converting text into embeddings
4. Indexing – Storing vectors for efficient similarity search

These steps form the foundation of retrieval-based AI systems, and design decisions at this stage often have a greater impact on system performance than prompt engineering or model tuning.

These architectural considerations are also increasingly relevant for modern AI development tools such as Claude Code, OpenAI Codex–based systems, and other AI-powered coding assistants. Although these tools may appear to operate purely through conversational prompts, they often rely on similar retrieval pipelines under the hood, indexing codebases, documentation, and contextual information before generating responses. As a result, the same factors discussed in this article, such as tokenization, chunking strategies, embeddings, and retrieval design, can significantly influence both the performance and the token consumption of these systems.

In this article, we explore how these processes work together and how they can be implemented using Amazon Bedrock and Amazon OpenSearch Service.

The Retrieval Pipeline Begins Before Retrieval

Before a model retrieves or generates information, data must first pass through a preparation pipeline.

Each stage introduces trade-offs affecting:

  • Retrieval accuracy
  • Latency
  • Operational cost

Designing this pipeline carefully is essential for scalable GenAI systems.


Tokenization: How Models Read Text

Large language models do not process text directly. Instead, they operate on tokens, which are smaller units derived from text.

Tokens may represent:

  • Whole words
  • Parts of words
  • Punctuation
  • Whitespace

For example:
Cloud computing enables scalable AI systems.

might be tokenized as:
["Cloud", " computing", " enables", " scalable", " AI", " systems", "."]

Tokenization is necessary because models operate within fixed context windows, defined by the maximum number of tokens they can process at once.

However, token counts vary depending on the model's tokenizer.

Tokenization Algorithms

Tokenization is performed using trained algorithms rather than simple whitespace splitting.

Common approaches include:

Algorithm Description
Byte Pair Encoding (BPE) Merges frequently occurring character pairs
WordPiece Used in many transformer-based models
Unigram Language Model Probabilistic token selection
SentencePiece Language-agnostic tokenizer framework

Many modern models rely on Byte Pair Encoding or similar subword tokenization techniques (Sennrich et al., 2016).

Example: Byte Pair Encoding

Consider the word: tokenization, A BPE tokenizer may split it as:
["token", "ization"] or ["token", "iz", "ation"] depending on how the tokenizer vocabulary was learned.

Subword tokenization allows models to represent rare or unseen words efficiently without requiring extremely large vocabularies.

Why Tokenization Matters

Tokenization affects several key aspects of AI system design.

Context Window Limits

Models can only process a limited number of tokens at once. Depending on the model, context windows may range from thousands to hundreds of thousands of tokens.

Large documents therefore need to be split into smaller chunks before processing.

Cost

Many generative AI platforms charge based on token usage. When generating embeddings or performing inference with Amazon Bedrock, token counts directly influence operational costs.

Retrieval Quality

Chunk sizes and overlaps are typically defined in tokens. Tokenization therefore influences:

  • Chunk boundaries
  • Embedding context
  • Retrieval precision

Token Cost Optimization in GenAI Pipelines

In production systems, whether powering AI assistants, coding copilots, or knowledge retrieval pipelines, token usage is not just a technical constraint, it is also a major cost driver. Most generative AI platforms charge based on the number of input and output tokens processed. When models are accessed through services such as Amazon Bedrock, inefficient prompt construction or overly large context windows can significantly increase operational cost.

Token usage grows quickly at scale. For example, a system sending 2,000 tokens per request and serving 10,000 requests per day processes 20 million tokens daily. Even small inefficiencies in chunking or retrieval strategies can therefore become expensive.

Design decisions such as chunk size, overlap, and retrieval precision directly influence token consumption. Smaller, well-structured chunks combined with accurate vector retrieval (for example using Amazon OpenSearch Service) help ensure that only the most relevant context is sent to the model.

Treating token efficiency as an architectural concern, not just a billing metric, can significantly improve the scalability and cost-efficiency of GenAI systems.


Chunking: Preparing Documents for Retrieval

Chunking refers to the process of splitting large documents into smaller segments prior to indexing.

Effective chunking improves:

  • Retrieval accuracy
  • Semantic coherence of embeddings
  • Query relevance

Poor chunking may result in fragmented information or irrelevant context.

Several strategies are commonly used.

1. Standard Chunking

The simplest method is fixed-size chunking, where text is split based on a predetermined token length.

Example configuration:

Chunk size: 300 tokens
Overlap: 20%
Enter fullscreen mode Exit fullscreen mode

Overlap ensures that information spanning chunk boundaries remains available.

Typical parameters include:

Parameter Purpose
Chunk size Number of tokens per chunk
Overlap Shared tokens between chunks

Many systems default to approximately 300 tokens per chunk, although optimal sizes vary depending on document structure and query patterns.

2. Hierarchical Chunking

Hierarchical chunking preserves document structure by splitting text into nested segments.

Example structure:

Document
 ├── Section
 │    ├── Paragraph
 │    │     ├── Sentence
Enter fullscreen mode Exit fullscreen mode

This enables retrieval at multiple levels of granularity:

  • fine-grained paragraph retrieval
  • broader section-level context

Hierarchical chunking is particularly useful for large technical documentation or knowledge bases.

3. Semantic Chunking

Traditional chunking splits text mechanically. Semantic chunking instead splits text based on meaning and topic boundaries.

Instead of dividing text every fixed number of tokens, semantic chunking analyzes content to group sentences that represent a coherent concept.

Foundation models available through Amazon Bedrock can assist in identifying semantic boundaries.

This often produces more meaningful chunks, improving retrieval quality.


Vectorization: Converting Text into Embeddings

After chunking, each segment is converted into a vector embedding.

Embeddings represent text as numerical vectors capturing semantic meaning.

Example:
"cloud infrastructure"

might become:
[0.43, 0.24, 0.54, 0.12, 0.53]

Embeddings allow systems to compare semantic similarity between text segments.

Embedding generation can be performed using models provided through Amazon Bedrock.

Optimizing Embeddings

Embedding design has a direct impact on performance and cost.

Embeddings vary in Vector Dimensionality, and higher-dimensional vectors can capture more semantic nuance but require more storage and computational resources.


Indexing Embeddings for Retrieval

Once embeddings are generated, they must be stored in a structure supporting efficient similarity search.

Vector databases achieve this using Approximate Nearest Neighbor (ANN) algorithms, which allow searching across millions of vectors efficiently (Aumüller et al., 2020).

Amazon OpenSearch Service supports vector search capabilities using ANN indexing.

Typical workflow:

Example AWS Architecture

This architecture separates concerns across:

  • Ingestion
  • Processing
  • Embedding generation
  • Vector indexing
  • Retrieval

Example: Generating Embeddings with AWS Bedrock

import boto3
import json

bedrock = boto3.client("bedrock-runtime")

text = "Cloud architecture improves scalability and resilience."

response = bedrock.invoke_model(
    modelId="amazon.titan-embed-text-v1",
    body=json.dumps({"inputText": text})
)

embedding = json.loads(response['body'].read())

print(embedding)
Enter fullscreen mode Exit fullscreen mode

The generated vector can then be indexed in Amazon OpenSearch Service for similarity search.

Final Thoughts

  • Tokenization, chunking, vectorization, and indexing form the core infrastructure of retrieval-based AI systems.
  • Although much attention is given to models and prompt engineering, the effectiveness of an AI system often depends more on how knowledge is structured and indexed before retrieval.
  • Platforms such as Amazon Bedrock and Amazon OpenSearch Service provide powerful building blocks for implementing these pipelines.
  • However, achieving high-quality retrieval ultimately depends on designing the data preparation process thoughtfully
  • Efficient tokenization and chunking strategies not only improve retrieval quality but also play a critical role in controlling token consumption and operational cost in large-scale GenAI systems.

References

  • Sennrich, R., Haddow, B., & Birch, A. (2016). Neural Machine Translation of Rare Words with Subword Units. ACL.
  • Vaswani, A. et al. (2017). Attention Is All You Need. NeurIPS.
  • Kudo, T., & Richardson, J. (2018). SentencePiece: A simple and language independent subword tokenizer and detokenizer for Neural Text Processing.
  • Aumüller, M., Bernhardsson, E., & Faithfull, A. (2020). ANN-Benchmarks: A Benchmarking Tool for Approximate Nearest Neighbor Algorithms.
  • Salton, G., & Buckley, C. (1988). Term-Weighting Approaches in Automatic Text Retrieval.
  • AWS Documentation – Amazon Bedrock
  • AWS Documentation – Amazon OpenSearch Vector Search

If you're building AI systems today, optimizing tokenization and chunking strategies early can save significant cost and complexity later.

Top comments (0)