Beyond ChatGPT: Understanding the Core Building Blocks of Generative AI

Ramya D.N Rao — Tue, 30 Jun 2026 09:32:11 +0000

Most developers have experimented with ChatGPT or GitHub Copilot. But when it comes to building AI-powered applications, simply calling an LLM API isn't enough. Understanding what's happening behind the scenes helps you design systems that are scalable, reliable, and cost-effective.

In this article, we'll explore four concepts every software engineer should know: tokens, embeddings, transformers, and Retrieval-Augmented Generation (RAG).

1. LLMs Think in Tokens, Not Words

One of the biggest misconceptions about Large Language Models (LLMs) is that they understand words like humans do. In reality, they process tokens, which are smaller units of text.

For example:

Prompt:
Explain dependency injection in Spring Boot.

is first converted into a sequence of tokens before the model processes it.

Why does this matter?

API pricing is based on the number of input and output tokens.
Longer prompts increase latency and cost.
Every model has a maximum context window measured in tokens.

When building AI applications, prompt design isn't just about getting better answers—it's also about optimizing performance and cost.

2. Transformers: The Breakthrough Behind Modern AI

Before 2017, language models processed text one word at a time using architectures like RNNs and LSTMs. They struggled with long conversations because earlier context was gradually forgotten.

The introduction of the Transformer architecture changed this with a mechanism called self-attention.

Instead of reading text sequentially, transformers analyze the relationships between all tokens in a sentence simultaneously.

Consider this sentence:

"The server restarted because it ran out of memory."

The model understands that "it" refers to "the server", not "memory", by assigning attention to the relevant words.

This ability to capture context efficiently is what powers modern LLMs like GPT, Gemini, Claude, and Llama.

3. Embeddings Enable Semantic Search

Suppose a customer searches:

"How can I get my money back?"

But your documentation only contains:

"Request a refund."

A keyword search may fail because the exact words don't match.

This is where embeddings come in.

Embeddings convert text into high-dimensional vectors that capture semantic meaning. Even though the wording is different, both sentences produce vectors that are close together in vector space.

This enables semantic search, allowing applications to retrieve information based on meaning rather than exact keywords.

Common use cases include:

Enterprise document search
Recommendation systems
FAQ retrieval
Knowledge assistants

4. Why Enterprise AI Uses RAG

A common misconception is that LLMs "know everything." In reality, they only know what was available during training.

Imagine asking:

"What is our company's leave policy?"

The model has no knowledge of your internal HR documents.

Instead of retraining the model, modern AI systems use Retrieval-Augmented Generation (RAG).

A typical workflow looks like this:

User Question
│
▼
Generate Embedding
│
▼
Search Vector Database
│
Retrieve Relevant Documents
│
▼
LLM Generates Grounded Answer

Rather than relying on memory alone, the model first retrieves the most relevant documents and then generates a response based on that context.

This approach significantly improves accuracy while reducing hallucinations.

A Practical Use Case

Imagine you're building an AI assistant for an e-commerce platform.

A customer asks:

"Can I return a damaged product after 45 days?"

Instead of expecting the LLM to guess, your application can:

Convert the question into an embedding.
Search a vector database containing return policy documents.
Retrieve the relevant policy.
Send both the user's question and the retrieved document to the LLM.
Generate a response grounded in your company's actual policy.

This architecture ensures responses are accurate, up-to-date, and specific to your business.

Final Thoughts

Generative AI is much more than a chat interface. The real engineering lies in understanding how tokens, transformers, embeddings, and retrieval work together.

As software engineers, we don't need to build foundation models from scratch. But understanding these building blocks enables us to design AI systems that are scalable, explainable, and production-ready.

The next time you integrate an LLM into your application, remember that the API call is only a small part of the solution. The real value comes from the architecture you build around it.