How I Built a RAG System That Actually Understands Business Metrics (Part 1: Foundation)
The Problem: When AI Can't Find Your Data
Picture this: Your marketing team asks your AI assistant, "How's this month's revenue?"
The AI responds: "I don't have access to that data."
Frustrating, right? But here's the real problem - you DO have the data. It's sitting in your database, perfectly organized, just waiting to be used. The AI just doesn't know which metrics to look for.
This is the story of how I solved that problem by building a Retrieval-Augmented Generation (RAG) system that bridges the gap between natural language questions and structured business metrics.
Why RAG? Understanding LLM Limitations
Large Language Models like GPT have three fundamental limitations:
- Time Constraint: Trained only up to a specific cutoff date
- Domain Constraint: No knowledge of your company's internal data
- Real-time Constraint: Can't access continuously changing data
RAG solves this by combining the AI's language understanding with your real-time data retrieval system.
The Architecture Overview
Here's what we're building:
User Question
↓
[1] Text Embedding (Vector Search)
↓
[2] Structured Outputs (GPT Filtering)
↓
[3] Database Query
↓
Final Answer
This article covers steps 1-2, which form the foundation of intelligent metric retrieval.
Starting Point: Understanding Our Data Structure
Our Spring Batch process generates metrics in this format:
order_count|advertiser
conversion_rate|advertiser
total_revenue|advertiser
campaign_roas|ad_impressions|advertiser
traffic_by_channel|google|industry
traffic_by_channel|facebook|industry
The problem? This flat structure makes semantic search difficult.
Solution: Restructure into hierarchical maps
Step 1 - Separate by domain (Advertiser vs Industry):
{
"order_count": [],
"total_revenue": [],
"campaign_roas": [
"ad_impressions",
"ad_clicks",
"ad_spend"
],
"traffic_by_channel": [
"google",
"facebook",
"direct"
]
}
Step 2 - Create domain classification:
{
"order_count": "advertiser",
"conversion_rate": "common",
"total_revenue": "common",
"avg_purchase_count": "industry"
}
Step 1: Text Embedding for Semantic Search
The Core Concept: Convert text into numerical vectors that capture semantic meaning.
How It Works
When a user asks "How's the revenue?", we need to find that this relates to the "total_revenue" metric - even though the words don't match exactly.
Process:
- Convert question → 1536-dimensional vector
- Convert all metrics → 1536-dimensional vectors
- Calculate cosine similarity
- Extract top candidates
OpenAI Embedding API
@Service
public class EmbeddingService {
@Value("${openai.api.key}")
private String apiKey;
public double[] getEmbedding(String text) {
Map<String, Object> requestBody = Map.of(
"model", "text-embedding-3-small",
"input", text
);
HttpHeaders headers = new HttpHeaders();
headers.setBearerAuth(apiKey);
headers.setContentType(MediaType.APPLICATION_JSON);
HttpEntity<Map<String, Object>> request = new HttpEntity<>(requestBody, headers);
ResponseEntity<Map> response = restTemplate.postForEntity(
"https://api.openai.com/v1/embeddings",
request,
Map.class
);
// Extract embedding array from response
Map<String, Object> data = (Map<String, Object>)
((List<Object>) response.getBody().get("data")).get(0);
List<Double> embedding = (List<Double>) data.get("embedding");
return embedding.stream()
.mapToDouble(Double::doubleValue)
.toArray();
}
}
This service generates embeddings using OpenAI's API.
Cosine Similarity Calculation
public class VectorMath {
public static double cosineSimilarity(double[] vec1, double[] vec2) {
double dotProduct = 0.0;
double norm1 = 0.0;
double norm2 = 0.0;
for (int i = 0; i < vec1.length; i++) {
dotProduct += vec1[i] * vec2[i];
norm1 += vec1[i] * vec1[i];
norm2 += vec2[i] * vec2[i];
}
return dotProduct / (Math.sqrt(norm1) * Math.sqrt(norm2));
}
}
This calculates semantic similarity between two vectors.
Example Output
Question: "How's this month's revenue?"
Top Matches:
- "total_revenue": 0.89 similarity ✓
- "order_count": 0.72 similarity
- "conversion_rate": 0.45 similarity
Real Example:
Question: "Show me sales performance"
Vector Comparison:
[0.4, 0.1, 0.6, ...] (question)
vs
[0.3, 0.2, 0.7, ...] (total_revenue) → 89% match
[0.1, 0.5, -0.2, ...] (order_count) → 72% match
[0.2, 0.6, -0.4, ...] (conversion_rate) → 45% match
Result: Extract top 10 candidates
Problem: 10 candidates is still too many for accurate answers. We need more precise filtering.
Step 2: Structured Outputs for Precise Classification
Why Structured Outputs?
Without Structured Outputs:
AI Response: "You should check total_revenue and maybe order_count as well..."
↓ Unpredictable format, hard to parse
With Structured Outputs:
{
"metrics": ["total_revenue", "order_count"]
}
↓ Clean array, ready to use
The JSON Schema
Map<String, Object> schema = Map.of(
"type", "object",
"properties", Map.of(
"metrics", Map.of(
"type", "array",
"items", Map.of("type", "string"),
"description", "List of relevant metric names"
)
),
"required", List.of("metrics"),
"additionalProperties", false
);
This schema forces GPT to return exactly the format we need.
System Prompt Design
String systemPrompt = String.format("""
You are an analytics assistant for e-commerce advertisers.
### Critical Rules
1) If you don't understand the question → return []
2) If unrelated to advertising data → return []
3) Extract ONLY metric names, no explanations
4) Use exact names from the list below
### Available Metrics
%s
""", String.join(", ", candidateMetrics));
Clear instructions ensure consistent results.
Java Implementation
@Service
public class MetricClassificationService {
@Value("${openai.api.key}")
private String apiKey;
private final RestTemplate restTemplate;
public List<String> classifyMetrics(
String userQuestion,
List<String> candidateMetrics) {
String systemPrompt = buildSystemPrompt(candidateMetrics);
Map<String, Object> requestBody = Map.of(
"model", "gpt-4o-mini",
"messages", List.of(
Map.of("role", "system", "content", systemPrompt),
Map.of("role", "user", "content", userQuestion)
),
"response_format", Map.of(
"type", "json_schema",
"json_schema", Map.of(
"name", "map_metrics",
"strict", true,
"schema", buildSchema()
)
)
);
HttpHeaders headers = new HttpHeaders();
headers.setBearerAuth(apiKey);
headers.setContentType(MediaType.APPLICATION_JSON);
HttpEntity<Map<String, Object>> entity =
new HttpEntity<>(requestBody, headers);
ResponseEntity<Map> response = restTemplate.postForEntity(
"https://api.openai.com/v1/chat/completions",
entity,
Map.class
);
return parseResponse(response.getBody());
}
private List<String> parseResponse(Map<String, Object> response) {
try {
Map<String, Object> choice = (Map<String, Object>)
((List<Object>) response.get("choices")).get(0);
Map<String, Object> message =
(Map<String, Object>) choice.get("message");
String content = (String) message.get("content");
ObjectMapper mapper = new ObjectMapper();
Map<String, Object> data = mapper.readValue(content, Map.class);
return (List<String>) data.get("metrics");
} catch (Exception e) {
log.error("Failed to parse structured output", e);
return Collections.emptyList();
}
}
}
This service handles the complete Structured Outputs workflow.
Real-World Example
Input: "How's this month's revenue?" + 50 candidates
GPT Processing:
- "total_revenue" : revenue mentioned (selected)
- "order_count" : related to revenue (selected)
- "page_views" : not about revenue (rejected)
- "cart_abandonment" : not about revenue (rejected)
Output: ["total_revenue", "order_count"]
Complete Flow Example
Let's see the entire system in action:
Step 1: Vector Search
User: "How's this month's revenue?"
↓
Top 10 candidates:
- total_revenue (0.89)
- order_count (0.78)
- gross_profit (0.76)
- net_revenue (0.75)
- sales_amount (0.74)
- purchase_count (0.71)
- transaction_value (0.69)
- page_views (0.45)
- bounce_rate (0.32)
- conversion_rate (0.28)
Step 2: GPT Structured Filtering
Input: 10 candidates
↓
GPT Analysis:
"The question asks about 'revenue', so:
- total_revenue ✓ (direct match)
- net_revenue ✓ (revenue metric)
- order_count ✗ (count, not revenue)
- page_views ✗ (engagement, not revenue)"
↓
Output: ["total_revenue", "net_revenue"]
Step 3: Database Query
SELECT total_revenue, net_revenue
FROM metrics
WHERE period = 'current_month'
↓
Final Answer: "This month's total revenue is $125,000..."
Results So Far
Before RAG:
User: "How's this month's revenue?"
AI: "I don't have access to that data."
After Part 1:
User: "How's this month's revenue?"
System: Successfully identified ["total_revenue", "net_revenue"]
→ Can now query the correct metrics
Success Rate: 95% accuracy for flat metrics like "total_revenue", "order_count"
Current Limitation
This system works well for simple metrics but fails on hierarchical ones:
User: "How's my Google traffic?"
Current System:
- Finds "traffic_by_channel" ✓
- Can't identify the "google" sub-metric ✗
We need: "traffic_by_channel|google|advertiser"
The Problem: We need to search not just for parent metrics, but also navigate their hierarchical structure to find specific child metrics like "google" under "traffic_by_channel".
What's Next?
In Part 2, I'll show you how I solved this using:
- MetricForest Data Structure: A tree-based system for hierarchical metrics
- Neighbor Node Exploration: Finding related parent-child metric pairs
- Two-Stage GPT Filtering: Precise classification at each hierarchy level
Example of what Part 2 solves:
User: "How's my Google traffic?"
System: "traffic_by_channel|google|advertiser"
User: "Compare my Facebook traffic to industry average"
System:
- "traffic_by_channel|facebook|advertiser"
- "traffic_by_channel|facebook|industry"
Result: 98% accuracy even for complex hierarchical queries.
Need Help Implementing RAG?
I help companies integrate AI systems with their existing Spring Boot infrastructure.
Specializing in:
- Spring Boot + OpenAI integration
- Custom RAG pipelines
- E-commerce analytics systems
📧 Contact: [junyoungmoon9857@gmail.com]
Tags: #rag #system-design #vectordatabase #backend
Top comments (0)