How we built a seamless data pipeline from AWS Athena to React dashboards using dbt and Cube.js
The Challenge: From Raw Events to Dashboard Insights
Picture this: You have millions of ridepooling trip events flowing into AWS Athena, and you need to serve them up as beautiful, interactive dashboards to your stakeholders. The data needs to be:
- Fast to query (sub-second response times)
- Always fresh (reflecting recent events)
- Flexible (various time ranges and filters)
- Scalable (handling growing data volumes)
This is the story of how we built a modern analytics stack that connects all these pieces seamlessly.
The Architecture: Four Layers, One Goal
Our stack consists of four key layers, each solving a specific problem:
UI Layer (React)
↕️
Semantic Layer (Cube.js)
↕️
Transformation Layer (dbt)
↕️
Storage Layer (AWS Athena)
Let's walk through each layer and see how they work together.
Layer 1: AWS Athena - The Foundation
At the bottom, we have AWS Athena storing our raw event data in S3 with a multitenant architecture:
-- Raw trip events table
ridepooling__moia__service__trip_entities.ent__trip
ridepooling__tenant_1__service__trip_entities.ent__trip
ridepooling__tenant_2__service__trip_entities.ent__trip
Each tenant gets isolated data storage, but the schema remains consistent. Events flow in continuously from our ridepooling services.
Athena's role:
- Serverless SQL engine over S3 data lakes
- Cost-effective for large datasets
- Handles complex queries with good performance
- Integrates natively with AWS ecosystem
Layer 2: dbt - The Transformation Engine
dbt sits on top of Athena, transforming raw events into analytics-ready models:
-- stop_network_usage.sql
{{ config(materialized="view") }}
select
stops.stop_id,
stops.service_area_uuid,
stops.name_en as stop_name_en,
stops.name_de as stop_name_de,
date(trip_usage.occurred_at) as trip_date,
count(distinct trip_usage.trip_id) as daily_trip_count
from
{{ multitenancy_source("ridepooling", "service__service_area_entities", "att__stop__details_latest") }} stops
left join (
select link__stop_id__pickup as stop_id, trip_id, occurred_at
from {{ multitenancy_source("ridepooling", "service__trip_entities", "ent__trip") }}
where link__stop_id__pickup is not null
union
select link__stop_id__delivery as stop_id, trip_id, occurred_at
from {{ multitenancy_source("ridepooling", "service__trip_entities", "ent__trip") }}
where link__stop_id__delivery is not null
) trip_usage on stops.stop_id = trip_usage.stop_id
group by stops.stop_id, stops.service_area_uuid, stops.name_en, stops.name_de, date(trip_usage.occurred_at)
Key dbt features we leverage:
- Multitenancy macros: Dynamic table routing per tenant
- Jinja templating: Conditional logic and variables
- View materialization: Always fresh data without staleness
- Modular design: Reusable transformations across tenants
The multitenant magic:
-- This macro resolves to tenant-specific tables at runtime
{{ multitenancy_source("ridepooling", "service__trip_entities", "ent__trip") }}
-- Becomes: ridepooling__moia__service__trip_entities.ent__trip
Layer 3: Cube.js - The Semantic Layer
Cube.js acts as our semantic layer, defining business metrics and providing a consistent API:
// stop-network-usage.js
cube(`StopNetworkUsage`, {
sql_table: `ridepooling__${COMPILE_CONTEXT.tenant_name}__service__stop_network_data_products__${COMPILE_CONTEXT.env}.stop_network_usage`,
dimensions: {
stopId: {
sql: `stop_id`,
type: `string`,
primary_key: true
},
stopNameEn: {
sql: `stop_name_en`,
type: `string`
},
tripDate: {
sql: `trip_date`,
type: `time`
}
},
measures: {
totalTrips: {
sql: `daily_trip_count`,
type: `sum`
},
avgDailyTrips: {
sql: `daily_trip_count`,
type: `avg`
}
}
});
Cube.js superpowers:
- Semantic definitions: Business users think "total trips," not SQL aggregations
- Automatic SQL generation: Converts REST/GraphQL requests to optimized SQL
- Caching layer: Sub-second response times for repeated queries
- Multi-tenant aware: Same schema definition works across all tenants
- Time dimension intelligence: Handles date ranges, comparisons, and time series
Layer 4: React UI - The Experience
Finally, our React dashboard consumes Cube.js APIs to create beautiful visualizations:
// StopUsageDashboard.jsx
import { useCubeQuery } from '@cubejs-client/react';
const StopUsageDashboard = () => {
const { resultSet, isLoading } = useCubeQuery({
measures: ['StopNetworkUsage.totalTrips'],
timeDimensions: [{
dimension: 'StopNetworkUsage.tripDate',
granularity: 'day',
dateRange: 'last 30 days'
}],
dimensions: ['StopNetworkUsage.stopNameEn']
});
if (isLoading) return <LoadingSpinner />;
return (
<div>
<h2>Stop Usage Over Time</h2>
<LineChart
data={resultSet.chartPivot()}
xAxis="time"
yAxis="StopNetworkUsage.totalTrips"
/>
</div>
);
};
UI benefits:
- Declarative queries: Describe what you want, not how to get it
- Automatic optimization: Cube.js handles SQL generation and caching
- Real-time updates: Data stays fresh through view materialization
- Rich interactions: Time range pickers, filters, drill-downs work out of the box
The Data Flow: From Event to Insight
Here's how a single trip event becomes a dashboard insight:
- Event arrives: Trip booked with pickup stop A and delivery stop B
- Athena stores: Raw event in tenant-specific S3 partition
- dbt transforms: View aggregates daily usage per stop, handling both pickup and delivery
- Cube.js semantics: Converts to business metrics like "total trips per stop"
- UI displays: Beautiful chart showing stop usage trends
The magic is that steps 3-5 happen in real-time during user queries, ensuring fresh data.
Key Design Decisions & Trade-offs
✅ Why Views Over Tables?
- Always fresh data vs. potential performance hit
- Simplicity vs. automation complexity
- Perfect for our use case: Daily monitoring with reasonable data volumes
✅ Why Cube.js Over Direct SQL?
- Semantic consistency across different front-ends
- Automatic optimization and caching
- Developer productivity - no more manual SQL writing
- Multi-tenant support built-in
✅ Why dbt Over Raw Athena?
- Version control for transformations
- Testing and documentation built-in
- Modularity and reusability across tenants
- Software engineering practices for data
✅ Why This Stack?
- Serverless: Scales automatically, pay-per-use
- Modern: Embraces current best practices
- Flexible: Each layer can be swapped independently
- Developer-friendly: Code-first approach throughout
Performance Characteristics
Our stack delivers impressive performance:
- Query latency: 200-800ms for typical dashboard queries
- Data freshness: Real-time (views query source directly)
- Scalability: Handles millions of events, hundreds of concurrent users
- Cost efficiency: Serverless architecture scales to zero
Performance tips:
- Cube.js pre-aggregations for frequently accessed metrics
- Athena partition pruning with date filters
- Strategic use of dbt incremental models for large fact tables
- Smart caching strategies in the UI layer
Operational Considerations
Monitoring & Observability
- Cube.js: Built-in query performance metrics
- dbt: Model run logs and test results
- Athena: CloudWatch metrics for query performance
- UI: User experience monitoring and error tracking
Multi-tenant Isolation
- Data isolation: Separate databases per tenant
- Schema consistency: Same models work across tenants
- Runtime flexibility: Switch tenants via variables
- Security: Row-level security enforced at storage layer
Development Workflow
- dbt development: Local development with sample data
- Cube.js iteration: Schema development with dev APIs
- UI prototyping: Component development with mock data
- Integration testing: End-to-end testing across layers
- Deployment: GitOps with environment promotion
Lessons Learned
🎯 Start Simple
We initially over-engineered with complex incremental models and scheduling. Views solved our freshness problem elegantly.
🎯 Semantic Layer is Worth It
Cube.js transformed our development velocity. No more custom SQL endpoints or manual optimization.
🎯 Multi-tenancy From Day One
Planning for multiple tenants upfront saved massive refactoring later.
🎯 Choose Your Materializations Wisely
Not everything needs to be a table. Views, incremental, and tables each have their place.
What's Next?
Our stack continues to evolve:
- Real-time streaming: Moving from batch to stream processing
- Advanced analytics: ML models integrated into the pipeline
- Self-service: Empowering business users to create their own metrics
- Global scale: Multi-region deployment strategies
The Bottom Line
Building a modern analytics stack isn't just about picking the right tools—it's about creating a coherent data experience that serves everyone from data engineers to business stakeholders.
Our UI → Cube.js → dbt → Athena stack delivers:
- Fast insights for decision makers
- Productive development for engineers
- Scalable architecture for growing data needs
- Maintainable codebase for long-term success
The secret sauce isn't any single technology—it's how they work together to create something greater than the sum of their parts.
What's your experience with modern analytics stacks? Have you found different combinations that work well for your use cases? I'd love to hear about your architecture decisions and trade-offs.
Top comments (0)