Jose Torreblanca

Posted on Aug 21

Building a Modern Analytics Stack: UI Cube.js dbt Athena Integration

#react #cubejs #dbt #athena

How we built a seamless data pipeline from AWS Athena to React dashboards using dbt and Cube.js

The Challenge: From Raw Events to Dashboard Insights

Picture this: You have millions of ridepooling trip events flowing into AWS Athena, and you need to serve them up as beautiful, interactive dashboards to your stakeholders. The data needs to be:

Fast to query (sub-second response times)
Always fresh (reflecting recent events)
Flexible (various time ranges and filters)
Scalable (handling growing data volumes)

This is the story of how we built a modern analytics stack that connects all these pieces seamlessly.

The Architecture: Four Layers, One Goal

Our stack consists of four key layers, each solving a specific problem:

UI Layer (React) 
    ↕️ 
Semantic Layer (Cube.js)
    ↕️
Transformation Layer (dbt)
    ↕️
Storage Layer (AWS Athena)

Let's walk through each layer and see how they work together.

Layer 1: AWS Athena - The Foundation

At the bottom, we have AWS Athena storing our raw event data in S3 with a multitenant architecture:

-- Raw trip events table
ridepooling__moia__service__trip_entities.ent__trip
ridepooling__tenant_1__service__trip_entities.ent__trip
ridepooling__tenant_2__service__trip_entities.ent__trip

Each tenant gets isolated data storage, but the schema remains consistent. Events flow in continuously from our ridepooling services.

Athena's role:

Serverless SQL engine over S3 data lakes
Cost-effective for large datasets
Handles complex queries with good performance
Integrates natively with AWS ecosystem

Layer 2: dbt - The Transformation Engine

dbt sits on top of Athena, transforming raw events into analytics-ready models:

-- stop_network_usage.sql
{{ config(materialized="view") }}

select
    stops.stop_id,
    stops.service_area_uuid,
    stops.name_en as stop_name_en,
    stops.name_de as stop_name_de,
    date(trip_usage.occurred_at) as trip_date,
    count(distinct trip_usage.trip_id) as daily_trip_count
from
    {{ multitenancy_source("ridepooling", "service__service_area_entities", "att__stop__details_latest") }} stops
left join (
    select link__stop_id__pickup as stop_id, trip_id, occurred_at
    from {{ multitenancy_source("ridepooling", "service__trip_entities", "ent__trip") }}
    where link__stop_id__pickup is not null
    union
    select link__stop_id__delivery as stop_id, trip_id, occurred_at  
    from {{ multitenancy_source("ridepooling", "service__trip_entities", "ent__trip") }}
    where link__stop_id__delivery is not null
) trip_usage on stops.stop_id = trip_usage.stop_id
group by stops.stop_id, stops.service_area_uuid, stops.name_en, stops.name_de, date(trip_usage.occurred_at)

Key dbt features we leverage:

Multitenancy macros: Dynamic table routing per tenant
Jinja templating: Conditional logic and variables
View materialization: Always fresh data without staleness
Modular design: Reusable transformations across tenants

The multitenant magic:

-- This macro resolves to tenant-specific tables at runtime
{{ multitenancy_source("ridepooling", "service__trip_entities", "ent__trip") }}
-- Becomes: ridepooling__moia__service__trip_entities.ent__trip

Layer 3: Cube.js - The Semantic Layer

Cube.js acts as our semantic layer, defining business metrics and providing a consistent API:

// stop-network-usage.js
cube(`StopNetworkUsage`, {
  sql_table: `ridepooling__${COMPILE_CONTEXT.tenant_name}__service__stop_network_data_products__${COMPILE_CONTEXT.env}.stop_network_usage`,

  dimensions: {
    stopId: {
      sql: `stop_id`,
      type: `string`,
      primary_key: true
    },

    stopNameEn: {
      sql: `stop_name_en`,
      type: `string`
    },

    tripDate: {
      sql: `trip_date`,
      type: `time`
    }
  },

  measures: {
    totalTrips: {
      sql: `daily_trip_count`,
      type: `sum`
    },

    avgDailyTrips: {
      sql: `daily_trip_count`,
      type: `avg`
    }
  }
});

Cube.js superpowers:

Semantic definitions: Business users think "total trips," not SQL aggregations
Automatic SQL generation: Converts REST/GraphQL requests to optimized SQL
Caching layer: Sub-second response times for repeated queries
Multi-tenant aware: Same schema definition works across all tenants
Time dimension intelligence: Handles date ranges, comparisons, and time series

Layer 4: React UI - The Experience

Finally, our React dashboard consumes Cube.js APIs to create beautiful visualizations:

// StopUsageDashboard.jsx
import { useCubeQuery } from '@cubejs-client/react';

const StopUsageDashboard = () => {
  const { resultSet, isLoading } = useCubeQuery({
    measures: ['StopNetworkUsage.totalTrips'],
    timeDimensions: [{
      dimension: 'StopNetworkUsage.tripDate',
      granularity: 'day',
      dateRange: 'last 30 days'
    }],
    dimensions: ['StopNetworkUsage.stopNameEn']
  });

  if (isLoading) return <LoadingSpinner />;

  return (
    <div>
      <h2>Stop Usage Over Time</h2>
      <LineChart 
        data={resultSet.chartPivot()} 
        xAxis="time"
        yAxis="StopNetworkUsage.totalTrips"
      />
    </div>
  );
};

UI benefits:

Declarative queries: Describe what you want, not how to get it
Automatic optimization: Cube.js handles SQL generation and caching
Real-time updates: Data stays fresh through view materialization
Rich interactions: Time range pickers, filters, drill-downs work out of the box

The Data Flow: From Event to Insight

Here's how a single trip event becomes a dashboard insight:

Event arrives: Trip booked with pickup stop A and delivery stop B
Athena stores: Raw event in tenant-specific S3 partition
dbt transforms: View aggregates daily usage per stop, handling both pickup and delivery
Cube.js semantics: Converts to business metrics like "total trips per stop"
UI displays: Beautiful chart showing stop usage trends

The magic is that steps 3-5 happen in real-time during user queries, ensuring fresh data.

Key Design Decisions & Trade-offs

✅ Why Views Over Tables?

Always fresh data vs. potential performance hit
Simplicity vs. automation complexity
Perfect for our use case: Daily monitoring with reasonable data volumes

✅ Why Cube.js Over Direct SQL?

Semantic consistency across different front-ends
Automatic optimization and caching
Developer productivity - no more manual SQL writing
Multi-tenant support built-in

✅ Why dbt Over Raw Athena?

Version control for transformations
Testing and documentation built-in
Modularity and reusability across tenants
Software engineering practices for data

✅ Why This Stack?

Serverless: Scales automatically, pay-per-use
Modern: Embraces current best practices
Flexible: Each layer can be swapped independently
Developer-friendly: Code-first approach throughout

Performance Characteristics

Our stack delivers impressive performance:

Query latency: 200-800ms for typical dashboard queries
Data freshness: Real-time (views query source directly)
Scalability: Handles millions of events, hundreds of concurrent users
Cost efficiency: Serverless architecture scales to zero

Performance tips:

Cube.js pre-aggregations for frequently accessed metrics
Athena partition pruning with date filters
Strategic use of dbt incremental models for large fact tables
Smart caching strategies in the UI layer

Operational Considerations

Monitoring & Observability

Cube.js: Built-in query performance metrics
dbt: Model run logs and test results
Athena: CloudWatch metrics for query performance
UI: User experience monitoring and error tracking

Multi-tenant Isolation

Data isolation: Separate databases per tenant
Schema consistency: Same models work across tenants
Runtime flexibility: Switch tenants via variables
Security: Row-level security enforced at storage layer

Development Workflow

dbt development: Local development with sample data
Cube.js iteration: Schema development with dev APIs
UI prototyping: Component development with mock data
Integration testing: End-to-end testing across layers
Deployment: GitOps with environment promotion

Lessons Learned

🎯 Start Simple

We initially over-engineered with complex incremental models and scheduling. Views solved our freshness problem elegantly.

🎯 Semantic Layer is Worth It

Cube.js transformed our development velocity. No more custom SQL endpoints or manual optimization.

🎯 Multi-tenancy From Day One

Planning for multiple tenants upfront saved massive refactoring later.

🎯 Choose Your Materializations Wisely

Not everything needs to be a table. Views, incremental, and tables each have their place.

What's Next?

Our stack continues to evolve:

Real-time streaming: Moving from batch to stream processing
Advanced analytics: ML models integrated into the pipeline
Self-service: Empowering business users to create their own metrics
Global scale: Multi-region deployment strategies

The Bottom Line

Building a modern analytics stack isn't just about picking the right tools—it's about creating a coherent data experience that serves everyone from data engineers to business stakeholders.

Our UI → Cube.js → dbt → Athena stack delivers:

Fast insights for decision makers
Productive development for engineers
Scalable architecture for growing data needs
Maintainable codebase for long-term success

The secret sauce isn't any single technology—it's how they work together to create something greater than the sum of their parts.

What's your experience with modern analytics stacks? Have you found different combinations that work well for your use cases? I'd love to hear about your architecture decisions and trade-offs.

DEV Community