DEV Community

Jose Torreblanca
Jose Torreblanca

Posted on

Building a Modern Analytics Stack: UI Cube.js dbt Athena Integration

How we built a seamless data pipeline from AWS Athena to React dashboards using dbt and Cube.js

The Challenge: From Raw Events to Dashboard Insights

Picture this: You have millions of ridepooling trip events flowing into AWS Athena, and you need to serve them up as beautiful, interactive dashboards to your stakeholders. The data needs to be:

  • Fast to query (sub-second response times)
  • Always fresh (reflecting recent events)
  • Flexible (various time ranges and filters)
  • Scalable (handling growing data volumes)

This is the story of how we built a modern analytics stack that connects all these pieces seamlessly.

The Architecture: Four Layers, One Goal

Our stack consists of four key layers, each solving a specific problem:

UI Layer (React) 
    ↕️ 
Semantic Layer (Cube.js)
    ↕️
Transformation Layer (dbt)
    ↕️
Storage Layer (AWS Athena)
Enter fullscreen mode Exit fullscreen mode

Let's walk through each layer and see how they work together.

Layer 1: AWS Athena - The Foundation

At the bottom, we have AWS Athena storing our raw event data in S3 with a multitenant architecture:

-- Raw trip events table
ridepooling__moia__service__trip_entities.ent__trip
ridepooling__tenant_1__service__trip_entities.ent__trip
ridepooling__tenant_2__service__trip_entities.ent__trip
Enter fullscreen mode Exit fullscreen mode

Each tenant gets isolated data storage, but the schema remains consistent. Events flow in continuously from our ridepooling services.

Athena's role:

  • Serverless SQL engine over S3 data lakes
  • Cost-effective for large datasets
  • Handles complex queries with good performance
  • Integrates natively with AWS ecosystem

Layer 2: dbt - The Transformation Engine

dbt sits on top of Athena, transforming raw events into analytics-ready models:

-- stop_network_usage.sql
{{ config(materialized="view") }}

select
    stops.stop_id,
    stops.service_area_uuid,
    stops.name_en as stop_name_en,
    stops.name_de as stop_name_de,
    date(trip_usage.occurred_at) as trip_date,
    count(distinct trip_usage.trip_id) as daily_trip_count
from
    {{ multitenancy_source("ridepooling", "service__service_area_entities", "att__stop__details_latest") }} stops
left join (
    select link__stop_id__pickup as stop_id, trip_id, occurred_at
    from {{ multitenancy_source("ridepooling", "service__trip_entities", "ent__trip") }}
    where link__stop_id__pickup is not null
    union
    select link__stop_id__delivery as stop_id, trip_id, occurred_at  
    from {{ multitenancy_source("ridepooling", "service__trip_entities", "ent__trip") }}
    where link__stop_id__delivery is not null
) trip_usage on stops.stop_id = trip_usage.stop_id
group by stops.stop_id, stops.service_area_uuid, stops.name_en, stops.name_de, date(trip_usage.occurred_at)
Enter fullscreen mode Exit fullscreen mode

Key dbt features we leverage:

  • Multitenancy macros: Dynamic table routing per tenant
  • Jinja templating: Conditional logic and variables
  • View materialization: Always fresh data without staleness
  • Modular design: Reusable transformations across tenants

The multitenant magic:

-- This macro resolves to tenant-specific tables at runtime
{{ multitenancy_source("ridepooling", "service__trip_entities", "ent__trip") }}
-- Becomes: ridepooling__moia__service__trip_entities.ent__trip
Enter fullscreen mode Exit fullscreen mode

Layer 3: Cube.js - The Semantic Layer

Cube.js acts as our semantic layer, defining business metrics and providing a consistent API:

// stop-network-usage.js
cube(`StopNetworkUsage`, {
  sql_table: `ridepooling__${COMPILE_CONTEXT.tenant_name}__service__stop_network_data_products__${COMPILE_CONTEXT.env}.stop_network_usage`,

  dimensions: {
    stopId: {
      sql: `stop_id`,
      type: `string`,
      primary_key: true
    },

    stopNameEn: {
      sql: `stop_name_en`,
      type: `string`
    },

    tripDate: {
      sql: `trip_date`,
      type: `time`
    }
  },

  measures: {
    totalTrips: {
      sql: `daily_trip_count`,
      type: `sum`
    },

    avgDailyTrips: {
      sql: `daily_trip_count`,
      type: `avg`
    }
  }
});
Enter fullscreen mode Exit fullscreen mode

Cube.js superpowers:

  • Semantic definitions: Business users think "total trips," not SQL aggregations
  • Automatic SQL generation: Converts REST/GraphQL requests to optimized SQL
  • Caching layer: Sub-second response times for repeated queries
  • Multi-tenant aware: Same schema definition works across all tenants
  • Time dimension intelligence: Handles date ranges, comparisons, and time series

Layer 4: React UI - The Experience

Finally, our React dashboard consumes Cube.js APIs to create beautiful visualizations:

// StopUsageDashboard.jsx
import { useCubeQuery } from '@cubejs-client/react';

const StopUsageDashboard = () => {
  const { resultSet, isLoading } = useCubeQuery({
    measures: ['StopNetworkUsage.totalTrips'],
    timeDimensions: [{
      dimension: 'StopNetworkUsage.tripDate',
      granularity: 'day',
      dateRange: 'last 30 days'
    }],
    dimensions: ['StopNetworkUsage.stopNameEn']
  });

  if (isLoading) return <LoadingSpinner />;

  return (
    <div>
      <h2>Stop Usage Over Time</h2>
      <LineChart 
        data={resultSet.chartPivot()} 
        xAxis="time"
        yAxis="StopNetworkUsage.totalTrips"
      />
    </div>
  );
};
Enter fullscreen mode Exit fullscreen mode

UI benefits:

  • Declarative queries: Describe what you want, not how to get it
  • Automatic optimization: Cube.js handles SQL generation and caching
  • Real-time updates: Data stays fresh through view materialization
  • Rich interactions: Time range pickers, filters, drill-downs work out of the box

The Data Flow: From Event to Insight

Here's how a single trip event becomes a dashboard insight:

  1. Event arrives: Trip booked with pickup stop A and delivery stop B
  2. Athena stores: Raw event in tenant-specific S3 partition
  3. dbt transforms: View aggregates daily usage per stop, handling both pickup and delivery
  4. Cube.js semantics: Converts to business metrics like "total trips per stop"
  5. UI displays: Beautiful chart showing stop usage trends

The magic is that steps 3-5 happen in real-time during user queries, ensuring fresh data.

Key Design Decisions & Trade-offs

✅ Why Views Over Tables?

  • Always fresh data vs. potential performance hit
  • Simplicity vs. automation complexity
  • Perfect for our use case: Daily monitoring with reasonable data volumes

✅ Why Cube.js Over Direct SQL?

  • Semantic consistency across different front-ends
  • Automatic optimization and caching
  • Developer productivity - no more manual SQL writing
  • Multi-tenant support built-in

✅ Why dbt Over Raw Athena?

  • Version control for transformations
  • Testing and documentation built-in
  • Modularity and reusability across tenants
  • Software engineering practices for data

✅ Why This Stack?

  • Serverless: Scales automatically, pay-per-use
  • Modern: Embraces current best practices
  • Flexible: Each layer can be swapped independently
  • Developer-friendly: Code-first approach throughout

Performance Characteristics

Our stack delivers impressive performance:

  • Query latency: 200-800ms for typical dashboard queries
  • Data freshness: Real-time (views query source directly)
  • Scalability: Handles millions of events, hundreds of concurrent users
  • Cost efficiency: Serverless architecture scales to zero

Performance tips:

  • Cube.js pre-aggregations for frequently accessed metrics
  • Athena partition pruning with date filters
  • Strategic use of dbt incremental models for large fact tables
  • Smart caching strategies in the UI layer

Operational Considerations

Monitoring & Observability

  • Cube.js: Built-in query performance metrics
  • dbt: Model run logs and test results
  • Athena: CloudWatch metrics for query performance
  • UI: User experience monitoring and error tracking

Multi-tenant Isolation

  • Data isolation: Separate databases per tenant
  • Schema consistency: Same models work across tenants
  • Runtime flexibility: Switch tenants via variables
  • Security: Row-level security enforced at storage layer

Development Workflow

  1. dbt development: Local development with sample data
  2. Cube.js iteration: Schema development with dev APIs
  3. UI prototyping: Component development with mock data
  4. Integration testing: End-to-end testing across layers
  5. Deployment: GitOps with environment promotion

Lessons Learned

🎯 Start Simple

We initially over-engineered with complex incremental models and scheduling. Views solved our freshness problem elegantly.

🎯 Semantic Layer is Worth It

Cube.js transformed our development velocity. No more custom SQL endpoints or manual optimization.

🎯 Multi-tenancy From Day One

Planning for multiple tenants upfront saved massive refactoring later.

🎯 Choose Your Materializations Wisely

Not everything needs to be a table. Views, incremental, and tables each have their place.

What's Next?

Our stack continues to evolve:

  • Real-time streaming: Moving from batch to stream processing
  • Advanced analytics: ML models integrated into the pipeline
  • Self-service: Empowering business users to create their own metrics
  • Global scale: Multi-region deployment strategies

The Bottom Line

Building a modern analytics stack isn't just about picking the right tools—it's about creating a coherent data experience that serves everyone from data engineers to business stakeholders.

Our UI → Cube.js → dbt → Athena stack delivers:

  • Fast insights for decision makers
  • Productive development for engineers
  • Scalable architecture for growing data needs
  • Maintainable codebase for long-term success

The secret sauce isn't any single technology—it's how they work together to create something greater than the sum of their parts.


What's your experience with modern analytics stacks? Have you found different combinations that work well for your use cases? I'd love to hear about your architecture decisions and trade-offs.

Top comments (0)