DEV Community

Cover image for Building a Lightweight Camunda Monitoring Dashboard: From Enterprise Pain to Open Source Solution
Dmitriy Champa
Dmitriy Champa

Posted on

Building a Lightweight Camunda Monitoring Dashboard: From Enterprise Pain to Open Source Solution

"You can't improve what you can't measure." - Peter Drucker

πŸŽ₯ Video Demo

Want to see it in action before installing? Watch the feature demonstration on YouTube:

▢️ Watch Demo: Champa Camunda 7 Health Monitor

Note: The video shows our enterprise version, but the open-source version includes about 70% of the same features. Core monitoring capabilities are identical - the enterprise version adds multi-tenancy, advanced alerting, and SSO integration.

The Story

Last year, our production Camunda clusters started showing strange behavior. Process instances were stuck. Job executors were falling behind. Incidents were piling up. But our monitoring dashboards? They showed everything was "green." 🟒 or just "no data to show"

😀 The Problem: Why Monitoring Camunda is Frustrating

I've been working with Camunda 7 for years, and monitoring has always been the painful part. Not because monitoring tools don't exist - quite the opposite. I've tried them all:

  • Datadog - Great for infrastructure, expensive, doesn't understand workflow engines
  • Grafana + Prometheus - Powerful but requires extensive configuration for Camunda-specific metrics
  • Promtail + Loki - Built custom log parsing pipelines, spent more time maintaining them than using them
  • ELK Stack - Overkill for operational monitoring

Each solution had the same problems:

❌ The Pain Points

1. Too Heavy

Every enterprise monitoring solution feels like bringing a sledgehammer to hang a picture frame. I just want to know if my Camunda cluster is healthy - I don't need distributed tracing across 47 microservices.

2. Multiple Tools, Multiple Headaches

  • Configure Prometheus exporters
  • Maintain Grafana dashboards
  • Set up log shippers
  • Keep everything in sync
  • Debug why Node 3's metrics stopped flowing

3. Generic Metrics β‰  Operational Insights

Sure, I can see that:

  • βœ… CPU usage is at 45%
  • βœ… Memory usage looks fine
  • βœ… Database connections are healthy
  • βœ… HTTP response times are normal

But what I actually need to know at 3 AM when getting paged:

  • ❓ Why are 1,200 process instances stuck for 7+ days?
  • ❓ Which node is rejecting 80% of job acquisitions?
  • ❓ Are message correlations timing out?
  • ❓ Is job executor throughput dropping node by node?
  • ❓ Which process definitions are consuming the most resources?

4. Configuration Hell

Want to monitor job executor metrics per node? Here's your 200-line Prometheus query. Want to track stuck instances? Write a custom exporter. Want it to look presentable? Build 15 Grafana panels processing data from several sources.

πŸ’‘ The Realization

After years of fighting with enterprise monitoring stacks, I realized: I don't need a monitoring platform. I need a Camunda health dashboard.

One tool. One configuration file. One dashboard. That's it.

So I built it.

πŸ—οΈ Building a Better Way

The requirements were simple:

Must Have:

  1. Real-time visibility into ALL Camunda nodes in the cluster
  2. Process execution metrics (instances, tasks, jobs, incidents)
  3. Per-node health and workload distribution
  4. JVM metrics integrated with Camunda context
  5. Database performance insights
  6. Alerting capabilities

Technical Constraints:

  • Lightweight (can't add another heavyweight monitoring stack)
  • Read-only access (security requirement)
  • No Camunda code changes (we don't control deployments)
  • Works with PostgreSQL (our DB of choice)

πŸ› οΈ The Technical Stack

After evaluating options, I chose simplicity over sophistication:

Backend:

  • Flask - Lightweight Python web framework (no blueprints, single file to start)
  • psycopg2 - Direct PostgreSQL access for database metrics
  • requests - Camunda REST API integration

Frontend:

  • Alpine.js - Minimal reactive framework (15KB!)
  • Tailwind CSS - Utility-first styling (CDN version, no build step)
  • Chart.js - Clean visualizations

πŸ€” "Wait, This Looks Too Simple..."

You're right. This architecture is intentionally simplified for the open-source version. Here's why:

What's NOT here (that would be in a commercial product):

  • ❌ Flask blueprints for modular code organization
  • ❌ Tailwind build pipeline with purging and optimization
  • ❌ Asset bundling and minification
  • ❌ Comprehensive test suite with 80%+ coverage
  • ❌ Database migrations framework
  • ❌ Advanced caching strategies
  • ❌ Multi-tenancy support

Why This Simplified Approach?

  1. Lower barrier to contribution - Any Python developer can read and modify the code
  2. Faster deployment - No build steps, just run it
  3. Easier debugging - Single-file architecture means fewer places for bugs to hide
  4. Good enough for most use cases - Monitoring 3-10 Camunda nodes? This works great.

πŸ“Š The Trade-offs

Aspect Simplified (Current) Production-Grade
Setup time 5 minutes 30+ minutes
Code complexity Low Medium-High
Maintainability Good for small teams Better for large teams
Performance Great for <10 nodes Optimized for 50+ nodes
Customization Easy to hack Structured extension points

The Philosophy:

This is a monitoring tool, not a SaaS platform. It should be:

  • βœ… Deployable in minutes - Not hours of setup
  • βœ… Understandable by any developer - Not requiring framework expertise
  • βœ… Hackable - Easy to add your custom metrics
  • βœ… Lightweight - Runs on a t2.micro if needed

πŸ’‘ "Can I Use This in Production?"

Short answer: Yes, with considerations.

Long answer:

We've been running a more sophisticated version internally for over a year. This open-source version strips away our company-specific customizations but retains the core monitoring capabilities.

For production use:

  • βœ… Small-to-medium deployments (1-10 nodes): Use as-is
  • ⚠️ Large deployments (10+ nodes): Consider adding caching, load balancing
  • ⚠️ High-security environments: Review the security section, add authentication layer
  • ⚠️ Mission-critical monitoring: Run redundantly, export to Prometheus for historical data

Why These Choices Matter:

  • Total footprint < 500KB (excluding Python dependencies)
  • Startup time < 5 seconds
  • Memory footprint < 100MB
  • No build step - deploy and run

Compare this to a typical Grafana + Prometheus + Exporters setup:

  • Installation: 1-2 hours
  • Configuration: 2-4 hours
  • Memory: 500MB - 2GB
  • Maintenance: Ongoing

For Camunda-specific monitoring, the lightweight approach wins.

πŸ“Š What We Monitor

1. Cluster Overview

Total Nodes: 3
Running Nodes: 3 βœ…
Engine Version: 7.21.0
Response Time: 45ms avg
Enter fullscreen mode Exit fullscreen mode

2. Process Execution Metrics

  • Active process instances
  • User tasks awaiting action
  • External tasks in queue
  • Open incidents (by type)
  • Stuck instances (configurable threshold)

3. Job Executor Health

This is where it gets interesting. Per-node metrics:

  • Job acquisition rate (success/rejection ratio)
  • Job execution throughput (jobs/minute)
  • Failed jobs (no retries remaining)
  • Executable jobs in queue

Why This Matters: If Node 2 is rejecting 80% of job acquisitions while Nodes 1 and 3 are fine, you've found your problem.

4. JVM Deep Dive

Integration with Prometheus JMX Exporter gives us:

  • Heap utilization per node
  • GC pause times
  • Thread counts (daemon vs non-daemon)
  • CPU load per node

5. Database Analytics

Direct PostgreSQL queries reveal:

  • Storage usage by Camunda table
  • Slow query identification
  • Archivable instances (completed processes)
  • Connection pool utilization

🎯 The Architecture

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚   Web Browser   β”‚
β”‚   (Dashboard)   β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Flask Backend  β”‚
β”‚                 β”‚
β”‚  β€’ REST APIs    β”‚
β”‚  β€’ Metrics Agg  β”‚
β”‚  β€’ Prometheus   β”‚
β””β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”¬β”€β”€β”€β”€β”€β”€β”˜
     β”‚     β”‚
     β”‚     └─────────────┐
     β–Ό                   β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”   β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  Camunda    β”‚   β”‚  PostgreSQL  β”‚
β”‚  REST API   β”‚   β”‚   Database   β”‚
β”‚             β”‚   β”‚              β”‚
β”‚  Node 1     β”‚   β”‚ β€’ ACT_* tbl  β”‚
β”‚  Node 2     β”‚   β”‚ β€’ Metrics    β”‚
β”‚  Node 3     β”‚   β”‚ β€’ Analytics  β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜   β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
         β”‚
         β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚  JMX Exporter   β”‚
β”‚  (per node)     β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
Enter fullscreen mode Exit fullscreen mode

πŸ’‘ Key Design Decisions

1. Lazy Loading

Dashboard loads in < 1 second. Metric cards load on-demand as you scroll.

// Alpine.js lazy loading
<div x-data="{ loaded: false }" 
     x-intersect="loaded = true">
  <template x-if="loaded">
    <!-- Load expensive metrics -->
  </template>
</div>
Enter fullscreen mode Exit fullscreen mode

2. No Browser Storage

Critical learning: localStorage/sessionStorage aren't available in all deployment contexts. Everything stays in memory or calls the backend.

3. Read-Only Database Access

We query Camunda's tables directly but ONLY with SELECT statements. The DB user should have zero write permissions.

-- Example: Find stuck instances
SELECT COUNT(*) 
FROM ACT_RU_EXECUTION 
WHERE START_TIME_ < NOW() - INTERVAL '7 days'
AND END_TIME_ IS NULL;
Enter fullscreen mode Exit fullscreen mode

4. Prometheus Integration

Export metrics in Prometheus format for Grafana dashboards:

@app.route('/metrics')
def prometheus_metrics():
    metrics = [
        f'camunda_active_instances {get_active_count()}',
        f'camunda_open_incidents {get_incident_count()}',
        f'camunda_node_status{{node="{node}"}} {status}'
    ]
    return '\n'.join(metrics)
Enter fullscreen mode Exit fullscreen mode

πŸš€ From Internal to Open Source

After running this successfully in production and development of an enterprise level solution, I realized other Camunda users face the same challenges. So I created a lightweight, open-source version.

What Changed for Open Source:

  • βœ… Removed proprietary business logic
  • βœ… Added comprehensive documentation
  • βœ… Docker support + docker-compose
  • βœ… Configuration via environment variables
  • βœ… Dark mode (because developers love dark mode πŸŒ™ and my eyes also :) )
  • βœ… Responsive design for mobile monitoring
  • βœ… Removed extra security features and user management

What Stayed the Same:

  • βœ… Core monitoring capabilities
  • βœ… Lightweight architecture
  • βœ… Production-ready code

πŸ“¦ Try It Yourself

Quick start in under 5 minutes:

git clone https://github.com/bibacrm/camunda-health-monitor.git
cd camunda-health-monitor

# Configure your Camunda nodes
cp .env.example .env
nano .env  # Add your Camunda URLs

# Install and run
pip install -r requirements.txt
python app.py
Enter fullscreen mode Exit fullscreen mode

Visit http://localhost:5000 and you're monitoring!

Docker enthusiasts?

docker-compose up -d
Enter fullscreen mode Exit fullscreen mode

GitHub logo bibacrm / camunda-health-monitor

Lightweight monitoring dashboard for Camunda 7 clusters with real-time metrics, JVM health tracking, and database analytics. Built with Flask, Alpine.js, and Tailwind CSS.

Camunda Health Monitor

A lightweight, real-time monitoring dashboard for Camunda 7 based BPM Platform clusters. Monitor your process engines, track performance metrics, and gain insights into your workflow automation with a modern, responsive interface.

License Python Camunda Flask

πŸŽ₯ Video Demo

Watch the feature demonstration on YouTube(enterprise version, but it is 70% the same):

Camunda Health Monitor Demo

▢️ Watch Demo: Champa Camunda 7 Health Monitor

πŸ“Έ Screenshots

Full Dashboard View

Dashboard Overview Real-time cluster monitoring with comprehensive metrics

Light Theme

Light Theme Clean, professional light mode interface

Node Status Monitoring

Node Down Example Immediate visual feedback when nodes become unavailable

🌟 Features

Real-Time Cluster Monitoring

  • Multi-node cluster support - Monitor all your Camunda nodes from a single dashboard
  • Engine health status - Track node availability and response times
  • Database connectivity - Monitor PostgreSQL connection health and latency

Comprehensive Metrics

  • Process Instances - Active instances, user tasks, and external tasks
  • Job Execution - Job executor throughput, failed jobs, and execution rates
  • Incidents - Real-time incident…

πŸŽ“ Lessons Learned

1. Specific > Generic

Domain-specific monitoring beats generic APM for operational insights.

2. Lightweight Wins

You don't need React + Redux + TypeScript for a monitoring dashboard. Alpine.js + Tailwind is often enough.

3. Direct Database Access is Powerful

When used responsibly (read-only!), querying Camunda tables directly gives you insights no API can provide. The right way is to use Camunda DB read-only replica on Production/Staging environments!

4. Start Internal, Go Open Source

Building for your own needs first ensures you solve real problems. Open-sourcing second ensures quality.

5. Simplify the Architecture

Lower barriers to contribution = more community engagement. Flask blueprints and build pipelines can come later.

πŸ’¬ The "Enterprise vs. Open Source" Balance

I wrestled with how much to include in the open-source version. Too simple = not useful. Too complex = hard to contribute to.

What I landed on:

  • Core monitoring features = 100% open source
  • Advanced features (multi-tenancy, SSO, custom alerting pipelines) = Enterprise version

This lets the community benefit from solid monitoring while leaving room for commercial sustainability.

🌟 Check It Out

GitHub: https://github.com/bibacrm/camunda-health-monitor

If you're running Camunda 7 in production, give it a try! I'd love to hear:

  • What works well?
  • What's missing?
  • What would make this indispensable for your team?

⭐ Star the repo if you find it useful!

πŸ’¬ Drop a comment with your Camunda monitoring challenges

πŸ”§ Contribute if you'd like to add features


πŸ’¬ Discussion Questions

What monitoring challenges do you face with Camunda or other workflow engines?

Have you found a monitoring setup that actually works? What's your current pain point?

Drop a comment below - I read and respond to every one!

Thinking about sharing BPMN/DMN Linter web based solution(bpmn.io based) as open-source. May it be interesting for the community?


Building better monitoring tools, one metric at a time. Follow me for more deep-dives into DevOps, BPM, and open-source development.

Top comments (1)

Collapse
 
dmitriy_champa_d458cfc7 profile image
Dmitriy Champa

πŸ‘‹ Author here!

Really curious to hear from the community:

  1. What monitoring tools are you currently using for workflow engines?
  2. What's your biggest pain point with monitoring Camunda (or similar BPM platforms)?
  3. What metrics would make this tool indispensable for your team?

I'll be hanging around to answer questions and discuss ideas!