Dmitriy Champa

Posted on Nov 9 • Edited on Nov 22

Building a Lightweight Camunda Monitoring Dashboard: From Enterprise Pain to Open Source Solution

#camunda #monitoring #opensource #devops

"You can't improve what you can't measure." - Peter Drucker

🎥 Video Demo

Want to see it in action before installing? Watch the feature demonstration on YouTube:

▶️ Watch Demo: Champa Camunda 7 Health Monitor

Note: The video shows our enterprise version, but the open-source version includes about 70% of the same features. Core monitoring capabilities are identical - the enterprise version adds multi-tenancy, advanced alerting, and SSO integration.

The Story

Last year, our production Camunda clusters started showing strange behavior. Process instances were stuck. Job executors were falling behind. Incidents were piling up. But our monitoring dashboards? They showed everything was "green." 🟢 or just "no data to show"

😤 The Problem: Why Monitoring Camunda is Frustrating

I've been working with Camunda 7 for years, and monitoring has always been the painful part. Not because monitoring tools don't exist - quite the opposite. I've tried them all:

Datadog - Great for infrastructure, expensive, doesn't understand workflow engines
Grafana + Prometheus - Powerful but requires extensive configuration for Camunda-specific metrics
Promtail + Loki - Built custom log parsing pipelines, spent more time maintaining them than using them
ELK Stack - Overkill for operational monitoring

Each solution had the same problems:

❌ The Pain Points

1. Too Heavy

Every enterprise monitoring solution feels like bringing a sledgehammer to hang a picture frame. I just want to know if my Camunda cluster is healthy - I don't need distributed tracing across 47 microservices.

2. Multiple Tools, Multiple Headaches

Configure Prometheus exporters
Maintain Grafana dashboards
Set up log shippers
Keep everything in sync
Debug why Node 3's metrics stopped flowing

3. Generic Metrics ≠ Operational Insights

Sure, I can see that:

✅ CPU usage is at 45%
✅ Memory usage looks fine
✅ Database connections are healthy
✅ HTTP response times are normal

But what I actually need to know at 3 AM when getting paged:

❓ Why are 1,200 process instances stuck for 7+ days?
❓ Which node is rejecting 80% of job acquisitions?
❓ Are message correlations timing out?
❓ Is job executor throughput dropping node by node?
❓ Which process definitions are consuming the most resources?

4. Configuration Hell

Want to monitor job executor metrics per node? Here's your 200-line Prometheus query. Want to track stuck instances? Write a custom exporter. Want it to look presentable? Build 15 Grafana panels processing data from several sources.

💡 The Realization

After years of fighting with enterprise monitoring stacks, I realized: I don't need a monitoring platform. I need a Camunda health dashboard.

One tool. One configuration file. One dashboard. That's it.

So I built it.

🏗️ Building a Better Way

The requirements were simple:

Must Have:

Real-time visibility into ALL Camunda nodes in the cluster
Process execution metrics (instances, tasks, jobs, incidents)
Per-node health and workload distribution
JVM metrics integrated with Camunda context
Database performance insights
Alerting capabilities

Technical Constraints:

Lightweight (can't add another heavyweight monitoring stack)
Read-only access (security requirement)
No Camunda code changes (we don't control deployments)
Works with PostgreSQL (our DB of choice)

🛠️ The Technical Stack

After evaluating options, I chose simplicity over sophistication:

Backend:

Flask - Lightweight Python web framework (no blueprints, single file to start)
psycopg2 - Direct PostgreSQL access for database metrics
requests - Camunda REST API integration

Frontend:

Alpine.js - Minimal reactive framework (15KB!)
Tailwind CSS - Utility-first styling (CDN version, no build step)
Chart.js - Clean visualizations

🤔 "Wait, This Looks Too Simple..."

You're right. This architecture is intentionally simplified for the open-source version. Here's why:

What's NOT here (that would be in a commercial product):

❌ Tailwind build pipeline with purging and optimization
❌ Asset bundling and minification
❌ Comprehensive test suite with 80%+ coverage
❌ Database migrations framework
❌ Advanced caching strategies
❌ Multi-tenancy support

Why This Simplified Approach?

Lower barrier to contribution - Any Python developer can read and modify the code
Faster deployment - No build steps, just run it
Good enough for most use cases - Monitoring 3-10 Camunda nodes? This works great.

📊 The Trade-offs

Aspect	Simplified (Current)	Production-Grade
Setup time	5 minutes	30+ minutes
Code complexity	Low	Medium-High
Maintainability	Good for small teams	Better for large teams
Performance	Great for <10 nodes	Optimized for 50+ nodes
Customization	Easy to hack	Structured extension points

The Philosophy:

This is a monitoring tool, not a SaaS platform. It should be:

✅ Deployable in minutes - Not hours of setup
✅ Understandable by any developer - Not requiring framework expertise
✅ Hackable - Easy to add your custom metrics
✅ Lightweight - Runs on a t2.micro if needed

💡 "Can I Use This in Production?"

Short answer: Yes, Apache 2.0 on board.

Long answer:

We've been running a more sophisticated version internally for over a year. This open-source version strips away our company-specific customizations but retains the core monitoring capabilities.

For production use:

✅ Small-to-medium deployments (1-4 nodes): Use as-is
⚠️ Large deployments (5+ nodes): Consider adding caching, load balancing
⚠️ High-security environments: Review the security section, add authentication layer
⚠️ Mission-critical monitoring: Run redundantly, export to Prometheus for historical data

Why These Choices Matter:

Total footprint < 500KB (excluding Python dependencies)
Startup time < 5 seconds
Memory footprint < 100MB
No build step - deploy and run

Compare this to a typical Grafana + Prometheus + Exporters setup:

Installation: 1-2 hours
Configuration: 2-4 hours
Memory: 500MB - 2GB
Maintenance: Ongoing

For Camunda-specific monitoring, the lightweight approach wins.

📊 What We Monitor

1. Cluster Overview

Total Nodes: 3
Running Nodes: 3 ✅
Engine Version: 7.21.0
Response Time: 350ms avg

2. Process Execution Metrics

Active process instances
User tasks awaiting action
External tasks in queue
Open incidents (by type)
Stuck instances (configurable threshold)

3. Job Executor Health

This is where it gets interesting. Per-node metrics:

Job acquisition rate (success/rejection ratio)
Job execution throughput (jobs/minute)
Failed jobs (no retries remaining)
Executable jobs in queue

Why This Matters: If Node 2 is rejecting 80% of job acquisitions while Nodes 1 and 3 are fine, you've found your problem.

4. JVM Deep Dive

Integration with Prometheus JMX Exporter(or Micrometer) gives us:

Heap utilization per node
GC pause times
Thread counts (daemon vs non-daemon)
CPU load per node

5. Database Analytics

Direct PostgreSQL queries reveal:

Storage usage by Camunda table
Slow query identification
Archivable instances (completed processes)
Connection pool utilization

🎯 The Architecture

┌─────────────────┐
│   Web Browser   │
│   (Dashboard)   │
└────────┬────────┘
         │
         ▼
┌─────────────────┐
│  Flask Backend  │
│                 │
│  • REST APIs    │
│  • Metrics Agg  │
│  • Prometheus   │
└────┬─────┬──────┘
     │     │
     │     └─────────────┐
     ▼                   ▼
┌─────────────┐   ┌──────────────┐
│  Camunda    │   │  PostgreSQL  │
│  REST API   │   │   Database   │
│             │   │              │
│  Node 1     │   │ • ACT_* tbl  │
│  Node 2     │   │ • Metrics    │
│  Node 3     │   │ • Analytics  │
└─────────────┘   └──────────────┘
         │
         ▼
┌─────────────────┐
│  JMX Exporter   │
│  (per node)     │
└─────────────────┘

💡 Key Design Decisions

1. Lazy Loading

Dashboard loads in < 1 second. Metric cards load on-demand as you scroll.

// Alpine.js lazy loading
<div x-data="{ loaded: false }" 
     x-intersect="loaded = true">
  <template x-if="loaded">
    <!-- Load expensive metrics -->
  </template>
</div>

2. No Browser Storage

Critical learning: localStorage/sessionStorage aren't available in all deployment contexts. Everything stays in memory or calls the backend.

3. Read-Only Database Access

We query Camunda's tables directly but ONLY with SELECT statements. The DB user should have zero write permissions.

-- Example: Find stuck instances
SELECT COUNT(*) 
FROM ACT_RU_EXECUTION 
WHERE START_TIME_ < NOW() - INTERVAL '7 days'
AND END_TIME_ IS NULL;

4. Prometheus Integration

Export metrics in Prometheus format for Grafana dashboards:

@app.route('/metrics')
def prometheus_metrics():
    metrics = [
        f'camunda_active_instances {get_active_count()}',
        f'camunda_open_incidents {get_incident_count()}',
        f'camunda_node_status{{node="{node}"}} {status}'
    ]
    return '\n'.join(metrics)

🚀 From Internal to Open Source

After running this successfully in production and development of an enterprise level solution, I realized other Camunda users face the same challenges. So I created a lightweight, open-source version.

What Changed for Open Source:

✅ Removed proprietary business logic
✅ Added comprehensive documentation
✅ Docker support + docker-compose
✅ Configuration via environment variables
✅ Dark mode (because developers love dark mode 🌙 and my eyes also :) )
✅ Responsive design for mobile monitoring
✅ Removed extra security features and user management

What Stayed the Same:

✅ Core monitoring capabilities
✅ Lightweight architecture
✅ Production-ready code

📦 Try It Yourself

Quick start in under 5 minutes:

git clone https://github.com/bibacrm/camunda-health-monitor.git
cd camunda-health-monitor

# Configure your Camunda nodes
cp .env.example .env
nano .env  # Add your Camunda URLs

# Install and run
pip install -r requirements.txt
python app.py

Visit http://localhost:5000 and you're monitoring!

Docker enthusiasts?

docker-compose up -d

bibacrm / camunda-health-monitor

Lightweight monitoring dashboard for Camunda 7 clusters with real-time metrics, JVM health tracking, and database analytics. Built with Flask, Alpine.js, and Tailwind CSS.

Camunda Health Monitor & Advanced AI/ML Analytics

A lightweight, real-time monitoring dashboard for Camunda 7 based BPM Platform clusters. Monitor your process engines, track performance metrics, and gain insights into your workflow automation with a modern, responsive interface.

✅ Verified Compatible With:

🎥 Video Demo

Watch the feature demonstration on YouTube:

▶️ Watch Demo: Champa Camunda 7 Health Monitor

📸 Screenshots

Full Dashboard View

Real-time cluster monitoring with comprehensive metrics

AI/ML Analysis Health & Performance Tab

AI/ML Analysis Issues & Alerts Tab

AI/ML Analysis Modal Example

Light Theme

Clean, professional light mode interface

Node Status Monitoring

Immediate visual feedback when nodes become unavailable

Quick Start with Docker

https://hub.docker.com/r/champabpmn/camunda-health-monitor

Create a .env file with your configuration (example for Camunda 2 nodes cluster):

# Database Configuration (PostgreSQL)
DB_NAME=PUT_YOUR_CAMUNDA_DB_NAME_HERE
DB_USER=PUT_YOUR_CAMUNDA_DB_USERNAME_HERE
DB_PASSWORD=PUT_YOUR_CAMUNDA_DB_PASS_HERE
DB_HOST=PUT_YOUR_CAMUNDA_DB_HOSTNAME_OR_IP_ADDRESS_HERE
DB_PORT=5432
# Camunda Node 1 (Required)
CAMUNDA_NODE_1_NAME=node1
CAMUNDA_NODE_1_URL=http://PUT_YOUR_CAMUNDA_BPM_1_NODE_HOST_HERE/engine-rest

# Camunda Node 2 (Optional)
CAMUNDA_NODE_2_NAME=node2
CAMUNDA_NODE_2_URL=http://PUT_YOUR_CAMUNDA_BPM_2_NODE_HOST_HERE/engine-rest

# Camunda Node 3 (Optional)
#

…

View on GitHub

🎓 Lessons Learned

1. Specific > Generic

Domain-specific monitoring beats generic APM for operational insights.

2. Lightweight Wins

You don't need React + Redux + TypeScript for a monitoring dashboard. Alpine.js + Tailwind is often enough.

3. Direct Database Access is Powerful

When used responsibly (read-only!), querying Camunda tables directly gives you insights no API can provide. The right way is to use Camunda DB read-only replica on Production/Staging environments!

4. Start Internal, Go Open Source

Building for your own needs first ensures you solve real problems. Open-sourcing second ensures quality.

5. Simplify the Architecture

Lower barriers to contribution = more community engagement. Atomic Flask blueprints and build pipelines can come later.

💬 The "Enterprise vs. Open Source" Balance

I wrestled with how much to include in the open-source version. Too simple = not useful. Too complex = hard to contribute to.

What I landed on:

Core monitoring features = 100% open source
Advanced features (multi-tenancy, SSO, custom alerting pipelines) = Enterprise version

This lets the community benefit from solid monitoring while leaving room for commercial sustainability.

🌟 Check It Out

GitHub: https://github.com/bibacrm/camunda-health-monitor

If you're running Camunda 7 in production, give it a try! I'd love to hear:

What works well?
What's missing?
What would make this indispensable for your team?

⭐ Star the repo if you find it useful!

💬 Drop a comment with your Camunda monitoring challenges

🔧 Contribute if you'd like to add features

💬 Discussion Questions

What monitoring challenges do you face with Camunda or other workflow engines?

Have you found a monitoring setup that actually works? What's your current pain point?

Drop a comment below - I read and respond to every one!

Thinking about sharing BPMN/DMN Linter web based solution(bpmn.io based) as open-source. May it be interesting for the community?

Building better monitoring tools, one metric at a time. Follow me for more deep-dives into DevOps, BPM, and open-source development.

Top comments (1)

Dmitriy Champa • Nov 9

👋 Author here!

Really curious to hear from the community:

What monitoring tools are you currently using for workflow engines?
What's your biggest pain point with monitoring Camunda (or similar BPM platforms)?
What metrics would make this tool indispensable for your team?

I'll be hanging around to answer questions and discuss ideas!