"You can't improve what you can't measure." - Peter Drucker
π₯ Video Demo
Want to see it in action before installing? Watch the feature demonstration on YouTube:
βΆοΈ Watch Demo: Champa Camunda 7 Health Monitor
Note: The video shows our enterprise version, but the open-source version includes about 70% of the same features. Core monitoring capabilities are identical - the enterprise version adds multi-tenancy, advanced alerting, and SSO integration.
The Story
Last year, our production Camunda clusters started showing strange behavior. Process instances were stuck. Job executors were falling behind. Incidents were piling up. But our monitoring dashboards? They showed everything was "green." π’ or just "no data to show"
π€ The Problem: Why Monitoring Camunda is Frustrating
I've been working with Camunda 7 for years, and monitoring has always been the painful part. Not because monitoring tools don't exist - quite the opposite. I've tried them all:
- Datadog - Great for infrastructure, expensive, doesn't understand workflow engines
- Grafana + Prometheus - Powerful but requires extensive configuration for Camunda-specific metrics
- Promtail + Loki - Built custom log parsing pipelines, spent more time maintaining them than using them
- ELK Stack - Overkill for operational monitoring
Each solution had the same problems:
β The Pain Points
1. Too Heavy
Every enterprise monitoring solution feels like bringing a sledgehammer to hang a picture frame. I just want to know if my Camunda cluster is healthy - I don't need distributed tracing across 47 microservices.
2. Multiple Tools, Multiple Headaches
- Configure Prometheus exporters
- Maintain Grafana dashboards
- Set up log shippers
- Keep everything in sync
- Debug why Node 3's metrics stopped flowing
3. Generic Metrics β Operational Insights
Sure, I can see that:
- β CPU usage is at 45%
- β Memory usage looks fine
- β Database connections are healthy
- β HTTP response times are normal
But what I actually need to know at 3 AM when getting paged:
- β Why are 1,200 process instances stuck for 7+ days?
- β Which node is rejecting 80% of job acquisitions?
- β Are message correlations timing out?
- β Is job executor throughput dropping node by node?
- β Which process definitions are consuming the most resources?
4. Configuration Hell
Want to monitor job executor metrics per node? Here's your 200-line Prometheus query. Want to track stuck instances? Write a custom exporter. Want it to look presentable? Build 15 Grafana panels processing data from several sources.
π‘ The Realization
After years of fighting with enterprise monitoring stacks, I realized: I don't need a monitoring platform. I need a Camunda health dashboard.
One tool. One configuration file. One dashboard. That's it.
So I built it.
ποΈ Building a Better Way
The requirements were simple:
Must Have:
- Real-time visibility into ALL Camunda nodes in the cluster
- Process execution metrics (instances, tasks, jobs, incidents)
- Per-node health and workload distribution
- JVM metrics integrated with Camunda context
- Database performance insights
- Alerting capabilities
Technical Constraints:
- Lightweight (can't add another heavyweight monitoring stack)
- Read-only access (security requirement)
- No Camunda code changes (we don't control deployments)
- Works with PostgreSQL (our DB of choice)
π οΈ The Technical Stack
After evaluating options, I chose simplicity over sophistication:
Backend:
- Flask - Lightweight Python web framework (no blueprints, single file to start)
- psycopg2 - Direct PostgreSQL access for database metrics
- requests - Camunda REST API integration
Frontend:
- Alpine.js - Minimal reactive framework (15KB!)
- Tailwind CSS - Utility-first styling (CDN version, no build step)
- Chart.js - Clean visualizations
π€ "Wait, This Looks Too Simple..."
You're right. This architecture is intentionally simplified for the open-source version. Here's why:
What's NOT here (that would be in a commercial product):
- β Flask blueprints for modular code organization
- β Tailwind build pipeline with purging and optimization
- β Asset bundling and minification
- β Comprehensive test suite with 80%+ coverage
- β Database migrations framework
- β Advanced caching strategies
- β Multi-tenancy support
Why This Simplified Approach?
- Lower barrier to contribution - Any Python developer can read and modify the code
- Faster deployment - No build steps, just run it
- Easier debugging - Single-file architecture means fewer places for bugs to hide
- Good enough for most use cases - Monitoring 3-10 Camunda nodes? This works great.
π The Trade-offs
| Aspect | Simplified (Current) | Production-Grade |
|---|---|---|
| Setup time | 5 minutes | 30+ minutes |
| Code complexity | Low | Medium-High |
| Maintainability | Good for small teams | Better for large teams |
| Performance | Great for <10 nodes | Optimized for 50+ nodes |
| Customization | Easy to hack | Structured extension points |
The Philosophy:
This is a monitoring tool, not a SaaS platform. It should be:
- β Deployable in minutes - Not hours of setup
- β Understandable by any developer - Not requiring framework expertise
- β Hackable - Easy to add your custom metrics
- β Lightweight - Runs on a t2.micro if needed
π‘ "Can I Use This in Production?"
Short answer: Yes, with considerations.
Long answer:
We've been running a more sophisticated version internally for over a year. This open-source version strips away our company-specific customizations but retains the core monitoring capabilities.
For production use:
- β Small-to-medium deployments (1-10 nodes): Use as-is
- β οΈ Large deployments (10+ nodes): Consider adding caching, load balancing
- β οΈ High-security environments: Review the security section, add authentication layer
- β οΈ Mission-critical monitoring: Run redundantly, export to Prometheus for historical data
Why These Choices Matter:
- Total footprint < 500KB (excluding Python dependencies)
- Startup time < 5 seconds
- Memory footprint < 100MB
- No build step - deploy and run
Compare this to a typical Grafana + Prometheus + Exporters setup:
- Installation: 1-2 hours
- Configuration: 2-4 hours
- Memory: 500MB - 2GB
- Maintenance: Ongoing
For Camunda-specific monitoring, the lightweight approach wins.
π What We Monitor
1. Cluster Overview
Total Nodes: 3
Running Nodes: 3 β
Engine Version: 7.21.0
Response Time: 45ms avg
2. Process Execution Metrics
- Active process instances
- User tasks awaiting action
- External tasks in queue
- Open incidents (by type)
- Stuck instances (configurable threshold)
3. Job Executor Health
This is where it gets interesting. Per-node metrics:
- Job acquisition rate (success/rejection ratio)
- Job execution throughput (jobs/minute)
- Failed jobs (no retries remaining)
- Executable jobs in queue
Why This Matters: If Node 2 is rejecting 80% of job acquisitions while Nodes 1 and 3 are fine, you've found your problem.
4. JVM Deep Dive
Integration with Prometheus JMX Exporter gives us:
- Heap utilization per node
- GC pause times
- Thread counts (daemon vs non-daemon)
- CPU load per node
5. Database Analytics
Direct PostgreSQL queries reveal:
- Storage usage by Camunda table
- Slow query identification
- Archivable instances (completed processes)
- Connection pool utilization
π― The Architecture
βββββββββββββββββββ
β Web Browser β
β (Dashboard) β
ββββββββββ¬βββββββββ
β
βΌ
βββββββββββββββββββ
β Flask Backend β
β β
β β’ REST APIs β
β β’ Metrics Agg β
β β’ Prometheus β
ββββββ¬ββββββ¬βββββββ
β β
β βββββββββββββββ
βΌ βΌ
βββββββββββββββ ββββββββββββββββ
β Camunda β β PostgreSQL β
β REST API β β Database β
β β β β
β Node 1 β β β’ ACT_* tbl β
β Node 2 β β β’ Metrics β
β Node 3 β β β’ Analytics β
βββββββββββββββ ββββββββββββββββ
β
βΌ
βββββββββββββββββββ
β JMX Exporter β
β (per node) β
βββββββββββββββββββ
π‘ Key Design Decisions
1. Lazy Loading
Dashboard loads in < 1 second. Metric cards load on-demand as you scroll.
// Alpine.js lazy loading
<div x-data="{ loaded: false }"
x-intersect="loaded = true">
<template x-if="loaded">
<!-- Load expensive metrics -->
</template>
</div>
2. No Browser Storage
Critical learning: localStorage/sessionStorage aren't available in all deployment contexts. Everything stays in memory or calls the backend.
3. Read-Only Database Access
We query Camunda's tables directly but ONLY with SELECT statements. The DB user should have zero write permissions.
-- Example: Find stuck instances
SELECT COUNT(*)
FROM ACT_RU_EXECUTION
WHERE START_TIME_ < NOW() - INTERVAL '7 days'
AND END_TIME_ IS NULL;
4. Prometheus Integration
Export metrics in Prometheus format for Grafana dashboards:
@app.route('/metrics')
def prometheus_metrics():
metrics = [
f'camunda_active_instances {get_active_count()}',
f'camunda_open_incidents {get_incident_count()}',
f'camunda_node_status{{node="{node}"}} {status}'
]
return '\n'.join(metrics)
π From Internal to Open Source
After running this successfully in production and development of an enterprise level solution, I realized other Camunda users face the same challenges. So I created a lightweight, open-source version.
What Changed for Open Source:
- β Removed proprietary business logic
- β Added comprehensive documentation
- β Docker support + docker-compose
- β Configuration via environment variables
- β Dark mode (because developers love dark mode π and my eyes also :) )
- β Responsive design for mobile monitoring
- β Removed extra security features and user management
What Stayed the Same:
- β Core monitoring capabilities
- β Lightweight architecture
- β Production-ready code
π¦ Try It Yourself
Quick start in under 5 minutes:
git clone https://github.com/bibacrm/camunda-health-monitor.git
cd camunda-health-monitor
# Configure your Camunda nodes
cp .env.example .env
nano .env # Add your Camunda URLs
# Install and run
pip install -r requirements.txt
python app.py
Visit http://localhost:5000 and you're monitoring!
Docker enthusiasts?
docker-compose up -d
bibacrm
/
camunda-health-monitor
Lightweight monitoring dashboard for Camunda 7 clusters with real-time metrics, JVM health tracking, and database analytics. Built with Flask, Alpine.js, and Tailwind CSS.
Camunda Health Monitor
A lightweight, real-time monitoring dashboard for Camunda 7 based BPM Platform clusters. Monitor your process engines, track performance metrics, and gain insights into your workflow automation with a modern, responsive interface.
π₯ Video Demo
Watch the feature demonstration on YouTube(enterprise version, but it is 70% the same):
πΈ Screenshots
Full Dashboard View
Real-time cluster monitoring with comprehensive metrics
Light Theme
Clean, professional light mode interface
Node Status Monitoring
Immediate visual feedback when nodes become unavailable
π Features
Real-Time Cluster Monitoring
- Multi-node cluster support - Monitor all your Camunda nodes from a single dashboard
- Engine health status - Track node availability and response times
- Database connectivity - Monitor PostgreSQL connection health and latency
Comprehensive Metrics
- Process Instances - Active instances, user tasks, and external tasks
- Job Execution - Job executor throughput, failed jobs, and execution rates
- Incidents - Real-time incidentβ¦
π Lessons Learned
1. Specific > Generic
Domain-specific monitoring beats generic APM for operational insights.
2. Lightweight Wins
You don't need React + Redux + TypeScript for a monitoring dashboard. Alpine.js + Tailwind is often enough.
3. Direct Database Access is Powerful
When used responsibly (read-only!), querying Camunda tables directly gives you insights no API can provide. The right way is to use Camunda DB read-only replica on Production/Staging environments!
4. Start Internal, Go Open Source
Building for your own needs first ensures you solve real problems. Open-sourcing second ensures quality.
5. Simplify the Architecture
Lower barriers to contribution = more community engagement. Flask blueprints and build pipelines can come later.
π¬ The "Enterprise vs. Open Source" Balance
I wrestled with how much to include in the open-source version. Too simple = not useful. Too complex = hard to contribute to.
What I landed on:
- Core monitoring features = 100% open source
- Advanced features (multi-tenancy, SSO, custom alerting pipelines) = Enterprise version
This lets the community benefit from solid monitoring while leaving room for commercial sustainability.
π Check It Out
GitHub: https://github.com/bibacrm/camunda-health-monitor
If you're running Camunda 7 in production, give it a try! I'd love to hear:
- What works well?
- What's missing?
- What would make this indispensable for your team?
β Star the repo if you find it useful!
π¬ Drop a comment with your Camunda monitoring challenges
π§ Contribute if you'd like to add features
π¬ Discussion Questions
What monitoring challenges do you face with Camunda or other workflow engines?
Have you found a monitoring setup that actually works? What's your current pain point?
Drop a comment below - I read and respond to every one!
Thinking about sharing BPMN/DMN Linter web based solution(bpmn.io based) as open-source. May it be interesting for the community?
Building better monitoring tools, one metric at a time. Follow me for more deep-dives into DevOps, BPM, and open-source development.

Top comments (1)
π Author here!
Really curious to hear from the community:
I'll be hanging around to answer questions and discuss ideas!