DEV Community

Zandile
Zandile

Posted on

I Built a Real-Time Load Shedding Platform for 60 million South Africans — Here's What Nearly Broke Me

Load shedding is not an inconvenience in South Africa. It's a national operating condition.

Businesses plan around it. Hospitals prepare for it. Families structure their evenings by it. And for software engineers, it's a constant reminder that the infrastructure assumptions baked into most cloud architecture tutorials — "your services will always be reachable", "your data will always flow" — simply don't hold here.

I built GridSense SA to solve a real problem: South Africans need fast, reliable load shedding intelligence, and the existing tools are brittle, slow, or expensive. What I didn't expect was how much the project would teach me about building truly resilient cloud systems — not the theoretical kind, but the kind that keeps working when the power goes off.

This is the story of that build: the architecture decisions, the thing that nearly broke me, and what I learned that no tutorial had prepared me for.


The Problem Worth Solving

The EskomSePush API is a lifeline. It exposes live national load shedding data — current stage, area schedules, status updates. Millions of South Africans depend on apps built on top of it every single day.

But raw API polling is fragile. If your service goes down during Stage 4 (exactly when everyone is hammering the endpoint), you lose data. If you can't validate what you receive, bad data flows straight to your users. And if you have no event history, you have no ability to predict, alert, or analyse patterns over time.

GridSense SA addresses all three. It ingests live Eskom data, streams it through a validated event pipeline, and serves it through a low-latency REST API — all on AWS, all infrastructure-as-code, all observable in real time.


The Architecture (Before I Explain What Went Wrong)

The system is deliberately event driven. Here's the high-level flow:

EskomSePush API
      ↓ (polled every 5 minutes)
Eskom Ingestor  (Python / Kubernetes)
      ↓
Kafka Topic: eskom.generation.raw  (AWS MSK)
      ↓
Data Validator (Python / Kubernetes)
      ↓
Kafka Topic: eskom.generation.validated
      ↓
API Gateway (FastAPI / Kubernetes)
      ↓
REST API consumers
Enter fullscreen mode Exit fullscreen mode

Three microservices. Three Kafka topics (plus a dead-letter queue for failed validation). Everything running on EKS. Everything provisioned with Terraform.

The technology choices were deliberate:

  • AWS MSK over self-managed Kafka — I wanted to focus on the application logic, not broker operations. MSK handles replication, patching, and monitoring.
  • EKS over ECS — Kubernetes gives me portability and a skill that translates globally. ECS would have been easier, but the learning was the point.
  • FastAPI over Flask/Django — async by default, auto-generated Swagger docs, Pydantic validation built in. The right tool for an event-driven data API.
  • Terraform modules — not scripts, not inline config. Reusable modules for EKS, MSK, and VPC that I can carry to any future project.

The full stack: AWS (EKS 1.30, MSK 3.5.1, ECR, VPC), Terraform, Kubernetes, Apache Kafka, Python 3.11, FastAPI, Prometheus, Grafana, GitHub Actions.


What Nearly Broke Me: Kafka on MSK

Here's the part they don't tell you in the getting-started guide.

MSK is not Kafka-with-a-UI. It's Kafka with IAM, with VPC-enforced network boundaries, with TLS that you cannot turn off, with broker endpoints that change format depending on whether you're connecting from inside or outside the cluster, and with a bootstrap server string that looks deceptively simple until your Python client throws a [SSL: CERTIFICATE_VERIFY_FAILED] error at 11pm on a Tuesday.

These were the specific walls I hit:

1. The bootstrap server format

MSK gives you three types of endpoints: plaintext (disabled by default and you should keep it that way), TLS, and SASL/SCRAM. I spent the better part of a day confused about why my Kafka client was connecting but not producing, before realising I was using the plaintext endpoint format with a TLS configuration. The error message pointed nowhere useful.

The fix was embarrassingly simple once I found it — use the TLS bootstrap servers and explicitly pass the CA cert path — but getting there required reading through AWS documentation, three Stack Overflow threads, and one GitHub issue from 2021 that finally had the right answer buried in a comment.

2. IAM authentication vs SASL/SCRAM

MSK supports both. Pick one before you start and commit to it. I initially tried IAM-based authentication because it felt cleaner (no credentials to manage), but the Python kafka-python library's MSK IAM support at the time required a separate aws-msk-iam-sasl-signer package that added complexity to every service. I switched to SASL/SCRAM with Secrets Manager, and it became significantly simpler to reason about.

3. Consumer group offsets in Kubernetes

When your Kafka consumer restarts — which happens constantly in Kubernetes during deployments, node scaling, or pod evictions — you need to decide what happens to its offset. auto.offset.reset=earliest means it reprocesses everything. latest means it skips what it missed. Neither is always right. For load shedding data where a missed event means a missed stage change, I settled on earliest with idempotent processing in the validator — each event is checked for duplicate message IDs before being written downstream.

4. Topic creation is not automatic

MSK does not auto-create topics by default (and you should keep auto.create.topics.enable=false in production). Every topic in GridSense — eskom.generation.raw, eskom.generation.validated, eskom.generation.deadletter, and the five future-use topics — had to be explicitly created. I scripted this with a kafka-client pod running in the cluster, but it was one of those invisible steps that took me a moment to diagnose when events weren't flowing, and I couldn't figure out why.

None of these are insurmountable. But they are the kind of friction that doesn't appear in architecture diagrams, and they're exactly the kind of thing that distinguishes engineers who have actually run Kafka in production from those who have only read about it.


The Data Validator: Seven Rules Between Good Data and Bad

One decision I'm proud of is the validation layer. Rather than treating the EskomSePush API as a trusted source, the Data Validator applies seven rules to every raw event before it moves downstream:

  1. Required fields present — does the event have all expected keys?
  2. Stage range valid — load shedding stages run 0–8. Anything outside is corrupted data.
  3. Timestamp freshness — events older than 30 minutes are stale and rejected.
  4. No future timestamps — an event claiming to be from the future is invalid.
  5. Known source — the event source must be a recognised identifier.
  6. Schema version match — forward compatibility if the API changes its format.
  7. Duplicate detection — the same event ID should not be processed twice.

Events failing validation go to eskom.generation.deadletter for inspection rather than being silently dropped. This matters: silent data loss is harder to debug than visible dead-letter accumulation.


Observability From Day One

Every production system needs to answer three questions: Is it running? Is it healthy? Where is it slow?

GridSense answers all three with Prometheus and Grafana, deployed into the cluster from day one rather than bolted on later. Key metrics tracked per service:

  • Message throughput (events/second produced and consumed)
  • Consumer lag (how far behind the validator is from the ingestor)
  • Validation pass/fail rates
  • API response time (p50, p95, p99)
  • Dead-letter queue depth

The Grafana dashboard is the first thing I open when something feels off. Consumer lag is particularly telling — if the validator falls behind, it usually means a burst of invalid events hitting the dead-letter queue, which usually means the upstream API has changed something.

Grafana Kubernetes cluster dashboard showing GridSense SA CPU at 3.05% and memory at 27.7%, with a CPU spike at 16:20 during a consumer restart event


Real Costs, Not Estimates

I want to be honest about something most architecture posts aren't this costs real money to run.

Resource Cost/day
EKS Control Plane $2.40
2× t3.medium nodes $2.02
MSK 2× kafka.t3.small $2.16
NAT Gateway $1.08
Total ~$7.65/day

That's $229/month for a development environment. For a learning project, that's significant — especially when you're paying in rand. My cost management strategy: destroy the expensive resources (EKS nodes, MSK cluster, NAT Gateway) when I'm not actively developing, and bring them back up when I resume. Terraform makes this a single command in both directions.

The lesson here is real: cloud cost visibility is not optional. Every production team I've spoken to has a story about a runaway bill. Building FinOps awareness into a project from the start — knowing exactly what each component costs and having a strategy to control it — is a skill that transfers directly to enterprise work.


What's Coming Next

GridSense is a data pipeline today. The Kafka topics I've pre-provisioned tell the story of what it becomes:

  • predictions.stage.forecast — an ML model predicting the next load shedding stage based on historical patterns
  • alerts.triggered — push notifications when stage changes are detected
  • weather.readings.raw — correlation between weather data and generation capacity
  • municipality.schedules.parsed — area-level schedule intelligence, not just national stage
  • user.reports.validated — crowdsourced ground-truth reports from users

The pipeline is built to receive all of this. The intelligence layer is what gets built on top of it.

I'm also planning to add:

  • A CloudFront + S3 offline fallback — so the API remains partially functional even when the EKS cluster is unreachable during an outage
  • A POPIA compliance module — automated checks for data residency and privacy controls, which maps directly to GDPR for any international deployments
  • A public demo endpoint — so you can actually hit the API and see live data

What This Project Taught Me

Six months ago I would have described myself as an intermediate AWS engineer. I knew EC2, S3, Lambda, RDS. I had done some container work and basic Terraform knowledge.

GridSense forced me into territory I had only read about: managed Kafka, Kubernetes networking, multi-AZ data pipelines, event-driven microservices, production observability. More importantly, it forced me to think about infrastructure the way South African infrastructure demands — not "what happens when everything works," but "what happens when the power goes off, the API goes down, and your consumers restart mid-stream."

That question — *what happens when everything doesn't work? * — is, I've come to believe, the real question cloud engineering is trying to answer. Load shedding just makes it impossible to ignore.


The Repository

GridSense SA is open source. The full codebase — Terraform modules, Kubernetes manifests, Python services, CI/CD pipeline — is on GitHub.

*GitHub: * github.com/zandiletsh/gridsense-sa

If you're building something similar — resilient cloud architecture in environments where infrastructure reliability can't be assumed — I'd love to connect.


*Zandile Tshabalala is a cloud engineer based in South Africa, building infrastructure that works even when the power doesn't. Find her on GitHub at @zandiletsh and on LinkedIn: www.linkedin.com/in/zandile-tsh

Top comments (0)