Ever been paged at 2 AM because a Glue job silently failed?
Me too. And that pain led me to build SiliconPrimeX β an autonomous, AI-powered self-healing system for AWS Glue failures. No more sifting through logs or scrambling to patch configs. Now, my pipeline heals itself before I even blink. π
π¨ The Problem
β Glue job fails.
π€― You get pinged.
π You read CloudWatch logs line-by-line.
π οΈ You figure out the issue.
π§ You manually update the Glue job.
This used to take me 20β45 minutes per failure. Now? ~10 seconds.
π‘ The Solution β SiliconPrimeX
SiliconPrimeX is a fully autonomous RCA (Root Cause Analysis) + Auto-Patching Engine for AWS Glue.
It listens for job failures, fetches the logs, sends them to Google Gemini, and automatically applies fixes β like changing WorkerType or NumberOfWorkers. It also sends RCA-rich alerts via SNS.
β¨ What It Does:
Parses failed Glue job logs from S3
Asks Gemini: "Why did this fail? How should I fix it?"
Applies the patch (via update_job)
Sends an alert with RCA to your team
πΌοΈ Imagine a clean, beautiful Lambda + S3 + Gemini-powered auto-healer doing its magic
π§ Gemini Flash 2.0 for LLM-based RCA
βοΈ AWS Lambda for automation
π S3 for storing job logs
π§ͺ DynamoDB for RCA storage
π§ Glue for data jobs
π£ SNS for alerts
Try It Yourself
Create an S3 trigger for Glue logs
Use Lambda to call Gemini with the log as prompt
Parse and patch the Glue job config using update_job
Store results in DynamoDB
Trigger alerts with boto3.client('sns').publish()
π GitHub + Live Demo
GitHub Repo: https://github.com/ashishkesari18/Data-Engineering-Projects/tree/main/AWS%2BData%20Engineering/SiliconPrimeX
Medium Article: https://medium.com/@ashishkesari018/siliconprimex-building-an-autonomous-self-healing-data-platform-on-aws-c7a73703795c
Future Add-ons:
Retry mechanism before patching
Slack integration for alerts
Scheduled RCA audit logs
Multi-job patching recommendations
Why I Built This
As someone passionate about serverless reliability, I wanted to prove that:
β
Glue jobs can self-diagnose
β
LLMs like Gemini can reason on failure logs
β
We donβt need to burn ops hours for routine errors
π I Need Your Feedback
I'd love to hear your thoughts:
How can I make this more scalable?
Would you use this in prod?
Got suggestions for extra automation?
π Let's Connect
πhttps://www.linkedin.com/in/ashishk18/
Top comments (0)