DEV Community

Ashish Kesari
Ashish Kesari

Posted on

πŸ› οΈ SiliconPrimeX – Healing AWS Glue Jobs Autonomously with Gemini & Lambda πŸš‘βœ¨

Ever been paged at 2 AM because a Glue job silently failed?

Me too. And that pain led me to build SiliconPrimeX β€” an autonomous, AI-powered self-healing system for AWS Glue failures. No more sifting through logs or scrambling to patch configs. Now, my pipeline heals itself before I even blink. 😎

🚨 The Problem
❌ Glue job fails.
🀯 You get pinged.
πŸ” You read CloudWatch logs line-by-line.
πŸ› οΈ You figure out the issue.
πŸ”§ You manually update the Glue job.

This used to take me 20–45 minutes per failure. Now? ~10 seconds.

πŸ’‘ The Solution β€” SiliconPrimeX
SiliconPrimeX is a fully autonomous RCA (Root Cause Analysis) + Auto-Patching Engine for AWS Glue.

It listens for job failures, fetches the logs, sends them to Google Gemini, and automatically applies fixes β€” like changing WorkerType or NumberOfWorkers. It also sends RCA-rich alerts via SNS.

✨ What It Does:
Parses failed Glue job logs from S3

Asks Gemini: "Why did this fail? How should I fix it?"

Applies the patch (via update_job)

Sends an alert with RCA to your team

πŸ–ΌοΈ Imagine a clean, beautiful Lambda + S3 + Gemini-powered auto-healer doing its magic

🧠 Gemini Flash 2.0 for LLM-based RCA
☁️ AWS Lambda for automation
πŸ“„ S3 for storing job logs
πŸ§ͺ DynamoDB for RCA storage
πŸ”§ Glue for data jobs
πŸ“£ SNS for alerts

Try It Yourself
Create an S3 trigger for Glue logs

Use Lambda to call Gemini with the log as prompt

Parse and patch the Glue job config using update_job

Store results in DynamoDB

Trigger alerts with boto3.client('sns').publish()

πŸ“Ž GitHub + Live Demo
GitHub Repo: https://github.com/ashishkesari18/Data-Engineering-Projects/tree/main/AWS%2BData%20Engineering/SiliconPrimeX

Medium Article: https://medium.com/@ashishkesari018/siliconprimex-building-an-autonomous-self-healing-data-platform-on-aws-c7a73703795c

Future Add-ons:

Retry mechanism before patching

Slack integration for alerts

Scheduled RCA audit logs

Multi-job patching recommendations

Why I Built This
As someone passionate about serverless reliability, I wanted to prove that:

βœ… Glue jobs can self-diagnose
βœ… LLMs like Gemini can reason on failure logs
βœ… We don’t need to burn ops hours for routine errors

πŸš€ I Need Your Feedback
I'd love to hear your thoughts:

How can I make this more scalable?

Would you use this in prod?

Got suggestions for extra automation?

πŸ™Œ Let's Connect
πŸ”—https://www.linkedin.com/in/ashishk18/

Top comments (0)