DEV Community: Naveen Ayalla

I Got Tired of Claude Code Guessing Wrong, So I Built an MCP Toolkit

Naveen Ayalla — Tue, 09 Jun 2026 00:29:07 +0000

AI coding agents are useful, but they still have one frustrating habit:

They guess.

You ask something reasonable like:

“Where do we validate user input before inserting into the database?”

And instead of knowing where to look, the agent starts reading files one by one.

In a small project, that is fine.

In a real production codebase with 80,000+ lines, multiple engineers, old decisions, half-renamed folders, and years of accumulated context, this gets messy fast.

The agent reads a handful of files, hits context limits, and gives you an answer that sounds confident but points to the wrong part of the codebase.

I got tired of that, so I built an open-source MCP toolkit to fix it.

What I Built

I built MCP Server Toolkit, a collection of four Model Context Protocol servers that give AI coding agents direct access to the things they need:

Your codebase
Your database
Your docs
Your git history

Repo:

https://github.com/naveenayalla1-CS50/mcp-server-toolkit

The goal is simple:

Stop making the agent guess. Give it tools that know where to look.

Why MCP?

The Model Context Protocol, or MCP, lets AI agents call external tools in a standardized way.

Instead of the agent reading random files and hoping the right context fits, it can call a purpose-built tool like:

search_code("validate user input")

And get back file paths, line numbers, and relevant context.

That means fewer wrong guesses, fewer wasted tokens, and much better answers in large codebases.

The Four Servers

1. `mcp-code-search`

Searches across your repo and returns relevant matches with file paths, line numbers, and surrounding context.

Example:

You: Find all places where we call sendEmail

Agent calls search_code("sendEmail")

Results:
api/users.ts:89
services/email.ts:42
jobs/reminders.ts:117

It also includes targeted read_file and list_files tools so the agent can inspect only the files it actually needs.

2. `mcp-database`

Lets the agent ask read-only database questions in natural language.

Example:

You: How many users signed up in the last 7 days?

Agent runs:

SELECT count(*) FROM users
WHERE created_at > now() - interval '7 days';

It supports Postgres and SQLite.

The database server is read-only by default. You have to explicitly enable writable mode if you want writes.

That default matters. I did not want an agent anywhere near production data with write permissions unless the developer intentionally allowed it.

3. `mcp-docs`

Indexes a folder of Markdown docs with no embedding setup and no external API.

You can point it at internal docs, runbooks, API references, or project notes.

Example:

You: What does our runbook say about rolling back a deployment?

Agent calls search_docs("rollback deployment")

Result:
docs/ops/deploy.md:47
"To rollback: run ./scripts/rollback.sh <version>..."

It works locally and does not send your docs anywhere.

4. `mcp-git`

Lets the agent query git history, diffs, blame, and branches.

This is useful when the agent needs to understand not just what the code does, but why it changed.

Example:

You: Why was this validation added?

Agent checks git blame and recent commits for that file.

Install

The fastest way to get started is:

npx mcp-server-toolkit@latest init

That launches an interactive setup.

Pick the servers you want, provide DATABASE_URL if you are using the database server, and it generates the config for Claude Code, Cursor, Windsurf, or any other MCP-compatible client.

Manual Claude config example:

{
  "servers": {
    "code-search": {
      "command": "npx",
      "args": ["-y", "@mcp-toolkit/code-search", "."]
    },
    "database": {
      "command": "npx",
      "args": ["-y", "@mcp-toolkit/database", "--read-only"],
      "env": {
        "DATABASE_URL": "${DATABASE_URL}"
      }
    },
    "docs": {
      "command": "npx",
      "args": ["-y", "@mcp-toolkit/docs", "./docs"]
    },
    "git": {
      "command": "npx",
      "args": ["-y", "@mcp-toolkit/git", "."]
    }
  }
}

Restart your MCP-compatible client and the tools are available.

Why I Made It

This project came from a very specific frustration.

AI coding agents are getting better, but they still struggle when the answer is buried inside a large repo.

The problem is not always reasoning.

A lot of the time, the problem is retrieval.

The agent simply does not know where to look.

MCP servers help solve that by giving the agent focused tools:

Instead of:
Read random files → guess → maybe answer correctly

You get:
Search with the right tool → inspect relevant result → answer with context

That is a much better workflow for real codebases.

Building Your Own Server

I also added @mcp-toolkit/core to make building new MCP servers easier.

A simple server looks like this:

import { createServer, tool, z } from '@mcp-toolkit/core';

const server = createServer({
  name: 'my-server',
  version: '1.0.0',
});

server.addTool(
  tool({
    name: 'get_feature_flags',
    description: 'Get all active feature flags for an environment',
    input: z.object({
      env: z.string().describe('Environment name: staging or production'),
    }),
    run: async ({ env }) => {
      const flags = await fetchFlags(env);

      return {
        content: JSON.stringify(flags, null, 2),
      };
    },
  })
);

server.start();

You can scaffold a new server with:

npm run new-server -- my-server-name

What I Learned

A few things stood out while building this:

The MCP TypeScript SDK is solid.
Most of my time went into the actual tool logic, not the protocol plumbing.

Read-only defaults matter.
Especially for database access. Agents should not get write permissions by accident.

Zod works really well for tool input validation.
When the agent passes the wrong input shape, the error is usually clear enough that it can self-correct.

Roadmap

The current version includes:

Code search
Database
Docs
Git

Next, I am thinking about:

Notion
Linear/Jira
A small web UI for registered tools
More examples for custom MCP servers

Contributions are welcome.

Try It

GitHub:

https://github.com/naveenayalla1-CS50/mcp-server-toolkit

Install:

npx mcp-server-toolkit@latest init

If you try it, I would love feedback, issues, or PRs.

Especially if you are using Claude Code, Cursor, Windsurf, or another MCP-compatible coding agent on a large repo.

# Moving RAG From Demo to Production on Databricks: A Developer-Focused Checklist

Naveen Ayalla — Mon, 08 Jun 2026 01:10:23 +0000

By Naveen Ayalla

This article is adapted from my original post in the Databricks Community and is shared here for developers, data engineers, and GenAI practitioners building production AI workflows.

A RAG demo is easy to build compared to a production RAG system.

For a demo, you can upload documents, create embeddings, connect an LLM, ask a question, and return an answer.

That is a great starting point.

But production needs more than a working answer.

A production RAG workflow has to answer questions like:

Is the source data trusted?
Is the user allowed to access this content?
Did the system retrieve the right context?
Is the answer grounded in that context?
Can we monitor quality, latency, cost, and failures?
Who owns the data and the workflow after launch?

When these questions are ignored, many GenAI projects slow down after the demo stage.

Below is a practical checklist I use when thinking about RAG workflows on Databricks.

Demo vs. Production

Area	Demo Thinking	Production Thinking
Data	Use sample documents.	Use trusted, current, approved data.
Access	Assume one access level.	Enforce user permissions and sensitive-data rules.
Retrieval	Return similar chunks.	Return the right context for the right user.
Response	Generate a helpful answer.	Answer only from supported context.
Evaluation	Try a few test prompts.	Measure retrieval quality, groundedness, correctness, and failures.
Monitoring	Check usage.	Track quality, latency, cost, errors, and feedback.
Ownership	AI team owns everything.	Data owners, platform teams, and business users share ownership.

1. Start With a Narrow Use Case

The first mistake is trying to index everything.

A better starting point is one clear use case.

Examples:

Help support teams answer product questions faster.
Help analysts search internal documentation.
Help engineers troubleshoot pipeline failures.
Help business users understand policy documents.

A narrow use case helps you choose better data, test better questions, and measure value more clearly.

2. Use Data You Can Trust

Not every document should go into a RAG system.

Before indexing content, ask:

Who owns the data?
Is it current?
Is it approved for this use case?
Does it include sensitive information?
Which users should be allowed to see it?

If the source data is outdated or poorly governed, the generated answer will not be reliable.

3. Add Metadata Early

Metadata is easy to skip in a demo, but it becomes very useful in production.

Useful metadata includes:

document owner
source system
updated date
department
product name
region
sensitivity level
access group

Metadata helps with filtering, debugging, governance, and retrieval quality.

For example, if two documents answer the same question but one is newer, metadata can help the system prefer the latest source.

4. Build Access Control Into Retrieval

In enterprise RAG, access control cannot be an afterthought.

If a user cannot access a document directly, they should not be able to access it through an AI assistant.

This means the retrieval layer should respect permissions, sensitivity rules, and data ownership.

On Databricks, this is where a governed lakehouse design becomes important. The AI workflow should follow the same governance principles as the rest of the data platform.

5. Evaluate Retrieval and Generation Separately

When a RAG answer is wrong, it is important to know why.

The issue may be retrieval.
The issue may be the model.
The issue may be missing data.
The issue may be stale content.
The issue may be bad chunking.

That is why I prefer to evaluate retrieval and answer generation separately.

Evaluation Area	Main Question
Retrieval quality	Did the system retrieve the right context?
Answer quality	Did the model use the context correctly?

This makes debugging much easier.

6. Tell the Model When to Stop

One of the most useful production rules is simple:

If the retrieved context is not enough, say that the information is not available instead of guessing.

For internal business users, a confident wrong answer is worse than a clear limitation.

A good RAG system should know when not to answer.

7. Monitor After Launch

A RAG system changes after it goes live.

Users ask new questions.
Documents get updated.
Models change.
Costs change.
Business rules change.

After launch, monitor:

user feedback
failed questions
retrieval quality
latency
cost
error rate
outdated sources
low-confidence answers

Monitoring should feed back into better data preparation, improved metadata, better prompts, and stronger evaluation datasets.

Final Thought

Production RAG is not just an LLM connected to a vector index.

It is a governed data product.

It needs trusted data, metadata, permissions, evaluation, monitoring, and clear ownership.

Databricks can be a strong foundation for this kind of workflow because data engineering, governance, machine learning, and AI workflows can be connected through the lakehouse approach.

I would like to hear from other developers and data engineers:

What has been the hardest part of moving RAG from demo to production: access control, retrieval quality, evaluation, monitoring, cost, or user adoption?

This article was originally published in the Databricks Community and is republished here for developers, data engineers, and GenAI practitioners building production AI workflows. Original post: https://community.databricks.com/t5/data-engineering/from-rag-demo-to-production-on-databricks-7-things-teams-should/m-p/158526#M54730

From 2 AM Failures to 10x Speed: How We Escaped the Stored Procedure Prison

Naveen Ayalla — Sat, 06 Jun 2026 07:18:53 +0000

The Hook: The 2 AM Wake-Up Call

It was 2 AM on a Tuesday. Again.

My phone buzzed with the all-too-familiar alert: “SAP HANA ETL job failed — timeout exceeded.” This wasn't just a technical glitch; it was a systemic failure of an architecture pushed to its limits. Our nightly batch processes, powered by over 200 deeply nested stored procedures, had become a fragile ecosystem. For the third time that month, the finance team would be without their reports at 8 AM, and as the Data Engineering Lead, the responsibility sat squarely on my shoulders.

The reality of managing these legacy systems is that you eventually stop being an engineer and start being an archaeologist. These procedures had evolved over seven years — written by consultants long gone and modified by developers who left no documentation. Debugging them felt like an archaeological excavation of technical debt. I knew that to save our SLAs (and my sleep schedule), we had to migrate to Databricks and PySpark.

Takeaway 1: The “Black Box” Logic is Your Biggest Liability

In my experience, the primary danger of a legacy SAP HANA environment isn't just the performance lag; it's the existential risk of “black box” logic. When your business logic is trapped in proprietary SQLScript, it becomes a liability. Without version control or unit tests, the core of your company's data intelligence is unverified and unscalable.

This creates a “proprietary prison” of vendor lock-in. When business logic is coupled so tightly to a specific database dialect, your ability to scale is limited by the physical constraints of expensive, vertically-scaled hardware. From a business continuity standpoint, undocumented, nested logic isn't just annoying — it's a risk to the entire department's standing with leadership.

“The worst part? Nobody truly understood the stored procedures anymore. They had evolved over 7 years — written by consultants who left, modified by developers who forgot to document, and nested so deeply that debugging felt like archaeological excavation.”

Takeaway 2: Killing Cursors is the Key to 10x Performance

The most significant performance leap came from a fundamental shift in mindset: killing cursors. SAP HANA often relies on row-by-row processing patterns that are inherently sequential and resource-heavy. By moving to the set-based operations of PySpark, we didn't just see marginal gains; we saw a transformation.

The numbers speak for themselves. In our migration, a process that once took 45 minutes using a SAP HANA cursor was slashed to just 90 seconds. But the real “aha!” moment for our team was our most complex procedure, which dropped from 94 minutes to a mere 6 minutes — a 15.6x jump in performance. This is the power of moving from sequential database execution to a distributed processing mindset.

SAP HANA Pattern	PySpark Equivalent	Why It's Faster
Cursors & Manual Loops	Set-Based Operations	Distributed execution across the cluster nodes
Nested Row Calculations	Window Functions	Optimized execution plans and predicate pushdown
Recursive CTEs	GraphFrames	Efficient handling of deep hierarchical trees

Takeaway 3: Cost Reductions are Not Just Marginal, They're Massive

When I presented the billing report after our first month on the Databricks Lakehouse, the numbers didn't just meet expectations — they fundamentally changed our standing with the CFO. We achieved a staggering 82% reduction in monthly compute costs, dropping from $18,500 to just $3,200.

That is a saving of $15,300 per month, or roughly $183,600 per year. That's more than enough to fund an additional senior headcount for the team.

The financial drain of SAP HANA comes from its vertical scaling model: you pay for high-spec, expensive nodes that sit idle or underutilized just to handle peak loads. By moving to a Lakehouse architecture, we stopped paying for peak capacity 24/7. We now pay only for the compute we actually use during transformations, leveraging elastic scaling and cost-effective cloud storage.

Takeaway 4: Technical “Gotchas” Live in the Details of NULLs and Decimals

If you're planning this jump, let me tell you exactly where the bodies are buried. While the performance gains are intoxicating, the technical nuances can compromise your data integrity if you aren't disciplined:

NULL Aggregations: SAP HANA and Spark handle NULLs differently. In my experience, skipping a thorough validation of these differences is the fastest way to lose the trust of your finance stakeholders.
Decimal Precision: SAP HANA supports higher precision by default. To avoid rounding errors that could break a balance sheet, you must explicitly cast decimals at the schema level in PySpark.
The “Side Effect” Nightmare: Stored procedures often modify data outside their declared scope — a concept foreign to PySpark's functional, side-effect-free programming model. We had to reverse-engineer these hidden behaviors to ensure the new logic captured every “ghost” rule.
Architect-Level Partitioning: Don't rely on auto-partitioning. For my workload, partitioning by order_year on high-cardinality columns reduced scan times by 80%.

To ensure safety, we mandated row-for-row validation for a full week before the cutover. The execution engine changed, but the data remained 100% identical.

Takeaway 5: Transitioning from a Database to a Development Platform

The real “escape” from the legacy prison isn't just about code; it's about engineering rigor. We moved from a database-centric world to a platform-centric world.

By adopting Databricks, we introduced CI/CD, Git version control, and automated unit testing into our workflow. This democratization of the codebase meant that junior engineers could finally understand and modify transformations that were previously locked behind the “black box” of senior-only SQLScript knowledge.

Furthermore, features like Delta Time Travel became our “get out of jail free” card. Being able to “time travel” back to a previous state of the data to debug a failure in minutes — rather than hours of archaeological digging — has completely changed our operational velocity. We are no longer just running daily batches; we are now near real-time and streaming ready.

Conclusion: Your Future Sleep Schedule Depends on the Leap

The results of this migration were definitive: we slashed our total nightly ETL runtime from 5 hours and 23 minutes to just 28 minutes. We turned a $220k annual liability into a lean, $38k high-performance asset.

By escaping the proprietary prison of stored procedures, we empowered our team, secured our data integrity, and restored our sleep schedules. The data team is no longer the bottleneck; we are the engine of the company.

Here is the question for you: Is your current technical debt a managed expense, or is it a liability waiting to wake you up at 2 AM? It might be time to take the leap.

Building a PySpark and AWS Glue ETL Pipeline for Search Keyword Revenue Analysis

Naveen Ayalla — Sun, 31 May 2026 23:10:33 +0000

I published a public data engineering project that demonstrates a cloud-based ETL pipeline for analyzing web analytics search keyword revenue.

The project uses PySpark, AWS Glue, Amazon S3, and Terraform to process hit-level web analytics data, extract external search engine domains and keywords, parse revenue, and generate a sorted reporting output.

Key concepts covered:

Batch ETL pipeline design
PySpark transformations
AWS Glue job configuration
S3 input and output workflow
Revenue aggregation logic
Terraform infrastructure examples

This is a generic open-source portfolio project and does not include proprietary or company-provided data.

GitHub: https://github.com/naveenayalla1-CS50/search-keyword-performance-revenue

Feedback from data engineers and cloud data practitioners is welcome.

DEV Community: Naveen Ayalla

I Got Tired of Claude Code Guessing Wrong, So I Built an MCP Toolkit

What I Built

Why MCP?

The Four Servers

1. mcp-code-search

2. mcp-database

3. mcp-docs

4. mcp-git

Install

Why I Made It

Building Your Own Server

What I Learned

Roadmap

Try It

# Moving RAG From Demo to Production on Databricks: A Developer-Focused Checklist

Demo vs. Production

1. Start With a Narrow Use Case

2. Use Data You Can Trust

3. Add Metadata Early

4. Build Access Control Into Retrieval

5. Evaluate Retrieval and Generation Separately

6. Tell the Model When to Stop

7. Monitor After Launch

Final Thought

From 2 AM Failures to 10x Speed: How We Escaped the Stored Procedure Prison

The Hook: The 2 AM Wake-Up Call

Takeaway 1: The “Black Box” Logic is Your Biggest Liability

Takeaway 2: Killing Cursors is the Key to 10x Performance

Takeaway 3: Cost Reductions are Not Just Marginal, They're Massive

Takeaway 4: Technical “Gotchas” Live in the Details of NULLs and Decimals

Takeaway 5: Transitioning from a Database to a Development Platform

Conclusion: Your Future Sleep Schedule Depends on the Leap

Building a PySpark and AWS Glue ETL Pipeline for Search Keyword Revenue Analysis

1. `mcp-code-search`

2. `mcp-database`

3. `mcp-docs`

4. `mcp-git`