<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Manikandan Pandurangan</title>
    <description>The latest articles on DEV Community by Manikandan Pandurangan (@manikandan_pandurangan_16).</description>
    <link>https://dev.to/manikandan_pandurangan_16</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3997727%2Fe81d2b6e-1a14-4753-9a14-1eca9ad09d4e.jpg</url>
      <title>DEV Community: Manikandan Pandurangan</title>
      <link>https://dev.to/manikandan_pandurangan_16</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/manikandan_pandurangan_16"/>
    <language>en</language>
    <item>
      <title>What If Your Employees Never Had to Know Which System to Check?</title>
      <dc:creator>Manikandan Pandurangan</dc:creator>
      <pubDate>Wed, 24 Jun 2026 16:07:11 +0000</pubDate>
      <link>https://dev.to/manikandan_pandurangan_16/what-if-your-employees-never-had-to-know-which-system-to-check-4me1</link>
      <guid>https://dev.to/manikandan_pandurangan_16/what-if-your-employees-never-had-to-know-which-system-to-check-4me1</guid>
      <description>&lt;p&gt;&lt;em&gt;A practical look at building one AI desk that talks to your documents, your database, and the web. All at the same time.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  The problem nobody talks about out loud
&lt;/h2&gt;

&lt;p&gt;Someone on the operations team needs to know the incident response runbook for a specific system. They ask a colleague. That colleague isn't sure. They dig through Confluence, try a search, find something from 2022, hope it's still valid.&lt;/p&gt;

&lt;p&gt;Meanwhile someone in data analytics wants yesterday's order count. They open a BI tool. Filter wrong. Give up. Ping the data team.&lt;/p&gt;

&lt;p&gt;These are not technology failures. They're routing failures. The answers exist. Nobody knows where to look.&lt;/p&gt;

&lt;p&gt;One Desk AI is a working attempt to fix that.&lt;/p&gt;




&lt;h2&gt;
  
  
  What it actually does
&lt;/h2&gt;

&lt;p&gt;One question. One box. The system figures out where the answer is.&lt;/p&gt;

&lt;p&gt;Ask it something about an internal process and it searches your uploaded documents using semantic (meaning-based) search. Ask it about data and it writes and runs a database query on your behalf. Ask it something general and it searches the web, reads the relevant pages, and gives you a summary.&lt;/p&gt;

&lt;p&gt;You don't choose which mode. The system does.&lt;/p&gt;

&lt;p&gt;The response comes back the same way every time: clean with any personal data removed and with a full trace of which agent ran, why it was chosen, and how long each step took.&lt;/p&gt;




&lt;h2&gt;
  
  
  The four agents behind the single answer
&lt;/h2&gt;

&lt;p&gt;The system runs four specialized agents in sequence. Each one does exactly one job.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Knowledge Brain&lt;/strong&gt; handles your internal documents. It uses vector search (think of it as search that understands meaning, not just keywords) over an OpenSearch index. If a question contains an organization name or mentions internal content, this agent runs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;SQL Agent&lt;/strong&gt; handles data questions. It does not simply generate a query and run it. It generates the query, then has a second model verify it for safety before execution. This prevents the obvious disasters.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Research Agent&lt;/strong&gt; handles everything else. It runs a Google search, reads the actual pages, and synthesizes a response. Not snippets. The full content.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Author Agent&lt;/strong&gt; runs last on every response. It reformats the output for readability, strips any personally identifiable information, and is the only agent that writes to the user. One output point, always.&lt;/p&gt;

&lt;p&gt;There is also a guardrail layer at the front. Before any agent runs, the input is checked for SQL injection attempts and similar attacks. Blocked inputs never reach the agents.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why this matters for non-technical managers
&lt;/h2&gt;

&lt;p&gt;The honest version: most AI tools companies buy are wrappers around a chat interface with one source of truth. Ask a question about a document, get an answer about the document. That's it.&lt;/p&gt;

&lt;p&gt;This system connects three sources at once and decides between them per question. A new hire asking "what's our leave policy?" gets the HR document. A manager asking "how many leave requests were filed last quarter?" gets a live database query. No manual switching. No knowing in advance which system holds the answer.&lt;/p&gt;

&lt;p&gt;The evaluation layer matters too. Every response is written to a database table: the original question, which agent handled it, the raw output, the final response, whether any PII was masked, and latency per stage. That's not for debugging. That's the audit trail compliance teams ask for and rarely get from AI tools.&lt;/p&gt;




&lt;h2&gt;
  
  
  Where it runs and what it costs to operate
&lt;/h2&gt;

&lt;p&gt;The system deploys on AWS ECS Fargate which means there are no servers to manage. It starts a container when a request comes in and scales with demand. The AI model runs on AWS Bedrock (Claude 3.5 Sonnet) which means pay-per-use with no GPU procurement.&lt;/p&gt;

&lt;p&gt;For a mid-sized company with moderate query volume the infrastructure cost is low. The bigger cost is the initial setup: getting documents into the search index, configuring which organization names the knowledge agent should recognize, and connecting the database.&lt;/p&gt;

&lt;p&gt;Setup instructions are in the repository. The config that controls everything including which organizations the system knows about is a single YAML file. One edit, redeploy, done.&lt;/p&gt;




&lt;h2&gt;
  
  
  What this is not
&lt;/h2&gt;

&lt;p&gt;It is not a replacement for a proper knowledge management system. If your internal documents are scattered, incomplete, or outdated, this system will surface scattered, incomplete, or outdated answers with confidence.&lt;/p&gt;

&lt;p&gt;It does not handle ambiguous questions well. "How are we doing?" will confuse the routing layer. Specific questions get better answers.&lt;/p&gt;

&lt;p&gt;It does not do anything about data quality in your database. If the data is wrong, the query results are wrong.&lt;/p&gt;




&lt;h2&gt;
  
  
  A note on the technical choices
&lt;/h2&gt;

&lt;p&gt;The LangGraph version constraint is worth reading even if you never touch the code. The README documents which exact package versions work together and why upgrading them individually causes silent failures. That section alone is worth saving for anyone who has debugged a LangChain version mismatch at midnight.&lt;/p&gt;

&lt;p&gt;The WebSocket implementation is also non-obvious. Long-running AI responses don't fit neatly into a standard HTTP request-response cycle. The system streams progress events back to the client so users see which agent is running while the answer is being assembled. The Angular integration contract is documented with working code.&lt;/p&gt;




&lt;h2&gt;
  
  
  Guardrails and PII masking are built to extend
&lt;/h2&gt;

&lt;p&gt;This is a multi-agent system so the guardrail layer was designed as infrastructure rather than a fixed list of rules.&lt;/p&gt;

&lt;p&gt;The input guardrail blocks SQL injection attempts before any agent runs. The Author Agent masks PII on every response regardless of which agent produced the answer. Both sit at fixed points in the graph so adding new checks means editing one file not hunting across four agents.&lt;/p&gt;

&lt;p&gt;The config that controls routing also controls which patterns the guardrail flags. A team in a regulated industry can add domain-specific risk patterns in the same YAML that defines their organization names. No code change required for the common cases.&lt;/p&gt;

&lt;p&gt;The masking rules follow the same pattern. New entity types can be added through config. The current defaults cover emails, phone numbers, and national ID formats for Indian regulatory context (aligned with DPDP Act requirements). Adding new formats is a config entry.&lt;/p&gt;

&lt;p&gt;The design intent: guardrails in a multi-agent system should be easier to extend than a single-model wrapper because the insertion points are explicit and documented.&lt;/p&gt;




&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;The current system handles text queries. The next logical step is audio: speak the question, get the answer read back. The architecture already supports it. Web search quality could also improve by adding domain filtering for trusted sources.&lt;/p&gt;




&lt;h2&gt;
  
  
  Read the full code
&lt;/h2&gt;

&lt;p&gt;Everything described here is working code, not a demo or a mockup.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/Manisoft55-lab/one-desk-ai" rel="noopener noreferrer"&gt;github.com/Manisoft55-lab/one-desk-ai&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The README walks through local setup with Docker Compose to full ECS Fargate deployment. The test client lets you try all four agent paths with one command.&lt;/p&gt;

&lt;p&gt;If you build something on top of it or run into a version issue the docs don't cover, open an issue.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;Built at Manisoft Labs. Questions about the architecture or deployment: find me on &lt;a href="https://www.linkedin.com/in/manikandan-pandurangan-tech-enthusiast/" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>programming</category>
      <category>productivity</category>
    </item>
    <item>
      <title>Don't Let Your Jarvis Become Ultron: A Field Guide to Testing Agentic AI system</title>
      <dc:creator>Manikandan Pandurangan</dc:creator>
      <pubDate>Tue, 23 Jun 2026 01:46:43 +0000</pubDate>
      <link>https://dev.to/manikandan_pandurangan_16/dont-let-your-jarvis-become-ultron-a-field-guide-to-testing-agentic-ai-system-5c7m</link>
      <guid>https://dev.to/manikandan_pandurangan_16/dont-let-your-jarvis-become-ultron-a-field-guide-to-testing-agentic-ai-system-5c7m</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fqio1bcle4ty16uwlzy38.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.us-east-2.amazonaws.com%2Fuploads%2Farticles%2Fqio1bcle4ty16uwlzy38.png" alt=" " width="800" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 1: Component tests.&lt;/strong&gt; Write deterministic unit tests for each layer: test_research_agent.py, test_web_search_tool.py, test_user_profile_memory.py. Use mock data that your domain expert has signed off on. These run on every commit, cost nothing, and catch the obvious breakages before any LLM call gets billed. While you're here, stub the external APIs too (GA4, Shopify, Meta, OpenSearch). If a test goes red because Shopify was down, it isn't telling you anything about your agent.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 2: The prompt repository&lt;/strong&gt;. Sit with the domain expert and collect the sharpest prompts you can, the ones that force specific tools, functions, agents, and memory to fire. Tag each prompt with what it's supposed to exercise, and group prompts by business area so a change in one area only re-runs its own set. This is the most valuable thing you'll build, so treat it that way.&lt;/p&gt;

&lt;p&gt;Two kinds of prompts people forget. First, the failure cases: out-of-scope questions, prompt injection, ambiguous input, empty or malformed tool responses, timeouts. In banking, a prompt that checks whether the agent correctly refuses to give financial advice is a real test, not an edge case. Second, the multi-turn cases. Memory bugs show up across a conversation, not in a single call. Does it carry context forward, drop it when it should, and never leak one user's profile into another user's session? That last one matters a lot under DPDP.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 3: Coverage and trajectory.&lt;/strong&gt; Run the whole repository and confirm every agent and tool actually fired. That's the coverage check. Then go one level deeper and look at the path the agent took to get there. A tool firing isn't the same as the right tool firing, with the right arguments, in the right order, without three pointless detours, and recovering when a tool returned an error. This trajectory check is the part most teams skip, and it's the part that's specific to agents rather than plain LLM apps.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 4: Versioned runs and capture.&lt;/strong&gt; Stamp every run with a version (gpt-5.5-upgrade-20260623) and store every response against it. Now regression is something you can point to instead of something you argue about. Two additions. Run each prompt several times rather than once, because the model is stochastic and a single scored run is closer to a coin flip than a test. Track the pass rate and the variance. And capture cost, tokens, latency, and tool-call count on every run, because the upgrade decision is a trade-off. "Four percent more accurate at three times the tokens and twice the latency" is a business call, and you can't make it without those numbers in front of you.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 5: Ground truth store.&lt;/strong&gt; Keep domain-expert-verified ground truths for each prompt and tool, versioned the same way (...-20250510). One thing to decide early: who is allowed to change a ground truth, and how that approval gets recorded. When the product changes for real, old ground truths go stale, and without an update process the suite slowly starts failing things that are actually correct.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 6: The evaluator.&lt;/strong&gt; Score each candidate run against the ground truth using Ragas plus an LLM judge, on precision, recall, completeness, correctness, and whatever else the business asks for. The catch is that your judge is also an LLM, with its own biases toward longer answers, toward whatever comes first, toward its own style. Keep a small set of human-labelled examples and check how often the judge agrees with the humans. If you don't, you get metrics that are wrong and confident at the same time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 7: Dashboard and human review.&lt;/strong&gt; Surface the low-scoring cases and let a human confirm or correct both the ground truth and the new response. The same screen does double duty: the labels people produce here are what you use to calibrate the judge in Stage 6.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Stage 8: CI/CD&lt;/strong&gt; Decide where this runs. Component tests on every pull request, the full evaluation suite nightly and before a release, and a gate that blocks the deploy when scores fall below a threshold. A suite that nothing in your pipeline calls won't get run, and won't get maintained.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>agents</category>
      <category>agentaichallenge</category>
    </item>
  </channel>
</rss>
