DEV Community: Manoj Aggarwal

Agentic AI in Payments Is Exciting. It’s Also a Minefield.

Manoj Aggarwal — Sun, 08 Mar 2026 00:09:05 +0000

We're on the verge of AI that doesn't just advise on payments — it executes them. Book the flight, split the bill, pay the invoice, trigger the refund. All autonomously, on your behalf.

That's genuinely powerful. It's also terrifying if you think about it for more than five seconds.

The moment an AI agent has real financial authority, you've introduced a new attack surface for fraud, a new compliance headache for PCI-DSS, and a new category of failure where the system doesn't hallucinate text — it hallucinates a transaction.

A few things I kept coming back to while thinking through this:

Latency vs. safety is a real tradeoff. Payments are optimized for speed. Fraud checks, human-in-the-loop review, and agent sandboxing all add friction. Where do you draw the line before the UX collapses?

Autonomous doesn't mean unaccountable. If an AI agent executes a payment incorrectly, who owns it? The user who granted permissions? The company that deployed the agent? The model provider? This isn't solved.

Constrained agents are underrated. The instinct is to give agents as much autonomy as possible. But scoped permissions — agents that can only act within explicit guardrails — are probably the right default for high-stakes financial actions. Boring, but correct.

I wrote up the full breakdown on HackerNoon, covering fraud detection architecture, compliance risk, decision boundaries, and what "human-in-the-loop" actually needs to look like for this to work in production.

👉 Read the full article on HackerNoon

Curious whether others building in this space are leaning into full autonomy or keeping agents on a shorter leash — and why.

The Advertising Model is Coming for AI Agents (And It's Worse Than You Think)

Manoj Aggarwal — Mon, 02 Mar 2026 05:32:06 +0000

We already know how the advertising model corrupted social media. Now it's heading for AI agents — and the mechanics make it potentially worse.

When an agent books your flight, recommends a tool, or summarizes your options, how will you know if that answer was shaped by a sponsored result? Unlike a clearly labeled Google ad, the influence could be baked directly into the model's training data or RAG pipeline — invisible to the user.

I wrote about where this is already happening (Perplexity's sponsored answers), how affiliate-style incentives could quietly distort agent recommendations, and what "hyper-personalized" ad targeting looks like when the agent knows everything about you.

👉 Read the full article on HackerNoon

Curious if others are thinking about this — especially those building agents that make recommendations or take actions on behalf of users.

Would I Use LLMs to Build Dynamic Product Ads Today? Maybe

Manoj Aggarwal — Mon, 09 Feb 2026 21:16:09 +0000

I led Dynamic Product Ads at Twitter, where we matched millions of users to hundreds of millions of e-commerce products in real time. The output was the top 5-6 products that each user was most likely to buy. The system used product and user embeddings with classic ML models to serve personalized ads at Twitter's scale. We saw 15-18% improvements in CTR and 12% improvements in conversion rates compared to brand advertisements.

This was a few years ago. But now, everyone is talking about AI and Large Language Models as if they will revolutionize everything. So, I was reflecting on if I would build the Dynamic Product Ads today, would I use LLMs? And more importantly, what would not change at all, and why?

The answer: I'd use LLMs for about 20% of the system, specifically for generating embeddings, and keep everything else the same.

Our Original Approach

The problem at hand: recommend products to a user that they are most likely to click and buy, while they are quickly scrolling their timeline. There were millions of products to choose from and advertisers. On top of that, there were millions of users scrolling at the same time. The system had to make predictions for millions of users within sub-millisecond latency. The Ad Serving pipeline would need to complete the prediction in under 50 milliseconds, at a maximum. The approach was very textbook (for 2022 at least).

Product Embeddings

We used each product's metadata, like title, description, category, price, etc., and encoded it in a 128-dimensional dense vector space. We also utilized signals like user engagement with this product or conversion patterns to calculate embeddings.

User Embeddings

Users were represented by vectors based on signals like their engagement on the platform, profile information, and past purchases. Even geographies and the time of day played a key role here.

The Matching Model

At inference time, we would use a two-stage approach. First, we would run a fast approximate nearest neighbor search to retrieve candidate products whose embeddings were close to user embeddings. Then, we would use a gradient boosted decision tree to score those candidates, incorporating additional features like recency, price signals, and context like time of day.

This approach worked. The model and ANN (approximate nearest neighbor) were explainable, debuggable, and most importantly, fast enough for Twitter's scale.

How I'd Approach This Today?

It's 2026 now. If I were building this system today, here's what I'd actually change.

Better Product Embeddings with LLM Encoders

The biggest improvement would be in generating better product embeddings. Modern Large Language Models are remarkably good at understanding semantic meaning and context. Instead of stitching together product descriptions (most of which were pretty bad to begin with), I'd use an LLM-based encoder to generate product embeddings.

This matters because a product titled "running shoes" would be semantically close to "sneakers for jogging", even though they don't share the exact words. Modern sentence transformers from Hugging Face, like all-MiniLM-L6-v2 handle this effortlessly.

We once had a Nike product catalog entry titled 'Air Max 270 React' that our 2022 embeddings couldn't match to users searching for 'cushioned running shoes' or 'athletic sneakers' because there was no keyword overlap. The product got 35-40% fewer impressions than similar items in its first week until we collected enough engagement data. An LLM-based encoder would have understood the semantic relationship immediately.

Improved Cold-Start Handling

LLMs would also make the cold-start handling a bit better. When a new product appears in a catalog, LLM can extract rich signals from product descriptions, reviews, and images to generate a reasonable initial embedding. Similarly, for new users with sparse engagement history, modern encoders can better understand their profile information and initial tweets (if they have any) to create meaningful representations. Cold start was always our weakest point in classic embeddings-based user-product matching solutions.

So, where would LLMs fit into the actual architecture?

Hybrid Approach

I would still use classic ML for actual storing and serving layers. The architecture would look like:

Feature	LLM or Classic
LLM-based encoder to generate user and product embeddings	LLM
Match embeddings to generate candidate products per user	Classic
Final scoring and ranking	Classic

Why would I use classic models for scoring? The reasons are latency, cost, and explainability.

LLMs cannot predict products for millions of users in under 10 milliseconds. They are an overkill for combining numerical features and making a ranking decision. Classic models would do this in microseconds. At Twitter's scale, the difference between 1ms and 10ms inference time translates to millions of dollars in infrastructure costs and measurable drops in user engagement.

Cost matters more than people admit. Running LLM inference for every prediction request would cost 50-100x more than our classic approach.

What About Generating Ad Copy?

There is a lot of hype on the internet about using LLMs to generate personalized ad copy on the fly or to reason about user intent in real time. This is where it gets hard to decide if LLMs would actually be useful or not.

Generating ad copy with LLMs introduces unacceptable risks like hallucinations about product features, inconsistent branding, and hard-to-review content at this scale. The system would need to show millions of ad variations per day, and there would be no way to review them for accuracy and brand safety. One hallucinated claim about a product like "waterproof" when it's not, or "FDA-approved" when it isn't, would create legal liability. The risk doesn't justify the marginal lift in engagement.

What Does Not Change?

Understanding user intent is still the hardest part. Whether the system uses embeddings from 2022 or LLMs from 2026, the fundamental challenge remains the same: inferring what someone wants from noisy signals. Someone who tweets about running shoes might be a marathon runner shopping for their next pair, or a casual observer who just watched a race on television. This problem needs good data to solve, thoughtful feature engineering, and lots of experimentation. No model architecture solves this.

Latency requirements are non-negotiable. At scale, every millisecond counts. Users would abandon experiences that feel slow. Ad systems cannot slow down the loading of the timeline. I have seen several ML infrastructure systems crumble in A/B tests because they added 100ms of latency to a well-oiled system. The model could be better, but the latency requirement trumps all.

Last-mile problems remain the same. Issues like cold start for new products or users, and data quality issues when catalogs have missing or incorrect product descriptions, are still there. These problems are orthogonal to the model architecture and require system design thinking, and not model architecture building.

Iteration speed beats model sophistication. A team that can run 10 experiments per week with a simpler model will constantly outperform a team running 1 experiment per week with a very sophisticated model. The ability to test quickly, measure results, and iterate is more valuable than marginal improvements in model quality. When we launched Dynamic Product Ads, we ran 3-4 experiments per week. We tested different embedding dimensions, different ANN algorithms, and different features in the scoring model. Most experiments failed. But the ones that worked compounded. That velocity mattered more than picking the "perfect" model architecture.

The Real Question is: Where Is The Bottleneck?

To be honest, most of the "how would you build it today with modern AI" discussion misses the point. The question should not be what is possible with the new technology. It should be, "What is the actual bottleneck in your system where AI can help?"

For us, the bottleneck was never about the quality of the embeddings. It was about understanding user intent, handling data quality issues in the product catalogs, and managing cold-start problems, as well as building systems that could handle scale. Modern AI genuinely helps with some of these, and there is real value in using AI. However, the fundamental system challenges don't change.

If I were building this system today, I'd spend 20% of my effort on "using LLM to generate better embeddings" and 80% on the same problems of scale, data quality, experimentation, and understanding user intent.

The Unpopular Opinion

The tech industry loves revolutionizing narratives. There was similar hype around blockchain and Web3 a few years ago. But the truth is, most production ML systems work, scale, and make money. The revolutionary approach of using LLM would make 5% improvement, but would be 10x slower and 100x more expensive. Modern AI is genuinely valuable when applied thoughtfully to bottlenecks, not as a replacement for the entire system that already works.

AI and my workflow

Manoj Aggarwal — Tue, 20 Jan 2026 18:52:35 +0000

I used to spend weeks writing technical design documents. But last month, I had one ready within a day. The difference was not that I got faster, but that my approach to document writing changed.

After spending a decade building software for users and enterprises, this year felt different. Not because the problems changed, but because how I approached them did.

AI tools quietly integrated themselves into my workflow and now everything has shifted.

Document Writing and Reviewing

As an engineering leader, my job requires me to write and review technical and product documentation, ranging from proof-of-concept proposals to full-fledged design documents containing structured API design, system diagrams and storage layouts. Until last year, writing a solid technical design document would take me weeks. Most of the time went into structuring the document so that reviewers can quickly grasp my proposed solution, critique it and approve.

That changed this year.

Now, I start by prompting what is this document for?

Is it a system design proposal for a feature or an entirely new system?

What problem is this system solving?

And does AI need to refer to existing documents like product requirements?

Within seconds, I have a well-structured first draft. Recently, for a proof-of-concept proposal, it generated placeholder content with sections like:

Overview
Proposed System Design (with a blank Lucidchart link)
Logic and Value proposition
Screenshots
Success Criteria

At that point, all I had to do was fill in the blanks. More than half the mechanical work was done.

What AI didn't do was tell me whether this was the right system to build. It didn't challenge me about the scale or dependencies. That judgement still required me.

Document reviewing became easier too. Instead of reading through entire Product requirement documents end to end, I ask Gemini to summarize it and answer targeted questions like "Who is the audience for this feature?". It answers using the document as the source of truth in natural language.

Writing and reviewing code

Writing unit tests was never fun and will never be. But they are essential! What differentiates a good software engineer isn't just writing code, but thinking through edge cases and proving that those cases are handled through thorough testing.

Until recently, writing unit or integration tests would be a time-consuming manual process. Boilerplate code to initialize mocks, setup factories and fetching test data had to be manually written. That workflow has changed.

After writing a new controller for a .NET API, along with its business logic, CRUD layer and mongo repository, I prompt Copilot to generate unit tests. In seconds, it produces *Tests.cs file for each new file I have written with unit tests covering most happy paths and several common error cases.

Most, but not all.

Here's where it gets interesting: Copilot generated tests for a POST api that looked perfect. They had proper assertions, clean setup methods and good naming conventions, and they all passed. But they were all useless. The business logic behind the api was writing to three different collections in a mongo database in a single transaction. The tests checked that calling the api would write to the database, not that the write happened transactionally such that one failure would fail the entire transaction. The tests tested the happy path but not other meaningful scenarios.

It is still my responsibility to identify the true corner cases, ones that require product context. And when I give Copilot additional prompts about the new cases to test, it fills in the remaining tests quickly and accurately. I still have to think about some of these scenarios, but the execution does not have to be done by me.

Code review has also become more focused. Rather than spending time on mechanical issues like missing error handling or unsafe type casts, automated workflows send the entire diff to models like Claude, Gemini or OpenAI to generate a first pass of comments on Github.

My role is to ensure those comments are resolved correctly, the business logic makes sense and the system isn't regressing in subtle ways. That authority to approve a pull request still lies with me, given I have the product context.

Research

Understanding unfamiliar codebase is never easy. Every team has its own conventions for structuring code, managing configurations and handling service-to-service communication. Yet this understanding is crucial when designing a new feature, especially when it needs to integrate with existing systems.

Previously, I would spend hours using tools like Sourcegraph or GitHub code search to build a mental model of how a system worked. Tracing how service A interacted with service B took effort and manually reading through large portions of code.

Not anymore.

Now, I clone each repository, open them in VSCode and ask targeted questions to Copilot. For example:

Does this service emit events to the kafka topic XYZ?
If not, list all the topics that this service emits to.

Using Claude Sonnet 4.5 in Copilot's Agent mode, it scans through the relevant source files and responds in natural language within a minute. No more scratching my head, understanding code archaeology. I get a clear, high level understanding almost immediately.

This is where AI delivers disproportionate value. Not in writing code, but in reading it and understanding it quickly enough that I can make architectural decisions fast.

What Actually Changed

Over the past year, my role quietly shifted. I spend more time deciding what to build and less time actually building.

AI gave me confidence, not just raw productivity. I can now explore unfamiliar codebases faster and validate ideas off ChatGPT without second guessing myself.

What AI has not changed is accountability. Every line of code I ship, every pull request I approve, and every design document I author, still carries my name. If something breaks, that responsibility is mine, not ChatGPT's or Claude's or Gemini's. The trap is that it is easier than ever to ship something that might look real, but is fundamentally wrong.

The engineers who thrive in 2026 won't be those who use AI the most, but those who know what to ask for and what to verify. Those who understand that AI doesn't replace judgement, it amplifies it, both good or bad.