Building a Serverless Semantic Deduplication Engine Under 500ms published: true

#awschallenge #aideas10000 #serverless #aws

As a 1st-year engineering student diving into distributed systems, I’ve noticed a massive inefficiency in how growing engineering teams operate: they don't just duplicate code—they duplicate intent. One squad builds an "Auth Service" while another builds a "Login Backend." They use different words, so standard Jira or GitHub keyword searches never catch the overlap until the code is already shipped.

I recently built Kanso, a cloud-native semantic intelligence layer, to solve this. It integrates directly into the IDE and alerts developers if they are typing a feature that already exists elsewhere in the organization.

Here is a breakdown of the event-driven software architecture patterns I used, and how I kept the entire round-trip semantic validation under a strict 500ms latency budget.

The Core Architecture Pattern

Kanso operates on a fully decoupled, serverless hub-and-spoke model. The IDE extensions (VS Code, Kiro) act as edge sensors, while the heavy lifting is handled by an AWS serverless backend.

(AWS Architecture Diagram)

The Stack:

Ingestion: Amazon API Gateway (with JWT Lambda Authorizers for multi-tenant isolation)
Compute: AWS Lambda (Python 3.11). Python was the pragmatic choice here for rapid integration with the AWS Bedrock SDKs.
Embeddings & Reasoning: Amazon Bedrock (Titan & Claude 3 Sonnet)
Vector Database: Amazon OpenSearch Serverless (KNN Search)

Solving the 500ms Edge Challenge

Even as a student, I know that blocking the main thread of an IDE is a cardinal sin. For this tool to be usable, the time from a developer pausing their typing to receiving a validated semantic alert had to be practically imperceptible.

1. Edge Debouncing & Caching
Instead of firing an API call on every keystroke, the VS Code extension uses a 500ms debouncer. I also implemented an intelligent embedding cache at the edge. If a developer deletes and retypes the same phrase, the extension pulls from the local cache rather than hitting the API Gateway, aggressively reducing noisy network calls.

2. Fast Vector Retrieval
Once the text hits the backend, AWS Lambda uses the Titan model to generate a vector embedding. To query historical initiatives quickly, I utilized OpenSearch Serverless. Its KNN (k-nearest neighbors) matching allows the system to scan thousands of partitioned, tenant-specific vectors in milliseconds.

3. The LLM Circuit Breaker
Vector similarity alone generates too many false positives (e.g., "Implement Stripe" vs. "Deprecate Stripe"). However, running every query through a Large Language Model is too slow for an IDE extension.

My solution was a two-tiered filter:

Tier 1: OpenSearch returns vectors. If the similarity is below 85%, the Lambda drops the event immediately.
Tier 2: If (and only if) the score exceeds 85%, the payload is routed to Claude 3 Sonnet via Bedrock. Claude acts as the final contextual validator, returning a strict JSON response (is_duplicate: true/false) based on intent and status conflicts.

This pattern ensures that the slower, compute-heavy LLM is only invoked when absolutely necessary, preserving the overall latency budget.

The Result

By combining fast vector retrieval with conditional LLM reasoning, Kanso prevents redundant engineering effort in real time without lagging the developer's environment.

The full project is currently a semi-finalist in the AWS 10,000 AIdeas Competition. As a 1st-year student, making it to the Top 300 would be incredible. If you found this architectural breakdown helpful, I would highly appreciate your vote! To support Kanso: