DEV Community: vaishakh

Classification in a Nutshell

vaishakh — Wed, 17 Sep 2025 16:50:33 +0000

Classification is the art of drawing boundaries. You take messy, high-dimensional data and force it into neat categories.

In the wild, this could be spam vs. not-spam, cat vs. dog, tumor vs. healthy tissue.

In textbooks, it’s often MNIST, the 70,000-image dataset of handwritten digits that’s become the “Hello World” of machine learning.

MNIST looks simple, but it hides the essence of classification:

Inputs: images, each a 28×28 grid of pixels → vectors in $R784\mathbb{R}^{784}$ .
Outputs: 10 possible digits (0–9).
Goal: learn a function $\mathbb{R}^{784} \to {0,1,\dots,9}$ .

That’s it. Strip away the hype, and classification is about learning the function that maps features to labels.

Step 1: The Idea of Decision Boundaries

Think of classification as drawing walls in a huge room. Each wall splits the space into regions: “everything on this side is a 3, everything on that side is a 7.”

Mathematically, the simplest wall is linear:

f(\mathbf{x}) = \text{sign}(\mathbf{w}^\top \mathbf{x} + b)

where $x\mathbf{x}$ is your input (a flattened image), $w\mathbf{w}$ is a weight vector, and $b$ is a bias.

If $f(x)=+1f(\mathbf{x}) = +1$ , you say “class A.” If it’s $- 1$ , you say “class B.”

This is binary classification. MNIST is harder. It’s 10-way classification. But the principle holds: learn a set of boundaries that carve up the space of digits.

Step 2: Probabilities, Not Just Boundaries

Hard decisions are brittle. Instead of only predicting “3” or “7,” we want a probability distribution over all 10 classes.

Enter softmax regression (multi-class logistic regression):

\mid \mathbf{x}) = \frac{\exp(\mathbf{w}k^\top \mathbf{x} + b_k)}{\sum{j=0}^{9} \exp(\mathbf{w}_j^\top \mathbf{x} + b_j)}

Each class $k$ gets a score. Exponentiate, normalize, and you’ve got probabilities.

Step 3: Learning from Mistakes

How do we tune those weights? By minimizing a loss. The gold standard is cross-entropy loss:

\mathcal{L} = -\sum_{i=1}^N \log P(y^{(i)} \mid \mathbf{x}^{(i)})

where $(x(i),y(i))(\mathbf{x}^{(i)}, y^{(i)})$ are your training examples.

The loss punishes confident wrong predictions and rewards confident correct ones.

Optimization is just gradient descent:

\mathbf{w} \gets \mathbf{w} - \eta \, \nabla_{\mathbf{w}} \mathcal{L}

with $η\eta$ as the learning rate.

Step 4: Why Neural Nets Beat Logistic Regression

On MNIST, a plain softmax regression gets ~92% accuracy. Not bad.

But if you stack layers of nonlinear functions:

h=σ(W1x+b1),y^=softmax(W2h+b2) h = \sigma(W_1 \mathbf{x} + b_1), \quad \hat{y} = \text{softmax}(W_2 h + b_2)

you unlock much richer decision boundaries.

Neural nets can bend and twist the “walls” in ways linear models never can.

Convolutional neural nets (CNNs) go further by exploiting image structure. That’s how they push MNIST accuracy past 99%.

Step 5: What MNIST Actually Teaches You

MNIST isn’t about handwritten digits. It’s a sandbox to learn the deep truths of classification:

Every problem is about separating regions in feature space.
Probabilities > hard labels.
Losses tell you how “wrong” you are.
Optimization is just moving weights downhill.
Deeper models = more flexible boundaries.

Once you grasp these, you can swap MNIST for anything: medical scans, stock movements, audio signals. The math doesn’t change.

In a Nutshell

Classification = boundaries, probabilities, and losses.

MNIST is just the training wheels.

The real game is scaling this logic to data messier than digits scribbled on paper.

Linear Regression in a Nutshell

vaishakh — Tue, 16 Sep 2025 16:22:14 +0000

Everyday something exciting comes up in machine learning.

A new RL technique, a transformer architecture that is 0.001% more effective than GPT-2, synthetic data creation to train neural nets, and whatnot.

But before diving into all these things, we must fondly remember the simpler, time tested, arguably more efficient algorithm for less complex problems, ladies and gentlemen, I'M TALKING ABOUT NONE OTHER THAN

Yes, you heard me right, I'm going to show you the power of linear regression.

If I had to put it in a single sentence, linear regression is a machine learning model that tries to find a linear equation

f(x) = y = ax + b

that fits our data the best.

It's not all sunshine and rainbows though, as we run into our first major issue. How do we define "fitting our data the best"?

What do we mean "fitting our data the best"?

We have a few ways in which we could approach this problem.
But before that, let us define our cost function.

Eᵢ = yᵢ - f(xᵢ)

Great! With that out of the way, how do we optimize this cost function to accurately predict our data in the best way?

Method 1:

Σ i = 1-> n (Eᵢ)

Whoa, don't run away just yet. Let me explain. All this is doing is minimizing the net variation of each data point in our dataset in comparison with the predicted linear equation.

If you have a sharp eye though, you would notice that high positive residues and high negative residues at various data points on addition can give a low resultant value, but that need not be accurate and could output multiple such lines.

Well, what else can we do?

Method 2:

Σ i = 1-> n (|Eᵢ|)

You may have already caught this idea. If negative and positive values cancelled each other out, then just take the absolute value. If you look close enough, this too can return multiple such lines with a minimum of 2. If you don't want to take my word for it, check out with a custom dataset on a graphing software like Desmos.

Method 3:

Σ i = 1-> n (Eᵢ²)

Beautiful! This is called the least squares criterion and fixes both our major issues of getting multiple lines and opposite signs cancelling each other out.

Knowing this, we can move on to the next step in our analysis of this algorithm.

How does the machine find out the equation now?

There are 2 main ways in which the equation for a linear regression model is found.

Firstly, we have the closed form equation:
If the dataset isn't massive, the slope and intercept of equation can be solved for almost in a single shot.

And secondly, gradient descent:
If the dataset is huge, the formula gets messy and solving for the necessary values becomes time-consuming and daunting. Instead of all that, we just let the computer walk downhill step by step. It looks at the slope of the error curve and keeps adjusting until it reaches the lowest point.

As much as I would love to explain the math further, it could get boring and could go out of the scope of this article. If you are interested, I could cover it in another article some time in the future.

Yeah, but this is a toy right? Real problems have SO MANY variables to account for

With one variable, we're fitting a line on a 2d graph, two variables it becomes a plane on a 3d graph, and as the variables increase, we can no longer visualize.

It is thus easier to illustrate with an example of housing costs in the hypothetical city of "Machineland" where we can see flying cars for transport and humanoids in the government.

Price = 100*Area + 7000*Bedrooms + 30000*Location + 25*No. of humanoids in 1km^2 range + Intercept

Each coefficient thus tells you the impact of that feature, keeping others constant.

One limitation however is that the variables can sometimes be interdependent. Welcome to multicollinearity, which makes interpretation tricky.

Why should I care?

Interpretable (easy to explain results -> +2k per humanoid in 1 km^2)
Fast (Trivial to compute even on massive datasets)
Baseline (Every ML pipeline starts here)

Code demo

Run this script in your python IDE so that we can correlate pizza slice size to happiness using linear regression 🍕🥳

from sklearn.linear_model import LinearRegression
import numpy as np

# Pizza size in inches
X = np.array([[6], [8], [10], [12], [14]])
# Happiness rating out of 10 (totally made up!)
y = np.array([3, 5, 7, 9, 10])

# Train the model
model = LinearRegression()
model.fit(X, y)

print("Slope:", model.coef_[0])
print("Intercept:", model.intercept_)
print("Predicted happiness for a 16-inch pizza:", model.predict([[16]])[0])

Your output should look like:

Slope: 0.7
Intercept: -1.5
Predicted happiness for a 16-inch pizza: 9.7

Linear regression basically learns that the bigger the pizza, the happier the human!

Conclusion

I hope that you enjoyed my overview of this interesting topic. So before your next "wrapping an LLM and calling it an AI project", please put some respect to the cradle of machine learning, i.e. linear regression.

Thank you for reading!

Redact: AI redteam for LLM powered applications

vaishakh — Mon, 11 Aug 2025 01:10:34 +0000

This is a submission for the Redis AI Challenge.

What I built

Redact-LLM is a red-team automation platform that stress-tests AI systems. It:

Generates targeted adversarial prompts (jailbreak, hallucination, advanced)
Executes them against a target model
Evaluates responses with a strict, JSON-only security auditor
Surfaces a resistance score and vulnerability breakdown with recommendations

Frontend: React + Vite. Backend: FastAPI. Model execution/evaluation: Cerebras Chat API. Redis provides real-time coordination, caching, and rate controls.

Live demo (frontend + auth only): https://redact-llm.vercel.app
GitHub repository: https://github.com/VaishakhVipin/Redact-LLM

Note: The backend could not be deployed on Vercel due to large build size constraints. The live link demonstrates the frontend and authentication flows; backend/API testing should be run locally.

Screenshots (since deployment is not working as my build is too large)

Auth pages (/login) :

Home page (/):

Prompt analysis (/analysis/XXXXXXXX):

Real application flow (where Redis fits)

1) Prompt submission (frontend → backend)

User submits a system prompt via /api/v1/attacks/test-resistance.
Backend validates and enqueues a job.

2) Job queue on Redis Streams

JobQueue.submit_job() writes to stream attack_generation_jobs using XADD.
Workers pull jobs (currently via XRANGE), generate adversarial attacks, execute them against the target model, and persist results.
Results are stored in Redis under job_result:{id} with a short TTL for quick retrieval.

3) Semantic caching for cost/latency reduction

Both generation and evaluation leverage a semantic cache to deduplicate similar work.
Implementation: backend/app/services/semantic_cache.py
- Embeddings via SentenceTransformer('all-MiniLM-L6-v2')
- Embedding cache key: semantic_cache:embeddings:{hash(text)}
- Item store: semantic_cache:{namespace}:{key} with text, embedding, metadata
- Default similarity threshold: 0.85; the evaluator uses 0.65 for higher hit rates

4) Strict evaluator with conservative defaults

The evaluator builds a rigid JSON-only prompt (no prose/markdown). Any uncertainty defaults *_blocked=false.
It caches evaluations semantically and can publish verdicts (channel verdict_channel) when configured.
Key logic: backend/app/services/evaluator.py

5) API reads from Redis

Poll for job results at job_result:{id}
Queue stats derived from stream + result keys

Why Redis

Low-latency, async client: redis.asyncio with pooling, health checks, and retries (RedisManager)
Streams for reliable job handoff and scalable workers
Semantic cache to avoid duplicate LLM calls (cost/time savings)
Short-lived result caching for responsive UX
Central place for rate limiting and pipeline metrics

Redis components in this repo

Client/connection management: backend/app/redis/client.py
- Connection pooling, PING on startup, graceful shutdown via FastAPI lifespan
- Env-driven config: REDIS_HOST, REDIS_PORT, REDIS_USERNAME, REDIS_PASSWORD
Streams/queue: backend/app/services/job_queue.py
- Stream name: attack_generation_jobs
- XADD for jobs; results in job_result:{id} via SETEX
- Stats via XRANGE and key scans
Optional stream helper: backend/app/redis/stream_handler.py
- Example stream prompt_queue and XADD helper
Semantic cache: backend/app/services/semantic_cache.py
- Namespaces (e.g., attacks, evaluations) to segment cache
- Embeddings stored once; items stored with metadata and optional TTL
Rate limiting: backend/app/services/rate_limiter.py
- Per-user/IP/global checks to protect expensive model calls (sliding window)

Key patterns and TTLs

Embedding: semantic_cache:embeddings:{hash} (no TTL)
Item: semantic_cache:{namespace}:{key} (optional TTL)
Job result: job_result:{uuid} (TTL ≈ 300s)
Stream: attack_generation_jobs

Operational notes

Startup connects to Redis and pings; backend degrades gracefully if unavailable
Strict evaluator prompt + temperature 0.0 for deterministic scoring
Similarity threshold tuned differently for generator vs evaluator to maximize reuse while avoiding false matches

Impact

60–80% fewer repeated LLM calls on similar prompts through semantic caching
Real-time UX via streams/results cache without overloading the model backend
Deterministic, stricter evaluations produce stable security scoring for dashboards

By submitting this entry, I agree to receive communications from Redis regarding products, services, events, and special offers. I can unsubscribe at any time. My information will be handled in accordance with Redis's Privacy Policy.

40% of top devs distrust AI coding tools in 2025

vaishakh — Fri, 08 Aug 2025 15:39:00 +0000

When AI Isn't Your Friend: Why 40% of Advanced Developers Distrust Code-Auto Tools

“Code is liability. AI just gives you more of it, faster.”

In 2025, Stack Overflow reported that 40% of experienced developers do not trust AI coding tools. This includes Cursor, CoPilot, and Windsurf. They can generate functional-looking code from a short prompt, but that does not mean the output is safe, efficient, or even correct.

The issue is not speed. AI can produce code quickly. The issue is trust and reliability. In production environments, a single incorrect assumption can cause cascading failures. AI coding assistants often lack the context required to avoid those mistakes.

The Mirage of AI Coding

AI tools generate code based on statistical patterns from training data. They do not reason about your specific project in the way a developer does. Without full project context, they optimize for code that appears correct syntactically and stylistically, not necessarily code that works for your architecture.

Example:

# Prompt: "merge sorted lists"
def merge_sorted_lists(a, b):
    return sorted(a + b)  # works until a or b is a generator

In a test snippet, this works. In a live service where a or b might be iterators or streams, this will break or cause performance bottlenecks.

Why Developers Distrust AI Code

1. Context Blindness
AI models do not maintain an internal representation of your entire codebase. Even with extended context windows, they work on a limited set of input tokens. This means they can miss existing utility functions, established patterns, or constraints.

2. Overconfident Wrongness
LLMs tend to produce answers in a confident tone, even when incorrect. This leads to errors that are harder to detect because they appear intentional and well-structured.

3. Security and Maintainability Risks
AI tools can introduce outdated dependencies, unsafe input handling, and inconsistent coding styles. These issues increase the attack surface and make long-term maintenance harder.

The False Sense of Speed

Rapid code generation is attractive during prototyping, but in production the cost of hidden errors outweighs the time saved. AI can reduce the initial coding time from hours to minutes, but debugging and refactoring poorly generated code can take days or weeks.

The Goldilocks Zone for AI Tools

Best suited for:

Boilerplate code
Regex patterns
Data parsing scripts
One-off utilities

Not suited for:

Core business logic
Security-critical components
Direct interaction with production databases

Using AI within these boundaries reduces risk while maintaining speed.

The Path to Trust

For AI coding tools to earn developer trust, they need:

Project-wide context awareness
Real-time static analysis integration
Clear uncertainty estimation in outputs

Until then, the safest approach is to treat AI like a junior developer: review every line before merging.

Whispers - A Voice Journaling App with Smart Memory Search (Algolia MCP)

vaishakh — Mon, 28 Jul 2025 00:39:27 +0000

Algolia MCP Server Challenge Submission

Whispers - A Contextual Voice Memory System

What I Built

Whispers is a voice-first journaling application that transforms spoken thoughts into searchable, contextual memories. Users speak naturally into their microphone, and the system captures, processes, and indexes their reflections with semantic understanding. The core innovation is using Algolia MCP Server to power intelligent search that goes beyond keyword matching—it understands context, emotional states, and temporal patterns in your personal narrative.

This isn't just a search engine for text. It's a second brain that remembers not just what you said, but when you said it, how you felt, and what patterns emerge across your thoughts over time.

Demo

🎥 Video Demo:
https://drive.google.com/file/d/1RHyqpW434EeTGdP6xMRYbZCfifNatZd7/view?usp=sharing

GitHub Repository

The complete source code is available at: (https://github.com/VaishakhVipin/whispers-final)

Key files demonstrating Algolia MCP integration:

backend/services/gemini.py - MCP search orchestration and query decomposition
backend/routes/stream.py - Algolia indexing and filtered search endpoints
frontend/src/components/SearchInterface.tsx - Natural language search interface
backend/services/algolia.py - Algolia MCP client implementation

How I Utilized the Algolia MCP Server

The Algolia MCP Server is the backbone of Whispers' contextual memory system. Here's how it transforms natural language queries into intelligent, filtered search results:

1. Structured Data Indexing with Rich Metadata

Each journal entry is indexed with comprehensive metadata that enables sophisticated filtering:

entry = {
    "user_id": user.id,           # User isolation
    "session_id": session_id,     # Session grouping
    "date": date,                 # Temporal filtering
    "timestamp": timestamp,       # Precise timing
    "title": title,               # Semantic search
    "summary": summary,           # Contextual understanding
    "tags": tags,                 # Emotional/topic classification
    "text": text,                 # Full content search
    "is_from_prompt": is_from_prompt  # Prompt-driven vs free-form
}

2. Gemini-Powered Query Decomposition

When users ask questions like "When was I stuck?" or "What were my creative ideas last month?", Gemini breaks these into searchable components:

def mcp_search(query, user_id=None):
    """
    Multi-step MCP agent architecture for contextual journal search:

    Stage 1: Intent Extraction - Gemini analyzes query and extracts search terms
    Stage 2: Memory Retrieval - Check local memory for similar past queries
    Stage 3: Search Execution - Query Algolia with extracted terms and user filters
    Stage 4: Synthesis - Feed results back to Gemini for contextual insights
    Stage 5: Memory Storage - Store query and results for future reference
    """
    import json as pyjson
    import re
    import os
    from datetime import datetime

    # Initialize response structure
    response = {
        "original_query": query,
        "search_terms": [],
        "stage1_response": "",
        "algolia_hits": [],
        "final_summary": "",
        "memory_used": False,
        "timestamp": datetime.now().isoformat()
    }

    # ===== STAGE 1: Intent Extraction =====
    print("🔍 Stage 1: Extracting intent and search terms...")
    try:
        url = "https://generativelanguage.googleapis.com/v1beta/models/gemini-2.0-flash:generateContent"
        headers = {"Content-Type": "application/json"}

        extraction_prompt = (
            "You are an AI assistant for a journaling app. "
            "Analyze the user's query and extract intent and search terms. "
            "Return a JSON object with: "
            "1. 'is_search': 'yes' if this requires searching past entries, 'no' otherwise "
            "2. 'search_terms': array of specific, relevant search terms "
            "3. 'intent': brief description of what the user is looking for "
            "4. 'response': a helpful, natural response about what you'll search for "
            "Example: {\"is_search\": \"yes\", \"search_terms\": [\"productivity\", \"morning\"], \"intent\": \"finding productivity patterns\", \"response\": \"I'll search for entries about your productivity and morning routines.\"} "
            f"User query: {query}"
        )

        payload = {
            "contents": [{"parts": [{"text": extraction_prompt}]}],
            "generationConfig": {"maxOutputTokens": 512}
        }
        params = {"key": GEMINI_API_KEY}

        resp = requests.post(url, headers=headers, params=params, json=payload, timeout=15)
        resp.raise_for_status()
        data = resp.json()

        response_text = data.get("candidates", [{}])[0].get("content", {}).get("parts", [{}])[0].get("text", "")

3. Contextual Relevance Scoring

Results are ranked by semantic relevance, not just keyword frequency:

def calculate_relevance(hit):
    relevance_score = 0
    for term in search_terms:
        term_lower = term.lower()
        if term_lower in hit.get("title", "").lower():
            relevance_score += 3  # Title matches are most important
        if term_lower in hit.get("summary", "").lower():
            relevance_score += 2  # Summary matches are important
        if any(term_lower in tag.lower() for tag in hit.get("tags", [])):
            relevance_score += 1  # Tag matches are good
    return relevance_score

4. Real-World Search Examples

Query: "When did I feel burnt out?"

Gemini Decomposition: ["burnt", "out", "burnout", "exhausted"]
Algolia Filter: user_id:123 AND (burnt OR out OR burnout OR exhausted)
Result: Entries tagged with "burnout", "stress", or containing emotional context

Query: "What were my app ideas last month?"

Gemini Decomposition: ["app", "ideas", "startup", "project"]
Algolia Filter: user_id:123 AND date:2024-06* AND (app OR ideas OR startup OR project)
Result: Creative entries from June with relevant tags

Key Technical Achievements

Contextual Memory Recall

Semantic Understanding: Queries like "when I was struggling" find entries with emotional context, not just the word "struggling"
Temporal Intelligence: "Last week" automatically filters to recent entries
Pattern Recognition: Identifies recurring themes across multiple entries

Privacy-First Architecture

User Isolation: Every search is filtered by user_id ensuring complete data separation
Secure Indexing: No cross-user data leakage in the Algolia index
Audit Trail: All search queries are logged for transparency

Performance Optimization

Sub-200ms Search: Algolia's distributed search infrastructure delivers instant results
Smart Caching: Frequently accessed patterns are cached for faster retrieval
Efficient Filtering: User-specific filters reduce search space and improve performance

Key Takeaways

MCP Enables Contextual Search: Traditional search engines match keywords. MCP with Gemini enables understanding of intent, emotion, and temporal context.
Structured Data Powers Intelligence: Rich metadata (tags, dates, user context) transforms simple text search into intelligent memory recall.
User Isolation is Critical: Multi-tenant applications require careful filter design to prevent data leakage while maintaining search performance.
Natural Language Queries Need Decomposition: Complex questions require breaking down into searchable components while preserving semantic meaning.
Relevance Scoring Matters: Beyond simple keyword matching, contextual relevance scoring ensures users find the most meaningful memories.

Technical Stack

Voice Processing:

AssemblyAI Universal Streaming for real-time transcription
WebSocket for low-latency audio streaming

AI & Search:

Google Gemini for query decomposition and content analysis
Algolia MCP Server for contextual search and filtering
FastAPI for backend orchestration

Data Architecture:

Supabase for user authentication and session management
Algolia for search indexing with rich metadata
React + TypeScript for responsive frontend

Deployment:

Vercel for frontend hosting
Vercel Functions for serverless backend
Environment-based security configuration

What's Next

Immediate Roadmap:

Implement semantic similarity search for finding related memories
Add emotional trend analysis across time periods
Create memory timelines with contextual insights

Future Enhancements:

Voice emotion detection for enhanced emotional context
Collaborative memory sharing with privacy controls
Integration with calendar and productivity apps
Advanced pattern recognition for personal growth insights

Final Note

Whispers demonstrates how Algolia MCP Server can transform simple text search into contextual memory recall. By combining structured data indexing, intelligent query decomposition, and semantic relevance scoring, it creates a second brain that understands not just what you said, but the context, emotion, and patterns in your thoughts over time.

The project showcases how MCP technology enables applications that feel like they understand you—not just search your data, but help you rediscover and reflect on your own thoughts and growth journey.

Whispers - A Real-Time Voice Journaling Agent Built with AssemblyAI

vaishakh — Mon, 28 Jul 2025 00:39:24 +0000

This is a submission for the AssemblyAI Voice Agents Challenge

Whispers - A Real-Time Voice Journaling Agent

What I Built

Whispers is a voice-first journaling application powered by AssemblyAI's universal-streaming API. It enables users to speak their thoughts in real-time, intelligently formatting their words into reflective, readable journal entries. The app serves as a personal wellness companion—part therapist, part mirror, part coach—helping users capture their daily reflections through natural speech.

This project falls under the Real-Time Performance category, demonstrating advanced real-time audio processing with sub-300ms latency for live transcription display. The application showcases how AssemblyAI's universal-streaming technology can create seamless, responsive voice experiences that feel natural and immediate.

Demo

🎥 Video Demo: https://drive.google.com/file/d/1RHyqpW434EeTGdP6xMRYbZCfifNatZd7/view?usp=sharing

GitHub Repository

The complete source code is available at: (https://github.com/VaishakhVipin/whispers-final)

Key files demonstrating AssemblyAI integration:

backend/services/assembly.py - Python WebSocket streaming implementation
frontend/src/components/NotionLikeEditor.tsx - Frontend WebSocket integration
backend/routes/stream.py - Backend API endpoints for voice processing
frontend/src/lib/api.ts - Frontend API integration

Technical Implementation & AssemblyAI Integration

AssemblyAI's universal-streaming WebSocket API is the core of Whispers' real-time voice processing capabilities. The implementation streams microphone audio and receives live, formatted transcripts with exceptional accuracy and minimal latency.

Key AssemblyAI Features Implemented:

Real-time WebSocket Connection: Direct streaming to AssemblyAI's v3 streaming endpoint with formatted finals
Live Transcription: Continuous audio processing with immediate text output and partial transcript display
Auto-formatting: Clean, punctuated transcripts with proper sentence boundaries using formatted_finals=true
Streaming State Management: Robust connection handling with proper cleanup and error recovery
Duplicate Detection: Intelligent handling to prevent transcription artifacts and repeated content
Paragraph Logic: Smart paragraph spacing based on content analysis and sentence boundaries

Code Snippet - Python WebSocket Implementation:

async def stream_to_assemblyai(audio_generator):
    """
    Streams PCM audio chunks to AssemblyAI Universal-Streaming API and yields transcript text results.
    :param audio_generator: async generator yielding raw PCM audio bytes
    :yield: transcript text (str)
    """
    token = get_assemblyai_token_universal_streaming()
    ws_url = ASSEMBLYAI_WS_BASE + token

    async with websockets.connect(ws_url) as ws:
        async def send_audio():
            async for chunk in audio_generator:
                await ws.send(chunk)
            await ws.send(json.dumps({"terminate_session": True}))

        async def receive_transcripts():
            async for msg in ws:
                data = json.loads(msg)
                if data.get("message_type") == "FinalTranscript":
                    yield data.get("text", "")

        send_task = asyncio.create_task(send_audio())
        async for transcript in receive_transcripts():
            yield transcript
        await send_task

Frontend JavaScript Integration:

// Connect to AssemblyAI WebSocket
const ws = new WebSocket(`wss://streaming.assemblyai.com/v3/ws?sample_rate=16000&formatted_finals=true&token=${token}`);

ws.onmessage = (event) => {
  const data = JSON.parse(event.data);

  if (data.type === "Turn") {
    const transcript = data.transcript || "";
    const turnIsFormatted = data.turn_is_formatted || false;

    if (turnIsFormatted && transcript.trim()) {
      // Final, formatted version - add to main transcription
      console.log("📝 Clean transcription:", transcript);

      // Check for duplicates and add with proper paragraph spacing
      const shouldStartNewParagraph = shouldStartNewParagraphLogic(transcript, transcriptionText);
      const separator = shouldStartNewParagraph ? "\n\n" : " ";

      setTranscriptionText(prev => {
        const trimmedTranscript = transcript.trim();
        const trimmedPrev = prev.trim();

        // Robust duplicate detection
        if (trimmedTranscript && 
            !trimmedPrev.endsWith(trimmedTranscript) && 
            !trimmedPrev.includes(trimmedTranscript + " " + trimmedTranscript)) {
          return prev + (prev && !prev.endsWith('\n\n') ? separator : "") + transcript;
        }
        return prev;
      });
    } else if (!turnIsFormatted && transcript.trim()) {
      // Partial version - show in real-time stream
      setCurrentStreamText(transcript);
    }
  }
};

UX Design & Features

Voice-First Interface:

Minimalist journaling canvas with vintage paper aesthetic
Pulsing recording indicator for live microphone status
Real-time word count and session duration tracking
Intelligent duplicate detection to prevent transcription artifacts

Smart Journaling Features:

Daily Reflection Prompts: Curated prompts that refresh daily at 12 AM GMT
Tone Rewriting: AI-powered text transformation (optimistic, technical, formal, etc.)
Session Management: Edit sessions created on the same day, read-only after that
Content Analysis: Automatic title generation, summaries, and key theme extraction
Search & Discovery: Full-text search across all journal entries

Technical Architecture:

Frontend: React + TypeScript + Vite + Tailwind CSS + Shadcn/ui
Backend: FastAPI + Python for API endpoints and AI processing
Database: Supabase for user authentication and session storage
Search: Algolia for fast, semantic search across journal entries
AI Processing: Google Gemini for content summarization and tone rewriting

Key Technical Achievements

Real-Time Performance:

Sub-200ms latency for live transcription display
Seamless WebSocket connection management
Efficient audio processing with proper resource cleanup
Responsive UI updates synchronized with audio state

Domain Expertise:

Specialized journaling workflow optimized for voice input
Intelligent content organization with automatic categorization
User behavior analysis with session statistics and trends
Privacy-focused design with user data isolation

Robust Error Handling:

Graceful microphone permission management
Connection recovery mechanisms
Comprehensive logging for debugging
Fallback modes for degraded performance

Key Takeaways

AssemblyAI's Real-time Capabilities: The universal-streaming API provides exceptional low-latency transcription with remarkable accuracy, making voice journaling feel natural and responsive.
WebSocket Management is Critical: Proper cleanup of WebSocket connections and audio resources is essential, especially when users navigate between pages or close the application.
Voice Journaling Requires Context: Beyond simple text capture, voice journaling benefits from emotional context, prompting, and intelligent content organization.
Immutable Journals Encourage Honesty: Locking journal entries after creation (read-only after the same day) encourages more authentic, unfiltered self-reflection.
Real-time UX Demands Attention: Users expect immediate feedback when speaking, requiring careful attention to UI state management and audio-visual synchronization.

What's Next

Immediate Roadmap:

Deploy live version with enhanced security and RLS re-enabled
Implement user streak tracking and habit formation features
Add sentiment analysis for emotional trend tracking
Create memory timelines and reflection insights

Future Enhancements:

Voice emotion detection for mood tracking
Collaborative journaling features
Integration with wellness apps and calendars
Advanced AI coaching and reflection prompts

Technical Stack

Frontend:

Typescript (React)
Vite for fast development and building
Tailwind CSS for styling
Shadcn/ui for component library
React Router for navigation

Backend:

FastAPI for RESTful API endpoints
Python for server-side processing
Supabase for authentication and database
Algolia for search indexing

Voice & AI:

AssemblyAI Universal Streaming for real-time transcription
Google Gemini for content analysis and rewriting
WebSocket for real-time communication

Deployment:

Vercel for frontend hosting
Vercel Functions for backend API
Environment-based security configuration

Final Note

Whispers is built for people who think best out loud. It transforms the traditional journaling experience into a dynamic conversation with yourself—live, raw, and authentically yours. By leveraging AssemblyAI's cutting-edge voice technology, Whispers makes capturing daily reflections as natural as having a conversation, while providing the structure and insights that make journaling truly meaningful.

The project demonstrates how real-time voice technology can enhance personal wellness applications, creating a more intuitive and engaging way for users to document their thoughts, emotions, and personal growth journey.