DEV Community: inquistive

Building an AI-Powered YouTube Intelligence Assistant Using Voice AI and Multi-Agent Workflows

inquistive — Tue, 19 May 2026 11:57:00 +0000

Introduction

YouTube has become one of the largest sources of knowledge on the internet. From AI research discussions and startup podcasts to technical tutorials and industry analysis, creators upload hours of valuable content every single day. However, consuming all of this information manually is nearly impossible, especially for users subscribed to dozens or even hundreds of channels.

To solve this problem, I built an AI-powered YouTube intelligence assistant that can search subscribed YouTube channels, extract transcripts from videos, summarize the content, and answer user questions conversationally through voice interaction.

The system combines:
Voice AI
Multi-agent orchestration
Transcript understanding
Large language models

into a single automated workflow.

The complete system is designed to function like a personalized AI research assistant for YouTube content.

The Core Idea
The goal of the project is simple.
Instead of manually watching long videos, users should be able to ask questions naturally using voice and instantly receive summaries or answers extracted directly from YouTube video transcripts.

For example, a user can ask:

“What did AI creators say about OpenAI this week?”

“Summarize the latest Lex Fridman podcast.”

The AI system automatically searches the user’s subscribed channels, identifies relevant videos, extracts transcripts, processes the information using large language models, and returns a conversational response through voice.

This transforms YouTube into an interactive conversational knowledge system.

Overall Workflow Architecture
The workflow is designed as a multi-stage AI pipeline where each component performs a specific responsibility.

The architecture looks like this:

Voice Input
    ↓
Webhook Trigger
    ↓
AI Agent 1 (Search + Orchestration)
    ↓
YouTube API Calls
    ↓
Transcript Extraction
    ↓
AI Agent 2 (Summarization + Q&A)
    ↓
Response Formatting
    ↓
Voice Output

The entire pipeline is connected to an ElevenLabs Voice AI system, allowing users to interact with YouTube content naturally using speech.

Voice AI Integration with ElevenLabs
The interaction begins with ElevenLabs Voice AI. This component acts as the conversational interface between the user and the workflow.
When the user speaks, ElevenLabs performs speech-to-text conversion and sends the query to the automation workflow through a webhook endpoint.

For example, if the user says:

“Summarize the latest AI video from my subscriptions.”

the voice agent converts the speech into text and sends a structured request to the workflow.
The webhook acts as the entry point for the entire system.

After processing is completed, the final AI-generated response is returned back to ElevenLabs, which converts the response into natural speech.
This creates a fully conversational experience where the user can “talk” to YouTube content instead of manually browsing videos.

Webhook Trigger System
The webhook node is responsible for receiving incoming requests from the voice assistant.
It acts as the starting point of the workflow and accepts user queries in real time. Once a request is received, the workflow begins processing the user’s intent.

A typical incoming request may look like this:


{
  "query": "What did AI creators discuss about AGI recently?"
}

This query is then passed to the first AI agent for reasoning and orchestration.

AI Agent 1 — Search and Orchestration Layer
The first AI agent functions as the orchestration layer of the system. Its primary responsibility is to understand the user query and determine how the workflow should proceed.

This agent is connected to multiple tools and APIs, including:

Gemini AI model
YouTube API requests
Search utilities
Metadata retrieval tools

The agent performs several important tasks:
Understanding user intent
Identifying relevant topics
Searching subscribed channels
Selecting appropriate videos
Generating structured outputs for downstream processing

For example, if the user asks:

“What are my subscribed creators saying about AI agents?”

the agent identifies:
The topic (“AI agents”)
Relevant subscribed channels
Recent related videos
Appropriate video IDs
This modular approach separates retrieval and orchestration from deep reasoning, improving scalability and reducing hallucinations.

YouTube API Integration
Once the first agent understands the query, the workflow interacts with YouTube APIs to fetch relevant information.

The APIs are used to retrieve:
Subscribed channels
Recent uploads
Video metadata
Search results
Video identifiers

This makes the system highly personalized because the search is restricted to the user’s subscriptions rather than the entire YouTube platform.
The workflow dynamically identifies videos that are most relevant to the user’s query.

JSON Parsing and Structured Data Handling
After the first AI agent completes its reasoning process, the generated output is converted into structured JSON data.

A typical output may include:

{
  "videoId": "abc123",
  "title": "The Future of AI Agents",
  "channel": "AI Explained"
}

The parsing layer extracts important fields such as:
Video IDs
Titles
Transcript references
Metadata

This structured format allows downstream components to process information efficiently and reliably.

Transcript Extraction System
One of the most important parts of the workflow is transcript extraction.
The workflow calls an external transcript API that retrieves subtitles or captions from YouTube videos. This step converts spoken video content into machine-readable text.

For example, the system may receive:

{
  "transcript": "Today we are discussing the future of autonomous AI agents..."
}

This transcript becomes the primary knowledge source for the language model.
Instead of analyzing raw video, the AI processes structured textual content, making summarization and question answering significantly more efficient.

AI Agent 2 — Transcript Intelligence and Reasoning
The second AI agent is focused entirely on transcript understanding and knowledge extraction.
Unlike the first agent, which handles orchestration and retrieval, this agent specializes in:

Summarization
Contextual reasoning
Question answering
Insight extraction
Semantic understanding

The transcript is passed to an OpenAI chat model such as GPT-4o or GPT-4.1, which processes the content and generates high-quality responses.

Users can ask questions such as:

“Summarize this video in five points.”
“What did the speaker say about startup funding?”
“List the key AI trends mentioned in the discussion.”
The AI agent analyzes the transcript and generates concise, human-readable answers.

Why Multi-Agent Architecture Matters
A key design decision in this workflow is the use of multiple AI agents instead of a single monolithic model.

The first agent handles:
Orchestration
Retrieval
API interactions
Workflow decisions

The second agent handles:
Deep reasoning
Summarization
Transcript analysis
Semantic understanding

This separation improves the overall architecture by making the system:
More modular
Easier to debug
More scalable
Less prone to hallucinations
More efficient in handling complex workflows

The modular multi-agent design also makes it easier to upgrade individual components independently in the future.

Response Formatting and Voice Output

Once the summarization is completed, the response is passed through a formatting layer that converts it into a schema compatible with the voice assistant.

For example:

{
  "response": "The video discusses recent advances in autonomous AI agents and their impact on software development."
}

This response is then returned to ElevenLabs, which converts the text back into natural speech.
The user ultimately experiences a seamless conversational interaction where spoken questions are answered using information extracted directly from YouTube videos.

Key Advantages of the System
One of the biggest strengths of this workflow is personalization. Since the system focuses only on subscribed channels, the generated summaries are highly relevant to the user’s interests.
The system also eliminates the need to manually watch long-form content. Instead of spending hours consuming videos, users can retrieve insights instantly through natural language interaction.
Another major advantage is scalability. The workflow can easily be expanded to support:

Podcasts
Educational lectures
Research papers
Interviews
Technical discussions
Industry news monitoring

The architecture effectively transforms YouTube into a searchable AI-powered knowledge base.

Conclusion
This project demonstrates how modern AI systems can combine voice interfaces, retrieval pipelines, transcript understanding, and large language models to create highly interactive knowledge assistants.

By integrating:

ElevenLabs Voice AI
YouTube APIs
Transcript extraction systems
Gemini orchestration agents
OpenAI reasoning models

the workflow transforms YouTube from a passive video platform into a conversational AI-powered research system.

The architecture highlights the growing potential of multi-agent AI systems capable of retrieving, understanding, and summarizing long-form multimedia content in real time.

As AI workflows continue to evolve, systems like this could become the foundation for next-generation research assistants, educational copilots, podcast intelligence platforms, and personalized knowledge retrieval systems.

Beyond Simple OCR: Building an Autonomous VLM Auditor for E-Commerce Scale

inquistive — Sun, 05 Apr 2026 14:02:28 +0000

In the world of global e-commerce, “dirty data” is a multi-billion dollar problem. Product dimensions (Length, Width, Height) are often inconsistent across databases, leading to shipping errors, warehouse mismatches, and customer returns.

Traditional OCR struggles with complex specification badges, and manual auditing is impossible at the scale of millions of ASINs. Enter the Autonomous VLM Auditor — a high-efficiency pipeline utilizing the newly released Qwen2.5-VL to extract, verify, and self-correct product metadata.

The Novelty: What Makes This Different?

Most Vision-Language Model (VLM) implementations focus on captioning or chat. This project introduces three specific technical novelties:

1. The “Big Brain, Small Footprint” Strategy
To process over 6,000 images at scale, we utilized 4-Bit Quantization (NF4) via BitsAndBytes. In the world of VLMs, memory is the primary bottleneck. By compressing the model's weights from 16-bit to 4-bit, we reduced the VRAM footprint by nearly 70%.

Why 4-bit? * Hardware Accessibility: It allows the Qwen2.5-VL-3B model to run comfortably on a standard 15GB VRAM envelope, such as a Kaggle T4 GPU or a consumer-grade RTX 3060.

Precision Preservation: Through NormalFloat4 (NF4) and bfloat16 compute types, we maintain high reasoning accuracy. The model doesn't just see the numbers; it retains the "intelligence" required to understand spatial context in product images without the massive hardware cost.
Throughput: Smaller memory requirements mean faster loading and more stable long-term batch processing without hitting memory walls.

bnb_config = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16
)

2. The Agentic Audit Loop
Extraction is only half the battle. The core innovation here is the Agentic Self-Evaluation logic. Instead of blindly trusting the AI, the system:

Extracts dimensions from the image.
Normalizes units (converting CM to Inches on the fly).
Audits the output against Ground Truth using a 10% tolerance threshold.
Categorizes results into VERIFIED, PARTIAL_DISCREPANCY, or CRITICAL_DISCREPANCY.

3. Robust Extraction Engine (Regex-JSON Hybrid)
VLMs are notoriously wordy. To turn a conversational AI response into a production-ready database entry, we implemented a robust Regex Parser that identifies JSON structures within the model’s chat output. This ensures that even if the model “thinks out loud,” the system only captures the structured {'L': val, 'W': val, 'H': val} payload.

The Technical Deep-Dive

Memory-Efficient Vision Processing
To prevent Out-Of-Memory (OOM) errors during long-running batch jobs, the pipeline utilizes aggressive memory management:

Strategic memory cleanup after every 5 images

with torch.no_grad():
    generated_ids = model.generate(**inputs, max_new_tokens=128)
del inputs, generated_ids
torch.cuda.empty_cache()
gc.collect()

This ensures the VRAM “waterline” remains flat, allowing the agent to process thousands of images without degrading performance.

Handling Multi-Modal Discrepancies
The “Audit Logic” accounts for the messiness of real-world data. By implementing an is_close function with a 0.1 + 0.5 tolerance, we account for both rounding differences (standard vs. metric) and minor OCR misreadings, focusing only on the "Critical Discrepancies" that actually impact the bottom line.

Why This Matters for the Future of Data Science
We are moving away from “AI as a tool” and toward “AI as an Auditor.” By combining the visual reasoning of Qwen2.5-VL with structured verification logic, we’ve built a system that doesn’t just see — it understands and validates. For businesses managing massive inventories, this approach replaces thousands of human hours with a single, reproducible Python loop.

The result? A verified, high-integrity dataset ready for logistics, analytics, and better customer experiences.

Conclusion-Building the Trust Layer for Visual AI
The true value of this project isn’t just that it works — it’s that it establishes a scalable trust layer between raw pixels and reliable structured data.

By employing 4-bit quantization via BitsAndBytes with the Qwen2.5-VL model, we have demonstrated that state-of-the-art vision processing doesn't require state-of-the-art hardware budgets. This optimization democratizes high-performance VLM auditing, allowing anyone with modest hardware to enforce strict data integrity over thousands of products.

We are moving past the initial excitement of “Generative AI” and into the crucial phase of Autonomous Validation. This closed-loop agent architecture proves that AI can not only perform complex tasks but also criticize its own performance against business logic, paving the way for fully autonomous, high-integrity data pipelines in e-commerce and beyond.

Secure AI-Powered Dependency Conflict Resolution with Auth0 Authentication"

inquistive — Mon, 20 Oct 2025 07:40:02 +0000

This is a submission for the Auth0 for AI Agents Challenge

What I Built

I built an agentic AI application that streamlines complex package dependency management by integrating AI-powered conflict resolution, natural language guidance, and automatic package installation. This tool interacts with multiple public APIs for package metadata but secures these external calls through Auth0’s robust authentication mechanisms. The AI agents autonomously make dependency decisions while operating under fine-grained access control, solving real-world challenges in software maintenance and security.

This AI-powered dependency management solution is implemented as a Command Line Interface (CLI) tool. This design choice enables easy integration into developer workflows and automation pipelines without requiring a graphical interface.
Users can run the tool directly within terminals or CI/CD environments to:

Query package dependencies and versions
Detect and resolve conflicts
Perform AI-guided installation and updates securely The CLI interface provides concise, user-friendly logs and summaries, making the AI-driven recommendations and Auth0-protected actions clear and accessible to developers and automation scripts alike.

One core challenge this AI-powered tool tackles is resolving dependency conflicts that frequently arise when installing or upgrading software packages. These conflicts happen because different packages require overlapping dependencies but may specify incompatible version ranges.

When the tool checks the package environment, it uses AI to:

Detect conflicts such as when one package demands version 2.0 of a library but another requires 3.0, which can cause errors.
Provide natural language explanations of the conflicts, making it easier for developers to understand what is causing the issue.
Recommend resolutions like downgrading or upgrading specific dependencies to compatible versions.
Automatically apply fixes by choosing the best matching package versions based on historical success and AI prediction.
Audit and log each decision and action, including authentication through Auth0 to ensure only authorized fixes are executed.

This automated conflict resolution system helps avoid manual trial-and-error, prevents broken builds, and accelerates secure package installation with confidence. It greatly improves developer productivity by handling the complexity of dependency trees with intelligent guidance under strict security controls.

Demo

The project repository is: Repo

Project demo on Youtube: Try it Out

The demo includes:

Secure authentication logs showing successful Auth0 token fetch and cache

AI-driven dependency checks and conflict resolution messages

Automated installation commands issued only within authenticated sessions

Comprehensive audit records with sensitive information sanitized for security compliance

Example :

Commands:

Install a specific package version:

python Enhanced_version_conflict.py install <package_name> <version>  # or only <package_name>

If version conflict exist then Auto-resolve:

python Enhanced_version_conflict.py auto-resolve <package_name> <version>  # or only <package_name>

How I Used Auth0 for AI Agents

This application includes two primary AI agents that work in tandem with Auth0-secured authentication to automate complex dependency management tasks:

This AI agent uses Google’s Gemini API to analyze dependency conflicts and explain them in natural language — why they occur, and how to fix them. It interprets package metadata, checks version compatibility, and recommends optimal version ranges using contextual reasoning.

It then leverages Gradient AI for intelligent decision-making — predicting the most stable versions, learning from past resolutions, and automating fixes securely via Auth0-authenticated sessions.

Together, Gemini provides insight, while Gradient ensures action — creating a self-learning, auto-resolving system that keeps builds stable and developers confident.

Both AI agents operate within authenticated and authorized sessions secured by Auth0. Their API calls and decision-making processes are protected by bearer tokens to ensure trusted execution. Audit logs capture all AI agent activities, supporting traceability and security compliance.

My application leverages several key Auth0 features and APIs to enable secure agent authentication and authorization:

Authentication API (Client Credentials Grant): The AI agents authenticate securely by exchanging client credentials for bearer tokens, suitable for machine-to-machine trust without user intervention.
Token Management and Caching: Tokens are automatically handled—refreshed on expiry, cached in a secure file, and reused efficiently, ensuring continuous authenticated access without manual intervention.
Bearer Token Authorization: All API requests to external services (e.g., package registries, AI advisers) are made with valid Auth0-issued bearer tokens in Authorization headers, preventing unauthorized access.
Audit Logging API: All security events, including token requests, refresh, and API usage, are locally logged with detailed audit trails. Logs sanitize sensitive header data for privacy and compliance.
Session Tracking: Detailed analytics for each authenticated session include token lifetimes, scopes, and request timestamps, enabling precise monitoring of AI agent activities and access privileges.

This integration ensures that AI agents operate under strong security controls, accessing only authorized resources and maintaining compliance with modern authentication requirements.

Lessons Learned and Takeaways

Developing this project highlighted the crucial role of authentication in building secure, trustable AI applications. Managing token lifecycles with Auth0 greatly simplified ensuring uninterrupted access while maintaining security. Designing effective audit logging gave me insight into balancing transparency and privacy.

Challenges included managing smooth token refresh during continuous AI workflows and sanitizing logs to avoid leaking secrets while preserving useful diagnostics. This experience reinforced that integrating authentication deeply into AI agents enables not only security but also trust and accountability in autonomous systems.

For other developers, deeply understanding Auth0’s token flow and audit capabilities is essential for building AI applications that interact with multiple protected APIs and sensitive resources.

Conclusion

Below is the flow chart diagram to understand it better:

Contact : Me

Dependency Hell is Alive? Developers often face version mismatches when installing libraries. One wrong update shows “version incompatible,” breaks workflows, wastes hours of debugging and stalls projects. Who else has faced this? Share your war stories!💥

inquistive — Tue, 14 Oct 2025 06:33:48 +0000