DEV Community: Aryan Kargwal

Top AI 7 Agent Supervision Platforms in 2025

Aryan Kargwal — Mon, 27 Oct 2025 20:05:47 +0000

I’ve spent time on both sides of AI. Hacking together small local demos to test the latest models. And helping enterprises figure out how to put agents into production. One lesson I learnt is that supervision is what keeps things from breaking.

When you train a model, you expect failures. Runs can stall. Loss curves drift. A mislabeled field can throw the data into a loop. You watch it closely because you know unchecked training goes sideways fast.

Now imagine that same volatility playing out in production. An AI agent running live, talking to customers, acting in workflows. Without supervision, it’s a system learning in public with no one in control.

What is AI agent supervision?

AI agent supervision is the practice of watching over and guiding autonomous AI systems as they work. Instead of leaving agents to run unchecked, supervision gives people visibility into what they’re doing, a way to measure if they’re getting things right, and controls to step in when they go off track.

It’s less about the technology itself and more about the human role of keeping these systems accountable. As agents become part of real workflows — answering customers, running ad campaigns, drafting code, moving money — supervision is what makes them trustworthy.

AI supervision connects the day-to-day details (logs, dashboards, feedback) with the bigger picture of regional and ethical safety and compliance inside organizations.

How does supervision work in AI agents?

Supervision means every agent run is traceable, controllable, and improvable. In practice: you can see the steps an agent took, you can block or reroute unsafe actions while it’s running, and you have a regular way to learn from outcomes so the next run is better.

Observability turns the invisible visible

The core problem with agents is opacity. Observability fixes that by recording the full path of a run: retrieval results, tool calls with inputs and outputs, model tokens used, latency by step, and the final outcome. With this trail you can replay a run, compare a good path versus a bad one, and connect answers back to sources.

Two capabilities make this useful day-to-day:

Trace & replay. Open any run and see each step in order. Pinpoint where a wrong source slipped in or where a tool returned an error.
Attribution. Link the final answer to specific documents, queries, or tools. If a claim can’t be traced, treat it as ungrounded and fix the routing or retrieval.

Real-time alerts and guardrails keep workflows safe

Think of supervision here the same way you’d think about monitoring in DevOps. In production systems, you don’t wait for a server to crash before reacting — you set alerts for CPU spikes or failed health checks.

AI agents need the same treatment. Alerts flag unusual behavior the moment it happens, and guardrails are the automated rules that stop agents from doing something out of bounds, like calling the wrong API or overspending on tokens.

Let’s try to understand how it will look in practice, in AI agent supervision:

Alerting when an agent loop runs too long: PagerDuty-style ping when an agent retries a tool more than 20 times.
Guardrail blocking unsafe tool calls: Block any payment API call that doesn’t include a valid deal ID.
Token or API spend limits: Automatically kill a run if it burns more than $5 worth of tokens in a single request.
Safety or policy checks at runtime: Flag outputs if the generated text violates brand or safety filters.

The point is to give agents the same resilient safety net that DevOps teams already rely on, where alerts flag trouble as it starts and guardrails automatically contain the impact while still allowing human oversight when it matters.

Performance reviews drive continuous improvement

And finally, even when agents run smoothly, they can lose accuracy or drift away from business goals over time. That’s why supervision borrows from the workplace: reviews, look back at performance, assess decision making, and decide what needs to change.

Platforms like Wayfound extend the idea of employee performance reviews to agents, using session recordings and interaction data to spot recurring gaps such as failed actions or knowledge blind spots, then suggesting targeted improvements. The difference is simply that the “employee” under review is now an AI agent.

For teams that want more technical control, open-source frameworks such as OpenAI’s Evals or Arize Phoenix offer benchmarking and trace replay. Those tools require more engineering effort but allow precise measurement and fine-tuned experimentation.

Key Benefits of Supervising AI Workflows

Reliability and low-risk for production-scale agents

Reliability is the first promise of supervision. In DevOps, teams earn stability with CI/CD, monitoring, and paging. The same people now run intelligent workflows, so the instincts carry over: surface problems early and design for recovery.

Concretely, carry over the discipline you already trust:

Versioning and rollback so bad changes don’t linger.
Automated checks in CI to stop regressions before release.
Monitoring and paging on user-visible symptoms, not just internals.
Runbooks and incident response to shorten time to mitigation.

Deployed ML systems do degrade without oversight. That’s drift: training and real-world data diverge, and quality falls unless you watch and adapt. CMU’s Software Engineering Institute calls out this decay in production and focuses on practical drift detection.

Recent industry observations underline the need for ongoing performance monitoring and show that detection effectiveness depends on data conditions and not blind faith in a single metric. Supervision brings the loop you need: recording runs, replay for diagnosis, alerts when behavior veers, and policy checks aligned with governance guidance.

Faster iteration and shorter deployment cycles

McKinsey’s State of AI 2025 shows that companies with structured oversight scale AI use cases faster and with fewer blockers. The reason is simple: when you can see problems clearly, you don’t waste weeks firefighting.

For developers, the pain points look familiar. A bad data source throws outputs off. Debugging takes days because you can’t replay the run. Testing agents in production feels like testing code directly in prod — unstable and slow.

This is where AI agent supervision platforms really shine over traditional indexing and LLM libraries. Instead of stitching together frameworks like LangChain and writing custom tracing code, you get functions already designed for you. Supervisor type setups automatically cluster recurring failure patterns and knowledge gaps across sessions.

When a run fails, you can replay it and watch the exact path the agent took. That makes it obvious where the wrong source crept in. Over time, those replays reveal patterns — gaps in knowledge, broken tools — that you’d never catch skimming raw logs.

Governance and accountability for enterprises

Gartner frames the solution as “The Rise of Guardian Agents”: AI designed to watch over other AI. Early forms check quality and enforce policy; more mature forms can observe processes in real time; the end state is active protection, where unsafe actions are blocked before they reach customers.

These functions don’t replace governance teams, but they give them something they’ve lacked until now: continuous visibility and enforcement mechanisms that keep pace with rapid deployment.

Guardian agents are emerging as the operational mechanisms that keep governance from falling behind innovation. They turn compliance from a periodic audit into an always-on layer of accountability.

What are the current AI policies and standards enterprises must follow?

Technical fixes alone won’t make agents trustworthy. Enterprises also need to track the laws, standards, and frameworks that define what “responsible AI” actually means. Policy is moving quickly, at the same time, global standards bodies have begun publishing management system rules, creating a shared language for how organizations prove oversight.

What matters is evidence: logs, streams, audits, and human-in-the-loop controls that stand up to regulators and standards bodies. Without that, governance risks getting outpaced by innovation — and once that gap opens, trust is almost impossible to recover.

Here are the most relevant policies and standards to watch right now:

Sets rules for consent, data minimization, and grievance redress. Implementation is still ongoing.

However, some more policymakers to keep an eye out would be organizations around the world. OECD, UNESCO, the U.S. AI Safety Institute, the EU AI Office, and the UK AI Safety Institute which are all shaping rules that could soon harden into law.

Top AI Agent Supervision Platforms

Wayfound

Best for: Enterprise leaders who need business-friendly oversight of AI agents across departments, with seamless integration into existing workflows.

Wayfound positions itself as the first proactive AI Agent Supervisor — what Gartner calls a Guardian Agent. It combines LLM-as-a-judge reasoning with an organizational lens, treating supervision as management rather than technical oversight. The platform turns abstract observability into a process that mirrors how teams set goals, review performance, and adapt over time.

The dashboard puts business users in direct control. They can define roles, objectives, and evaluation rules aligned with business metrics — then assess agent behavior without relying on engineering. Oversight becomes continuous and hands-on, allowing decisions to stay close to outcomes rather than filtered through technical mediation.

At the core of this loop is Wayfound’s Model Context Protocol (MCP). It lets agents query Wayfound during execution to verify actions, follow new guidelines, and apply lessons from previous runs. When goals or policies change, updates apply instantly turning feedback into live iteration rather than a post-deployment task.

For developers, MCP provides automatic instrumentation and clear visibility into behavior. Wayfound captures traces, errors, and decision paths, surfacing practical summaries instead of raw logs. It shows exactly where things went wrong and what adjustments would make the next run better.

Together, these layers form a self-improving supervision cycle that works across CRMs, analytics tools, and agent frameworks. Wayfound becomes the shared space where business and engineering converge, ensuring agents stay aligned and continuously improving in real time.

Key Features

Supervisor dashboard with ability to write natural language custom evaluations and business-friendly observability and review
Real-time alerts that highlight risks and optimization opportunities
Agent improvement suggestions that can be implemented seamlessly through MCP
Easy integration for any AI Agent via SDK, APIs, MCP as well as native integrations with Salesforce Agentforce

Pricing: Enterprise contracts with deployment support; pricing on request.

LangSmith

Best for: Engineering teams that need to replay agent runs, debug failures, and fine-tune prompts in detail.

LangSmith grew out of LangChain, a more focused attempt to turn open-source building blocks into a systematic debugging layer. Its biggest strength is replay and trace. Developers can walk back through every intermediate step, surfacing exactly where logic went wrong.

It also doubles as a testing platform. By fixing datasets of sample prompts and replaying them against updated models, teams can measure regression and confirm whether changes improve results. That structured QA approach is something observability tools alone don’t provide.

The flip side is that LangSmith is unapologetically developer-centric. Non-technical stakeholders won’t find much value in JSON replays or raw traces. Without engineering commitment, it risks being shelfware, since the interface assumes technical comfort from its users.

Another limitation is scope. LangSmith excels at debugging single agents but lacks governance and policy enforcement features. Enterprises looking for guardrails or executive dashboards usually need to pair it with broader supervision platforms to achieve complete oversight.

Key Features:

Full traces of inputs, outputs, and tool calls
Dataset management for structured evaluation
Replay environment to walk through reasoning
Built-in hooks for automated testing

Pricing: Free tier available; usage-based plans scale with volume.

Lakera Guard

Best for: Teams that run agents in production and worry about jailbreaks, data leaks, or reckless tool use.

Lakera Guard’s strongest play is ease of adoption. In this Reddit thread, developers described it as a “drop-in proxy” — you just point your agent traffic to Lakera’s endpoint and instantly add a layer of injection filtering. That simplicity is a huge draw.

At runtime, it can block jailbreak-style inputs and protect sensitive functions. One engineer noted that Lakera “catches weird edge cases” where users try to exfiltrate hidden prompts. That’s real production defense that can be deployed out of the box with minimal efforts.

But overhead comes with it. As another user put it: “nice to have a drop-in solution — not so nice to have additional wait-steps in a large, branched agentic loop.” False positives and latency can make guardrails feel heavy in complex pipelines.

Lakera also isn’t a full observability tool. It keeps you safe in the moment but won’t give dashboards or long-term metrics. Most teams pair it with LangSmith for debugging or Wayfound for governance so they’re covered across the full supervision stack.

Key Features:

Real-time prompt injection detection
Guardrails for sensitive tool and API calls
Drop-in proxy endpoints for LLM requests
Filters for unsafe or policy-violating outputs

Pricing: Free developer tier, with custom pricing for enterprise deployments.

Coralogix

Best for: Engineering and data teams that need unified observability for infrastructure and AI agents, with context awareness, compliance tracking, and drift detection.

Coralogix began as a log analytics platform and gradually evolved into a full AI observability layer. The acquisition of Aporia added model telemetry and runtime guardrails, giving teams a single console for monitoring both agent behavior and system performance.

The AI Center dashboard tracks drift, latency, and cost using the same ingestion layer that powers traditional log pipelines. Each inference can be traced from API call to output without manual tracing or separate monitoring scripts.

A big differentiator is cost visibility. Coralogix logs each token, API call, and compute expense, showing cost per agent and raising alerts for anomalies. You can set custom budgets per agent to prevent runaway usage.

To keep costs manageable, Coralogix offers index-free querying, remote archive queries (on your own S3 or cloud storage), and tools like “Drop Irrelevant Metrics” to prune what isn’t useful after ingestion.

Key Features

Token & resource cost tracking with anomaly alerts
Archive & index-free querying to reduce storage/query overheads
Olly, a natural-language assistant for observability insights

Pricing: Usage-based model with a free developer tier. Enterprise plans include advanced AI Center analytics and extended retention options.

Giskard

Best for: Teams that want to test agents and LLMs pre-deployment using open-source tooling rather than paying for a commercial supervision suite.

Giskard’s core idea is that supervision should start before production. Its open-source framework lets you scan models and datasets to expose problems like hallucinations, biased completions, or injection vulnerabilities without relying on closed third-party platforms.

A standout is RAGET, their retrieval evaluation tool. Instead of eyeballing responses, it matches outputs against reference datasets and surfaces where the retrieval logic breaks down. That makes it useful for RAG pipelines, which often fail in subtle, context-driven ways.

Developers appreciate that it is free and flexible. You can script evaluations directly in Python or use the UI to set custom rules. But it does take work — you need to define good datasets and tests, otherwise results don’t mean much.

The limitation is that Giskard stops at testing. It helps you find weaknesses pre-deployment but doesn’t offer continuous runtime monitoring. Most teams that adopt it still layer on observability or guardrail tools once agents are in production.

Key Features

Automated scanning for hallucinations, bias, and injection risks
RAGET framework for retrieval evaluation against ground-truth data
Python APIs and UI for test creation
Extensible with custom rules and datasets

Pricing: Completely open-source, with paid support options for enterprises.

IBM WatsonX

Best for: Enterprises that care more about auditability and control than speed — teams that must prove AI reliability to regulators and leadership.

IBM’s Watsonx.governance suite brings structured oversight to AI systems. It automates documentation and risk scoring while managing bias checks across pipelines. Model registries, version control, and lineage tracking connect directly to corporate compliance systems for end-to-end accountability.

The main advantage of Watsonx is the support network behind it. Clients work with IBM Consulting, prebuilt industry templates, and integration pathways into enterprise software such as SAP or Salesforce. For regulated firms, that guidance often replaces months of internal coordination.

The trade-off is complexity. Setup involves many moving parts, and flexibility narrows once policies are locked in. Watsonx performs best in predictable settings where models change gradually. Smaller teams seeking fast iteration usually find it cumbersome.

Key Features

Model and data governance automation
Risk, bias, and compliance reporting
Integration with enterprise data and risk suites
IBM Consulting and sector-specific templates

Pricing: Enterprise licensing with optional consulting packages. Costs vary by compliance level, and contract structure.

NVIDIA NeMo Rails

Best for: Engineering teams running complex LLM or agent workloads that need programmable control over behavior and safety boundaries.

NeMo Guardrails is a rule-driven framework for supervising conversational AI. It specifies what agents can discuss, how they respond, and which tools they may access. Rules are written in a lightweight configuration language and enforced during runtime without retraining.

The framework can run locally or through NVIDIA’s AI Foundry platform. Teams use it to combine rule enforcement with retrieval or vector-based search, adding safety layers over existing generative pipelines. Integration with LangChain and Triton simplifies deployment inside production environments.

In practice, Guardrails delivers reliable behavior once tuned but requires setup and testing effort. Each rule must be validated under expected load to prevent latency or misfires. After calibration, it produces consistent, policy-aligned outputs that satisfy enterprise safety requirements.

Key Features

Rule-based control for model topics, responses, and tool use
Integration with LangChain, Triton, and retrieval pipelines
Local or cloud operation via NVIDIA AI Foundry
Templates and SDKs for regulated and safety-focused workloads

Pricing: Enterprise features are available through AI Enterprise and Foundry subscriptions, which include managed support and deployment assistance.

Getting Started with AI Supervision

1. Choose your first supervision platform

You don’t need to deploy everything at once. Start by picking the platform that fits your immediate need:

Wayfound gives executives and governance teams an AI Agent Supervisor and dashboards to review and improve agent behavior without digging through code.
LangSmith lets engineers replay runs and debug prompts when agents misfire.
Lakera Guard sits in front of agents to block prompt injection and risky actions.
Coralogix tracks every agent session in real time, scoring quality, latency, and cost while surfacing issues like drift.
Giskard offers an open-source framework to test agents against datasets before pushing to production.
IBM WatsonX equips enterprises with model governance, bias detection, and audit reporting to keep AI development compliant and explainable.
NVIDIA NeMo Guardrails gives engineers fine-grained control over what agents can say or do, enforcing safety and policy rules directly at runtime.

2. Connect your agents and integrations

Start by linking your agents to the platform, then extend coverage into the everyday tools your teams already rely on:

Communication tools such as Slack or Teams so alerts can be shared immediately
Business systems like Salesforce or HubSpot to tie oversight directly into customer and sales activity
Data pipelines and API layers that carry context between agents and backend systems
Storage and knowledge bases, where supervision can check retrievals and prevent drift in long-term memory

3. Configure policies and guardrails

Decide what your agents are not allowed to do. Common starting points:

Restrict access to sensitive data fields (e.g., customer IDs in Salesforce)
Set runtime rules on tool use so costly or high-risk actions require approval
Add brand, compliance, or tone filters before responses go out
Layer in prompt-injection defenses to stop adversarial inputs

For further reading:

4. Run your first performance review

A performance review is the first moment supervision feels concrete. Instead of chasing logs or scattered feedback, you can watch how an agent handled a real conversation and see where the rules you set actually mattered. It’s a shift from theory into something you can observe and adjust.

Wayfound’s review panel makes agent behavior and guideline checks visible in one place.

Wayfound’s review panel is a good illustration. The transcript sits alongside the decisions the agent made, with clear signals showing how those choices lined up with the company guidelines. Explanations and feedback boxes capture what worked well and where improvements are needed, making the outcome easy to understand without requiring technical expertise or extra reporting layers.

For teams used to heavy oversight processes, this streamlined view shows how reviews can be both explainable and lightweight enough to slot into existing workflows.

5. Expand gradually across workflows

Once you’ve run that first review, the temptation is to wire up every agent and every process at once. Resist it. Supervision scales best when you add coverage step by step. Start with the workflows where risk or cost is highest, and let the early reviews teach you which rules matter most.

A good tip here is to keep track of what you change. Treat each adjustment to a policy, guideline, or review process as an experiment, and note what effect it has. That simple habit makes scaling far easier, because you’re learning how supervision fits the way your teams actually work.

Enterprise adoption is already showing that many traditional teams struggle to integrate AI, weighed down by layers of bureaucracy. Much of that bureaucracy exists in good faith, to protect customers and keep operations stable.

The problem is when that oversight can’t keep up with the speed of deployment. What begins as protection turns into friction, and innovation spills into unmanaged risk. Structured supervision is what closes that gap: it gives enterprises the accountability they need without sacrificing the pace of adoption.

I’m Aryan. I’ve worked in the AI agent, LLM, and MLOps space for a while now, and my upcoming PhD is focused on governance and explainability for AI agents. If this is something you’re working on or curious about, connect with me on LinkedIn, I’d love to continue the conversation.

DeepSeek-R1: Redefining the Reinforcement Learning in AI

Aryan Kargwal — Mon, 10 Feb 2025 16:47:58 +0000

The rapid evolution of large language models (LLMs) has reshaped the AI landscape, with OpenAI leading the charge. However, the emergence of open-source models like DeepSeek-R1 challenges the dominance of proprietary systems.

As users question the justification behind hefty subscription fees, such as OpenAI's $200 premium plan (which could fund a year's worth of caffeine for developers), DeepSeek-R1 offers compelling answers.

This blog delves into DeepSeek-R1's architecture, performance, and what it means for the future of AI.

The Resurgence of Reinforcement Learning in Generative AI

Earlier last year, the adoption of Reinforcement Learning from Human Feedback (RLHF) saw a notable decline in the development of generative AI models. Many organizations shifted focus towards scaling model architectures and fine-tuning large datasets without the added complexity of reinforcement learning (RL).

However, DeepSeek-R1 has reinvigorated interest in RL by demonstrating how it can be applied cost-effectively while significantly enhancing model performance—without needing the budget of a small country.

So, how does DeepSeek incorporate RL into its training program at a fraction of the cost typically associated with such techniques? Let's break down three key RL strategies they employed:

Group Relative Policy Optimization (GRPO):

DeepSeek-R1 utilizes GRPO, an efficient variant of traditional policy optimization algorithms. Unlike standard approaches that rely heavily on resource-intensive critic models, GRPO estimates baselines from group scores, significantly reducing computational overhead.

Think of it as a group project. Instead of one person doing all the work, everyone contributes just enough to earn an A. This technique allows DeepSeek-R1 to optimize model performance with minimal resources, focusing on relative improvements within sampled outputs rather than absolute performance metrics.

Reward Modeling with Rule-Based Systems:

Instead of relying on costly neural reward models prone to issues like reward hacking (yes, even AI knows how to game the system), DeepSeek-R1 adopts a rule-based reward system. This system emphasizes two types of rewards:

Accuracy rewards: Assessing the correctness of outputs in tasks like math and coding.
Format rewards: Ensuring responses follow structured reasoning patterns.

This cost-effective approach maintains training stability without continuous retraining of complex reward models. It’s like teaching a model that 2+2=4 and that math answers shouldn't come with emojis—simple yet effective.

Reinforcement Learning with Cold Start Data:

DeepSeek-R1 introduces a 'cold start' phase to address early-stage instability common in RL training. This involves fine-tuning the base model on a small, high-quality dataset before applying RL.

By establishing a strong initial performance baseline, DeepSeek-R1 accelerates convergence during RL training, reducing the computational cost typically required to achieve high reasoning capabilities.

Breaking Down DeepSeek-R1

Developed by DeepSeek-AI, DeepSeek-R1 is part of a new generation of reasoning-focused LLMs. It comes in two variants.

DeepSeek-R1-Zero: Trained using large-scale reinforcement learning (RL) without supervised fine-tuning (SFT), showcasing raw reasoning capabilities.
DeepSeek-R1: Built on a multi-stage training pipeline that includes cold-start data, SFT, and extensive RL, achieving performance comparable to OpenAI's o1-1217 models.

Unlike traditional LLMs, which rely heavily on supervised datasets, DeepSeek-R1-Zero's performance emerges from pure RL, organically incentivizing reasoning behaviors.

This approach allows the model to develop self-verification, reflection, and complex chain-of-thought (CoT) reasoning without human biases embedded through SFT—essentially the AI equivalent of learning to ride a bike without training wheels.

Performance That Rivals the Best

DeepSeek-R1 doesn't just claim to be competitive; its benchmark results prove it:

AIME 2024 (Pass@1): 79.8%, outperforming OpenAI-o1-mini and matching OpenAI-o1-1217.
MATH-500 (Pass@1): 97.3%, rivaling OpenAI's models in mathematical reasoning.
MMLU: 90.8%, showcasing broad knowledge comprehension.
Codeforces (Percentile): 96.3%, indicating elite coding competition performance (because what's more satisfying than beating both AI and humans at code challenges?).

The distilled models, with 1.5B to 70B parameters, also outperform existing open-source benchmarks. For instance, DeepSeek-R1-Distill-Qwen-32B surpasses QwQ-32B-Preview in all major reasoning tasks, proving that size isn’t everything—optimization is key.

Why This Matters: The Open-Source Revolution

The success of DeepSeek-R1 challenges the notion that proprietary models inherently offer superior value. Here’s why:

Performance Parity: DeepSeek-R1 achieves results on par with, and sometimes exceeding, OpenAI's top-tier models, especially in reasoning-intensive tasks.
Cost Efficiency: Open-source models eliminate the need for expensive API subscriptions, providing high-quality AI capabilities without recurring costs (so you can finally cancel that subscription and still afford your morning latte).
Transparency: Unlike black-box proprietary models, open-source projects offer transparency, fostering trust and enabling community-driven improvements.
Customization: Organizations can fine-tune models like DeepSeek-R1 to meet specific needs, something not easily achievable with closed-source APIs. Sometimes, you just need your AI to have that particular flair.

Building AI Agents using DeepSeek R1

If you're excited about DeepSeek-R1's potential but wondering how to integrate it into practical applications, look no further than Botpress.

This powerful platform allows you to build sophisticated AI agents to handle customer support, automate workflows, and even assist in coding tasks without breaking the bank.

By leveraging DeepSeek-R1 on Botpress, you can recreate much of the agentic functionality that proprietary models offer, but at a fraction of the cost.

Benchmarking Pixtral Large vs Pixtral 12B

Aryan Kargwal — Mon, 25 Nov 2024 21:52:24 +0000

Youtube: Click Me

Multimodal AI has taken significant leaps in recent years, and Mistral AI's Pixtral Large is no exception. This new Vision-Language Model (VLM) aims to redefine benchmarks in multimodal understanding and reasoning. In this post, I’ll dive into Pixtral Large's capabilities, its performance against its predecessor, Pixtral 12B, and GPT-4V, and share my benchmarking experiments to help you make informed decisions when choosing your next VLM.

What is Pixtral Large?

Pixtral Large is Mistral AI’s latest multimodal innovation. Building on the foundation of Pixtral 12B, it introduces enhanced reasoning and comprehension capabilities. Whether tackling complex math problems on datasets like MathVista, document comprehension from DocVQA, or visual-question answering with VQAv2, Pixtral Large consistently sets itself apart with superior performance.

At its core, Pixtral Large is powered by 123 billion multimodal decoder parameters and a 1 billion-parameter vision encoder, making it a true powerhouse. It supports up to 30 high-resolution images within a 128K context window, allowing it to handle complex, large-scale reasoning tasks effortlessly. Its Mistral Large 2 Text Encoder enhances text processing while maintaining its exceptional multimodal capabilities.

Technical Specifications

Although the exact architecture of Pixtral Large remains undisclosed, it likely builds upon Pixtral 12B's common embedding-based multimodal transformer decoder. This setup enables it to process multi-image inferences and perform high-quality cross-modal reasoning, excelling at tasks that require a deep integration of visual and textual data.

Here are some standout specs of Pixtral Large:

Parameters: 123 billion (multimodal decoder) + 1 billion (vision encoder)
Context Window: 128K tokens
Image Support: Up to 30 high-resolution images
Applications: Math reasoning, document comprehension, chart understanding, and more

Pixtral Large vs. Pixtral 12B

The shift from Pixtral 12B to Pixtral Large represents a nuanced tradeoff:

Pixtral 12B: Balanced capabilities across tasks, excelling in label-based and rationale-based evaluations.
Pixtral Large: Falls behind in label-based tasks but shines in rationale-based performance, indicating superior reasoning and explanation capabilities.

This evolution demonstrates Pixtral Large’s focus on tasks requiring deeper comprehension and reasoning, making it a strong contender for specialized use cases.

Benchmarking Results

Datasets Used

To test Pixtral Large, I benchmarked it against its predecessor and GPT-4V using two datasets:

ArxivQA: Research paper-based QA tasks with GPT-4V inferences for comparison.
Flickr30k: A classic image captioning dataset enhanced with GPT-4O-generated captions.

Evaluation Metrics

I used Cosine Similarity to measure semantic alignment between generated outputs and reference data. Metrics included win rate, average similarity, and top-1, top-5, top-10 scores.

ArxivQA Results

From 1,000 randomly selected images, Pixtral Large demonstrated a stronger ability to reason through scientific and mathematical content. While it struggled with label-based evaluations compared to Pixtral 12B, it outperformed in rationale-based tasks. This indicates a shift toward deeper reasoning capabilities, ideal for complex QA scenarios.

Flickr30k Results

For the Flickr30k Captioning Benchmark, Pixtral Large produced slight improvements over Pixtral 12B when evaluated against human-generated captions. However, both models lagged in achieving a win rate for this task.

Interestingly, when compared to GPT-4V captions, Pixtral Large performed well, though it fell slightly behind Pixtral 12B in top-ranked matches. These results highlight Pixtral Large’s potential but also suggest areas for improvement in precision and caption generation.

Using Pixtral Large on Tune Studio

Due to the model's size and resource requirements, I used Tune Studio for benchmarking. With its user-friendly interface and efficient inference scripts, I was able to process 500 images per hour, completing the job for under $20. This makes Tune Studio a valuable tool for researchers and developers working on large-scale AI projects.

Conclusion

Pixtral Large represents a significant step forward in multimodal AI, offering enhanced reasoning and cross-modal comprehension. While it may not surpass Pixtral 12B in every aspect, its focus on rationale-based tasks makes it a compelling choice for applications requiring deeper understanding.

For developers, researchers, and enterprises looking for cutting-edge VLMs, Pixtral Large offers a mix of power and precision that’s hard to beat.

What do you think about Pixtral Large? Is it the next big thing in VLMs, or do you see potential in other models like GPT-4V? Let me know your thoughts in the comments below! 🚀

Transform UI Screenshots into HTML & CSS with Qwen Coder and Qwen VL

Aryan Kargwal — Thu, 14 Nov 2024 16:43:38 +0000

🎥Youtube Video Link: Click Me

Imagine this: you’re working on a website redesign, and you’ve just captured a UI screenshot that embodies the look you want. Wouldn’t it be incredible if you could automatically turn that image into HTML and CSS? This tutorial will show you exactly how to make that happen, transforming visual designs into code using cutting-edge vision-language models (VLMs) and Qwen Coder.

In this setup, we’ll build a pipeline where an AI model analyzes your UI design image, understands its layout, colors, typography, and structure, and then generates clean, organized HTML and CSS code. This process opens up a world of possibilities for UI prototyping, automated design-to-code workflows, and quick mockup generation.

Some cool points we'll cover:

Upload and Describe UI Designs: How we upload a UI screenshot and get a detailed breakdown of the design elements.
Generate HTML & CSS with AI: Transforming these descriptions into fully functional HTML and CSS code for quick web design prototyping.

Let’s get started!

Step 1: Setting Up API Details and Image Encoding

First, let’s configure the API endpoint, headers, and a helper function to encode images into Base64. This encoding step allows us to send the image data to the model.

import json
import requests
import base64
from PIL import Image
from io import BytesIO

# Set API details
url = "https://proxy.tune.app/chat/completions"
headers = {
    "Authorization": "YOUR_API_KEY",  # Replace with your actual API key
    "Content-Type": "application/json",
}

# Encode image in Base64 format
def encode_image(image):
    if image.mode == 'RGBA':
        image = image.convert('RGB')  # Convert RGBA to RGB
    buffered = BytesIO()
    image.save(buffered, format="JPEG")
    return base64.b64encode(buffered.getvalue()).decode('utf-8')

Step 2: Querying the Vision-Language Model for Description

In this step, we’ll create a function that queries the VLM to analyze the UI image and provide a detailed description. This model captures all aspects of the UI, including color schemes, typography, layout structures, and icons, which are essential for accurately generating HTML and CSS.

# Query the model for a description of the image
def query_model(base64_image, prompt, model_id, max_tokens=1000, temperature=0.9):
    image_content = {
        "type": "image_url",
        "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}"
        }
    }

    data = {
        "model": model_id,
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    image_content
                ]
            }
        ],
        "max_tokens": max_tokens,
    }

    response = requests.post(url, headers=headers, json=data)
    if response.status_code == 200:
        answer = response.json().get('choices', [{}])[0].get('message', {}).get('content', "")
        return answer.strip()
    else:
        return f"Error: {response.status_code} - {response.text}"

Step 3: Extracting HTML and CSS Code from Model Response

Once we have the description, we prompt Qwen Coder to generate HTML and CSS based on the UI layout. Our code will parse the response, extracting any HTML and CSS content for easy file output.

import re

# Extract HTML and CSS from model response
def extract_html_css(response_text):
    html_match = re.search(r"### HTML\n```

html\n(.*?)

```", response_text, re.DOTALL)
    css_match = re.search(r"### CSS.*\n```

css\n(.*?)

```", response_text, re.DOTALL)

    html_code = html_match.group(1).strip() if html_match else ""
    css_code = css_match.group(1).strip() if css_match else ""

    return html_code, css_code

# Save HTML and CSS to files
def write_files(html_code, css_code):
    with open("index.html", "w") as html_file:
        html_file.write(html_code)
    with open("styles.css", "w") as css_file:
        css_file.write(css_code)

Step 4: Building the Streamlit App for User Interaction

Our final step is setting up the Streamlit interface. This UI allows users to upload images, choose a model, generate descriptions, and output HTML/CSS.

import streamlit as st

# Streamlit UI setup
st.title("Image Description and HTML/CSS Generation")
model_choice = st.selectbox("Select Model for Image Understanding", 
                            options=["qwen/qwen-2-vl-72b", "openai/gpt-4o", "mistral/pixtral-12B-2409", "meta/llama-3.2-90b-vision"],
                            index=0)
uploaded_image = st.file_uploader("Choose an image...", type=["jpg", "jpeg", "png"])

if st.button("Generate Description"):
    if uploaded_image:
        image = Image.open(uploaded_image)
        base64_image = encode_image(image)
        st.image(image)

        # Generate the UI description
        description_prompt = "Please analyze this software interface image and provide a detailed description."
        description = query_model(base64_image, description_prompt, model_id=model_choice)
        st.subheader("Generated Description:")
        st.markdown(description)

        if description:
            # Generate HTML and CSS
            html_css_data = {
                "temperature": 0.9,
                "messages": [
                    {"role": "system", "content": "You are TuneStudio, a coding assistant that generates HTML and CSS based on descriptions."},
                    {"role": "user", "content": f"Please create HTML and CSS based on the following detailed description: {description}"}
                ],
                "model": "qwen/qwen-2.5-coder-32b",
                "max_tokens": 3000
            }

            response = requests.post(url, headers=headers, json=html_css_data)
            if response.status_code == 200:
                html_css_code = response.json().get('choices', [{}])[0].get('message', {}).get('content', '')
                html_code, css_code = extract_html_css(html_css_code)

                if html_code and css_code:
                    write_files(html_code, css_code)
                    st.success("HTML and CSS files have been generated.")
                else:
                    st.error("HTML/CSS extraction failed.")

                st.subheader("Generated HTML and CSS:")
                st.code(html_css_code, language="html")
            else:
                st.error("Error generating HTML/CSS.")
    else:
        st.warning("Please upload an image.")

Conclusion

With this setup, you’ve created a pipeline that not only automates the analysis of UI images but also translates them into HTML and CSS. This workflow is a major time-saver for developers, designers, and anyone involved in UI design. Now, you can turn visual ideas into functional code with the power of AI!

Let me know if you run into any questions or issues in the comments below.

Image Generation using Janus 1.3B🔮

Aryan Kargwal — Thu, 24 Oct 2024 22:03:31 +0000

Code: Click Me
Youtube: Click Me

Today, we’re diving into something exciting: Janus 1.3B, one of the tiniest yet competent truly multimodal LLMs. What sets Janus apart is that, despite its smaller size, it delivers powerful results in natural language processing and image generation. This is a perfect example of where AI is heading—smaller models yet versatile and multimodal.

Janus 1.3B

So, what exactly is Janus 1.3B? At its core, Janus is a vision-language model (VLM) designed to handle textual and visual data. With just 1.3 billion parameters, Janus is significantly smaller than some of the other LLMs and multimodal models we’ve discussed on the channel. But don’t let its size fool you; it can perform both text and image generation, making it a powerful tool despite its relatively compact size.

Unlike most models, which specialise in one area or need large architectures to function effectively in multiple domains, Janus achieves this multimodal functionality at a much smaller scale. This is a massive step in making AI more efficient, accessible, and, most importantly, scalable.

How Does Janus Work?

Let’s start with its architecture. Janus processes text understanding, multimodal understanding, and visual generation through independent encoding methods that eventually feed into a unified autoregressive transformer. This design allows it to handle different types of input—text, images, or a combination of both—in a highly efficient manner.

Here’s the breakdown of how it all works:

Text Understanding: Janus employs a built-in tokenizer from its underlying LLM. This tokenizer converts text into discrete IDs (tokens), which are transformed into feature representations. The LLM processes these features in the same way as any other text-based model.
Multimodal Understanding: Janus integrates SigLIP, a powerful vision encoder that extracts high-dimensional semantic features from images for image processing. These features are flattened from a 2D grid into a 1D sequence and passed through an understanding adaptor. This adaptor maps the image features into the input space of the LLM, ensuring that both image and text data are represented in a way that the model can understand together.
Image Generation: Janus utilizes a Vector Quantization (VQ) tokenizer to generate images. This tokenizer converts images into a sequence of discrete IDs. These ID sequences are flattened and passed through a generation adaptor, which maps them into the LLM’s input space. This allows Janus to generate image content from a text description. A specialized image prediction head is trained for this task, while Janus relies on the LLM’s existing text prediction head for text-based tasks.

Once the inputs, whether text, image, or both, are converted into feature sequences, Janus concatenates them into a unified multimodal feature sequence. This sequence is then fed into the LLM for processing, making it capable of generating text and images based on the input it receives.

Janus Multi-Modal Performance

Now, let’s talk performance. Despite its relatively small size of 1.3 billion parameters, Janus is competitive across several multimodal tasks. It excels in Visual Question Answering (VQA) benchmarks, COCO Captioning, and Image-Text Retrieval.

Janus is designed to handle real-world multimodal applications where parameter efficiency is critical. While larger models might outperform Janus on tasks that require deep reasoning over complex text or high-resolution images, Janus hits a sweet spot by balancing efficiency and performance for general-purpose multimodal applications.

How to Use Janus for Multi-Modal Integration

Now, let us see how to use the model for multimodal inferences. Below is an example of how to set up the generate_answer function, which takes an image and a question as inputs.

def generate_answer(image_path, question):
    # Load the VL-GPT model, tokenizer, and visual language chat processor
    model = load_vl_gpt_model()
    tokenizer = load_tokenizer()
    vl_chat_processor = load_vl_chat_processor()

    # Define conversation structure
    conversation = f"{question} [image: {image_path}]"

    # Prepare image for processing
    image = preprocess_image(image_path)

    # Prepare inputs for the model
    inputs = vl_chat_processor.process(image, conversation)

    # Generate input embeddings
    input_embeddings = model.get_embeddings(inputs)

    # Generate answer using the VL-GPT model
    answer = model.generate(input_embeddings)

    return decode_answer(answer)

In this code, we load the necessary components, prepare the image and question for processing, and generate a response that combines visual context with the posed question.

Janus Image Generation

Finally, let’s examine Janus’ image generation capabilities. While it’s not as large as dedicated models like DALL-E 2 or Stable Diffusion, Janus still creates high-quality images from textual inputs in an incredibly compact form.

As mentioned, Janus uses the VQ tokenizer to convert images into discrete tokens. These tokens are then processed using a latent diffusion model, generating the image in stages and refining it over time to match the text input. The result? Images that are highly coherent and contextually accurate, especially when dealing with more straightforward or abstract prompts.

How to Use Janus for Image Generation

The process starts with tokenizing the prompt using the vl_chat_processor. This converts the text into numerical representations that the model can understand.

def generate_image(prompt):
    # Tokenize the prompt
    tokenized_prompt = vl_chat_processor.tokenize(prompt)

    # Create initial embeddings from tokens
    initial_embeddings = model.create_embeddings(tokenized_prompt)

    # Generate image tokens iteratively
    image_tokens = []
    for _ in range(num_tokens):
        token = model.generate_next_token(initial_embeddings)
        image_tokens.append(token)
        initial_embeddings = model.update_embeddings(initial_embeddings, token)

    # Decode tokens into an image
    image = decode_image(image_tokens)

    # Save image to disk
    save_image(image, "output_image.jpg")

This code illustrates generating an image based on a text prompt using Janus. It showcases the iterative process of generating image tokens while ensuring relevance to the original prompt.

Conclusion

So there you have it—Janus 1.3B, a small but compelling multimodal model that punches well above its weight. Its ability to handle text understanding, multimodal reasoning, and image generation in such a compact framework is a testament to the efficiency of its design.

For those interested in multimodal AI that can be deployed in real-world applications without massive computational power, Janus is a model you should watch.

Building a Multi-Turn-Assistant Application using Llama, Claude and GPT4o

Aryan Kargwal — Fri, 18 Oct 2024 17:32:13 +0000

💻Github: https://github.com/aryankargwal/genai-tutorials/tree/main/multi-turn-agents
🎥Youtube: https://youtu.be/S9iHpExFrTs

In this guide, we’ll explore the development of a multi-turn AI assistant application using LLMs and AI assistant integration. This application is designed to streamline complex workflows such as internet retrieval, market research, campaign generation, and image creation. Throughout this process, we will rely on Tune Studio for AI model orchestration, and Streamlit for the front-end user interface. The end goal is to create a fully automated assistant-led pipeline that performs end-to-end tasks by interacting with multiple AI assistants in a sequential manner—also known as a multi-turn workflow.

What is a Multi-Turn AI Assistant Application?

In the context of AI and automation, a multi-turn assistant application is one where multiple interactions (or "turns") are required to complete a task. The application maintains context throughout these turns and allows each assistant or model to perform specific sub-tasks in a coordinated manner. This approach contrasts with single-turn applications, where the AI assistant addresses a single user query without needing to track prior inputs or outputs.

In this tutorial, the multi-turn approach allows AI assistants to collaborate across multiple steps:

Market Research Assistant gathers data from the web.
Analytics Assistant processes the research and generates insights.
Campaign Generation Assistant creates marketing strategies.
Image Generation Assistant produces a campaign poster.

Each assistant plays a crucial role and passes context to the next in line, ensuring a smooth and coherent user experience.

What Are AI Assistants?

AI assistants are digital agents powered by machine learning models that help users perform tasks, answer questions, and provide recommendations. Unlike co-pilots or AI agents, AI assistants focus on assisting with user-driven tasks, such as scheduling meetings, performing web searches, or, in our case, handling marketing tasks.

There are three distinct categories of LLM-driven tools:

AI Assistants: Designed to respond to user commands and requests. Common examples include virtual assistants like Siri or Alexa, but they can also handle more specialized workflows.
Co-Pilots: These tools work alongside humans, helping improve tasks as they are being performed. Examples include Grammarly for writing and GitHub Copilot for coding.
AI Agents: Autonomous agents that perform tasks without user input, such as ReACT or Agentic Workflows.

In our application, the AI assistants are key players in achieving each part of the task while ensuring user control and input at every step. Now, let’s break down how we’ve integrated multiple assistants to create a seamless marketing and campaign generation tool.

Step-by-Step Breakdown of the Application

1. Performing Market Research with an AI Assistant

In this first step, the AI assistant is responsible for gathering relevant information from the internet. We use a Llama 3.1 model fine-tuned for research tasks to collect numerical data, trends, and insights from across the web.

Here’s the core code for this assistant's function:

def call_market_research_assistant(query):
    payload = {
        "temperature": 0.8,
        "messages": [{"role": "user", "content": query}],
        "model": "kargwalaryan/research",
        "stream": False,
        "frequency_penalty": 0,
        "max_tokens": 100
    }
    response = requests.post(research_url, headers=headers, data=json.dumps(payload))
    return response.json()

This function sends a user query to the Tune Studio API, which uses a fine-tuned model to fetch relevant market research. The model acts as a subject matter expert on the specific topic or product the user inquires about.

2. Analyzing Research and Creating Insights

Once the data is gathered, the next assistant steps in to analyze the research. This assistant is run using Claude Sonnet, a model known for its compliance, safety, and conversational adaptability.

def call_analytics_assistant(research_text):
    user_content = f"Here is some market research data: {research_text}. Extract all the marketing insights and generate a campaign prompt."

    payload = {
        "temperature": 0.9,
        "messages": [
            {"role": "system", "content": "You are TuneStudio"},
            {"role": "user", "content": user_content}
        ],
        "model": "anthropic/claude-3.5-sonnet",
        "stream": False,
        "frequency_penalty": 0.2,
        "max_tokens": 300
    }
    response = requests.post(research_url, headers=headers, data=json.dumps(payload))
    return response.json()

Here, the Claude Sonnet model processes the research and extracts stylistic and strategic insights that will inform the next step—campaign generation.

3. Generating the Marketing Campaign

For campaign generation, we need an assistant that not only understands the market analysis but can also create a compelling, structured campaign. The Claude Sonnet model shines in this area, as it generates an engaging and compliant campaign strategy based on market trends.

def generate_campaign(analysis_text):
    payload = {
        "temperature": 0.9,
        "messages": [
            {"role": "system", "content": "Generate a marketing campaign based on this analysis: {analysis_result}."}
        ],
        "model": "kargwalaryan/campaign-gen",
        "stream": False,
        "frequency_penalty": 0.2,
        "max_tokens": 150
    }
    response = requests.post(research_url, headers=headers, data=json.dumps(payload))
    return response.json()

This assistant pulls from the insights gathered and creates a comprehensive campaign that could be deployed over the next few months.

4. Image Generation for the Campaign Poster

The final assistant in this pipeline generates a visual representation—a campaign poster—using GPT4o. This model specializes in image creation tasks based on textual descriptions.

def call_image_generation(analysis_text):
    payload = {
        "temperature": 0.9,
        "messages": [
            {"role": "system", "content": "Generate a campaign poster based on this analysis: {analysis_result}"}
        ],
        "model": "kargwalaryan/image-gen",
        "stream": False,
        "frequency_penalty": 0.2,
        "max_tokens": 100
    }
    response = requests.post(research_url, headers=headers, data=json.dumps(payload))
    return response.json()

This model generates a creative campaign poster based on the strategy developed in the earlier steps, completing the entire marketing pipeline.

Why Use Multi-Turn Assistant Workflows?

Multi-turn workflows allow for complex tasks to be broken into smaller, manageable operations, each handled by a specialized AI assistant. This ensures that the final output is not only accurate but also aligned with the user's overall goals.

Some of the key advantages of multi-turn workflows include:

Context Retention: The application retains context across different stages of the workflow. This allows each assistant to build upon the work of previous assistants.
Task Specialization: Each assistant is optimized for a specific sub-task, ensuring higher performance in individual areas like research, analysis, campaign generation, and image creation.
Flexibility and Customization: You can easily modify or swap out assistants to suit different applications. For example, you could replace the market research assistant with one better suited to another industry or domain.

Conclusion

Creating a multi-turn AI assistant application allows you to harness the power of multiple LLMs and assistants to handle complex tasks in a highly structured way. By breaking down tasks into distinct stages and integrating models like Llama 3.1, Claude Sonnet, and GPT4o, you can build intelligent, autonomous pipelines that help users with everything from market research to visual content creation.

This approach is ideal for applications where tasks need to be completed in a step-by-step manner while maintaining context across all steps.

Let me know if you have any questions or suggestions for further improvement! Stay tuned for more advanced tutorials on LLMs and VLMs.

Stress Testing VLMs: Multi QnA and Description Tasks

Aryan Kargwal — Mon, 14 Oct 2024 14:12:09 +0000

Video Link: https://youtu.be/pwW9zwVQ4L8
Repository Link: https://github.com/aryankargwal/genai-tutorials/tree/main

In the fast-evolving world of AI, Vision-Language Models (VLMs) have garnered attention for their ability to understand and generate responses based on visual and textual inputs. However, testing these models in a structured environment and comparing their performance across various scenarios is still a challenging task. This blog will walk you through an experiment where we used a custom-built Streamlit web application to stress test multiple VLMs like Llama 3.2, Qwen 2 VL, and GPT 4o on a range of tasks. We analyzed their response tokens, latency, and accuracy in generating answers to complex, multimodal questions.

However, please note that most of the findings are still hidden, as this application is part of my process of making a VLM benchmark, the first of which you can check out on Huggingface as SynCap-Flickr8K!

Why Compare Vision-Language Models?

The ability to compare the performance of different VLMs across domains is critical for:

Understanding model efficiency (tokens used, latency).
Measuring how well models can generate coherent responses based on image inputs and textual prompts.
Creating benchmark datasets to improve further and fine-tune VLMs.

To achieve this, we built a VLM Stress Testing Web App in Python, utilizing Streamlit for a user-friendly interface. This allowed us to upload images, input textual prompts, and obtain model-generated responses in real time. The app also calculated and logged critical metrics such as the number of tokens used in responses and latency.

Project Setup

Our main application file, app.py, uses Streamlit as the frontend and is integrated with API requests to call different VLM models. Each query to a model includes:

Image: Encoded in Base64 format.
Question: A text input by the user.
Model ID: We allow users to choose between multiple VLMs.

The API response includes:

Answer: The model-generated text.
Latency: Time taken for the model to generate the answer.
Token Count: Number of tokens used by the model in generating the response.

Below is the code structure for querying the models:

def query_model(base64_image, question, model_id, max_tokens=300, temperature=0.9, stream=False, frequency_penalty=0.2):
    image_content = {
        "type": "image_url",
        "image_url": {
            "url": f"data:image/jpeg;base64,{base64_image}"
        }
    }

    prompt = question

    data = {
        "model": model_id,
        "messages": [
            {
                "role": "user",
                "content": [
                    {"type": "text", "text": prompt},
                    image_content
                ]
            }
        ],
        "max_tokens": max_tokens,
        "temperature": temperature,
        "stream": stream,
        "frequency_penalty": frequency_penalty
    }

    response = requests.post(url, headers=headers, json=data)
    return response.json()

Task Definitions and Experiments

We tested four different tasks across multiple domains using the following models:

Llama 3.2
Qwen 2 VL
GPT 4o

Domains:

Medical: Questions related to complex medical scenarios.
Retail: Product-related queries.
CCTV: Surveillance footage analysis.
Art: Generating artistic interpretations and descriptions.

The experiment involved five queries per task for each model, and we recorded the following metrics:

Tokens: The number of tokens used by the model to generate a response.
Latency: Time taken to return the response.

Results

Token Usage Comparison

The tables below highlight the token usage across the four domains for both Llama and GPT models.

Task	Q1 Tokens	Q2 Tokens	Q3 Tokens	Q4 Tokens	Q5 Tokens	Mean Tokens	Standard Deviation (Tokens)
Medical (Llama)	1	12	1	1	1	3.2	4.81
Retail (Llama)	18	39	83	40	124	60.8	32.77
CCTV (Llama)	18	81	83	40	124	69.2	37.29
Art (Llama)	11	71	88	154	40	72.2	51.21

Task	Q1 Tokens	Q2 Tokens	Q3 Tokens	Q4 Tokens	Q5 Tokens	Mean Tokens	Standard Deviation (Tokens)
Medical (GPT)	1	10	1	1	1	2.4	4.04
Retail (GPT)	7	13	26	14	29	17.8	8.53
CCTV (GPT)	7	8	26	14	29	16.8	7.69
Art (GPT)	10	13	102	43	35	40.6	35.73

Latency Comparison

Latency, measured in seconds, is another critical factor in evaluating the model's performance, especially for real-time applications. The following tables display latency results for the same set of tasks.

Task	Q1 Latency	Q2 Latency	Q3 Latency	Q4 Latency	Q5 Latency	Mean Latency	Standard Deviation (Latency)
Medical (Llama)	0.74	0.97	0.78	0.98	1.19	0.73	0.19
Retail (Llama)	1.63	3.00	3.02	1.67	3.14	2.09	0.74
CCTV (Llama)	1.63	3.00	3.02	1.67	3.14	2.09	0.74
Art (Llama)	1.35	2.46	2.91	4.45	2.09	2.46	1.06

Task	Q1 Latency	Q2 Latency	Q3 Latency	Q4 Latency	Q5 Latency	Mean Latency	Standard Deviation (Latency)
Medical (GPT)	1.35	1.50	1.21	1.50	1.23	1.38	0.10
Retail (GPT)	1.24	1.77	2.12	1.35	1.83	1.63	0.29
CCTV (GPT)	1.20	2.12	1.80	1.35	1.83	1.68	0.32
Art (GPT)	1.24	1.77	7.69	3.94	2.41	3.61	2.29

Observations

Token Efficiency: Llama models generally use fewer tokens in response generation for simpler tasks like Medical compared to more complex domains like Art.
Latency: Latency is higher for more complex images, especially for tasks like Retail and Art, indicating that these models take more time when generating in-depth descriptions or analyzing images.
GPT vs. Llama: GPT models generally had lower token counts across the tasks, but the latency was comparable, with GPT showing slightly more variability in complex tasks like Art.

Conclusion and Future Work

This experiment highlights the importance of evaluating both token efficiency and latency when stress testing VLMs. The VLM Stress Test App allows us to quickly compare multiple models and analyze their performance across a variety of real-world tasks.

Future Plans:

Additional Models: We plan to add more models like Mistral and Claude to the comparison.
Expanded Dataset: New tasks in

domains like Legal and Education will be added to challenge the models further.

Accuracy Metrics: We'll also integrate accuracy metrics like BLEU and ROUGE scores in the next iteration.

Check out our GitHub repository for the code and further instructions on how to set up and run your own VLM experiments.

Doing Multihop on HotPotQA Using Qwen 2.5 72B

Aryan Kargwal — Thu, 26 Sep 2024 14:46:21 +0000

When dealing with complex question-answering tasks, a single-hop retrieval approach might not be enough. Questions often require synthesizing information from multiple sources. That’s where MultiHop Question Answering (QA) comes into play, requiring more advanced tools for retrieval and reasoning. In this post, I’ll describe how I built a multi-hop QA pipeline using DSPy, ColBERT, TuneAPI, and Qwen 2.5 72B to handle multi-step reasoning over a knowledge base.

Understanding the Key Tools

Before diving into the code, let’s first break down the key tools and libraries that power this pipeline:

1. DSPy (Data Structure Processing)

DSPy is a Python library that helps structure multi-step processes for tasks like retrieval-augmented generation and multi-hop question answering. It allows us to define a clear, modular flow for handling complex information retrieval tasks and integrate language models effectively.

2. ColBERT (Contextualized Late Interaction over BERT)

ColBERT is a dense retrieval model designed to retrieve passages efficiently from large corpora. It works by encoding both the query and documents in a low-dimensional space and comparing them to find relevant matches. For multi-hop QA, ColBERT helps identify the most pertinent passages to answer complex questions.

3. TuneAPI (API Proxy for LLMs)

TuneAPI acts as a proxy API to interact with LLMs such as Qwen. This lets us access the powerful inference capabilities of LLMs and customize how they process inputs and generate responses.

4. Qwen 2.5 72B (Alibaba’s Vision-Language Model)

Qwen 2.5 72B is a state-of-the-art large language model developed by Alibaba. While it’s primarily known for its vision-language tasks, Qwen excels in natural language reasoning, making it a great choice for multi-hop QA tasks where nuanced reasoning over text is required.

5. HotPotQA (Dataset for Multi-Hop QA)

HotPotQA is a dataset designed specifically for multi-hop question answering. It contains questions that require information from multiple documents to arrive at an accurate answer, making it ideal for training and evaluating multi-hop QA systems.

Setting Up the MultiHopQA Pipeline

The goal here is to build an end-to-end pipeline that can retrieve relevant documents using ColBERT, pass the retrieved contexts to Qwen 2.5 72B for reasoning, and finally output the predicted answer.

Code Walkthrough

Let’s break down the process into manageable steps. Here’s the code for building the pipeline:

1. Importing Required Libraries

import requests
from dsp import LM
from dspy.datasets import HotPotQA
import dspy
from dsp.utils import deduplicate

We start by importing necessary libraries: DSPy for data handling, ColBERT for retrieval, and requests to interact with the TuneAPI. The HotPotQA dataset is also loaded to provide multi-hop questions for the pipeline.

2. Creating a Custom Language Model Client

To use Qwen, we need a custom class to handle API requests. We interact with Qwen via the TuneAPI to submit prompts and retrieve responses.

class CustomLMClient(LM):
    def __init__(self, model, api_key):
        self.model = model
        self.api_key = api_key
        self.base_url = "https://proxy.tune.app/chat/completions"
        self.history = []
        self.kwargs = {}

    def basic_request(self, prompt: str, **kwargs):
        headers = {
            "Authorization": f"{self.api_key}",
            "Content-Type": "application/json"
        }
        data = {
            "model": self.model,
            "messages": [
                {"role": "system", "content": "You are TuneStudio, answer the question based on the context given to you."},
                {"role": "user", "content": prompt}
            ],
            "temperature": kwargs.get("temperature", 0.9),
            "max_tokens": kwargs.get("max_tokens", 100),
            "frequency_penalty": kwargs.get("frequency_penalty", 0.2),
            "stream": kwargs.get("stream", False)
        }
        response = requests.post(self.base_url, headers=headers, json=data)
        return response.json()

custom_lm = CustomLMClient(model='qwen/qwen-2.5-72b', api_key='your_api_key_here')

This class wraps around the Qwen model and formats the prompt, handles API communication, and processes the response. The basic_request method takes care of sending requests to the TuneAPI.

3. Configuring Retrieval and Language Model

Next, we configure ColBERT and set up DSPy to use our custom language model client for inference.

colbertv2_wiki17_abstracts = dspy.ColBERTv2(url='http://20.102.90.50:2017/wiki17_abstracts')
dspy.settings.configure(lm=custom_lm, rm=colbertv2_wiki17_abstracts)

Here, ColBERTv2 retrieves relevant Wikipedia abstracts. These abstracts will be passed to the language model for deeper reasoning.

4. Loading HotPotQA Dataset

dataset = HotPotQA(train_seed=1, train_size=20, eval_seed=2023, dev_size=50, test_size=0)
trainset = [x.with_inputs('question') for x in dataset.train]
devset = [x.with_inputs('question') for x in dataset.dev]

We load a small subset of the HotPotQA dataset for testing. This dataset will provide multi-hop questions for the pipeline.

5. Simplified Baleen for Multi-Hop Retrieval

The Simplified Baleen class handles the multi-hop retrieval process. It repeatedly retrieves passages, feeds them into the language model, and finally generates an answer.

class SimplifiedBaleen(dspy.Module):
    def __init__(self, lm_client, passages_per_hop=3, max_hops=1):
        super().__init__()
        self.retrieve = dspy.Retrieve(k=passages_per_hop)
        self.max_hops = max_hops
        self.lm_client = lm_client

    def generate_query(self, context, question, **kwargs):
        query = f"{question} Context: {' '.join(context)}"
        return query

    def generate_answer(self, context, question, **kwargs):
        context_str = " ".join(context)
        prompt = f"Given the following information: {context_str} \n\nAnswer the question: {question}"
        response = self.lm_client(prompt, **kwargs)
        return response[0]

    def forward(self, question, **kwargs):
        context = []
        for _ in range(self.max_hops):
            query = self.generate_query(context, question, **kwargs)
            passages = self.retrieve(query).passages
            context = deduplicate(context + passages)
        answer = self.generate_answer(context, question, **kwargs)
        return dspy.Prediction(context=context, answer=answer)

This is the core of our pipeline. It:

Generates queries based on previously retrieved context.
Retrieves relevant documents using ColBERT.
Passes the final context to Qwen to generate the answer.

6. Running the Pipeline

We define a question and pass it through the pipeline to retrieve the answer.

my_question = "What position on the Billboard Top 100 did Alison Moyet's late summer hit achieve?"
uncompiled_baleen = SimplifiedBaleen(lm_client=custom_lm)
pred = uncompiled_baleen(my_question, temperature=0.9, max_tokens=100)

print(f"Question: {my_question}")
print(f"Predicted Answer: {pred.answer}")
print(f"Retrieved Contexts (truncated): {[c[:200] + '...' for c in pred.context]}")

This question is answered using multiple passages retrieved in successive hops and is reasoned over by Qwen 2.5 72B. The final answer is printed alongside the retrieved contexts.

Final Thoughts

This project highlights the growing importance of multi-hop question answering and how combining modern tools like ColBERT for retrieval and Qwen for reasoning can provide powerful solutions. By leveraging datasets like HotPotQA, it’s easier to experiment and fine-tune these pipelines for real-world QA systems.

Future Plans:

Experiment with more retrieval-augmented generation tasks.
Extend this pipeline to support more languages and domain-specific datasets.

For more NLP tutorials and walkthroughs, feel free to check out my YouTube Channel.

Benchmarking Pixtral 12B: MistralAI's New VLM

Aryan Kargwal — Wed, 18 Sep 2024 20:45:32 +0000

GitHub Link
Youtube Link

In the fast-evolving world of AI, Vision-Language Models (VLMs) are breaking new ground. Today, we are diving into Pixtral 12B, the latest VLM from Mistral AI, which I benchmarked against GPT-4 on multiple datasets. This technical blog will walk you through my benchmarking process and share insights on how Pixtral 12B fares against GPT-4v in various tasks.

Pixtral 12B is an exciting release, and it brings several innovations to the table, including a 400M parameter vision encoder and a massive 128K token context window. If you’re working on any image-to-text pipelines, this might be the model you need. Let’s dig into the details.

What is Pixtral 12B?

Pixtral 12B, Mistral AI's latest VLM, is built for complex multimodal tasks, such as chart analysis, code generation from images, and multi-image inferences. Its unique architecture features a 400M parameter vision encoder capable of processing images at their native resolution and aspect ratio, significantly reducing preprocessing efforts. Additionally, the 128K token context window allows it to handle up to 2,000 images in one batch, streamlining image processing at scale.

The model is versatile across various tasks, especially in understanding visuals with intricate details, such as diagrams. It even supports multi-image inferences, a feature highly beneficial for complex scenarios like medical imaging or sequential image analysis.

Datasets and Benchmarks

For this benchmarking exercise, I evaluated Pixtral 12B on three key datasets:

ArxivQA: A large collection of research paper-based question-answering tasks.
- Dataset link: ArxivQA on Hugging Face
VisIT Benchmark: A dataset for vision-language instruction tasks inspired by real-life scenarios.
- Dataset link: VisIT Bench on Hugging Face
Flickr30K: A long-standing image captioning dataset with both human-generated and GPT-4o captions.
- Dataset link: Flickr30K on Hugging Face

In addition to evaluating Pixtral 12B, I also used GPT-4v for comparison. The critical evaluation metric was Cosine Similarity, which measures the semantic similarity between the generated captions and the references. This metric gives us insights into win rate and top-k scores (top-1, top-5, and top-10) for the model-generated captions and GPT-4v outputs.

Cosine Similarity with all-MiniLM-L6-v2

In this benchmarking process, I used Cosine Similarity to evaluate the quality of the model-generated captions and responses by comparing them with reference texts. Specifically, I leveraged the all-MiniLM-L6-v2 model, a lightweight transformer model fine-tuned for sentence embedding tasks, to compute the embeddings of both the predicted and reference texts.

Why Cosine Similarity?

Cosine Similarity is an efficient and commonly used metric for measuring the semantic similarity between two pieces of text. Unlike traditional methods like BLEU or METEOR, which emphasize exact word matching, Cosine Similarity evaluates the contextual alignment between two text embeddings, making it ideal for tasks like image captioning and question answering where the meaning of the text matters more than the exact word sequence.

For each comparison, both the reference and predicted texts were transformed into vector embeddings using the all-MiniLM-L6-v2 model, and the cosine similarity score was calculated as:

[
\text{Cosine Similarity} = \frac{\text{A} \cdot \text{B}}{|\text{A}| |\text{B}|}
]

Where:

A and B represent the embeddings of the predicted and reference texts, respectively.
The result is a score between -1 and 1, where 1 indicates that the two vectors are perfectly aligned (high similarity), and -1 indicates they are diametrically opposed.

Why all-MiniLM-L6-v2?

I chose all-MiniLM-L6-v2 because of its balance between speed and performance. The model, with just 22 million parameters, is capable of generating high-quality sentence embeddings that can efficiently compute similarity scores in real-time. Despite being compact, it retains much of the semantic understanding found in larger models, making it ideal for scenarios like benchmarking where large volumes of data need to be processed quickly.

Here’s why all-MiniLM-L6-v2 was the perfect fit for this task:

Efficient Embeddings: It generates high-quality embeddings that are lightweight yet semantically rich.
Scalability: Due to its small size, it scales well with large datasets without compromising inference speed.
Accurate Semantic Representation: It captures a strong semantic understanding, essential when comparing captions or answers where the meaning matters more than exact matches.

This embedding model enabled me to compute cosine similarity for various benchmarks like ArxivQA, VisIT, and Flickr30K, allowing for a more nuanced evaluation of how well Pixtral 12B and GPT-4v perform on these datasets.

Evaluation Setup and Methodology

To evaluate Pixtral 12B’s performance, I used Tune Studio, which offers unlimited API calls and provides fast inference with 350+ instruction inferences/hour and 500+ captioning inferences/hour.

Each dataset was benchmarked as follows:

ArxivQA: I sampled 1,000 randomly selected images from a pool of 100,000 research paper-based questions. The model had to select the correct answer from multiple options and provide a rationale.

** VisIT Benchmark **: I evaluated the model on 500+ images containing real-life VLM applications. The task required Pixtral 12B to generate instruction-based responses from the images, which were then compared against human and GPT-4-generated captions.

Flickr30K: For this dataset, Pixtral 12B generated captions for 1,000 random images. These captions were compared with both human and GPT-4o-generated captions.

Results

ArxivQA

On the ArxivQA dataset, Pixtral 12B faced the challenge of generating accurate answers for research-based questions. Compared to GPT-4v, Pixtral 12B’s multi-word responses lowered the win rate, but its rationale score remained high, showcasing its ability to reason through complex topics.

Metric	GPT-4v (Labels)	GPT-4v (Rationale)
Win Rate	20.2%	23.1%
Top-1 Score	90.8%	94.2%
Top-5 Score	84.3%	94.1%
Top-10 Score	77.2%	92.66%

VisIT Benchmark

The VisIT Benchmark focuses on real-world VLM tasks, making it a more practical measure of Pixtral 12B’s capabilities. Pixtral 12B performed well against GPT-4’s captions, showing improved instruction-following abilities, especially when dealing with more specific queries.

Metric	Human Captions	GPT-4 (Captions)
Win Rate	9.1%	37.5%
Top-1 Score	88.4%	95.1%
Top-5 Score	85.8%	94.6%
Top-10 Score	84.4%	93.4%

Flickr30K

For Flickr30K, Pixtral 12B’s performance was close to GPT-4v, especially for machine-generated captions, though it scored lower when compared to human captions due to its more concise and objective outputs.

Metric	Data Captions	GPT-4v (Captions)
Win Rate	0%	5.1%
Top-1 Score	33.6%	98.7%
Top-5 Score	27.9%	97.8%
Top-10 Score	0%	96.6%

Conclusion

In conclusion, Pixtral 12B proves to be a formidable contender in the VLM space. While it may not fully outshine GPT-4 in terms of creative reasoning, its analytical and cognitive capabilities make it a valuable tool for tasks involving structured visual data, like charts, diagrams, and instructional content. It’s faster, cheaper, and more scalable for applications that rely on image-to-text processing.

As I continue to explore Pixtral 12B and other models, I’ll be sharing code and updates on my GitHub repo. If you’re curious about Pixtral’s performance in other benchmarks or know of datasets I should test, feel free to reach out in the comments!

SurvBot🎥: Automatic Surveillance Tagging using Moondream and Streamlit

Aryan Kargwal — Tue, 27 Aug 2024 19:29:18 +0000

YOUTUBE VIDEO LIVE NOW🔗

I’m still riding the GenAI train, testing and tweaking new apps using LLMs and VLMs like a mad scientist in a digital lab. My latest experiment? Resurrecting my old projects with some modern, no-nonsense solutions. Enter Moondream 2, the open-source VLM that plays nicely with Streamlit. I put it to work creating an intelligent video tagging system for surveillance because who doesn’t love a bit of AI snooping?

In my latest tutorial, I’ll walk you through deploying a VLM locally. No cloud is needed; it's just a good old-fashioned DIY. You’ll also get the lowdown on tackling tokenization and the other infernal tasks in passing an image through a VLM. Trust me, it’s more fun than it sounds!

The Model: Moondream

Moondream is a highly versatile and modular Vision Language Model (VLM) capable of performing various vision-related tasks. From answering questions based on images and detecting objects with bounding boxes to generating accurate image captions, Moondream is designed to deliver reliable results across various applications. It's an advanced tool for developers looking to integrate powerful Vision AI capabilities into their projects.

Built to run efficiently across multiple platforms, Moondream stands out as a compact, open-source VLM that combines performance with accessibility. It’s the perfect choice for developing next-level AI Vision applications without the burden of heavy or complex models such as GPT4o and Gemma. The Apache License also lets us use the model for our use cases.

Implementation

Moving to the implementation, we are looking at 3-4 major functionalities, which can be blocked by loading the VLM, setting up a tokenizer for the logging, extracting frames from an uploaded image, and finally inferring to store the logs in a CSV.

We are using a Streamlit workflow to set up the application's input and output streams. To see how the actual implementation code goes, check out the Github Repository Here or the YouTube tutorial Here.

Loading Model and Tokenizer

We are going to use Moondream VLM sourced from a function using AutoModelForCasualLM. This statement will let us download all the weights for the model and cache the download into our web application instance, avoiding repeated downloads.

Warning: The Model is Over 2.5 GB, So Mind Your Internet Connection

# Cache the model and tokenizer to avoid downloading them repeatedly
@st.cache_resource
def load_model_and_tokenizer():
    model_id = "vikhyatk/moondream2"
    revision = "2024-07-23"

    model = AutoModelForCausalLM.from_pretrained(
        model_id, trust_remote_code=True, revision=revision,
        torch_dtype=torch.float16).to("cuda")

    tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
    return model, tokenizer

Extracting Frames with Timestamp

The next function we write handles the uploaded CCTV Surveillance Footage, letting us capture frames according to time intervals. This will also help us identify Key Frames later.

# Function to extract frames from video and their timestamps
def extract_frames_with_timestamps(video_path, interval=0.2):
    cap = cv2.VideoCapture(video_path)
    frames = []
    timestamps = []
    frame_rate = cap.get(cv2.CAP_PROP_FPS)
    success, image = cap.read()
    count = 0

    while success:
        timestamp_ms = cap.get(cv2.CAP_PROP_POS_MSEC)
        timestamp_sec = timestamp_ms / 1000.0

        if count % (interval * frame_rate) == 0:
            img = Image.fromarray(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
            frames.append(img)
            timestamps.append(timestamp_sec)

        success, image = cap.read()
        count += 1

    cap.release()

    print(f"Total frames captured: {len(frames)}")
    return frames, timestamps

Frame Inference

Now, passing the system prompt "Describe this image." After uploading the frames one by one, we shall get the descriptions for the video logs. We will still, however, pass one logic to get some estimated key frames from the video, asking for the code to flag frames generating more than 5 different words from their predecessor.

# Extract frames and timestamps from the video
    frames, timestamps = extract_frames_with_timestamps(video_path, interval=1)  # Extract 1 frame per second

    # Process each frame using the model
    descriptions = []
    prev_description_words = set()
    key_frames = []

    with st.spinner("Processing..."):
        for i, frame in enumerate(frames):
            enc_image = model.encode_image(frame)
            description = model.answer_question(enc_image, "Describe this image.", tokenizer)
            filtered_words = list(filter_description(description))  # Convert to list
# Logic for Key Frames
            new_words = set(filtered_words) - prev_description_words
            if len(new_words) > 5:
                key_frames.append((timestamps[i], frame))

            descriptions.append((timestamps[i], filtered_words))
            prev_description_words = set(filtered_words)  # Ensure it remains a set

Streamlit Formatting for Images

Finally, for the more keen, here is the code for formatting the displayed frames and keyframes using Streamlit Commands.

 # Display the frames in a grid layout
    num_columns = 3  # Number of columns in the grid
    num_rows = (len(frames) + num_columns - 1) // num_columns  # Calculate number of rows needed

    for row in range(num_rows):
        cols = st.columns(num_columns)
        for col in range(num_columns):
            index = row * num_columns + col
            if index < len(frames):
                frame = frames[index]
                cols[col].image(frame, caption=f"Frame {index + 1} at {timestamps[index]:.2f}s")

     # Display key frames in a grid layout
    if key_frames:
        st.write("Key Frames:")
        num_columns_key_frames = 3  # Number of columns for key frames grid
        num_rows_key_frames = (len(key_frames) + num_columns_key_frames - 1) // num_columns_key_frames  # Calculate number of rows needed

        for row in range(num_rows_key_frames):
            cols = st.columns(num_columns_key_frames)
            for col in range(num_columns_key_frames):
                index = row * num_columns_key_frames + col
                if index < len(key_frames):
                    timestamp, frame = key_frames[index]
                    cols[col].image(frame, caption=f"Key Frame {index + 1} at {timestamp:.2f}s")

Conclusion

So there you have it, a crash course in making your old projects feel new again with a bit of VLM magic. Whether you want to impress your boss or geek out over some next-level AI, Moondream 2 has your back. If you’re anything like me, you’ll probably wonder why you didn’t do this sooner. Go forth and tag those videos like a pro!

Boss Llama: Building a Smart Interview Simulator using Llama 3.1 70B

Aryan Kargwal — Thu, 22 Aug 2024 17:22:20 +0000

Moving a bit further from writing and exploring the world of LLMs as just a consumer of the vast knowledge and context awareness, I took it upon myself to build a product out of these tools. Boss Llama is an interactive intelligent interview simulator on Meta's Llama 3.1 70B.

Although not a novel implementation, I hope my tutorial helps you understand the various API calls and functions involved in processing chat, inputs, and data files in a Streamlit Web Application using Tune Studio to deploy our model and set API gateways for inference. So, let's look at how you can replicate the same results.

The Model: Llama 3.1 70B

An Open Source LLM, Meta's Llama 3.1 70B, has good reasons to become my choice for the product. By improving the context window from 8K to 128K from its predecessor, Llama 3 70B, the model also brings a more significant threshold for output tokens, 4096, over the previous 2048 context.

Analysis taken from Aritificial Analysis

Earlier, however, I had tried implementing a locally hosted Llama 3.1 8B for the task, which unfortunately lacks the context windows expected off of the use-case, which typically involves more extended exchanges averaging about 1500-2000 words or 2000+ tokens. So let us now check out how you can also make such an application.

Implementation

Regarding the implementation, we have chosen Streamlit as the base of our operations, helping us tie up the outputs generated by the API calls to a chat interface. Unfortunately, the larger model asks for a higher VRAM, which I have chosen to fulfill using Tune Studio's API Calls.

While discussing the code, however, we will skip over the Streamlit part and focus on its API aspect. If you wish to see how I implement that, check out my video tutorial on YouTube here or head over to the Github Repository here!

For the upcoming code, here are some essential variables:

Conversation: A session state variable holding all the conversation exchanges between the bot and the user.
Difficulty: The difficulty of the simulated interview
API Key: Your Tune Studio API Key
Max Questions: Number of Questions in the Interview

System Prompt

Finetuning is the best way to get a model to perform exactly how you want it to; well, I went for the next best thing: a thoughtful and thorough system prompt. While choosing the system prompt, we should be detailed with our requirements, as the model tends to meander and hallucinate if such instructions are not given.

The latest adversarial training on the modern Llama models also allows us to pass such a system prompt, avoiding any prompt leakage.

You are Boss Llama, an advanced AI interviewer. Your goal is to conduct a comprehensive and intelligent interview with the candidate based on the following details:

1. Job Description: {st.session_state.job_description}
2. Difficulty Level: {difficulty}
3. Number of Questions: {st.session_state.max_questions}

Instructions:
1. Start the Interview:
   - Begin by presenting a detailed job description for the role based on the difficulty level and the provided job description. Try to keep this introduction small and to the point as the user already knows what they are interviewing for.
   - Provide a warm welcome message to start the interview and set a positive tone.
2. Generate and Ask Questions:
   - Ask a series of questions, up to the specified number of questions. Ensure these questions are relevant to the job description and appropriately challenging according to the difficulty level.
   - Provide clear and concise prompts that assess the candidate's skills, knowledge, and fit for the role.

3. Conduct the Interview:
  - Engage with the candidate in a conversational manner. If the candidate's responses are vague or insufficient, let them know about it and give them a chance to improve, but count it as one more question.
   - Maintain a professional and supportive tone throughout the interview.

Generating Responses

We will generate the responses and the conversation using a curl command from Tune Studio. The command is a simple way of linking your current working code with a model of your choice on Tune Studio, which holds an amazingly massive library of free models for inference and even more advanced models with custom fine-tuning and deploying practices for hard-core enthusiasts.

The variable "conversation," which incrementally holds the ongoing conversation, is called every time to create a response that adds to the existing discussion.

With parameters such as temperature, frequency_penaly, and max_tokens, we can also tweak the quality of responses, further enhancing the feeling of being interviewed by a proper interviewer.

# Function to call the API
def generate_response(conversation, apikey):
    url = "https://proxy.tune.app/chat/completions"
    headers = {
        "Authorization": apikey,  # Your API key
        "Content-Type": "application/json"
    }

    # Construct the payload for the API call
    payload = {
        "temperature": 0.9,
        "messages": conversation,
        "model": "meta/llama-3.1-70b-instruct",
        "stream": False,
        "frequency_penalty": 0.2,
        "max_tokens": 500
    }

    # Send the POST request to the API
    response = requests.post(url, headers=headers, data=json.dumps(payload))

    # Check if the request was successful
    if response.status_code == 200:
        # Extract the response from the JSON output
        return response.json()["choices"][0]["message"]["content"]
    else:
        return f"Error: {response.status_code} - {response.text}"

Generate Evaluations

For the evaluations, we are using a similar API call. We only pass the individual exchanges from the bot and the user to run an assessment using a suitable system prompt.

This second call activates the interviewer's harsher and more intense side. This new call then looks at the conversation from a third perspective and assigns feedback and a score that feeds back into the web application.

# Function to generate evaluations on the interview
def generate_evaluation(question, answer, difficulty, apikey):
    url = "https://proxy.tune.app/chat/completions"
    headers = {
        "Authorization": apikey,
        "Content-Type": "application/json"
    }

    payload = {
        "temperature": 0.7,
        "messages": [
            {"role": "system", "content": f"Evaluate the following answer based on the job description difficulty level: {difficulty}."},
            {"role": "user", "content": f"Question: {question}\nAnswer: {answer}"}
        ],
        "model": "meta/llama-3.1-70b-instruct",
        "stream": False,
        "frequency_penalty": 0.2,
        "max_tokens": 500
    }

    try:
        response = requests.post(url, headers=headers, data=json.dumps(payload))
        response.raise_for_status()
        result = response.json()
        feedback = result.get("choices", [{}])[0].get("message", {}).get("content", "No feedback provided")
        score = result.get("choices", [{}])[0].get("score", 0)
        return feedback, score
    except requests.RequestException as e:
        return f"Error: {e}", 0

Download PDF Report

Well, finally, what good is progress in AI? Suppose all we are doing is asking it to write our assignment. In that case, the final function links the outputs generated by the Evaluation function to FPDF to generate a plausible and downloadable PDF for the evaluation.

What is FPDF, you ask? FPDF, initially a PHP class, is a library used for PDF document generation under Python. Compared to other jargon available online, FPDF provides a more streamlined and straightforward way to generate PDFs. (It also allows us to add PNGs, JPEGs, and GIFs to the PDF, which should be a boon if we wish to add systems to include tacky diagrams in the report.)

# Function to generate a PDF report
def generate_pdf_report(evaluations):
    pdf = FPDF()
    pdf.add_page()
    pdf.set_font("Arial", size=12)

    pdf.cell(0, 10, "Interview Evaluation Report", ln=True, align="C")
    pdf.ln(10)  # Add a line break

    for evaluation in evaluations:
        pdf.set_font("Arial", style='B', size=12)
        pdf.multi_cell(0, 10, evaluation["Question"])
        pdf.set_font("Arial", size=12)
        pdf.multi_cell(0, 10, evaluation["Answer"])
        pdf.multi_cell(0, 10, f"Feedback: {evaluation['Feedback']}")
        pdf.multi_cell(0, 10, f"Score: {evaluation['Score']}")
        pdf.ln(5)  # Add a line break

    # Save the PDF to a temporary file
    temp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".pdf")
    pdf.output(temp_file.name)
    return temp_file.name

Conclusion

Here, we saw an implementation of Llama 3.1 70B in a Streamlit Web Application to simulate brilliant interviews using a chat interface. Although the final product lacks some accessibility features such as TTS or STT, it shows great promise in the way even a model that is not fine-tuned can operate using just system prompts.

LMQL, AAAL Pt.6

Aryan Kargwal — Fri, 02 Aug 2024 04:00:00 +0000

In my journey to enhance adversarial robustness in LLMs, I explored LMQL (Language Model Query Language). This tool is a programming language that allows seamless integration of LLM interaction into program code, providing a structured way to manage model inputs and outputs.

LMQL stands out by enabling developers to specify constraints and rules directly within their code. This feature is crucial for preventing adversarial attacks such as prompt injection and token manipulation. By defining strict constraints, developers can ensure that the model processes only valid and safe inputs, reducing the risk of malicious manipulations.

Additionally, LMQL supports dynamic control over model interactions. Developers can programmatically adjust the model’s behavior based on real-time input validation and monitoring. This flexibility allows for quick responses to potential adversarial attacks, enhancing the overall security of the LLM.

Another advantage of LMQL is its ability to integrate with existing guardrail tools. For example, combining LMQL with Llama Guard or Nvidia NeMo Guardrails can create a multi-layered defense system. This integration allows for more robust input validation, ethical content generation, and comprehensive logging and monitoring.

LMQL also facilitates better transparency and explainability. By embedding model interactions within the code, developers can easily trace and audit the model’s decision-making process. This transparency is vital for identifying and mitigating adversarial attacks, ensuring the model’s outputs are trustworthy and reliable.

In conclusion, LMQL offers a powerful and flexible solution for enhancing the security of LLMs. Its ability to define constraints, dynamic control, and integration with existing tools makes it a valuable addition to any adversarial robustness strategy. Stay tuned for more insights into practical implementations of these tools in real-world applications.