<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Aryan Kargwal</title>
    <description>The latest articles on DEV Community by Aryan Kargwal (@aryankargwal).</description>
    <link>https://dev.to/aryankargwal</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F917475%2F82a15bfc-f7ee-407e-a29a-24f9a080c744.jpg</url>
      <title>DEV Community: Aryan Kargwal</title>
      <link>https://dev.to/aryankargwal</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/aryankargwal"/>
    <language>en</language>
    <item>
      <title>Top AI 7 Agent Supervision Platforms in 2025</title>
      <dc:creator>Aryan Kargwal</dc:creator>
      <pubDate>Mon, 27 Oct 2025 20:05:47 +0000</pubDate>
      <link>https://dev.to/aryankargwal/top-ai-7-agent-supervision-platforms-in-2025-2767</link>
      <guid>https://dev.to/aryankargwal/top-ai-7-agent-supervision-platforms-in-2025-2767</guid>
      <description>&lt;p&gt;I’ve spent time on both sides of AI. Hacking together small local demos to test the latest models. And helping enterprises figure out how to put agents into production. One lesson I learnt is that supervision is what keeps things from breaking.&lt;/p&gt;

&lt;p&gt;When you train a model, you expect failures. Runs can stall. Loss curves drift. A mislabeled field can throw the data into a loop. You watch it closely because you know unchecked training goes sideways fast.&lt;/p&gt;

&lt;p&gt;Now imagine that same volatility playing out in production. An AI agent running live, talking to customers, acting in workflows. Without supervision, it’s a system learning in public with no one in control.&lt;/p&gt;

&lt;h2&gt;
  
  
  What is AI agent supervision?
&lt;/h2&gt;

&lt;p&gt;AI agent supervision is the practice of watching over and guiding autonomous AI systems as they work. Instead of leaving agents to run unchecked, supervision gives people visibility into what they’re doing, a way to measure if they’re getting things right, and controls to step in when they go off track.&lt;/p&gt;

&lt;p&gt;It’s less about the technology itself and more about the human role of keeping these systems accountable. As agents become part of real workflows — answering customers, running ad campaigns, drafting code, moving money — supervision is what makes them trustworthy.&lt;/p&gt;

&lt;p&gt;AI supervision connects the day-to-day details (logs, dashboards, feedback) with the bigger picture of regional and ethical safety and compliance inside organizations. &lt;/p&gt;

&lt;h2&gt;
  
  
  How does supervision work in AI agents?
&lt;/h2&gt;

&lt;p&gt;Supervision means every agent run is traceable, controllable, and improvable. In practice: you can see the steps an agent took, you can block or reroute unsafe actions while it’s running, and you have a regular way to learn from outcomes so the next run is better.&lt;/p&gt;

&lt;h3&gt;
  
  
  Observability turns the invisible visible
&lt;/h3&gt;

&lt;p&gt;The core problem with agents is opacity. Observability fixes that by recording the full path of a run: retrieval results, tool calls with inputs and outputs, model tokens used, latency by step, and the final outcome. With this trail you can replay a run, compare a good path versus a bad one, and connect answers back to sources.&lt;/p&gt;

&lt;p&gt;Two capabilities make this useful day-to-day:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Trace &amp;amp; replay.&lt;/strong&gt; Open any run and see each step in order. Pinpoint where a wrong source slipped in or where a tool returned an error.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Attribution.&lt;/strong&gt; Link the final answer to specific documents, queries, or tools. If a claim can’t be traced, treat it as ungrounded and fix the routing or retrieval.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Real-time alerts and guardrails keep workflows safe
&lt;/h3&gt;

&lt;p&gt;Think of supervision here the same way you’d think about monitoring in DevOps. In production systems, you don’t wait for a server to crash before reacting — you set alerts for CPU spikes or failed health checks.&lt;/p&gt;

&lt;p&gt;AI agents need the same treatment. Alerts flag unusual behavior the moment it happens, and guardrails are the automated rules that stop agents from doing something out of bounds, like calling the wrong API or overspending on tokens.&lt;/p&gt;

&lt;p&gt;Let’s try to understand how it will look in practice, &lt;strong&gt;in AI agent supervision:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Alerting when an agent loop runs too long:&lt;/strong&gt; PagerDuty-style ping when an agent retries a tool more than 20 times.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Guardrail blocking unsafe tool calls:&lt;/strong&gt; Block any payment API call that doesn’t include a valid deal ID.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Token or API spend limits:&lt;/strong&gt; Automatically kill a run if it burns more than $5 worth of tokens in a single request.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Safety or policy checks at runtime:&lt;/strong&gt; Flag outputs if the generated text violates brand or safety filters.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The point is to give agents the same resilient safety net that DevOps teams already rely on, where alerts flag trouble as it starts and guardrails automatically contain the impact while still allowing human oversight when it matters.&lt;/p&gt;

&lt;h3&gt;
  
  
  Performance reviews drive continuous improvement
&lt;/h3&gt;

&lt;p&gt;And finally, even when agents run smoothly, they can lose accuracy or drift away from business goals over time. That’s why supervision borrows from the workplace: reviews, look back at performance, assess decision making, and decide what needs to change.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9olguocnu12qxmqytw11.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9olguocnu12qxmqytw11.png" alt=" " width="686" height="348"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Platforms like Wayfound extend the idea of employee performance reviews to agents, using session recordings and interaction data to spot recurring gaps such as failed actions or knowledge blind spots, then suggesting targeted improvements. The difference is simply that the “employee” under review is now an AI agent.&lt;/p&gt;

&lt;p&gt;For teams that want more technical control, open-source frameworks such as OpenAI’s Evals or Arize Phoenix offer benchmarking and trace replay. Those tools require more engineering effort but allow precise measurement and fine-tuned experimentation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Benefits of Supervising AI Workflows
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Reliability and low-risk for production-scale agents
&lt;/h3&gt;

&lt;p&gt;Reliability is the first promise of supervision. In DevOps, teams earn stability with CI/CD, monitoring, and paging. The same people now run intelligent workflows, so the instincts carry over: surface problems early and design for recovery. &lt;/p&gt;

&lt;p&gt;Concretely, carry over the discipline you already trust:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Versioning and rollback&lt;/strong&gt; so bad changes don’t linger.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Automated checks in CI&lt;/strong&gt; to stop regressions before release.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Monitoring and paging&lt;/strong&gt; on user-visible symptoms, not just internals.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Runbooks and incident response&lt;/strong&gt; to shorten time to mitigation.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Deployed ML systems do degrade without oversight. That’s drift: training and real-world data diverge, and quality falls unless you watch and adapt. CMU’s Software Engineering Institute calls out this decay in production and focuses on practical drift detection.&lt;/p&gt;

&lt;p&gt;Recent industry observations underline the need for ongoing performance monitoring and show that detection effectiveness depends on data conditions and not blind faith in a single metric. Supervision brings the loop you need: recording runs, replay for diagnosis, alerts when behavior veers, and policy checks aligned with governance guidance.&lt;/p&gt;

&lt;h3&gt;
  
  
  Faster iteration and shorter deployment cycles
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://www.mckinsey.com/capabilities/quantumblack/our-insights/the-state-of-ai" rel="noopener noreferrer"&gt;McKinsey’s &lt;em&gt;State of AI 2025&lt;/em&gt;&lt;/a&gt; shows that companies with structured oversight scale AI use cases faster and with fewer blockers. The reason is simple: when you can see problems clearly, you don’t waste weeks firefighting.&lt;/p&gt;

&lt;p&gt;For developers, the pain points look familiar. A bad data source throws outputs off. Debugging takes days because you can’t replay the run. Testing agents in production feels like testing code directly in prod — unstable and slow.&lt;/p&gt;

&lt;p&gt;This is where AI &lt;strong&gt;agent supervision platforms&lt;/strong&gt; really shine over traditional indexing and LLM libraries. Instead of stitching together frameworks like LangChain and writing custom tracing code, you get functions already designed for you. Supervisor type setups automatically cluster recurring failure patterns and knowledge gaps across sessions.&lt;/p&gt;

&lt;p&gt;When a run fails, you can replay it and watch the exact path the agent took. That makes it obvious where the wrong source crept in. Over time, those replays reveal patterns — gaps in knowledge, broken tools — that you’d never catch skimming raw logs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Governance and accountability for enterprises
&lt;/h3&gt;

&lt;p&gt;Gartner frames the solution as “&lt;a href="https://www.gartner.com/en/articles/guardian-agents" rel="noopener noreferrer"&gt;The Rise of Guardian Agents&lt;/a&gt;”: AI designed to watch over other AI. Early forms check quality and enforce policy; more mature forms can observe processes in real time; the end state is active protection, where unsafe actions are blocked before they reach customers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgpuq4h7p0jphfqykeonx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgpuq4h7p0jphfqykeonx.png" alt=" " width="543" height="140"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;These functions don’t replace governance teams, but they give them something they’ve lacked until now: continuous visibility and enforcement mechanisms that keep pace with rapid deployment.&lt;/p&gt;

&lt;p&gt;Guardian agents are emerging as the operational mechanisms that keep governance from falling behind innovation. They turn compliance from a periodic audit into an always-on layer of accountability.&lt;/p&gt;

&lt;h2&gt;
  
  
  What are the current AI policies and standards enterprises must follow?
&lt;/h2&gt;

&lt;p&gt;Technical fixes alone won’t make agents trustworthy. Enterprises also need to track the laws, standards, and frameworks that define what “responsible AI” actually means. Policy is moving quickly, at the same time, global standards bodies have begun publishing management system rules, creating a shared language for how organizations prove oversight.&lt;/p&gt;

&lt;p&gt;What matters is evidence: logs, streams, audits, and human-in-the-loop controls that stand up to regulators and standards bodies. Without that, governance risks getting outpaced by innovation — and once that gap opens, trust is almost impossible to recover.&lt;/p&gt;

&lt;p&gt;Here are the most relevant policies and standards to watch right now:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn19ious53djzwe6pf4vd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fn19ious53djzwe6pf4vd.png" alt=" " width="652" height="455"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Sets rules for consent, data minimization, and grievance redress. Implementation is still ongoing.&lt;/p&gt;

&lt;p&gt;However, some more policymakers to keep an eye out would be organizations around the world. OECD, UNESCO, the U.S. AI Safety Institute, the EU AI Office, and the UK AI Safety Institute which are all shaping rules that could soon harden into law. &lt;/p&gt;

&lt;h2&gt;
  
  
  Top AI Agent Supervision Platforms
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://www.wayfound.ai/" rel="noopener noreferrer"&gt;Wayfound&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faym2bafn2ao0mj3wjjsg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Faym2bafn2ao0mj3wjjsg.png" alt=" " width="800" height="525"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Enterprise leaders who need business-friendly oversight of AI agents across departments, with seamless integration into existing workflows.&lt;/p&gt;

&lt;p&gt;Wayfound positions itself as the first proactive AI Agent Supervisor — what Gartner calls a &lt;em&gt;Guardian Agent&lt;/em&gt;. It combines LLM-as-a-judge reasoning with an organizational lens, treating supervision as management rather than technical oversight. The platform turns abstract observability into a process that mirrors how teams set goals, review performance, and adapt over time.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://www.wayfound.ai/pages/agentic-workflows" rel="noopener noreferrer"&gt;dashboard&lt;/a&gt; puts business users in direct control. They can define roles, objectives, and evaluation rules aligned with business metrics — then assess agent behavior without relying on engineering. Oversight becomes continuous and hands-on, allowing decisions to stay close to outcomes rather than filtered through technical mediation.&lt;/p&gt;

&lt;p&gt;At the core of this loop is Wayfound’s &lt;a href="https://www.wayfound.ai/post/self-improving-ai-agents-are-here-with-wayfound-mcp" rel="noopener noreferrer"&gt;Model Context Protocol (MCP)&lt;/a&gt;. It lets agents query Wayfound during execution to verify actions, follow new guidelines, and apply lessons from previous runs. When goals or policies change, updates apply instantly turning feedback into live iteration rather than a post-deployment task.&lt;/p&gt;

&lt;p&gt;For developers, MCP provides automatic instrumentation and clear visibility into behavior. Wayfound captures traces, errors, and decision paths, surfacing practical summaries instead of raw logs. It shows exactly where things went wrong and what adjustments would make the next run better.&lt;/p&gt;

&lt;p&gt;Together, these layers form a self-improving supervision cycle that works across CRMs, analytics tools, and agent frameworks. Wayfound becomes the shared space where business and engineering converge, ensuring agents stay aligned and continuously improving in real time.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Supervisor dashboard with ability to write natural language custom evaluations and business-friendly observability and review&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Real-time alerts that highlight risks and optimization opportunities&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Agent improvement suggestions that can be implemented seamlessly through MCP &lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Easy integration for any AI Agent via SDK, APIs, MCP as well as native integrations with Salesforce Agentforce&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Enterprise contracts with deployment support; pricing on request.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://docs.langchain.com/langsmith/home" rel="noopener noreferrer"&gt;LangSmith&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhlg8iinwsnj8f5c410yc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhlg8iinwsnj8f5c410yc.png" alt=" " width="800" height="418"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Engineering teams that need to replay agent runs, debug failures, and fine-tune prompts in detail.&lt;/p&gt;

&lt;p&gt;LangSmith grew out of &lt;a href="https://www.langchain.com/pricing" rel="noopener noreferrer"&gt;LangChain&lt;/a&gt;, a more focused attempt to turn open-source building blocks into a systematic debugging layer. Its biggest strength is &lt;a href="https://docs.langchain.com/langsmith/evaluation-concepts" rel="noopener noreferrer"&gt;replay and trace&lt;/a&gt;. Developers can walk back through every intermediate step, surfacing exactly where logic went wrong.&lt;/p&gt;

&lt;p&gt;It also doubles as a testing platform. By fixing datasets of sample prompts and replaying them against updated models, teams can measure regression and confirm whether changes improve results. That structured QA approach is something observability tools alone don’t provide.&lt;/p&gt;

&lt;p&gt;The flip side is that LangSmith is unapologetically developer-centric. Non-technical stakeholders won’t find much value in JSON replays or raw traces. Without engineering commitment, it risks being shelfware, since the interface assumes technical comfort from its users.&lt;/p&gt;

&lt;p&gt;Another limitation is scope. LangSmith excels at debugging single agents but &lt;a href="https://www.reddit.com/r/LangChain/comments/1khjrnx/langsmith_has_been_great_but_starting_to_feel/" rel="noopener noreferrer"&gt;lacks governance and policy enforcement features&lt;/a&gt;. Enterprises looking for guardrails or executive dashboards usually need to pair it with broader supervision platforms to achieve complete oversight.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Full traces of inputs, outputs, and tool calls&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Dataset management for structured evaluation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Replay environment to walk through reasoning&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Built-in hooks for automated testing&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Free tier available; usage-based plans scale with volume.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://www.lakera.ai/lakera-guard" rel="noopener noreferrer"&gt;Lakera Guard&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzm4ma6ht9zz94pap5g9j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzm4ma6ht9zz94pap5g9j.png" alt=" " width="800" height="260"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams that run agents in production and worry about jailbreaks, data leaks, or reckless tool use.&lt;/p&gt;

&lt;p&gt;Lakera Guard’s strongest play is ease of adoption. In &lt;a href="https://www.reddit.com/r/LocalLLaMA/comments/1m22w76/securing_ai_agents_with_honeypots_catch_prompt/" rel="noopener noreferrer"&gt;this Reddit thread&lt;/a&gt;, developers described it as a “drop-in proxy” — you just point your agent traffic to Lakera’s endpoint and instantly add a layer of injection filtering. That simplicity is a huge draw.&lt;/p&gt;

&lt;p&gt;At runtime, it can block jailbreak-style inputs and protect sensitive functions. One engineer noted that Lakera “catches weird edge cases” where users try to exfiltrate hidden prompts. That’s real production defense that can be deployed out of the box with minimal efforts.&lt;/p&gt;

&lt;p&gt;But overhead comes with it. As another user put it: &lt;em&gt;“nice to have a drop-in solution — not so nice to have additional wait-steps in a large, branched agentic loop.”&lt;/em&gt; False positives and latency can make guardrails feel heavy in complex pipelines.&lt;/p&gt;

&lt;p&gt;Lakera also isn’t a full observability tool. It keeps you safe in the moment but won’t give dashboards or long-term metrics. Most teams pair it with LangSmith for debugging or Wayfound for governance so they’re covered across the full supervision stack.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Real-time prompt injection detection&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Guardrails for sensitive tool and API calls&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Drop-in proxy endpoints for LLM requests&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Filters for unsafe or policy-violating outputs&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Free developer tier, with custom pricing for enterprise deployments.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://coralogix.com/" rel="noopener noreferrer"&gt;Coralogix&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzf8hatf1ku8equ5tk40w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzf8hatf1ku8equ5tk40w.png" alt=" " width="800" height="387"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Engineering and data teams that need unified observability for infrastructure and AI agents, with context awareness, compliance tracking, and drift detection.&lt;/p&gt;

&lt;p&gt;Coralogix began as a log analytics platform and gradually evolved into a full AI observability layer. The acquisition of Aporia added model telemetry and runtime guardrails, giving teams a single console for monitoring both agent behavior and system performance.&lt;/p&gt;

&lt;p&gt;The AI Center dashboard tracks drift, latency, and cost using the same ingestion layer that powers traditional log pipelines. Each inference can be traced from API call to output without manual tracing or separate monitoring scripts.&lt;/p&gt;

&lt;p&gt;A big differentiator is cost visibility. Coralogix logs each token, API call, and compute expense, showing cost per agent and raising alerts for anomalies. You can set custom budgets per agent to prevent runaway usage. &lt;/p&gt;

&lt;p&gt;To keep costs manageable, Coralogix offers index-free querying, remote archive queries (on your own S3 or cloud storage), and tools like “Drop Irrelevant Metrics” to prune what isn’t useful after ingestion.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Token &amp;amp; resource cost tracking with anomaly alerts&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Archive &amp;amp; index-free querying to reduce storage/query overheads&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Olly, a natural-language assistant for observability insights&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Usage-based model with a free developer tier. Enterprise plans include advanced AI Center analytics and extended retention options.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://www.giskard.ai/" rel="noopener noreferrer"&gt;Giskard&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6xfon7s0250anopio9dy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F6xfon7s0250anopio9dy.png" alt=" " width="800" height="556"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Teams that want to test agents and LLMs pre-deployment using open-source tooling rather than paying for a commercial supervision suite.&lt;/p&gt;

&lt;p&gt;Giskard’s core idea is that supervision should start before production. Its open-source framework lets you scan models and datasets to expose problems like hallucinations, biased completions, or injection vulnerabilities without relying on closed third-party platforms.&lt;/p&gt;

&lt;p&gt;A standout is &lt;a href="https://www.giskard.ai/" rel="noopener noreferrer"&gt;RAGET, their retrieval evaluation tool&lt;/a&gt;. Instead of eyeballing responses, it matches outputs against reference datasets and surfaces where the retrieval logic breaks down. That makes it useful for RAG pipelines, which often fail in subtle, context-driven ways.&lt;/p&gt;

&lt;p&gt;Developers appreciate that it is free and flexible. You can &lt;a href="https://medium.com/@rohithreddy66666/mastering-rag-evaluation-a-deep-dive-into-giskard-ai-for-pdf-chatbots-c6d1163330aa" rel="noopener noreferrer"&gt;script evaluations directly in Python or use the UI to set custom rules&lt;/a&gt;. But it does take work — you need to define good datasets and tests, otherwise results don’t mean much.&lt;/p&gt;

&lt;p&gt;The limitation is that Giskard stops at testing. It helps you find weaknesses pre-deployment but doesn’t offer continuous runtime monitoring. Most teams that adopt it still layer on observability or guardrail tools once agents are in production.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Automated scanning for hallucinations, bias, and injection risks&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;RAGET framework for retrieval evaluation against ground-truth data&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Python APIs and UI for test creation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Extensible with custom rules and datasets&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Completely open-source, with paid support options for enterprises.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://www.ibm.com/products/watsonx" rel="noopener noreferrer"&gt;IBM WatsonX&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgl16dpj12nx734dcu1aw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgl16dpj12nx734dcu1aw.png" alt=" " width="800" height="401"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Enterprises that care more about auditability and control than speed — teams that must prove AI reliability to regulators and leadership.&lt;/p&gt;

&lt;p&gt;IBM’s Watsonx.governance suite brings structured oversight to AI systems. It automates documentation and risk scoring while managing bias checks across pipelines. Model registries, version control, and lineage tracking connect directly to corporate compliance systems for end-to-end accountability.&lt;/p&gt;

&lt;p&gt;The main advantage of Watsonx is the support network behind it. Clients work with IBM Consulting, prebuilt industry templates, and integration pathways into enterprise software such as SAP or Salesforce. For regulated firms, that guidance often replaces months of internal coordination.&lt;/p&gt;

&lt;p&gt;The trade-off is complexity. Setup involves many moving parts, and flexibility narrows once policies are locked in. Watsonx performs best in predictable settings where models change gradually. Smaller teams seeking fast iteration usually find it cumbersome.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Model and data governance automation&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Risk, bias, and compliance reporting&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Integration with enterprise data and risk suites&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;IBM Consulting and sector-specific templates&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Enterprise licensing with optional consulting packages. Costs vary by compliance level, and contract structure.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;a href="https://www.nvidia.com/en-us/ai-data-science/products/nemo/" rel="noopener noreferrer"&gt;NVIDIA NeMo Rails&lt;/a&gt;
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk3mn3eg0wxxi5ipfoo4m.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fk3mn3eg0wxxi5ipfoo4m.png" alt=" " width="800" height="568"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Best for:&lt;/strong&gt; Engineering teams running complex LLM or agent workloads that need programmable control over behavior and safety boundaries.&lt;/p&gt;

&lt;p&gt;NeMo Guardrails is a rule-driven framework for supervising conversational AI. It specifies what agents can discuss, how they respond, and which tools they may access. Rules are written in a lightweight configuration language and enforced during runtime without retraining.&lt;/p&gt;

&lt;p&gt;The framework can run locally or through &lt;a href="https://www.nvidia.com/en-us/ai/foundry/" rel="noopener noreferrer"&gt;NVIDIA’s AI Foundry platform&lt;/a&gt;. Teams use it to combine rule enforcement with retrieval or vector-based search, adding safety layers over existing generative pipelines. Integration with LangChain and Triton simplifies deployment inside production environments.&lt;/p&gt;

&lt;p&gt;In practice, Guardrails delivers reliable behavior once tuned but requires setup and testing effort. Each rule must be validated under expected load to prevent latency or misfires. After calibration, it produces consistent, policy-aligned outputs that satisfy enterprise safety requirements.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key Features&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Rule-based control for model topics, responses, and tool use&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Integration with LangChain, Triton, and retrieval pipelines&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Local or cloud operation via NVIDIA AI Foundry&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Templates and SDKs for regulated and safety-focused workloads&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Pricing:&lt;/strong&gt; Enterprise features are available through AI Enterprise and Foundry subscriptions, which include managed support and deployment assistance.&lt;/p&gt;

&lt;h2&gt;
  
  
  Getting Started with AI Supervision
&lt;/h2&gt;

&lt;h3&gt;
  
  
  1. Choose your first supervision platform
&lt;/h3&gt;

&lt;p&gt;You don’t need to deploy everything at once. Start by picking the platform that fits your immediate need:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Wayfound&lt;/strong&gt; gives executives and governance teams an AI Agent Supervisor and dashboards to review and improve agent behavior without digging through code.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;LangSmith&lt;/strong&gt; lets engineers replay runs and debug prompts when agents misfire.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Lakera Guard&lt;/strong&gt; sits in front of agents to block prompt injection and risky actions.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Coralogix&lt;/strong&gt; tracks every agent session in real time, scoring quality, latency, and cost while surfacing issues like drift.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Giskard&lt;/strong&gt; offers an open-source framework to test agents against datasets before pushing to production.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;IBM WatsonX&lt;/strong&gt; equips enterprises with model governance, bias detection, and audit reporting to keep AI development compliant and explainable.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;NVIDIA NeMo Guardrails&lt;/strong&gt; gives engineers fine-grained control over what agents can say or do, enforcing safety and policy rules directly at runtime.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  2. Connect your agents and integrations
&lt;/h3&gt;

&lt;p&gt;Start by linking your agents to the platform, then extend coverage into the everyday tools your teams already rely on:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Communication tools such as Slack or Teams so alerts can be shared immediately&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Business systems like Salesforce or HubSpot to tie oversight directly into customer and sales activity&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Data pipelines and API layers that carry context between agents and backend systems&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Storage and knowledge bases, where supervision can check retrievals and prevent drift in long-term memory&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  3. Configure policies and guardrails
&lt;/h3&gt;

&lt;p&gt;Decide what your agents are not allowed to do. Common starting points:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Restrict access to sensitive data fields (e.g., customer IDs in Salesforce)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Set runtime rules on tool use so costly or high-risk actions require approval&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Add brand, compliance, or tone filters before responses go out&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Layer in prompt-injection defenses to stop adversarial inputs&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For further reading:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.lakera.ai/blog/guide-to-prompt-injection" rel="noopener noreferrer"&gt;Lakera’s guide on prompt injection defense&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://www.nist.gov/itl/ai-risk-management-framework" rel="noopener noreferrer"&gt;NIST AI Risk Management Framework&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;a href="https://trust.salesforce.com/en/trust-and-compliance-documentation/" rel="noopener noreferrer"&gt;Salesforce Trust and Compliance documentation&lt;/a&gt;&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  4. Run your first performance review
&lt;/h3&gt;

&lt;p&gt;A performance review is the first moment supervision feels concrete. Instead of chasing logs or scattered feedback, you can watch how an agent handled a real conversation and see where the rules you set actually mattered. It’s a shift from theory into something you can observe and adjust.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjadufzdabk0oriti4hrh.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjadufzdabk0oriti4hrh.png" alt=" " width="800" height="700"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Wayfound’s review panel makes agent behavior and guideline checks visible in one place.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Wayfound’s review panel is a good illustration. The transcript sits alongside the decisions the agent made, with clear signals showing how those choices lined up with the company guidelines. Explanations and feedback boxes capture what worked well and where improvements are needed, making the outcome easy to understand without requiring technical expertise or extra reporting layers.&lt;/p&gt;

&lt;p&gt;For teams used to heavy oversight processes, this streamlined view shows how reviews can be both explainable and lightweight enough to slot into existing workflows.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Expand gradually across workflows
&lt;/h3&gt;

&lt;p&gt;Once you’ve run that first review, the temptation is to wire up every agent and every process at once. Resist it. Supervision scales best when you add coverage step by step. Start with the workflows where risk or cost is highest, and let the early reviews teach you which rules matter most.&lt;/p&gt;

&lt;p&gt;A good tip here is to keep track of what you change. Treat each adjustment to a policy, guideline, or review process as an experiment, and note what effect it has. That simple habit makes scaling far easier, because you’re learning how supervision fits the way your teams actually work.&lt;/p&gt;




&lt;p&gt;Enterprise adoption is already showing that many traditional teams struggle to integrate AI, weighed down by layers of bureaucracy. Much of that bureaucracy exists in good faith, to protect customers and keep operations stable.&lt;/p&gt;

&lt;p&gt;The problem is when that oversight can’t keep up with the speed of deployment. What begins as protection turns into friction, and innovation spills into unmanaged risk. Structured supervision is what closes that gap: it gives enterprises the accountability they need without sacrificing the pace of adoption.&lt;/p&gt;

&lt;p&gt;I’m Aryan. I’ve worked in the AI agent, LLM, and MLOps space for a while now, and my upcoming PhD is focused on governance and explainability for AI agents. If this is something you’re working on or curious about, connect with me on &lt;a href="//www.linkedin.com/in/aryankargwal"&gt;LinkedIn&lt;/a&gt;, I’d love to continue the conversation.&lt;/p&gt;

</description>
      <category>ai</category>
      <category>architecture</category>
      <category>agents</category>
      <category>mcp</category>
    </item>
    <item>
      <title>DeepSeek-R1: Redefining the Reinforcement Learning in AI</title>
      <dc:creator>Aryan Kargwal</dc:creator>
      <pubDate>Mon, 10 Feb 2025 16:47:58 +0000</pubDate>
      <link>https://dev.to/aryankargwal/deepseek-r1-redefining-the-reinforcement-learning-in-ai-47e6</link>
      <guid>https://dev.to/aryankargwal/deepseek-r1-redefining-the-reinforcement-learning-in-ai-47e6</guid>
      <description>&lt;p&gt;The rapid evolution of large language models (LLMs) has reshaped the AI landscape, with OpenAI leading the charge. However, the emergence of open-source models like &lt;strong&gt;DeepSeek-R1&lt;/strong&gt; challenges the dominance of proprietary systems.&lt;/p&gt;

&lt;p&gt;As users question the justification behind hefty subscription fees, such as OpenAI's $200 premium plan (which could fund a year's worth of caffeine for developers), DeepSeek-R1 offers compelling answers.&lt;/p&gt;

&lt;p&gt;This blog delves into DeepSeek-R1's architecture, performance, and what it means for the future of AI.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;The Resurgence of Reinforcement Learning in Generative AI&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Earlier last year, the adoption of &lt;a href="https://youtu.be/T_X4XFwKX8k?si=XwLkpXLs8k0IAmlb" rel="noopener noreferrer"&gt;Reinforcement Learning from Human Feedback&lt;/a&gt; (RLHF) saw a notable decline in the development of generative AI models. Many organizations shifted focus towards scaling model architectures and fine-tuning large datasets without the added complexity of reinforcement learning (RL).&lt;/p&gt;

&lt;p&gt;However, &lt;strong&gt;DeepSeek-R1&lt;/strong&gt; has reinvigorated interest in RL by demonstrating how it can be applied cost-effectively while significantly enhancing model performance—without needing the budget of a small country.&lt;/p&gt;

&lt;p&gt;So, how does DeepSeek incorporate RL into its training program at a fraction of the cost typically associated with such techniques? Let's break down three key RL strategies they employed:&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Group Relative Policy Optimization (GRPO):&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;DeepSeek-R1 utilizes GRPO, an efficient variant of traditional policy optimization algorithms. Unlike standard approaches that rely heavily on resource-intensive critic models, GRPO estimates baselines from group scores, significantly reducing computational overhead.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzx6yq8lm3465jpozkjpb.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzx6yq8lm3465jpozkjpb.png" alt=" " width="800" height="347"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Think of it as a group project. Instead of one person doing all the work, everyone contributes just enough to earn an A. This technique allows DeepSeek-R1 to optimize model performance with minimal resources, focusing on relative improvements within sampled outputs rather than absolute performance metrics.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Reward Modeling with Rule-Based Systems:&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Instead of relying on costly neural reward models prone to issues like reward hacking (yes, even AI knows how to game the system), DeepSeek-R1 adopts a rule-based reward system. This system emphasizes two types of rewards:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy rewards:&lt;/strong&gt; Assessing the correctness of outputs in tasks like math and coding.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Format rewards:&lt;/strong&gt; Ensuring responses follow structured reasoning patterns.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This cost-effective approach maintains training stability without continuous retraining of complex reward models. It’s like teaching a model that 2+2=4 and that math answers shouldn't come with emojis—simple yet effective.&lt;/p&gt;

&lt;h4&gt;
  
  
  &lt;strong&gt;Reinforcement Learning with Cold Start Data:&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;DeepSeek-R1 introduces a 'cold start' phase to address early-stage instability common in RL training. This involves fine-tuning the base model on a small, high-quality dataset before applying RL.&lt;/p&gt;

&lt;p&gt;By establishing a strong initial performance baseline, DeepSeek-R1 accelerates convergence during RL training, reducing the computational cost typically required to achieve high reasoning capabilities.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Breaking Down DeepSeek-R1&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Developed by DeepSeek-AI, DeepSeek-R1 is part of a new generation of reasoning-focused LLMs. It comes in two variants.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;DeepSeek-R1-Zero:&lt;/strong&gt; Trained using large-scale reinforcement learning (RL) without supervised fine-tuning (SFT), showcasing raw reasoning capabilities.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;DeepSeek-R1:&lt;/strong&gt; Built on a multi-stage training pipeline that includes cold-start data, SFT, and extensive RL, achieving performance comparable to OpenAI's o1-1217 models.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Unlike traditional LLMs, which rely heavily on supervised datasets, DeepSeek-R1-Zero's performance emerges from pure RL, organically incentivizing reasoning behaviors.&lt;/p&gt;

&lt;p&gt;This approach allows the model to develop self-verification, reflection, and complex chain-of-thought (CoT) reasoning without human biases embedded through SFT—essentially the AI equivalent of learning to ride a bike without training wheels.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Performance That Rivals the Best&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;DeepSeek-R1 doesn't just claim to be competitive; its benchmark results prove it:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyvtcdbhtdb7gjmmnzuav.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fyvtcdbhtdb7gjmmnzuav.png" alt=" " width="546" height="446"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;AIME 2024 (Pass@1):&lt;/strong&gt; 79.8%, outperforming OpenAI-o1-mini and matching OpenAI-o1-1217.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MATH-500 (Pass@1):&lt;/strong&gt; 97.3%, rivaling OpenAI's models in mathematical reasoning.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MMLU:&lt;/strong&gt; 90.8%, showcasing broad knowledge comprehension.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Codeforces (Percentile):&lt;/strong&gt; 96.3%, indicating elite coding competition performance (because what's more satisfying than beating both AI and humans at code challenges?).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhhrn75rksi2s38x2nb7r.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fhhrn75rksi2s38x2nb7r.png" alt=" " width="616" height="251"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The distilled models, with 1.5B to 70B parameters, also outperform existing open-source benchmarks. For instance, &lt;strong&gt;DeepSeek-R1-Distill-Qwen-32B&lt;/strong&gt; surpasses QwQ-32B-Preview in all major reasoning tasks, proving that size isn’t everything—optimization is key.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Why This Matters: The Open-Source Revolution&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The success of DeepSeek-R1 challenges the notion that proprietary models inherently offer superior value. Here’s why:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Performance Parity:&lt;/strong&gt; DeepSeek-R1 achieves results on par with, and sometimes exceeding, OpenAI's top-tier models, especially in reasoning-intensive tasks.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cost Efficiency:&lt;/strong&gt; Open-source models eliminate the need for expensive API subscriptions, providing high-quality AI capabilities without recurring costs (so you can finally cancel that subscription and still afford your morning latte).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Transparency:&lt;/strong&gt; Unlike black-box proprietary models, open-source projects offer transparency, fostering trust and enabling community-driven improvements.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Customization:&lt;/strong&gt; Organizations can fine-tune models like DeepSeek-R1 to meet specific needs, something not easily achievable with closed-source APIs. Sometimes, you just need your AI to have that particular flair.&lt;/li&gt;
&lt;/ol&gt;




&lt;h3&gt;
  
  
  Building AI Agents using DeepSeek R1
&lt;/h3&gt;

&lt;p&gt;If you're excited about DeepSeek-R1's potential but wondering how to integrate it into practical applications, look no further than &lt;a href="//studio.botpress.cloud"&gt;Botpress&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;This powerful platform allows you to build sophisticated AI agents to handle customer support, automate workflows, and even assist in coding tasks without breaking the bank.&lt;/p&gt;

&lt;p&gt;&lt;iframe width="710" height="399" src="https://www.youtube.com/embed/7ZuHsVcQoVc"&gt;
&lt;/iframe&gt;
&lt;/p&gt;

&lt;p&gt;By leveraging DeepSeek-R1 on Botpress, you can recreate much of the agentic functionality that proprietary models offer, but at a fraction of the cost.&lt;/p&gt;

</description>
      <category>deepseek</category>
      <category>opensource</category>
      <category>ai</category>
      <category>agents</category>
    </item>
    <item>
      <title>Benchmarking Pixtral Large vs Pixtral 12B</title>
      <dc:creator>Aryan Kargwal</dc:creator>
      <pubDate>Mon, 25 Nov 2024 21:52:24 +0000</pubDate>
      <link>https://dev.to/tunehqai/benchmarking-pixtral-large-vs-pixtral-12b-2l0e</link>
      <guid>https://dev.to/tunehqai/benchmarking-pixtral-large-vs-pixtral-12b-2l0e</guid>
      <description>&lt;p&gt;Youtube: &lt;a href="https://youtu.be/O412lbdYQA0" rel="noopener noreferrer"&gt;Click Me&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Multimodal AI has taken significant leaps in recent years, and &lt;strong&gt;Mistral AI's Pixtral Large&lt;/strong&gt; is no exception. This new Vision-Language Model (VLM) aims to redefine benchmarks in multimodal understanding and reasoning. In this post, I’ll dive into Pixtral Large's capabilities, its performance against its predecessor, Pixtral 12B, and GPT-4V, and share my benchmarking experiments to help you make informed decisions when choosing your next VLM.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;What is Pixtral Large?&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Pixtral Large is Mistral AI’s latest multimodal innovation. Building on the foundation of Pixtral 12B, it introduces enhanced reasoning and comprehension capabilities. Whether tackling complex math problems on datasets like &lt;strong&gt;MathVista&lt;/strong&gt;, document comprehension from &lt;strong&gt;DocVQA&lt;/strong&gt;, or visual-question answering with &lt;strong&gt;VQAv2&lt;/strong&gt;, Pixtral Large consistently sets itself apart with superior performance.&lt;/p&gt;

&lt;p&gt;At its core, Pixtral Large is powered by &lt;strong&gt;123 billion multimodal decoder parameters&lt;/strong&gt; and a &lt;strong&gt;1 billion-parameter vision encoder&lt;/strong&gt;, making it a true powerhouse. It supports up to &lt;strong&gt;30 high-resolution images&lt;/strong&gt; within a &lt;strong&gt;128K context window&lt;/strong&gt;, allowing it to handle complex, large-scale reasoning tasks effortlessly. Its &lt;strong&gt;Mistral Large 2 Text Encoder&lt;/strong&gt; enhances text processing while maintaining its exceptional multimodal capabilities.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Technical Specifications&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Although the exact architecture of Pixtral Large remains undisclosed, it likely builds upon Pixtral 12B's &lt;strong&gt;common embedding-based multimodal transformer decoder&lt;/strong&gt;. This setup enables it to process &lt;strong&gt;multi-image inferences&lt;/strong&gt; and perform high-quality cross-modal reasoning, excelling at tasks that require a deep integration of visual and textual data.&lt;/p&gt;

&lt;p&gt;Here are some standout specs of Pixtral Large:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Parameters&lt;/strong&gt;: 123 billion (multimodal decoder) + 1 billion (vision encoder)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Context Window&lt;/strong&gt;: 128K tokens&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Image Support&lt;/strong&gt;: Up to 30 high-resolution images&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Applications&lt;/strong&gt;: Math reasoning, document comprehension, chart understanding, and more&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Pixtral Large vs. Pixtral 12B&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;The shift from Pixtral 12B to Pixtral Large represents a nuanced tradeoff:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Pixtral 12B&lt;/strong&gt;: Balanced capabilities across tasks, excelling in &lt;strong&gt;label-based&lt;/strong&gt; and &lt;strong&gt;rationale-based&lt;/strong&gt; evaluations.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Pixtral Large&lt;/strong&gt;: Falls behind in label-based tasks but shines in &lt;strong&gt;rationale-based performance&lt;/strong&gt;, indicating superior reasoning and explanation capabilities.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This evolution demonstrates Pixtral Large’s focus on tasks requiring deeper comprehension and reasoning, making it a strong contender for specialized use cases.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Benchmarking Results&lt;/strong&gt;
&lt;/h2&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Datasets Used&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;To test Pixtral Large, I benchmarked it against its predecessor and GPT-4V using two datasets:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;ArxivQA&lt;/strong&gt;: Research paper-based QA tasks with GPT-4V inferences for comparison.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flickr30k&lt;/strong&gt;: A classic image captioning dataset enhanced with GPT-4O-generated captions.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Evaluation Metrics&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;I used &lt;strong&gt;Cosine Similarity&lt;/strong&gt; to measure semantic alignment between generated outputs and reference data. Metrics included &lt;strong&gt;win rate&lt;/strong&gt;, &lt;strong&gt;average similarity&lt;/strong&gt;, and &lt;strong&gt;top-1, top-5, top-10 scores&lt;/strong&gt;.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;ArxivQA Results&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;From &lt;strong&gt;1,000 randomly selected images&lt;/strong&gt;, Pixtral Large demonstrated a stronger ability to reason through scientific and mathematical content. While it struggled with label-based evaluations compared to Pixtral 12B, it outperformed in rationale-based tasks. This indicates a shift toward &lt;strong&gt;deeper reasoning capabilities&lt;/strong&gt;, ideal for complex QA scenarios.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3e11lo2vfnp6jrt1yewv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3e11lo2vfnp6jrt1yewv.png" alt=" " width="800" height="368"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Flickr30k Results&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;For the &lt;strong&gt;Flickr30k Captioning Benchmark&lt;/strong&gt;, Pixtral Large produced slight improvements over Pixtral 12B when evaluated against &lt;strong&gt;human-generated captions&lt;/strong&gt;. However, both models lagged in achieving a win rate for this task.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frmvfwfxfe3xgglyforty.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Frmvfwfxfe3xgglyforty.png" alt=" " width="800" height="383"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Interestingly, when compared to &lt;strong&gt;GPT-4V captions&lt;/strong&gt;, Pixtral Large performed well, though it fell slightly behind Pixtral 12B in top-ranked matches. These results highlight Pixtral Large’s potential but also suggest areas for improvement in precision and caption generation.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Using Pixtral Large on Tune Studio&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Due to the model's size and resource requirements, I used &lt;strong&gt;Tune Studio&lt;/strong&gt; for benchmarking. With its user-friendly interface and efficient inference scripts, I was able to process &lt;strong&gt;500 images per hour&lt;/strong&gt;, completing the job for under $20. This makes Tune Studio a valuable tool for researchers and developers working on large-scale AI projects.&lt;/p&gt;




&lt;h2&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h2&gt;

&lt;p&gt;Pixtral Large represents a significant step forward in multimodal AI, offering enhanced reasoning and cross-modal comprehension. While it may not surpass Pixtral 12B in every aspect, its focus on rationale-based tasks makes it a compelling choice for applications requiring deeper understanding.&lt;/p&gt;

&lt;p&gt;For developers, researchers, and enterprises looking for cutting-edge VLMs, Pixtral Large offers a mix of power and precision that’s hard to beat.&lt;/p&gt;




&lt;p&gt;What do you think about Pixtral Large? Is it the next big thing in VLMs, or do you see potential in other models like GPT-4V? Let me know your thoughts in the comments below! 🚀&lt;/p&gt;

</description>
      <category>llm</category>
      <category>vlm</category>
      <category>benchmarking</category>
      <category>research</category>
    </item>
    <item>
      <title>Transform UI Screenshots into HTML &amp; CSS with Qwen Coder and Qwen VL</title>
      <dc:creator>Aryan Kargwal</dc:creator>
      <pubDate>Thu, 14 Nov 2024 16:43:38 +0000</pubDate>
      <link>https://dev.to/tunehqai/transform-ui-screenshots-into-html-css-with-qwen-coder-and-qwen-vl-12nl</link>
      <guid>https://dev.to/tunehqai/transform-ui-screenshots-into-html-css-with-qwen-coder-and-qwen-vl-12nl</guid>
      <description>&lt;p&gt;🎥Youtube Video Link: &lt;a href="https://youtu.be/TK1TDe7fHGI" rel="noopener noreferrer"&gt;Click Me&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Imagine this: you’re working on a website redesign, and you’ve just captured a UI screenshot that embodies the look you want. Wouldn’t it be incredible if you could automatically turn that image into HTML and CSS? This tutorial will show you exactly how to make that happen, transforming visual designs into code using cutting-edge &lt;strong&gt;vision-language models (VLMs)&lt;/strong&gt; and &lt;strong&gt;Qwen Coder&lt;/strong&gt;. &lt;/p&gt;

&lt;p&gt;In this setup, we’ll build a pipeline where an AI model analyzes your UI design image, understands its layout, colors, typography, and structure, and then generates clean, organized HTML and CSS code. This process opens up a world of possibilities for &lt;strong&gt;UI prototyping, automated design-to-code workflows&lt;/strong&gt;, and quick mockup generation.&lt;/p&gt;

&lt;p&gt;Some cool points we'll cover:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Upload and Describe UI Designs&lt;/strong&gt;: How we upload a UI screenshot and get a detailed breakdown of the design elements.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generate HTML &amp;amp; CSS with AI&lt;/strong&gt;: Transforming these descriptions into fully functional HTML and CSS code for quick web design prototyping.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Let’s get started!&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Setting Up API Details and Image Encoding
&lt;/h3&gt;

&lt;p&gt;First, let’s configure the API endpoint, headers, and a helper function to encode images into Base64. This encoding step allows us to send the image data to the model.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;PIL&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;io&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;BytesIO&lt;/span&gt;

&lt;span class="c1"&gt;# Set API details
&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://proxy.tune.app/chat/completions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;YOUR_API_KEY&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;  &lt;span class="c1"&gt;# Replace with your actual API key
&lt;/span&gt;    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="c1"&gt;# Encode image in Base64 format
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;encode_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;mode&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;RGBA&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;convert&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;RGB&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# Convert RGBA to RGB
&lt;/span&gt;    &lt;span class="n"&gt;buffered&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;BytesIO&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;save&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buffered&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;format&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;JPEG&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;base64&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;b64encode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;buffered&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getvalue&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;decode&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;utf-8&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 2: Querying the Vision-Language Model for Description
&lt;/h3&gt;

&lt;p&gt;In this step, we’ll create a function that queries the VLM to analyze the UI image and provide a detailed description. This model captures all aspects of the UI, including &lt;strong&gt;color schemes, typography, layout structures, and icons&lt;/strong&gt;, which are essential for accurately generating HTML and CSS.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Query the model for a description of the image
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;query_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base64_image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1000&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;image_content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data:image/jpeg;base64,&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;base64_image&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                    &lt;span class="n"&gt;image_content&lt;/span&gt;
                &lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;choices&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[{}])[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; - &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 3: Extracting HTML and CSS Code from Model Response
&lt;/h3&gt;

&lt;p&gt;Once we have the description, we prompt &lt;strong&gt;Qwen Coder&lt;/strong&gt; to generate HTML and CSS based on the UI layout. Our code will parse the response, extracting any HTML and CSS content for easy file output.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;

&lt;span class="c1"&gt;# Extract HTML and CSS from model response
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;extract_html_css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response_text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;html_match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;### HTML\n```

html\n(.*?)

```&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DOTALL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;css_match&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;search&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;r&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;### CSS.*\n```

css\n(.*?)

```&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;response_text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;DOTALL&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="n"&gt;html_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;html_match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;html_match&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;
    &lt;span class="n"&gt;css_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;css_match&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;group&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt; &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;css_match&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="sh"&gt;""&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;html_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;css_code&lt;/span&gt;

&lt;span class="c1"&gt;# Save HTML and CSS to files
&lt;/span&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;write_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;css_code&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;index.html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;html_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;html_file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;with&lt;/span&gt; &lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;styles.css&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;w&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;css_file&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;css_file&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;write&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;css_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Step 4: Building the Streamlit App for User Interaction
&lt;/h3&gt;

&lt;p&gt;Our final step is setting up the &lt;strong&gt;Streamlit&lt;/strong&gt; interface. This UI allows users to upload images, choose a model, generate descriptions, and output HTML/CSS.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;streamlit&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt;

&lt;span class="c1"&gt;# Streamlit UI setup
&lt;/span&gt;&lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;title&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Image Description and HTML/CSS Generation&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;model_choice&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;selectbox&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Select Model for Image Understanding&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; 
                            &lt;span class="n"&gt;options&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen/qwen-2-vl-72b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;openai/gpt-4o&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;mistral/pixtral-12B-2409&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;meta/llama-3.2-90b-vision&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;
                            &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;uploaded_image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;file_uploader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Choose an image...&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nb"&gt;type&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jpg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;jpeg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;png&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;

&lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;button&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generate Description&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;uploaded_image&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;Image&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;open&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;uploaded_image&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;base64_image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;encode_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="c1"&gt;# Generate the UI description
&lt;/span&gt;        &lt;span class="n"&gt;description_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Please analyze this software interface image and provide a detailed description.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;description&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;query_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base64_image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;description_prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;model_choice&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subheader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generated Description:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;markdown&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="c1"&gt;# Generate HTML and CSS
&lt;/span&gt;            &lt;span class="n"&gt;html_css_data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are TuneStudio, a coding assistant that generates HTML and CSS based on descriptions.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Please create HTML and CSS based on the following detailed description: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;description&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
                &lt;span class="p"&gt;],&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;qwen/qwen-2.5-coder-32b&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;3000&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;

            &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;html_css_data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;status_code&lt;/span&gt; &lt;span class="o"&gt;==&lt;/span&gt; &lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;html_css_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;choices&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;[{}])[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;].&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;message&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="p"&gt;{}).&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;html_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;css_code&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;extract_html_css&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html_css_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

                &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;html_code&lt;/span&gt; &lt;span class="ow"&gt;and&lt;/span&gt; &lt;span class="n"&gt;css_code&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="nf"&gt;write_files&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;css_code&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                    &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;success&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HTML and CSS files have been generated.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                    &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;HTML/CSS extraction failed.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

                &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;subheader&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generated HTML and CSS:&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
                &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;code&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;html_css_code&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;language&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;html&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;error&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Error generating HTML/CSS.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;st&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;warning&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Please upload an image.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;With this setup, you’ve created a pipeline that not only automates the analysis of UI images but also translates them into HTML and CSS. This workflow is a major time-saver for developers, designers, and anyone involved in UI design. Now, you can turn visual ideas into functional code with the power of AI!&lt;/p&gt;

&lt;p&gt;Let me know if you run into any questions or issues in the comments below.&lt;/p&gt;

</description>
      <category>webdev</category>
      <category>javascript</category>
      <category>tutorial</category>
      <category>python</category>
    </item>
    <item>
      <title>Image Generation using Janus 1.3B🔮</title>
      <dc:creator>Aryan Kargwal</dc:creator>
      <pubDate>Thu, 24 Oct 2024 22:03:31 +0000</pubDate>
      <link>https://dev.to/aryankargwal/multi-modality-and-image-gen-in-a-13b-model-1ef2</link>
      <guid>https://dev.to/aryankargwal/multi-modality-and-image-gen-in-a-13b-model-1ef2</guid>
      <description>&lt;p&gt;Code: &lt;a href="https://github.com/aryankargwal/genai-tutorials" rel="noopener noreferrer"&gt;Click Me&lt;/a&gt;&lt;br&gt;
Youtube: &lt;a href="https://youtu.be/N1EFxzD7G0E" rel="noopener noreferrer"&gt;Click Me&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Today, we’re diving into something exciting: &lt;strong&gt;Janus 1.3B&lt;/strong&gt;, one of the tiniest yet competent truly multimodal LLMs. What sets Janus apart is that, despite its smaller size, it delivers powerful results in natural language processing and image generation. This is a perfect example of where AI is heading—smaller models yet versatile and multimodal.&lt;/p&gt;


&lt;h3&gt;
  
  
  &lt;strong&gt;Janus 1.3B&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;So, what exactly is &lt;strong&gt;&lt;a href="https://arxiv.org/abs/2410.13848" rel="noopener noreferrer"&gt;Janus 1.3B&lt;/a&gt;&lt;/strong&gt;? At its core, Janus is a &lt;strong&gt;vision-language model (VLM)&lt;/strong&gt; designed to handle textual and visual data. With just &lt;strong&gt;1.3 billion parameters&lt;/strong&gt;, Janus is significantly smaller than some of the other LLMs and multimodal models we’ve discussed on the channel. But don’t let its size fool you; it can perform both text and image generation, making it a powerful tool despite its relatively compact size.&lt;/p&gt;

&lt;p&gt;Unlike most models, which specialise in one area or need large architectures to function effectively in multiple domains, Janus achieves this multimodal functionality at a much smaller scale. This is a massive step in making AI more efficient, accessible, and, most importantly, scalable.&lt;/p&gt;


&lt;h3&gt;
  
  
  &lt;strong&gt;How Does Janus Work?&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Let’s start with its architecture. Janus processes text understanding, multimodal understanding, and visual generation through independent encoding methods that eventually feed into a unified autoregressive transformer. This design allows it to handle different types of input—text, images, or a combination of both—in a highly efficient manner.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjywqcs8dkx5czy2p7fo6.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fjywqcs8dkx5czy2p7fo6.png" alt=" " width="800" height="329"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here’s the breakdown of how it all works:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Text Understanding&lt;/strong&gt;: Janus employs a built-in tokenizer from its underlying LLM. This tokenizer converts text into discrete IDs (tokens), which are transformed into feature representations. The LLM processes these features in the same way as any other text-based model.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Multimodal Understanding&lt;/strong&gt;: Janus integrates &lt;strong&gt;SigLIP&lt;/strong&gt;, a powerful vision encoder that extracts high-dimensional semantic features from images for image processing. These features are flattened from a 2D grid into a 1D sequence and passed through an understanding adaptor. This adaptor maps the image features into the input space of the LLM, ensuring that both image and text data are represented in a way that the model can understand together.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Image Generation&lt;/strong&gt;: Janus utilizes a &lt;strong&gt;Vector Quantization (VQ)&lt;/strong&gt; tokenizer to generate images. This tokenizer converts images into a sequence of discrete IDs. These ID sequences are flattened and passed through a generation adaptor, which maps them into the LLM’s input space. This allows Janus to generate image content from a text description. A specialized image prediction head is trained for this task, while Janus relies on the LLM’s existing text prediction head for text-based tasks.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Once the inputs, whether text, image, or both, are converted into feature sequences, Janus concatenates them into a unified multimodal feature sequence. This sequence is then fed into the LLM for processing, making it capable of generating text and images based on the input it receives.&lt;/p&gt;


&lt;h3&gt;
  
  
  &lt;strong&gt;Janus Multi-Modal Performance&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Now, let’s talk performance. Despite its relatively small size of 1.3 billion parameters, Janus is competitive across several multimodal tasks. It excels in &lt;strong&gt;Visual Question Answering (VQA)&lt;/strong&gt; benchmarks, &lt;strong&gt;COCO Captioning&lt;/strong&gt;, and &lt;strong&gt;Image-Text Retrieval&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fptbtlfh8je9okva3ngos.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fptbtlfh8je9okva3ngos.png" alt="Janus MultiModal" width="800" height="523"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Janus is designed to handle real-world multimodal applications where parameter efficiency is critical. While larger models might outperform Janus on tasks that require deep reasoning over complex text or high-resolution images, Janus hits a sweet spot by balancing efficiency and performance for general-purpose multimodal applications.&lt;/p&gt;


&lt;h3&gt;
  
  
  &lt;strong&gt;How to Use Janus for Multi-Modal Integration&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Now, let us see how to use the model for multimodal inferences. Below is an example of how to set up the generate_answer function, which takes an image and a question as inputs.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Load the VL-GPT model, tokenizer, and visual language chat processor
&lt;/span&gt;    &lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_vl_gpt_model&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;tokenizer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_tokenizer&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;vl_chat_processor&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;load_vl_chat_processor&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

    &lt;span class="c1"&gt;# Define conversation structure
&lt;/span&gt;    &lt;span class="n"&gt;conversation&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; [image: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;image_path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;]&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="c1"&gt;# Prepare image for processing
&lt;/span&gt;    &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;preprocess_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_path&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Prepare inputs for the model
&lt;/span&gt;    &lt;span class="n"&gt;inputs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vl_chat_processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;process&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;conversation&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Generate input embeddings
&lt;/span&gt;    &lt;span class="n"&gt;input_embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get_embeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;inputs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Generate answer using the VL-GPT model
&lt;/span&gt;    &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;input_embeddings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="nf"&gt;decode_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this code, we load the necessary components, prepare the image and question for processing, and generate a response that combines visual context with the posed question.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Janus Image Generation&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Finally, let’s examine Janus’ image generation capabilities. While it’s not as large as dedicated models like &lt;strong&gt;DALL-E 2&lt;/strong&gt; or &lt;strong&gt;Stable Diffusion&lt;/strong&gt;, Janus still creates high-quality images from textual inputs in an incredibly compact form.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpw87fay5kqhu197haacr.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpw87fay5kqhu197haacr.png" alt="Janus Image Gen" width="800" height="319"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As mentioned, Janus uses the VQ tokenizer to convert images into discrete tokens. These tokens are then processed using a latent diffusion model, generating the image in stages and refining it over time to match the text input. The result? Images that are highly coherent and contextually accurate, especially when dealing with more straightforward or abstract prompts.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;How to Use Janus for Image Generation&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;The process starts with tokenizing the prompt using the &lt;code&gt;vl_chat_processor&lt;/code&gt;. This converts the text into numerical representations that the model can understand.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Tokenize the prompt
&lt;/span&gt;    &lt;span class="n"&gt;tokenized_prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;vl_chat_processor&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;tokenize&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Create initial embeddings from tokens
&lt;/span&gt;    &lt;span class="n"&gt;initial_embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;create_embeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokenized_prompt&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Generate image tokens iteratively
&lt;/span&gt;    &lt;span class="n"&gt;image_tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;num_tokens&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_next_token&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;initial_embeddings&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;image_tokens&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;initial_embeddings&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;update_embeddings&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;initial_embeddings&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Decode tokens into an image
&lt;/span&gt;    &lt;span class="n"&gt;image&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;decode_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image_tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

    &lt;span class="c1"&gt;# Save image to disk
&lt;/span&gt;    &lt;span class="nf"&gt;save_image&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;output_image.jpg&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This code illustrates generating an image based on a text prompt using Janus. It showcases the iterative process of generating image tokens while ensuring relevance to the original prompt.&lt;/p&gt;




&lt;h3&gt;
  
  
  &lt;strong&gt;Conclusion&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;So there you have it—&lt;strong&gt;Janus 1.3B&lt;/strong&gt;, a small but compelling multimodal model that punches well above its weight. Its ability to handle text understanding, multimodal reasoning, and image generation in such a compact framework is a testament to the efficiency of its design. &lt;/p&gt;

&lt;p&gt;For those interested in multimodal AI that can be deployed in real-world applications without massive computational power, Janus is a model you should watch.&lt;/p&gt;

</description>
      <category>streamlit</category>
      <category>transformers</category>
      <category>computervision</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>Building a Multi-Turn-Assistant Application using Llama, Claude and GPT4o</title>
      <dc:creator>Aryan Kargwal</dc:creator>
      <pubDate>Fri, 18 Oct 2024 17:32:13 +0000</pubDate>
      <link>https://dev.to/tunehqai/building-a-multi-turn-assistant-application-using-llama-claude-and-gpt4o-1ieo</link>
      <guid>https://dev.to/tunehqai/building-a-multi-turn-assistant-application-using-llama-claude-and-gpt4o-1ieo</guid>
      <description>&lt;p&gt;💻Github: &lt;a href="https://github.com/aryankargwal/genai-tutorials/tree/main/multi-turn-agents" rel="noopener noreferrer"&gt;https://github.com/aryankargwal/genai-tutorials/tree/main/multi-turn-agents&lt;/a&gt;&lt;br&gt;
🎥Youtube: &lt;a href="https://youtu.be/S9iHpExFrTs" rel="noopener noreferrer"&gt;https://youtu.be/S9iHpExFrTs&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this guide, we’ll explore the development of a &lt;strong&gt;multi-turn AI assistant application&lt;/strong&gt; using &lt;strong&gt;LLMs&lt;/strong&gt; and AI assistant integration. This application is designed to streamline complex workflows such as &lt;strong&gt;internet retrieval&lt;/strong&gt;, &lt;strong&gt;market research&lt;/strong&gt;, &lt;strong&gt;campaign generation&lt;/strong&gt;, and &lt;strong&gt;image creation&lt;/strong&gt;. Throughout this process, we will rely on &lt;strong&gt;Tune Studio&lt;/strong&gt; for AI model orchestration, and &lt;strong&gt;Streamlit&lt;/strong&gt; for the front-end user interface. The end goal is to create a fully automated assistant-led pipeline that performs end-to-end tasks by interacting with multiple AI assistants in a sequential manner—also known as a multi-turn workflow.&lt;/p&gt;


&lt;h3&gt;
  
  
  What is a Multi-Turn AI Assistant Application?
&lt;/h3&gt;

&lt;p&gt;In the context of AI and automation, a &lt;strong&gt;multi-turn assistant application&lt;/strong&gt; is one where multiple interactions (or "turns") are required to complete a task. The application maintains context throughout these turns and allows each assistant or model to perform specific sub-tasks in a coordinated manner. This approach contrasts with single-turn applications, where the AI assistant addresses a single user query without needing to track prior inputs or outputs.&lt;/p&gt;

&lt;p&gt;In this tutorial, the multi-turn approach allows AI assistants to collaborate across multiple steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Market Research Assistant&lt;/strong&gt; gathers data from the web.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Analytics Assistant&lt;/strong&gt; processes the research and generates insights.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Campaign Generation Assistant&lt;/strong&gt; creates marketing strategies.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Image Generation Assistant&lt;/strong&gt; produces a campaign poster.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each assistant plays a crucial role and passes context to the next in line, ensuring a smooth and coherent user experience.&lt;/p&gt;


&lt;h3&gt;
  
  
  What Are AI Assistants?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;AI assistants&lt;/strong&gt; are digital agents powered by machine learning models that help users perform tasks, answer questions, and provide recommendations. Unlike co-pilots or AI agents, AI assistants focus on assisting with user-driven tasks, such as scheduling meetings, performing web searches, or, in our case, handling marketing tasks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F375blztvdwjnozw6bcxg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F375blztvdwjnozw6bcxg.png" alt="Assistants v Copilots v Agents" width="800" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There are three distinct categories of LLM-driven tools:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;AI Assistants&lt;/strong&gt;: Designed to respond to user commands and requests. Common examples include virtual assistants like &lt;strong&gt;Siri&lt;/strong&gt; or &lt;strong&gt;Alexa&lt;/strong&gt;, but they can also handle more specialized workflows.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Co-Pilots&lt;/strong&gt;: These tools work alongside humans, helping improve tasks as they are being performed. Examples include &lt;strong&gt;Grammarly&lt;/strong&gt; for writing and &lt;strong&gt;GitHub Copilot&lt;/strong&gt; for coding.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;AI Agents&lt;/strong&gt;: Autonomous agents that perform tasks without user input, such as &lt;strong&gt;ReACT&lt;/strong&gt; or &lt;strong&gt;Agentic Workflows&lt;/strong&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In our application, the &lt;strong&gt;AI assistants&lt;/strong&gt; are key players in achieving each part of the task while ensuring user control and input at every step. Now, let’s break down how we’ve integrated multiple assistants to create a seamless marketing and campaign generation tool.&lt;/p&gt;


&lt;h3&gt;
  
  
  Step-by-Step Breakdown of the Application
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd6qiqsichvw1fipfyfd4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd6qiqsichvw1fipfyfd4.png" alt="Application Flow" width="429" height="824"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;h4&gt;
  
  
  1. &lt;strong&gt;Performing Market Research with an AI Assistant&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;In this first step, the AI assistant is responsible for gathering relevant information from the internet. We use a &lt;strong&gt;Llama 3.1&lt;/strong&gt; model fine-tuned for research tasks to collect numerical data, trends, and insights from across the web.&lt;/p&gt;

&lt;p&gt;Here’s the core code for this assistant's function:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call_market_research_assistant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.8&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;}],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kargwalaryan/research&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;frequency_penalty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;research_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This function sends a user query to the &lt;strong&gt;Tune Studio&lt;/strong&gt; API, which uses a fine-tuned model to fetch relevant market research. The model acts as a &lt;strong&gt;subject matter expert&lt;/strong&gt; on the specific topic or product the user inquires about.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. &lt;strong&gt;Analyzing Research and Creating Insights&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;Once the data is gathered, the next assistant steps in to analyze the research. This assistant is run using &lt;strong&gt;Claude Sonnet&lt;/strong&gt;, a model known for its compliance, safety, and conversational adaptability.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call_analytics_assistant&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;research_text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;user_content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Here is some market research data: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;research_text&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;. Extract all the marketing insights and generate a campaign prompt.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are TuneStudio&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;user_content&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;anthropic/claude-3.5-sonnet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;frequency_penalty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;300&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;research_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, the &lt;strong&gt;Claude Sonnet&lt;/strong&gt; model processes the research and extracts stylistic and strategic insights that will inform the next step—campaign generation.&lt;/p&gt;

&lt;h4&gt;
  
  
  3. &lt;strong&gt;Generating the Marketing Campaign&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;For campaign generation, we need an assistant that not only understands the market analysis but can also create a compelling, structured campaign. The &lt;strong&gt;Claude Sonnet&lt;/strong&gt; model shines in this area, as it generates an engaging and compliant campaign strategy based on market trends.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_campaign&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;analysis_text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generate a marketing campaign based on this analysis: {analysis_result}.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kargwalaryan/campaign-gen&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;frequency_penalty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;150&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;research_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This assistant pulls from the insights gathered and creates a comprehensive campaign that could be deployed over the next few months.&lt;/p&gt;

&lt;h4&gt;
  
  
  4. &lt;strong&gt;Image Generation for the Campaign Poster&lt;/strong&gt;
&lt;/h4&gt;

&lt;p&gt;The final assistant in this pipeline generates a visual representation—a campaign poster—using &lt;strong&gt;GPT4o&lt;/strong&gt;. This model specializes in image creation tasks based on textual descriptions.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;call_image_generation&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;analysis_text&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;payload&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Generate a campaign poster based on this analysis: {analysis_result}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;kargwalaryan/image-gen&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;frequency_penalty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;research_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;dumps&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;payload&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This model generates a creative campaign poster based on the strategy developed in the earlier steps, completing the entire marketing pipeline.&lt;/p&gt;




&lt;h3&gt;
  
  
  Why Use Multi-Turn Assistant Workflows?
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Multi-turn workflows&lt;/strong&gt; allow for complex tasks to be broken into smaller, manageable operations, each handled by a specialized AI assistant. This ensures that the final output is not only accurate but also aligned with the user's overall goals.&lt;/p&gt;

&lt;p&gt;Some of the key advantages of multi-turn workflows include:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Context Retention&lt;/strong&gt;: The application retains context across different stages of the workflow. This allows each assistant to build upon the work of previous assistants.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Task Specialization&lt;/strong&gt;: Each assistant is optimized for a specific sub-task, ensuring higher performance in individual areas like research, analysis, campaign generation, and image creation.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flexibility and Customization&lt;/strong&gt;: You can easily modify or swap out assistants to suit different applications. For example, you could replace the market research assistant with one better suited to another industry or domain.&lt;/li&gt;
&lt;/ul&gt;




&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;Creating a &lt;strong&gt;multi-turn AI assistant application&lt;/strong&gt; allows you to harness the power of multiple LLMs and assistants to handle complex tasks in a highly structured way. By breaking down tasks into distinct stages and integrating models like &lt;strong&gt;Llama 3.1&lt;/strong&gt;, &lt;strong&gt;Claude Sonnet&lt;/strong&gt;, and &lt;strong&gt;GPT4o&lt;/strong&gt;, you can build intelligent, autonomous pipelines that help users with everything from market research to visual content creation.&lt;/p&gt;

&lt;p&gt;This approach is ideal for applications where tasks need to be completed in a step-by-step manner while maintaining context across all steps.&lt;/p&gt;

&lt;p&gt;Let me know if you have any questions or suggestions for further improvement! Stay tuned for more advanced tutorials on LLMs and VLMs.&lt;/p&gt;

</description>
      <category>tutorial</category>
      <category>llm</category>
      <category>streamlit</category>
      <category>python</category>
    </item>
    <item>
      <title>Stress Testing VLMs: Multi QnA and Description Tasks</title>
      <dc:creator>Aryan Kargwal</dc:creator>
      <pubDate>Mon, 14 Oct 2024 14:12:09 +0000</pubDate>
      <link>https://dev.to/aryankargwal/stress-testing-vlms-multi-qna-and-description-tasks-569g</link>
      <guid>https://dev.to/aryankargwal/stress-testing-vlms-multi-qna-and-description-tasks-569g</guid>
      <description>&lt;p&gt;Video Link: &lt;a href="https://youtu.be/pwW9zwVQ4L8" rel="noopener noreferrer"&gt;https://youtu.be/pwW9zwVQ4L8&lt;/a&gt;&lt;br&gt;
Repository Link: &lt;a href="https://github.com/aryankargwal/genai-tutorials/tree/main" rel="noopener noreferrer"&gt;https://github.com/aryankargwal/genai-tutorials/tree/main&lt;/a&gt;&lt;/p&gt;



&lt;p&gt;In the fast-evolving world of AI, Vision-Language Models (VLMs) have garnered attention for their ability to understand and generate responses based on visual and textual inputs. However, testing these models in a structured environment and comparing their performance across various scenarios is still a challenging task. This blog will walk you through an experiment where we used a custom-built &lt;strong&gt;Streamlit&lt;/strong&gt; web application to stress test multiple VLMs like &lt;strong&gt;Llama 3.2&lt;/strong&gt;, &lt;strong&gt;Qwen 2 VL&lt;/strong&gt;, and &lt;strong&gt;GPT 4o&lt;/strong&gt; on a range of tasks. We analyzed their &lt;strong&gt;response tokens&lt;/strong&gt;, &lt;strong&gt;latency&lt;/strong&gt;, and accuracy in generating answers to complex, multimodal questions.&lt;/p&gt;

&lt;p&gt;However, please note that most of the findings are still hidden, as this application is part of my process of making a VLM benchmark, the first of which you can check out on Huggingface as &lt;a href="https://huggingface.co/datasets/kargwalaryan/SynCap-Flickr8k" rel="noopener noreferrer"&gt;SynCap-Flickr8K&lt;/a&gt;! &lt;/p&gt;
&lt;h2&gt;
  
  
  Why Compare Vision-Language Models?
&lt;/h2&gt;

&lt;p&gt;The ability to compare the performance of different VLMs across domains is critical for:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Understanding model efficiency (tokens used, latency).&lt;/li&gt;
&lt;li&gt;Measuring how well models can generate coherent responses based on image inputs and textual prompts.&lt;/li&gt;
&lt;li&gt;Creating benchmark datasets to improve further and fine-tune VLMs.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To achieve this, we built a &lt;strong&gt;VLM Stress Testing Web App&lt;/strong&gt; in Python, utilizing &lt;strong&gt;Streamlit&lt;/strong&gt; for a user-friendly interface. This allowed us to upload images, input textual prompts, and obtain model-generated responses in real time. The app also calculated and logged critical metrics such as the number of tokens used in responses and latency.&lt;/p&gt;
&lt;h3&gt;
  
  
  Project Setup
&lt;/h3&gt;

&lt;p&gt;Our main application file, &lt;code&gt;app.py&lt;/code&gt;, uses &lt;strong&gt;Streamlit&lt;/strong&gt; as the frontend and is integrated with API requests to call different VLM models. Each query to a model includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Image&lt;/strong&gt;: Encoded in Base64 format.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Question&lt;/strong&gt;: A text input by the user.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Model ID&lt;/strong&gt;: We allow users to choose between multiple VLMs.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The API response includes:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Answer&lt;/strong&gt;: The model-generated text.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt;: Time taken for the model to generate the answer.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Token Count&lt;/strong&gt;: Number of tokens used by the model in generating the response.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Below is the code structure for querying the models:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;query_model&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;base64_image&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;300&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;frequency_penalty&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;image_content&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;image_url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;url&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;data:image/jpeg;base64,&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;base64_image&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;

    &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;model_id&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
            &lt;span class="p"&gt;{&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                    &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;text&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                    &lt;span class="n"&gt;image_content&lt;/span&gt;
                &lt;span class="p"&gt;]&lt;/span&gt;
            &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="p"&gt;],&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;stream&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
        &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;frequency_penalty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;frequency_penalty&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;

    &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Task Definitions and Experiments
&lt;/h2&gt;

&lt;p&gt;We tested four different tasks across multiple domains using the following models:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;strong&gt;Llama 3.2&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Qwen 2 VL&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;GPT 4o&lt;/strong&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Domains:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Medical&lt;/strong&gt;: Questions related to complex medical scenarios.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Retail&lt;/strong&gt;: Product-related queries.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CCTV&lt;/strong&gt;: Surveillance footage analysis.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Art&lt;/strong&gt;: Generating artistic interpretations and descriptions.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The experiment involved five queries per task for each model, and we recorded the following metrics:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Tokens&lt;/strong&gt;: The number of tokens used by the model to generate a response.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt;: Time taken to return the response.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Token Usage Comparison
&lt;/h4&gt;

&lt;p&gt;The tables below highlight the token usage across the four domains for both &lt;strong&gt;Llama&lt;/strong&gt; and &lt;strong&gt;GPT&lt;/strong&gt; models.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Task&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Q1 Tokens&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Q2 Tokens&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Q3 Tokens&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Q4 Tokens&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Q5 Tokens&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Mean Tokens&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Standard Deviation (Tokens)&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Medical (Llama)&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;12&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;3.2&lt;/td&gt;
&lt;td&gt;4.81&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retail (Llama)&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;39&lt;/td&gt;
&lt;td&gt;83&lt;/td&gt;
&lt;td&gt;40&lt;/td&gt;
&lt;td&gt;124&lt;/td&gt;
&lt;td&gt;60.8&lt;/td&gt;
&lt;td&gt;32.77&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CCTV (Llama)&lt;/td&gt;
&lt;td&gt;18&lt;/td&gt;
&lt;td&gt;81&lt;/td&gt;
&lt;td&gt;83&lt;/td&gt;
&lt;td&gt;40&lt;/td&gt;
&lt;td&gt;124&lt;/td&gt;
&lt;td&gt;69.2&lt;/td&gt;
&lt;td&gt;37.29&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Art (Llama)&lt;/td&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;71&lt;/td&gt;
&lt;td&gt;88&lt;/td&gt;
&lt;td&gt;154&lt;/td&gt;
&lt;td&gt;40&lt;/td&gt;
&lt;td&gt;72.2&lt;/td&gt;
&lt;td&gt;51.21&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Task&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Q1 Tokens&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Q2 Tokens&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Q3 Tokens&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Q4 Tokens&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Q5 Tokens&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Mean Tokens&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Standard Deviation (Tokens)&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Medical (GPT)&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;2.4&lt;/td&gt;
&lt;td&gt;4.04&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retail (GPT)&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;29&lt;/td&gt;
&lt;td&gt;17.8&lt;/td&gt;
&lt;td&gt;8.53&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CCTV (GPT)&lt;/td&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;td&gt;14&lt;/td&gt;
&lt;td&gt;29&lt;/td&gt;
&lt;td&gt;16.8&lt;/td&gt;
&lt;td&gt;7.69&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Art (GPT)&lt;/td&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;13&lt;/td&gt;
&lt;td&gt;102&lt;/td&gt;
&lt;td&gt;43&lt;/td&gt;
&lt;td&gt;35&lt;/td&gt;
&lt;td&gt;40.6&lt;/td&gt;
&lt;td&gt;35.73&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  Latency Comparison
&lt;/h4&gt;

&lt;p&gt;Latency, measured in seconds, is another critical factor in evaluating the model's performance, especially for real-time applications. The following tables display latency results for the same set of tasks.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Task&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Q1 Latency&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Q2 Latency&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Q3 Latency&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Q4 Latency&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Q5 Latency&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Mean Latency&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Standard Deviation (Latency)&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Medical (Llama)&lt;/td&gt;
&lt;td&gt;0.74&lt;/td&gt;
&lt;td&gt;0.97&lt;/td&gt;
&lt;td&gt;0.78&lt;/td&gt;
&lt;td&gt;0.98&lt;/td&gt;
&lt;td&gt;1.19&lt;/td&gt;
&lt;td&gt;0.73&lt;/td&gt;
&lt;td&gt;0.19&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retail (Llama)&lt;/td&gt;
&lt;td&gt;1.63&lt;/td&gt;
&lt;td&gt;3.00&lt;/td&gt;
&lt;td&gt;3.02&lt;/td&gt;
&lt;td&gt;1.67&lt;/td&gt;
&lt;td&gt;3.14&lt;/td&gt;
&lt;td&gt;2.09&lt;/td&gt;
&lt;td&gt;0.74&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CCTV (Llama)&lt;/td&gt;
&lt;td&gt;1.63&lt;/td&gt;
&lt;td&gt;3.00&lt;/td&gt;
&lt;td&gt;3.02&lt;/td&gt;
&lt;td&gt;1.67&lt;/td&gt;
&lt;td&gt;3.14&lt;/td&gt;
&lt;td&gt;2.09&lt;/td&gt;
&lt;td&gt;0.74&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Art (Llama)&lt;/td&gt;
&lt;td&gt;1.35&lt;/td&gt;
&lt;td&gt;2.46&lt;/td&gt;
&lt;td&gt;2.91&lt;/td&gt;
&lt;td&gt;4.45&lt;/td&gt;
&lt;td&gt;2.09&lt;/td&gt;
&lt;td&gt;2.46&lt;/td&gt;
&lt;td&gt;1.06&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;&lt;strong&gt;Task&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Q1 Latency&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Q2 Latency&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Q3 Latency&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Q4 Latency&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Q5 Latency&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Mean Latency&lt;/strong&gt;&lt;/th&gt;
&lt;th&gt;&lt;strong&gt;Standard Deviation (Latency)&lt;/strong&gt;&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Medical (GPT)&lt;/td&gt;
&lt;td&gt;1.35&lt;/td&gt;
&lt;td&gt;1.50&lt;/td&gt;
&lt;td&gt;1.21&lt;/td&gt;
&lt;td&gt;1.50&lt;/td&gt;
&lt;td&gt;1.23&lt;/td&gt;
&lt;td&gt;1.38&lt;/td&gt;
&lt;td&gt;0.10&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retail (GPT)&lt;/td&gt;
&lt;td&gt;1.24&lt;/td&gt;
&lt;td&gt;1.77&lt;/td&gt;
&lt;td&gt;2.12&lt;/td&gt;
&lt;td&gt;1.35&lt;/td&gt;
&lt;td&gt;1.83&lt;/td&gt;
&lt;td&gt;1.63&lt;/td&gt;
&lt;td&gt;0.29&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;CCTV (GPT)&lt;/td&gt;
&lt;td&gt;1.20&lt;/td&gt;
&lt;td&gt;2.12&lt;/td&gt;
&lt;td&gt;1.80&lt;/td&gt;
&lt;td&gt;1.35&lt;/td&gt;
&lt;td&gt;1.83&lt;/td&gt;
&lt;td&gt;1.68&lt;/td&gt;
&lt;td&gt;0.32&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Art (GPT)&lt;/td&gt;
&lt;td&gt;1.24&lt;/td&gt;
&lt;td&gt;1.77&lt;/td&gt;
&lt;td&gt;7.69&lt;/td&gt;
&lt;td&gt;3.94&lt;/td&gt;
&lt;td&gt;2.41&lt;/td&gt;
&lt;td&gt;3.61&lt;/td&gt;
&lt;td&gt;2.29&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Observations
&lt;/h3&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Token Efficiency&lt;/strong&gt;: Llama models generally use fewer tokens in response generation for simpler tasks like &lt;strong&gt;Medical&lt;/strong&gt; compared to more complex domains like &lt;strong&gt;Art&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Latency&lt;/strong&gt;: Latency is higher for more complex images, especially for tasks like &lt;strong&gt;Retail&lt;/strong&gt; and &lt;strong&gt;Art&lt;/strong&gt;, indicating that these models take more time when generating in-depth descriptions or analyzing images.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;GPT vs. Llama&lt;/strong&gt;: GPT models generally had lower token counts across the tasks, but the latency was comparable, with GPT showing slightly more variability in complex tasks like &lt;strong&gt;Art&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  Conclusion and Future Work
&lt;/h3&gt;

&lt;p&gt;This experiment highlights the importance of evaluating both token efficiency and latency when stress testing VLMs. The &lt;strong&gt;VLM Stress Test App&lt;/strong&gt; allows us to quickly compare multiple models and analyze their performance across a variety of real-world tasks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Future Plans&lt;/strong&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Additional Models&lt;/strong&gt;: We plan to add more models like &lt;strong&gt;Mistral&lt;/strong&gt; and &lt;strong&gt;Claude&lt;/strong&gt; to the comparison.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Expanded Dataset&lt;/strong&gt;: New tasks in&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;domains like &lt;strong&gt;Legal&lt;/strong&gt; and &lt;strong&gt;Education&lt;/strong&gt; will be added to challenge the models further.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Accuracy Metrics&lt;/strong&gt;: We'll also integrate accuracy metrics like &lt;strong&gt;BLEU&lt;/strong&gt; and &lt;strong&gt;ROUGE&lt;/strong&gt; scores in the next iteration.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Check out our &lt;a href="https://github.com/aryankargwal/genai-tutorials/tree/main" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt; for the code and further instructions on how to set up and run your own VLM experiments.&lt;/p&gt;

</description>
      <category>tutorial</category>
      <category>streamlit</category>
      <category>vlm</category>
      <category>benchmarking</category>
    </item>
    <item>
      <title>Doing Multihop on HotPotQA Using Qwen 2.5 72B</title>
      <dc:creator>Aryan Kargwal</dc:creator>
      <pubDate>Thu, 26 Sep 2024 14:46:21 +0000</pubDate>
      <link>https://dev.to/tunehqai/doing-multihop-on-hotpotqa-using-qwen-25-72b-43a1</link>
      <guid>https://dev.to/tunehqai/doing-multihop-on-hotpotqa-using-qwen-25-72b-43a1</guid>
      <description>&lt;p&gt;When dealing with complex question-answering tasks, a single-hop retrieval approach might not be enough. Questions often require synthesizing information from multiple sources. That’s where &lt;strong&gt;MultiHop Question Answering (QA)&lt;/strong&gt; comes into play, requiring more advanced tools for retrieval and reasoning. In this post, I’ll describe how I built a multi-hop QA pipeline using &lt;strong&gt;DSPy&lt;/strong&gt;, &lt;strong&gt;ColBERT&lt;/strong&gt;, &lt;strong&gt;TuneAPI&lt;/strong&gt;, and &lt;strong&gt;Qwen 2.5 72B&lt;/strong&gt; to handle multi-step reasoning over a knowledge base.&lt;/p&gt;

&lt;h2&gt;
  
  
  Understanding the Key Tools
&lt;/h2&gt;

&lt;p&gt;Before diving into the code, let’s first break down the key tools and libraries that power this pipeline:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. &lt;strong&gt;DSPy&lt;/strong&gt; (Data Structure Processing)
&lt;/h3&gt;

&lt;p&gt;DSPy is a Python library that helps structure multi-step processes for tasks like retrieval-augmented generation and multi-hop question answering. It allows us to define a clear, modular flow for handling complex information retrieval tasks and integrate language models effectively.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. &lt;strong&gt;ColBERT&lt;/strong&gt; (Contextualized Late Interaction over BERT)
&lt;/h3&gt;

&lt;p&gt;ColBERT is a dense retrieval model designed to retrieve passages efficiently from large corpora. It works by encoding both the query and documents in a low-dimensional space and comparing them to find relevant matches. For multi-hop QA, ColBERT helps identify the most pertinent passages to answer complex questions.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. &lt;strong&gt;TuneAPI&lt;/strong&gt; (API Proxy for LLMs)
&lt;/h3&gt;

&lt;p&gt;TuneAPI acts as a proxy API to interact with LLMs such as Qwen. This lets us access the powerful inference capabilities of LLMs and customize how they process inputs and generate responses.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. &lt;strong&gt;Qwen 2.5 72B&lt;/strong&gt; (Alibaba’s Vision-Language Model)
&lt;/h3&gt;

&lt;p&gt;Qwen 2.5 72B is a state-of-the-art large language model developed by Alibaba. While it’s primarily known for its vision-language tasks, Qwen excels in natural language reasoning, making it a great choice for multi-hop QA tasks where nuanced reasoning over text is required.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. &lt;strong&gt;HotPotQA&lt;/strong&gt; (Dataset for Multi-Hop QA)
&lt;/h3&gt;

&lt;p&gt;HotPotQA is a dataset designed specifically for multi-hop question answering. It contains questions that require information from multiple documents to arrive at an accurate answer, making it ideal for training and evaluating multi-hop QA systems.&lt;/p&gt;




&lt;h2&gt;
  
  
  Setting Up the MultiHopQA Pipeline
&lt;/h2&gt;

&lt;p&gt;The goal here is to build an end-to-end pipeline that can retrieve relevant documents using &lt;strong&gt;ColBERT&lt;/strong&gt;, pass the retrieved contexts to &lt;strong&gt;Qwen 2.5 72B&lt;/strong&gt; for reasoning, and finally output the predicted answer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Code Walkthrough
&lt;/h3&gt;

&lt;p&gt;Let’s break down the process into manageable steps. Here’s the code for building the pipeline:&lt;/p&gt;

&lt;h3&gt;
  
  
  1. Importing Required Libraries
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dsp&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;LM&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dspy.datasets&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;HotPotQA&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;dsp.utils&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;deduplicate&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We start by importing necessary libraries: &lt;strong&gt;DSPy&lt;/strong&gt; for data handling, &lt;strong&gt;ColBERT&lt;/strong&gt; for retrieval, and &lt;strong&gt;requests&lt;/strong&gt; to interact with the &lt;strong&gt;TuneAPI&lt;/strong&gt;. The &lt;code&gt;HotPotQA&lt;/code&gt; dataset is also loaded to provide multi-hop questions for the pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  2. Creating a Custom Language Model Client
&lt;/h3&gt;

&lt;p&gt;To use Qwen, we need a custom class to handle API requests. We interact with Qwen via the TuneAPI to submit prompts and retrieve responses.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;CustomLMClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;LM&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;https://proxy.tune.app/chat/completions&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;history&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{}&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;basic_request&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;str&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;headers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Authorization&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Content-Type&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;application/json&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;data&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;model&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;messages&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;system&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;You are TuneStudio, answer the question based on the context given to you.&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;role&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;content&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;],&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;temperature&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;max_tokens&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;frequency_penalty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;frequency_penalty&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mf"&gt;0.2&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt;
            &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;get&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stream&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="p"&gt;}&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;requests&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;post&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;base_url&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;headers&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;json&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;data&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;json&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;

&lt;span class="n"&gt;custom_lm&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;CustomLMClient&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;qwen/qwen-2.5-72b&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;api_key&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;your_api_key_here&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This class wraps around the Qwen model and formats the prompt, handles API communication, and processes the response. The &lt;code&gt;basic_request&lt;/code&gt; method takes care of sending requests to the &lt;strong&gt;TuneAPI&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  3. Configuring Retrieval and Language Model
&lt;/h3&gt;

&lt;p&gt;Next, we configure &lt;strong&gt;ColBERT&lt;/strong&gt; and set up DSPy to use our custom language model client for inference.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;colbertv2_wiki17_abstracts&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ColBERTv2&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;url&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;http://20.102.90.50:2017/wiki17_abstracts&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;settings&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;configure&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;custom_lm&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;rm&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;colbertv2_wiki17_abstracts&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, &lt;strong&gt;ColBERTv2&lt;/strong&gt; retrieves relevant Wikipedia abstracts. These abstracts will be passed to the language model for deeper reasoning.&lt;/p&gt;

&lt;h3&gt;
  
  
  4. Loading HotPotQA Dataset
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;dataset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;HotPotQA&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;train_seed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;train_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;20&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;eval_seed&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;2023&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;dev_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;50&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;test_size&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;trainset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_inputs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;train&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;devset&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;with_inputs&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;question&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;dataset&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;dev&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We load a small subset of the &lt;strong&gt;HotPotQA&lt;/strong&gt; dataset for testing. This dataset will provide multi-hop questions for the pipeline.&lt;/p&gt;

&lt;h3&gt;
  
  
  5. Simplified Baleen for Multi-Hop Retrieval
&lt;/h3&gt;

&lt;p&gt;The &lt;strong&gt;Simplified Baleen&lt;/strong&gt; class handles the multi-hop retrieval process. It repeatedly retrieves passages, feeds them into the language model, and finally generates an answer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;class&lt;/span&gt; &lt;span class="nc"&gt;SimplifiedBaleen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;Module&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;lm_client&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;passages_per_hop&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_hops&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="nf"&gt;super&lt;/span&gt;&lt;span class="p"&gt;().&lt;/span&gt;&lt;span class="nf"&gt;__init__&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;retrieve&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;k&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;passages_per_hop&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_hops&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;max_hops&lt;/span&gt;
        &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;lm_client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;lm_client&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; Context: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;query&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;generate_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;context_str&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;prompt&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Given the following information: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;context_str&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="se"&gt;\n\n&lt;/span&gt;&lt;span class="s"&gt;Answer the question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;lm_client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;prompt&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

    &lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;forward&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
        &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
        &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;_&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;range&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;max_hops&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
            &lt;span class="n"&gt;query&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_query&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
            &lt;span class="n"&gt;passages&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;retrieve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;query&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="n"&gt;passages&lt;/span&gt;
            &lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;deduplicate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;passages&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;answer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;self&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;generate_answer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="o"&gt;**&lt;/span&gt;&lt;span class="n"&gt;kwargs&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;dspy&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;Prediction&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This is the core of our pipeline. It:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Generates queries based on previously retrieved context.&lt;/li&gt;
&lt;li&gt;Retrieves relevant documents using &lt;strong&gt;ColBERT&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Passes the final context to &lt;strong&gt;Qwen&lt;/strong&gt; to generate the answer.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  6. Running the Pipeline
&lt;/h3&gt;

&lt;p&gt;We define a question and pass it through the pipeline to retrieve the answer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="n"&gt;my_question&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;What position on the Billboard Top 100 did Alison Moyet&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s late summer hit achieve?&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
&lt;span class="n"&gt;uncompiled_baleen&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SimplifiedBaleen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;lm_client&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;custom_lm&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;pred&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;uncompiled_baleen&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;my_question&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;temperature&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mf"&gt;0.9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;max_tokens&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="mi"&gt;100&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Question: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;my_question&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Predicted Answer: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;answer&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Retrieved Contexts (truncated): &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;c&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="si"&gt;:&lt;/span&gt;&lt;span class="mi"&gt;200&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;...&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;c&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;pred&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;context&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This question is answered using multiple passages retrieved in successive hops and is reasoned over by &lt;strong&gt;Qwen 2.5 72B&lt;/strong&gt;. The final answer is printed alongside the retrieved contexts.&lt;/p&gt;




&lt;h2&gt;
  
  
  Final Thoughts
&lt;/h2&gt;

&lt;p&gt;This project highlights the growing importance of &lt;strong&gt;multi-hop question answering&lt;/strong&gt; and how combining modern tools like &lt;strong&gt;ColBERT&lt;/strong&gt; for retrieval and &lt;strong&gt;Qwen&lt;/strong&gt; for reasoning can provide powerful solutions. By leveraging datasets like &lt;strong&gt;HotPotQA&lt;/strong&gt;, it’s easier to experiment and fine-tune these pipelines for real-world QA systems.&lt;/p&gt;

&lt;h3&gt;
  
  
  Future Plans:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Experiment with more retrieval-augmented generation tasks.&lt;/li&gt;
&lt;li&gt;Extend this pipeline to support more languages and domain-specific datasets.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;For more NLP tutorials and walkthroughs&lt;/strong&gt;, feel free to check out my &lt;a href="https://www.youtube.com/@AryanKargwal" rel="noopener noreferrer"&gt;YouTube Channel&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>rag</category>
      <category>qwen</category>
      <category>tutorial</category>
      <category>tunestudio</category>
    </item>
    <item>
      <title>Benchmarking Pixtral 12B: MistralAI's New VLM</title>
      <dc:creator>Aryan Kargwal</dc:creator>
      <pubDate>Wed, 18 Sep 2024 20:45:32 +0000</pubDate>
      <link>https://dev.to/aryankargwal/benchmarking-pixtral-12b-mistralais-new-vlm-ff</link>
      <guid>https://dev.to/aryankargwal/benchmarking-pixtral-12b-mistralais-new-vlm-ff</guid>
      <description>&lt;p&gt;&lt;a href="https://github.com/aryankargwal/vlm-benchmarks" rel="noopener noreferrer"&gt;GitHub Link&lt;/a&gt;&lt;br&gt;
&lt;a href="https://youtu.be/MwryGctpWrM" rel="noopener noreferrer"&gt;Youtube Link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In the fast-evolving world of AI, Vision-Language Models (VLMs) are breaking new ground. Today, we are diving into &lt;a href="https://mistral.ai/news/pixtral-12b/" rel="noopener noreferrer"&gt;Pixtral 12B&lt;/a&gt;, the latest VLM from Mistral AI, which I benchmarked against GPT-4 on multiple datasets. This technical blog will walk you through my benchmarking process and share insights on how Pixtral 12B fares against GPT-4v in various tasks.&lt;/p&gt;

&lt;p&gt;Pixtral 12B is an exciting release, and it brings several innovations to the table, including a 400M parameter vision encoder and a massive 128K token context window. If you’re working on any image-to-text pipelines, this might be the model you need. Let’s dig into the details.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is Pixtral 12B?
&lt;/h3&gt;

&lt;p&gt;Pixtral 12B, Mistral AI's latest VLM, is built for complex multimodal tasks, such as chart analysis, code generation from images, and multi-image inferences. Its unique architecture features a &lt;strong&gt;400M parameter vision encoder&lt;/strong&gt; capable of processing images at their native resolution and aspect ratio, significantly reducing preprocessing efforts. Additionally, the &lt;strong&gt;128K token context window&lt;/strong&gt; allows it to handle up to 2,000 images in one batch, streamlining image processing at scale.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcu3nts1m9o8rpophyh4k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fcu3nts1m9o8rpophyh4k.png" alt=" " width="800" height="234"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The model is versatile across various tasks, especially in understanding visuals with intricate details, such as diagrams. It even supports multi-image inferences, a feature highly beneficial for complex scenarios like medical imaging or sequential image analysis.&lt;/p&gt;

&lt;h3&gt;
  
  
  Datasets and Benchmarks
&lt;/h3&gt;

&lt;p&gt;For this benchmarking exercise, I evaluated Pixtral 12B on three key datasets:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;ArxivQA&lt;/strong&gt;: A large collection of research paper-based question-answering tasks.

&lt;ul&gt;
&lt;li&gt;Dataset link: &lt;a href="https://huggingface.co/datasets/MMInstruction/ArxivQA?row=0" rel="noopener noreferrer"&gt;ArxivQA on Hugging Face&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;VisIT Benchmark&lt;/strong&gt;: A dataset for vision-language instruction tasks inspired by real-life scenarios.

&lt;ul&gt;
&lt;li&gt;Dataset link: &lt;a href="https://huggingface.co/datasets/mlfoundations/VisIT-Bench?row=0" rel="noopener noreferrer"&gt;VisIT Bench on Hugging Face&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Flickr30K&lt;/strong&gt;: A long-standing image captioning dataset with both human-generated and GPT-4o captions.

&lt;ul&gt;
&lt;li&gt;Dataset link: &lt;a href="https://huggingface.co/datasets/mlfoundations/VisIT-Bench?row=0" rel="noopener noreferrer"&gt;Flickr30K on Hugging Face&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;
&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;In addition to evaluating Pixtral 12B, I also used &lt;strong&gt;GPT-4v&lt;/strong&gt; for comparison. The critical evaluation metric was &lt;strong&gt;Cosine Similarity&lt;/strong&gt;, which measures the semantic similarity between the generated captions and the references. This metric gives us insights into &lt;strong&gt;win rate&lt;/strong&gt; and &lt;strong&gt;top-k scores&lt;/strong&gt; (top-1, top-5, and top-10) for the model-generated captions and GPT-4v outputs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cosine Similarity with all-MiniLM-L6-v2
&lt;/h3&gt;

&lt;p&gt;In this benchmarking process, I used &lt;strong&gt;Cosine Similarity&lt;/strong&gt; to evaluate the quality of the model-generated captions and responses by comparing them with reference texts. Specifically, I leveraged the &lt;strong&gt;all-MiniLM-L6-v2&lt;/strong&gt; model, a lightweight transformer model fine-tuned for sentence embedding tasks, to compute the embeddings of both the predicted and reference texts.&lt;/p&gt;

&lt;h4&gt;
  
  
  Why Cosine Similarity?
&lt;/h4&gt;

&lt;p&gt;Cosine Similarity is an efficient and commonly used metric for measuring the &lt;strong&gt;semantic similarity&lt;/strong&gt; between two pieces of text. Unlike traditional methods like BLEU or METEOR, which emphasize exact word matching, Cosine Similarity evaluates the &lt;strong&gt;contextual alignment&lt;/strong&gt; between two text embeddings, making it ideal for tasks like image captioning and question answering where the meaning of the text matters more than the exact word sequence.&lt;/p&gt;

&lt;p&gt;For each comparison, both the reference and predicted texts were transformed into &lt;strong&gt;vector embeddings&lt;/strong&gt; using the all-MiniLM-L6-v2 model, and the cosine similarity score was calculated as:&lt;/p&gt;

&lt;p&gt;[&lt;br&gt;
\text{Cosine Similarity} = \frac{\text{A} \cdot \text{B}}{|\text{A}| |\text{B}|}&lt;br&gt;
]&lt;/p&gt;

&lt;p&gt;Where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;A&lt;/strong&gt; and &lt;strong&gt;B&lt;/strong&gt; represent the embeddings of the predicted and reference texts, respectively.&lt;/li&gt;
&lt;li&gt;The result is a score between -1 and 1, where 1 indicates that the two vectors are perfectly aligned (high similarity), and -1 indicates they are diametrically opposed.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Why all-MiniLM-L6-v2?
&lt;/h4&gt;

&lt;p&gt;I chose &lt;strong&gt;all-MiniLM-L6-v2&lt;/strong&gt; because of its balance between &lt;strong&gt;speed&lt;/strong&gt; and &lt;strong&gt;performance&lt;/strong&gt;. The model, with just 22 million parameters, is capable of generating high-quality sentence embeddings that can efficiently compute similarity scores in real-time. Despite being compact, it retains much of the semantic understanding found in larger models, making it ideal for scenarios like benchmarking where large volumes of data need to be processed quickly.&lt;/p&gt;

&lt;p&gt;Here’s why &lt;strong&gt;all-MiniLM-L6-v2&lt;/strong&gt; was the perfect fit for this task:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Efficient Embeddings&lt;/strong&gt;: It generates high-quality embeddings that are lightweight yet semantically rich.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Scalability&lt;/strong&gt;: Due to its small size, it scales well with large datasets without compromising inference speed.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Accurate Semantic Representation&lt;/strong&gt;: It captures a strong semantic understanding, essential when comparing captions or answers where the meaning matters more than exact matches.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This embedding model enabled me to compute &lt;strong&gt;cosine similarity&lt;/strong&gt; for various benchmarks like &lt;strong&gt;ArxivQA&lt;/strong&gt;, &lt;strong&gt;VisIT&lt;/strong&gt;, and &lt;strong&gt;Flickr30K&lt;/strong&gt;, allowing for a more nuanced evaluation of how well Pixtral 12B and GPT-4v perform on these datasets.&lt;/p&gt;

&lt;h3&gt;
  
  
  Evaluation Setup and Methodology
&lt;/h3&gt;

&lt;p&gt;To evaluate Pixtral 12B’s performance, I used &lt;a href="https://studio.tune.app/playground" rel="noopener noreferrer"&gt;&lt;strong&gt;Tune Studio&lt;/strong&gt;&lt;/a&gt;, which offers &lt;strong&gt;unlimited API calls&lt;/strong&gt; and provides fast inference with &lt;strong&gt;350+ instruction inferences/hour&lt;/strong&gt; and &lt;strong&gt;500+ captioning inferences/hour&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo05xbnfir2rzxbuesvtt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo05xbnfir2rzxbuesvtt.png" alt=" " width="654" height="310"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Each dataset was benchmarked as follows:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ArxivQA&lt;/strong&gt;: I sampled 1,000 randomly selected images from a pool of 100,000 research paper-based questions. The model had to select the correct answer from multiple options and provide a rationale.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;** VisIT Benchmark **: I evaluated the model on 500+ images containing real-life VLM applications. The task required Pixtral 12B to generate instruction-based responses from the images, which were then compared against human and GPT-4-generated captions.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Flickr30K&lt;/strong&gt;: For this dataset, Pixtral 12B generated captions for 1,000 random images. These captions were compared with both human and GPT-4o-generated captions.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Results
&lt;/h3&gt;

&lt;h4&gt;
  
  
  ArxivQA
&lt;/h4&gt;

&lt;p&gt;On the &lt;strong&gt;ArxivQA&lt;/strong&gt; dataset, Pixtral 12B faced the challenge of generating accurate answers for research-based questions. Compared to &lt;strong&gt;GPT-4v&lt;/strong&gt;, Pixtral 12B’s multi-word responses lowered the &lt;strong&gt;win rate&lt;/strong&gt;, but its &lt;strong&gt;rationale score&lt;/strong&gt; remained high, showcasing its ability to reason through complex topics.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;GPT-4v (Labels)&lt;/th&gt;
&lt;th&gt;GPT-4v (Rationale)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Win Rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;20.2%&lt;/td&gt;
&lt;td&gt;23.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Top-1 Score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;90.8%&lt;/td&gt;
&lt;td&gt;94.2%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Top-5 Score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;84.3%&lt;/td&gt;
&lt;td&gt;94.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Top-10 Score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;77.2%&lt;/td&gt;
&lt;td&gt;92.66%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  VisIT Benchmark
&lt;/h4&gt;

&lt;p&gt;The &lt;strong&gt;VisIT Benchmark&lt;/strong&gt; focuses on real-world VLM tasks, making it a more practical measure of Pixtral 12B’s capabilities. Pixtral 12B performed well against &lt;strong&gt;GPT-4’s&lt;/strong&gt; captions, showing improved instruction-following abilities, especially when dealing with more specific queries.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Human Captions&lt;/th&gt;
&lt;th&gt;GPT-4 (Captions)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Win Rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;9.1%&lt;/td&gt;
&lt;td&gt;37.5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Top-1 Score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;88.4%&lt;/td&gt;
&lt;td&gt;95.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Top-5 Score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;85.8%&lt;/td&gt;
&lt;td&gt;94.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Top-10 Score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;84.4%&lt;/td&gt;
&lt;td&gt;93.4%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h4&gt;
  
  
  Flickr30K
&lt;/h4&gt;

&lt;p&gt;For &lt;strong&gt;Flickr30K&lt;/strong&gt;, Pixtral 12B’s performance was close to &lt;strong&gt;GPT-4v&lt;/strong&gt;, especially for machine-generated captions, though it scored lower when compared to human captions due to its more concise and objective outputs.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Data Captions&lt;/th&gt;
&lt;th&gt;GPT-4v (Captions)&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Win Rate&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;5.1%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Top-1 Score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;33.6%&lt;/td&gt;
&lt;td&gt;98.7%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Top-5 Score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;27.9%&lt;/td&gt;
&lt;td&gt;97.8%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;&lt;strong&gt;Top-10 Score&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;0%&lt;/td&gt;
&lt;td&gt;96.6%&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;In conclusion, &lt;strong&gt;Pixtral 12B&lt;/strong&gt; proves to be a formidable contender in the VLM space. While it may not fully outshine &lt;strong&gt;GPT-4&lt;/strong&gt; in terms of creative reasoning, its &lt;strong&gt;analytical&lt;/strong&gt; and &lt;strong&gt;cognitive&lt;/strong&gt; capabilities make it a valuable tool for tasks involving structured visual data, like charts, diagrams, and instructional content. It’s faster, cheaper, and more scalable for applications that rely on image-to-text processing.&lt;/p&gt;

&lt;p&gt;As I continue to explore Pixtral 12B and other models, I’ll be sharing code and updates on my &lt;a href="https://github.com/aryankargwal/vlm-benchmarks" rel="noopener noreferrer"&gt;GitHub repo&lt;/a&gt;. If you’re curious about Pixtral’s performance in other benchmarks or know of datasets I should test, feel free to reach out in the comments!&lt;/p&gt;

</description>
      <category>llm</category>
      <category>vlm</category>
      <category>benchmarking</category>
      <category>gpt4</category>
    </item>
    <item>
      <title>SurvBot🎥: Automatic Surveillance Tagging using Moondream and Streamlit</title>
      <dc:creator>Aryan Kargwal</dc:creator>
      <pubDate>Tue, 27 Aug 2024 19:29:18 +0000</pubDate>
      <link>https://dev.to/aryankargwal/survbot-automatic-surveillance-tagging-using-moondream-and-streamlit-5f3j</link>
      <guid>https://dev.to/aryankargwal/survbot-automatic-surveillance-tagging-using-moondream-and-streamlit-5f3j</guid>
      <description>&lt;p&gt;&lt;a href="https://youtu.be/9NspeuVio6I" rel="noopener noreferrer"&gt;YOUTUBE VIDEO LIVE NOW🔗&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I’m still riding the GenAI train, testing and tweaking new apps using LLMs and VLMs like a mad scientist in a digital lab. My latest experiment? Resurrecting my old projects with some modern, no-nonsense solutions. Enter Moondream 2, the open-source VLM that plays nicely with Streamlit. I put it to work creating an intelligent video tagging system for surveillance because who doesn’t love a bit of AI snooping?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F95ivjm4hkfses1iyiatc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F95ivjm4hkfses1iyiatc.png" alt=" " width="800" height="504"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ivuxwe2vjkjphkunlje.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F9ivuxwe2vjkjphkunlje.png" alt=" " width="800" height="893"&gt;&lt;/a&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flij8to0o2gxrcir3jvsv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flij8to0o2gxrcir3jvsv.png" alt=" " width="788" height="488"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In my latest tutorial, I’ll walk you through deploying a VLM locally. No cloud is needed; it's just a good old-fashioned DIY. You’ll also get the lowdown on tackling tokenization and the other infernal tasks in passing an image through a VLM. Trust me, it’s more fun than it sounds!&lt;/p&gt;
&lt;h2&gt;
  
  
  The Model: Moondream
&lt;/h2&gt;

&lt;p&gt;Moondream is a highly versatile and modular Vision Language Model (VLM) capable of performing various vision-related tasks. From answering questions based on images and detecting objects with bounding boxes to generating accurate image captions, Moondream is designed to deliver reliable results across various applications. It's an advanced tool for developers looking to integrate powerful Vision AI capabilities into their projects.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdffxjiif21annbdjk5a7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdffxjiif21annbdjk5a7.png" alt=" " width="800" height="568"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Built to run efficiently across multiple platforms, Moondream stands out as a compact, open-source VLM that combines performance with accessibility. It’s the perfect choice for developing next-level AI Vision applications without the burden of heavy or complex models such as GPT4o and Gemma. The Apache License also lets us use the model for our use cases.&lt;/p&gt;
&lt;h2&gt;
  
  
  Implementation
&lt;/h2&gt;

&lt;p&gt;Moving to the implementation, we are looking at 3-4 major functionalities, which can be blocked by loading the VLM, setting up a tokenizer for the logging, extracting frames from an uploaded image, and finally inferring to store the logs in a CSV.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuw665cx14n1jz56s2lrn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fuw665cx14n1jz56s2lrn.png" alt=" " width="800" height="126"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We are using a Streamlit workflow to set up the application's input and output streams. To see how the actual implementation code goes, check out the Github Repository Here or the YouTube tutorial Here.&lt;/p&gt;
&lt;h3&gt;
  
  
  Loading Model and Tokenizer
&lt;/h3&gt;

&lt;p&gt;We are going to use Moondream VLM sourced from a function using AutoModelForCasualLM. This statement will let us download all the weights for the model and cache the download into our web application instance, avoiding repeated downloads.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Warning: The Model is Over 2.5 GB, So Mind Your Internet Connection&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Cache the model and tokenizer to avoid downloading them repeatedly
@st.cache_resource
def load_model_and_tokenizer():
    model_id = "vikhyatk/moondream2"
    revision = "2024-07-23"

    model = AutoModelForCausalLM.from_pretrained(
        model_id, trust_remote_code=True, revision=revision,
        torch_dtype=torch.float16).to("cuda")

    tokenizer = AutoTokenizer.from_pretrained(model_id, revision=revision)
    return model, tokenizer
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Extracting Frames with Timestamp
&lt;/h3&gt;

&lt;p&gt;The next function we write handles the uploaded CCTV Surveillance Footage, letting us capture frames according to time intervals. This will also help us identify Key Frames later.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Function to extract frames from video and their timestamps
def extract_frames_with_timestamps(video_path, interval=0.2):
    cap = cv2.VideoCapture(video_path)
    frames = []
    timestamps = []
    frame_rate = cap.get(cv2.CAP_PROP_FPS)
    success, image = cap.read()
    count = 0

    while success:
        timestamp_ms = cap.get(cv2.CAP_PROP_POS_MSEC)
        timestamp_sec = timestamp_ms / 1000.0

        if count % (interval * frame_rate) == 0:
            img = Image.fromarray(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
            frames.append(img)
            timestamps.append(timestamp_sec)

        success, image = cap.read()
        count += 1

    cap.release()

    print(f"Total frames captured: {len(frames)}")
    return frames, timestamps
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Frame Inference
&lt;/h3&gt;

&lt;p&gt;Now, passing the system prompt "Describe this image." After uploading the frames one by one, we shall get the descriptions for the video logs. We will still, however, pass one logic to get some estimated key frames from the video, asking for the code to flag frames generating more than 5 different words from their predecessor.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Extract frames and timestamps from the video
    frames, timestamps = extract_frames_with_timestamps(video_path, interval=1)  # Extract 1 frame per second

    # Process each frame using the model
    descriptions = []
    prev_description_words = set()
    key_frames = []

    with st.spinner("Processing..."):
        for i, frame in enumerate(frames):
            enc_image = model.encode_image(frame)
            description = model.answer_question(enc_image, "Describe this image.", tokenizer)
            filtered_words = list(filter_description(description))  # Convert to list
# Logic for Key Frames
            new_words = set(filtered_words) - prev_description_words
            if len(new_words) &amp;gt; 5:
                key_frames.append((timestamps[i], frame))

            descriptions.append((timestamps[i], filtered_words))
            prev_description_words = set(filtered_words)  # Ensure it remains a set
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Streamlit Formatting for Images
&lt;/h3&gt;

&lt;p&gt;Finally, for the more keen, here is the code for formatting the displayed frames and keyframes using Streamlit Commands.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt; # Display the frames in a grid layout
    num_columns = 3  # Number of columns in the grid
    num_rows = (len(frames) + num_columns - 1) // num_columns  # Calculate number of rows needed

    for row in range(num_rows):
        cols = st.columns(num_columns)
        for col in range(num_columns):
            index = row * num_columns + col
            if index &amp;lt; len(frames):
                frame = frames[index]
                cols[col].image(frame, caption=f"Frame {index + 1} at {timestamps[index]:.2f}s")

     # Display key frames in a grid layout
    if key_frames:
        st.write("Key Frames:")
        num_columns_key_frames = 3  # Number of columns for key frames grid
        num_rows_key_frames = (len(key_frames) + num_columns_key_frames - 1) // num_columns_key_frames  # Calculate number of rows needed

        for row in range(num_rows_key_frames):
            cols = st.columns(num_columns_key_frames)
            for col in range(num_columns_key_frames):
                index = row * num_columns_key_frames + col
                if index &amp;lt; len(key_frames):
                    timestamp, frame = key_frames[index]
                    cols[col].image(frame, caption=f"Key Frame {index + 1} at {timestamp:.2f}s")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;So there you have it, a crash course in making your old projects feel new again with a bit of VLM magic. Whether you want to impress your boss or geek out over some next-level AI, Moondream 2 has your back. If you’re anything like me, you’ll probably wonder why you didn’t do this sooner. Go forth and tag those videos like a pro!&lt;/p&gt;

</description>
      <category>tutorial</category>
      <category>python</category>
      <category>machinelearning</category>
      <category>computervision</category>
    </item>
    <item>
      <title>Boss Llama: Building a Smart Interview Simulator using Llama 3.1 70B</title>
      <dc:creator>Aryan Kargwal</dc:creator>
      <pubDate>Thu, 22 Aug 2024 17:22:20 +0000</pubDate>
      <link>https://dev.to/tunehqai/boss-llama-building-a-smart-interview-simulator-using-llama-31-70b-egi</link>
      <guid>https://dev.to/tunehqai/boss-llama-building-a-smart-interview-simulator-using-llama-31-70b-egi</guid>
      <description>&lt;p&gt;Moving a bit further from writing and exploring the world of LLMs as just a consumer of the vast knowledge and context awareness, I took it upon myself to build a product out of these tools. Boss Llama is an interactive intelligent interview simulator on Meta's &lt;a href="https://llama.meta.com/" rel="noopener noreferrer"&gt;Llama 3.1 70B&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Although not a novel implementation, I hope my tutorial helps you understand the various API calls and functions involved in processing chat, inputs, and data files in a Streamlit Web Application using &lt;a href="https://studio.tune.app/playground" rel="noopener noreferrer"&gt;Tune Studio&lt;/a&gt; to deploy our model and set API gateways for inference. So, let's look at how you can replicate the same results.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Model: Llama 3.1 70B
&lt;/h2&gt;

&lt;p&gt;An Open Source LLM, Meta's Llama 3.1 70B, has good reasons to become my choice for the product. By improving the context window from 8K to 128K from its predecessor, Llama 3 70B, the model also brings a more significant threshold for output tokens, 4096, over the previous 2048 context.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm08cwjym5wvx23vu0sy1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fm08cwjym5wvx23vu0sy1.png" alt="Quality Comparison" width="486" height="350"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1tvnuevo5di4vgnnx9ad.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1tvnuevo5di4vgnnx9ad.png" alt="Speed Comparison" width="481" height="345"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgx5yf5lstktb3mwc5zrz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fgx5yf5lstktb3mwc5zrz.png" alt="Price Comparison" width="483" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Analysis taken from &lt;a href="https://artificialanalysis.ai/models/llama-3-1-instruct-70b" rel="noopener noreferrer"&gt;Aritificial Analysis&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Earlier, however, I had tried implementing a locally hosted Llama 3.1 8B for the task, which unfortunately lacks the context windows expected off of the use-case, which typically involves more extended exchanges averaging about 1500-2000 words or 2000+ tokens. So let us now check out how you can also make such an application.&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementation
&lt;/h2&gt;

&lt;p&gt;Regarding the implementation, we have chosen &lt;a href="https://streamlit.io/" rel="noopener noreferrer"&gt;Streamlit&lt;/a&gt; as the base of our operations, helping us tie up the outputs generated by the API calls to a chat interface. Unfortunately, the larger model asks for a higher VRAM, which I have chosen to fulfill using Tune Studio's API Calls.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq64wwop4t1chbb9ftm6u.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fq64wwop4t1chbb9ftm6u.png" alt="Boss Llama UI" width="800" height="417"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;While discussing the code, however, we will skip over the Streamlit part and focus on its API aspect. If you wish to see how I implement that, check out my video tutorial on &lt;a href="https://youtu.be/cAHipN7CQwE?si=UBm2-Avyy4Bpx8E8" rel="noopener noreferrer"&gt;YouTube here&lt;/a&gt; or head over to the &lt;a href="https://github.com/aryankargwal/boss_llama" rel="noopener noreferrer"&gt;Github Repository here&lt;/a&gt;!&lt;/p&gt;

&lt;p&gt;For the upcoming code, here are some essential variables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Conversation: A session state variable holding all the conversation exchanges between the bot and the user.&lt;/li&gt;
&lt;li&gt;Difficulty: The difficulty of the simulated interview&lt;/li&gt;
&lt;li&gt;API Key: Your Tune Studio API Key&lt;/li&gt;
&lt;li&gt;Max Questions: Number of Questions in the Interview&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  System Prompt
&lt;/h3&gt;

&lt;p&gt;Finetuning is the best way to get a model to perform exactly how you want it to; well, I went for the next best thing: a thoughtful and thorough system prompt. While choosing the system prompt, we should be detailed with our requirements, as the model tends to meander and hallucinate if such instructions are not given.&lt;/p&gt;

&lt;p&gt;The latest adversarial training on the modern Llama models also allows us to pass such a system prompt, avoiding any prompt leakage.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;You are Boss Llama, an advanced AI interviewer. Your goal is to conduct a comprehensive and intelligent interview with the candidate based on the following details:

1. Job Description: {st.session_state.job_description}
2. Difficulty Level: {difficulty}
3. Number of Questions: {st.session_state.max_questions}

Instructions:
1. Start the Interview:
   - Begin by presenting a detailed job description for the role based on the difficulty level and the provided job description. Try to keep this introduction small and to the point as the user already knows what they are interviewing for.
   - Provide a warm welcome message to start the interview and set a positive tone.
2. Generate and Ask Questions:
   - Ask a series of questions, up to the specified number of questions. Ensure these questions are relevant to the job description and appropriately challenging according to the difficulty level.
   - Provide clear and concise prompts that assess the candidate's skills, knowledge, and fit for the role.

3. Conduct the Interview:
  - Engage with the candidate in a conversational manner. If the candidate's responses are vague or insufficient, let them know about it and give them a chance to improve, but count it as one more question.
   - Maintain a professional and supportive tone throughout the interview.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Generating Responses
&lt;/h3&gt;

&lt;p&gt;We will generate the responses and the conversation using a curl command from Tune Studio. The command is a simple way of linking your current working code with a model of your choice on Tune Studio, which holds an amazingly massive library of free models for inference and even more advanced models with custom fine-tuning and deploying practices for hard-core enthusiasts.&lt;/p&gt;

&lt;p&gt;The variable "conversation," which incrementally holds the ongoing conversation, is called every time to create a response that adds to the existing discussion.&lt;/p&gt;

&lt;p&gt;With parameters such as temperature, frequency_penaly, and max_tokens, we can also tweak the quality of responses, further enhancing the feeling of being interviewed by a proper interviewer.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Function to call the API
def generate_response(conversation, apikey):
    url = "https://proxy.tune.app/chat/completions"
    headers = {
        "Authorization": apikey,  # Your API key
        "Content-Type": "application/json"
    }

    # Construct the payload for the API call
    payload = {
        "temperature": 0.9,
        "messages": conversation,
        "model": "meta/llama-3.1-70b-instruct",
        "stream": False,
        "frequency_penalty": 0.2,
        "max_tokens": 500
    }

    # Send the POST request to the API
    response = requests.post(url, headers=headers, data=json.dumps(payload))

    # Check if the request was successful
    if response.status_code == 200:
        # Extract the response from the JSON output
        return response.json()["choices"][0]["message"]["content"]
    else:
        return f"Error: {response.status_code} - {response.text}"
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Generate Evaluations
&lt;/h3&gt;

&lt;p&gt;For the evaluations, we are using a similar API call. We only pass the individual exchanges from the bot and the user to run an assessment using a suitable system prompt.&lt;/p&gt;

&lt;p&gt;This second call activates the interviewer's harsher and more intense side. This new call then looks at the conversation from a third perspective and assigns feedback and a score that feeds back into the web application.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Function to generate evaluations on the interview
def generate_evaluation(question, answer, difficulty, apikey):
    url = "https://proxy.tune.app/chat/completions"
    headers = {
        "Authorization": apikey,
        "Content-Type": "application/json"
    }

    payload = {
        "temperature": 0.7,
        "messages": [
            {"role": "system", "content": f"Evaluate the following answer based on the job description difficulty level: {difficulty}."},
            {"role": "user", "content": f"Question: {question}\nAnswer: {answer}"}
        ],
        "model": "meta/llama-3.1-70b-instruct",
        "stream": False,
        "frequency_penalty": 0.2,
        "max_tokens": 500
    }

    try:
        response = requests.post(url, headers=headers, data=json.dumps(payload))
        response.raise_for_status()
        result = response.json()
        feedback = result.get("choices", [{}])[0].get("message", {}).get("content", "No feedback provided")
        score = result.get("choices", [{}])[0].get("score", 0)
        return feedback, score
    except requests.RequestException as e:
        return f"Error: {e}", 0
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Download PDF Report
&lt;/h3&gt;

&lt;p&gt;Well, finally, what good is progress in AI? Suppose all we are doing is asking it to write our assignment. In that case, the final function links the outputs generated by the Evaluation function to FPDF to generate a plausible and downloadable PDF for the evaluation.&lt;/p&gt;

&lt;p&gt;What is FPDF, you ask? FPDF, initially a PHP class, is a library used for PDF document generation under Python. Compared to other jargon available online, FPDF provides a more streamlined and straightforward way to generate PDFs. (It also allows us to add PNGs, JPEGs, and GIFs to the PDF, which should be a boon if we wish to add systems to include tacky diagrams in the report.)&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Function to generate a PDF report
def generate_pdf_report(evaluations):
    pdf = FPDF()
    pdf.add_page()
    pdf.set_font("Arial", size=12)

    pdf.cell(0, 10, "Interview Evaluation Report", ln=True, align="C")
    pdf.ln(10)  # Add a line break

    for evaluation in evaluations:
        pdf.set_font("Arial", style='B', size=12)
        pdf.multi_cell(0, 10, evaluation["Question"])
        pdf.set_font("Arial", size=12)
        pdf.multi_cell(0, 10, evaluation["Answer"])
        pdf.multi_cell(0, 10, f"Feedback: {evaluation['Feedback']}")
        pdf.multi_cell(0, 10, f"Score: {evaluation['Score']}")
        pdf.ln(5)  # Add a line break

    # Save the PDF to a temporary file
    temp_file = tempfile.NamedTemporaryFile(delete=False, suffix=".pdf")
    pdf.output(temp_file.name)
    return temp_file.name
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Here, we saw an implementation of Llama 3.1 70B in a Streamlit Web Application to simulate brilliant interviews using a chat interface. Although the final product lacks some accessibility features such as TTS or STT, it shows great promise in the way even a model that is not fine-tuned can operate using just system prompts.&lt;/p&gt;

</description>
      <category>streamlit</category>
      <category>tunestudio</category>
      <category>llm</category>
      <category>tutorial</category>
    </item>
    <item>
      <title>LMQL, AAAL Pt.6</title>
      <dc:creator>Aryan Kargwal</dc:creator>
      <pubDate>Fri, 02 Aug 2024 04:00:00 +0000</pubDate>
      <link>https://dev.to/tunehqai/lmql-aaal-pt6-22ib</link>
      <guid>https://dev.to/tunehqai/lmql-aaal-pt6-22ib</guid>
      <description>&lt;p&gt;In my journey to enhance adversarial robustness in LLMs, I explored LMQL (Language Model Query Language). This tool is a programming language that allows seamless integration of LLM interaction into program code, providing a structured way to manage model inputs and outputs.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdmk2snvij63y5rq44xus.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdmk2snvij63y5rq44xus.png" alt=" " width="620" height="544"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;LMQL stands out by enabling developers to specify constraints and rules directly within their code. This feature is crucial for preventing adversarial attacks such as prompt injection and token manipulation. By defining strict constraints, developers can ensure that the model processes only valid and safe inputs, reducing the risk of malicious manipulations.&lt;/p&gt;

&lt;p&gt;Additionally, LMQL supports dynamic control over model interactions. Developers can programmatically adjust the model’s behavior based on real-time input validation and monitoring. This flexibility allows for quick responses to potential adversarial attacks, enhancing the overall security of the LLM.&lt;/p&gt;

&lt;p&gt;Another advantage of LMQL is its ability to integrate with existing guardrail tools. For example, combining LMQL with Llama Guard or Nvidia NeMo Guardrails can create a multi-layered defense system. This integration allows for more robust input validation, ethical content generation, and comprehensive logging and monitoring.&lt;/p&gt;

&lt;p&gt;LMQL also facilitates better transparency and explainability. By embedding model interactions within the code, developers can easily trace and audit the model’s decision-making process. This transparency is vital for identifying and mitigating adversarial attacks, ensuring the model’s outputs are trustworthy and reliable.&lt;/p&gt;

&lt;p&gt;In conclusion, LMQL offers a powerful and flexible solution for enhancing the security of LLMs. Its ability to define constraints, dynamic control, and integration with existing tools makes it a valuable addition to any adversarial robustness strategy. Stay tuned for more insights into practical implementations of these tools in real-world applications.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>openai</category>
      <category>developer</category>
      <category>cybersecurity</category>
    </item>
  </channel>
</rss>
