DEV Community

Alain Airom
Alain Airom

Posted on

The Power of Bob: A Masterclass in Execution

A new test using “Bob” to generate a project from scratch in less than one hour!

Introduction

At IBM, we don’t just “release” software; we push it to its limits with rigorous testing before it ever hits General Availability (GA). When it comes to Bob — our AI-powered IDE and modernization assistant — hundreds of IBMers (myself included!) have been conducting intensive, real-world trials to ensure it’s battle-ready. I decided to put Bob to the test by building a sophisticated OCR agent. By orchestrating a stack of Docling, Ollama, and Granite, alongside Llama and Qwen Vision models, I was able to build an out-of-the-box application that doesn’t just process text — it delivers a full-scale observability suite with metrics dashboards for Prometheus, Grafana, and OpenTelemetry.

From Concept to Code in 60 Minutes

The most remarkable part of this experience? I was able to build the entire application in roughly one hour.

Of course, development is rarely a straight line. I encountered the usual hurdles — runtime errors and configuration tweaks — but this is where Bob truly shines. Instead of scouring documentation or StackOverflow, I simply fed the error messages back to Bob. He diagnosed the issues, suggested corrections, and helped me iterate in real-time. Within an hour, the application wasn’t just written; it was up and running.

Is the application perfect? Not yet — there is always room for enhancement. However, the sheer productivity gain is undeniable. By handling the heavy lifting of boilerplate code, documentation, and troubleshooting, Bob allows developers and testers to focus on high-level logic and innovation rather than syntax and setup.

Key highlights of the Bob experience:

  • Transparent Reasoning: Bob doesn’t just give you code; he explains the “why” behind his logic.
  • Security First: He respects boundaries, explicitly asking for permission before accessing local folders or sensitive directories.
  • Instant Documentation: He provides a comprehensive README and documentation out of the box. Below, I’ve illustrated some of the core components of the application. If you want to dive into the code yourself, you can find the link to the GitHub repository in the Links section at the end of this post.

Implementation-The project structure

OCR Agent with Ollama

A Python agent application for OCR (Optical Character Recognition) using local Ollama vision models with comprehensive metrics monitoring via Prometheus, OpenTelemetry, and a Streamlit GUI dashboard. Includes Docker/Podman support for easy deployment.

Features

  • 🤖 Multiple Vision Models: Support for llama3.2-vision:latest, granite3.2-vision:2b, and qwen2-vl:2b
  • 📄 Document Processing: Image and PDF text extraction using Docling
  • 📊 Metrics Monitoring:
  • Prometheus metrics exporter
  • OpenTelemetry compatible format
  • Real-time Grafana integration support
  • 🎨 GUI Dashboard: Interactive Streamlit dashboard for metrics visualization
  • 🔄 Batch Processing: Process multiple documents efficiently
  • ⚡ Performance Tracking: Token usage, latency, and error monitoring

While the full stack involves several moving parts, the heart of the operation lives in one clean file. Below is the ocr_agent.py script, showcasing how Bob helped me orchestrate the different models.

"""OCR Agent with Ollama integration."""
import logging
import time
import base64
from pathlib import Path
from typing import Optional, Dict, Any, List
import ollama
from PIL import Image
from docling.document_converter import DocumentConverter
from docling.datamodel.base_models import InputFormat

from ..metrics.collector import metrics_collector
from ..config_manager import config_manager

logger = logging.getLogger(__name__)


class OCRAgent:
    """Agent for OCR processing using Ollama vision models."""

    def __init__(self, model: Optional[str] = None):
        """Initialize OCR agent.

        Args:
            model: Model name to use (defaults to config default)
        """
        self.config = config_manager.get_config()
        self.model = model or self.config.ollama.default_model
        self.client = ollama.Client(host=self.config.ollama.base_url)
        self.doc_converter = DocumentConverter()

        # Set model info in metrics
        metrics_collector.set_model_info({
            "default_model": self.model,
            "available_models": ",".join(self.config.ollama.models)
        })

        logger.info(f"OCR Agent initialized with model: {self.model}")

    def _encode_image(self, image_path: str) -> str:
        """Encode image to base64.

        Args:
            image_path: Path to image file

        Returns:
            Base64 encoded image string
        """
        with open(image_path, 'rb') as f:
            return base64.b64encode(f.read()).decode('utf-8')

    def _validate_file(self, file_path: str) -> bool:
        """Validate file format.

        Args:
            file_path: Path to file

        Returns:
            True if valid, False otherwise
        """
        path = Path(file_path)
        if not path.exists():
            logger.error(f"File not found: {file_path}")
            return False

        ext = path.suffix.lower().lstrip('.')
        if ext not in self.config.agent.supported_formats:
            logger.error(f"Unsupported format: {ext}")
            return False

        return True

    def extract_text_from_image(self, image_path: str, 
                               prompt: Optional[str] = None) -> Dict[str, Any]:
        """Extract text from an image using Ollama vision model.

        Args:
            image_path: Path to image file
            prompt: Optional custom prompt

        Returns:
            Dictionary with extracted text and metadata
        """
        if not self._validate_file(image_path):
            metrics_collector.record_error(self.model, "invalid_file")
            return {"error": "Invalid file", "text": ""}

        start_time = time.time()
        metrics_collector.set_active_requests(self.model, 1)

        try:
            # Default prompt for OCR
            if prompt is None:
                prompt = "Extract all text from this image. Provide the text exactly as it appears, maintaining formatting and structure."

            # Call Ollama vision model
            response = self.client.chat(
                model=self.model,
                messages=[{
                    'role': 'user',
                    'content': prompt,
                    'images': [image_path]
                }]
            )

            duration = time.time() - start_time

            # Extract response
            text = response['message']['content']

            # Record metrics
            tokens = {
                'prompt': response.get('prompt_eval_count', 0),
                'completion': response.get('eval_count', 0)
            }

            metrics_collector.record_request(
                model=self.model,
                status='success',
                duration=duration,
                tokens=tokens
            )

            # Record document processing
            file_format = Path(image_path).suffix.lower().lstrip('.')
            metrics_collector.record_document(file_format, 'success', pages=1)

            logger.info(f"Successfully extracted text from {image_path} in {duration:.2f}s")

            return {
                "text": text,
                "model": self.model,
                "duration": duration,
                "tokens": tokens,
                "status": "success"
            }

        except Exception as e:
            duration = time.time() - start_time
            logger.error(f"Error extracting text from {image_path}: {e}")

            metrics_collector.record_request(
                model=self.model,
                status='error',
                duration=duration
            )
            metrics_collector.record_error(self.model, type(e).__name__)

            file_format = Path(image_path).suffix.lower().lstrip('.')
            metrics_collector.record_document(file_format, 'error', pages=1)

            return {
                "error": str(e),
                "text": "",
                "model": self.model,
                "duration": duration,
                "status": "error"
            }

        finally:
            metrics_collector.set_active_requests(self.model, 0)

    def extract_text_from_pdf(self, pdf_path: str) -> Dict[str, Any]:
        """Extract text from PDF using Docling.

        Args:
            pdf_path: Path to PDF file

        Returns:
            Dictionary with extracted text and metadata
        """
        if not self._validate_file(pdf_path):
            metrics_collector.record_error(self.model, "invalid_file")
            return {"error": "Invalid file", "text": ""}

        start_time = time.time()

        try:
            # Convert PDF using Docling
            result = self.doc_converter.convert(pdf_path)

            # Extract text
            text = result.document.export_to_markdown()

            duration = time.time() - start_time

            # Get page count
            pages = len(result.document.pages) if hasattr(result.document, 'pages') else 1

            # Record metrics
            metrics_collector.record_document('pdf', 'success', pages=pages)

            logger.info(f"Successfully extracted text from PDF {pdf_path} ({pages} pages) in {duration:.2f}s")

            return {
                "text": text,
                "pages": pages,
                "duration": duration,
                "status": "success",
                "method": "docling"
            }

        except Exception as e:
            duration = time.time() - start_time
            logger.error(f"Error extracting text from PDF {pdf_path}: {e}")

            metrics_collector.record_document('pdf', 'error', pages=1)
            metrics_collector.record_error(self.model, type(e).__name__)

            return {
                "error": str(e),
                "text": "",
                "duration": duration,
                "status": "error",
                "method": "docling"
            }

    def process_document(self, file_path: str, 
                        prompt: Optional[str] = None) -> Dict[str, Any]:
        """Process any supported document type.

        Args:
            file_path: Path to document
            prompt: Optional custom prompt for image processing

        Returns:
            Dictionary with extracted text and metadata
        """
        path = Path(file_path)
        ext = path.suffix.lower().lstrip('.')

        if ext == 'pdf':
            return self.extract_text_from_pdf(file_path)
        else:
            return self.extract_text_from_image(file_path, prompt)

    def batch_process(self, file_paths: List[str], 
                     prompt: Optional[str] = None) -> List[Dict[str, Any]]:
        """Process multiple documents.

        Args:
            file_paths: List of file paths
            prompt: Optional custom prompt for image processing

        Returns:
            List of results for each document
        """
        results = []

        for file_path in file_paths:
            result = self.process_document(file_path, prompt)
            results.append({
                "file": file_path,
                **result
            })

        return results

    def switch_model(self, model: str):
        """Switch to a different model.

        Args:
            model: Model name to switch to
        """
        if model in self.config.ollama.models:
            self.model = model
            logger.info(f"Switched to model: {model}")
            metrics_collector.set_model_info({
                "current_model": self.model,
                "available_models": ",".join(self.config.ollama.models)
            })
        else:
            logger.warning(f"Model {model} not in configured models")

# Made with Bob
Enter fullscreen mode Exit fullscreen mode

The whole project structure is the following;

agent-telemetry-bob/
├── src/
│   ├── agent/
│   │   ├── __init__.py
│   │   └── ocr_agent.py              # Core OCR agent with Ollama integration
│   ├── metrics/
│   │   ├── __init__.py
│   │   ├── collector.py              # Metrics collection (Prometheus + OTEL)
│   │   └── prometheus_exporter.py    # Prometheus HTTP server
│   ├── gui/
│   │   ├── __init__.py
│   │   └── dashboard.py              # Streamlit metrics dashboard
│   ├── __init__.py
│   ├── config_manager.py             # Configuration management with Pydantic
│   └── main.py                       # CLI entry point
├── config/
│   └── config.yaml                   # Application configuration
├── monitoring/
│   ├── prometheus.yml                # Prometheus scrape config
│   ├── otel-collector-config.yml     # OpenTelemetry collector config
│   └── grafana-datasources.yml       # Grafana datasource config
├── examples/
│   └── basic_usage.py                # Usage examples
├── tests/                            # Test directory (ready for pytest)
├── docker-compose.yml                # Full stack deployment
├── Dockerfile                        # Application container
├── requirements.txt                  # Python dependencies
├── setup.py                          # Package setup
├── .gitignore                        # Git ignore rules
├── README.md                         # Full documentation
├── QUICKSTART.md                     # Quick start guide
└── PROJECT_OVERVIEW.md               # This file
Enter fullscreen mode Exit fullscreen mode

  • The main UI of the application 👇

  • And the Metrics 📊

  • Bob provides also the scripts to start and stop the services 😁
#!/bin/bash

# Start all services script for OCR Agent Telemetry
# This script provides two options: monitoring only or full stack

set -e

echo "🚀 OCR Agent Telemetry - Service Startup"
echo "========================================"
echo ""

# Set Docker host for Podman
export DOCKER_HOST="unix:///var/folders/29/91237wyj7cqgg2z22rtd2mx00000gn/T/podman/podman-machine-default-api.sock"

# Check if argument is provided
if [ "$1" == "monitoring" ]; then
    echo "📊 Starting MONITORING SERVICES ONLY..."
    echo "   - Prometheus (port 9090)"
    echo "   - Grafana (port 3000)"
    echo "   - OpenTelemetry Collector (ports 4317, 4318, 8889)"
    echo ""
    docker-compose up -d prometheus grafana otel-collector

elif [ "$1" == "full" ]; then
    echo "📦 Starting FULL STACK (including OCR Agent)..."
    echo "   - OCR Agent (ports 8000, 8501)"
    echo "   - Prometheus (port 9090)"
    echo "   - Grafana (port 3000)"
    echo "   - OpenTelemetry Collector (ports 4317, 4318, 8889)"
    echo ""
    echo "⚠️  Note: This will build the OCR Agent Docker image (may take several minutes)"
    echo ""
    docker-compose up -d

else
    echo "Usage: $0 [monitoring|full]"
    echo ""
    echo "Options:"
    echo "  monitoring  - Start only monitoring services (Prometheus, Grafana, OTEL Collector)"
    echo "  full        - Start all services including OCR Agent application"
    echo ""
    echo "Examples:"
    echo "  $0 monitoring   # Quick start for monitoring only"
    echo "  $0 full         # Start everything (requires Docker build)"
    exit 1
fi

echo ""
echo "⏳ Waiting for services to be ready..."
sleep 5

echo ""
echo "✅ Services started successfully!"
echo ""
echo "📍 Access Points:"
echo "   - OpenTelemetry Collector (HTTP): http://localhost:4318"
echo "   - OpenTelemetry Collector (gRPC): localhost:4317"
echo "   - Prometheus Metrics: http://localhost:8889/metrics"
echo "   - Prometheus UI: http://localhost:9090"
echo "   - Grafana UI: http://localhost:3000 (admin/admin)"

if [ "$1" == "full" ]; then
    echo "   - OCR Agent Metrics: http://localhost:8000/metrics"
    echo "   - OCR Agent GUI: http://localhost:8501"
fi

echo ""
echo "📊 Check service status: docker-compose ps"
echo "📋 View logs: docker-compose logs -f [service-name]"
echo "🛑 Stop services: ./stop-services.sh"

# Made with Bob
Enter fullscreen mode Exit fullscreen mode
#!/bin/bash

# Stop all services script for OCR Agent Telemetry
# This script stops all Docker/Podman services

set -e

echo "🛑 Stopping OCR Agent Telemetry Services..."
echo ""

# Set Docker host for Podman
export DOCKER_HOST="unix:///var/folders/29/91237wyj7cqgg2z22rtd2mx00000gn/T/podman/podman-machine-default-api.sock"

# Stop all docker-compose services
echo "📦 Stopping Docker Compose services..."
docker-compose down

echo ""
echo "✅ All services stopped successfully!"
echo ""
echo "To start services again, run: ./start-services.sh"

# Made with Bob
Enter fullscreen mode Exit fullscreen mode

Of course, a few more ‘bells and whistles’ could make it even handier, but the core functionality is already at a production-ready standard. The fact that a developer can spin up a fully documented, instrumented application of this caliber so quickly is a testament to the power of the Bob IDE.

Does it even work?

Of course! :D

  • The input image 🤔

  • And the output 📤
python -m src.main ./_input/img1.png
2025-12-17 09:10:46,137 - INFO - OCR Agent initialized with model: llama3.2-vision:latest
2025-12-17 09:10:46,137 - INFO - Processing file: ./_input/img1.png
2025-12-17 09:11:40,517 - INFO - HTTP Request: POST http://localhost:11434/api/chat "HTTP/1.1 200 OK"
2025-12-17 09:11:40,520 - INFO - Successfully extracted text from ./_input/img1.png in 54.38s

================================================================================
File: ./_input/img1.png
Model: llama3.2-vision:latest
Duration: 54.38s
Tokens: {'prompt': 30, 'completion': 511}
================================================================================

Extracted Text:
The image presents a comprehensive overview of the various roles within the data science field, highlighting their distinct responsibilities and contributions to the industry. The image is divided into two main sections: a graphic illustrating the different roles and a list of bullet points detailing their responsibilities.

**Graphic:**

*   The graphic features a blue background with a gradient effect, transitioning from light to dark blue.
*   It showcases 12 icons, each representing a unique role in the data science field, including:
    *   Data Scientist
    *   Data Engineer
    *   Business Analyst
    *   Statistician
    *   Machine Learning Engineer
    *   AI Engineer
    *   NLP Engineer
    *   BI Analyst
    *   Data Architect
    *   Data Analyst
    *   Data Engineer
    *   Business Analyst
*   Each icon is accompanied by a brief description of the role's responsibilities.

**Bullet Points:**

*   The list of bullet points provides a detailed explanation of the roles and their responsibilities, including:
    *   Data Scientist: responsible for collecting, processing, and analyzing data to extract insights and make informed decisions.
    *   Data Engineer: responsible for designing, building, and maintaining the infrastructure that stores and processes data.
    *   Business Analyst: responsible for analyzing data to identify trends and patterns, and making recommendations for business improvement.
    *   Statistician: responsible for collecting and analyzing data to make informed decisions.
    *   Machine Learning Engineer: responsible for designing and developing machine learning models to predict outcomes and make recommendations.
    *   AI Engineer: responsible for developing and implementing AI models to improve business processes.
    *   NLP Engineer: responsible for developing and implementing NLP models to improve business processes.
    *   BI Analyst: responsible for creating reports and visualizations to help business leaders make informed decisions.
    *   Data Architect: responsible for designing and implementing data systems to support business processes.
    *   Data Analyst: responsible for collecting and analyzing data to make informed decisions.
    *   Data Engineer: responsible for designing, building, and maintaining the infrastructure that stores and processes data.
    *   Business Analyst: responsible for analyzing data to identify trends and patterns, and making recommendations for business improvement.

In summary, the image provides a comprehensive overview of the various roles within the data science field, highlighting their distinct responsibilities and contributions to the industry. The graphic and bullet points work together to provide a clear and concise understanding of the different roles and their responsibilities.
================================================================================

Enter fullscreen mode Exit fullscreen mode

🚀🚀🚀🚀🚀

Conclusion: The New Standard for Development

What I experienced over the course of a single hour wasn’t just a coding session; it was a glimpse into the future of software engineering. By partnering with Bob, I was able to bridge the gap between a complex multi-model concept and a professional-grade, observability-backed application in record time. At IBM, we understand that “good enough” isn’t the standard — it has to be tested, documented, and production-ready. Bob meets that challenge head-on, handling the heavy lifting of troubleshooting and boilerplate so that we can focus on high-value innovation. Whether you are modernizing legacy systems or building the next generation of AI agents, Bob proves that with the right assistant, the distance between “idea” and “execution” has never been shorter. Super Bob does it again.

🚀 What’s Next?

Building this OCR agent was just the beginning. Now that the foundation is laid with Bob, here are a few ways I plan to take this project — and my use of Bob — even further:

  • Advanced Model Chaining: Fine-tuning how Bob handles the logic between Docling and Vision LLMs for even more complex document layouts.
  • Custom Grafana Alerts: Automating the deployment of custom alerting rules for the Prometheus/OpenTelemetry stack directly through Bob’s interface.
  • Scaling the Community: I’ll be sharing more “Bob-built” modules in the coming weeks to show how we can standardize AI-powered tools across our teams.

💬 Join the Conversation

I’m curious to see what you can build in your first 60 minutes with Bob!

Have you tried Bob for modernization or integration yet?
What’s the most “impossible” deadline you’ve met using an AI assistant?
Check out the repo: Head over to the GitHub repository to clone the OCR agent, try it out, and let me know your thoughts in the comments below.
Let’s keep pushing the boundaries of what’s possible — one hour at a time!

Links

Top comments (0)