PaperBanana: Automating Research Diagrams With an Agentic AI Framework

#datascience #ai #machinelearning #python

Google just shipped a framework that turns natural language into publication-ready figures. Here's how the agentic pipeline actually works, with real code.

I want to tell you about the specific kind of frustration that makes researchers consider career changes.

You've just finished a three-month experiment. The results are clean, the story is clear and all you need to do is produce the figures for the paper. Six hours later you're on Stack Overflow at 11pm trying to figure out why matplotlib is cutting off your axis labels in the PDF export and the actual insight you were excited about three hours ago feels very far away.

PaperBanana is Google AI's answer to this. It's an agentic framework that takes natural language descriptions and produces publication-ready research figures, not rough drafts that need cleanup, but figures you can drop directly into a Nature or NeurIPS submission. The GitHub activity around it has been significant and the architecture underneath deserves attention independent of the diagram use case.

This is a technical deep-dive. We're going to cover what PaperBanana is, how the agentic loop works and how to actually use it, with code that runs.

What Makes This Different From Previous Attempts

The graveyard of "natural language to chart" tools is substantial. Most of them fail in the same way: they generate a plausible first attempt and then have no mechanism for improving it. The gap between a plausible matplotlib output and a publication-ready figure is significant, typography, colour accessibility, journal-specific formatting requirements, legend placement, resolution and that gap requires iteration.

PaperBanana's core insight is that figure generation is a multi-criteria quality problem and single-pass generation can't solve it reliably. The solution is an agentic critic-generator loop that iterates until quality thresholds are met. The Critic agent produces structured, actionable feedback. The Generator agent acts on that feedback. The loop continues until the output satisfies defined publication standards or hits a maximum iteration count.

This sounds simple. It works remarkably well. And the architecture pattern generalises to any task where quality is multidimensional.

The Agent Architecture

Four agents. Specific responsibilities. Structured handoffs between them.

INPUT
(natural language description + data)
         ↓
  ┌─────────────────┐
  │  PLANNER AGENT  │
  │                 │
  │ • Interprets    │
  │   request       │
  │ • Selects chart │
  │   type          │
  │ • Identifies    │
  │   data transforms│
  │ • Outputs spec  │
  └────────┬────────┘
           ↓
  ┌─────────────────┐
  │  CODE GENERATOR │◄──────────────────┐
  │  AGENT          │                   │
  │                 │                   │
  │ • Translates    │                   │
  │   spec to code  │                   │
  │ • matplotlib /  │                   │
  │   seaborn /     │                   │
  │   plotly        │                   │
  └────────┬────────┘                   │
           ↓                            │
  ┌─────────────────┐                   │
  │ EXECUTOR AGENT  │                   │
  │                 │                   │
  │ • Runs code in  │                   │
  │   sandbox       │                   │
  │ • Captures      │                   │
  │   output + errors│                  │
  └────────┬────────┘                   │
           ↓                            │
  ┌─────────────────┐    FAIL           │
  │  CRITIC AGENT   │───────────────────┘
  │                 │
  │ • Evaluates vs  │
  │   pub standards │
  │ • Structured    │
  │   feedback      │
  └────────┬────────┘
           │ PASS
           ↓
        OUTPUT
   (publication-ready figure)

The Planner runs once. The Code Generator → Executor → Critic loop runs until quality threshold is reached. In practice this converges in three to five iterations for most figure types.

The Critic agent is the piece that makes this work. It doesn't output "this needs improvement", it outputs structured feedback with severity ratings and specific suggested fixes:

json{
  "iteration": 2,
  "quality_score": 0.76,
  "feedback": [
    {
      "severity": "HIGH",
      "category": "formatting",
      "message": "Figure width 195mm exceeds two-column maximum of 180mm",
      "suggested_fix": "Set figsize=(7.09, height) — 7.09 inches = 180mm"
    },
    {
      "severity": "MEDIUM", 
      "category": "accessibility",
      "message": "Colour palette fails deuteranopia simulation",
      "suggested_fix": "Replace current palette with Wong 2011: ['#000000', '#E69F00', '#56B4E9', '#009E73', '#F0E442', '#0072B2', '#D55E00', '#CC79A7']"
    },
    {
      "severity": "LOW",
      "category": "typography",
      "message": "Axis label font weight lighter than journal standard",
      "suggested_fix": "Add fontweight='bold' to xlabel() and ylabel() calls"
    }
  ]
}

This specificity is what makes the loop converge efficiently. The Code Generator doesn't need to guess what to fix, it receives exact, implementable instructions.

Installation and Setup

bashgit clone https://github.com/google-research/paperbanana
cd paperbanana
pip install -r requirements.txt

PaperBanana supports multiple LLM backends. For the Critic agent specifically, Claude Sonnet produces notably better structured feedback than the alternatives in our testing, the specificity and actionability

of the feedback directly affects how fast the loop converges.
pythonfrom paperbanana import PaperBanana, Config

config = Config(
    llm_backend="anthropic",          # or "openai"
    model="claude-sonnet-4-5",
    max_iterations=6,
    quality_threshold=0.85,
    output_format="pdf",
    dpi=300,
    style_preset="nature"
    # Options: "nature", "science", "ieee", 
    #          "neurips", "arxiv", "custom"
)

pb = PaperBanana(config=config)

Your First Publication Figure

The simplest use case. Describe what you want, provide your data, get a figure:

pythonimport pandas as pd
import numpy as np

# Experimental results data
results = pd.DataFrame({
    'Method': ['Baseline', 'Method A', 'Method B', 
               'Method C', 'Ours'],
    'Accuracy': [71.3, 78.6, 82.1, 84.7, 89.3],
    'F1_Score': [68.9, 76.2, 80.4, 83.1, 87.8],
    'Inference_ms': [12.3, 45.6, 38.2, 61.4, 29.7],
    'Accuracy_std': [0.8, 1.1, 0.9, 1.3, 0.7],
    'F1_std': [1.2, 0.9, 1.1, 1.0, 0.8]
})

description = """
Create a grouped bar chart comparing five methods on 
Accuracy and F1 Score. Use a colourblind-accessible palette.
Include error bars from the std columns. Highlight the 
'Ours' group with a distinct visual treatment.
Add a horizontal dashed line at 85 labeled 
'State-of-the-art threshold'. Nature journal formatting,
two-column width (180mm), 9pt Helvetica Neue.
Legend outside plot area, upper right.
"""

result = pb.generate(
    description=description,
    data=results,
    output_path="./figures/method_comparison.pdf"
)

print(f"Iterations: {result.iterations}")
print(f"Quality score: {result.quality_score:.3f}")
print(f"Saved: {result.output_path}")

# Output:
# Iterations: 4
# Quality score: 0.91
# Saved: ./figures/method_comparison.pdf

Four iterations. Quality score above threshold. Figure ready for submission.

Inspecting the Iteration Log

The iteration inspector is one of PaperBanana's most useful features for understanding what the agent loop is doing and for debugging when it doesn't converge the way you expect:

python# Inspect what happened at each iteration
for i, step in enumerate(result.iteration_log):
    print(f"\n{'─'*48}")
    print(f"ITERATION {i+1} │ Quality: {step.quality_score:.2f}")
    print(f"{'─'*48}")

    if step.critic_feedback:
        print("Critic feedback:")
        for item in step.critic_feedback:
            icon = "🔴" if item.severity == "HIGH" else \
                   "🟡" if item.severity == "MEDIUM" else "🟢"
            print(f"  {icon} [{item.category}] {item.message}")
    else:
        print("  ✓ Quality threshold met")

For the method comparison figure above, the log looked like this:

────────────────────────────────────────────────
ITERATION 1 │ Quality: 0.58
────────────────────────────────────────────────
Critic feedback:
  [accessibility] Default colour cycle fails 
     protanopia simulation — replace with 
     colourblind-safe palette
  [formatting] Figure 210mm wide, exceeds 
     two-column 180mm maximum
  [typography] Legend inside plot area, 
     overlapping bars at right edge
  [data] Error bars present but cap size 0 — 
     not visible in print
  [style] Minor: grid lines too prominent, 
     reduce alpha to 0.3

────────────────────────────────────────────────
ITERATION 2 │ Quality: 0.74
────────────────────────────────────────────────
Critic feedback:
  [formatting] 'Ours' group not visually 
     distinct — add hatching or edge highlight
  [typography] Axis labels 8pt, journal 
     minimum 9pt
  [data] State-of-the-art line label font 
     size inconsistent with axis labels

────────────────────────────────────────────────
ITERATION 3 │ Quality: 0.86
────────────────────────────────────────────────
Critic feedback:
  [formatting] Minor: x-axis label padding 
     slightly tight — increase labelpad to 8

────────────────────────────────────────────────
ITERATION 4 │ Quality: 0.91
────────────────────────────────────────────────
  ✓ Quality threshold met — output generated

This is the loop in practice. The first iteration catches the structural issues, wrong dimensions, inaccessible colours. The second iteration catches the medium-severity items. By iteration three the feedback is minor. Iteration four crosses the threshold.

Extending the Critic for Journal-Specific Requirements

The built-in Critic uses general publication standards. For specific journal requirements or custom style guides, extend it:

pythonfrom paperbanana.agents import CriticAgent
from paperbanana.evaluation import EvaluationCriteria, FeedbackItem

class NeurIPS2026Critic(CriticAgent):
    """
    Critic extended with NeurIPS 2026 
    camera-ready requirements.
    """

    REQUIREMENTS = EvaluationCriteria(
        max_width_mm=177,
        min_font_size_pt=9,
        required_font_family="Times New Roman",
        colour_accessibility=True,
        max_pdf_size_mb=10,
        required_format="PDF/A",
        prohibited_elements=["rasterized_text", "embedded_fonts_missing"]
    )

    def evaluate(self, figure, code, spec):
        # Run base evaluation
        base = super().evaluate(figure, code, spec)

        # Layer NeurIPS-specific checks
        neurips_items = []

        # Dimension check
        width_mm = figure.width_inches * 25.4
        if width_mm > self.REQUIREMENTS.max_width_mm:
            neurips_items.append(FeedbackItem(
                severity="HIGH",
                category="neurips_format",
                message=f"Width {width_mm:.1f}mm exceeds "
                        f"NeurIPS max {self.REQUIREMENTS.max_width_mm}mm",
                suggested_fix=f"Set figsize width to "
                              f"{self.REQUIREMENTS.max_width_mm/25.4:.2f}"
                              f" inches"
            ))

        # Font check  
        detected_font = figure.get_primary_font()
        if detected_font != self.REQUIREMENTS.required_font_family:
            neurips_items.append(FeedbackItem(
                severity="HIGH",
                category="neurips_format", 
                message=f"Font '{detected_font}' — NeurIPS requires "
                        f"'{self.REQUIREMENTS.required_font_family}'",
                suggested_fix="Set plt.rcParams['font.family'] = "
                             "'Times New Roman' before plotting"
            ))

        return base.merge_feedback(neurips_items)

# Use custom critic
config = Config(
    critic_agent=NeurIPS2026Critic(),
    llm_backend="anthropic",
    model="claude-sonnet-4-5",
    quality_threshold=0.90  # Higher bar for camera-ready
)

pb = PaperBanana(config=config)

The pattern, base agent with domain-specific extension via structured feedback items applies directly to other agentic use cases. Document review agents with organisation-specific criteria. Code review agents with team-specific standards. Data validation agents with domain-specific rules. The architecture is the same.

Batch Generation for Multi-Figure Papers

Real papers have multiple figures and they need to be visually consistent:

pythonfigure_set = [
    {
        "id": "fig1_training_curves",
        "description": """
            Training and validation loss curves for three 
            model variants over 100 epochs. Log scale y-axis. 
            Use solid lines for training, dashed for validation.
            Mark the convergence epoch with a vertical line.
        """,
        "data": training_df
    },
    {
        "id": "fig2_ablation",
        "description": """
            Horizontal bar chart showing ablation study results.
            Highlight the full model row. Sort by performance 
            descending. Include percentage improvement labels 
            on each bar.
        """,
        "data": ablation_df
    },
    {
        "id": "fig3_qualitative",
        "description": """
            3x3 grid showing input/output pairs for qualitative 
            evaluation. Three rows: success cases, failure cases,
            edge cases. Add a thin red border on failure cases.
        """,
        "data": sample_images
    }
]

batch = pb.generate_batch(
    figures=figure_set,
    output_dir="./paper_figures/",
    consistency_check=True,  # Verify visual consistency 
                              # across all figures
    shared_style={
        "style_preset": "neurips",
        "colour_palette": "wong2011",
        "base_font_size": 9
    }
)

for fig_id, result in batch.results.items():
    status = "✓" if result.converged else "⚠"
    print(f"{status} {fig_id}: "
          f"{result.iterations} iterations, "
          f"quality {result.quality_score:.2f}")

# Output:
# ✓ fig1_training_curves: 3 iterations, quality 0.88
# ✓ fig2_ablation: 4 iterations, quality 0.91  
# ✓ fig3_qualitative: 5 iterations, quality 0.87

The consistency_check=True parameter runs a post-generation agent pass that verifies colour palette consistency, font size matching and style coherence across all figures. It's the detail that's tedious to manage manually and that PaperBanana handles automatically.

Where the Architecture Goes Beyond Research

The critic-generator loop with structured feedback is the pattern. PaperBanana implements it for research figures. The same architecture handles any task where quality is multidimensional and single-pass generation can't reliably satisfy all dimensions simultaneously.
Code review with team-specific standards. Document formatting with compliance requirements. Data pipeline validation against schema contracts. Report generation with brand guidelines. The structure is identical: define your quality criteria in the Critic, let the Generator iterate against them, exit when thresholds are met.

Understanding how PaperBanana implements this at a concrete level is the most transferable thing in this article. The diagram generation is useful. The agentic pattern underneath it is what you want to carry into your next project.

For the full deep-dive into the PaperBanana agentic AI framework, covering the Planner's specification format, the Critic's evaluation rubrics and the prompt engineering that makes the loop converge reliably, the Dextra Labs writeup covers what a single Dev.to article can't fit.

This is one example of agentic AI solving a specific, high-friction workflow in research. For production agentic systems at enterprise scale, custom agent architectures for document processing, data validation, complex multi-step automation across real enterprise workflows, Dextra Labs builds and deploys these systems across industries. The patterns that make PaperBanana reliable in a research context are the same patterns that make enterprise agents reliable in production.