(The Senses) Image Generation & Media

#ai #openai #privacy #security

The Catalyst

Field note: Nano Banana Pro and reactive image gen

I hit a real workflow failure mode: a proactive image stack (Nano Banana Pro) that would spontaneously generate images while the agent was effectively listening in on a chat. Worse, it would regenerate images people had shared in the conversation. That was the biggest day-to-day nuisance, and a big part of why I went with OpenAI’s image API (DALL·E) path: only generate when explicitly asked, not because the conversation suggested something visual.

Eyes that see too much

The moment you add images, audio, and video, the model can see your camera roll path, a cached thumbnail, or a viral meme. A malicious payload can be in the image, not the caption. I wanted senses (multimedia) without giving the model surveillance over my disk or a blank cheque to generate images for strangers.

Phase 3 of the series is The Senses: how OpenClaw exposes images, how you deny image-generation tools by default, how allow-scopes work per channel, and how to keep users engaged while a heavy operation runs.

Covered in other articles: identity leakage via workspace files and cached media (see OpenClaw Skill Shield and Setting up OpenClaw). Here the focus is tooling and config for multimedia.

Overview

In my openclaw.json:

tools.deny includes openai-image-gen at the top level where the model is not given a casual path to DALL·E tools even if the skill package exists.
tools.media enables image, audio, and video, each with a default deny and explicit allow rules that match a channel and a keyPrefix (e.g. your owner WhatsApp direct thread key, expressed as a placeholder in docs).
skills.entries.openai-image-gen can still hold ${OPENAI_API_KEY} for when you deliberately re-enable the skill in a controlled way.

The Silas skill (SKILL.md) adds behavioural law: do not call image-gen tools for non-operator sessions, treat blocked vision input as blocked, and never guess the pixels.

In this section:

1. Image Generation: Deny First, Enable Deliberately
2. Inbound Media: Scopes, Not “On for the World”
3. Filesystem: Workspace-Only
4. Latency: Keeping Humans Calm While “Senses” Work
5. Checklist: Senses in Production

1. Image Generation: Deny First, Enable Deliberately

Mechanism	Purpose
`tools.deny: ["openai-image-gen"]`	A global deny list removes the tool from the agent’s easy reach.
Skill config `openai-image-gen.apiKey`	When you do enable, keys live in env, not in chat logs.
`SKILL.md` image-gen section	Behavioural backstop: even if a tool slipped through, the model is instructed to refuse for non-operator contacts.

New-user default: start with openai-image-gen denied until you have (a) a billing/usage cap you accept, and (b) a clear “who may request images” policy (owner session vs everyone). The Connection article (part 4) names how my WhatsApp bridge maps allowFrom and session keys to who counts as the operator so “owner-only” in config and in SKILL.md are the same person in practice.

2. Inbound Media: Scopes, Not “On for the World”

tools.media for image / audio / video shares the same pattern:

"default": "deny"
"rules": one or more { "action": "allow", "match": { "channel": "whatsapp", "keyPrefix": "..." } }

What keyPrefix means in practice: it is a channel-specific routing key. Your OpenClaw build should document the exact string format; treat it as a capability, only the threads you list get inbound multimodal access at the tool layer.

Example (use your own key prefix, not a copy-paste of someone’s phone number):

"media": {
  "image": {
    "enabled": true,
    "scope": {
      "default": "deny",
      "rules": [
        { "action": "allow", "match": { "channel": "whatsapp", "keyPrefix": "whatsapp:direct:+1XXXXXXXXXX" } }
      ]
    }
  }
}

Repeat the same idea for audio and video if you want symmetric behaviour. If a modality should stay off entirely, set enabled: false for that block instead of relying on empty rules.

channels.whatsapp.mediaMaxMb: set an upper bound (my config uses 50 MB) so a single “document as video” cannot exhaust disk or the gateway.

3. Filesystem: Workspace-Only

tools.fs.workspaceOnly: true means the model’s file tools are anchored to the configured workspace, not an arbitrary path. That pairs with:

Inbound media cache living under your OpenClaw media areas (separate from random OS paths, depending on your build)
Outbound or generated files you intentionally place under workspace/media/... when you want the agent to reference them

Practical guide rule: if the LLM can read a file, assume it can be summarised or exfiltrated unless session + skills forbid it. Deny is the default; allow is a contract.

4. Latency: Keeping Humans Calm While “Senses” Work

Problem: A minute of silence feels like a dropped message, especially on WhatsApp.

Patterns that work:

ACK early where your channel allows it (reactions, short “Received, processing” copy).
Chunk work: transcribe or describe in stages, not one giant block at the end.
Set expectations in SOUL.md / identity: the assistant can say it may take a few seconds for audio or large images.
Debounce (channel): a longer debounceMs on the WhatsApp channel reduces double-firing on slow networks. You trade a little latency for fewer duplicate heavy jobs. See the Connection article for debounceMs as wiring, not as speed hack.

Reality check: fast model + large media still hits API limits. The UX fix is communication, not overpromising in the system prompt.

Cultural matters (ties to the Voice article): when replying in a second language, a short localised “working on it” line often lands better than English.

5. Checklist: Senses in Production

Check	You want
Image gen	Deny tool globally until policy is explicit.
Inbound image/audio/video	Default deny; allow only named channel + key prefix.
Model behaviour	`SKILL.md` matches config (no “secret” image gen path).
Disk and limits	`mediaMaxMb` sane; monitor `workspace/media` growth.
User trust	Early ACK + honest latency messaging.