Yoshiaki Hirokawa

Posted on Jun 27

How I auto-generate 800+ App Store screenshots across 39 languages and 3 devices

#ios #swift #automation #indiedev

App Store screenshots are the highest-leverage marketing asset an app has — and the most painful to maintain. Now multiply that pain by 39 languages and 3 device classes. Doing that by hand is not "tedious," it's impossible to keep in sync.

So I built a pipeline that turns one command into ~800 finished, captioned, device-correct screenshots for Cadento, my SwiftUI focus timer. Here's the architecture.

The scale problem

The output target:

Device	Shots per language	Languages	Total
iPhone 6.9"	10	39	390
iPad 13"	8	39	312
Apple Watch	3	39	117

That's 819 images, each needing the right language UI and the right localized caption. Change one screen design and every number above regenerates. Hand-editing is off the table — the only sane answer is "rebuild everything from source on demand."

The pipeline, end to end

XCUITest (per language)  →  raw localized PNGs
        ↓
extract from .xcresult
        ↓
Python + Pillow: compose background + device frame + caption
        ↓
AppStore画像/<device>/<lang>/1..N.png   (exact store dimensions)

Five stages. Each is independently re-runnable.

Stage 1 — Capture real localized screens with XCUITest

The key insight: don't fake screenshots, drive the real app. A UI test launches the app, forces a specific language/locale, navigates to each screen, and snapshots it.

Language and locale come in as environment variables so one test file covers every language:

let lang   = ProcessInfo.processInfo.environment["SHOT_LANG"]   ?? "en"
let locale = ProcessInfo.processInfo.environment["SHOT_LOCALE"] ?? "en_US"

app.launchArguments += ["-AppleLanguages", "(\(lang))"]
app.launchArguments += ["-AppleLocale", locale]
app.launch()

// navigate + snapshot each screen
let shot = XCTAttachment(screenshot: app.screenshot())
shot.lifetime = .keepAlways
add(shot)

A shell loop runs this once per language. Because it's the actual app, the screenshots are guaranteed to match what users see — including RTL flips for Arabic/Hebrew and text expansion in German.

Stage 2 — Extract PNGs from the .xcresult

XCUITest buries screenshots inside an .xcresult bundle. A small Python script walks the result and pulls out the raw PNGs into a flat per-language folder. Nothing clever — just plumbing so the next stage has clean inputs.

Stage 3 — Compose with Python + Pillow

This is where raw screens become marketing. For each shot, Pillow:

Draws the branded background (generated separately, app-themed gradients)
Places the device frame
Drops the raw screenshot into the frame at the correct offset
Renders the localized caption on top — pulled from a per-language strings map

The caption text is itself localized (39 languages of ASO copy), so the marketing message reads natively, not just the UI underneath it. Font fallback matters here: CJK, Arabic, Hebrew, Thai, and Devanagari all need the right font or you get tofu (□□□).

Stage 4 — Live Activity & Watch shots

Live Activity (Dynamic Island / lock screen) and Apple Watch screens are generated through their own paths and folded into the same compositor, so the final set is consistent across all surfaces.

Stage 5 — Output to exact store dimensions

Everything lands in a predictable tree at the exact pixel sizes App Store Connect requires:

AppStore画像/iPhone_6.9/<lang>/1..10.png   (1320×2868)
AppStore画像/iPad_13/<lang>/1..8.png       (2064×2752)
AppStore画像/AppleWatch/<lang>/1..3.png    (410×502)

From here it's a straight upload (I drive App Store Connect's API to swap a single device's set without touching the others — but that's another post).

Lessons from running it for real

Drive the real app, don't mock. The whole value is that screenshots can't lie about what the UI does in each language.
Environment variables > 39 test targets. One parameterized UI test beats copy-pasted code every time.
Font fallback is not optional. Test the hardest scripts (Arabic, Thai, Hindi, CJK) early or you'll ship boxes.
Make every stage idempotent. A design change should be one command away from 819 fresh images, not a weekend.
Separate UI capture from caption rendering. Redesign the screen? Re-run stage 1. Rewrite the marketing copy? Re-run stage 3. They shouldn't be coupled.

The payoff: when I change a screen or a tagline, I'm not dreading a manual marathon. I run the pipeline, and the entire localized store presence updates itself.

I'm a solo iOS developer from Japan building small, deeply localized apps. Cadento (focus timer, 39 languages) is on the App Store. Ask me anything about the pipeline in the comments.

Top comments (1)

Hayrullah Kar • Jun 27

Managing 800+ localized screenshots manually is an absolute pipeline bottleneck. Parameterizing XCUITest via env vars and decoupling the frame/caption compositing is pure engineering maturity.