DEV Community

Cover image for How to Use Replicate the Right Way in Your Next.js App (And Ship a Real Product With It)
Lucas Santos Rodrigues
Lucas Santos Rodrigues

Posted on

How to Use Replicate the Right Way in Your Next.js App (And Ship a Real Product With It)

Most tutorials show you how to call Replicate. Few show you how to use it well inside a real production app. This article covers the mistakes I made and the patterns that actually work — using Goodbye Watermark as a real-world case study.


What Is Replicate, Really?

Replicate is a cloud API that lets you run AI models — image generation, video, audio, vision — without owning a single GPU. You send an HTTP request, a model runs on their infrastructure, and you get the result back.

The business model is pay-per-prediction: you're charged for the time the model actually runs, not idle time. That means cold boots don't affect your cost — only your latency.


1. Understand the Prediction Lifecycle Before Writing Any Code

Every Replicate call creates a prediction — an object with a lifecycle:

starting → processing → succeeded (or failed / canceled)
Enter fullscreen mode Exit fullscreen mode
  • starting: model is booting (cold start happens here)
  • processing: predict() is actively running
  • succeeded: output is ready — but files are deleted after 1 hour

That last point is critical. If you're not saving outputs immediately, you'll lose them. More on that below.


2. Polling vs. Webhooks: Choose the Right Strategy

Replicate gives you three ways to handle async predictions:

Polling (simplest, fine for most apps)

// Create the prediction
const prediction = await replicate.predictions.create({
  model: "owner/model-name",
  input: { image: imageUrl },
});

// Poll until done
let result = prediction;
while (result.status !== "succeeded" && result.status !== "failed") {
  await new Promise((r) => setTimeout(r, 1000));
  result = await replicate.predictions.get(result.id);
}
Enter fullscreen mode Exit fullscreen mode

Works well for short-lived predictions (under ~15s). Simple to implement. The tradeoff: you're making repeated requests even when nothing has changed.

Webhooks (better for longer or background tasks)

const prediction = await replicate.predictions.create({
  model: "owner/model-name",
  input: { image: imageUrl },
  webhook: `${process.env.VERCEL_URL}/api/webhooks`,
  webhook_events_filter: ["completed"], // only fire when done
});
Enter fullscreen mode Exit fullscreen mode

Replicate POSTs to your URL when the prediction finishes. No polling loop. If there are network issues, they retry automatically.

Use webhooks when:

  • Predictions take more than ~10-15 seconds
  • You want to persist results to a database
  • You're building background processing flows

Tip: Add query params to your webhook URL to carry context:

https://yourapp.com/api/webhooks?userId=abc123&predictionType=watermark
Enter fullscreen mode Exit fullscreen mode

When to use each

Scenario Use
Fast model, UX waits for result Polling
Slow model, fire and notify Webhooks
Background job, store to DB Webhooks
Quick prototype Polling

3. Cold Starts Are Real — Here's How to Handle Them

When a model hasn't been used recently, it needs to "boot up." This can add several seconds of latency on the first request after idle time.

For casual traffic: Cold boots are fine. You only pay for actual compute, not boot time.

For production apps with consistent traffic: Use a Deployment with minInstances: 1:

// Via the Replicate dashboard or API:
// Create a deployment for your model with min_instances = 1
// This keeps the model warm 24/7
Enter fullscreen mode Exit fullscreen mode

This costs more (you're paying to keep the instance warm) but eliminates cold start latency entirely.

For Goodbye Watermark, I don't use a deployment because the traffic is spread across the day and a few seconds of latency on first boot is acceptable. But if you're building something with strict SLA requirements — use deployments.


4. Save Outputs Immediately — They Expire in 1 Hour

This is the gotcha that trips up everyone:

Input and output files are automatically deleted after 1 hour for any predictions created through the API.

If your app doesn't save the result right after succeeded, it's gone. Your options:

Option A: Stream back to the client immediately

// Next.js API route
export async function GET(request: Request) {
  const output = await replicate.run("owner/model", { input });
  return new Response(output); // stream back to client
}
Enter fullscreen mode Exit fullscreen mode

Option B: Save to your own storage (Supabase Storage, S3, etc.)

const output = await replicate.run("owner/model", { input });
const response = await fetch(output[0]); // download from Replicate
const buffer = await response.arrayBuffer();
await supabase.storage.from("outputs").upload(`${userId}/${id}.png`, buffer);
Enter fullscreen mode Exit fullscreen mode

For Goodbye Watermark, I stream the result directly back to the client. The user downloads it immediately. No storage needed, no expiry problem.


5. Next.js Config: Don't Forget This

If you're displaying output images from Replicate in a Next.js <Image> component, add this to your config or you'll get a domain error:

// next.config.ts
const nextConfig = {
  images: {
    remotePatterns: [
      {
        protocol: "https",
        hostname: "replicate.delivery",
      },
      {
        protocol: "https",
        hostname: "*.replicate.delivery",
      },
    ],
  },
};
Enter fullscreen mode Exit fullscreen mode

Small thing, but it will bite you in production.


6. Error Handling That Doesn't Suck

Real-world Replicate usage needs to handle:

  • Network timeouts
  • Model errors (bad input format, unsupported file type)
  • Rate limits (429)
  • Prediction timeouts (30 min hard cap)
try {
  const prediction = await replicate.predictions.create({ ... });

  if (prediction?.error) {
    return NextResponse.json({ error: prediction.error }, { status: 500 });
  }

  // poll with timeout safety
  let result = prediction;
  const deadline = Date.now() + 60_000; // 60s max wait

  while (result.status !== "succeeded" && result.status !== "failed") {
    if (Date.now() > deadline) {
      return NextResponse.json({ error: "Prediction timed out" }, { status: 504 });
    }
    await new Promise((r) => setTimeout(r, 1500));
    result = await replicate.predictions.get(result.id);
  }

  if (result.status === "failed") {
    return NextResponse.json({ error: "Model failed" }, { status: 500 });
  }

  return NextResponse.json({ output: result.output });

} catch (err) {
  return NextResponse.json({ error: "Unexpected error" }, { status: 500 });
}
Enter fullscreen mode Exit fullscreen mode

Set your own deadline. Replicate's hard limit is 30 minutes, but your users don't want to wait more than ~60 seconds for most tasks.


7. Rate Limits to Know

From Replicate's docs:

  • Create prediction: 600 requests/minute
  • All other endpoints: 3000 requests/minute

For most indie apps, you won't hit these. If you do, they return a 429 — build retry logic with exponential backoff.


8. Choosing the Right Model

Replicate hosts thousands of models. Two categories matter:

Official models — maintained by Replicate, always warm, stable API, predictable per-output pricing. Best for production use.

Community models — more variety, charged by compute time, may have cold starts, API can change between versions.

For Goodbye Watermark, I use the Qwen model for watermark removal. The choice came down to output quality and how well it handled semi-transparent watermarks — which are significantly harder than solid text watermarks. Testing a few models on realistic samples before committing to one is worth the extra hour.


Real-World Case Study: Goodbye Watermark

Goodbye Watermark is an AI watermark removal tool built with Next.js + Replicate + Vercel. The full stack is:

  • Frontend: Next.js + Tailwind CSS
  • AI: Replicate (Qwen model)
  • Hosting: Vercel
  • Payments: Stripe (two credit tiers)

The entire MVP was built in ~1 hour. The hardest part wasn't the UI — it was getting consistent output quality from the model across different watermark types.

Current results:

  • ~150 weekly organic users
  • $0 paid acquisition
  • Zero infrastructure management

Replicate made the difference. Running my own GPU inference would have added weeks of setup and ongoing ops overhead. Instead, I spent that time on the UX and monetization.


TL;DR — The Patterns That Matter

  1. Understand the prediction lifecycle — especially the 1-hour file expiry
  2. Use polling for short tasks, webhooks for long/background ones
  3. Use Deployments if cold start latency is a problem for your UX
  4. Save or stream outputs immediately after succeeded
  5. Add replicate.delivery to your Next.js image domains
  6. Set your own deadline — don't wait 30 minutes for a user-facing request
  7. Test multiple models before committing — quality varies significantly

Replicate is genuinely one of the best tools for indie developers shipping AI products fast. Use it well and you can build something real in a weekend.


Built something with Replicate? Drop it in the comments — always curious to see what people are shipping.

Top comments (0)