Observation Haki for Manga: Why I Built a Change Data Capture (CDC) Pipeline Just to Read Manga

#go #springboot #kafka #webscraping

Observation Haki for Manga: Why I Built a Change Data Capture (CDC) Pipeline Just to Read Manga

If you are a manga reader, you know the pain.

New chapters are scattered across a dozen scanlation sites (MangaDex, MangaFire, MangaPlus, Asura Scans). Each site has a different layout, different APIs (or none at all), and zero unified way to track updates. Manually checking six websites daily is tedious and slow.

In One Piece, Observation Haki (Kenbunshoku Haki) allows a warrior to sense the presence, strength, and movements of others before they can act.

I wanted that power for my manga reading list. So, I did what any software engineer would do: I over-engineered a real-time Change Data Capture (CDC) pipeline using Go, PostgreSQL, Redpanda, and Spring Boot.

Here is how I built it—and why this hybrid stack is the ultimate pattern for building robust, self-hosted web trackers.

The Grand Line: System Architecture

Rather than writing a monolithic scraper that directly makes API requests and spams webhooks, I designed a decoupled, event-driven architecture:

Let's break down the main components and how they map to our engineering (and anime) concepts.

1. The Scraper: Shadow Clone Jutsu (Go + Colly)

To scrape multiple sites quickly, the scraper must be fast, lightweight, and concurrent. Go is the perfect fit.

I built a concurrent runner that spins up separate workers (Goroutines) for each scanlation platform. Think of them as Naruto’s Shadow Clones (Kage Bunshin no Jutsu). They spread out across the web, gather pages using the Colly scraping framework, extract metadata, and send it back to the main thread.

Here’s a snippet of how the scraper fetches chapters using HTML selectors:

// From scraper/internal/adapter/asurascans.go
c.OnHTML("a[href*=\"/chapter/\"]", func(e *colly.HTMLElement) {
    href := e.Attr("href")
    if href == "" {
        return
    }

    parts := strings.Split(href, "/chapter/")
    if len(parts) != 2 {
        return
    }

    chapterNum, err := strconv.ParseFloat(strings.TrimRight(parts[1], "/"), 64)
    if err != nil {
        return
    }

    chapters = append(chapters, model.Chapter{
        Number: chapterNum,
        URL:    asurascansBase + href,
        IsNew:  true,
    })
})

2. The Change Engine: Observation Haki (PostgreSQL Diff)

Once the raw data is fetched, the system needs to determine what has actually changed. We don't want to receive duplicate alerts. This is where Observation Haki (our Diff Engine) comes in.

The Go scraper runs a Postgres transactional upsert:

It upserts the manga_series table (updating headers, cover image, status).
It attempts to insert the chapters to the chapters table.
The database schema has a unique constraint: UNIQUE(series_id, chapter_num).

Using ON CONFLICT DO NOTHING, if a chapter already exists in our database, Postgres ignores it. If it is new, Postgres saves it and returns a new UUID. The Go code detects this database-generated ID and flags the chapter as a fresh release:

// From scraper/internal/db/postgres.go
func (d *DB) InsertChapter(ctx context.Context, seriesID string, ch model.Chapter) (string, error) {
    var id string
    err := d.pool.QueryRow(ctx, `
        INSERT INTO chapters (series_id, chapter_num, title, url, release_date, is_new)
        VALUES ($1, $2, $3, $4, $5, true)
        ON CONFLICT (series_id, chapter_num) DO NOTHING
        RETURNING id
    `, seriesID, ch.Number, ch.Title, ch.URL, ch.ReleaseDate).Scan(&id)
    // ...
}

3. The Eventing Layer: Domain Expansion (Redpanda / Kafka)

Directly invoking notifications inside the scraper is a classic anti-pattern. If your Discord webhook rate-limits you, or your service goes offline, you lose the event.

To solve this, I activated a Domain Expansion: a pocket message broker (using Redpanda locally and Aiven Kafka in production) to act as a buffer.

Instead of deploying a full, heavy Debezium Connector, the Go scraper serializes database changes into a Debezium-compatible JSON payload directly:

{
  "op": "c",
  "after": {
    "id": "7ca648b2-5f65-4d2c-8067-27083042a3cf",
    "series_id": "bfd0d829-114c-47bc-ad6c-d2c67f407784",
    "chapter_num": "1117.00",
    "url": "https://mangaplus.shueisha.co.jp/viewer/1021287",
    "is_new": true
  }
}

The scraper pushes this event to the message topic, ensuring at-least-once delivery.

4. The Notifier: Hell Butterflies (Spring Boot)

In Bleach, Shinigami use Hell Butterflies (Jigokuchō) to safely guide messages between worlds.

In our pipeline, the Spring Boot application consumes events from the Redpanda stream and dispatches them as Discord, Slack, or Telegram webhook payloads.

Spring Boot is the perfect choice for the consumer layer due to its rich ecosystem of integration libraries and battle-tested thread pool listeners.

// From notification-service/.../service/ChapterEventService.java
public void processChapterEvent(String message) {
    JsonNode root = mapper.readTree(message);
    String op = root.path("op").asText();
    if (!"c".equals(op)) return; // Only notify on 'create' updates

    JsonNode after = root.path("after");
    String chapterId = after.path("id").asText();
    String chapterNum = after.path("chapter_num").asText();
    String url = after.path("url").asText();

    // Route to active webhook endpoints
    Map<String, Boolean> results = notifierRegistry.sendAll(seriesTitle, chapterNum, title, url);

    // Mark chapter as notified so it won't trigger alerts again
    chapterRepo.markNotified(chapterId);
}

Here is how the notification alerts look in action:

Developer Experience (DX) First

A major problem with multi-service stacks is setup complexity. To fix this, I created an interactive setup CLI wizard written in Go. Running go run ./configure automatically guides you through the process:

With a single docker compose up -d, you can have a running Kafka, Prometheus monitoring dashboard, PostgreSQL cluster, and the scraper active in seconds.

Conclusion & Code

Building manga-cdc proved that enterprise concepts like CDC, decoupled message brokers, and hybrid-language microservices aren't just for scaling huge web companies. They are powerful tools for building side-projects that are resilient, modular, and extremely fun to build in public.

You can inspect the complete source code, set up the wizard, and run it yourself on GitHub:

👉 GitHub Repository: aeswibon/manga-cdc

Let me know in the comments: What is the most over-engineered side project you have built to solve a daily minor annoyance?