Stephen Dev

Posted on Mar 27

I built Search Crawler: A Reference Architecture using DDD and CQRS

#nestjs #cqrs #ddd #typescript

Most DDD tutorials use a banking app or an e-commerce store as the example domain. Which is fine, but honestly… kind of boring. I wanted something with more moving parts — async jobs, queues, retries, anti-bot protection — so I built a full-stack Bing keyword scraper instead.

The result is search-crawler: a NestJS + Next.js app where you upload a CSV of keywords, and it scrapes Bing search results for each one using Puppeteer, with a real-time dashboard to track progress.

But the scraper itself isn't the point of this post. The point is how it's structured — and why I think a scraping system is actually a great domain for learning DDD, CQRS, and domain events.

Let's dig in.

Why a scraper is a great DDD domain

A scraping system has a ton of natural domain complexity:

A keyword isn't just a string — it has a lifecycle (queued → processing → completed/failed)
Failures are domain concepts, not just exceptions (should we retry? how many times?)
Multiple services need to react when a scrape finishes (update status, save results, write audit log)
Reads and writes have wildly different performance profiles (writes are slow async jobs, reads need to be instant)

These constraints make DDD patterns feel natural rather than forced. Let's look at how each one plays out.

Layer 1: The Domain — where the real logic lives

The most important thing in DDD is keeping your domain layer pure. No NestJS decorators, no database imports, no HTTP stuff. Just plain TypeScript classes that model your business rules.

The Aggregate Root

The core of the domain is KeywordEntity, which is the aggregate root. Here's what that means in practice:

// src/domain/keyword/keyword.entity.ts

export class KeywordEntity extends AggregateRoot {
  private status: KeywordStatus;
  private attempts: number;

  static create(keyword: string, uploadId: string): KeywordEntity {
    const entity = new KeywordEntity(/* ... */);
    entity.addDomainEvent(new KeywordQueuedEvent(entity.id, keyword));
    return entity;
  }

  markAsProcessing(): void {
    this.status = KeywordStatus.PROCESSING;
    this.addDomainEvent(new KeywordProcessingStartedEvent(this.id));
  }

  markAsCompleted(results: ScrapeResult[]): void {
    this.status = KeywordStatus.COMPLETED;
    this.addDomainEvent(new KeywordCompletedEvent(this.id, results));
  }

  markAsFailed(reason: string): void {
    this.attempts++;
    if (this.attempts < MAX_RETRIES) {
      this.status = KeywordStatus.QUEUED; // back to the queue
      this.addDomainEvent(new KeywordScrapeAttemptFailedEvent(this.id, reason));
    } else {
      this.status = KeywordStatus.FAILED;
      this.addDomainEvent(new KeywordFailedEvent(this.id, reason));
    }
  }
}

A few things to notice:

The entity controls its own state. You can't just set keyword.status = 'completed' from outside. You call markAsCompleted() and the entity decides what that means.
It raises domain events. Every meaningful state change publishes an event. These events are how the rest of the system finds out what happened — without tight coupling.
The retry logic lives here. Should we retry after a failure? How many times? That's a business rule, so it belongs in the domain — not in some infrastructure retry config.

The base classes in `pkg/ddd/`

The project includes a small pkg/ddd/ folder with base classes — AggregateRoot, Entity, ValueObject, DomainEvent. This is the most educational part of the repo for anyone learning DDD, because you can see exactly what an aggregate root needs to do:

// pkg/ddd/aggregate-root.ts

export abstract class AggregateRoot extends Entity {
  private _domainEvents: DomainEvent[] = [];

  protected addDomainEvent(event: DomainEvent): void {
    this._domainEvents.push(event);
  }

  public pullDomainEvents(): DomainEvent[] {
    const events = [...this._domainEvents];
    this._domainEvents = [];
    return events;
  }
}

Simple. The aggregate collects events internally, and they get pulled out (and cleared) when it's time to publish them.

Layer 2: Domain Events — decoupling without chaos

Domain events are one of those concepts that sounds complicated but is actually really clean once you see it in action.

The idea: when something important happens in the domain, the aggregate publishes an event. Other parts of the system listen for that event and react. The aggregate doesn't know or care who's listening.

In this project, KeywordEntity raises five different events:

Event	When
`KeywordQueuedEvent`	Keyword created and added to the queue
`KeywordProcessingStartedEvent`	Worker picks it up
`KeywordCompletedEvent`	Scrape finished successfully
`KeywordScrapeAttemptFailedEvent`	Scrape failed, will retry
`KeywordFailedEvent`	All retries exhausted

One of the best examples of how this pays off is the KeywordTimelineEventHandler:

// src/application/event-handlers/keyword-timeline.handler.ts

@EventsHandler(KeywordCompletedEvent, KeywordFailedEvent, KeywordProcessingStartedEvent)
export class KeywordTimelineEventHandler {
  constructor(private readonly timelineRepo: ITimelineRepository) {}

  async handle(event: KeywordCompletedEvent | KeywordFailedEvent | ...) {
    await this.timelineRepo.save({
      keywordId: event.keywordId,
      eventType: event.constructor.name,
      occurredAt: new Date(),
    });
  }
}

The KeywordEntity has zero idea this audit trail exists. Tomorrow you could add a Slack notification handler, or an email alert handler, and the domain doesn't change at all. That's the power of events.

How the repository publishes events

There's a subtle but important pattern in the repository base class:

// pkg/ddd/repository.base.ts

async update(entity: AggregateRoot): Promise<void> {
  await this.persist(entity);
  const events = entity.pullDomainEvents();
  await this.eventBus.publishAll(events);
}

Every time you save an aggregate, its events are automatically published. You never have to remember to do it manually. This is easy to miss but really important — it means your application code stays clean.

Layer 3: CQRS — separating reads from writes

CQRS (Command Query Responsibility Segregation) means: writes go through commands, reads go through queries. They're separate paths with separate handlers.

Here's why this matters for a scraper specifically: writing is slow (you're running async Puppeteer jobs) but reading needs to be fast (the dashboard is polling for live status). If you share the same model for both, you end up with compromises everywhere.

Commands (write side)

// src/application/commands/queue-keyword/queue-keyword.command.ts
export class QueueKeywordCommand {
  constructor(
    public readonly keyword: string,
    public readonly uploadId: string,
  ) {}
}

// src/application/commands/queue-keyword/queue-keyword.handler.ts
@CommandHandler(QueueKeywordCommand)
export class QueueKeywordHandler {
  async execute(command: QueueKeywordCommand) {
    const keyword = KeywordEntity.create(command.keyword, command.uploadId);
    await this.keywordRepo.save(keyword); // this also publishes KeywordQueuedEvent
  }
}

The controller sends a command. The handler creates the aggregate, saves it, and domain events get published automatically. Clean.

Queries (read side)

// src/application/queries/get-keyword-status/get-keyword-status.handler.ts
@QueryHandler(GetKeywordStatusQuery)
export class GetKeywordStatusHandler {
  async execute(query: GetKeywordStatusQuery): Promise<KeywordStatusDto> {
    // Can be optimized independently — raw SQL, Redis cache, whatever
    return this.keywordReadRepo.findStatus(query.keywordId);
  }
}

The read side can be optimized completely independently. You could add Redis caching here, use a read replica, or denormalize the data — all without touching the write side.

The controllers are also split, which makes the intent clear at a glance:

src/interface/http/
├── keyword-command.controller.ts   ← POST /keywords, POST /upload
└── keyword-query.controller.ts     ← GET /keywords, GET /keywords/:id/status

Layer 4: BullMQ as the Event Bus

Here's where it gets interesting. The KeywordQueuedEvent needs to actually trigger a scrape job. How does that happen?

BullMQ acts as the event bus between the application layer and the scraper worker.

// src/infrastructure/queue/bullmq.event-bus.ts

export class BullMQEventBus implements IEventBus {
  async publish(event: DomainEvent): Promise<void> {
    if (event instanceof KeywordQueuedEvent) {
      await this.scrapeQueue.add('scrape', {
        keywordId: event.keywordId,
        keyword: event.keyword,
      });
    }
  }
}

The domain doesn't know BullMQ exists. The domain raises a KeywordQueuedEvent. The infrastructure layer translates that into a BullMQ job. This is the dependency inversion principle in action — the domain defines an IEventBus interface, and infrastructure implements it.

The worker on the other side picks up the job and runs the scrape:

// src/infrastructure/workers/scrape.worker.ts

@Processor('scrape-queue')
export class ScrapeWorker {
  @Process('scrape')
  async handleScrape(job: Job<{ keywordId: string; keyword: string }>) {
    const keyword = await this.keywordRepo.findById(job.data.keywordId);
    keyword.markAsProcessing();
    await this.keywordRepo.update(keyword);

    try {
      const results = await this.puppeteerScraper.scrape(job.data.keyword);
      keyword.markAsCompleted(results);
    } catch (err) {
      keyword.markAsFailed(err.message);
    }

    await this.keywordRepo.update(keyword); // publishes ScrapeCompleted or ScrapeFailed
  }
}

Notice how the worker speaks the domain language: markAsProcessing(), markAsCompleted(), markAsFailed(). It never manually sets status fields. The aggregate owns the state transitions.

Layer 5: Puppeteer Stealth

This one's more infrastructure than DDD, but it's worth a quick mention because it's the part that actually makes the scraper work in production.

Bing (like most sites) has bot detection. Without countermeasures, Puppeteer gets blocked pretty quickly. The scraper handles this with:

puppeteer-extra-plugin-stealth — patches browser fingerprints (WebGL, canvas, navigator properties)
User-agent rotation — cycles through realistic UA strings
Exponential backoff — on failure, wait longer before retry (handled by BullMQ's built-in retry config)
Request interception — blocks images and CSS to speed up page loads

// src/infrastructure/scraper/puppeteer.scraper.ts

const browser = await puppeteer.use(StealthPlugin()).launch({
  args: ['--no-sandbox', '--disable-setuid-sandbox'],
});

await page.setRequestInterception(true);
page.on('request', (req) => {
  if (['image', 'stylesheet', 'font'].includes(req.resourceType())) {
    req.abort();
  } else {
    req.continue();
  }
});

The important thing architecturally: all of this is infrastructure. The domain doesn't know how scraping works. It just calls this.scraper.scrape(keyword) via an IScraper interface. If you wanted to swap Puppeteer for Playwright tomorrow, you'd change one file.

The full picture

Here's how everything connects end to end:

HTTP POST /upload (CSV file)
    │
    ▼
ImportKeywordsCommand
    │
    ▼
KeywordEntity.create()  ──raises──▶  KeywordQueuedEvent
    │                                        │
    ▼                                        ▼
keywordRepo.save()               BullMQEventBus.publish()
                                             │
                                             ▼
                                   BullMQ Job added to queue
                                             │
                                             ▼
                                      ScrapeWorker picks up job
                                             │
                                  ┌──────────┴──────────┐
                                  ▼                     ▼
                         keyword.markAsCompleted()  keyword.markAsFailed()
                                  │                     │
                                  ▼                     ▼
                         KeywordCompletedEvent    KeywordFailedEvent (or retry)
                                  │
                                  ▼
                         KeywordTimelineEventHandler → audit log

What I'd do differently

A few things I'd change with the benefit of hindsight:

Add an outbox pattern. Right now domain events are published in the same transaction as the save. If the event bus call fails after the save, you lose events. An outbox table (events persisted to DB, then picked up by a background job) would make this reliable.
Separate read models. The query side currently reads from the same tables as the write side. For a learning project that's fine, but in production you'd want dedicated read models (potentially materialized views or a separate read DB) so reads can scale independently.
More value objects. keyword is currently just a string. It could be a SearchKeyword value object that validates length, strips special characters, and normalizes whitespace. The domain would be richer for it.

Check it out

The full repo is at https://github.com/stephendo-dev/monorepo-ddd-cqrs-search-crawler.

If you're learning DDD/CQRS and tired of todo app examples, give it a look — and if it helped you, a ⭐ goes a long way.

Questions or feedback? Drop them in the comments below.

Top comments (1)

arun rajkumar • Apr 2

The outbox pattern callout is the one that bites teams in production. We hit exactly this at Atoa — publishing domain events synchronously after the save looks fine until you get a network blip on the event bus. Losing payment events silently is a very bad day. We now persist events to a DB table first, with a background processor picking them up asynchronously.

Your IEventBus abstraction over BullMQ is the right call architecturally. Being able to stub the event bus in tests without spinning up a real queue saves a lot of setup friction, especially as the number of handlers grows.

The separate read/write controllers are a subtle but underrated pattern — makes the intent obvious when you're navigating a large codebase months later.