DEV Community

Stephen Dev
Stephen Dev

Posted on

I built Search Crawler: A Reference Architecture using DDD and CQRS

Most DDD tutorials use a banking app or an e-commerce store as the example domain. Which is fine, but honestly… kind of boring. I wanted something with more moving parts — async jobs, queues, retries, anti-bot protection — so I built a full-stack Bing keyword scraper instead.

The result is search-crawler: a NestJS + Next.js app where you upload a CSV of keywords, and it scrapes Bing search results for each one using Puppeteer, with a real-time dashboard to track progress.

But the scraper itself isn't the point of this post. The point is how it's structured — and why I think a scraping system is actually a great domain for learning DDD, CQRS, and domain events.

Let's dig in.


Why a scraper is a great DDD domain

A scraping system has a ton of natural domain complexity:

  • A keyword isn't just a string — it has a lifecycle (queued → processing → completed/failed)
  • Failures are domain concepts, not just exceptions (should we retry? how many times?)
  • Multiple services need to react when a scrape finishes (update status, save results, write audit log)
  • Reads and writes have wildly different performance profiles (writes are slow async jobs, reads need to be instant)

These constraints make DDD patterns feel natural rather than forced. Let's look at how each one plays out.


Layer 1: The Domain — where the real logic lives

The most important thing in DDD is keeping your domain layer pure. No NestJS decorators, no database imports, no HTTP stuff. Just plain TypeScript classes that model your business rules.

The Aggregate Root

The core of the domain is KeywordEntity, which is the aggregate root. Here's what that means in practice:

// src/domain/keyword/keyword.entity.ts

export class KeywordEntity extends AggregateRoot {
  private status: KeywordStatus;
  private attempts: number;

  static create(keyword: string, uploadId: string): KeywordEntity {
    const entity = new KeywordEntity(/* ... */);
    entity.addDomainEvent(new KeywordQueuedEvent(entity.id, keyword));
    return entity;
  }

  markAsProcessing(): void {
    this.status = KeywordStatus.PROCESSING;
    this.addDomainEvent(new KeywordProcessingStartedEvent(this.id));
  }

  markAsCompleted(results: ScrapeResult[]): void {
    this.status = KeywordStatus.COMPLETED;
    this.addDomainEvent(new KeywordCompletedEvent(this.id, results));
  }

  markAsFailed(reason: string): void {
    this.attempts++;
    if (this.attempts < MAX_RETRIES) {
      this.status = KeywordStatus.QUEUED; // back to the queue
      this.addDomainEvent(new KeywordScrapeAttemptFailedEvent(this.id, reason));
    } else {
      this.status = KeywordStatus.FAILED;
      this.addDomainEvent(new KeywordFailedEvent(this.id, reason));
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

A few things to notice:

  1. The entity controls its own state. You can't just set keyword.status = 'completed' from outside. You call markAsCompleted() and the entity decides what that means.

  2. It raises domain events. Every meaningful state change publishes an event. These events are how the rest of the system finds out what happened — without tight coupling.

  3. The retry logic lives here. Should we retry after a failure? How many times? That's a business rule, so it belongs in the domain — not in some infrastructure retry config.

The base classes in pkg/ddd/

The project includes a small pkg/ddd/ folder with base classes — AggregateRoot, Entity, ValueObject, DomainEvent. This is the most educational part of the repo for anyone learning DDD, because you can see exactly what an aggregate root needs to do:

// pkg/ddd/aggregate-root.ts

export abstract class AggregateRoot extends Entity {
  private _domainEvents: DomainEvent[] = [];

  protected addDomainEvent(event: DomainEvent): void {
    this._domainEvents.push(event);
  }

  public pullDomainEvents(): DomainEvent[] {
    const events = [...this._domainEvents];
    this._domainEvents = [];
    return events;
  }
}
Enter fullscreen mode Exit fullscreen mode

Simple. The aggregate collects events internally, and they get pulled out (and cleared) when it's time to publish them.


Layer 2: Domain Events — decoupling without chaos

Domain events are one of those concepts that sounds complicated but is actually really clean once you see it in action.

The idea: when something important happens in the domain, the aggregate publishes an event. Other parts of the system listen for that event and react. The aggregate doesn't know or care who's listening.

In this project, KeywordEntity raises five different events:

Event When
KeywordQueuedEvent Keyword created and added to the queue
KeywordProcessingStartedEvent Worker picks it up
KeywordCompletedEvent Scrape finished successfully
KeywordScrapeAttemptFailedEvent Scrape failed, will retry
KeywordFailedEvent All retries exhausted

One of the best examples of how this pays off is the KeywordTimelineEventHandler:

// src/application/event-handlers/keyword-timeline.handler.ts

@EventsHandler(KeywordCompletedEvent, KeywordFailedEvent, KeywordProcessingStartedEvent)
export class KeywordTimelineEventHandler {
  constructor(private readonly timelineRepo: ITimelineRepository) {}

  async handle(event: KeywordCompletedEvent | KeywordFailedEvent | ...) {
    await this.timelineRepo.save({
      keywordId: event.keywordId,
      eventType: event.constructor.name,
      occurredAt: new Date(),
    });
  }
}
Enter fullscreen mode Exit fullscreen mode

The KeywordEntity has zero idea this audit trail exists. Tomorrow you could add a Slack notification handler, or an email alert handler, and the domain doesn't change at all. That's the power of events.

How the repository publishes events

There's a subtle but important pattern in the repository base class:

// pkg/ddd/repository.base.ts

async update(entity: AggregateRoot): Promise<void> {
  await this.persist(entity);
  const events = entity.pullDomainEvents();
  await this.eventBus.publishAll(events);
}
Enter fullscreen mode Exit fullscreen mode

Every time you save an aggregate, its events are automatically published. You never have to remember to do it manually. This is easy to miss but really important — it means your application code stays clean.


Layer 3: CQRS — separating reads from writes

CQRS (Command Query Responsibility Segregation) means: writes go through commands, reads go through queries. They're separate paths with separate handlers.

Here's why this matters for a scraper specifically: writing is slow (you're running async Puppeteer jobs) but reading needs to be fast (the dashboard is polling for live status). If you share the same model for both, you end up with compromises everywhere.

Commands (write side)

// src/application/commands/queue-keyword/queue-keyword.command.ts
export class QueueKeywordCommand {
  constructor(
    public readonly keyword: string,
    public readonly uploadId: string,
  ) {}
}

// src/application/commands/queue-keyword/queue-keyword.handler.ts
@CommandHandler(QueueKeywordCommand)
export class QueueKeywordHandler {
  async execute(command: QueueKeywordCommand) {
    const keyword = KeywordEntity.create(command.keyword, command.uploadId);
    await this.keywordRepo.save(keyword); // this also publishes KeywordQueuedEvent
  }
}
Enter fullscreen mode Exit fullscreen mode

The controller sends a command. The handler creates the aggregate, saves it, and domain events get published automatically. Clean.

Queries (read side)

// src/application/queries/get-keyword-status/get-keyword-status.handler.ts
@QueryHandler(GetKeywordStatusQuery)
export class GetKeywordStatusHandler {
  async execute(query: GetKeywordStatusQuery): Promise<KeywordStatusDto> {
    // Can be optimized independently — raw SQL, Redis cache, whatever
    return this.keywordReadRepo.findStatus(query.keywordId);
  }
}
Enter fullscreen mode Exit fullscreen mode

The read side can be optimized completely independently. You could add Redis caching here, use a read replica, or denormalize the data — all without touching the write side.

The controllers are also split, which makes the intent clear at a glance:

src/interface/http/
├── keyword-command.controller.ts   ← POST /keywords, POST /upload
└── keyword-query.controller.ts     ← GET /keywords, GET /keywords/:id/status
Enter fullscreen mode Exit fullscreen mode

Layer 4: BullMQ as the Event Bus

Here's where it gets interesting. The KeywordQueuedEvent needs to actually trigger a scrape job. How does that happen?

BullMQ acts as the event bus between the application layer and the scraper worker.

// src/infrastructure/queue/bullmq.event-bus.ts

export class BullMQEventBus implements IEventBus {
  async publish(event: DomainEvent): Promise<void> {
    if (event instanceof KeywordQueuedEvent) {
      await this.scrapeQueue.add('scrape', {
        keywordId: event.keywordId,
        keyword: event.keyword,
      });
    }
  }
}
Enter fullscreen mode Exit fullscreen mode

The domain doesn't know BullMQ exists. The domain raises a KeywordQueuedEvent. The infrastructure layer translates that into a BullMQ job. This is the dependency inversion principle in action — the domain defines an IEventBus interface, and infrastructure implements it.

The worker on the other side picks up the job and runs the scrape:

// src/infrastructure/workers/scrape.worker.ts

@Processor('scrape-queue')
export class ScrapeWorker {
  @Process('scrape')
  async handleScrape(job: Job<{ keywordId: string; keyword: string }>) {
    const keyword = await this.keywordRepo.findById(job.data.keywordId);
    keyword.markAsProcessing();
    await this.keywordRepo.update(keyword);

    try {
      const results = await this.puppeteerScraper.scrape(job.data.keyword);
      keyword.markAsCompleted(results);
    } catch (err) {
      keyword.markAsFailed(err.message);
    }

    await this.keywordRepo.update(keyword); // publishes ScrapeCompleted or ScrapeFailed
  }
}
Enter fullscreen mode Exit fullscreen mode

Notice how the worker speaks the domain language: markAsProcessing(), markAsCompleted(), markAsFailed(). It never manually sets status fields. The aggregate owns the state transitions.


Layer 5: Puppeteer Stealth

This one's more infrastructure than DDD, but it's worth a quick mention because it's the part that actually makes the scraper work in production.

Bing (like most sites) has bot detection. Without countermeasures, Puppeteer gets blocked pretty quickly. The scraper handles this with:

  • puppeteer-extra-plugin-stealth — patches browser fingerprints (WebGL, canvas, navigator properties)
  • User-agent rotation — cycles through realistic UA strings
  • Exponential backoff — on failure, wait longer before retry (handled by BullMQ's built-in retry config)
  • Request interception — blocks images and CSS to speed up page loads
// src/infrastructure/scraper/puppeteer.scraper.ts

const browser = await puppeteer.use(StealthPlugin()).launch({
  args: ['--no-sandbox', '--disable-setuid-sandbox'],
});

await page.setRequestInterception(true);
page.on('request', (req) => {
  if (['image', 'stylesheet', 'font'].includes(req.resourceType())) {
    req.abort();
  } else {
    req.continue();
  }
});
Enter fullscreen mode Exit fullscreen mode

The important thing architecturally: all of this is infrastructure. The domain doesn't know how scraping works. It just calls this.scraper.scrape(keyword) via an IScraper interface. If you wanted to swap Puppeteer for Playwright tomorrow, you'd change one file.


The full picture

Here's how everything connects end to end:

HTTP POST /upload (CSV file)
    │
    ▼
ImportKeywordsCommand
    │
    ▼
KeywordEntity.create()  ──raises──▶  KeywordQueuedEvent
    │                                        │
    ▼                                        ▼
keywordRepo.save()               BullMQEventBus.publish()
                                             │
                                             ▼
                                   BullMQ Job added to queue
                                             │
                                             ▼
                                      ScrapeWorker picks up job
                                             │
                                  ┌──────────┴──────────┐
                                  ▼                     ▼
                         keyword.markAsCompleted()  keyword.markAsFailed()
                                  │                     │
                                  ▼                     ▼
                         KeywordCompletedEvent    KeywordFailedEvent (or retry)
                                  │
                                  ▼
                         KeywordTimelineEventHandler → audit log
Enter fullscreen mode Exit fullscreen mode

What I'd do differently

A few things I'd change with the benefit of hindsight:

  • Add an outbox pattern. Right now domain events are published in the same transaction as the save. If the event bus call fails after the save, you lose events. An outbox table (events persisted to DB, then picked up by a background job) would make this reliable.

  • Separate read models. The query side currently reads from the same tables as the write side. For a learning project that's fine, but in production you'd want dedicated read models (potentially materialized views or a separate read DB) so reads can scale independently.

  • More value objects. keyword is currently just a string. It could be a SearchKeyword value object that validates length, strips special characters, and normalizes whitespace. The domain would be richer for it.


Check it out

The full repo is at https://github.com/stephendo-dev/monorepo-ddd-cqrs-search-crawler.

If you're learning DDD/CQRS and tired of todo app examples, give it a look — and if it helped you, a ⭐ goes a long way.

Questions or feedback? Drop them in the comments below.


Top comments (0)