Kuba

Posted on Apr 16

How to Safely Import Thousands of Records from an External API in NestJS

#api #nestjs #architecture #typescript

When building a SaaS product that integrates with a third-party API like Stripe, you will eventually face a problem that seems simple on the surface: how do you import existing data when a user connects their account?

In theory it sounds trivial. Fetch the data, save it to your database, done. In practice, you are dealing with paginated APIs, rate limits, foreign key dependencies, partial failures, and users staring at a spinner with no idea what is happening.

I ran into all of these problems while building Billy — a billing infrastructure layer for developers that lets you manage Stripe subscriptions, tokens, and seat-based billing through a simple SDK. When a developer connects their existing Stripe account, Billy needs to import all their products, prices, customers, and subscriptions automatically. Some accounts have tens of thousands of records. The naive approach falls apart immediately.

This article walks through the architecture I ended up with — from tracking sync state to paginating safely, handling failures gracefully, and surfacing progress to the user in real time. Everything is in NestJS with BullMQ and MikroORM, but the concepts apply to any stack.

Why the Naive Approach Fails

Let me start with what not to do, because understanding the failure modes makes the solution obvious.

The first instinct is something like this:

async function syncCustomers(stripeSecretKey: string) {
  const stripe = new Stripe(stripeSecretKey)
  const customers = await stripe.customers.list({ limit: 100 })

  for (const customer of customers.data) {
    await customerRepository.save(mapToCustomer(customer))
  }
}

This has at least five problems.

It only fetches the first page. Stripe paginates at 100 items maximum per request. If your user has 5,000 customers, this function imports 100 and silently drops the rest. No error, no warning.

It runs synchronously in the request cycle. If you call this from an HTTP endpoint, the request will time out for any account with significant data. Even if it doesn't time out, you are blocking a Node.js worker thread doing I/O for minutes.

It has no progress tracking. The user sees a loading spinner and has no idea if something is happening or if the app is broken. This is a terrible experience.

It is not idempotent. If something fails halfway through and you re-run it, you get duplicate records or errors from unique constraint violations.

It ignores dependency order. Subscriptions reference customers and prices. Prices reference products. If you import subscriptions before customers exist in your database, you get foreign key violations and the whole thing crashes.

Each of these is a real problem in production. Let me walk through how to solve all of them.

The Architecture Overview

The solution has four main components working together:

A sync log entity that tracks the state of each sync operation with full history
A BullMQ background job that runs the sync outside the HTTP request cycle
An ordered execution pipeline that respects data dependencies
Idempotent upsert operations that make re-runs safe

Let me go through each one in detail.

Designing the Sync Log

The first mistake I almost made was storing sync status as fields on the Project entity itself:

// Don't do this
type Project = {
  // ...
  syncStatus: 'never' | 'pending' | 'running' | 'completed' | 'failed'
  lastSyncedAt: Date | null
  syncError: string | null
  customersCount: number
  subscriptionsCount: number
}

This works for the simplest case but breaks down quickly. You lose history — if a sync fails and you re-run it, you overwrite the previous state. You can't track which specific step failed. You can't show granular progress. And your Project entity grows indefinitely as you add more sync types.

The better approach is a separate ProjectSyncLog entity where each sync type gets its own log entry:

type SyncType =
  | 'sync_products'
  | 'sync_prices'
  | 'sync_coupons'
  | 'sync_promotion_codes'
  | 'sync_customers'
  | 'sync_subscriptions'
  | 'full_sync'

type ProjectSyncLog = {
  id: string
  projectId: string
  type: SyncType
  status: 'pending' | 'running' | 'completed' | 'failed'
  currentStep: number
  totalSteps: number | null
  syncedCount: number
  failedCount: number
  error: string | null
  startedAt: Date
  completedAt: Date | null
}

A full sync creates six log entries — one for each type. If products and prices succeed but customers fail, you know exactly what happened and you can re-run just sync_customers without touching anything else.

The Project entity only keeps a lightweight summary for quick access:

type Project = {
  // ...
  lastSyncedAt: Date | null
  syncStatus: 'never' | 'pending' | 'running' | 'completed' | 'failed'
}

This separation of concerns makes the system much easier to reason about. The log is the source of truth for sync state. The project fields are just a convenience cache.

Running Sync as a Background Job

The sync must not run in the HTTP request cycle. When a user creates a project and connects their Stripe account, you want to respond immediately and do the heavy lifting in the background.

With BullMQ in NestJS this is straightforward:

// In CreateProjectHandler
@CommandHandler(CreateProjectCommand)
export class CreateProjectHandler implements ICommandHandler<CreateProjectCommand> {
  constructor(
    private readonly projectRepository: ProjectRepositoryPort,
    @InjectQueue('project-sync') private readonly syncQueue: Queue,
  ) {}

  async execute(command: CreateProjectCommand) {
    const project = Project.create(command.payload)
    await this.projectRepository.create(project)

    // Dispatch sync job and return immediately
    await this.syncQueue.add(
      'full-sync',
      {
        projectId: project.id.value,
        stripeSecretKey: project.stripeSecretKey,
      },
      {
        attempts: 3,
        backoff: {
          type: 'exponential',
          delay: 5000,
        },
      }
    )

    return { data: project.id.value }
  }
}

The user gets a response in milliseconds. The sync job runs whenever a worker picks it up.

The job processor handles the actual orchestration:

@Processor('project-sync')
export class ProjectSyncProcessor {
  constructor(
    private readonly commandBus: CommandBus,
    private readonly projectRepository: ProjectRepositoryPort,
  ) {}

  @Process('full-sync')
  async handleFullSync(job: Job<{ projectId: string; stripeSecretKey: string }>) {
    const { projectId, stripeSecretKey } = job.data

    await this.projectRepository.updateSyncStatus(projectId, 'running')

    const syncTypes: SyncType[] = [
      'sync_products',
      'sync_prices',
      'sync_coupons',
      'sync_promotion_codes',
      'sync_customers',
      'sync_subscriptions',
    ]

    let hasFailures = false

    for (const type of syncTypes) {
      try {
        await this.commandBus.execute(
          new RunSyncTypeCommand({ projectId, stripeSecretKey, type })
        )
      } catch (error) {
        hasFailures = true
        // Log the failure but continue with next sync type
        // Each type manages its own log entry
      }
    }

    const finalStatus = hasFailures ? 'failed' : 'completed'
    await this.projectRepository.updateSyncStatus(projectId, finalStatus)
  }
}

Notice the sequential for loop rather than Promise.all. This is intentional. Prices reference products by their Stripe product ID. Subscriptions reference both customers and prices. Running them in parallel will cause foreign key violations for any user who has these relationships in their Stripe data — which is basically everyone.

The dependency order is:

Products → Prices → Coupons → Promotion Codes → Customers → Subscriptions

Always sync in this order and you will never hit a dependency issue.

Paginating Through Stripe Without Loading Everything Into Memory

Stripe uses cursor-based pagination. Each list endpoint accepts limit (max 100) and starting_after (the ID of the last item from the previous page). You keep fetching until has_more is false.

The cleanest way to handle this in TypeScript is with an async generator:

async function* paginateStripe<T extends { id: string }>(
  fetchPage: (cursor?: string) => Promise<Stripe.ApiList<T>>
): AsyncGenerator<T> {
  let cursor: string | undefined = undefined

  while (true) {
    const page = await fetchPage(cursor)

    for (const item of page.data) {
      yield item
    }

    if (!page.has_more) break

    cursor = page.data[page.data.length - 1].id
  }
}

An async generator yields items one at a time as pages arrive. You never load the entire dataset into memory — you process each item as it comes. For a user with 50,000 customers, this is the difference between 50MB of memory usage and 500MB.

Usage is clean and readable:

for await (const customer of paginateStripe((cursor) =>
  stripe.customers.list({
    limit: 100,
    starting_after: cursor,
  })
)) {
  await this.upsertCustomer(customer, projectId)
  syncLog.syncedCount++
}

You can also add rate limit handling inside the generator. Stripe allows 100 read requests per second on most endpoints, but if you are running multiple syncs in parallel you may hit limits:

async function* paginateStripe<T extends { id: string }>(
  fetchPage: (cursor?: string) => Promise<Stripe.ApiList<T>>,
  delayMs = 50
): AsyncGenerator<T> {
  let cursor: string | undefined = undefined

  while (true) {
    const page = await fetchPage(cursor)

    for (const item of page.data) {
      yield item
    }

    if (!page.has_more) break

    cursor = page.data[page.data.length - 1].id

    // Small delay between pages to avoid rate limits
    await new Promise(resolve => setTimeout(resolve, delayMs))
  }
}

A 50ms delay between pages adds about 5 seconds per 1,000 items — acceptable for a background job, and it keeps you well within Stripe's rate limits.

Making Every Write Idempotent

Background jobs can fail and be retried. Network timeouts happen. Your database can be temporarily unavailable. If a sync crashes at customer 3,000 and restarts from the beginning, you need writes to be safe to repeat.

The solution is upsert operations keyed on the external ID — in this case the Stripe ID:

async upsertCustomer(stripeCustomer: Stripe.Customer, projectId: string): Promise<void> {
  const existing = await this.customerRepository.findByStripeId(stripeCustomer.id)

  if (existing) {
    existing.update({
      email: stripeCustomer.email ?? null,
      name: stripeCustomer.name ?? null,
      phone: stripeCustomer.phone ?? null,
      metadata: stripeCustomer.metadata ?? null,
      description: stripeCustomer.description ?? null,
    })
    await this.customerRepository.save(existing)
    return
  }

  const customer = Customer.create({
    projectId,
    stripeCustomerId: stripeCustomer.id,
    externalUserId: stripeCustomer.metadata?.billyExternalUserId ?? stripeCustomer.id,
    email: stripeCustomer.email ?? '',
    name: stripeCustomer.name ?? '',
    phone: stripeCustomer.phone ?? null,
    metadata: stripeCustomer.metadata ?? null,
    status: 'active',
  })

  await this.customerRepository.create(customer)
}

If the record exists, update it. If not, create it. Running this function ten times with the same input produces the same result as running it once. This is the idempotency guarantee you need.

One important detail: for customers coming from an existing Stripe account that was not previously connected to Billy, there is no billyExternalUserId in the metadata. In that case I fall back to using the Stripe customer ID as the external user ID. The developer can update this later once they integrate the SDK into their application.

The same pattern applies to products, prices, and subscriptions. Always check by Stripe ID first before creating.

Handling Failures Gracefully

Each sync type runs in its own try/catch block and maintains its own log entry. This isolation means a failure in one type does not cascade to others:

@CommandHandler(RunSyncTypeCommand)
export class RunSyncTypeHandler implements ICommandHandler<RunSyncTypeCommand> {
  async execute(command: RunSyncTypeCommand): Promise<void> {
    const { projectId, stripeSecretKey, type } = command.payload

    const syncLog = await this.syncLogRepository.create({
      projectId,
      type,
      status: 'running',
      syncedCount: 0,
      failedCount: 0,
      currentStep: 0,
      startedAt: new Date(),
    })

    try {
      switch (type) {
        case 'sync_products':
          await this.syncProducts(projectId, stripeSecretKey, syncLog)
          break
        case 'sync_prices':
          await this.syncPrices(projectId, stripeSecretKey, syncLog)
          break
        case 'sync_customers':
          await this.syncCustomers(projectId, stripeSecretKey, syncLog)
          break
        case 'sync_subscriptions':
          await this.syncSubscriptions(projectId, stripeSecretKey, syncLog)
          break
        // ... other types
      }

      syncLog.status = 'completed'
      syncLog.completedAt = new Date()
    } catch (error) {
      syncLog.status = 'failed'
      syncLog.error = error.message
      syncLog.completedAt = new Date()
      throw error // Re-throw so the processor knows this type failed
    } finally {
      await this.syncLogRepository.save(syncLog)
    }
  }
}

Individual item failures are handled differently from type-level failures. If one customer fails to import due to bad data, you probably want to log it and continue rather than abort the entire customer sync:

async syncCustomers(projectId: string, stripeSecretKey: string, syncLog: ProjectSyncLog) {
  const stripe = this.stripeClientFactory.create(stripeSecretKey)

  for await (const customer of paginateStripe((cursor) =>
    stripe.customers.list({ limit: 100, starting_after: cursor })
  )) {
    try {
      await this.upsertCustomer(customer, projectId)
      syncLog.syncedCount++
    } catch (error) {
      syncLog.failedCount++
      // Log the individual failure but keep going
      this.logger.warn(`Failed to import customer ${customer.id}: ${error.message}`)
    }

    // Persist progress every 100 items
    if ((syncLog.syncedCount + syncLog.failedCount) % 100 === 0) {
      syncLog.currentStep = syncLog.syncedCount
      await this.syncLogRepository.save(syncLog)
    }
  }
}

This way a user with 10 customers that have malformed metadata does not block the import of their other 4,990 customers. The failedCount field tells them exactly how many items had problems.

Surfacing Progress to the User

Real-time progress is not just a nice-to-have — it is the difference between a user thinking the feature is broken and a user trusting the system.

I poll the sync status endpoint from the frontend using React Query's refetchInterval. The key insight is that you only poll while a sync is running — once it completes or fails you stop polling automatically:

const { data: syncStatus } = useProjectSyncStatus(projectId, {
  refetchInterval: (query) => {
    const status = query.state.data?.syncLogs?.some(
      log => log.status === 'running' || log.status === 'pending'
    )
    return status ? 2000 : false
  }
})

The backend endpoint returns the current state of all sync logs:

@Get(':projectId/sync-status')
async getSyncStatus(@Param('projectId') projectId: string) {
  const syncLogs = await this.syncLogRepository.findByProjectId(projectId)
  const project = await this.projectRepository.findById(projectId)

  return {
    syncStatus: project.syncStatus,
    syncLogs: syncLogs.map(log => ({
      type: log.type,
      status: log.status,
      syncedCount: log.syncedCount,
      failedCount: log.failedCount,
      currentStep: log.currentStep,
      error: log.error,
    }))
  }
}

In the UI I show a list of sync operations with individual progress for each:
✓ Products 12 synced
✓ Prices 31 synced
⟳ Customers 1,247 / syncing...
Subscriptions waiting

This gives the user complete visibility. They know the system is working, they know how much data has been imported, and they can see immediately if something goes wrong.

Allowing Re-Sync

The final piece is letting users trigger a re-sync manually. This is important for two reasons: recovering from failures, and keeping data fresh if a user has added new records in Stripe after the initial import.

The re-sync endpoint accepts an optional type parameter:

@Post(':projectId/sync')
async triggerSync(
  @Param('projectId') projectId: string,
  @Body() dto: { type?: SyncType }
) {
  const project = await this.projectRepository.findById(projectId)

  await this.syncQueue.add('full-sync', {
    projectId,
    stripeSecretKey: project.stripeSecretKey,
    types: dto.type ? [dto.type] : undefined, // undefined = all types
  })

  return { queued: true }
}

If you pass a specific type, only that sync runs. If you pass nothing, the full sync runs. The idempotent upserts mean this is always safe to trigger — you will never end up with duplicate data.

The Result

When a developer connects their Stripe account to Billy, their entire billing history appears within minutes. Products, prices, customers, and subscriptions — all imported automatically, with real-time progress visible in the dashboard.

No manual migration scripts. No CSV exports. No "please re-create your products in our system". Just connect and go.

This turned out to be one of the highest-leverage features in the product. The biggest objection to adopting new billing infrastructure is always "I already have data in Stripe". With automatic sync, that objection disappears.

Key Takeaways

Use a separate log entity per sync operation rather than fields on the parent entity. You get history, granular failure tracking, and the ability to re-run individual steps.

Run sync types sequentially in dependency order. Products before prices before customers before subscriptions. Parallel execution will give you foreign key violations.

Use async generators to paginate external APIs. You process items as they arrive rather than loading everything into memory.

Make every write idempotent by upserting on the external ID. Jobs will be retried. Network will fail. Your writes need to be safe to repeat.

Persist progress every N items rather than every item. Writing to the database on every record will slow your sync significantly. Every 100 items is a reasonable balance.

Show real-time progress to the user. Polling with React Query's refetchInterval is simple and effective. Users who can see progress trust the system. Users staring at a spinner do not.

Isolate failures per sync type so a problem with one type does not block the others. Track individual item failures separately from type-level failures.

If you are building a SaaS and want billing infrastructure that handles subscriptions, tokens, and seat-based billing without rebuilding Stripe logic from scratch — I am building Billy for exactly this. Early access coming soon.

DEV Community