Bahador Raghibizadeh

Posted on Sep 13

Introducing db2lake: A Lightweight and Powerful ETL Framework for Node.js

#ai #node #opensource #datascience

Transferring data from operational databases to data lakes or warehouses is a critical need in the modern data landscape. Traditional ETL (Extract, Transform, Load) tools are often complex or rely on costly cloud infrastructure, making them overkill for small to medium-sized projects. db2lake is an open-source Node.js-based framework that simplifies this process with a lightweight, flexible, and intuitive API. It not only handles data extraction and loading but also supports transformation and integration with various architectures like webhooks and database triggers. In this article, we introduce db2lake, its available drivers, the transformer capability, suggested architectures, testing with Vitest, and a comparison with competitors.

Why db2lake? Key Benefits

db2lake is designed for developers who want to implement ETL quickly without the complexity of enterprise tools:

Lightweight and Fast: The core package (@db2lake/core) is minimal and uses streaming and batch processing for high performance.
TypeScript-Native: Strong typing reduces errors and enhances the development experience.
Free and Open-Source: Licensed under MIT, suitable for personal, startup, and commercial projects.
Node.js Integration: Seamlessly fits into the JavaScript ecosystem, no new languages required.

With db2lake, you can transfer data from PostgreSQL to Redshift in just a few lines of code!

Installation and Quick Start

Install via npm:

npm install @db2lake/core @db2lake/driver-mysql @db2lake/driver-bigquery

Simple example:

import { Pipeline, ITransformer, ILogger } from '@db2lake/core';
import { MySQLSourceDriver } from '@db2lake/driver-mysql';
import { BigQueryDestinationDriver } from '@db2lake/driver-bigquery';

// --- Configure drivers (fill with your credentials) ---
const mysqlConfig = {
    query: 'SELECT * FROM orders WHERE order_id > ? LIMIT 50',
    params: [0],
    cursorField: 'order_id',
    cursorParamsIndex: 0,
    connectionUri: 'mysql://user:password@localhost:3306/shopdb'
};

const bigqueryConfig = {
    bigQueryOptions: {
        keyFilename: './service-account.json',
        projectId: 'my-project-id'
    },
    dataset: 'my_dataset',
    table: 'users',
    batchSize: 1000,
    // Optional: use streaming for real-time inserts
    writeOptions: {
        sourceFormat: 'NEWLINE_DELIMITED_JSON'
    }
};

// --- Transformer: adapt source row shape to destination schema ---
const transformer: ITransformer<any, any> = (rows) => rows.map(r => ({
    id: r.id,
    fullName: `${r.name}`,
    createdAt: r.created_at instanceof Date ? r.created_at.toISOString() : r.created_at
}));

// --- Logger ---
const logger: ILogger = (level, message, data) => {
    const ts = new Date().toISOString();
    console.log(`${ts} [${level.toUpperCase()}] ${message}`);
    if (data) console.debug(data);
};

async function main() {
    const source = new MySQLSourceDriver(mysqlConfig);
    const dest = new BigQueryDestinationDriver(bigqueryConfig);

    const pipeline = new Pipeline(source, dest, transformer, logger);

    try {
        await pipeline.run();
        console.log('Pipeline finished', pipeline.getMetrics());
    } catch (err) {
        console.error('Pipeline error', err);
    }
}

main().catch(err => { console.error(err); process.exit(1); });

Available Drivers

db2lake uses a modular design: the core (@db2lake/core) and separate drivers for sources and destinations. Available drivers:

Sources:

PostgreSQL (@db2lake/driver-postgres): Supports complex queries, incremental loads, and Supabase integration.
MySQL (@db2lake/driver-mysql): Fast streaming with mysql2/promise.
Oracle (@db2lake/driver-oracle): For legacy systems.
Firestore (@db2lake/driver-firestore): For NoSQL databases.

Destinations:

BigQuery (@db2lake/driver-bigquery): Batch inserts for analytics.
Databricks (@db2lake/driver-databricks): Loads to Delta Lake tables.
Redshift (@db2lake/driver-redshift): Upsert and bulk loads for AWS.
Snowflake (@db2lake/driver-snowflake): Supported only as a destination.

Developers can contribute new drivers via the project’s GitHub repository.

Transformer Capability: Smart Data Processing

db2lake supports data transformation (the T in ETL) through the ITransformer interface, allowing you to modify data before loading, such as mapping fields or adding calculations.
Example:

// --- Transformer: adapt source row shape to destination schema ---
const transformer: ITransformer<any, any> = (rows) => rows.map(r => ({
    id: r.id,
    fullName: `${r.name}`,
    createdAt: r.created_at instanceof Date ? r.created_at.toISOString() : r.created_at
}));

Transformers can be async and integrate with tools like Lodash, making db2lake a complete ETL solution.

Suggested Architectures: How to Deploy db2lake

db2lake is flexible and works in various architectures, depending on project needs:

1. Scheduled Cronjob (For Periodic ETL)

Why? Simple for weekly/daily transfers of new data using cursors.
Implementation:

const cron = require('node-cron');
cron.schedule('0 2 * * 0', runETL);  // Every Sunday at 2 AM

Benefits: No real-time overhead; ideal for startups.

2. Event-Driven with Webhook (For Real-Time)

Why? Perfect for databases like Supabase that support webhooks for table changes.
Implementation:

const express = require('express');
const app = express();
app.use(express.json());

app.post('/webhook', async (req, res) => {
  if (req.body.type === 'INSERT') {
    await runETL();
  }
  res.send('OK');
});

app.listen(3000);

3. Database Trigger Functions (For Internal Automation)

Why? Minimal latency with database triggers (e.g., PostgreSQL pg_notify).
Implementation: SQL trigger sending signals to a Node.js server.
PostgreSQL Example:

CREATE FUNCTION notify_change() RETURNS trigger AS $$
BEGIN
  PERFORM pg_notify('data_change', row_to_json(NEW)::text);
  RETURN NEW;
END;
$$ LANGUAGE plpgsql;

CREATE TRIGGER table_change AFTER INSERT OR UPDATE ON your_table
FOR EACH ROW EXECUTE PROCEDURE notify_change();

Cursors in all architectures prevent duplicate transfers. For scaling, use Docker or PM2.

Testing with Vitest: Ensuring Quality

To ensure db2lake’s reliability, unit and integration tests are run before each release using Vitest, chosen for its speed and TypeScript integration. Check the GitHub repository for test details.

Comparison with Competitors

db2lake stands out with its simplicity and Node.js focus:

db2lake excels for developers seeking simplicity and control.

Advanced Tips and Future

Security: Use Snyk for vulnerability scanning and add a security badge to README.
Monitoring: Integrate with Winston for logging.
Future: New drivers (e.g., MongoDB), serverless support (Vercel), and UI dashboard.
Contribute: Join the project on GitHub!

Conclusion

db2lake is a lightweight, powerful ETL framework for transferring data to data lakes and warehouses (like Snowflake as a destination). With TypeScript support, transformers, flexible architectures, and Vitest testing, it’s an excellent choice for Node.js developers. Get started with: npm install @db2lake/core and send your data to the lake!

DEV Community