DEV Community

DarkEdges
DarkEdges

Posted on

Auto-Detecting CSV Schemas for Lightning-Fast ClickHouse Ingestion with Parquet

As a data engineer, one of the most repetitive tasks I face is ingesting data from CSV files. The problem isn't just loading the data; it's the ceremony that comes with it. Every time a new data source appears, I have to manually inspect the columns, define a table schema, and write a script to load it. What if the CSV has 100 columns? What if the data types are ambiguous? This process is tedious and error-prone.

I wanted a better way. My goal was to create a Node.js script that could:

  1. Read any CSV file without prior knowledge of its structure.
  2. Auto-detect the schema, including column names and data types.
  3. Convert the CSV to Parquet, a highly efficient columnar storage format.
  4. Prepare for ingestion into ClickHouse, which loves Parquet.

In this article, I'll walk you through the proof-of-concept I built to solve this exact problem.

Why Parquet and ClickHouse?

ClickHouse is an open-source, column-oriented database built for speed. It's a beast for analytical queries (OLAP).

Apache Parquet is a columnar storage format. Instead of storing data in rows, it stores it in columns. This is a perfect match for ClickHouse because it allows the database to read only the columns it needs for a query, dramatically reducing I/O and speeding up performance.

Combining the two means you get efficient storage and lightning-fast analytics.

The Concept: From CSV to Parquet

Our script will perform a three-step process:

  1. Schema Detection: Read the CSV headers and a few sample rows to infer the data type of each column (e.g., string, number, boolean).
  2. Data Transformation: Convert the raw string values from the CSV into their inferred types.
  3. Parquet Conversion: Write the transformed data and the detected schema into a new .parquet file.

Let's start with a simple CSV file to demonstrate.

Our Example CSV: sample-data.csv

id,first_name,last_name,email,is_active,created_at,balance
1,John,Doe,john.doe@example.com,true,2023-01-15T10:00:00Z,150.75
2,Jane,Smith,jane.smith@example.com,false,2023-02-20T11:30:00Z,200.00
3,Peter,Jones,peter.jones@example.com,true,2023-03-10T09:05:00Z,50.25
4,Mary,Williams,mary.w@example.com,true,2023-04-01T15:45:00Z,300.50
Enter fullscreen mode Exit fullscreen mode

This file has a nice mix of data types: integers, strings, booleans, timestamps, and floating-point numbers.

Step 1: Setting Up the Project

First, let's set up a simple Node.js project and install the necessary libraries. We'll use papaparse for robust CSV parsing and parquetjs for writing Parquet files.

npm init -y
npm install papaparse parquetjs
Enter fullscreen mode Exit fullscreen mode

Step 2: Auto-Detecting the Schema

This is the core of our solution. We need a function that can look at the string data from the CSV and make an educated guess about its real type.

Here’s a simple type inference function. It checks if a value is a boolean, a number, or a date. If it's none of those, it defaults to a string.

index.js

// Function to infer data type from a string value
function inferType(value) {
    if (value === 'true' || value === 'false') return 'BOOLEAN';
    if (!isNaN(value) && value.trim() !== '') return 'DOUBLE'; // Use DOUBLE for all numbers for simplicity
    if (!isNaN(Date.parse(value))) return 'UTF8'; // Dates will be stored as strings (UTF8)
    return 'UTF8'; // Default to string
}
Enter fullscreen mode Exit fullscreen mode

Now, we can use this to build a schema from the CSV file. We'll read the first data row to infer the types for each column.

index.js

const fs = require('fs');
const papa = require('papaparse');

// ... inferType function from above ...

async function detectSchema(filePath) {
    const fileContent = fs.readFileSync(filePath, 'utf8');
    const result = papa.parse(fileContent, { header: true, preview: 1 });

    if (result.errors.length > 0) {
        throw new Error('Error parsing CSV for schema detection.');
    }

    const firstRecord = result.data[0];
    const schema = {};

    for (const header in firstRecord) {
        const value = firstRecord[header];
        schema[header] = { type: inferType(value) };
    }

    return schema;
}
Enter fullscreen mode Exit fullscreen mode

When we run detectSchema('sample-data.csv'), it will produce an object like this, which is exactly the format parquetjs needs:

{
  "id": { "type": "DOUBLE" },
  "first_name": { "type": "UTF8" },
  "last_name": { "type": "UTF8" },
  "email": { "type": "UTF8" },
  "is_active": { "type": "BOOLEAN" },
  "created_at": { "type": "UTF8" },
  "balance": { "type": "DOUBLE" }
}
Enter fullscreen mode Exit fullscreen mode

Step 3: Converting CSV to Parquet

With the schema detected, we can now read the entire CSV and write it to a Parquet file. The process involves:

  1. Creating a ParquetSchema and a ParquetWriter.
  2. Iterating through the CSV rows.
  3. Casting each value to its detected type (e.g., converting the string "true" to the boolean true).
  4. Appending the transformed row to the writer.

Here's the code to bring it all together:

index.js

const { ParquetSchema, ParquetWriter } = require('parquetjs');

// ... detectSchema and inferType functions ...

async function convertCsvToParquet(csvPath, parquetPath) {
    console.log('Detecting schema...');
    const schemaDefinition = await detectSchema(csvPath);
    const schema = new ParquetSchema(schemaDefinition);

    const writer = await ParquetWriter.openFile(schema, parquetPath);

    const fileContent = fs.readFileSync(csvPath, 'utf8');
    const result = papa.parse(fileContent, { header: true, skipEmptyLines: true });

    console.log(`Found ${result.data.length} records. Converting to Parquet...`);

    for (const row of result.data) {
        const processedRow = {};
        for (const key in row) {
            const value = row[key];
            const type = schemaDefinition[key].type;

            // Cast values to their proper types
            if (type === 'BOOLEAN') {
                processedRow[key] = value === 'true';
            } else if (type === 'DOUBLE') {
                processedRow[key] = parseFloat(value);
            } else {
                processedRow[key] = value;
            }
        }
        await writer.appendRow(processedRow);
    }

    await writer.close();
    console.log(`Parquet file written to ${parquetPath}`);
}

// Run the conversion
convertCsvToParquet('sample-data.csv', 'output.parquet').catch(console.error);
Enter fullscreen mode Exit fullscreen mode

After running this script (node index.js), you'll have an output.parquet file ready for ClickHouse!

Step 4: Ingesting into ClickHouse

Now for the easy part. Ingesting the Parquet file into ClickHouse is incredibly simple. First, you need a table with a matching schema.

CREATE TABLE my_data (
    id Float64,
    first_name String,
    last_name String,
    email String,
    is_active Boolean,
    created_at String,
    balance Float64
) ENGINE = MergeTree()
ORDER BY id;
Enter fullscreen mode Exit fullscreen mode

Then, you can use the clickhouse-client to ingest the file directly.

clickhouse-client --query="INSERT INTO my_data FORMAT Parquet" < output.parquet
Enter fullscreen mode Exit fullscreen mode

ClickHouse reads the Parquet file's schema and streams the data directly into the table. It's fast, efficient, and requires no complex INSERT statements with tons_of_values.

Conclusion

By automating schema detection and converting CSVs to Parquet, we've created a powerful and reusable ETL pipeline. This approach saves a massive amount of time, reduces manual errors, and leverages the high-performance capabilities of both Parquet and ClickHouse.

This proof-of-concept can be expanded with more robust type detection, error handling, and integration into a larger data workflow, but it's a fantastic starting point for streamlining your data ingestion process.

Top comments (0)