As a data engineer, one of the most repetitive tasks I face is ingesting data from CSV files. The problem isn't just loading the data; it's the ceremony that comes with it. Every time a new data source appears, I have to manually inspect the columns, define a table schema, and write a script to load it. What if the CSV has 100 columns? What if the data types are ambiguous? This process is tedious and error-prone.
I wanted a better way. My goal was to create a Node.js script that could:
- Read any CSV file without prior knowledge of its structure.
- Auto-detect the schema, including column names and data types.
- Convert the CSV to Parquet, a highly efficient columnar storage format.
- Prepare for ingestion into ClickHouse, which loves Parquet.
In this article, I'll walk you through the proof-of-concept I built to solve this exact problem.
Why Parquet and ClickHouse?
ClickHouse is an open-source, column-oriented database built for speed. It's a beast for analytical queries (OLAP).
Apache Parquet is a columnar storage format. Instead of storing data in rows, it stores it in columns. This is a perfect match for ClickHouse because it allows the database to read only the columns it needs for a query, dramatically reducing I/O and speeding up performance.
Combining the two means you get efficient storage and lightning-fast analytics.
The Concept: From CSV to Parquet
Our script will perform a three-step process:
- Schema Detection: Read the CSV headers and a few sample rows to infer the data type of each column (e.g.,
string,number,boolean). - Data Transformation: Convert the raw string values from the CSV into their inferred types.
- Parquet Conversion: Write the transformed data and the detected schema into a new
.parquetfile.
Let's start with a simple CSV file to demonstrate.
Our Example CSV: sample-data.csv
id,first_name,last_name,email,is_active,created_at,balance
1,John,Doe,john.doe@example.com,true,2023-01-15T10:00:00Z,150.75
2,Jane,Smith,jane.smith@example.com,false,2023-02-20T11:30:00Z,200.00
3,Peter,Jones,peter.jones@example.com,true,2023-03-10T09:05:00Z,50.25
4,Mary,Williams,mary.w@example.com,true,2023-04-01T15:45:00Z,300.50
This file has a nice mix of data types: integers, strings, booleans, timestamps, and floating-point numbers.
Step 1: Setting Up the Project
First, let's set up a simple Node.js project and install the necessary libraries. We'll use papaparse for robust CSV parsing and parquetjs for writing Parquet files.
npm init -y
npm install papaparse parquetjs
Step 2: Auto-Detecting the Schema
This is the core of our solution. We need a function that can look at the string data from the CSV and make an educated guess about its real type.
Here’s a simple type inference function. It checks if a value is a boolean, a number, or a date. If it's none of those, it defaults to a string.
index.js
// Function to infer data type from a string value
function inferType(value) {
if (value === 'true' || value === 'false') return 'BOOLEAN';
if (!isNaN(value) && value.trim() !== '') return 'DOUBLE'; // Use DOUBLE for all numbers for simplicity
if (!isNaN(Date.parse(value))) return 'UTF8'; // Dates will be stored as strings (UTF8)
return 'UTF8'; // Default to string
}
Now, we can use this to build a schema from the CSV file. We'll read the first data row to infer the types for each column.
index.js
const fs = require('fs');
const papa = require('papaparse');
// ... inferType function from above ...
async function detectSchema(filePath) {
const fileContent = fs.readFileSync(filePath, 'utf8');
const result = papa.parse(fileContent, { header: true, preview: 1 });
if (result.errors.length > 0) {
throw new Error('Error parsing CSV for schema detection.');
}
const firstRecord = result.data[0];
const schema = {};
for (const header in firstRecord) {
const value = firstRecord[header];
schema[header] = { type: inferType(value) };
}
return schema;
}
When we run detectSchema('sample-data.csv'), it will produce an object like this, which is exactly the format parquetjs needs:
{
"id": { "type": "DOUBLE" },
"first_name": { "type": "UTF8" },
"last_name": { "type": "UTF8" },
"email": { "type": "UTF8" },
"is_active": { "type": "BOOLEAN" },
"created_at": { "type": "UTF8" },
"balance": { "type": "DOUBLE" }
}
Step 3: Converting CSV to Parquet
With the schema detected, we can now read the entire CSV and write it to a Parquet file. The process involves:
- Creating a
ParquetSchemaand aParquetWriter. - Iterating through the CSV rows.
- Casting each value to its detected type (e.g., converting the string
"true"to the booleantrue). - Appending the transformed row to the writer.
Here's the code to bring it all together:
index.js
const { ParquetSchema, ParquetWriter } = require('parquetjs');
// ... detectSchema and inferType functions ...
async function convertCsvToParquet(csvPath, parquetPath) {
console.log('Detecting schema...');
const schemaDefinition = await detectSchema(csvPath);
const schema = new ParquetSchema(schemaDefinition);
const writer = await ParquetWriter.openFile(schema, parquetPath);
const fileContent = fs.readFileSync(csvPath, 'utf8');
const result = papa.parse(fileContent, { header: true, skipEmptyLines: true });
console.log(`Found ${result.data.length} records. Converting to Parquet...`);
for (const row of result.data) {
const processedRow = {};
for (const key in row) {
const value = row[key];
const type = schemaDefinition[key].type;
// Cast values to their proper types
if (type === 'BOOLEAN') {
processedRow[key] = value === 'true';
} else if (type === 'DOUBLE') {
processedRow[key] = parseFloat(value);
} else {
processedRow[key] = value;
}
}
await writer.appendRow(processedRow);
}
await writer.close();
console.log(`Parquet file written to ${parquetPath}`);
}
// Run the conversion
convertCsvToParquet('sample-data.csv', 'output.parquet').catch(console.error);
After running this script (node index.js), you'll have an output.parquet file ready for ClickHouse!
Step 4: Ingesting into ClickHouse
Now for the easy part. Ingesting the Parquet file into ClickHouse is incredibly simple. First, you need a table with a matching schema.
CREATE TABLE my_data (
id Float64,
first_name String,
last_name String,
email String,
is_active Boolean,
created_at String,
balance Float64
) ENGINE = MergeTree()
ORDER BY id;
Then, you can use the clickhouse-client to ingest the file directly.
clickhouse-client --query="INSERT INTO my_data FORMAT Parquet" < output.parquet
ClickHouse reads the Parquet file's schema and streams the data directly into the table. It's fast, efficient, and requires no complex INSERT statements with tons_of_values.
Conclusion
By automating schema detection and converting CSVs to Parquet, we've created a powerful and reusable ETL pipeline. This approach saves a massive amount of time, reduces manual errors, and leverages the high-performance capabilities of both Parquet and ClickHouse.
This proof-of-concept can be expanded with more robust type detection, error handling, and integration into a larger data workflow, but it's a fantastic starting point for streamlining your data ingestion process.
Top comments (0)