Reading CSVs with varying column counts that pandas cannot read using DuckDB

#dataengineering #duckdb #csv #programming

What are CSV files with varying column counts?

6, 54
58,81,62
75,84,64,21,55
20,71,55,32

These are what they look like.
For example, there are cases where CSV files output from devices have rows below the first line that contain more columns than expected.

Let's try converting this file to a pandas DataFrame

So, let's mindlessly convert it to a pandas DataFrame with the following code:

import pandas as pd
df = pd.read_csv("./csv_file1.csv")
print(df)

This resulted in the following error:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 5

This seems to mean that line 3 had 5 columns when 3 were expected, causing an error.
It appears that pandas looks at the first few lines (2 lines in this case, though it's unclear if 2 is the default) to determine the number of columns in the DataFrame. So when there are more columns than expected, it results in an error and the data cannot be read.

Solution 1: Find the maximum number of columns and fill the DataFrame with empty values for missing parts

So I thought I could open the CSV file with something other than pandas, find the maximum number of columns, and then fill rows with fewer columns with empty values. Here's the code I created:

import pandas as pd
with open('./csv_file1.csv', 'r', encoding='utf-8') as file:
    lines = file.readlines()
comma_count_max = max(line.count(',') for line in lines)
columns = [f'col_{i+1}' for i in range(comma_count_max + 1)]
data = []
for line in lines:
    row_data = line.strip().split(',')
    # Add empty strings as needed
    if len(row_data) < len(columns):
        row_data += [''] * (len(columns) - len(row_data))
    data.append(row_data)
df = pd.DataFrame(data, columns=columns)
print(df)

Since pandas gives an error when reading, I open the file with Python's open() function and find the line with the most commas among all lines. The number of commas + 1 gives us the maximum number of columns, so then I fill rows with fewer columns with empty strings.
The result is:

  col_1 col_2 col_3 col_4 col_5
0     6    54                  
1    58    81    62            
2    75    84    64    21    55
3    20    71    55    32

This achieves the requirement, but for example...

6, 54
58,81,62,,,,,
75,84,64,21,55
20,71,55,32

For data like this, the result would be:

  col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8
0     6    54                                    
1    58    81    62                              
2    75    84    64    21    55                  
3    20    71    55    32

This is fine for a few rows, but I thought it would be quite heavy to do the process of opening CSV files and adding maximum comma count data to each row for large files.

Solution 2: Use DuckDB's null_padding option

DuckDB's read_csv has a null_padding option that fills blank parts with null values.

import duckdb
con = duckdb.connect(database=":memory:")
print(con.sql("SELECT * FROM read_csv('csv_file1.csv',null_padding=true)"))

Running these three lines of code gives us...?

┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ column0 │ column1 │ column2 │ column3 │ column4 │ column5 │ column6 │ column7 │
│  int64  │  int64  │  int64  │  int64  │  int64  │ varchar │ varchar │ varchar │
├─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│       6 │      54 │    NULL │    NULL │    NULL │ NULL    │ NULL    │ NULL    │
│      58 │      81 │      62 │    NULL │    NULL │ NULL    │ NULL    │ NULL    │
│      75 │      84 │      64 │      21 │      55 │ NULL    │ NULL    │ NULL    │
│      20 │      71 │      55 │      32 │    NULL │ NULL    │ NULL    │ NULL    │
└─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘

It automatically fills up to the maximum number of columns with nulls and even creates headers for the maximum number of columns.
(Though naturally, the number of columns created matches the maximum number of commas.)
By the way, when you want to convert to a DataFrame:

import duckdb
con = duckdb.connect(database=":memory:")
print(con.sql("SELECT * FROM read_csv('csv_file1.csv',null_padding=true)").df())

Adding .df() at the end like this converts it to a pandas DataFrame.
If you're facing this issue, please give it a try! 🙌

DEV Community

Reading CSVs with varying column counts that pandas cannot read using DuckDB

What are CSV files with varying column counts?

Let's try converting this file to a pandas DataFrame

Solution 1: Find the maximum number of columns and fill the DataFrame with empty values for missing parts

Solution 2: Use DuckDB's null_padding option

Top comments (0)