DEV Community

nk_Enuke
nk_Enuke

Posted on

Reading CSVs with varying column counts that pandas cannot read using DuckDB

What are CSV files with varying column counts?

6, 54
58,81,62
75,84,64,21,55
20,71,55,32
Enter fullscreen mode Exit fullscreen mode

These are what they look like.
For example, there are cases where CSV files output from devices have rows below the first line that contain more columns than expected.

Let's try converting this file to a pandas DataFrame

So, let's mindlessly convert it to a pandas DataFrame with the following code:

import pandas as pd
df = pd.read_csv("./csv_file1.csv")
print(df)
Enter fullscreen mode Exit fullscreen mode

This resulted in the following error:

pandas.errors.ParserError: Error tokenizing data. C error: Expected 3 fields in line 3, saw 5
Enter fullscreen mode Exit fullscreen mode

This seems to mean that line 3 had 5 columns when 3 were expected, causing an error.
It appears that pandas looks at the first few lines (2 lines in this case, though it's unclear if 2 is the default) to determine the number of columns in the DataFrame. So when there are more columns than expected, it results in an error and the data cannot be read.

Solution 1: Find the maximum number of columns and fill the DataFrame with empty values for missing parts

So I thought I could open the CSV file with something other than pandas, find the maximum number of columns, and then fill rows with fewer columns with empty values. Here's the code I created:

import pandas as pd
with open('./csv_file1.csv', 'r', encoding='utf-8') as file:
    lines = file.readlines()
comma_count_max = max(line.count(',') for line in lines)
columns = [f'col_{i+1}' for i in range(comma_count_max + 1)]
data = []
for line in lines:
    row_data = line.strip().split(',')
    # Add empty strings as needed
    if len(row_data) < len(columns):
        row_data += [''] * (len(columns) - len(row_data))
    data.append(row_data)
df = pd.DataFrame(data, columns=columns)
print(df)
Enter fullscreen mode Exit fullscreen mode

Since pandas gives an error when reading, I open the file with Python's open() function and find the line with the most commas among all lines. The number of commas + 1 gives us the maximum number of columns, so then I fill rows with fewer columns with empty strings.
The result is:

  col_1 col_2 col_3 col_4 col_5
0     6    54                  
1    58    81    62            
2    75    84    64    21    55
3    20    71    55    32      
Enter fullscreen mode Exit fullscreen mode

This achieves the requirement, but for example...

6, 54
58,81,62,,,,,
75,84,64,21,55
20,71,55,32
Enter fullscreen mode Exit fullscreen mode

For data like this, the result would be:

  col_1 col_2 col_3 col_4 col_5 col_6 col_7 col_8
0     6    54                                    
1    58    81    62                              
2    75    84    64    21    55                  
3    20    71    55    32        
Enter fullscreen mode Exit fullscreen mode

This is fine for a few rows, but I thought it would be quite heavy to do the process of opening CSV files and adding maximum comma count data to each row for large files.

Solution 2: Use DuckDB's null_padding option

DuckDB's read_csv has a null_padding option that fills blank parts with null values.

import duckdb
con = duckdb.connect(database=":memory:")
print(con.sql("SELECT * FROM read_csv('csv_file1.csv',null_padding=true)"))
Enter fullscreen mode Exit fullscreen mode

Running these three lines of code gives us...?

┌─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┬─────────┐
│ column0 │ column1 │ column2 │ column3 │ column4 │ column5 │ column6 │ column7 │
│  int64  │  int64  │  int64  │  int64  │  int64  │ varchar │ varchar │ varchar │
├─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┼─────────┤
│       6 │      54 │    NULL │    NULL │    NULL │ NULL    │ NULL    │ NULL    │
│      58 │      81 │      62 │    NULL │    NULL │ NULL    │ NULL    │ NULL    │
│      75 │      84 │      64 │      21 │      55 │ NULL    │ NULL    │ NULL    │
│      20 │      71 │      55 │      32 │    NULL │ NULL    │ NULL    │ NULL    │
└─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┴─────────┘
Enter fullscreen mode Exit fullscreen mode

It automatically fills up to the maximum number of columns with nulls and even creates headers for the maximum number of columns.
(Though naturally, the number of columns created matches the maximum number of commas.)
By the way, when you want to convert to a DataFrame:

import duckdb
con = duckdb.connect(database=":memory:")
print(con.sql("SELECT * FROM read_csv('csv_file1.csv',null_padding=true)").df())
Enter fullscreen mode Exit fullscreen mode

Adding .df() at the end like this converts it to a pandas DataFrame.
If you're facing this issue, please give it a try! 🙌

Top comments (0)