read_csv file using Pandas skipping rows based on content

#python #programming #datascience

Ideally each CSV file you work with will contain homogeneous data: only one row of header and in ever row similar type of data.
Unfortunately real world sometimes gives us files where several tables are mixed. The question arises how can we read the content
of such file using the read_csv
method of Pandas?

One could use the skiprows and the skipfooter parameters, but what if I don't the number of the rows we need to skip?

The file

This a very short version of a file I just created for the demo. The real file is much, much bigger.

code,from,to,departure,length,price,tickets
CM 2201,BUD,LTN,2016-05-29 06:00,2:35,80,20
CM 2203,BUD,LTN,2016-05-29 09:00,2:35,80,17

CM 2202,LTN,BUD,2016-06-10 08:10,2:25,120,5


Planet name,Distance (AU),Mass
Mercury,0.4,0.055
Venus,0.7,0.815

City 1,City 2,distance
Budapest,Bukarest,1200
Tel Aviv,Beirut,500

This file has 3 different tables in it.

Read everything

import pandas as pd

filename = "examples/data/mixed.csv"
df = pd.read_csv(filename)
print(df)

Reading everything will not result in anything useful. The header of the 2nd and the 3rd table are seen as data in the first table.
Along with the content of those tables.

          code           from        to         departure length  price  tickets
0      CM 2201            BUD       LTN  2016-05-29 06:00   2:35   80.0     20.0
1      CM 2203            BUD       LTN  2016-05-29 09:00   2:35   80.0     17.0
2      CM 2202            LTN       BUD  2016-06-10 08:10   2:25  120.0      5.0
3  Planet name  Distance (AU)      Mass               NaN    NaN    NaN      NaN
4      Mercury            0.4     0.055               NaN    NaN    NaN      NaN
5        Venus            0.7     0.815               NaN    NaN    NaN      NaN
6       City 1         City 2  distance               NaN    NaN    NaN      NaN
7     Budapest       Bukarest      1200               NaN    NaN    NaN      NaN
8     Tel Aviv         Beirut       500               NaN    NaN    NaN      NaN

We could try to fiddle with the dataframe to locate the rows that are relevant to our data, but I thought a much filtering the original data would be
simpler.

Set the skips manually

Before trying to go to the dynamic solution I wanted to see if the skiprows and skipfooter parameters work as I expect.
So we have a version where we supply these two parameters specific to the sample csv file:

import pandas as pd

filename = "examples/data/mixed.csv"
df = pd.read_csv(filename, skiprows=7, skipfooter=4, engine="python")
print(df)

The result is this:

  Planet name  Distance (AU)   Mass
0     Mercury            0.4  0.055
1       Venus            0.7  0.815

I also had to supply the engine="python" to avoid a warning:
ParserWarning: Falling back to the 'python' engine because the 'c' engine does not support skipfooter; you can avoid this warning by specifying engine='python'.

This also made me worried that maybe this will cause the whole loading to work a lot slower. We'll have to see it on real data.

Calculate the skips

Finally we have the code to calculate the number of rows automatically. I did not want to do it with Pandas as this solution, where we only care about strings, seems to be much faster.
(But I have not measured it.)

import pandas as pd

def get_row(rows, text):
    matches = list(filter(lambda row: row[1].rstrip("\n") == text, enumerate(rows)))
    # print(matches)

    if len(matches) == 0:
        raise Exception(f"header {text} could not be found")
    if len(matches) > 1:
        raise Exception(f"duplicate header {text} was found")

    return matches[0][0]


filename = "examples/data/mixed.csv"
with open(filename) as fh:
    rows = fh.readlines()

starting_row = get_row(rows, "Planet name,Distance (AU),Mass")
# print(starting_row) # 7

end_row = len(rows) - get_row(rows, "City 1,City 2,distance")
# print(end_row) # 4

df = pd.read_csv(filename, skiprows=starting_row, skipfooter=end_row, engine="python")
print(df)

First we read the content of the csv file into memory as a list of rows. Then we use the get_row function to find the row-number.

enumerate() returns a list of tuples of the form (row_number, row).

We use the filter on the enumerated list. The filter function takes element 1 of the tuple (which is the current row in the file) and checks if it is equal to the string we supplied.

filter returns a filter object, we use list to flatten it into a list.

Then we make sure we found exactly one such row. I know theoretically the file should be correct and have exactly this format. But I know in reality....
So I prefer to report an error or an ambiguity on the input data then to give false results.

That's it.

I hope this will be useful to someone.

DEV Community

read_csv file using Pandas skipping rows based on content

The file

Read everything

Set the skips manually

Calculate the skips

Top comments (0)