How to Build a Pandas DataFrame from Multiple Files

#datascience #tutorial #beginners #python

In any Data Science project, the steps of Importing Data followed by Data Cleaning and Exploratory Data Analysis(EDA) are extremely important.

Let us say we have the required dataset in a CSV file, but the dataset is stored across multiple files, instead of a single file. We would ideally like to read in the data from multiple files into a single pandas DataFrame for use in subsequent steps.

The most straightforward way to do it is to read in the data from each of those files into separate DataFrames and then concatenate them suitably into a single large DataFrame. This can be memory inefficient and involves writing redundant code. So, is there a better way to do it? Yes, there is. In this blog post, let us try to understand through simple examples.

Let us import pandas under its usual alias pd.

import pandas as pd

Building a DataFrame from multiple files (row-wise)

We shall use a sample dataset for our example; let us read the data from http://bit.ly/smallstocks into a DataFrame stocks using the read_csv() method of pandas.

stocks = pd.read_csv('http://bit.ly/smallstocks', parse_dates=['Date']) 
print(stocks)

Let us say, this data was stored in 3 separate CSV files, one for each day, named stocks1.csv, stocks2.csv and stocks3.csv as shown below.

stocks1 = pd.read_csv('data/stocks1.csv')
stocks2 = pd.read_csv('data/stocks2.csv')
stocks3 = pd.read_csv('data/stocks3.csv')

As said earlier, a naive approach would be to read in each of these CSV files into separate DataFrames, as shown above, and then concatenate them, but this would become cumbersome as the number of such files increases. A better solution is to use the built-in glob module. Let us import glob.

import glob

We can pass in a pattern to glob(), including wildcard characters, and it will return a list of all files that match that pattern.

stock_files = sorted(glob('data/stocks*.csv'))

In this case, glob is looking in the “data” subdirectory for all CSV files that start with the word “stocks”. glob returns filenames in an arbitrary order, which is why we have sorted the list using Python’s built-in sorted() method.

Now that we’ve collected all the files over which our dataset is spread across, we can use a generator expression to read in each of the files using read_csv() and pass the results to the concat() function, which will concatenate the rows into a single DataFrame.

pd.concat((pd.read_csv(file) for file in stock_files))

Looks like we’ve successfully accomplished bringing in all data from the three files into a single DataFrame, but, there are duplicate values in the index. To avoid that, we can set the ignore_index argument to True to tell the concat() function to ignore the index and use the default integer index instead.

pd.concat((pd.read_csv(file) for file in stock_files), ignore_index=True)

This method is useful when each file contains rows from our dataset. What if each file instead contains columns from our dataset?

Building a DataFrame from multiple files (column-wise)

Here’s an example in which the drinks dataset has been split into two CSV files, and each of the files

drinks = pd.read_csv('http://bit.ly/drinksbycountry')

drinks1 = pd.read_csv('data/drinks1.csv').head()
drinks2 = pd.read_csv('data/drinks2.csv').head()

Similar to the procedure we followed earlier, we’ll start by using glob()

drink_files = sorted(glob('data/drinks*.csv'))

And this time, we’ll tell the concat() function to concatenate along with the columns by specifying the axis argument as ‘columns’.

pd.concat((pd.read_csv(file) for file in drink_files), axis='columns').head()

We see that we’ve obtained a single DataFrame with all six columns. We had considered simple examples to illustrate the use. In practice, where we have datasets chunked across multiple files, this could be a lot more helpful.

Thank you for reading!😊