DEV Community

Cover image for How to combine and summarize excel/CSV files in Python?
Lara
Lara

Posted on

How to combine and summarize excel/CSV files in Python?

This is the begging of a series to share simple and short lines of code in Python that helped me to automate my work.

Scenario: You have monthly sales reports for the year 2021, and your need to put them all in one file to create a yearly report and generate simple statistics. How can we do that?


Prerequisites:

  1. All files must have the same set of columns (same name, spelling, and number)
  2. Have all files in one folder
  3. All files must have the same extension (either .xlsx, .cvs)
  4. Any version of Python - mine is 7.29.0.
  5. You will need the below libraries:

Set up

First, install the above libraries if you do not have them installed already. You can use the below commands and tweak them depending on the IDE you use, I use Spyder.

Spyder:
pip install pandas
pip install os
pip install glob


Importing the files

1. Set up the path
path = r'C:/Folder Name'

2. Get the list of files
This will get for us the names of all the excel/csv files in the particular folder we specified in the path.
file_list = glob.glob(path + "/*.xlsx")
Just change .xlsx to .csv throughout the code and it will work the same.

3. Importing the files
This is a for loop that will go through each file we got in the file_list, read it as a data frame, and put it inside a list of data frames by appending it to the list.

excel_list = []

for file in file_list:
excelFile =pd.read_excel(file)
excelFile['fileName']=os.path.basename(file)
excel_list.append(excelFile)

Merging the files into one

1. Concatenating the files into one data frame
excel_merged = pd.concat(excel_list, ignore_index=True)

2. Export the file to the same path
excel_merged.to_excel(path+'/All Lists.xlsx', index=False)


Create summary and basic statists

Get all columns in the data

excel_merged.columns.tolist()

Generate a general summary of the data

excel_merged.describe()

This will give stats of all numeric columns (mean, medium, STD, count, max, min, etc.)

You can use the below to get stats on a certain numeric column

excel_merged['column name'].describe()

Get stats group by certain month
summary=excel_merged.groupby(["Month"])[["Sales Value"]].describe()
summary

This will show us statistics on sales per month - it is a good indicator to see if we have peaks or low sales in certain months.

Plot the data

f, ax = plt.subplots()

ax.bar(x=excel_merged['Month'],
height=excel_merged['Sales'],
color="purple")

ax.set(title="Plot of Sales per Month")
plt.show()

Troubleshooting

In case a file was missing ensure that the file is in the file_list, and it has the same extension and is in the folder.
If you have more columns than the original lists -this means there is a file with either extra columns or does not have the same spelling.
Throughout the codes, just change the path from .xlsx to .csv and it will work the same.

Well done!

Image description

Oldest comments (0)