Recently I was working on comparing the performance of different machine learning models and I wanted to add entries to a Pandas DataFrame as I evaluated each model. What I found was that adding new rows to a Pandas DataFrame was a little harder than I suspected and required some mild searching, so I wanted to preserve the two solutions I found here in case it helps someone else.
When would you need to Append Rows?
So, first of all, when would you want to append to a Pandas DataFrame?
DataFrames in Pandas are immutable tabular data structures built to be loaded and then filtered and transformed into new DataFrames.
Most of the time when you're working with DataFrames you're doing things like performing exploratory data analysis, plotting data on charts, and building a training dataset for machine learning tasks. All of these tasks involve transforming data from one shape to another, not adding new observations to a DataFrame.
However, sometimes you'll want to have a collection of data together in a DataFrame for analysis purposes and you won't know all the pieces at once. For example, if you're doing something iterative in nature and you want to track metrics associated with each iteration, that's a case where you'd need to append new rows as you discover them.
This is the case I found myself examining earlier this week as I sought to compare model performance metrics for several different machine learning algorithms.
Let's look at the code I used to solve this problem and a few alternatives.
Concatenating Pandas DataFrames
The current recommended way of adding additional observations to a Pandas DataFrame is to call the concat function.
The concat
function does not live on a specific DataFrame, but rather on the Pandas library. If you typically do import pandas as pd
, then you can use pd.concat
to get the concat
function.
concat
takes in an array of DataFrames and outputs a single DataFrame with the rows from each of the DataFrames added in sequence. This means that you can use concat
to join together more than just two DataFrames at a time.
Here's a sample of concat
in action:
import pandas as pd
# CSV file is not relevant, just assuming you have some pre-existing data
df = pd.read_csv('models.csv')
# Define columns to add with an array of values for the new row(s)
data = {
"Model":["Linear Regression"],
"R2": [0.042],
"MSE": [3.14],
}
# We need a new DataFrame with the new contents
df_new_rows = pd.DataFrame(data)
# Call Pandas.concat to create a new DataFrame from several existing ones
df = pd.concat([df, df_new_rows])
Note that when we call pd.concat
we pass it an array containing the original DataFrame named df
and store the results in that same variable. This is because concat
, like most Pandas functions, does not modify the original DataFrame but creates and returns a new DataFrame instead.
Now, let's talk about data
. Data in the code snippet above represents the rows to add. Here data
is an object containing three columns: Model
, R2
, and MSE
. Each column contains an array of values, representing that column's value for the rows being added. Here we're only adding a single row, so each of these arrays has just one value. If you wanted to add multiple rows, you would have multiple values in each array.
One final note before we move on. Here we're appending the row to the end of the DataFrame by adding it as the last value to concat
. If we instead wanted to have the new row appear first we'd swap the order as follows:
df = pd.concat([df_new_rows, df])
All told, concat
is a versatile way of stitching together data.
The Old Way: Appending Rows to a Pandas DataFrame
Before we move on, I want to mention the append function present on Pandas DataFrames.
The append function is deprecated and will be removed in a future release of Pandas so you should not rely on it in current or new code. However, you may see it around still, so I include it just for completeness.
The append
function is associated with a specific DataFrame so it is a bit simpler to invoke:
import pandas as pd
# CSV file is not relevant, just assuming you have some pre-existing data
df = pd.read_csv('models.csv')
# Define columns to add with an array of values for the new row(s)
data = {
"Model":["Linear Regression"],
"R2": [0.042],
"MSE": [3.14],
}
# We need a new DataFrame with the new contents
df_new_rows = pd.DataFrame(data)
# Call append to add in the new rows.
# WARNING: This is deprecated and will be removed in the future
df = df.append(df_new_rows)
Here we call append
on the original DataFrame and pass it a single DataFrame containing all the rows to append. Like other functions on DataFrames, this operation results in a new DataFrame.
I personally find append
to be more intuitive and easier to discover, but concat
gives us greater flexibility and is the way of the future.
Final Thoughts on Concat
So, there we go: just use concat
and you'll have a much better time appending rows to Pandas DataFrames.
This is one of those articles I wrote to refer back to myself later on, because I don't frequently want to append things to DataFrames, but when I do, it's helpful to have a small reference for it.
Let me know if this helped you and if there's other aspects of Pandas you'd love for me to cover as well.
Top comments (0)