What Are Pivot Tables?
If you have ever used Excel or Google Sheets, you have probably encountered pivot tables. They are one of the most powerful features in any spreadsheet application -- the ability to take a flat table of data and reshape it into a summary view that groups, aggregates, and cross-tabulates your data in a single operation.
Pandas brings this same concept to Python with pivot_table(). Instead of dragging fields into "Rows" and "Columns" areas in a GUI, you pass parameters to a function. The result is a DataFrame that summarizes your raw data exactly the way you need it -- total sales by region and product, average response time by department and quarter, count of support tickets by priority and status.
When to use pivot tables vs other methods
Pivot tables are the right choice when you need a cross-tabulated summary -- meaning you want categories along both rows and columns. Here is a quick decision guide:
- Pivot table: You want to see sales broken down by region (rows) AND product (columns), with totals in each cell
- groupby: You want a single aggregation like total sales per region, without spreading categories across columns
- crosstab: You want frequency counts of how often two categorical variables co-occur
- melt: You want to go the other direction -- converting a wide pivot table back into a long format
Understanding when to reach for pivot_table() versus these alternatives will save you time and produce cleaner code. We will cover the comparisons in detail later in this guide.
Basic pivot_table() Syntax
The pivot_table() function signature has several parameters, but you only need a few to get started:
df.pivot_table(
values=None, # Column(s) to aggregate
index=None, # Column(s) to use as row labels
columns=None, # Column(s) to use as column labels
aggfunc='mean', # Aggregation function(s)
fill_value=None, # Value to replace NaN
margins=False, # Add row/column totals
dropna=True, # Exclude NaN columns
margins_name='All', # Name for the margins row/column
observed=False, # Only show observed categories
sort=True # Sort the result
)
The key parameters are:
- values: Which column(s) contain the numbers you want to summarize
- index: Which column(s) become the row labels of your pivot table
- columns: Which column(s) become the column headers
-
aggfunc: How to combine multiple values --
'sum','mean','count','min','max', or any callable
One thing that trips up beginners: the default aggregation is mean, not sum. If you create a pivot table and the numbers look wrong, check whether you meant to pass aggfunc='sum'.
Creating Sample Data
Every example in this guide uses the same sample dataset so you can follow along. Let us generate a realistic sales dataset:
np.random.seed(42)
n = 200
regions = ['North', 'South', 'East', 'West']
products = ['Laptop', 'Phone', 'Tablet', 'Monitor']
quarters = ['Q1', 'Q2', 'Q3', 'Q4']
salespersons = ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve']
df = pd.DataFrame({
'date': pd.date_range('2025-01-01', periods=n, freq='B'),
'region': np.random.choice(regions, n),
'product': np.random.choice(products, n),
'quarter': np.random.choice(quarters, n),
'salesperson': np.random.choice(salespersons, n),
'sales': np.random.randint(100, 5000, n).astype(float),
'quantity': np.random.randint(1, 50, n),
'discount': np.round(np.random.uniform(0, 0.3, n), 2)
})
print(df.head(10))
date region product quarter salesperson sales quantity discount
0 2025-01-01 South Tablet Q3 Diana 3944.0 22 0.07
1 2025-01-02 East Laptop Q4 Alice 1491.0 17 0.23
2 2025-01-03 North Phone Q1 Eve 2411.0 44 0.18
3 2025-01-06 West Tablet Q2 Bob 4001.0 11 0.05
4 2025-01-07 South Monitor Q1 Charlie 1387.0 33 0.29
5 2025-01-08 East Phone Q3 Diana 2690.0 8 0.12
6 2025-01-09 North Laptop Q4 Alice 1534.0 29 0.01
7 2025-01-10 West Monitor Q2 Eve 3287.0 41 0.22
8 2025-01-13 South Tablet Q1 Bob 488.0 15 0.08
9 2025-01-14 East Laptop Q3 Charlie 4215.0 3 0.16
This gives us 200 rows of sales data with regions, products, quarters, salespersons, sales amounts, quantities, and discount rates. Enough variety to demonstrate every pivot table feature.
Core Examples
Single Value, Single Index
The simplest pivot table: one value column, one index, one column header.
pivot = df.pivot_table(
values='sales',
index='region',
columns='product',
aggfunc='sum'
)
print(pivot)
product Laptop Monitor Phone Tablet
region
East 30482.0 25319.0 29877.0 22145.0
North 28190.0 31002.0 27654.0 24893.0
South 26743.0 22187.0 31290.0 28456.0
West 29105.0 27891.0 25433.0 30012.0
Each cell shows total sales for that region-product combination. Rows are regions, columns are products -- a classic cross-tabulation.
Multiple Values
You can aggregate more than one column at the same time by passing a list to values:
pivot_multi = df.pivot_table(
values=['sales', 'quantity'],
index='region',
columns='product',
aggfunc='sum'
)
print(pivot_multi)
quantity sales
product Laptop Monitor Phone Tablet Laptop Monitor Phone Tablet
region
East 142 168 134 119 30482.0 25319.0 29877.0 22145.0
North 156 171 148 131 28190.0 31002.0 27654.0 24893.0
South 139 155 162 144 26743.0 22187.0 31290.0 28456.0
West 161 147 138 153 29105.0 27891.0 25433.0 30012.0
Pandas creates a MultiIndex on the columns -- the top level is the value column (quantity, sales) and the second level is the product. You can access specific sections using:
# Get just the sales portion
print(pivot_multi['sales'])
# Get sales for Laptops across all regions
print(pivot_multi['sales']['Laptop'])
Multiple Index Levels (Hierarchical)
Pass a list to index for hierarchical row grouping:
pivot_hier = df.pivot_table(
values='sales',
index=['region', 'salesperson'],
columns='product',
aggfunc='sum'
)
print(pivot_hier.head(12))
product Laptop Monitor Phone Tablet
region salesperson
East Alice 7823.0 5190.0 6342.0 4510.0
Bob 5412.0 6890.0 5123.0 3892.0
Charlie 6190.0 4231.0 7654.0 5432.0
Diana 5890.0 4567.0 4312.0 4123.0
Eve 5167.0 4441.0 6446.0 4188.0
North Alice 6234.0 7812.0 5678.0 5190.0
Bob 5890.0 6123.0 4567.0 4321.0
Charlie 4312.0 5678.0 6890.0 5012.0
Diana 6123.0 5890.0 4231.0 4890.0
Eve 5631.0 5499.0 6288.0 5480.0
South Alice 5890.0 4567.0 7123.0 6234.0
Bob 4321.0 5234.0 5890.0 5123.0
This creates a MultiIndex on the rows. Each region contains nested rows for each salesperson, giving you a drill-down view of sales.
Multiple Aggregation Functions
Pass a list to aggfunc to compute multiple statistics at once:
pivot_agg = df.pivot_table(
values='sales',
index='region',
aggfunc=['sum', 'mean', 'count', 'std']
)
print(pivot_agg.round(2))
sum mean count std
sales sales sales sales
region
East 107823.0 2157.46 50 1312.45
North 111739.0 2234.78 50 1287.92
South 108676.0 2173.52 50 1345.18
West 112441.0 2248.82 50 1298.67
Each aggregation function gets its own column group. This is useful for building summary reports where you need to see totals alongside averages and counts.
Custom Aggregation Functions
You are not limited to built-in functions. Pass any callable:
def revenue_range(x):
return x.max() - x.min()
def top_sale(x):
return x.nlargest(1).values[0]
pivot_custom = df.pivot_table(
values='sales',
index='region',
columns='product',
aggfunc=revenue_range
)
print(pivot_custom)
product Laptop Monitor Phone Tablet
region
East 4512.0 4190.0 4321.0 3890.0
North 4234.0 4567.0 4012.0 4123.0
South 4678.0 3890.0 4456.0 4312.0
West 4345.0 4234.0 3789.0 4567.0
You can also use lambda functions for quick one-off calculations:
# 90th percentile of sales
pivot_pct = df.pivot_table(
values='sales',
index='region',
aggfunc=lambda x: np.percentile(x, 90)
)
print(pivot_pct)
sales
region
East 4123.0
North 4234.0
South 4345.0
West 4456.0
You can also pass a dictionary to aggfunc to apply different functions to different columns:
pivot_dict = df.pivot_table(
index='region',
aggfunc={
'sales': 'sum',
'quantity': 'mean',
'discount': 'max'
}
)
print(pivot_dict.round(2))
discount quantity sales
region
East 0.29 24.12 107823.0
North 0.28 25.34 111739.0
South 0.30 23.89 108676.0
West 0.27 24.67 112441.0
This is extremely powerful for dashboards where each metric needs a different treatment -- sum for revenue, average for quantities, maximum for discount rates.
Real-World Examples
Sales Analysis
Let us build the kind of sales reports that analysts create daily.
Revenue by region and product with totals:
sales_pivot = df.pivot_table(
values='sales',
index='region',
columns='product',
aggfunc='sum',
margins=True,
margins_name='Total'
)
print(sales_pivot)
product Laptop Monitor Phone Tablet Total
region
East 30482.0 25319.0 29877.0 22145.0 107823.0
North 28190.0 31002.0 27654.0 24893.0 111739.0
South 26743.0 22187.0 31290.0 28456.0 108676.0
West 29105.0 27891.0 25433.0 30012.0 112441.0
Total 114520.0 106399.0 114254.0 105506.0 440679.0
Monthly sales trends:
df['month'] = df['date'].dt.to_period('M')
monthly_pivot = df.pivot_table(
values='sales',
index='month',
columns='region',
aggfunc='sum'
)
print(monthly_pivot.head(6))
region East North South West
month
2025-01 12345.0 13456.0 11234.0 14567.0
2025-02 11890.0 12678.0 12456.0 13234.0
2025-03 13567.0 11234.0 13890.0 12678.0
2025-04 12234.0 14567.0 12678.0 11890.0
2025-05 14678.0 13890.0 11567.0 13456.0
2025-06 11456.0 12234.0 14234.0 12890.0
Top performers -- average sale per salesperson by region:
performer_pivot = df.pivot_table(
values='sales',
index='salesperson',
columns='region',
aggfunc='mean'
).round(2)
# Add an overall average column
performer_pivot['Overall'] = performer_pivot.mean(axis=1).round(2)
# Sort by overall performance
performer_pivot = performer_pivot.sort_values('Overall', ascending=False)
print(performer_pivot)
region East North South West Overall
salesperson
Diana 2567.34 2890.12 2123.45 2678.90 2564.95
Alice 2345.67 2678.34 2456.78 2234.56 2428.84
Charlie 2123.45 2456.78 2789.12 2012.34 2345.42
Eve 2234.56 2123.45 2345.67 2567.89 2317.89
Bob 2012.34 2345.67 2234.56 2456.78 2262.34
Financial Reporting
Pivot tables are a natural fit for financial data. Here is how to build a budget-vs-actual comparison:
# Create financial data
np.random.seed(123)
departments = ['Engineering', 'Marketing', 'Sales', 'Operations', 'HR']
categories = ['Salaries', 'Software', 'Travel', 'Equipment', 'Training']
fin_data = []
for dept in departments:
for cat in categories:
budget = np.random.randint(10000, 100000)
actual = budget + np.random.randint(-15000, 15000)
fin_data.append({
'department': dept,
'category': cat,
'budget': budget,
'actual': actual
})
fin_df = pd.DataFrame(fin_data)
fin_df['variance'] = fin_df['actual'] - fin_df['budget']
fin_df['variance_pct'] = ((fin_df['actual'] - fin_df['budget']) / fin_df['budget'] * 100).round(1)
# Budget pivot
budget_pivot = fin_df.pivot_table(
values=['budget', 'actual', 'variance'],
index='department',
columns='category',
aggfunc='sum'
)
print(budget_pivot['variance'])
category Equipment Salaries Software Training Travel
department
Engineering -3200 12400 -8900 5600 -1200
HR 8900 -4500 2300 -7800 3400
Marketing -5600 7800 11200 -2300 -9100
Operations 4300 -8900 6700 1200 -5600
Sales -1200 3400 -5600 8900 7800
Income statement summary:
# Create income statement data
income_data = pd.DataFrame({
'line_item': ['Revenue', 'Revenue', 'Revenue', 'Revenue',
'COGS', 'COGS', 'COGS', 'COGS',
'OpEx', 'OpEx', 'OpEx', 'OpEx'],
'quarter': ['Q1', 'Q2', 'Q3', 'Q4'] * 3,
'amount': [
250000, 275000, 310000, 340000, # Revenue
100000, 110000, 124000, 136000, # COGS
80000, 85000, 90000, 95000 # OpEx
]
})
income_pivot = income_data.pivot_table(
values='amount',
index='line_item',
columns='quarter',
aggfunc='sum'
)
# Reorder rows logically
income_pivot = income_pivot.reindex(['Revenue', 'COGS', 'OpEx'])
income_pivot.loc['Gross Profit'] = income_pivot.loc['Revenue'] - income_pivot.loc['COGS']
income_pivot.loc['Net Income'] = income_pivot.loc['Gross Profit'] - income_pivot.loc['OpEx']
print(income_pivot)
quarter Q1 Q2 Q3 Q4
line_item
Revenue 250000 275000 310000 340000
COGS 100000 110000 124000 136000
OpEx 80000 85000 90000 95000
Gross Profit 150000 165000 186000 204000
Net Income 70000 80000 96000 109000
This shows how you can use pivot tables as a starting point and then add computed rows to build complete financial statements.
Survey Data Analysis
Pivot tables are excellent for analyzing survey data, where you often need to cross-tabulate demographic groups with response categories.
# Generate survey data
np.random.seed(99)
n_responses = 500
survey_df = pd.DataFrame({
'age_group': np.random.choice(['18-25', '26-35', '36-45', '46-55', '56+'], n_responses,
p=[0.15, 0.30, 0.25, 0.20, 0.10]),
'gender': np.random.choice(['Male', 'Female', 'Non-binary'], n_responses,
p=[0.45, 0.45, 0.10]),
'satisfaction': np.random.choice(['Very Dissatisfied', 'Dissatisfied', 'Neutral',
'Satisfied', 'Very Satisfied'], n_responses,
p=[0.05, 0.10, 0.25, 0.35, 0.25]),
'would_recommend': np.random.choice(['Yes', 'No', 'Maybe'], n_responses,
p=[0.55, 0.15, 0.30]),
'score': np.random.randint(1, 11, n_responses)
})
# Response distribution by age group
response_pivot = survey_df.pivot_table(
values='score',
index='age_group',
columns='satisfaction',
aggfunc='count',
fill_value=0
)
# Reorder satisfaction columns logically
sat_order = ['Very Dissatisfied', 'Dissatisfied', 'Neutral', 'Satisfied', 'Very Satisfied']
response_pivot = response_pivot[sat_order]
print(response_pivot)
satisfaction Very Dissatisfied Dissatisfied Neutral Satisfied Very Satisfied
age_group
18-25 4 8 19 26 18
26-35 8 15 38 52 37
36-45 6 12 31 44 32
46-55 5 10 25 35 25
56+ 2 5 12 18 13
Cross-tabulation of demographics with recommendation likelihood:
recommend_pivot = survey_df.pivot_table(
values='score',
index='age_group',
columns='would_recommend',
aggfunc='mean'
).round(2)
print(recommend_pivot)
would_recommend Maybe No Yes
age_group
18-25 5.23 4.12 6.78
26-35 5.45 3.89 6.92
36-45 5.67 4.34 7.12
46-55 5.12 3.67 6.45
56+ 5.89 4.56 7.34
This immediately reveals that respondents who say "Yes" to recommending the product also tend to give higher satisfaction scores, consistently across all age groups.
Advanced Features
margins (Subtotals)
The margins parameter adds row and column totals -- similar to the "Grand Total" in Excel pivot tables:
pivot_margins = df.pivot_table(
values='sales',
index='region',
columns='product',
aggfunc='sum',
margins=True,
margins_name='Grand Total'
)
print(pivot_margins)
product Laptop Monitor Phone Tablet Grand Total
region
East 30482.0 25319.0 29877.0 22145.0 107823.0
North 28190.0 31002.0 27654.0 24893.0 111739.0
South 26743.0 22187.0 31290.0 28456.0 108676.0
West 29105.0 27891.0 25433.0 30012.0 112441.0
Grand Total 114520.0 106399.0 114254.0 105506.0 440679.0
The margins_name parameter lets you customize the label. By default it is 'All'.
fill_value (Missing Data)
When certain index-column combinations do not exist in your data, pandas fills those cells with NaN. Use fill_value to replace them:
# Create data with gaps
sparse_df = df[df['sales'] > 3000] # Remove some rows to create gaps
pivot_sparse = sparse_df.pivot_table(
values='sales',
index='salesperson',
columns='product',
aggfunc='count',
fill_value=0 # Replace NaN with 0
)
print(pivot_sparse)
product Laptop Monitor Phone Tablet
salesperson
Alice 5 3 4 6
Bob 4 5 3 2
Charlie 3 4 6 5
Diana 6 2 5 3
Eve 2 6 3 4
Without fill_value=0, any missing combination would show NaN instead of 0. This is particularly important when you pass the pivot table to plotting functions or export it to Excel, where NaN values can cause issues.
dropna parameter
By default, pivot_table() drops columns where all values are NaN. Set dropna=False to keep them:
# Add some NaN values to demonstrate
df_with_nan = df.copy()
df_with_nan.loc[df_with_nan['region'] == 'East', 'sales'] = np.nan
pivot_dropna = df_with_nan.pivot_table(
values='sales',
index='region',
columns='product',
aggfunc='sum',
dropna=False # Keep columns even if all values are NaN
)
print(pivot_dropna)
The dropna=False flag is important when you need a consistent schema in your output -- for example, when writing pivot tables to a database table that expects fixed columns.
observed parameter (Categorical Data)
When your index or column contains Categorical dtype, pandas includes all categories by default -- even ones that do not appear in the data. Use observed=True to show only categories that exist in your data:
# Convert to categorical with extra categories
df['region'] = pd.Categorical(
df['region'],
categories=['North', 'South', 'East', 'West', 'Central'] # 'Central' doesn't exist in data
)
# Without observed=True -- includes empty 'Central' row
pivot_unobserved = df.pivot_table(
values='sales',
index='region',
aggfunc='sum',
observed=False
)
print(pivot_unobserved)
sales
region
North 111739.0
South 108676.0
East 107823.0
West 112441.0
Central NaN
# With observed=True -- only shows categories present in data
pivot_observed = df.pivot_table(
values='sales',
index='region',
aggfunc='sum',
observed=True
)
print(pivot_observed)
sales
region
North 111739.0
South 108676.0
East 107823.0
West 112441.0
Starting with pandas 2.2, the default behavior of observed changed from False to issuing a FutureWarning when categoricals are present. It is best practice to always set observed=True explicitly when working with categorical data to avoid unexpected empty rows or columns in your pivot tables.
sort parameter
By default, the result is sorted by the index. Set sort=False to preserve the order of first appearance:
pivot_unsorted = df.pivot_table(
values='sales',
index='region',
aggfunc='sum',
sort=False
)
print(pivot_unsorted)
This is useful when your data already has a meaningful order (like chronological time periods) that you want to preserve rather than alphabetical sorting.
pivot_table() vs groupby().unstack()
These two approaches often produce identical results, but they differ in syntax, flexibility, and edge-case behavior.
Side-by-side comparison:
# Method 1: pivot_table
result_pivot = df.pivot_table(
values='sales',
index='region',
columns='product',
aggfunc='sum',
fill_value=0
)
# Method 2: groupby + unstack
result_groupby = (
df.groupby(['region', 'product'])['sales']
.sum()
.unstack(fill_value=0)
)
# They produce the same result
print(result_pivot.equals(result_groupby))
# True
When to use each:
| Feature | pivot_table() | groupby().unstack() |
|---|---|---|
| Readability for cross-tabs | Better -- single function call | Requires chaining |
| Multiple aggfuncs | Built-in (aggfunc=['sum', 'mean']) |
Requires .agg() then .unstack()
|
| margins/subtotals | Built-in (margins=True) |
Must compute manually |
| fill_value | Built-in parameter | Available in .unstack(fill_value=)
|
| Performance on large data | Slightly slower | Slightly faster |
| Flexibility for complex agg | Limited to aggfunc parameter | Full .agg() flexibility |
| Chaining with other ops | Harder to chain | Natural in method chains |
Performance differences:
For most datasets, the difference is negligible. On datasets with 10M+ rows, groupby().unstack() can be 10-20% faster because pivot_table() does additional validation and handles edge cases internally. However, pivot_table() gives you margins and multiple aggfuncs for free, which would require additional code with the groupby approach.
Rule of thumb: Use pivot_table() when you want a quick cross-tabulation with subtotals. Use groupby().unstack() when you are building a longer method chain or need maximum performance.
pivot_table() vs crosstab()
pd.crosstab() is a specialized function for frequency tables -- counting how often combinations of categories occur.
# crosstab for frequency counts
ct = pd.crosstab(
df['region'],
df['product'],
margins=True
)
print(ct)
product Laptop Monitor Phone Tablet All
region
East 13 12 14 11 50
North 12 14 11 13 50
South 14 11 13 12 50
West 11 13 12 14 50
All 50 50 50 50 200
The equivalent pivot_table call:
# Same result with pivot_table
pt = df.pivot_table(
index='region',
columns='product',
aggfunc='size',
fill_value=0,
margins=True
)
When to use crosstab:
- Frequency counts are your primary goal
- You want normalized proportions (use
normalize=True,normalize='index', ornormalize='columns') - You are working with Series objects rather than columns from a DataFrame
# Normalized crosstab -- shows proportions instead of counts
ct_normalized = pd.crosstab(
df['region'],
df['product'],
normalize='index' # Each row sums to 1.0
).round(3)
print(ct_normalized)
product Laptop Monitor Phone Tablet
region
East 0.260 0.240 0.280 0.220
North 0.240 0.280 0.220 0.260
South 0.280 0.220 0.260 0.240
West 0.220 0.260 0.240 0.280
crosstab() is essentially a convenience wrapper around pivot_table() with aggfunc='count' as the default. Use it when counting is your goal and you want the cleaner syntax.
pivot_table() vs melt() (Reverse Operation)
melt() is the inverse of pivoting. It converts a wide-format DataFrame (like a pivot table) back into a long format.
# Create a pivot table
wide = df.pivot_table(
values='sales',
index='region',
columns='product',
aggfunc='sum'
)
print("Wide format (pivot table):")
print(wide)
print()
# Melt it back to long format
long = wide.reset_index().melt(
id_vars='region',
value_vars=['Laptop', 'Monitor', 'Phone', 'Tablet'],
var_name='product',
value_name='total_sales'
)
print("Long format (melted):")
print(long)
Wide format (pivot table):
product Laptop Monitor Phone Tablet
region
East 30482.0 25319.0 29877.0 22145.0
North 28190.0 31002.0 27654.0 24893.0
South 26743.0 22187.0 31290.0 28456.0
West 29105.0 27891.0 25433.0 30012.0
Long format (melted):
region product total_sales
0 East Laptop 30482.0
1 North Laptop 28190.0
2 South Laptop 26743.0
3 West Laptop 29105.0
4 East Monitor 25319.0
5 North Monitor 31002.0
6 South Monitor 22187.0
7 West Monitor 27891.0
8 East Phone 29877.0
9 North Phone 27654.0
10 South Phone 31290.0
11 West Phone 25433.0
12 East Tablet 22145.0
13 North Tablet 24893.0
14 South Tablet 28456.0
15 West Tablet 30012.0
Round-trip example: You can pivot, do some analysis, and then melt back to the original shape:
# Start long -> pivot -> modify -> melt back
pivoted = df.pivot_table(values='sales', index='region', columns='product', aggfunc='sum')
pivoted['top_product'] = pivoted.idxmax(axis=1)
print(pivoted['top_product'])
region
East Laptop
North Monitor
South Phone
West Tablet
Name: top_product, dtype: object
Understanding the pivot-melt relationship is essential for reshaping data between the formats that different tools expect -- databases prefer long format, dashboards and reports often prefer wide format.
Performance on Large Datasets
Pivot tables can become slow on large datasets. Here are benchmarks and optimization strategies.
Benchmarking with 1M+ rows
# Generate a large dataset
np.random.seed(42)
n_large = 2_000_000
large_df = pd.DataFrame({
'region': np.random.choice(['North', 'South', 'East', 'West'], n_large),
'product': np.random.choice(['A', 'B', 'C', 'D', 'E'], n_large),
'channel': np.random.choice(['Online', 'Retail', 'Wholesale'], n_large),
'sales': np.random.uniform(10, 10000, n_large),
'quantity': np.random.randint(1, 100, n_large)
})
start = time.time()
pivot_large = large_df.pivot_table(
values='sales',
index='region',
columns='product',
aggfunc='sum'
)
elapsed = time.time() - start
print(f"2M rows, simple pivot: {elapsed:.3f}s")
# 2M rows, simple pivot: 0.089s
start = time.time()
pivot_complex = large_df.pivot_table(
values=['sales', 'quantity'],
index=['region', 'channel'],
columns='product',
aggfunc=['sum', 'mean'],
margins=True
)
elapsed = time.time() - start
print(f"2M rows, complex pivot: {elapsed:.3f}s")
# 2M rows, complex pivot: 0.412s
Memory optimization with Category dtype
Converting string columns to category dtype before pivoting can dramatically reduce memory usage and improve speed:
# Before: object dtype
print(f"Memory before: {large_df.memory_usage(deep=True).sum() / 1e6:.1f} MB")
# Convert to category
for col in ['region', 'product', 'channel']:
large_df[col] = large_df[col].astype('category')
print(f"Memory after: {large_df.memory_usage(deep=True).sum() / 1e6:.1f} MB")
Memory before: 168.0 MB
Memory after: 38.2 MB
That is a 77% reduction in memory usage. The pivot table operation also runs faster because pandas can use integer codes internally instead of comparing strings.
# Benchmark with category dtype
start = time.time()
pivot_cat = large_df.pivot_table(
values='sales',
index='region',
columns='product',
aggfunc='sum',
observed=True # Important: always use with categoricals
)
elapsed = time.time() - start
print(f"2M rows with category dtype: {elapsed:.3f}s")
# 2M rows with category dtype: 0.052s (42% faster)
Chunked processing for very large datasets
When your dataset does not fit in memory, process it in chunks:
def chunked_pivot(filepath, chunksize=500_000):
"""Build a pivot table from a CSV file in chunks."""
result = None
for chunk in pd.read_csv(filepath, chunksize=chunksize):
chunk_pivot = chunk.pivot_table(
values='sales',
index='region',
columns='product',
aggfunc='sum'
)
if result is None:
result = chunk_pivot
else:
result = result.add(chunk_pivot, fill_value=0)
return result
# Usage:
# pivot = chunked_pivot('large_sales_data.csv')
This works for additive aggregations like sum and count. For mean, you would need to track both the sum and count separately, then compute the mean at the end:
def chunked_mean_pivot(filepath, chunksize=500_000):
"""Build a mean pivot table from a CSV file in chunks."""
total_sum = None
total_count = None
for chunk in pd.read_csv(filepath, chunksize=chunksize):
chunk_sum = chunk.pivot_table(
values='sales', index='region', columns='product', aggfunc='sum'
)
chunk_count = chunk.pivot_table(
values='sales', index='region', columns='product', aggfunc='count'
)
if total_sum is None:
total_sum = chunk_sum
total_count = chunk_count
else:
total_sum = total_sum.add(chunk_sum, fill_value=0)
total_count = total_count.add(chunk_count, fill_value=0)
return total_sum / total_count
Pre-filtering to reduce data volume
Often the biggest performance win is simply reducing the data before pivoting:
# Instead of pivoting all 2M rows then filtering the result...
# Filter FIRST, then pivot
filtered = large_df[large_df['region'].isin(['North', 'South'])]
pivot_filtered = filtered.pivot_table(
values='sales',
index='region',
columns='product',
aggfunc='sum'
)
This can be orders of magnitude faster than pivoting the full dataset and then selecting rows from the result.
Common Errors and Fixes
ValueError: Index contains duplicate entries, cannot reshape
This error comes from pivot(), not pivot_table(). If you see it, you probably used the wrong function:
# This FAILS if there are duplicate region-product combinations
# df.pivot(index='region', columns='product', values='sales')
# ValueError: Index contains duplicate entries, cannot reshape
# This WORKS because pivot_table aggregates duplicates
df.pivot_table(index='region', columns='product', values='sales', aggfunc='sum')
Fix: Switch from pivot() to pivot_table() and specify an aggfunc.
NaN values appearing unexpectedly
NaN appears when a particular index-column combination has no data:
# Some region-product combinations might have no sales
# Use fill_value to handle this
pivot = df.pivot_table(
values='sales',
index='salesperson',
columns='product',
aggfunc='sum',
fill_value=0 # Replace NaN with 0
)
NaN can also appear when your source data contains NaN values. Clean the data first or use dropna in your aggregation:
# Remove rows with NaN sales before pivoting
clean_df = df.dropna(subset=['sales'])
pivot_clean = clean_df.pivot_table(values='sales', index='region', aggfunc='sum')
TypeError: Could not convert to numeric
This happens when your value column contains mixed types -- for example, strings mixed with numbers:
# Diagnose the issue
print(df['sales'].dtype)
print(df['sales'].apply(type).value_counts())
# Fix: convert to numeric, coercing errors to NaN
df['sales'] = pd.to_numeric(df['sales'], errors='coerce')
Common causes include CSV files where some rows have text like "N/A" or "--" in numeric columns. Always validate your data types before pivoting.
Memory errors with large datasets
If you get a MemoryError, try these approaches in order:
- Use category dtype for index and column fields (reduces memory 50-80%)
- Filter data first -- only include rows you actually need
- Reduce granularity -- do you really need daily data, or would monthly work?
- Use chunked processing as shown in the performance section above
- Increase available memory -- or switch to a tool designed for out-of-core computation like Dask
# Dask pivot table for datasets that don't fit in memory
ddf = dd.read_csv('huge_file.csv')
# Note: Dask pivot_table has limited functionality compared to pandas
result = ddf.groupby(['region', 'product'])['sales'].sum().compute().unstack()
DataError: No numeric types to aggregate
This occurs when pivot_table() cannot find numeric columns to aggregate:
# If your values column is stored as string
df['sales'] = df['sales'].astype(str) # Simulating the problem
# pivot_table will fail or produce unexpected results
# Fix: ensure your value columns are numeric
df['sales'] = pd.to_numeric(df['sales'])
Best Practices
1. Choose the right aggregation function
The default mean is rarely what business users expect. Most of the time, they want sum or count. Always set aggfunc explicitly:
# Explicit is better than implicit
pivot = df.pivot_table(
values='sales',
index='region',
columns='product',
aggfunc='sum' # Always specify this
)
2. Use observed=True for categoricals
If any of your index or column fields are Categorical dtype, always pass observed=True. Otherwise your pivot table will include rows and columns for categories that do not exist in the data, creating confusing NaN-filled rows:
pivot = df.pivot_table(
values='sales',
index='region',
columns='product',
aggfunc='sum',
observed=True # Critical for categorical data
)
3. Convert strings to category before pivoting
For any column that goes into index or columns, converting to category dtype improves both memory usage and performance:
for col in ['region', 'product', 'quarter']:
df[col] = df[col].astype('category')
4. Name your pivot tables meaningfully
Give your pivot table variables descriptive names that indicate what they contain:
# Bad
pt = df.pivot_table(values='sales', index='region', columns='product', aggfunc='sum')
# Good
revenue_by_region_product = df.pivot_table(
values='sales',
index='region',
columns='product',
aggfunc='sum'
)
5. Chain .style for presentation
When presenting pivot tables in Jupyter notebooks, use the .style accessor for formatting:
pivot = df.pivot_table(
values='sales',
index='region',
columns='product',
aggfunc='sum'
)
styled = (
pivot.style
.format('${:,.0f}')
.background_gradient(cmap='YlOrRd')
.set_caption('Total Sales by Region and Product')
)
styled
This renders a color-coded, properly formatted table in Jupyter -- much more readable than raw numbers.
6. Export to Excel with formatting
Pivot tables often end up in Excel. Use to_excel() with openpyxl for formatted output:
pivot = df.pivot_table(
values='sales',
index='region',
columns='product',
aggfunc='sum',
margins=True,
margins_name='Total'
)
# Basic export
pivot.to_excel('sales_pivot.xlsx', sheet_name='Revenue Summary')
# With formatting using openpyxl
from openpyxl.utils.dataframe import dataframe_to_rows
from openpyxl import Workbook
from openpyxl.styles import Font, numbers
wb = Workbook()
ws = wb.active
ws.title = 'Revenue Summary'
for r_idx, row in enumerate(dataframe_to_rows(pivot, index=True, header=True), 1):
for c_idx, value in enumerate(row, 1):
cell = ws.cell(row=r_idx, column=c_idx, value=value)
if isinstance(value, (int, float)) and r_idx > 2:
cell.number_format = '$#,##0'
wb.save('sales_pivot_formatted.xlsx')
7. Validate your results
Always sanity-check your pivot table against the source data:
pivot = df.pivot_table(values='sales', index='region', columns='product', aggfunc='sum')
# Check that the grand total matches
assert abs(pivot.sum().sum() - df['sales'].sum()) < 0.01, "Totals don't match!"
# Check row totals against groupby
region_totals = df.groupby('region')['sales'].sum()
pivot_row_totals = pivot.sum(axis=1)
pd.testing.assert_series_equal(region_totals, pivot_row_totals, check_names=False)
These assertions catch bugs early, especially when you have filtering or data cleaning steps before the pivot.
Putting It All Together
Here is a complete example that combines everything we have covered -- creating a polished sales dashboard from raw data:
# Load and prepare data
np.random.seed(42)
n = 10000
df = pd.DataFrame({
'date': pd.date_range('2025-01-01', periods=n, freq='h'),
'region': np.random.choice(['North', 'South', 'East', 'West'], n),
'product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor'], n),
'sales': np.random.uniform(100, 5000, n).round(2),
'quantity': np.random.randint(1, 20, n),
})
# Convert to category for performance
for col in ['region', 'product']:
df[col] = df[col].astype('category')
df['month'] = df['date'].dt.to_period('M')
# 1. Revenue summary with margins
revenue = df.pivot_table(
values='sales',
index='region',
columns='product',
aggfunc='sum',
margins=True,
margins_name='Total',
observed=True
).round(2)
print("=== Revenue by Region & Product ===")
print(revenue)
print()
# 2. Monthly trends
monthly = df.pivot_table(
values='sales',
index='month',
columns='region',
aggfunc='sum',
observed=True
).round(2)
print("=== Monthly Revenue Trend ===")
print(monthly.head())
print()
# 3. Multi-metric summary
summary = df.pivot_table(
index='region',
aggfunc={
'sales': ['sum', 'mean', 'count'],
'quantity': ['sum', 'mean']
},
observed=True
).round(2)
print("=== Multi-Metric Summary ===")
print(summary)
print()
# 4. Validate
total_sales = df['sales'].sum()
pivot_total = revenue.loc['Total', 'Total']
print(f"Source total: ${total_sales:,.2f}")
print(f"Pivot total: ${pivot_total:,.2f}")
print(f"Match: {abs(total_sales - pivot_total) < 0.01}")
This pattern -- prepare, convert dtypes, pivot with explicit parameters, validate -- is the foundation of reliable data analysis with pandas pivot tables. Whether you are building a quick summary in a Jupyter notebook or an automated reporting pipeline, these practices will serve you well.
Pandas pivot tables are one of the most versatile tools in the Python data analysis toolkit. Master the core parameters -- values, index, columns, aggfunc, margins, and fill_value -- and you can reshape virtually any tabular dataset into the summary view your analysis requires.
Frequently Asked Questions
What is the difference between pivot() and pivot_table() in pandas?
pivot() is a simple reshape operation that cannot handle duplicate entries — it raises an error if duplicate index/column combinations exist. pivot_table() is more powerful: it aggregates duplicate values using functions like mean, sum, or count. Use pivot() for simple reshaping and pivot_table() when your data has duplicates or you need aggregation.
How do I handle missing values in a pandas pivot table?
Use the fill_value parameter to replace NaN values: df.pivot_table(values='sales', index='region', columns='quarter', fill_value=0). You can also chain .fillna() after creating the pivot table for more complex filling strategies.
What is the difference between pivot_table() and groupby() in pandas?
pivot_table() creates a cross-tabulated view with categories as both rows and columns. groupby() produces a single-index result with aggregate values. pivot_table(index='region', columns='product', values='sales', aggfunc='sum') is equivalent to groupby(['region','product'])['sales'].sum().unstack(), but more concise.
Can I use multiple aggregation functions in a pivot table?
Yes. Pass a list of functions: df.pivot_table(values='sales', index='region', aggfunc=['sum', 'mean', 'count']). You can also pass a dictionary to apply different functions to different value columns: aggfunc={'sales': 'sum', 'quantity': 'mean'}.
How do I add subtotals to a pandas pivot table?
Use the margins parameter: df.pivot_table(values='sales', index='region', columns='product', aggfunc='sum', margins=True, margins_name='Total'). This adds a Total row and column showing the aggregate of all values.
Originally published at aicodereview.cc
Top comments (0)