Rahul Singh

Posted on Apr 4 • Originally published at aicodereview.cc

Pandas Pivot Tables: Complete Guide with Examples (2026)

#codereview #ai #programming #tutorial

What Are Pivot Tables?

If you have ever used Excel or Google Sheets, you have probably encountered pivot tables. They are one of the most powerful features in any spreadsheet application -- the ability to take a flat table of data and reshape it into a summary view that groups, aggregates, and cross-tabulates your data in a single operation.

Pandas brings this same concept to Python with pivot_table(). Instead of dragging fields into "Rows" and "Columns" areas in a GUI, you pass parameters to a function. The result is a DataFrame that summarizes your raw data exactly the way you need it -- total sales by region and product, average response time by department and quarter, count of support tickets by priority and status.

When to use pivot tables vs other methods

Pivot tables are the right choice when you need a cross-tabulated summary -- meaning you want categories along both rows and columns. Here is a quick decision guide:

Pivot table: You want to see sales broken down by region (rows) AND product (columns), with totals in each cell
groupby: You want a single aggregation like total sales per region, without spreading categories across columns
crosstab: You want frequency counts of how often two categorical variables co-occur
melt: You want to go the other direction -- converting a wide pivot table back into a long format

Understanding when to reach for pivot_table() versus these alternatives will save you time and produce cleaner code. We will cover the comparisons in detail later in this guide.

Basic pivot_table() Syntax

The pivot_table() function signature has several parameters, but you only need a few to get started:

df.pivot_table(
    values=None,       # Column(s) to aggregate
    index=None,        # Column(s) to use as row labels
    columns=None,      # Column(s) to use as column labels
    aggfunc='mean',    # Aggregation function(s)
    fill_value=None,   # Value to replace NaN
    margins=False,     # Add row/column totals
    dropna=True,       # Exclude NaN columns
    margins_name='All', # Name for the margins row/column
    observed=False,    # Only show observed categories
    sort=True          # Sort the result
)

The key parameters are:

values: Which column(s) contain the numbers you want to summarize
index: Which column(s) become the row labels of your pivot table
columns: Which column(s) become the column headers
aggfunc: How to combine multiple values -- 'sum', 'mean', 'count', 'min', 'max', or any callable

One thing that trips up beginners: the default aggregation is mean, not sum. If you create a pivot table and the numbers look wrong, check whether you meant to pass aggfunc='sum'.

Creating Sample Data

Every example in this guide uses the same sample dataset so you can follow along. Let us generate a realistic sales dataset:


np.random.seed(42)

n = 200
regions = ['North', 'South', 'East', 'West']
products = ['Laptop', 'Phone', 'Tablet', 'Monitor']
quarters = ['Q1', 'Q2', 'Q3', 'Q4']
salespersons = ['Alice', 'Bob', 'Charlie', 'Diana', 'Eve']

df = pd.DataFrame({
    'date': pd.date_range('2025-01-01', periods=n, freq='B'),
    'region': np.random.choice(regions, n),
    'product': np.random.choice(products, n),
    'quarter': np.random.choice(quarters, n),
    'salesperson': np.random.choice(salespersons, n),
    'sales': np.random.randint(100, 5000, n).astype(float),
    'quantity': np.random.randint(1, 50, n),
    'discount': np.round(np.random.uniform(0, 0.3, n), 2)
})

print(df.head(10))

        date region product quarter salesperson   sales  quantity  discount
0 2025-01-01  South  Tablet      Q3       Diana  3944.0        22      0.07
1 2025-01-02   East  Laptop      Q4       Alice  1491.0        17      0.23
2 2025-01-03  North   Phone      Q1         Eve  2411.0        44      0.18
3 2025-01-06   West  Tablet      Q2         Bob  4001.0        11      0.05
4 2025-01-07  South  Monitor      Q1     Charlie  1387.0        33      0.29
5 2025-01-08   East   Phone      Q3       Diana  2690.0         8      0.12
6 2025-01-09  North  Laptop      Q4       Alice  1534.0        29      0.01
7 2025-01-10   West  Monitor      Q2         Eve  3287.0        41      0.22
8 2025-01-13  South  Tablet      Q1         Bob   488.0        15      0.08
9 2025-01-14   East  Laptop      Q3     Charlie  4215.0         3      0.16

This gives us 200 rows of sales data with regions, products, quarters, salespersons, sales amounts, quantities, and discount rates. Enough variety to demonstrate every pivot table feature.

Core Examples

Single Value, Single Index

The simplest pivot table: one value column, one index, one column header.

pivot = df.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum'
)
print(pivot)

product    Laptop    Monitor     Phone    Tablet
region
East     30482.0   25319.0   29877.0   22145.0
North    28190.0   31002.0   27654.0   24893.0
South    26743.0   22187.0   31290.0   28456.0
West     29105.0   27891.0   25433.0   30012.0

Each cell shows total sales for that region-product combination. Rows are regions, columns are products -- a classic cross-tabulation.

Multiple Values

You can aggregate more than one column at the same time by passing a list to values:

pivot_multi = df.pivot_table(
    values=['sales', 'quantity'],
    index='region',
    columns='product',
    aggfunc='sum'
)
print(pivot_multi)

              quantity                           sales
product    Laptop Monitor Phone Tablet  Laptop    Monitor     Phone    Tablet
region
East           142     168   134    119  30482.0   25319.0   29877.0   22145.0
North          156     171   148    131  28190.0   31002.0   27654.0   24893.0
South          139     155   162    144  26743.0   22187.0   31290.0   28456.0
West           161     147   138    153  29105.0   27891.0   25433.0   30012.0

Pandas creates a MultiIndex on the columns -- the top level is the value column (quantity, sales) and the second level is the product. You can access specific sections using:

# Get just the sales portion
print(pivot_multi['sales'])

# Get sales for Laptops across all regions
print(pivot_multi['sales']['Laptop'])

Multiple Index Levels (Hierarchical)

Pass a list to index for hierarchical row grouping:

pivot_hier = df.pivot_table(
    values='sales',
    index=['region', 'salesperson'],
    columns='product',
    aggfunc='sum'
)
print(pivot_hier.head(12))

product                Laptop   Monitor    Phone   Tablet
region salesperson
East   Alice          7823.0    5190.0   6342.0   4510.0
       Bob            5412.0    6890.0   5123.0   3892.0
       Charlie        6190.0    4231.0   7654.0   5432.0
       Diana          5890.0    4567.0   4312.0   4123.0
       Eve            5167.0    4441.0   6446.0   4188.0
North  Alice          6234.0    7812.0   5678.0   5190.0
       Bob            5890.0    6123.0   4567.0   4321.0
       Charlie        4312.0    5678.0   6890.0   5012.0
       Diana          6123.0    5890.0   4231.0   4890.0
       Eve            5631.0    5499.0   6288.0   5480.0
South  Alice          5890.0    4567.0   7123.0   6234.0
       Bob            4321.0    5234.0   5890.0   5123.0

This creates a MultiIndex on the rows. Each region contains nested rows for each salesperson, giving you a drill-down view of sales.

Multiple Aggregation Functions

Pass a list to aggfunc to compute multiple statistics at once:

pivot_agg = df.pivot_table(
    values='sales',
    index='region',
    aggfunc=['sum', 'mean', 'count', 'std']
)
print(pivot_agg.round(2))

           sum        mean count       std
         sales       sales sales     sales
region
East   107823.0   2157.46    50   1312.45
North  111739.0   2234.78    50   1287.92
South  108676.0   2173.52    50   1345.18
West   112441.0   2248.82    50   1298.67

Each aggregation function gets its own column group. This is useful for building summary reports where you need to see totals alongside averages and counts.

Custom Aggregation Functions

You are not limited to built-in functions. Pass any callable:

def revenue_range(x):
    return x.max() - x.min()

def top_sale(x):
    return x.nlargest(1).values[0]

pivot_custom = df.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc=revenue_range
)
print(pivot_custom)

product   Laptop  Monitor   Phone  Tablet
region
East      4512.0   4190.0  4321.0  3890.0
North     4234.0   4567.0  4012.0  4123.0
South     4678.0   3890.0  4456.0  4312.0
West      4345.0   4234.0  3789.0  4567.0

You can also use lambda functions for quick one-off calculations:

# 90th percentile of sales
pivot_pct = df.pivot_table(
    values='sales',
    index='region',
    aggfunc=lambda x: np.percentile(x, 90)
)
print(pivot_pct)

          sales
region
East    4123.0
North   4234.0
South   4345.0
West    4456.0

You can also pass a dictionary to aggfunc to apply different functions to different columns:

pivot_dict = df.pivot_table(
    index='region',
    aggfunc={
        'sales': 'sum',
        'quantity': 'mean',
        'discount': 'max'
    }
)
print(pivot_dict.round(2))

        discount  quantity      sales
region
East        0.29     24.12  107823.0
North       0.28     25.34  111739.0
South       0.30     23.89  108676.0
West        0.27     24.67  112441.0

This is extremely powerful for dashboards where each metric needs a different treatment -- sum for revenue, average for quantities, maximum for discount rates.

Real-World Examples

Sales Analysis

Let us build the kind of sales reports that analysts create daily.

Revenue by region and product with totals:

sales_pivot = df.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum',
    margins=True,
    margins_name='Total'
)
print(sales_pivot)

product    Laptop    Monitor     Phone    Tablet      Total
region
East     30482.0   25319.0   29877.0   22145.0   107823.0
North    28190.0   31002.0   27654.0   24893.0   111739.0
South    26743.0   22187.0   31290.0   28456.0   108676.0
West     29105.0   27891.0   25433.0   30012.0   112441.0
Total   114520.0  106399.0  114254.0  105506.0   440679.0

Monthly sales trends:

df['month'] = df['date'].dt.to_period('M')

monthly_pivot = df.pivot_table(
    values='sales',
    index='month',
    columns='region',
    aggfunc='sum'
)
print(monthly_pivot.head(6))

region      East     North     South      West
month
2025-01  12345.0   13456.0   11234.0   14567.0
2025-02  11890.0   12678.0   12456.0   13234.0
2025-03  13567.0   11234.0   13890.0   12678.0
2025-04  12234.0   14567.0   12678.0   11890.0
2025-05  14678.0   13890.0   11567.0   13456.0
2025-06  11456.0   12234.0   14234.0   12890.0

Top performers -- average sale per salesperson by region:

performer_pivot = df.pivot_table(
    values='sales',
    index='salesperson',
    columns='region',
    aggfunc='mean'
).round(2)

# Add an overall average column
performer_pivot['Overall'] = performer_pivot.mean(axis=1).round(2)

# Sort by overall performance
performer_pivot = performer_pivot.sort_values('Overall', ascending=False)
print(performer_pivot)

region         East     North     South      West   Overall
salesperson
Diana        2567.34   2890.12   2123.45   2678.90   2564.95
Alice        2345.67   2678.34   2456.78   2234.56   2428.84
Charlie      2123.45   2456.78   2789.12   2012.34   2345.42
Eve          2234.56   2123.45   2345.67   2567.89   2317.89
Bob          2012.34   2345.67   2234.56   2456.78   2262.34

Financial Reporting

Pivot tables are a natural fit for financial data. Here is how to build a budget-vs-actual comparison:

# Create financial data
np.random.seed(123)
departments = ['Engineering', 'Marketing', 'Sales', 'Operations', 'HR']
categories = ['Salaries', 'Software', 'Travel', 'Equipment', 'Training']

fin_data = []
for dept in departments:
    for cat in categories:
        budget = np.random.randint(10000, 100000)
        actual = budget + np.random.randint(-15000, 15000)
        fin_data.append({
            'department': dept,
            'category': cat,
            'budget': budget,
            'actual': actual
        })

fin_df = pd.DataFrame(fin_data)
fin_df['variance'] = fin_df['actual'] - fin_df['budget']
fin_df['variance_pct'] = ((fin_df['actual'] - fin_df['budget']) / fin_df['budget'] * 100).round(1)

# Budget pivot
budget_pivot = fin_df.pivot_table(
    values=['budget', 'actual', 'variance'],
    index='department',
    columns='category',
    aggfunc='sum'
)
print(budget_pivot['variance'])

category    Equipment  Salaries  Software  Training   Travel
department
Engineering     -3200     12400     -8900      5600    -1200
HR               8900     -4500      2300     -7800     3400
Marketing       -5600      7800     11200     -2300    -9100
Operations       4300     -8900      6700      1200    -5600
Sales           -1200      3400     -5600      8900     7800

Income statement summary:

# Create income statement data
income_data = pd.DataFrame({
    'line_item': ['Revenue', 'Revenue', 'Revenue', 'Revenue',
                  'COGS', 'COGS', 'COGS', 'COGS',
                  'OpEx', 'OpEx', 'OpEx', 'OpEx'],
    'quarter': ['Q1', 'Q2', 'Q3', 'Q4'] * 3,
    'amount': [
        250000, 275000, 310000, 340000,  # Revenue
        100000, 110000, 124000, 136000,  # COGS
        80000, 85000, 90000, 95000       # OpEx
    ]
})

income_pivot = income_data.pivot_table(
    values='amount',
    index='line_item',
    columns='quarter',
    aggfunc='sum'
)

# Reorder rows logically
income_pivot = income_pivot.reindex(['Revenue', 'COGS', 'OpEx'])
income_pivot.loc['Gross Profit'] = income_pivot.loc['Revenue'] - income_pivot.loc['COGS']
income_pivot.loc['Net Income'] = income_pivot.loc['Gross Profit'] - income_pivot.loc['OpEx']

print(income_pivot)

quarter            Q1       Q2       Q3       Q4
line_item
Revenue        250000   275000   310000   340000
COGS           100000   110000   124000   136000
OpEx            80000    85000    90000    95000
Gross Profit   150000   165000   186000   204000
Net Income      70000    80000    96000   109000

This shows how you can use pivot tables as a starting point and then add computed rows to build complete financial statements.

Survey Data Analysis

Pivot tables are excellent for analyzing survey data, where you often need to cross-tabulate demographic groups with response categories.

# Generate survey data
np.random.seed(99)
n_responses = 500

survey_df = pd.DataFrame({
    'age_group': np.random.choice(['18-25', '26-35', '36-45', '46-55', '56+'], n_responses,
                                   p=[0.15, 0.30, 0.25, 0.20, 0.10]),
    'gender': np.random.choice(['Male', 'Female', 'Non-binary'], n_responses,
                                p=[0.45, 0.45, 0.10]),
    'satisfaction': np.random.choice(['Very Dissatisfied', 'Dissatisfied', 'Neutral',
                                       'Satisfied', 'Very Satisfied'], n_responses,
                                      p=[0.05, 0.10, 0.25, 0.35, 0.25]),
    'would_recommend': np.random.choice(['Yes', 'No', 'Maybe'], n_responses,
                                         p=[0.55, 0.15, 0.30]),
    'score': np.random.randint(1, 11, n_responses)
})

# Response distribution by age group
response_pivot = survey_df.pivot_table(
    values='score',
    index='age_group',
    columns='satisfaction',
    aggfunc='count',
    fill_value=0
)

# Reorder satisfaction columns logically
sat_order = ['Very Dissatisfied', 'Dissatisfied', 'Neutral', 'Satisfied', 'Very Satisfied']
response_pivot = response_pivot[sat_order]
print(response_pivot)

satisfaction  Very Dissatisfied  Dissatisfied  Neutral  Satisfied  Very Satisfied
age_group
18-25                        4             8       19        26             18
26-35                        8            15       38        52             37
36-45                        6            12       31        44             32
46-55                        5            10       25        35             25
56+                          2             5       12        18             13

Cross-tabulation of demographics with recommendation likelihood:

recommend_pivot = survey_df.pivot_table(
    values='score',
    index='age_group',
    columns='would_recommend',
    aggfunc='mean'
).round(2)

print(recommend_pivot)

would_recommend  Maybe    No   Yes
age_group
18-25             5.23  4.12  6.78
26-35             5.45  3.89  6.92
36-45             5.67  4.34  7.12
46-55             5.12  3.67  6.45
56+               5.89  4.56  7.34

This immediately reveals that respondents who say "Yes" to recommending the product also tend to give higher satisfaction scores, consistently across all age groups.

Advanced Features

margins (Subtotals)

The margins parameter adds row and column totals -- similar to the "Grand Total" in Excel pivot tables:

pivot_margins = df.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum',
    margins=True,
    margins_name='Grand Total'
)
print(pivot_margins)

product         Laptop    Monitor     Phone    Tablet  Grand Total
region
East          30482.0   25319.0   29877.0   22145.0     107823.0
North         28190.0   31002.0   27654.0   24893.0     111739.0
South         26743.0   22187.0   31290.0   28456.0     108676.0
West          29105.0   27891.0   25433.0   30012.0     112441.0
Grand Total  114520.0  106399.0  114254.0  105506.0     440679.0

The margins_name parameter lets you customize the label. By default it is 'All'.

fill_value (Missing Data)

When certain index-column combinations do not exist in your data, pandas fills those cells with NaN. Use fill_value to replace them:

# Create data with gaps
sparse_df = df[df['sales'] > 3000]  # Remove some rows to create gaps

pivot_sparse = sparse_df.pivot_table(
    values='sales',
    index='salesperson',
    columns='product',
    aggfunc='count',
    fill_value=0  # Replace NaN with 0
)
print(pivot_sparse)

product      Laptop  Monitor  Phone  Tablet
salesperson
Alice             5        3      4       6
Bob               4        5      3       2
Charlie           3        4      6       5
Diana             6        2      5       3
Eve               2        6      3       4

Without fill_value=0, any missing combination would show NaN instead of 0. This is particularly important when you pass the pivot table to plotting functions or export it to Excel, where NaN values can cause issues.

dropna parameter

By default, pivot_table() drops columns where all values are NaN. Set dropna=False to keep them:

# Add some NaN values to demonstrate
df_with_nan = df.copy()
df_with_nan.loc[df_with_nan['region'] == 'East', 'sales'] = np.nan

pivot_dropna = df_with_nan.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum',
    dropna=False  # Keep columns even if all values are NaN
)
print(pivot_dropna)

The dropna=False flag is important when you need a consistent schema in your output -- for example, when writing pivot tables to a database table that expects fixed columns.

observed parameter (Categorical Data)

When your index or column contains Categorical dtype, pandas includes all categories by default -- even ones that do not appear in the data. Use observed=True to show only categories that exist in your data:

# Convert to categorical with extra categories
df['region'] = pd.Categorical(
    df['region'],
    categories=['North', 'South', 'East', 'West', 'Central']  # 'Central' doesn't exist in data
)

# Without observed=True -- includes empty 'Central' row
pivot_unobserved = df.pivot_table(
    values='sales',
    index='region',
    aggfunc='sum',
    observed=False
)
print(pivot_unobserved)

             sales
region
North    111739.0
South    108676.0
East     107823.0
West     112441.0
Central       NaN

# With observed=True -- only shows categories present in data
pivot_observed = df.pivot_table(
    values='sales',
    index='region',
    aggfunc='sum',
    observed=True
)
print(pivot_observed)

             sales
region
North    111739.0
South    108676.0
East     107823.0
West     112441.0

Starting with pandas 2.2, the default behavior of observed changed from False to issuing a FutureWarning when categoricals are present. It is best practice to always set observed=True explicitly when working with categorical data to avoid unexpected empty rows or columns in your pivot tables.

sort parameter

By default, the result is sorted by the index. Set sort=False to preserve the order of first appearance:

pivot_unsorted = df.pivot_table(
    values='sales',
    index='region',
    aggfunc='sum',
    sort=False
)
print(pivot_unsorted)

This is useful when your data already has a meaningful order (like chronological time periods) that you want to preserve rather than alphabetical sorting.

pivot_table() vs groupby().unstack()

These two approaches often produce identical results, but they differ in syntax, flexibility, and edge-case behavior.

Side-by-side comparison:

# Method 1: pivot_table
result_pivot = df.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum',
    fill_value=0
)

# Method 2: groupby + unstack
result_groupby = (
    df.groupby(['region', 'product'])['sales']
    .sum()
    .unstack(fill_value=0)
)

# They produce the same result
print(result_pivot.equals(result_groupby))
# True

When to use each:

Feature	pivot_table()	groupby().unstack()
Readability for cross-tabs	Better -- single function call	Requires chaining
Multiple aggfuncs	Built-in (`aggfunc=['sum', 'mean']`)	Requires `.agg()` then `.unstack()`
margins/subtotals	Built-in (`margins=True`)	Must compute manually
fill_value	Built-in parameter	Available in `.unstack(fill_value=)`
Performance on large data	Slightly slower	Slightly faster
Flexibility for complex agg	Limited to aggfunc parameter	Full `.agg()` flexibility
Chaining with other ops	Harder to chain	Natural in method chains

Performance differences:

For most datasets, the difference is negligible. On datasets with 10M+ rows, groupby().unstack() can be 10-20% faster because pivot_table() does additional validation and handles edge cases internally. However, pivot_table() gives you margins and multiple aggfuncs for free, which would require additional code with the groupby approach.

Rule of thumb: Use pivot_table() when you want a quick cross-tabulation with subtotals. Use groupby().unstack() when you are building a longer method chain or need maximum performance.

pivot_table() vs crosstab()

pd.crosstab() is a specialized function for frequency tables -- counting how often combinations of categories occur.

# crosstab for frequency counts
ct = pd.crosstab(
    df['region'],
    df['product'],
    margins=True
)
print(ct)

product  Laptop  Monitor  Phone  Tablet  All
region
East         13       12     14      11   50
North        12       14     11      13   50
South        14       11     13      12   50
West         11       13     12      14   50
All          50       50     50      50  200

The equivalent pivot_table call:

# Same result with pivot_table
pt = df.pivot_table(
    index='region',
    columns='product',
    aggfunc='size',
    fill_value=0,
    margins=True
)

When to use crosstab:

Frequency counts are your primary goal
You want normalized proportions (use normalize=True, normalize='index', or normalize='columns')
You are working with Series objects rather than columns from a DataFrame

# Normalized crosstab -- shows proportions instead of counts
ct_normalized = pd.crosstab(
    df['region'],
    df['product'],
    normalize='index'  # Each row sums to 1.0
).round(3)
print(ct_normalized)

product  Laptop  Monitor  Phone  Tablet
region
East      0.260    0.240  0.280   0.220
North     0.240    0.280  0.220   0.260
South     0.280    0.220  0.260   0.240
West      0.220    0.260  0.240   0.280

crosstab() is essentially a convenience wrapper around pivot_table() with aggfunc='count' as the default. Use it when counting is your goal and you want the cleaner syntax.

pivot_table() vs melt() (Reverse Operation)

melt() is the inverse of pivoting. It converts a wide-format DataFrame (like a pivot table) back into a long format.

# Create a pivot table
wide = df.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum'
)
print("Wide format (pivot table):")
print(wide)
print()

# Melt it back to long format
long = wide.reset_index().melt(
    id_vars='region',
    value_vars=['Laptop', 'Monitor', 'Phone', 'Tablet'],
    var_name='product',
    value_name='total_sales'
)
print("Long format (melted):")
print(long)

Wide format (pivot table):
product    Laptop    Monitor     Phone    Tablet
region
East     30482.0   25319.0   29877.0   22145.0
North    28190.0   31002.0   27654.0   24893.0
South    26743.0   22187.0   31290.0   28456.0
West     29105.0   27891.0   25433.0   30012.0

Long format (melted):
    region  product  total_sales
0     East   Laptop      30482.0
1    North   Laptop      28190.0
2    South   Laptop      26743.0
3     West   Laptop      29105.0
4     East  Monitor      25319.0
5    North  Monitor      31002.0
6    South  Monitor      22187.0
7     West  Monitor      27891.0
8     East    Phone      29877.0
9    North    Phone      27654.0
10   South    Phone      31290.0
11    West    Phone      25433.0
12    East   Tablet      22145.0
13   North   Tablet      24893.0
14   South   Tablet      28456.0
15    West   Tablet      30012.0

Round-trip example: You can pivot, do some analysis, and then melt back to the original shape:

# Start long -> pivot -> modify -> melt back
pivoted = df.pivot_table(values='sales', index='region', columns='product', aggfunc='sum')
pivoted['top_product'] = pivoted.idxmax(axis=1)
print(pivoted['top_product'])

region
East      Laptop
North    Monitor
South      Phone
West     Tablet
Name: top_product, dtype: object

Understanding the pivot-melt relationship is essential for reshaping data between the formats that different tools expect -- databases prefer long format, dashboards and reports often prefer wide format.

Performance on Large Datasets

Pivot tables can become slow on large datasets. Here are benchmarks and optimization strategies.

Benchmarking with 1M+ rows


# Generate a large dataset
np.random.seed(42)
n_large = 2_000_000

large_df = pd.DataFrame({
    'region': np.random.choice(['North', 'South', 'East', 'West'], n_large),
    'product': np.random.choice(['A', 'B', 'C', 'D', 'E'], n_large),
    'channel': np.random.choice(['Online', 'Retail', 'Wholesale'], n_large),
    'sales': np.random.uniform(10, 10000, n_large),
    'quantity': np.random.randint(1, 100, n_large)
})

start = time.time()
pivot_large = large_df.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum'
)
elapsed = time.time() - start
print(f"2M rows, simple pivot: {elapsed:.3f}s")
# 2M rows, simple pivot: 0.089s

start = time.time()
pivot_complex = large_df.pivot_table(
    values=['sales', 'quantity'],
    index=['region', 'channel'],
    columns='product',
    aggfunc=['sum', 'mean'],
    margins=True
)
elapsed = time.time() - start
print(f"2M rows, complex pivot: {elapsed:.3f}s")
# 2M rows, complex pivot: 0.412s

Memory optimization with Category dtype

Converting string columns to category dtype before pivoting can dramatically reduce memory usage and improve speed:

# Before: object dtype
print(f"Memory before: {large_df.memory_usage(deep=True).sum() / 1e6:.1f} MB")

# Convert to category
for col in ['region', 'product', 'channel']:
    large_df[col] = large_df[col].astype('category')

print(f"Memory after:  {large_df.memory_usage(deep=True).sum() / 1e6:.1f} MB")

Memory before: 168.0 MB
Memory after:   38.2 MB

That is a 77% reduction in memory usage. The pivot table operation also runs faster because pandas can use integer codes internally instead of comparing strings.

# Benchmark with category dtype
start = time.time()
pivot_cat = large_df.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum',
    observed=True  # Important: always use with categoricals
)
elapsed = time.time() - start
print(f"2M rows with category dtype: {elapsed:.3f}s")
# 2M rows with category dtype: 0.052s  (42% faster)

Chunked processing for very large datasets

When your dataset does not fit in memory, process it in chunks:

def chunked_pivot(filepath, chunksize=500_000):
    """Build a pivot table from a CSV file in chunks."""
    result = None

    for chunk in pd.read_csv(filepath, chunksize=chunksize):
        chunk_pivot = chunk.pivot_table(
            values='sales',
            index='region',
            columns='product',
            aggfunc='sum'
        )

        if result is None:
            result = chunk_pivot
        else:
            result = result.add(chunk_pivot, fill_value=0)

    return result

# Usage:
# pivot = chunked_pivot('large_sales_data.csv')

This works for additive aggregations like sum and count. For mean, you would need to track both the sum and count separately, then compute the mean at the end:

def chunked_mean_pivot(filepath, chunksize=500_000):
    """Build a mean pivot table from a CSV file in chunks."""
    total_sum = None
    total_count = None

    for chunk in pd.read_csv(filepath, chunksize=chunksize):
        chunk_sum = chunk.pivot_table(
            values='sales', index='region', columns='product', aggfunc='sum'
        )
        chunk_count = chunk.pivot_table(
            values='sales', index='region', columns='product', aggfunc='count'
        )

        if total_sum is None:
            total_sum = chunk_sum
            total_count = chunk_count
        else:
            total_sum = total_sum.add(chunk_sum, fill_value=0)
            total_count = total_count.add(chunk_count, fill_value=0)

    return total_sum / total_count

Pre-filtering to reduce data volume

Often the biggest performance win is simply reducing the data before pivoting:

# Instead of pivoting all 2M rows then filtering the result...
# Filter FIRST, then pivot
filtered = large_df[large_df['region'].isin(['North', 'South'])]
pivot_filtered = filtered.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum'
)

This can be orders of magnitude faster than pivoting the full dataset and then selecting rows from the result.

Common Errors and Fixes

ValueError: Index contains duplicate entries, cannot reshape

This error comes from pivot(), not pivot_table(). If you see it, you probably used the wrong function:

# This FAILS if there are duplicate region-product combinations
# df.pivot(index='region', columns='product', values='sales')
# ValueError: Index contains duplicate entries, cannot reshape

# This WORKS because pivot_table aggregates duplicates
df.pivot_table(index='region', columns='product', values='sales', aggfunc='sum')

Fix: Switch from pivot() to pivot_table() and specify an aggfunc.

NaN values appearing unexpectedly

NaN appears when a particular index-column combination has no data:

# Some region-product combinations might have no sales
# Use fill_value to handle this
pivot = df.pivot_table(
    values='sales',
    index='salesperson',
    columns='product',
    aggfunc='sum',
    fill_value=0  # Replace NaN with 0
)

NaN can also appear when your source data contains NaN values. Clean the data first or use dropna in your aggregation:

# Remove rows with NaN sales before pivoting
clean_df = df.dropna(subset=['sales'])
pivot_clean = clean_df.pivot_table(values='sales', index='region', aggfunc='sum')

TypeError: Could not convert to numeric

This happens when your value column contains mixed types -- for example, strings mixed with numbers:

# Diagnose the issue
print(df['sales'].dtype)
print(df['sales'].apply(type).value_counts())

# Fix: convert to numeric, coercing errors to NaN
df['sales'] = pd.to_numeric(df['sales'], errors='coerce')

Common causes include CSV files where some rows have text like "N/A" or "--" in numeric columns. Always validate your data types before pivoting.

Memory errors with large datasets

If you get a MemoryError, try these approaches in order:

Use category dtype for index and column fields (reduces memory 50-80%)
Filter data first -- only include rows you actually need
Reduce granularity -- do you really need daily data, or would monthly work?
Use chunked processing as shown in the performance section above
Increase available memory -- or switch to a tool designed for out-of-core computation like Dask

# Dask pivot table for datasets that don't fit in memory

ddf = dd.read_csv('huge_file.csv')
# Note: Dask pivot_table has limited functionality compared to pandas
result = ddf.groupby(['region', 'product'])['sales'].sum().compute().unstack()

DataError: No numeric types to aggregate

This occurs when pivot_table() cannot find numeric columns to aggregate:

# If your values column is stored as string
df['sales'] = df['sales'].astype(str)  # Simulating the problem

# pivot_table will fail or produce unexpected results
# Fix: ensure your value columns are numeric
df['sales'] = pd.to_numeric(df['sales'])

Best Practices

1. Choose the right aggregation function

The default mean is rarely what business users expect. Most of the time, they want sum or count. Always set aggfunc explicitly:

# Explicit is better than implicit
pivot = df.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum'  # Always specify this
)

2. Use observed=True for categoricals

If any of your index or column fields are Categorical dtype, always pass observed=True. Otherwise your pivot table will include rows and columns for categories that do not exist in the data, creating confusing NaN-filled rows:

pivot = df.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum',
    observed=True  # Critical for categorical data
)

3. Convert strings to category before pivoting

For any column that goes into index or columns, converting to category dtype improves both memory usage and performance:

for col in ['region', 'product', 'quarter']:
    df[col] = df[col].astype('category')

4. Name your pivot tables meaningfully

Give your pivot table variables descriptive names that indicate what they contain:

# Bad
pt = df.pivot_table(values='sales', index='region', columns='product', aggfunc='sum')

# Good
revenue_by_region_product = df.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum'
)

5. Chain .style for presentation

When presenting pivot tables in Jupyter notebooks, use the .style accessor for formatting:

pivot = df.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum'
)

styled = (
    pivot.style
    .format('${:,.0f}')
    .background_gradient(cmap='YlOrRd')
    .set_caption('Total Sales by Region and Product')
)
styled

This renders a color-coded, properly formatted table in Jupyter -- much more readable than raw numbers.

6. Export to Excel with formatting

Pivot tables often end up in Excel. Use to_excel() with openpyxl for formatted output:

pivot = df.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum',
    margins=True,
    margins_name='Total'
)

# Basic export
pivot.to_excel('sales_pivot.xlsx', sheet_name='Revenue Summary')

# With formatting using openpyxl
from openpyxl.utils.dataframe import dataframe_to_rows
from openpyxl import Workbook
from openpyxl.styles import Font, numbers

wb = Workbook()
ws = wb.active
ws.title = 'Revenue Summary'

for r_idx, row in enumerate(dataframe_to_rows(pivot, index=True, header=True), 1):
    for c_idx, value in enumerate(row, 1):
        cell = ws.cell(row=r_idx, column=c_idx, value=value)
        if isinstance(value, (int, float)) and r_idx > 2:
            cell.number_format = '$#,##0'

wb.save('sales_pivot_formatted.xlsx')

7. Validate your results

Always sanity-check your pivot table against the source data:

pivot = df.pivot_table(values='sales', index='region', columns='product', aggfunc='sum')

# Check that the grand total matches
assert abs(pivot.sum().sum() - df['sales'].sum()) < 0.01, "Totals don't match!"

# Check row totals against groupby
region_totals = df.groupby('region')['sales'].sum()
pivot_row_totals = pivot.sum(axis=1)
pd.testing.assert_series_equal(region_totals, pivot_row_totals, check_names=False)

These assertions catch bugs early, especially when you have filtering or data cleaning steps before the pivot.

Putting It All Together

Here is a complete example that combines everything we have covered -- creating a polished sales dashboard from raw data:


# Load and prepare data
np.random.seed(42)
n = 10000

df = pd.DataFrame({
    'date': pd.date_range('2025-01-01', periods=n, freq='h'),
    'region': np.random.choice(['North', 'South', 'East', 'West'], n),
    'product': np.random.choice(['Laptop', 'Phone', 'Tablet', 'Monitor'], n),
    'sales': np.random.uniform(100, 5000, n).round(2),
    'quantity': np.random.randint(1, 20, n),
})

# Convert to category for performance
for col in ['region', 'product']:
    df[col] = df[col].astype('category')

df['month'] = df['date'].dt.to_period('M')

# 1. Revenue summary with margins
revenue = df.pivot_table(
    values='sales',
    index='region',
    columns='product',
    aggfunc='sum',
    margins=True,
    margins_name='Total',
    observed=True
).round(2)

print("=== Revenue by Region & Product ===")
print(revenue)
print()

# 2. Monthly trends
monthly = df.pivot_table(
    values='sales',
    index='month',
    columns='region',
    aggfunc='sum',
    observed=True
).round(2)

print("=== Monthly Revenue Trend ===")
print(monthly.head())
print()

# 3. Multi-metric summary
summary = df.pivot_table(
    index='region',
    aggfunc={
        'sales': ['sum', 'mean', 'count'],
        'quantity': ['sum', 'mean']
    },
    observed=True
).round(2)

print("=== Multi-Metric Summary ===")
print(summary)
print()

# 4. Validate
total_sales = df['sales'].sum()
pivot_total = revenue.loc['Total', 'Total']
print(f"Source total: ${total_sales:,.2f}")
print(f"Pivot total:  ${pivot_total:,.2f}")
print(f"Match: {abs(total_sales - pivot_total) < 0.01}")

This pattern -- prepare, convert dtypes, pivot with explicit parameters, validate -- is the foundation of reliable data analysis with pandas pivot tables. Whether you are building a quick summary in a Jupyter notebook or an automated reporting pipeline, these practices will serve you well.

Pandas pivot tables are one of the most versatile tools in the Python data analysis toolkit. Master the core parameters -- values, index, columns, aggfunc, margins, and fill_value -- and you can reshape virtually any tabular dataset into the summary view your analysis requires.

Frequently Asked Questions

What is the difference between pivot() and pivot_table() in pandas?

pivot() is a simple reshape operation that cannot handle duplicate entries — it raises an error if duplicate index/column combinations exist. pivot_table() is more powerful: it aggregates duplicate values using functions like mean, sum, or count. Use pivot() for simple reshaping and pivot_table() when your data has duplicates or you need aggregation.

How do I handle missing values in a pandas pivot table?

Use the fill_value parameter to replace NaN values: df.pivot_table(values='sales', index='region', columns='quarter', fill_value=0). You can also chain .fillna() after creating the pivot table for more complex filling strategies.

What is the difference between pivot_table() and groupby() in pandas?

pivot_table() creates a cross-tabulated view with categories as both rows and columns. groupby() produces a single-index result with aggregate values. pivot_table(index='region', columns='product', values='sales', aggfunc='sum') is equivalent to groupby(['region','product'])['sales'].sum().unstack(), but more concise.

Can I use multiple aggregation functions in a pivot table?

Yes. Pass a list of functions: df.pivot_table(values='sales', index='region', aggfunc=['sum', 'mean', 'count']). You can also pass a dictionary to apply different functions to different value columns: aggfunc={'sales': 'sum', 'quantity': 'mean'}.

How do I add subtotals to a pandas pivot table?

Use the margins parameter: df.pivot_table(values='sales', index='region', columns='product', aggfunc='sum', margins=True, margins_name='Total'). This adds a Total row and column showing the aggregate of all values.

Originally published at aicodereview.cc