DEV Community

Cover image for My Data Science Journey: Restaurant Tips Analysis
NIRANJAN LAMICHHANE
NIRANJAN LAMICHHANE Subscriber

Posted on

My Data Science Journey: Restaurant Tips Analysis

Project: Exploratory Data Analysis on Restaurant Tips Dataset
Duration: Full EDA Process
Dataset: 243 restaurant transactions, 7 variables
Status: ✅ COMPLETED


📊 PROJECT OVERVIEW

Dataset Information

  • Source: Restaurant tips dataset
  • Initial Size: 244 rows × 7 columns
  • Final Size: 243 rows × 7 columns (after cleaning)
  • Variables:
    • Numerical: total_bill, tip, size
    • Categorical: sex, smoker, day, time

Project Goal

Understand what factors influence tipping behavior in restaurants through comprehensive exploratory data analysis.


🧹 PHASE 1: DATA CLEANING (Investigation 1.3)

1.1 Missing Values Investigation

Hypothesis: "The Null Hypothesis" - Why might data be missing?

What I Did:

# Checked for missing values
data.isnull().sum()
data.isnull().any()
(data.isnull().sum() / len(data)) * 100  # Percentage
Enter fullscreen mode Exit fullscreen mode

Results:

  • 0 missing values in all columns
  • This indicated excellent data collection quality
  • No imputation or removal needed

Learning Moment: Not all datasets have missing data, but always check!


1.2 Duplicate Detection

Hypothesis: Could identical transactions exist legitimately?

What I Did:

# Found duplicates
num_duplicates = data.duplicated().sum()
duplicates = data[data.duplicated(keep=False)]

# Removed them
data_clean = data.drop_duplicates()
Enter fullscreen mode Exit fullscreen mode

Results:

  • Found 1 duplicate row
    • Bill: $13.00, Tip: $2.00, Female, Smoker, Thursday, Lunch, Party of 2
    • Row 198 and Row 202 were IDENTICAL
  • Decision: Removed as likely data entry error
  • Result: 244 rows → 243 rows

Key Insight: Identical transactions on same day/time are statistically improbable - likely errors.


1.3 Outlier Investigation

Hypothesis: "The Outlier Tribunal" - Are extreme values errors or legitimate?

What I Did:

# Created boxplots
plt.boxplot(data['tip'])
plt.boxplot(data['total_bill'])

# Calculated IQR boundaries
Q1 = data['tip'].quantile(0.25)
Q3 = data['tip'].quantile(0.75)
IQR = Q3 - Q1
upper_boundary = Q3 + (1.5 * IQR)

# Found outliers
outliers = data[data['tip'] > upper_boundary]
Enter fullscreen mode Exit fullscreen mode

Mathematical Formula:

IQR = Q3 - Q1
Upper Boundary = Q3 + (1.5 × IQR)
Lower Boundary = Q1 - (1.5 × IQR)

For Tips:
Q1 = $2.00
Q3 = $3.56
IQR = $1.56
Upper Boundary = $5.90
Enter fullscreen mode Exit fullscreen mode

Outliers Found:
| Bill | Tip | Tip % | Verdict |
|---------|--------|-------|-----------------|
| $50.81 | $10.00 | 19.7% | ✅ Legitimate |
| $48.33 | $9.00 | 18.6% | ✅ Legitimate |
| $39.42 | $7.58 | 19.2% | ✅ Legitimate |
| $48.27 | $6.73 | 13.9% | ✅ Legitimate |

Decision: Kept all outliers - they represent large parties with reasonable tip percentages

Key Insight: Outliers aren't always errors! Verify with context (tip percentage in this case).


🔬 PHASE 2: BIVARIATE ANALYSIS (Investigation 2.2)

Overview: Testing 7 Relationships

For each relationship, I followed the scientific method:

  1. Hypothesis - Make a prediction
  2. Visualization - Create appropriate chart
  3. Analysis - Interpret the pattern
  4. Conclusion - Accept or reject hypothesis

2.1 Relationship #1: Total Bill → Tip

My Hypothesis:

  • "As total_bill increases, tip will increase WEAKLY"
  • Reasoning: "Tip is 'keep the change' - not percentage based"
  • Confidence: MEDIUM
  • Expected: Weak/no relationship

What I Did:

plt.scatter(data_clean['total_bill'], data_clean['tip'])
plt.xlabel('Total Bill ($)')
plt.ylabel('Tip ($)')
plt.title('Relationship Between Total Bill and Tip Amount')
plt.show()
Enter fullscreen mode Exit fullscreen mode

Results:

  • Pattern: Strong upward linear trend
  • Correlation: r = 0.67 (Strong positive)
  • Points tightly clustered around imaginary line

Hypothesis Verdict:REJECTED

What I Learned:

  • My hypothesis was WRONG - and that's okay!
  • Reality: People tip 15-20% of bill (percentage-based, not "keep change")
  • Mechanism: Bill × 15-20% = Tip (mathematical relationship)
  • Key insight: "Learning happens with mistakes" - being wrong is part of science!

Business Insight: Higher bills = higher tips. Restaurants should encourage higher spending.


2.2 Relationship #2: Party Size → Total Bill

My Hypothesis:

  • "As party size increases, total_bill increases STRONGLY"
  • Reasoning: "More people = more food (obvious!)"
  • Confidence: HIGH

What I Did:

plt.scatter(data_clean['size'], data_clean['total_bill'])
Enter fullscreen mode Exit fullscreen mode

Results:

  • Pattern: Grouped upward trend (vertical columns)
  • Correlation: r = 0.60 (Medium-strong positive)
  • Party size = discrete (1,2,3,4,5,6), not continuous
  • Size 2 most common, with widest bill range ($10-$40)

Hypothesis Verdict:CONFIRMED (but weaker than expected)

Key Insight:

  • "Party size predicts bill, but doesn't determine it completely"
  • A couple can outspend a group of 4 depending on what they order
  • What people ORDER matters more than HOW MANY people

2.3 Relationship #3: Party Size → Tip

My Hypothesis:

  • "As party size increases, tip increases STRONGLY"
  • Reasoning: "More people → bigger bill → percentage-based tip → more tip"
  • Confidence: HIGH

Results:

  • Pattern: Upward trend from size 1-4, then FLATTENS at 5-6
  • Correlation: r = 0.49 (Medium-weak)
  • Non-linear relationship!

Hypothesis Verdict: ⚠️ PARTIALLY CORRECT

Surprising Discovery:

  • Tips increase up to party size 4
  • Tips PLATEAU at sizes 5-6 (don't increase further!)

Possible Explanations:

  1. Automatic gratuity - Restaurants add mandatory 15-18% for large parties
  2. Social loafing - "Someone else will tip well, so I don't need to"
  3. Different occasions - Large parties = kids/families (tip standard)
  4. Splitting complications - Harder to calculate when splitting 6 ways

Key Insight: Large parties tip differently than expected - real behavioral economics!


2.4 Relationship #4: Day of Week → Tip

My Hypothesis:

  • Highest: Sunday (weekend celebration mood)
  • Lowest: Wednesday (people just filling stomach)
  • Expected difference: MEDIUM
  • Confidence: MEDIUM

What I Did:

sns.boxplot(x='day', y='tip', data=data_clean,
            order=['Sun','Mon','Tue','Wed','Thur','Fri','Sat'])
Enter fullscreen mode Exit fullscreen mode

Results:
| Day | Avg Tip | Verdict |
|-----------|---------|----------------|
| Saturday | $3.00 | 🏆 Highest |
| Sunday | $2.90 | High |
| Mon/Tue/Wed| $2.25 | 🔻 Lowest (tie)|

Hypothesis Verdict: ⚠️ PARTIALLY WRONG

What I Got Wrong:

  • Predicted Sunday highest → Actually Saturday highest
  • Predicted Wednesday lowest → Correct (tied with Mon/Tue)

Key Observations:

  • Saturday has most high-tip outliers (special occasions, date nights)
  • Sunday has LARGEST box (most variation) - diverse crowd
  • Weekdays cluster together (consistent lower tipping)

Key Insight:
"Sunday = diverse people = diverse tipping = large variation in tips"


2.5 Relationship #5: Time (Lunch vs Dinner) → Tip

My Hypothesis:

  • Dinner will have higher tips
  • Reasoning: "Night time = people more generous; lunch = people in rush"
  • Confidence: MEDIUM

Results:
| Time | Avg Tip | Difference |
|--------|---------|------------|
| Dinner | $3.00 | — |
| Lunch | $2.20 | -$0.80 |

Hypothesis Verdict:CONFIRMED!

Key Insight:

  • $0.80 difference - this is the BIGGEST categorical effect!
  • Time of day is the STRONGEST categorical predictor
  • Lunch customers are rushed, less satisfied with service
  • Dinner is relaxed, celebratory atmosphere

Business Recommendation: Prioritize dinner service quality!


2.6 Relationship #6: Sex (Male vs Female) → Tip

My Hypothesis:

  • Males will tip MORE
  • Reasoning: "Female waitresses + male customers trying to impress"
  • Confidence: MEDIUM

Results:
| Sex | Avg Tip | Difference |
|--------|---------|------------|
| Female | $3.20 | — |
| Male | $3.00 | -$0.20 |

Hypothesis Verdict:REJECTED!

What I Got Wrong:

  • Females actually tip SLIGHTLY more (or it's basically equal)
  • The difference is minimal ($0.20)
  • Sex is NOT a strong predictor

Key Insight: Gender stereotypes about tipping don't hold up in data!


2.7 Relationship #7: Smoker vs Non-Smoker → Tip

My Hypothesis:

  • Non-smokers will tip MORE
  • Reasoning: "Smokers save money for cigarettes instead of tipping"
  • Confidence: LOW

Results:
| Smoker Status | Avg Tip | Difference |
|---------------|---------|------------|
| Smokers | $3.00 | — |
| Non-smokers | $2.80 | -$0.20 |

Hypothesis Verdict:REJECTED!

Honest Reflection: "Cannot figure out why" - and that's okay!

Possible Explanations:

  • Smokers sit outside/at bar (different atmosphere?)
  • Correlation, not causation (maybe age/demographic differences)
  • Small difference ($0.20) might be random chance
  • Need more data to understand

Key Insight: Not every pattern has an obvious explanation - intellectual honesty matters!


📈 PHASE 3: CORRELATION ANALYSIS

What I Did:

# Correlation matrix
correlation_matrix = data_clean[['total_bill', 'tip', 'size']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
Enter fullscreen mode Exit fullscreen mode

Results:

Pair Correlation Strength Interpretation
total_bill ↔ tip 0.67 Strong 🏆 Strongest predictor
size ↔ total_bill 0.60 Medium-Strong More people = more food
size ↔ tip 0.49 Medium-Weak Non-linear (plateaus)

Key Insight:
"Tip percentage is fixed as that of total bill" - this explains the 0.67 correlation perfectly!


🎨 PHASE 4: PAIRPLOT (Visual Summary)

What I Did:

sns.pairplot(data_clean, 
             vars=['total_bill', 'tip', 'size'],
             hue='time',  # Color by lunch/dinner
             diag_kind='hist')
Enter fullscreen mode Exit fullscreen mode

Observations:

From diagonal (distributions):

  • Most common tip: ~$2-3
  • Most common bill: ~$15-20
  • Most common party size: 2 people

From scatter plots:

  • total_bill vs tip: Clear upward trend (confirms r=0.67)
  • size vs others: Grouped patterns (discrete variable)
  • Lunch (blue) vs Dinner (orange): Overlap mostly, but dinner shifts slightly higher

Overall Impression: Relationships are SOMEWHAT CLEAR - not perfect, but strong enough to be meaningful


🎯 KEY FINDINGS SUMMARY

Strongest Predictors (Ranked):

  1. Total Bill (r=0.67) 🥇

    • Explains ~45% of tip variation (r² = 0.67² = 0.45)
    • Clear linear relationship
    • Percentage-based tipping (15-20%)
  2. Time of Day ($0.80 difference) 🥈

    • Dinner tips $0.80 more than lunch
    • Strongest categorical effect
    • Reflects rushed vs relaxed dining
  3. Party Size (r=0.49 with tip) 🥉

    • Medium effect, but NON-LINEAR
    • Plateaus at size 5-6
    • Different behavior for large groups
  4. Day of Week ($0.75 difference)

    • Saturday highest ($3.00)
    • Weekdays lowest (~$2.25)
    • Weekend vs weekday effect
  5. Sex ($0.20 difference) - WEAK

    • Minimal difference
    • Nearly equal tipping
  6. Smoker Status ($0.20 difference) - WEAK

    • Minimal difference
    • Unexplained pattern

💡 BUSINESS RECOMMENDATIONS

Based on data analysis, restaurant owners should:

1. FOCUS ON INCREASING BILL AMOUNT 🎯

Why: Strongest correlation (0.67) - higher bills directly lead to higher tips

Actions:

  • Upsell appetizers, drinks, desserts
  • Create combo deals that increase bill
  • Train servers on suggestive selling
  • Offer premium menu items

Expected Impact: 10% increase in average bill → ~10% increase in tips


2. PRIORITIZE DINNER SERVICE 🌙

Why: Dinner tips $0.80 (36%) more than lunch

Actions:

  • Allocate best servers to dinner shift
  • Focus marketing on dinner hours
  • Create special dinner ambiance
  • Dinner-specific promotions

Expected Impact: Shift focus to higher-margin time period


3. OPTIMIZE FOR PARTY SIZES 2-4 👥

Why: These sizes have best tip-to-effort ratio

Actions:

  • Table arrangements favor 2-4 person parties
  • Special deals for couples/small groups
  • Don't overinvest in large party accommodations (tips plateau)

Expected Impact: Maximize tips per table/server time


4. WEEKEND FOCUS 📅

Why: Saturday/Sunday have higher tips

Actions:

  • Premium staffing on weekends
  • Weekend specials/events
  • Higher-end menu items on weekends

5. DON'T DISCRIMINATE BY SEX/SMOKER ⚖️

Why: These factors have minimal effect

Insight: Treat all customers equally - demographics don't significantly predict tipping


🧠 PERSONAL LEARNING JOURNEY

What I Learned About Data Science:

1. The Scientific Method Works!

  • Make hypothesis → Test → Analyze → Conclude
  • Being wrong is GOOD - that's how we learn!
  • Quote: "Learning happens with mistakes"

2. Hypotheses Can Be Wrong

My Wrong Predictions:

  • ❌ Thought tipping was "keep the change" → Actually percentage-based
  • ❌ Thought Sunday would have highest tips → Actually Saturday
  • ❌ Thought males tip more → Actually nearly equal
  • ❌ Thought smokers tip less → Actually slightly more

Lesson: Don't trust assumptions - test with data!

3. Correlation ≠ Causation

  • Smokers tip more, but WHY?
  • Could be confounding variables (age, location, etc.)
  • Need more data to understand mechanisms

4. Context Matters

  • Outliers aren't always errors
  • $10 tip on $50 bill = 20% (normal!)
  • Always calculate percentages/ratios for context

5. Data Quality First

  • Clean data = reliable analysis
  • Check for: missing values, duplicates, outliers
  • This dataset was excellent (0 missing!)

6. Visualization is Powerful

  • Scatter plots → see relationships
  • Box plots → compare groups
  • Correlation matrix → see everything at once
  • Pairplot → ultimate summary

7. Different Charts for Different Data

  • Numerical vs Numerical → Scatter plot
  • Categorical vs Numerical → Box plot / Bar chart
  • All at once → Pairplot, Correlation matrix

What I Learned About Python/Tools:

Python Libraries:

import pandas as pd           # Data manipulation
import numpy as np            # Numerical operations
import matplotlib.pyplot as plt  # Basic plotting
import seaborn as sns         # Statistical plotting
Enter fullscreen mode Exit fullscreen mode

Key Functions Mastered:

Pandas:

data.head()              # First 5 rows
data.shape               # Dimensions
data.describe()          # Statistics
data.isnull().sum()      # Missing values
data.duplicated().sum()  # Duplicates
data.drop_duplicates()   # Remove duplicates
data['column']           # Select column
data[condition]          # Filter rows
data.groupby().mean()    # Group and aggregate
data.corr()              # Correlation matrix
Enter fullscreen mode Exit fullscreen mode

Matplotlib:

plt.scatter(x, y)        # Scatter plot
plt.xlabel()             # X-axis label
plt.ylabel()             # Y-axis label
plt.title()              # Title
plt.grid()               # Grid lines
plt.show()               # Display plot
Enter fullscreen mode Exit fullscreen mode

Seaborn:

sns.boxplot(x='category', y='number', data=df)
sns.heatmap(corr_matrix, annot=True)
sns.pairplot(df, vars=['col1','col2'], hue='category')
Enter fullscreen mode Exit fullscreen mode

Important Concepts:

1. Order Matters!

# WRONG
outliers = data[data['tip'] > 5]
data['tip_pct'] = data['tip'] / data['total_bill']
print(outliers['tip_pct'])  # ERROR!

# RIGHT
data['tip_pct'] = data['tip'] / data['total_bill']
outliers = data[data['tip'] > 5]
print(outliers['tip_pct'])  # WORKS!
Enter fullscreen mode Exit fullscreen mode

2. Matplotlib vs Seaborn:

  • matplotlib = Low-level, flexible, more code
  • seaborn = High-level, easy, pretty defaults
  • Use both together!

3. Data Types Matter:

  • float64 / int64 = Can do math
  • object = Text, can't do math

Skills I Developed:

Technical Skills:

  • Data cleaning and preprocessing
  • Exploratory data analysis
  • Statistical thinking
  • Data visualization
  • Python programming
  • Using Jupyter notebooks

Analytical Skills:

  • Hypothesis formation
  • Pattern recognition
  • Critical thinking
  • Drawing insights from data
  • Making business recommendations

Soft Skills:

  • Scientific method application
  • Intellectual honesty ("I don't know")
  • Learning from mistakes
  • Persistence through challenges
  • Clear communication of findings

📊 COMPLETE VISUALIZATIONS CREATED

  1. ✅ Scatter Plot: total_bill vs tip
  2. ✅ Scatter Plot: size vs total_bill
  3. ✅ Scatter Plot: size vs tip
  4. ✅ Box Plot: tip by day of week
  5. ✅ Box Plot: tip by time (lunch/dinner)
  6. ✅ Box Plot: tip by sex
  7. ✅ Box Plot: tip by smoker status
  8. ✅ Correlation Matrix Heatmap
  9. ✅ Pairplot (all relationships)

Total: 9 professional visualizations


🎓 FINAL REFLECTION

What Worked Well:

  • Systematic approach (hypothesis → test → analyze)
  • Using appropriate visualizations for each relationship
  • Being open to being wrong
  • Thorough data cleaning before analysis

What I'd Do Differently:

  • Could explore interaction effects (e.g., day × time)
  • Could calculate tip percentages earlier for context
  • Could test non-linear relationships more formally

Most Surprising Finding:

"Party size plateaus at 5-6 people!"

  • Expected linear relationship
  • Discovered real-world behavioral economics
  • Shows the value of looking at data, not just assumptions

Most Important Lesson:

"Total amount of spending is the determining factor"

  • Simple but powerful
  • Actionable for businesses
  • Data-driven decision making

End of Journey Summary

"The only real mistake is the one from which we learn nothing." - Henry Ford

"In God we trust. All others must bring data." - W. Edwards Deming

Top comments (0)