NIRANJAN LAMICHHANE

Posted on Dec 28

My Data Science Journey: Restaurant Tips Analysis

#ai #datascience

Project: Exploratory Data Analysis on Restaurant Tips Dataset
Duration: Full EDA Process
Dataset: 243 restaurant transactions, 7 variables
Status: ✅ COMPLETED

📊 PROJECT OVERVIEW

Dataset Information

Source: Restaurant tips dataset
Initial Size: 244 rows × 7 columns
Final Size: 243 rows × 7 columns (after cleaning)
Variables:
- Numerical: total_bill, tip, size
- Categorical: sex, smoker, day, time

Project Goal

Understand what factors influence tipping behavior in restaurants through comprehensive exploratory data analysis.

🧹 PHASE 1: DATA CLEANING (Investigation 1.3)

1.1 Missing Values Investigation

Hypothesis: "The Null Hypothesis" - Why might data be missing?

What I Did:

# Checked for missing values
data.isnull().sum()
data.isnull().any()
(data.isnull().sum() / len(data)) * 100  # Percentage

Results:

✅ 0 missing values in all columns
This indicated excellent data collection quality
No imputation or removal needed

Learning Moment: Not all datasets have missing data, but always check!

1.2 Duplicate Detection

Hypothesis: Could identical transactions exist legitimately?

What I Did:

# Found duplicates
num_duplicates = data.duplicated().sum()
duplicates = data[data.duplicated(keep=False)]

# Removed them
data_clean = data.drop_duplicates()

Results:

Found 1 duplicate row
- Bill: $13.00, Tip: $2.00, Female, Smoker, Thursday, Lunch, Party of 2
- Row 198 and Row 202 were IDENTICAL
Decision: Removed as likely data entry error
Result: 244 rows → 243 rows

Key Insight: Identical transactions on same day/time are statistically improbable - likely errors.

1.3 Outlier Investigation

Hypothesis: "The Outlier Tribunal" - Are extreme values errors or legitimate?

What I Did:

# Created boxplots
plt.boxplot(data['tip'])
plt.boxplot(data['total_bill'])

# Calculated IQR boundaries
Q1 = data['tip'].quantile(0.25)
Q3 = data['tip'].quantile(0.75)
IQR = Q3 - Q1
upper_boundary = Q3 + (1.5 * IQR)

# Found outliers
outliers = data[data['tip'] > upper_boundary]

Mathematical Formula:

IQR = Q3 - Q1
Upper Boundary = Q3 + (1.5 × IQR)
Lower Boundary = Q1 - (1.5 × IQR)

For Tips:
Q1 = $2.00
Q3 = $3.56
IQR = $1.56
Upper Boundary = $5.90

Outliers Found:
| Bill | Tip | Tip % | Verdict |
|---------|--------|-------|-----------------|
| $50.81 | $10.00 | 19.7% | ✅ Legitimate |
| $48.33 | $9.00 | 18.6% | ✅ Legitimate |
| $39.42 | $7.58 | 19.2% | ✅ Legitimate |
| $48.27 | $6.73 | 13.9% | ✅ Legitimate |

Decision: Kept all outliers - they represent large parties with reasonable tip percentages

Key Insight: Outliers aren't always errors! Verify with context (tip percentage in this case).

🔬 PHASE 2: BIVARIATE ANALYSIS (Investigation 2.2)

Overview: Testing 7 Relationships

For each relationship, I followed the scientific method:

Hypothesis - Make a prediction
Visualization - Create appropriate chart
Analysis - Interpret the pattern
Conclusion - Accept or reject hypothesis

2.1 Relationship #1: Total Bill → Tip

My Hypothesis:

"As total_bill increases, tip will increase WEAKLY"
Reasoning: "Tip is 'keep the change' - not percentage based"
Confidence: MEDIUM
Expected: Weak/no relationship

What I Did:

plt.scatter(data_clean['total_bill'], data_clean['tip'])
plt.xlabel('Total Bill ($)')
plt.ylabel('Tip ($)')
plt.title('Relationship Between Total Bill and Tip Amount')
plt.show()

Results:

Pattern: Strong upward linear trend
Correlation: r = 0.67 (Strong positive)
Points tightly clustered around imaginary line

Hypothesis Verdict: ❌ REJECTED

What I Learned:

My hypothesis was WRONG - and that's okay!
Reality: People tip 15-20% of bill (percentage-based, not "keep change")
Mechanism: Bill × 15-20% = Tip (mathematical relationship)
Key insight: "Learning happens with mistakes" - being wrong is part of science!

Business Insight: Higher bills = higher tips. Restaurants should encourage higher spending.

2.2 Relationship #2: Party Size → Total Bill

My Hypothesis:

"As party size increases, total_bill increases STRONGLY"
Reasoning: "More people = more food (obvious!)"
Confidence: HIGH

What I Did:

plt.scatter(data_clean['size'], data_clean['total_bill'])

Results:

Pattern: Grouped upward trend (vertical columns)
Correlation: r = 0.60 (Medium-strong positive)
Party size = discrete (1,2,3,4,5,6), not continuous
Size 2 most common, with widest bill range ($10-$40)

Hypothesis Verdict: ✅ CONFIRMED (but weaker than expected)

Key Insight:

"Party size predicts bill, but doesn't determine it completely"
A couple can outspend a group of 4 depending on what they order
What people ORDER matters more than HOW MANY people

2.3 Relationship #3: Party Size → Tip

My Hypothesis:

"As party size increases, tip increases STRONGLY"
Reasoning: "More people → bigger bill → percentage-based tip → more tip"
Confidence: HIGH

Results:

Pattern: Upward trend from size 1-4, then FLATTENS at 5-6
Correlation: r = 0.49 (Medium-weak)
Non-linear relationship!

Hypothesis Verdict: ⚠️ PARTIALLY CORRECT

Surprising Discovery:

Tips increase up to party size 4
Tips PLATEAU at sizes 5-6 (don't increase further!)

Possible Explanations:

Automatic gratuity - Restaurants add mandatory 15-18% for large parties
Social loafing - "Someone else will tip well, so I don't need to"
Different occasions - Large parties = kids/families (tip standard)
Splitting complications - Harder to calculate when splitting 6 ways

Key Insight: Large parties tip differently than expected - real behavioral economics!

2.4 Relationship #4: Day of Week → Tip

My Hypothesis:

Highest: Sunday (weekend celebration mood)
Lowest: Wednesday (people just filling stomach)
Expected difference: MEDIUM
Confidence: MEDIUM

What I Did:

sns.boxplot(x='day', y='tip', data=data_clean,
            order=['Sun','Mon','Tue','Wed','Thur','Fri','Sat'])

Results:
| Day | Avg Tip | Verdict |
|-----------|---------|----------------|
| Saturday | $3.00 | 🏆 Highest |
| Sunday | $2.90 | High |
| Mon/Tue/Wed| $2.25 | 🔻 Lowest (tie)|

Hypothesis Verdict: ⚠️ PARTIALLY WRONG

What I Got Wrong:

Predicted Sunday highest → Actually Saturday highest
Predicted Wednesday lowest → Correct (tied with Mon/Tue)

Key Observations:

Saturday has most high-tip outliers (special occasions, date nights)
Sunday has LARGEST box (most variation) - diverse crowd
Weekdays cluster together (consistent lower tipping)

Key Insight:
"Sunday = diverse people = diverse tipping = large variation in tips"

2.5 Relationship #5: Time (Lunch vs Dinner) → Tip

My Hypothesis:

Dinner will have higher tips
Reasoning: "Night time = people more generous; lunch = people in rush"
Confidence: MEDIUM

Results:
| Time | Avg Tip | Difference |
|--------|---------|------------|
| Dinner | $3.00 | — |
| Lunch | $2.20 | -$0.80 |

Hypothesis Verdict: ✅ CONFIRMED!

Key Insight:

$0.80 difference - this is the BIGGEST categorical effect!
Time of day is the STRONGEST categorical predictor
Lunch customers are rushed, less satisfied with service
Dinner is relaxed, celebratory atmosphere

Business Recommendation: Prioritize dinner service quality!

2.6 Relationship #6: Sex (Male vs Female) → Tip

My Hypothesis:

Males will tip MORE
Reasoning: "Female waitresses + male customers trying to impress"
Confidence: MEDIUM

Results:
| Sex | Avg Tip | Difference |
|--------|---------|------------|
| Female | $3.20 | — |
| Male | $3.00 | -$0.20 |

Hypothesis Verdict: ❌ REJECTED!

What I Got Wrong:

Females actually tip SLIGHTLY more (or it's basically equal)
The difference is minimal ($0.20)
Sex is NOT a strong predictor

Key Insight: Gender stereotypes about tipping don't hold up in data!

2.7 Relationship #7: Smoker vs Non-Smoker → Tip

My Hypothesis:

Non-smokers will tip MORE
Reasoning: "Smokers save money for cigarettes instead of tipping"
Confidence: LOW

Results:
| Smoker Status | Avg Tip | Difference |
|---------------|---------|------------|
| Smokers | $3.00 | — |
| Non-smokers | $2.80 | -$0.20 |

Hypothesis Verdict: ❌ REJECTED!

Honest Reflection: "Cannot figure out why" - and that's okay!

Possible Explanations:

Smokers sit outside/at bar (different atmosphere?)
Correlation, not causation (maybe age/demographic differences)
Small difference ($0.20) might be random chance
Need more data to understand

Key Insight: Not every pattern has an obvious explanation - intellectual honesty matters!

📈 PHASE 3: CORRELATION ANALYSIS

What I Did:

# Correlation matrix
correlation_matrix = data_clean[['total_bill', 'tip', 'size']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')

Results:

Pair	Correlation	Strength	Interpretation
total_bill ↔ tip	0.67	Strong	🏆 Strongest predictor
size ↔ total_bill	0.60	Medium-Strong	More people = more food
size ↔ tip	0.49	Medium-Weak	Non-linear (plateaus)

Key Insight:
"Tip percentage is fixed as that of total bill" - this explains the 0.67 correlation perfectly!

🎨 PHASE 4: PAIRPLOT (Visual Summary)

What I Did:

sns.pairplot(data_clean, 
             vars=['total_bill', 'tip', 'size'],
             hue='time',  # Color by lunch/dinner
             diag_kind='hist')

Observations:

From diagonal (distributions):

Most common tip: ~$2-3
Most common bill: ~$15-20
Most common party size: 2 people

From scatter plots:

total_bill vs tip: Clear upward trend (confirms r=0.67)
size vs others: Grouped patterns (discrete variable)
Lunch (blue) vs Dinner (orange): Overlap mostly, but dinner shifts slightly higher

Overall Impression: Relationships are SOMEWHAT CLEAR - not perfect, but strong enough to be meaningful

🎯 KEY FINDINGS SUMMARY

Strongest Predictors (Ranked):

Total Bill (r=0.67) 🥇
- Explains ~45% of tip variation (r² = 0.67² = 0.45)
- Clear linear relationship
- Percentage-based tipping (15-20%)
Time of Day ($0.80 difference) 🥈
- Dinner tips $0.80 more than lunch
- Strongest categorical effect
- Reflects rushed vs relaxed dining
Party Size (r=0.49 with tip) 🥉
- Medium effect, but NON-LINEAR
- Plateaus at size 5-6
- Different behavior for large groups
Day of Week ($0.75 difference)
- Saturday highest ($3.00)
- Weekdays lowest (~$2.25)
- Weekend vs weekday effect
Sex ($0.20 difference) - WEAK
- Minimal difference
- Nearly equal tipping
Smoker Status ($0.20 difference) - WEAK
- Minimal difference
- Unexplained pattern

💡 BUSINESS RECOMMENDATIONS

Based on data analysis, restaurant owners should:

1. FOCUS ON INCREASING BILL AMOUNT 🎯

Why: Strongest correlation (0.67) - higher bills directly lead to higher tips

Actions:

Upsell appetizers, drinks, desserts
Create combo deals that increase bill
Train servers on suggestive selling
Offer premium menu items

Expected Impact: 10% increase in average bill → ~10% increase in tips

2. PRIORITIZE DINNER SERVICE 🌙

Why: Dinner tips $0.80 (36%) more than lunch

Actions:

Allocate best servers to dinner shift
Focus marketing on dinner hours
Create special dinner ambiance
Dinner-specific promotions

Expected Impact: Shift focus to higher-margin time period

3. OPTIMIZE FOR PARTY SIZES 2-4 👥

Why: These sizes have best tip-to-effort ratio

Actions:

Table arrangements favor 2-4 person parties
Special deals for couples/small groups
Don't overinvest in large party accommodations (tips plateau)

Expected Impact: Maximize tips per table/server time

4. WEEKEND FOCUS 📅

Why: Saturday/Sunday have higher tips

Actions:

Premium staffing on weekends
Weekend specials/events
Higher-end menu items on weekends

5. DON'T DISCRIMINATE BY SEX/SMOKER ⚖️

Why: These factors have minimal effect

Insight: Treat all customers equally - demographics don't significantly predict tipping

🧠 PERSONAL LEARNING JOURNEY

What I Learned About Data Science:

1. The Scientific Method Works!

Make hypothesis → Test → Analyze → Conclude
Being wrong is GOOD - that's how we learn!
Quote: "Learning happens with mistakes"

2. Hypotheses Can Be Wrong

My Wrong Predictions:

❌ Thought tipping was "keep the change" → Actually percentage-based
❌ Thought Sunday would have highest tips → Actually Saturday
❌ Thought males tip more → Actually nearly equal
❌ Thought smokers tip less → Actually slightly more

Lesson: Don't trust assumptions - test with data!

3. Correlation ≠ Causation

Smokers tip more, but WHY?
Could be confounding variables (age, location, etc.)
Need more data to understand mechanisms

4. Context Matters

Outliers aren't always errors
$10 tip on $50 bill = 20% (normal!)
Always calculate percentages/ratios for context

5. Data Quality First

Clean data = reliable analysis
Check for: missing values, duplicates, outliers
This dataset was excellent (0 missing!)

6. Visualization is Powerful

Scatter plots → see relationships
Box plots → compare groups
Correlation matrix → see everything at once
Pairplot → ultimate summary

7. Different Charts for Different Data

Numerical vs Numerical → Scatter plot
Categorical vs Numerical → Box plot / Bar chart
All at once → Pairplot, Correlation matrix

What I Learned About Python/Tools:

Python Libraries:

import pandas as pd           # Data manipulation
import numpy as np            # Numerical operations
import matplotlib.pyplot as plt  # Basic plotting
import seaborn as sns         # Statistical plotting

Key Functions Mastered:

Pandas:

data.head()              # First 5 rows
data.shape               # Dimensions
data.describe()          # Statistics
data.isnull().sum()      # Missing values
data.duplicated().sum()  # Duplicates
data.drop_duplicates()   # Remove duplicates
data['column']           # Select column
data[condition]          # Filter rows
data.groupby().mean()    # Group and aggregate
data.corr()              # Correlation matrix

Matplotlib:

plt.scatter(x, y)        # Scatter plot
plt.xlabel()             # X-axis label
plt.ylabel()             # Y-axis label
plt.title()              # Title
plt.grid()               # Grid lines
plt.show()               # Display plot

Seaborn:

sns.boxplot(x='category', y='number', data=df)
sns.heatmap(corr_matrix, annot=True)
sns.pairplot(df, vars=['col1','col2'], hue='category')

Important Concepts:

1. Order Matters!

# WRONG
outliers = data[data['tip'] > 5]
data['tip_pct'] = data['tip'] / data['total_bill']
print(outliers['tip_pct'])  # ERROR!

# RIGHT
data['tip_pct'] = data['tip'] / data['total_bill']
outliers = data[data['tip'] > 5]
print(outliers['tip_pct'])  # WORKS!

2. Matplotlib vs Seaborn:

matplotlib = Low-level, flexible, more code
seaborn = High-level, easy, pretty defaults
Use both together!

3. Data Types Matter:

float64 / int64 = Can do math
object = Text, can't do math

Skills I Developed:

✅ Technical Skills:

Data cleaning and preprocessing
Exploratory data analysis
Statistical thinking
Data visualization
Python programming
Using Jupyter notebooks

✅ Analytical Skills:

Hypothesis formation
Pattern recognition
Critical thinking
Drawing insights from data
Making business recommendations

✅ Soft Skills:

Scientific method application
Intellectual honesty ("I don't know")
Learning from mistakes
Persistence through challenges
Clear communication of findings

📊 COMPLETE VISUALIZATIONS CREATED

✅ Scatter Plot: total_bill vs tip
✅ Scatter Plot: size vs total_bill
✅ Scatter Plot: size vs tip
✅ Box Plot: tip by day of week
✅ Box Plot: tip by time (lunch/dinner)
✅ Box Plot: tip by sex
✅ Box Plot: tip by smoker status
✅ Correlation Matrix Heatmap
✅ Pairplot (all relationships)

Total: 9 professional visualizations

🎓 FINAL REFLECTION

What Worked Well:

Systematic approach (hypothesis → test → analyze)
Using appropriate visualizations for each relationship
Being open to being wrong
Thorough data cleaning before analysis

What I'd Do Differently:

Could explore interaction effects (e.g., day × time)
Could calculate tip percentages earlier for context
Could test non-linear relationships more formally

Most Surprising Finding:

"Party size plateaus at 5-6 people!"

Expected linear relationship
Discovered real-world behavioral economics
Shows the value of looking at data, not just assumptions

Most Important Lesson:

"Total amount of spending is the determining factor"

Simple but powerful
Actionable for businesses
Data-driven decision making

End of Journey Summary

"The only real mistake is the one from which we learn nothing." - Henry Ford

"In God we trust. All others must bring data." - W. Edwards Deming