Project: Exploratory Data Analysis on Restaurant Tips Dataset
Duration: Full EDA Process
Dataset: 243 restaurant transactions, 7 variables
Status: ✅ COMPLETED
📊 PROJECT OVERVIEW
Dataset Information
- Source: Restaurant tips dataset
- Initial Size: 244 rows × 7 columns
- Final Size: 243 rows × 7 columns (after cleaning)
-
Variables:
- Numerical:
total_bill,tip,size - Categorical:
sex,smoker,day,time
- Numerical:
Project Goal
Understand what factors influence tipping behavior in restaurants through comprehensive exploratory data analysis.
🧹 PHASE 1: DATA CLEANING (Investigation 1.3)
1.1 Missing Values Investigation
Hypothesis: "The Null Hypothesis" - Why might data be missing?
What I Did:
# Checked for missing values
data.isnull().sum()
data.isnull().any()
(data.isnull().sum() / len(data)) * 100 # Percentage
Results:
- ✅ 0 missing values in all columns
- This indicated excellent data collection quality
- No imputation or removal needed
Learning Moment: Not all datasets have missing data, but always check!
1.2 Duplicate Detection
Hypothesis: Could identical transactions exist legitimately?
What I Did:
# Found duplicates
num_duplicates = data.duplicated().sum()
duplicates = data[data.duplicated(keep=False)]
# Removed them
data_clean = data.drop_duplicates()
Results:
- Found 1 duplicate row
- Bill: $13.00, Tip: $2.00, Female, Smoker, Thursday, Lunch, Party of 2
- Row 198 and Row 202 were IDENTICAL
- Decision: Removed as likely data entry error
- Result: 244 rows → 243 rows
Key Insight: Identical transactions on same day/time are statistically improbable - likely errors.
1.3 Outlier Investigation
Hypothesis: "The Outlier Tribunal" - Are extreme values errors or legitimate?
What I Did:
# Created boxplots
plt.boxplot(data['tip'])
plt.boxplot(data['total_bill'])
# Calculated IQR boundaries
Q1 = data['tip'].quantile(0.25)
Q3 = data['tip'].quantile(0.75)
IQR = Q3 - Q1
upper_boundary = Q3 + (1.5 * IQR)
# Found outliers
outliers = data[data['tip'] > upper_boundary]
Mathematical Formula:
IQR = Q3 - Q1
Upper Boundary = Q3 + (1.5 × IQR)
Lower Boundary = Q1 - (1.5 × IQR)
For Tips:
Q1 = $2.00
Q3 = $3.56
IQR = $1.56
Upper Boundary = $5.90
Outliers Found:
| Bill | Tip | Tip % | Verdict |
|---------|--------|-------|-----------------|
| $50.81 | $10.00 | 19.7% | ✅ Legitimate |
| $48.33 | $9.00 | 18.6% | ✅ Legitimate |
| $39.42 | $7.58 | 19.2% | ✅ Legitimate |
| $48.27 | $6.73 | 13.9% | ✅ Legitimate |
Decision: Kept all outliers - they represent large parties with reasonable tip percentages
Key Insight: Outliers aren't always errors! Verify with context (tip percentage in this case).
🔬 PHASE 2: BIVARIATE ANALYSIS (Investigation 2.2)
Overview: Testing 7 Relationships
For each relationship, I followed the scientific method:
- Hypothesis - Make a prediction
- Visualization - Create appropriate chart
- Analysis - Interpret the pattern
- Conclusion - Accept or reject hypothesis
2.1 Relationship #1: Total Bill → Tip
My Hypothesis:
- "As total_bill increases, tip will increase WEAKLY"
- Reasoning: "Tip is 'keep the change' - not percentage based"
- Confidence: MEDIUM
- Expected: Weak/no relationship
What I Did:
plt.scatter(data_clean['total_bill'], data_clean['tip'])
plt.xlabel('Total Bill ($)')
plt.ylabel('Tip ($)')
plt.title('Relationship Between Total Bill and Tip Amount')
plt.show()
Results:
- Pattern: Strong upward linear trend
- Correlation: r = 0.67 (Strong positive)
- Points tightly clustered around imaginary line
Hypothesis Verdict: ❌ REJECTED
What I Learned:
- My hypothesis was WRONG - and that's okay!
- Reality: People tip 15-20% of bill (percentage-based, not "keep change")
- Mechanism: Bill × 15-20% = Tip (mathematical relationship)
- Key insight: "Learning happens with mistakes" - being wrong is part of science!
Business Insight: Higher bills = higher tips. Restaurants should encourage higher spending.
2.2 Relationship #2: Party Size → Total Bill
My Hypothesis:
- "As party size increases, total_bill increases STRONGLY"
- Reasoning: "More people = more food (obvious!)"
- Confidence: HIGH
What I Did:
plt.scatter(data_clean['size'], data_clean['total_bill'])
Results:
- Pattern: Grouped upward trend (vertical columns)
- Correlation: r = 0.60 (Medium-strong positive)
- Party size = discrete (1,2,3,4,5,6), not continuous
- Size 2 most common, with widest bill range ($10-$40)
Hypothesis Verdict: ✅ CONFIRMED (but weaker than expected)
Key Insight:
- "Party size predicts bill, but doesn't determine it completely"
- A couple can outspend a group of 4 depending on what they order
- What people ORDER matters more than HOW MANY people
2.3 Relationship #3: Party Size → Tip
My Hypothesis:
- "As party size increases, tip increases STRONGLY"
- Reasoning: "More people → bigger bill → percentage-based tip → more tip"
- Confidence: HIGH
Results:
- Pattern: Upward trend from size 1-4, then FLATTENS at 5-6
- Correlation: r = 0.49 (Medium-weak)
- Non-linear relationship!
Hypothesis Verdict: ⚠️ PARTIALLY CORRECT
Surprising Discovery:
- Tips increase up to party size 4
- Tips PLATEAU at sizes 5-6 (don't increase further!)
Possible Explanations:
- Automatic gratuity - Restaurants add mandatory 15-18% for large parties
- Social loafing - "Someone else will tip well, so I don't need to"
- Different occasions - Large parties = kids/families (tip standard)
- Splitting complications - Harder to calculate when splitting 6 ways
Key Insight: Large parties tip differently than expected - real behavioral economics!
2.4 Relationship #4: Day of Week → Tip
My Hypothesis:
- Highest: Sunday (weekend celebration mood)
- Lowest: Wednesday (people just filling stomach)
- Expected difference: MEDIUM
- Confidence: MEDIUM
What I Did:
sns.boxplot(x='day', y='tip', data=data_clean,
order=['Sun','Mon','Tue','Wed','Thur','Fri','Sat'])
Results:
| Day | Avg Tip | Verdict |
|-----------|---------|----------------|
| Saturday | $3.00 | 🏆 Highest |
| Sunday | $2.90 | High |
| Mon/Tue/Wed| $2.25 | 🔻 Lowest (tie)|
Hypothesis Verdict: ⚠️ PARTIALLY WRONG
What I Got Wrong:
- Predicted Sunday highest → Actually Saturday highest
- Predicted Wednesday lowest → Correct (tied with Mon/Tue)
Key Observations:
- Saturday has most high-tip outliers (special occasions, date nights)
- Sunday has LARGEST box (most variation) - diverse crowd
- Weekdays cluster together (consistent lower tipping)
Key Insight:
"Sunday = diverse people = diverse tipping = large variation in tips"
2.5 Relationship #5: Time (Lunch vs Dinner) → Tip
My Hypothesis:
- Dinner will have higher tips
- Reasoning: "Night time = people more generous; lunch = people in rush"
- Confidence: MEDIUM
Results:
| Time | Avg Tip | Difference |
|--------|---------|------------|
| Dinner | $3.00 | — |
| Lunch | $2.20 | -$0.80 |
Hypothesis Verdict: ✅ CONFIRMED!
Key Insight:
- $0.80 difference - this is the BIGGEST categorical effect!
- Time of day is the STRONGEST categorical predictor
- Lunch customers are rushed, less satisfied with service
- Dinner is relaxed, celebratory atmosphere
Business Recommendation: Prioritize dinner service quality!
2.6 Relationship #6: Sex (Male vs Female) → Tip
My Hypothesis:
- Males will tip MORE
- Reasoning: "Female waitresses + male customers trying to impress"
- Confidence: MEDIUM
Results:
| Sex | Avg Tip | Difference |
|--------|---------|------------|
| Female | $3.20 | — |
| Male | $3.00 | -$0.20 |
Hypothesis Verdict: ❌ REJECTED!
What I Got Wrong:
- Females actually tip SLIGHTLY more (or it's basically equal)
- The difference is minimal ($0.20)
- Sex is NOT a strong predictor
Key Insight: Gender stereotypes about tipping don't hold up in data!
2.7 Relationship #7: Smoker vs Non-Smoker → Tip
My Hypothesis:
- Non-smokers will tip MORE
- Reasoning: "Smokers save money for cigarettes instead of tipping"
- Confidence: LOW
Results:
| Smoker Status | Avg Tip | Difference |
|---------------|---------|------------|
| Smokers | $3.00 | — |
| Non-smokers | $2.80 | -$0.20 |
Hypothesis Verdict: ❌ REJECTED!
Honest Reflection: "Cannot figure out why" - and that's okay!
Possible Explanations:
- Smokers sit outside/at bar (different atmosphere?)
- Correlation, not causation (maybe age/demographic differences)
- Small difference ($0.20) might be random chance
- Need more data to understand
Key Insight: Not every pattern has an obvious explanation - intellectual honesty matters!
📈 PHASE 3: CORRELATION ANALYSIS
What I Did:
# Correlation matrix
correlation_matrix = data_clean[['total_bill', 'tip', 'size']].corr()
sns.heatmap(correlation_matrix, annot=True, cmap='coolwarm')
Results:
| Pair | Correlation | Strength | Interpretation |
|---|---|---|---|
| total_bill ↔ tip | 0.67 | Strong | 🏆 Strongest predictor |
| size ↔ total_bill | 0.60 | Medium-Strong | More people = more food |
| size ↔ tip | 0.49 | Medium-Weak | Non-linear (plateaus) |
Key Insight:
"Tip percentage is fixed as that of total bill" - this explains the 0.67 correlation perfectly!
🎨 PHASE 4: PAIRPLOT (Visual Summary)
What I Did:
sns.pairplot(data_clean,
vars=['total_bill', 'tip', 'size'],
hue='time', # Color by lunch/dinner
diag_kind='hist')
Observations:
From diagonal (distributions):
- Most common tip: ~$2-3
- Most common bill: ~$15-20
- Most common party size: 2 people
From scatter plots:
- total_bill vs tip: Clear upward trend (confirms r=0.67)
- size vs others: Grouped patterns (discrete variable)
- Lunch (blue) vs Dinner (orange): Overlap mostly, but dinner shifts slightly higher
Overall Impression: Relationships are SOMEWHAT CLEAR - not perfect, but strong enough to be meaningful
🎯 KEY FINDINGS SUMMARY
Strongest Predictors (Ranked):
-
Total Bill (r=0.67) 🥇
- Explains ~45% of tip variation (r² = 0.67² = 0.45)
- Clear linear relationship
- Percentage-based tipping (15-20%)
-
Time of Day ($0.80 difference) 🥈
- Dinner tips $0.80 more than lunch
- Strongest categorical effect
- Reflects rushed vs relaxed dining
-
Party Size (r=0.49 with tip) 🥉
- Medium effect, but NON-LINEAR
- Plateaus at size 5-6
- Different behavior for large groups
-
Day of Week ($0.75 difference)
- Saturday highest ($3.00)
- Weekdays lowest (~$2.25)
- Weekend vs weekday effect
-
Sex ($0.20 difference) - WEAK
- Minimal difference
- Nearly equal tipping
-
Smoker Status ($0.20 difference) - WEAK
- Minimal difference
- Unexplained pattern
💡 BUSINESS RECOMMENDATIONS
Based on data analysis, restaurant owners should:
1. FOCUS ON INCREASING BILL AMOUNT 🎯
Why: Strongest correlation (0.67) - higher bills directly lead to higher tips
Actions:
- Upsell appetizers, drinks, desserts
- Create combo deals that increase bill
- Train servers on suggestive selling
- Offer premium menu items
Expected Impact: 10% increase in average bill → ~10% increase in tips
2. PRIORITIZE DINNER SERVICE 🌙
Why: Dinner tips $0.80 (36%) more than lunch
Actions:
- Allocate best servers to dinner shift
- Focus marketing on dinner hours
- Create special dinner ambiance
- Dinner-specific promotions
Expected Impact: Shift focus to higher-margin time period
3. OPTIMIZE FOR PARTY SIZES 2-4 👥
Why: These sizes have best tip-to-effort ratio
Actions:
- Table arrangements favor 2-4 person parties
- Special deals for couples/small groups
- Don't overinvest in large party accommodations (tips plateau)
Expected Impact: Maximize tips per table/server time
4. WEEKEND FOCUS 📅
Why: Saturday/Sunday have higher tips
Actions:
- Premium staffing on weekends
- Weekend specials/events
- Higher-end menu items on weekends
5. DON'T DISCRIMINATE BY SEX/SMOKER ⚖️
Why: These factors have minimal effect
Insight: Treat all customers equally - demographics don't significantly predict tipping
🧠 PERSONAL LEARNING JOURNEY
What I Learned About Data Science:
1. The Scientific Method Works!
- Make hypothesis → Test → Analyze → Conclude
- Being wrong is GOOD - that's how we learn!
- Quote: "Learning happens with mistakes"
2. Hypotheses Can Be Wrong
My Wrong Predictions:
- ❌ Thought tipping was "keep the change" → Actually percentage-based
- ❌ Thought Sunday would have highest tips → Actually Saturday
- ❌ Thought males tip more → Actually nearly equal
- ❌ Thought smokers tip less → Actually slightly more
Lesson: Don't trust assumptions - test with data!
3. Correlation ≠ Causation
- Smokers tip more, but WHY?
- Could be confounding variables (age, location, etc.)
- Need more data to understand mechanisms
4. Context Matters
- Outliers aren't always errors
- $10 tip on $50 bill = 20% (normal!)
- Always calculate percentages/ratios for context
5. Data Quality First
- Clean data = reliable analysis
- Check for: missing values, duplicates, outliers
- This dataset was excellent (0 missing!)
6. Visualization is Powerful
- Scatter plots → see relationships
- Box plots → compare groups
- Correlation matrix → see everything at once
- Pairplot → ultimate summary
7. Different Charts for Different Data
- Numerical vs Numerical → Scatter plot
- Categorical vs Numerical → Box plot / Bar chart
- All at once → Pairplot, Correlation matrix
What I Learned About Python/Tools:
Python Libraries:
import pandas as pd # Data manipulation
import numpy as np # Numerical operations
import matplotlib.pyplot as plt # Basic plotting
import seaborn as sns # Statistical plotting
Key Functions Mastered:
Pandas:
data.head() # First 5 rows
data.shape # Dimensions
data.describe() # Statistics
data.isnull().sum() # Missing values
data.duplicated().sum() # Duplicates
data.drop_duplicates() # Remove duplicates
data['column'] # Select column
data[condition] # Filter rows
data.groupby().mean() # Group and aggregate
data.corr() # Correlation matrix
Matplotlib:
plt.scatter(x, y) # Scatter plot
plt.xlabel() # X-axis label
plt.ylabel() # Y-axis label
plt.title() # Title
plt.grid() # Grid lines
plt.show() # Display plot
Seaborn:
sns.boxplot(x='category', y='number', data=df)
sns.heatmap(corr_matrix, annot=True)
sns.pairplot(df, vars=['col1','col2'], hue='category')
Important Concepts:
1. Order Matters!
# WRONG
outliers = data[data['tip'] > 5]
data['tip_pct'] = data['tip'] / data['total_bill']
print(outliers['tip_pct']) # ERROR!
# RIGHT
data['tip_pct'] = data['tip'] / data['total_bill']
outliers = data[data['tip'] > 5]
print(outliers['tip_pct']) # WORKS!
2. Matplotlib vs Seaborn:
- matplotlib = Low-level, flexible, more code
- seaborn = High-level, easy, pretty defaults
- Use both together!
3. Data Types Matter:
- float64 / int64 = Can do math
- object = Text, can't do math
Skills I Developed:
✅ Technical Skills:
- Data cleaning and preprocessing
- Exploratory data analysis
- Statistical thinking
- Data visualization
- Python programming
- Using Jupyter notebooks
✅ Analytical Skills:
- Hypothesis formation
- Pattern recognition
- Critical thinking
- Drawing insights from data
- Making business recommendations
✅ Soft Skills:
- Scientific method application
- Intellectual honesty ("I don't know")
- Learning from mistakes
- Persistence through challenges
- Clear communication of findings
📊 COMPLETE VISUALIZATIONS CREATED
- ✅ Scatter Plot: total_bill vs tip
- ✅ Scatter Plot: size vs total_bill
- ✅ Scatter Plot: size vs tip
- ✅ Box Plot: tip by day of week
- ✅ Box Plot: tip by time (lunch/dinner)
- ✅ Box Plot: tip by sex
- ✅ Box Plot: tip by smoker status
- ✅ Correlation Matrix Heatmap
- ✅ Pairplot (all relationships)
Total: 9 professional visualizations
🎓 FINAL REFLECTION
What Worked Well:
- Systematic approach (hypothesis → test → analyze)
- Using appropriate visualizations for each relationship
- Being open to being wrong
- Thorough data cleaning before analysis
What I'd Do Differently:
- Could explore interaction effects (e.g., day × time)
- Could calculate tip percentages earlier for context
- Could test non-linear relationships more formally
Most Surprising Finding:
"Party size plateaus at 5-6 people!"
- Expected linear relationship
- Discovered real-world behavioral economics
- Shows the value of looking at data, not just assumptions
Most Important Lesson:
"Total amount of spending is the determining factor"
- Simple but powerful
- Actionable for businesses
- Data-driven decision making
End of Journey Summary
"The only real mistake is the one from which we learn nothing." - Henry Ford
"In God we trust. All others must bring data." - W. Edwards Deming
Top comments (0)