The Bottleneck in Exploratory Data Analysis (EDA)
When performing EDA on massive datasets, a common anti-pattern is pulling the entire dataset into memory (Pandas DataFrame) just to calculate basic statistics or plot a graph. This approach leads to Out-Of-Memory (OOM) errors and skyrocketing cloud costs.As a data engineer focused on statistical rigor and system reliability, my approach is to push the math down to the database layer and only extract what is mathematically necessary for visualization.In this post, I will demonstrate how to analyze the relationship between trip distance and tip amounts using the chicago_taxi_trips dataset (hundreds of millions of rows) by combining BigQuery's native statistical functions and Python's Seaborn library.
Step 1: Compute the Pearson Correlation in BigQuery
Instead of downloading data to calculate correlation, we can use BigQuery's CORR() function. This computes the Pearson correlation coefficient ($r$) across the entire population natively in the distributed system.
SQL
-- Calculate Pearson correlation across the entire dataset
SELECT
CORR(trip_miles, tips) AS pearson_correlation
FROM
`bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
trip_miles > 0 AND tips > 0;
Result: The query returns approximately 0.48, indicating a moderate positive correlation.
Step 2: Unbiased Random Sampling for Visualization
Plotting millions of points on a scatter plot is visually useless (overplotting) and computationally expensive. To visualize the trend accurately, we must extract a statistically unbiased sample.We achieve this using BigQuery's RAND() function to perform random sampling directly in the SQL layer.
SQL
-- Extract a 0.1% random sample for unbiased visualization
SELECT
trip_miles,
tips
FROM
`bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE
trip_miles > 0 AND tips > 0
AND RAND() < 0.001
LIMIT 1000;
Step 3: Visualization in Vertex AI Workbench
Once the unbiased sample is extracted, we load it into a Pandas DataFrame and use Seaborn to inspect the data distribution visually.
Python
from google.cloud
import bigquery
import seaborn as sns
import matplotlib.pyplot as plt
# Initialize BigQuery client
client = bigquery.Client()
query = """
SELECT trip_miles, tips
FROM `bigquery-public-data.chicago_taxi_trips.taxi_trips`
WHERE trip_miles > 0 AND tips > 0 AND RAND() < 0.001
LIMIT 1000
"""
# Load the sampled data into memory
df = client.query(query).to_dataframe()
# Plotting the scatter plot
sns.set_theme(style="whitegrid")
plt.figure(figsize=(10, 6))
# Using alpha=0.5 to identify density and overlapping points
sns.scatterplot(x='trip_miles', y='tips', data=df, alpha=0.5, color='steelblue')
plt.title('Relationship between Trip Distance and Tip Amount', fontsize=14, pad=15)
plt.xlabel('Trip Miles', fontsize=12)
plt.ylabel('Tips ($)', fontsize=12)
plt.show()
Statistical Findings & Engineering Takeaways
- Correlation Trend: The scatter plot visually confirms the moderate positive correlation ($r \approx 0.48$) calculated in BigQuery. As trip_miles increases, tips generally increase.
- Outlier Detection: We can visually identify anomalies, such as extreme tips (>$50) for very short distances (<2 miles). From a QA and data integrity perspective, these outliers require further investigation (e.g., system glitches or premium services).
- Architectural Efficiency: By relying on BigQuery for population-level statistical computation and limiting Pandas/Seaborn to sample visualization, we ensure mathematical accuracy while maintaining zero risk of memory overflow.Building reliable data pipelines requires not just writing code, but understanding where the computation should logically occur.
Top comments (0)