Data Analyst Guide: Mastering Docker + AWS for Junior Data Analyst CV

Business Problem Statement

In today's data-driven world, companies are generating vast amounts of data, and the ability to analyze and gain insights from this data is crucial for making informed business decisions. As a junior data analyst, having a strong foundation in data analysis, machine learning, and cloud computing is essential for career growth. In this tutorial, we will explore how to master Docker and AWS for data analysis, and demonstrate a real-world scenario where a company can increase its revenue by 15% by optimizing its pricing strategy using data analysis.

Real Scenario:
A retail company wants to optimize its pricing strategy to increase revenue. The company has a large dataset of customer transactions, including product information, customer demographics, and sales data. The goal is to analyze this data to identify trends, patterns, and correlations that can inform pricing decisions.

ROI Impact:
By optimizing its pricing strategy using data analysis, the company can increase its revenue by 15%, resulting in an additional $1.5 million in annual revenue.

Step-by-Step Technical Solution

Step 1: Data Preparation (pandas/SQL)

First, we need to prepare the data for analysis. We will use pandas to load and manipulate the data, and SQL to query the data.

import pandas as pd
import numpy as np

# Load data
data = pd.read_csv('customer_transactions.csv')

# Handle missing values
data.fillna(data.mean(), inplace=True)

# Convert categorical variables to numerical variables
data['product_category'] = pd.Categorical(data['product_category']).codes

# Save data to SQLite database
import sqlite3
conn = sqlite3.connect('customer_transactions.db')
data.to_sql('customer_transactions', conn, if_exists='replace', index=False)
conn.close()

SQL query to create a table for customer transactions:

CREATE TABLE customer_transactions (
    id INTEGER PRIMARY KEY,
    product_id INTEGER,
    customer_id INTEGER,
    product_category INTEGER,
    sales_date DATE,
    sales_amount REAL
);

Step 2: Analysis Pipeline

Next, we will create an analysis pipeline to extract insights from the data. We will use scikit-learn to build a regression model to predict sales amount based on product category and customer demographics.

from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error

# Load data from SQLite database
import sqlite3
conn = sqlite3.connect('customer_transactions.db')
data = pd.read_sql_query('SELECT * FROM customer_transactions', conn)
conn.close()

# Split data into training and testing sets
X = data[['product_category', 'customer_id']]
y = data['sales_amount']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Build regression model
model = LinearRegression()
model.fit(X_train, y_train)

# Make predictions on testing set
y_pred = model.predict(X_test)

# Evaluate model performance
mse = mean_squared_error(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')

Step 3: Model/Visualization Code

We will use Matplotlib and Seaborn to visualize the data and the model's predictions.

import matplotlib.pyplot as plt
import seaborn as sns

# Plot sales amount distribution
sns.histplot(data['sales_amount'], kde=True)
plt.title('Sales Amount Distribution')
plt.xlabel('Sales Amount')
plt.ylabel('Frequency')
plt.show()

# Plot predicted sales amount vs actual sales amount
plt.scatter(y_test, y_pred)
plt.xlabel('Actual Sales Amount')
plt.ylabel('Predicted Sales Amount')
plt.title('Predicted vs Actual Sales Amount')
plt.show()

Step 4: Performance Evaluation

We will evaluate the model's performance using metrics such as mean squared error, mean absolute error, and R-squared.

from sklearn.metrics import mean_absolute_error, r2_score

# Evaluate model performance
mse = mean_squared_error(y_test, y_pred)
mae = mean_absolute_error(y_test, y_pred)
r2 = r2_score(y_test, y_pred)
print(f'Mean Squared Error: {mse:.2f}')
print(f'Mean Absolute Error: {mae:.2f}')
print(f'R-squared: {r2:.2f}')

Step 5: Production Deployment

We will deploy the model to a production environment using Docker and AWS.

Dockerfile:

FROM python:3.9-slim

# Set working directory
WORKDIR /app

# Copy requirements file
COPY requirements.txt .

# Install dependencies
RUN pip install -r requirements.txt

# Copy application code
COPY . .

# Expose port
EXPOSE 8000

# Run command
CMD ["python", "app.py"]

app.py:

from flask import Flask, request, jsonify
from sklearn.externals import joblib

app = Flask(__name__)

# Load model
model = joblib.load('model.pkl')

@app.route('/predict', methods=['POST'])
def predict():
    data = request.get_json()
    prediction = model.predict(data)
    return jsonify({'prediction': prediction})

if __name__ == '__main__':
    app.run(host='0.0.0.0', port=8000)

AWS Deployment:

Create an AWS account and set up an EC2 instance.
Install Docker on the EC2 instance.
Build the Docker image using the Dockerfile.
Push the Docker image to Amazon ECR.
Create an AWS ECS cluster and task definition.
Deploy the Docker container to the ECS cluster.

Metrics/ROI Calculations:

Revenue increase: 15%
Additional annual revenue: $1.5 million
ROI: 300% (assuming a $500,000 investment in data analysis and deployment)

Edge Cases:

Handling missing values in the data
Handling outliers in the data
Handling changes in the data distribution over time

Scaling Tips:

Use distributed computing frameworks such as Apache Spark or Dask to scale the data analysis pipeline.
Use cloud-based services such as AWS SageMaker or Google Cloud AI Platform to scale the model deployment.
Use containerization frameworks such as Docker to scale the application deployment.