DEV Community

Cover image for YouTube Data Processing Pipeline
Lagat Josiah
Lagat Josiah

Posted on

YouTube Data Processing Pipeline

YouTube Data Processing Pipeline

A comprehensive data pipeline that extracts, processes, and analyzes YouTube channel statistics using modern data engineering tools and practices.

πŸš€ Overview

This pipeline automates the collection and analysis of YouTube channel data, providing insights into channel performance, subscriber growth, and content metrics. The system is built with scalability and reproducibility in mind, using industry-standard tools for data orchestration, storage, and analysis.

πŸ“Š Pipeline Architecture

YouTube API β†’ Airflow (Orchestration) β†’ PostgreSQL (Storage) β†’ Jupyter (Analysis)
Enter fullscreen mode Exit fullscreen mode

Components:

  • Airflow: Workflow orchestration and scheduling
  • PostgreSQL: Relational data storage
  • Jupyter: Data analysis and visualization
  • Docker: Containerization and service management

πŸ› οΈ Tech Stack

Component Technology Purpose
Orchestration Apache Airflow 2.8.1 Workflow scheduling & monitoring
Database PostgreSQL 13 Data storage & persistence
Analysis Jupyter Lab Data exploration & visualization
Containerization Docker & Docker Compose Environment management
Language Python 3.8+ Data processing & API integration

πŸ“ Project Structure

youtube-pipeline/
β”œβ”€β”€ docker-compose.yml          # Main service configuration
β”œβ”€β”€ start-youtube-pipeline.sh   # Quick start script
β”œβ”€β”€ airflow/
β”‚   β”œβ”€β”€ dags/
β”‚   β”‚   └── youtube_pipeline.py # Main data extraction DAG
β”‚   β”œβ”€β”€ logs/                   # Airflow execution logs
β”‚   └── plugins/                # Custom Airflow plugins
β”œβ”€β”€ jupyter/
β”‚   └── notebooks/
β”‚       └── youtube_analysis.ipynb  # Data analysis notebook
β”œβ”€β”€ postgres_data/              # Database volume
└── backup/                     # Backup directory for configurations
Enter fullscreen mode Exit fullscreen mode

πŸš€ Quick Start

Prerequisites

  • Docker & Docker Compose
  • YouTube Data API v3 key

Installation & Setup

  1. Clone and setup the project:
   git clone <repository-url>
   cd youtube-pipeline
Enter fullscreen mode Exit fullscreen mode
  1. Set up YouTube API credentials:
   # Set your YouTube API key in Airflow
   docker-compose exec airflow airflow variables set YOUTUBE_API_KEY "your_api_key_here"
Enter fullscreen mode Exit fullscreen mode
  1. Start the pipeline:
   # Use the quick start script
   ./start-youtube-pipeline.sh

   # Or manually
   docker-compose up -d
Enter fullscreen mode Exit fullscreen mode
  1. Access the services:

πŸ“ˆ Pipeline Features

Data Collection

  • Automated Extraction: Scheduled every 6 hours
  • Multiple Channels: Configurable channel monitoring
  • Comprehensive Metrics:
    • Subscriber counts
    • Total views
    • Video counts
    • Channel metadata
    • Historical tracking

Data Processing

  • Incremental Updates: UPSERT operations to avoid duplicates
  • Error Handling: Graceful failure recovery
  • Data Validation: Type checking and constraint enforcement

Monitoring & Alerting

  • Airflow Dashboard: Real-time pipeline monitoring
  • Task Logs: Detailed execution logs
  • Health Checks: Service status monitoring

πŸ”§ Configuration

YouTube Channels

Edit airflow/dags/youtube_pipeline.py to modify monitored channels:

CHANNEL_IDS = [
    "UC_x5XG1OV2P6uZZ5FSM9Ttw",  # Google for Developers
    "UCq-Fj5jknLsU6-M0R3PiHcA",  # YouTube Creators
    "UCBJycsmduvYEL83R_U4JriQ",  # Marques Brownlee
    # Add more channel IDs here
]
Enter fullscreen mode Exit fullscreen mode

Scheduling

Modify the DAG schedule in youtube_pipeline.py:

schedule_interval=timedelta(hours=6)  # Change as needed
Enter fullscreen mode Exit fullscreen mode

πŸ“Š Data Schema

channel_stats Table

Column Type Description
channel_id VARCHAR(255) PRIMARY KEY YouTube channel ID
channel_title TEXT Channel name
channel_description TEXT Channel description
channel_created_at TIMESTAMP Channel creation date
total_views BIGINT Lifetime views
subscriber_count BIGINT Current subscribers
video_count BIGINT Total videos published
processed_at TIMESTAMP Data collection timestamp

video_stats Table

Column Type Description
video_id VARCHAR(255) PRIMARY KEY YouTube video ID
video_title TEXT Video title
channel_id VARCHAR(255) Associated channel
view_count BIGINT Video views
like_count BIGINT Video likes
comment_count BIGINT Number of comments
published_at TIMESTAMP Video publish date
processed_at TIMESTAMP Data collection timestamp

🎯 Usage Examples

Data Analysis in Jupyter

# Connect to database
import psycopg2
import pandas as pd

conn = psycopg2.connect(
    host="postgres",
    database="youtube_db", 
    user="postgres",
    password="password",
    port=5432
)

# Load channel data
df = pd.read_sql("SELECT * FROM channel_stats", conn)
print(f"Tracking {len(df)} YouTube channels")
Enter fullscreen mode Exit fullscreen mode

Common Queries

-- Channel growth analysis
SELECT channel_title, subscriber_count, total_views,
       ROUND(total_views::numeric / subscriber_count, 2) as views_per_subscriber
FROM channel_stats 
ORDER BY subscriber_count DESC;

-- Historical trends (when multiple runs exist)
SELECT channel_title, subscriber_count, processed_at
FROM channel_stats 
WHERE channel_id = 'UC_x5XG1OV2P6uZZ5FSM9Ttw'
ORDER BY processed_at DESC;
Enter fullscreen mode Exit fullscreen mode

πŸ”„ Operations

Starting Services

./start-youtube-pipeline.sh
Enter fullscreen mode Exit fullscreen mode

Stopping Services

docker-compose down
docker stop jupyter
Enter fullscreen mode Exit fullscreen mode

Monitoring

# Check service status
docker-compose ps

# View logs
docker-compose logs airflow
docker logs jupyter

# Check pipeline runs
docker-compose exec airflow airflow dags list-runs -d youtube_data_pipeline
Enter fullscreen mode Exit fullscreen mode

Maintenance

# Backup database
docker-compose exec postgres pg_dump -U postgres youtube_db > backup.sql

# Update channel list
# Edit airflow/dags/youtube_pipeline.py

# Check API quota usage
# Monitor Google Cloud Console
Enter fullscreen mode Exit fullscreen mode

πŸ›‘οΈ Security Considerations

  • API keys stored in Airflow Variables (encrypted)
  • Database credentials in Docker environment variables
  • Network isolation between services
  • Regular dependency updates

πŸ“ˆ Scaling & Extensions

Potential Enhancements

  • Real-time Processing: Add Kafka for stream processing
  • Data Warehouse: Integrate with BigQuery or Snowflake
  • Monitoring: Add Prometheus/Grafana for metrics
  • Alerting: Slack/email notifications for anomalies
  • ML Integration: Predictive analytics for growth trends

Performance Optimization

  • Batch Processing: Group API calls to reduce quota usage
  • Caching: Implement Redis for frequently accessed data
  • Partitioning: Time-based partitioning of video stats
  • Indexing: Optimize database query performance

πŸ› Troubleshooting

Common Issues

  1. YouTube API Quota Exceeded

    • Solution: Monitor usage in Google Cloud Console
    • Alternative: Reduce update frequency or request quota increase
  2. Airflow Database Connection Issues

   # Reset database connection
   docker-compose restart airflow
Enter fullscreen mode Exit fullscreen mode
  1. Jupyter Not Accessible
   # Restart Jupyter container
   docker restart jupyter
Enter fullscreen mode Exit fullscreen mode
  1. Missing YouTube Data
    • Check API key validity
    • Verify channel IDs are correct
    • Review Airflow task logs

Logs & Debugging

# Airflow task logs
docker-compose exec airflow airflow tasks logs youtube_data_pipeline extract_youtube_data

# Service logs
docker-compose logs
docker logs jupyter
Enter fullscreen mode Exit fullscreen mode

🀝 Contributing

  1. Fork the repository
  2. Create a feature branch
  3. Make changes with tests
  4. Submit a pull request

Development Setup

# Create virtual environment
python -m venv venv
source venv/bin/activate

# Install dependencies
pip install -r requirements.txt

# Run tests
pytest tests/
Enter fullscreen mode Exit fullscreen mode

πŸ“„ License

This project is licensed under the MIT License - see the LICENSE file for details.

πŸ™ Acknowledgments

  • YouTube Data API v3 for data access
  • Apache Airflow community for orchestration tools
  • Jupyter project for analysis environment
  • Docker community for containerization

πŸ“ž Support

For issues and questions:

  1. Check troubleshooting section above
  2. Review service logs
  3. Open a GitHub issue
  4. Contact maintainers

Maintained by: Josiah Lagat

Last Updated: October 2024

Version: 1.0.0


⭐ Star this repo if you find it helpful!

Top comments (0)