DEV Community

Cover image for How I Streamed Live Binance L2 Order Book Data on AWS for ~$10/Month
Kalu Goodness Chibueze
Kalu Goodness Chibueze

Posted on

How I Streamed Live Binance L2 Order Book Data on AWS for ~$10/Month

TL;DR

What we built: A fully automated Binance Level 2 order book streaming system on AWS Free Tier
Cost: Uses AWS credits (~$15/month worth) for 260K+ snapshots/day across BTC, ETH, SOL
Team: 6 WorldQuant University students with IAM-managed access
Tech stack: EC2 (t3.micro) + Python + S3 + CloudWatch + Systemd
Dataset size: ~100-130 GB (historical + 6 months live streaming)
Research focus: Liquidity stress detection in cryptocurrency markets
Result: 100% autonomous operation with zero manual intervention

We've all been there. You have a brilliant idea that needs collaboration and lots of data to succeed, and suddenly access and budget become your biggest obstacles.

Having decided to pick up my long-held interest in research again, I spoke to a couple of colleagues at WorldQuant University and we decided to just get it done. We wanted a fully open source project with a high-frequency finance dataset. But there were two big problems:

Each teammate needed access to ~30-50 GB of data. Downloading individually? Terrible for bandwidth and money.

We wanted the project to be truly reproducible, so anyone with the right technique could replicate it cheaply.

Our solution: AWS Free Tier + careful planning.


The Team

Our 6-person WorldQuant University research team:

  • Kalu Goodness (me)
  • Edet Joseph
  • Igboke Hannah
  • Ejike Uchenna
  • Kunde Godfrey
  • Fagbuyi Emmanuel

We're all students tackling high-frequency cryptocurrency market microstructure and liquidity stress detection.


The Plan

Static download: Grab historical trade data first, 5 key market periods covering COVID crash, bull runs, Terra collapse, post-FTX, and ETF approval (~30-50 GB total)

Live streaming: Continuous Level 2 order book snapshots every second for BTC/USDT, ETH/USDT, and SOL/USDT

Central S3 storage: Team members get read-only access via IAM

Budget & monitoring: Multi-threshold alerts and automatic stoppage to avoid surprises

Team IAM setup: Users in a group with policies attached for simplicity and safety

We used t3.micro EC2 in London for proximity and cost efficiency, running Python scripts in a virtual environment.


Part 1: EC2 Instance Setup

Instance Specifications:

Name: (our_s3_bucket)
Type: t3.micro (1 vCPU, 1GB RAM)
Storage: 20GB EBS (gp3)
Region: eu-west-2 (London - lowest latency to our team)
OS: Amazon Linux 2023
IAM Role: (ec2-s3-role)

Why London? Most of our team is in Europe/Africa, so this gave us the best latency.

Pro Tip: Always attach IAM roles to your EC2 instance during creation. Attaching them later sometimes requires stopping the instance, which interrupts your data collection.

IAM Role Policy for EC2:

The EC2 instance needs to write to S3, push CloudWatch metrics, and create SNS alerts:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::our-s3-bucket",
                "arn:aws:s3:::our-s3-bucket/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricData",
                "cloudwatch:PutMetricAlarm"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "sns:Publish"
            ],
            "Resource": "arn:aws:sns:*:*:binance-streamer-alerts"
        }
    ]
}
Enter fullscreen mode Exit fullscreen mode

Important: Using IAM roles instead of access keys means no hardcoded credentials, automatic key rotation, easier auditing, and zero risk of accidentally committing secrets to GitHub.


Part 2: AWS Budgeting (The Most Important Part!)

Before writing any code, we set up comprehensive budget controls. This is not optional for student projects.

Quick note on AWS Free Tier changes: AWS recently changed how their free tier works. You now get $100 in credits when you sign up, plus $20 credits for trying out specific services like Lambda, EC2, RDS, and AWS Budgets. The big change is that on the free tier, you cannot be charged beyond your credits. It's a hard limit. Once you hit zero credits and don't top up, AWS emails you to upgrade and spins everything down if you don't comply. Some services are still free within limits, but most usage now draws from your credits immediately.

Budget: $50 worth of credits per month (we started with $180 in total credits)

Our 5-Threshold Alert System:

Threshold Amount Action
1 50% ($25) Email alert
2 80% ($40) Email + Slack notification
3 90% ($45) Email to all team members
4 95% ($47.50) Stop EC2 instance
5 100% ($50) Terminate EC2 instance

Critical Setup Note: Thresholds 4 and 5 require the EC2 instance to exist before creating the budget. Both the budget and EC2 instance must be in the same AWS region, or the actions won't trigger. This step caught me initially and cost me an hour of debugging!


Part 3: Static Historical Data Download

Before firing up the live streamer script, we ran a one-off Python script to grab historical Binance trade data.

What We Downloaded:

We targeted 5 specific market periods for our liquidity stress research:

Window 1 - COVID Crash (Feb-Jul 2020): 6 months

Window 2 - Bull Run (Nov 2020-Apr 2021): 6 months

Window 3 - Terra Collapse (Nov 2021-Apr 2022): 6 months

Window 4 - Post-FTX (Nov 2022-Apr 2023): 6 months

Window 5 - ETF Approval (Jul-Dec 2024): 6 months

Assets: BTC/USDT, ETH/USDT, SOL/USDT (note: SOL wasn't listed until Aug 2020)

Total: 44.6 GB compressed, 84 files

The script:

  • Downloads from Binance Data Vision API
  • Verifies SHA256 checksums
  • Uploads directly to S3
  • Deletes local copies to save disk space
  • Runtime: ~15-20 minutes

binance_trade_data.py (excerpt)

import requests
import boto3
import hashlib
from tqdm import tqdm

s3_client = boto3.client('s3')
BASE_URL = "https://data.binance.vision/data/spot/monthly/trades"
ASSETS = ["BTCUSDT", "ETHUSDT", "SOLUSDT"]
S3_BUCKET = "our-s3-bucket"

def download_and_upload(url, local_path, s3_key, checksum_url):
    # Download with progress bar
    response = requests.get(url, stream=True)
    total_size = int(response.headers.get('content-length', 0))

    with open(local_path, 'wb') as f:
        with tqdm(total=total_size, unit='B', unit_scale=True) as pbar:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
                pbar.update(len(chunk))

    # Verify checksum
    checksum_response = requests.get(checksum_url)
    expected_checksum = checksum_response.text.strip().split()[0]

    sha256_hash = hashlib.sha256()
    with open(local_path, 'rb') as f:
        for chunk in iter(lambda: f.read(8192), b''):
            sha256_hash.update(chunk)

    if sha256_hash.hexdigest() == expected_checksum:
        print("✓ Checksum verified")
        # Upload to S3
        s3_client.upload_file(local_path, S3_BUCKET, s3_key)
        os.remove(local_path)  # Delete local copy
        return True
    else:
        print("✗ Checksum mismatch!")
        return False
Enter fullscreen mode Exit fullscreen mode

This gave us a clean historical baseline before starting the live stream.

Why these specific periods? Each window represents a major market event: COVID crash (liquidity crisis), Bull run (FOMO), Terra collapse (contagion), FTX collapse (systemic failure), and ETF approval (institutional adoption). Perfect for studying liquidity under stress.


Part 4: Live L2 Order Book Streamer

After the static data, we started the real-time snapshots using Python with multi-threading.

Streaming Configuration:

Interval: 1-second snapshots per symbol

Depth: Top 100 bid/ask levels

Symbols: BTC/USDT, ETH/USDT, SOL/USDT (running concurrently in separate threads)

API: Binance REST API (not WebSocket - simpler and more reliable for our use case)

Output: S3 bucket with Hive-style partitioning

Why REST API instead of WebSocket?

  • Simpler implementation
  • 1-second granularity is sufficient for liquidity research
  • Complete order book snapshots (not deltas)
  • Automatic retries and error handling
  • No re-connection logic needed

S3 Data Structure:

s3://our-s3-bucket/binance-l2-data/
├── symbol=BTCUSDT/
│   └── year=2025/
│       └── month=10/
│           └── day=22/
│               └── hour=10/
│                   ├── 20251022_100001_123456.json
│                   └── 20251022_100002_234567.json
├── symbol=ETHUSDT/
└── symbol=SOLUSDT/
Enter fullscreen mode Exit fullscreen mode

Each JSON file contains:

  • Full order book (100 levels)
  • Best bid/ask prices
  • Calculated metrics (spread, liquidity, imbalance)
  • Timestamp metadata

Streamer Core Logic

import json
import time
import boto3
import requests
import threading
from datetime import datetime

class BinanceL2Streamer:
    def __init__(self, symbol, s3_bucket, depth=100, interval=1):
        self.symbol = symbol
        self.s3_bucket = s3_bucket
        self.depth = depth
        self.interval = interval

        self.s3 = boto3.client('s3', region_name='eu-west-2')
        self.cloudwatch = boto3.client('cloudwatch', region_name='eu-west-2')
        self.api_url = "https://api.binance.com/api/v3/depth"

    def get_order_book(self):
        """Fetch current order book from Binance REST API"""
        params = {'symbol': self.symbol, 'limit': self.depth}
        response = requests.get(self.api_url, params=params, timeout=5)
        response.raise_for_status()
        return response.json()

    def enrich_snapshot(self, raw_data):
        """Calculate derived metrics"""
        bids = raw_data.get('bids', [])
        asks = raw_data.get('asks', [])

        best_bid = float(bids[0][0]) if bids else 0
        best_ask = float(asks[0][0]) if asks else 0
        spread = best_ask - best_bid
        mid_price = (best_bid + best_ask) / 2

        # Calculate liquidity at different depths
        bid_liquidity_5 = sum(float(b[0]) * float(b[1]) for b in bids[:5])
        ask_liquidity_5 = sum(float(a[0]) * float(a[1]) for a in asks[:5])

        return {
            'symbol': self.symbol,
            'timestamp': datetime.utcnow().isoformat(),
            'bids': bids,
            'asks': asks,
            'metrics': {
                'mid_price': mid_price,
                'spread': spread,
                'spread_bps': (spread / mid_price * 10000) if mid_price else 0,
                'bid_liquidity_5': bid_liquidity_5,
                'ask_liquidity_5': ask_liquidity_5,
                'imbalance_5': (bid_liquidity_5 - ask_liquidity_5) / 
                               (bid_liquidity_5 + ask_liquidity_5)
            }
        }

    def upload_to_s3(self, data):
        """Upload with time-based partitioning"""
        timestamp = datetime.utcnow()
        s3_key = (
            f"binance-l2-data/symbol={self.symbol}/"
            f"year={timestamp.year}/month={timestamp.month:02d}/"
            f"day={timestamp.day:02d}/hour={timestamp.hour:02d}/"
            f"{timestamp.strftime('%Y%m%d_%H%M%S_%f')}.json"
        )

        self.s3.put_object(
            Bucket=self.s3_bucket,
            Key=s3_key,
            Body=json.dumps(data),
            ContentType='application/json'
        )

    def run(self):
        """Main streaming loop"""
        while True:
            raw_data = self.get_order_book()
            enriched_data = self.enrich_snapshot(raw_data)
            self.upload_to_s3(enriched_data)
            time.sleep(self.interval)

# Run all three symbols in parallel threads
SYMBOLS = ["BTCUSDT", "ETHUSDT", "SOLUSDT"]
threads = []

for symbol in SYMBOLS:
    streamer = BinanceL2Streamer(symbol, "our-s3-bucket")
    thread = threading.Thread(target=streamer.run, daemon=True)
    thread.start()
    threads.append(thread)

# Keep main thread alive
while True:
    time.sleep(1)
Enter fullscreen mode Exit fullscreen mode

The script also pushes custom CloudWatch metrics every 60 seconds:

  • Snapshots captured
  • Error count
  • Spread (basis points)
  • Order book imbalance
  • Bid/ask liquidity
  • Heartbeat (liveness check)

Part 5: Team IAM Access Setup

To keep things safe but shareable, we used IAM user groups with read-only policies.

The Process:

  1. Created IAM user group: BinanceReadOnlyTeam
  2. Created 6 individual IAM users (one per team member)
  3. Added all users to the group
  4. Attached read-only S3 policy to the group
  5. Generated access keys for each user
  6. Sent keys via email with a Jupyter notebook showing how to use them in Google Colab

Read-Only S3 Policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ListBucket",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::our-s3-bucket"
        },
        {
            "Sid": "ReadObjects",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Resource": "arn:aws:s3:::our-s3-bucket/*"
        }
    ]
}
Enter fullscreen mode Exit fullscreen mode

What team members CAN do:

  • List files in S3
  • Download data to Google Colab/local machines
  • Read all historical and live data

What team members CANNOT do:

  • Upload/modify/delete data
  • Create EC2 instances
  • See billing information
  • Modify IAM permissions
  • Access other AWS services

This setup meant everyone had access without risking accidental writes or exposing the main EC2 credentials. Individual accountability with unique keys per person, easy revocation by removing from group, centralized management where one policy change affects everyone, and complete audit trail via CloudTrail.


Part 6: Automation with Bash & Systemd

To make the system fully autonomous, we created bash scripts and systemd services.

Script 1: create_alarms.sh

Creates all CloudWatch alarms and SNS topic:

#!/bin/bash

SYMBOLS=("BTCUSDT" "ETHUSDT" "SOLUSDT")
REGION="eu-west-2"
SNS_TOPIC_ARN="arn:aws:sns:<region>:<account-id-redacted>:<topic>"

# Create SNS topic
aws sns create-topic --name binance-streamer-alerts --region $REGION

# Subscribe email
aws sns subscribe \
  --topic-arn $SNS_TOPIC_ARN \
  --protocol email \
  --notification-endpoint your-email@example.com \
  --region $REGION

# Create alarms for each symbol
for SYMBOL in "${SYMBOLS[@]}"; do
    # High error rate alarm
    aws cloudwatch put-metric-alarm \
        --alarm-name "BinanceStreamer-HighErrorRate-${SYMBOL}" \
        --namespace "BinanceStreamer" \
        --metric-name "ErrorCount" \
        --dimensions Name=Symbol,Value=$SYMBOL \
        --statistic Sum \
        --period 300 \
        --threshold 10 \
        --comparison-operator GreaterThanThreshold \
        --alarm-actions $SNS_TOPIC_ARN \
        --region $REGION

    # No data received alarm
    aws cloudwatch put-metric-alarm \
        --alarm-name "BinanceStreamer-NoData-${SYMBOL}" \
        --namespace "BinanceStreamer" \
        --metric-name "SnapshotsCaptured" \
        --dimensions Name=Symbol,Value=$SYMBOL \
        --statistic SampleCount \
        --period 300 \
        --threshold 1 \
        --comparison-operator LessThanThreshold \
        --alarm-actions $SNS_TOPIC_ARN \
        --treat-missing-data breaching \
        --region $REGION
done
Enter fullscreen mode Exit fullscreen mode

Script 2: setup_binance_services.sh

Creates systemd services for automatic startup and restarts:

#!/bin/bash

# Create streamer service
sudo tee /etc/systemd/system/binance-streamer.service > /dev/null <<EOF
[Unit]
Description=Binance L2 Order Book Streamer
After=network-online.target

[Service]
Type=simple
User=ec2-user
WorkingDirectory=/home/ec2-user
ExecStart=/home/ec2-user/trade_download/bin/python /home/ec2-user/binance_l2_streamer.py
Restart=always
RestartSec=10
MemoryMax=512M
CPUQuota=50%
Environment="PYTHONUNBUFFERED=1"

[Install]
WantedBy=multi-user.target
EOF

# Enable and start service
sudo systemctl daemon-reload
sudo systemctl enable binance-streamer.service
sudo systemctl start binance-streamer.service
Enter fullscreen mode Exit fullscreen mode

With Restart=always and RestartSec=10, our streamer becomes virtually immortal. Script crashes? Auto-restart in 10 seconds. EC2 reboots? Starts automatically on boot. Zero manual intervention needed.

Monitoring Commands:

# View live logs
sudo journalctl -u binance-streamer.service -f

# Check service status
sudo systemctl status binance-streamer.service

# Restart if needed
sudo systemctl restart binance-streamer.service
Enter fullscreen mode Exit fullscreen mode

Part 7: CloudWatch Monitoring & Alerts

We created comprehensive monitoring with 7 alarms per symbol (21 total):

  1. High Error Rate - More than 10 errors in 5 minutes
  2. No Data Received - No snapshots in 5 minutes
  3. High Spread - Spread > 50 bps (potential liquidity crisis)
  4. Extreme Order Book Imbalance - 70% imbalance for 2 minutes
  5. EC2 Status Check Failed - Instance health issues
  6. High CPU Usage - Above 80% for 10 minutes
  7. Low Disk Space - Disk usage above 80%

All alarms send SNS email notifications to 2 members for redundancy (me and Joseph).

Example alarm trigger email:

ALARM: "BinanceStreamer-HighErrorRate-BTCUSDT" in EU (London)

Threshold Crossed: ErrorCount > 10 for 2 datapoints within 10 minutes

Current State: ALARM
Previous State: OK

Reason: Threshold Crossed: 1 datapoint [15.0] was greater than 
the threshold (10.0).
Enter fullscreen mode Exit fullscreen mode

This gives us peace of mind. We know immediately if anything goes wrong.


Part 8: Actual Costs & Results

After 72 Hours (3 Days) of Operation:

Metric Value
Total snapshots 259,200 per symbol (777,600 total)
S3 storage 2.6 GB live data + 44.6 GB historical = 47.2 GB
Uptime 100% (0 errors)
Files created ~384,000 JSON files
Avg file size 7.5 KB per snapshot

Cost Breakdown:

  • EC2 (t3.micro, 72 hours): $0.93 worth of credits
  • S3 storage (47.2 GB): $1.08 worth of credits
  • S3 PUT requests: $0.05 worth of credits
  • Data transfer: $0.02 worth of credits
  • Total for 3 days: $2.08 worth of credits

Projected Monthly Cost: ~$10 worth of credits per month

Well under our $50 budget, and remember, on the free tier you can't actually be charged real money. Once your credits hit zero, AWS just emails you and shuts things down unless you upgrade. It's a hard limit.


Architecture Diagram

Here's the complete system architecture showing the full data flow from Binance API through our EC2 instance to S3 storage, with CloudWatch monitoring and IAM-controlled team access.

┌─────────────────────────────────────────────────────────┐
│              Binance REST API                           |
│  (Order book snapshots for BTC/ETH/SOL every 1 second)  │
└────────────────────┬────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────┐
│  EC2 Instance (t3.micro, eu-west-2)                     │
│  ┌───────────────────────────────────────────────┐      |
│  │  Python Streamer (3 threads)                  │      |
│  │  • binance_l2_streamer.py                     │      |
│  │  • Systemd auto-restart                       │      |
│  │  • IAM role: <ec2-s3-role>                    │      |
│  └───────────────────────────────────────────────┘      |
└────────────┬──────────────────┬─────────────────────────┘
             │                  │
             │                  └──────────────┐
             ▼                                 ▼
┌────────────────────────────┐   ┌────────────────────────┐
│  S3: our-s3-bucket         │   │  CloudWatch Metrics    │
│  • Historical: 44.6 GB     │   │  • Snapshots/min       │
│  • Live L2: 2.6 GB         │   │  • Error count         │
│  • Time-partitioned        │   │  • Spread, liquidity   │
│  • 777K+ snapshots         │   │  • Order book balance  │
└────────────┬───────────────┘   └──────────┬─────────────┘
             │                              │
             │                              ▼
             │                   ┌─────────────────────────┐
             │                   │  CloudWatch Alarms      │
             │                   │  • 21 alarms total      │
             │                   │  • 7 per symbol         │
             │                   └──────────┬──────────────┘
             │                              │
             ▼                              ▼
┌────────────────────────────┐   ┌────────────────────────┐
│  IAM User Group            │   │  SNS Topic             │
│  BinanceReadOnlyTeam       │   │  streamer-alerts-topic │
│  • 6 team members          │   │                        │
│  • Read-only S3 access     │   │  • Email notifications │
│  • Individual access keys  │   │  • 2 team members      │
└────────────────────────────┘   └────────────────────────┘
             │
             ▼
┌────────────────────────────────────────────────────────┐
│  Team Analysis (Google Colab / Local)                  │
│  • Pandas DataFrames                                   │
│  • Liquidity stress analysis                           │
│  • Market microstructure research                      │
└────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Lessons Learned

Here's what actually happened during our 3-day journey from idea to production system.

What Worked Really Well:

  • Budget-first approach - Set up cost controls BEFORE infrastructure = zero surprises
  • IAM roles for EC2 - No credential management headaches at all
  • User groups for team - Adding/removing members took literally seconds
  • Systemd for reliability - Scripts survived EC2 restarts, no manual intervention
  • Time-based S3 partitioning - Made data queries fast and cheap
  • REST over WebSocket - Simpler code, fewer bugs, good enough for our needs

Challenges We Hit:

Challenge What Went Wrong Solution Time Lost
Budget actions didn't trigger Created budget before EC2 existed Created EC2 first, then recreated budget in same region 1 hour
Disk space filled up in 6 hours Kept local copies after S3 upload Added os.remove(local_path) after upload 30 min
CloudWatch metrics not appearing Used wrong region in boto3 client Changed region_name='us-east-1' to 'eu-west-2' 20 min
Systemd service wouldn't start Wrong Python interpreter path Fixed ExecStart to use venv Python: /home/ec2-user/trade_download/bin/python 15 min
IAM permissions initially too broad Gave EC2 role full S3 access Restricted to specific bucket ARN only 45 min

Debugging Tip: 90% of our issues were solved by checking sudo journalctl -u binance-streamer.service -n 50. Always check logs first before assuming AWS is broken!


Part 9: Making the Data Research-Ready

Getting raw data is only half the battle. We needed to transform 44.6 GB of compressed historical data and continuous JSON snapshots into something you could actually query and analyze.

The full data processing pipeline deserves its own article (unzipping historical data, converting JSON to Parquet, AWS Athena setup, feature engineering, etc.), but here's what we built:

The transformation: Raw JSON → Parquet (60% storage reduction, 10-20x faster queries)
The interface: AWS Athena with time-based partitioning (query 6 months in seconds)
The features: Pre-computed metrics like log_returns, volume_imbalance, effective_spread, and liquidity_score

Result: Team members can run complex queries across the entire dataset without downloading anything locally. What would take minutes of local processing now happens in seconds via Athena.

This data processing pipeline will be covered in detail in Part 2, along with our feature engineering approach and the actual liquidity stress detection models.


What We're Researching

Our project focuses on liquidity stress detection in cryptocurrency markets using order book microstructure.

Research Questions:

  1. Can we predict liquidity crises before they happen by analyzing order book imbalance?
  2. How does liquidity behave differently across BTC, ETH, and SOL during stress events?
  3. Are there early warning signals in bid-ask spreads during major market moves?
  4. How quickly does liquidity recover after major crashes (COVID, Terra, FTX)?

Methodology:

  • Compare L2 snapshots during "normal" vs "stress" periods
  • Analyze spread widening, order book depth depletion, and imbalance patterns
  • Build predictive models using machine learning on microstructure features
  • Validate findings across all three assets and five historical windows

This dataset will enable us (and the broader research community) to study these questions with unprecedented granularity.


Full System Workflow

Here's what happens after initial setup (completely hands-off):

  1. EC2 boots → systemd starts binance-streamer.service
  2. Python script launches → 3 threads (BTC, ETH, SOL) start streaming
  3. Every second → Each thread fetches order book, enriches data, uploads to S3
  4. Every 60 seconds → CloudWatch metrics pushed (snapshots, errors, spread, etc.)
  5. CloudWatch monitors → Checks all 21 alarms continuously
  6. If alarm triggers → SNS sends email to team leads immediately
  7. If script crashes → Systemd auto-restarts after 10 seconds
  8. If EC2 reboots → Everything starts automatically on boot
  9. Team members → Download data anytime via read-only IAM credentials

Total manual intervention required: Zero


Current Data Collection Status

As of writing this article:

Live Streaming (Day 3):

  • 777,600 total snapshots captured
  • 0 errors (100% uptime)
  • 2.6 GB of live L2 data
  • All CloudWatch alarms functioning
  • Team members accessing data successfully

Historical Data:

  • 44.6 GB across 5 market periods
  • 84 monthly files downloaded and verified
  • All checksums validated

Projected 6-Month Collection:

  • ~47 million snapshots per symbol (141 million total)
  • ~50-80 GB additional live data
  • Total dataset: ~100-130 GB

Code Repository

All scripts (Python + Bash) and setup documentation will be released on GitHub:

GitHub: Coming soon (will include complete setup guide, all scripts, and infrastructure templates)

What will be included:

  • binance_l2_streamer.py - Multi-threaded live streamer
  • binance_trade_data.py - Historical data downloader
  • create_alarms.sh - CloudWatch alarm automation
  • setup_binance_services.sh - Systemd service installer
  • IAM policy templates (JSON)
  • Complete setup documentation
  • Team access guide (Google Colab notebook)

Making It Open Source

Once our research concludes (~6 months), we'll release:

1. The complete dataset (~50-100 GB)

  • Historical trade data (5 market periods)
  • 6 months of live L2 snapshots
  • All three symbols (BTC, ETH, SOL)

2. All infrastructure code (might come earlier)

  • Terraform templates for one-click deployment
  • Docker containers (alternative to EC2)
  • Complete setup scripts

3. Analysis notebooks

  • Liquidity stress detection algorithms
  • Market microstructure analysis
  • Visualization dashboards

4. Research paper

  • Methodology
  • Findings
  • Reproducibility instructions

Why? High-frequency financial data is expensive and gatekept. Open-source data democratizes quantitative finance research for students and independent researchers.


How You Can Replicate This

Everything is designed to be plug-and-play. If you can SSH into a server and run a bash script, you can replicate this entire system.

Requirements:

  • AWS account with Free Tier ($100 credits for new accounts, plus $20 bonus credits per service)
  • Basic Python knowledge
  • SSH access to Linux
  • 30-60 minutes for initial setup

Estimated Costs:

  • First few months: Draws from your $100-140 in credits
  • Monthly credit usage: ~$8-12 worth
  • One-time setup time: 1-2 hours

Steps:

  1. Clone our GitHub repo (link coming soon)
  2. Follow the setup guide in README.md
  3. Run setup_binance_services.sh on your EC2 instance
  4. Configure your budget and alarms
  5. Start streaming!

The entire infrastructure is designed to be plug-and-play reproducible.


Want to Collaborate?

We're always interested in collaborating with other researchers working on:

  • Market microstructure
  • Cryptocurrency liquidity
  • High-frequency data analysis
  • Open-source finance datasets

Reach out if you:

  • Want to use our dataset for your research
  • Have ideas for improving the infrastructure
  • Want to contribute to the open-source release
  • Are building similar systems and want to share knowledge

Closing Thoughts

Building this system taught us that good data infrastructure doesn't require a huge budget or enterprise tools. It requires clear thinking about requirements and constraints, careful planning (especially around costs), simple and reliable tools over complex ones, good documentation for reproducibility, and team collaboration with clear ownership.

The hardest part wasn't the code. It was figuring out the architecture, IAM policies, budget controls, and making everything work together reliably.

If six students can build a production-grade market data pipeline on $10/month worth of credits, you can too.

Thanks for reading!


This article is part of our ongoing open-source research project. Star our GitHub repo when it's released to stay updated on the dataset publication and research findings!

Top comments (0)