Kalu Goodness Chibueze

Posted on Nov 7

How I Streamed Live Binance L2 Order Book Data on AWS for ~$10/Month

#aws #fintech #infrastructure #dataengineering

TL;DR

What we built: A fully automated Binance Level 2 order book streaming system on AWS Free Tier
Cost: Uses AWS credits (~$15/month worth) for 260K+ snapshots/day across BTC, ETH, SOL
Team: 6 WorldQuant University students with IAM-managed access
Tech stack: EC2 (t3.micro) + Python + S3 + CloudWatch + Systemd
Dataset size: ~100-130 GB (historical + 6 months live streaming)
Research focus: Liquidity stress detection in cryptocurrency markets
Result: 100% autonomous operation with zero manual intervention

We've all been there. You have a brilliant idea that needs collaboration and lots of data to succeed, and suddenly access and budget become your biggest obstacles.

Having decided to pick up my long-held interest in research again, I spoke to a couple of colleagues at WorldQuant University and we decided to just get it done. We wanted a fully open source project with a high-frequency finance dataset. But there were two big problems:

Each teammate needed access to ~30-50 GB of data. Downloading individually? Terrible for bandwidth and money.

We wanted the project to be truly reproducible, so anyone with the right technique could replicate it cheaply.

Our solution: AWS Free Tier + careful planning.

The Team

Our 6-person WorldQuant University research team:

Kalu Goodness (me)
Edet Joseph
Igboke Hannah
Ejike Uchenna
Kunde Godfrey
Fagbuyi Emmanuel

We're all students tackling high-frequency cryptocurrency market microstructure and liquidity stress detection.

The Plan

Static download: Grab historical trade data first, 5 key market periods covering COVID crash, bull runs, Terra collapse, post-FTX, and ETF approval (~30-50 GB total)

Live streaming: Continuous Level 2 order book snapshots every second for BTC/USDT, ETH/USDT, and SOL/USDT

Central S3 storage: Team members get read-only access via IAM

Budget & monitoring: Multi-threshold alerts and automatic stoppage to avoid surprises

Team IAM setup: Users in a group with policies attached for simplicity and safety

We used t3.micro EC2 in London for proximity and cost efficiency, running Python scripts in a virtual environment.

Part 1: EC2 Instance Setup

Instance Specifications:

Name: (our_s3_bucket)
Type: t3.micro (1 vCPU, 1GB RAM)
Storage: 20GB EBS (gp3)
Region: eu-west-2 (London - lowest latency to our team)
OS: Amazon Linux 2023
IAM Role: (ec2-s3-role)

Why London? Most of our team is in Europe/Africa, so this gave us the best latency.

Pro Tip: Always attach IAM roles to your EC2 instance during creation. Attaching them later sometimes requires stopping the instance, which interrupts your data collection.

IAM Role Policy for EC2:

The EC2 instance needs to write to S3, push CloudWatch metrics, and create SNS alerts:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucket"
            ],
            "Resource": [
                "arn:aws:s3:::our-s3-bucket",
                "arn:aws:s3:::our-s3-bucket/*"
            ]
        },
        {
            "Effect": "Allow",
            "Action": [
                "cloudwatch:PutMetricData",
                "cloudwatch:PutMetricAlarm"
            ],
            "Resource": "*"
        },
        {
            "Effect": "Allow",
            "Action": [
                "sns:Publish"
            ],
            "Resource": "arn:aws:sns:*:*:binance-streamer-alerts"
        }
    ]
}

Important: Using IAM roles instead of access keys means no hardcoded credentials, automatic key rotation, easier auditing, and zero risk of accidentally committing secrets to GitHub.

Part 2: AWS Budgeting (The Most Important Part!)

Before writing any code, we set up comprehensive budget controls. This is not optional for student projects.

Quick note on AWS Free Tier changes: AWS recently changed how their free tier works. You now get $100 in credits when you sign up, plus $20 credits for trying out specific services like Lambda, EC2, RDS, and AWS Budgets. The big change is that on the free tier, you cannot be charged beyond your credits. It's a hard limit. Once you hit zero credits and don't top up, AWS emails you to upgrade and spins everything down if you don't comply. Some services are still free within limits, but most usage now draws from your credits immediately.

Budget: $50 worth of credits per month (we started with $180 in total credits)

Our 5-Threshold Alert System:

Threshold	Amount	Action
1	50% ($25)	Email alert
2	80% ($40)	Email + Slack notification
3	90% ($45)	Email to all team members
4	95% ($47.50)	Stop EC2 instance
5	100% ($50)	Terminate EC2 instance

Critical Setup Note: Thresholds 4 and 5 require the EC2 instance to exist before creating the budget. Both the budget and EC2 instance must be in the same AWS region, or the actions won't trigger. This step caught me initially and cost me an hour of debugging!

Part 3: Static Historical Data Download

Before firing up the live streamer script, we ran a one-off Python script to grab historical Binance trade data.

What We Downloaded:

We targeted 5 specific market periods for our liquidity stress research:

Window 1 - COVID Crash (Feb-Jul 2020): 6 months

Window 2 - Bull Run (Nov 2020-Apr 2021): 6 months

Window 3 - Terra Collapse (Nov 2021-Apr 2022): 6 months

Window 4 - Post-FTX (Nov 2022-Apr 2023): 6 months

Window 5 - ETF Approval (Jul-Dec 2024): 6 months

Assets: BTC/USDT, ETH/USDT, SOL/USDT (note: SOL wasn't listed until Aug 2020)

Total: 44.6 GB compressed, 84 files

The script:

Downloads from Binance Data Vision API
Verifies SHA256 checksums
Uploads directly to S3
Deletes local copies to save disk space
Runtime: ~15-20 minutes

binance_trade_data.py (excerpt)

import requests
import boto3
import hashlib
from tqdm import tqdm

s3_client = boto3.client('s3')
BASE_URL = "https://data.binance.vision/data/spot/monthly/trades"
ASSETS = ["BTCUSDT", "ETHUSDT", "SOLUSDT"]
S3_BUCKET = "our-s3-bucket"

def download_and_upload(url, local_path, s3_key, checksum_url):
    # Download with progress bar
    response = requests.get(url, stream=True)
    total_size = int(response.headers.get('content-length', 0))

    with open(local_path, 'wb') as f:
        with tqdm(total=total_size, unit='B', unit_scale=True) as pbar:
            for chunk in response.iter_content(chunk_size=8192):
                f.write(chunk)
                pbar.update(len(chunk))

    # Verify checksum
    checksum_response = requests.get(checksum_url)
    expected_checksum = checksum_response.text.strip().split()[0]

    sha256_hash = hashlib.sha256()
    with open(local_path, 'rb') as f:
        for chunk in iter(lambda: f.read(8192), b''):
            sha256_hash.update(chunk)

    if sha256_hash.hexdigest() == expected_checksum:
        print("✓ Checksum verified")
        # Upload to S3
        s3_client.upload_file(local_path, S3_BUCKET, s3_key)
        os.remove(local_path)  # Delete local copy
        return True
    else:
        print("✗ Checksum mismatch!")
        return False

This gave us a clean historical baseline before starting the live stream.

Why these specific periods? Each window represents a major market event: COVID crash (liquidity crisis), Bull run (FOMO), Terra collapse (contagion), FTX collapse (systemic failure), and ETF approval (institutional adoption). Perfect for studying liquidity under stress.

Part 4: Live L2 Order Book Streamer

After the static data, we started the real-time snapshots using Python with multi-threading.

Streaming Configuration:

Interval: 1-second snapshots per symbol

Depth: Top 100 bid/ask levels

Symbols: BTC/USDT, ETH/USDT, SOL/USDT (running concurrently in separate threads)

API: Binance REST API (not WebSocket - simpler and more reliable for our use case)

Output: S3 bucket with Hive-style partitioning

Why REST API instead of WebSocket?

Simpler implementation
1-second granularity is sufficient for liquidity research
Complete order book snapshots (not deltas)
Automatic retries and error handling
No re-connection logic needed

S3 Data Structure:

s3://our-s3-bucket/binance-l2-data/
├── symbol=BTCUSDT/
│   └── year=2025/
│       └── month=10/
│           └── day=22/
│               └── hour=10/
│                   ├── 20251022_100001_123456.json
│                   └── 20251022_100002_234567.json
├── symbol=ETHUSDT/
└── symbol=SOLUSDT/

Each JSON file contains:

Full order book (100 levels)
Best bid/ask prices
Calculated metrics (spread, liquidity, imbalance)
Timestamp metadata

Streamer Core Logic

import json
import time
import boto3
import requests
import threading
from datetime import datetime

class BinanceL2Streamer:
    def __init__(self, symbol, s3_bucket, depth=100, interval=1):
        self.symbol = symbol
        self.s3_bucket = s3_bucket
        self.depth = depth
        self.interval = interval

        self.s3 = boto3.client('s3', region_name='eu-west-2')
        self.cloudwatch = boto3.client('cloudwatch', region_name='eu-west-2')
        self.api_url = "https://api.binance.com/api/v3/depth"

    def get_order_book(self):
        """Fetch current order book from Binance REST API"""
        params = {'symbol': self.symbol, 'limit': self.depth}
        response = requests.get(self.api_url, params=params, timeout=5)
        response.raise_for_status()
        return response.json()

    def enrich_snapshot(self, raw_data):
        """Calculate derived metrics"""
        bids = raw_data.get('bids', [])
        asks = raw_data.get('asks', [])

        best_bid = float(bids[0][0]) if bids else 0
        best_ask = float(asks[0][0]) if asks else 0
        spread = best_ask - best_bid
        mid_price = (best_bid + best_ask) / 2

        # Calculate liquidity at different depths
        bid_liquidity_5 = sum(float(b[0]) * float(b[1]) for b in bids[:5])
        ask_liquidity_5 = sum(float(a[0]) * float(a[1]) for a in asks[:5])

        return {
            'symbol': self.symbol,
            'timestamp': datetime.utcnow().isoformat(),
            'bids': bids,
            'asks': asks,
            'metrics': {
                'mid_price': mid_price,
                'spread': spread,
                'spread_bps': (spread / mid_price * 10000) if mid_price else 0,
                'bid_liquidity_5': bid_liquidity_5,
                'ask_liquidity_5': ask_liquidity_5,
                'imbalance_5': (bid_liquidity_5 - ask_liquidity_5) / 
                               (bid_liquidity_5 + ask_liquidity_5)
            }
        }

    def upload_to_s3(self, data):
        """Upload with time-based partitioning"""
        timestamp = datetime.utcnow()
        s3_key = (
            f"binance-l2-data/symbol={self.symbol}/"
            f"year={timestamp.year}/month={timestamp.month:02d}/"
            f"day={timestamp.day:02d}/hour={timestamp.hour:02d}/"
            f"{timestamp.strftime('%Y%m%d_%H%M%S_%f')}.json"
        )

        self.s3.put_object(
            Bucket=self.s3_bucket,
            Key=s3_key,
            Body=json.dumps(data),
            ContentType='application/json'
        )

    def run(self):
        """Main streaming loop"""
        while True:
            raw_data = self.get_order_book()
            enriched_data = self.enrich_snapshot(raw_data)
            self.upload_to_s3(enriched_data)
            time.sleep(self.interval)

# Run all three symbols in parallel threads
SYMBOLS = ["BTCUSDT", "ETHUSDT", "SOLUSDT"]
threads = []

for symbol in SYMBOLS:
    streamer = BinanceL2Streamer(symbol, "our-s3-bucket")
    thread = threading.Thread(target=streamer.run, daemon=True)
    thread.start()
    threads.append(thread)

# Keep main thread alive
while True:
    time.sleep(1)

The script also pushes custom CloudWatch metrics every 60 seconds:

Snapshots captured
Error count
Spread (basis points)
Order book imbalance
Bid/ask liquidity
Heartbeat (liveness check)

Part 5: Team IAM Access Setup

To keep things safe but shareable, we used IAM user groups with read-only policies.

The Process:

Created IAM user group: BinanceReadOnlyTeam
Created 6 individual IAM users (one per team member)
Added all users to the group
Attached read-only S3 policy to the group
Generated access keys for each user
Sent keys via email with a Jupyter notebook showing how to use them in Google Colab

Read-Only S3 Policy:

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "ListBucket",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:GetBucketLocation"
            ],
            "Resource": "arn:aws:s3:::our-s3-bucket"
        },
        {
            "Sid": "ReadObjects",
            "Effect": "Allow",
            "Action": [
                "s3:GetObject",
                "s3:GetObjectVersion"
            ],
            "Resource": "arn:aws:s3:::our-s3-bucket/*"
        }
    ]
}

What team members CAN do:

List files in S3
Download data to Google Colab/local machines
Read all historical and live data

What team members CANNOT do:

Upload/modify/delete data
Create EC2 instances
See billing information
Modify IAM permissions
Access other AWS services

This setup meant everyone had access without risking accidental writes or exposing the main EC2 credentials. Individual accountability with unique keys per person, easy revocation by removing from group, centralized management where one policy change affects everyone, and complete audit trail via CloudTrail.

Part 6: Automation with Bash & Systemd

To make the system fully autonomous, we created bash scripts and systemd services.

Script 1: create_alarms.sh

Creates all CloudWatch alarms and SNS topic:

#!/bin/bash

SYMBOLS=("BTCUSDT" "ETHUSDT" "SOLUSDT")
REGION="eu-west-2"
SNS_TOPIC_ARN="arn:aws:sns:<region>:<account-id-redacted>:<topic>"

# Create SNS topic
aws sns create-topic --name binance-streamer-alerts --region $REGION

# Subscribe email
aws sns subscribe \
  --topic-arn $SNS_TOPIC_ARN \
  --protocol email \
  --notification-endpoint your-email@example.com \
  --region $REGION

# Create alarms for each symbol
for SYMBOL in "${SYMBOLS[@]}"; do
    # High error rate alarm
    aws cloudwatch put-metric-alarm \
        --alarm-name "BinanceStreamer-HighErrorRate-${SYMBOL}" \
        --namespace "BinanceStreamer" \
        --metric-name "ErrorCount" \
        --dimensions Name=Symbol,Value=$SYMBOL \
        --statistic Sum \
        --period 300 \
        --threshold 10 \
        --comparison-operator GreaterThanThreshold \
        --alarm-actions $SNS_TOPIC_ARN \
        --region $REGION

    # No data received alarm
    aws cloudwatch put-metric-alarm \
        --alarm-name "BinanceStreamer-NoData-${SYMBOL}" \
        --namespace "BinanceStreamer" \
        --metric-name "SnapshotsCaptured" \
        --dimensions Name=Symbol,Value=$SYMBOL \
        --statistic SampleCount \
        --period 300 \
        --threshold 1 \
        --comparison-operator LessThanThreshold \
        --alarm-actions $SNS_TOPIC_ARN \
        --treat-missing-data breaching \
        --region $REGION
done

Script 2: setup_binance_services.sh

Creates systemd services for automatic startup and restarts:

#!/bin/bash

# Create streamer service
sudo tee /etc/systemd/system/binance-streamer.service > /dev/null <<EOF
[Unit]
Description=Binance L2 Order Book Streamer
After=network-online.target

[Service]
Type=simple
User=ec2-user
WorkingDirectory=/home/ec2-user
ExecStart=/home/ec2-user/trade_download/bin/python /home/ec2-user/binance_l2_streamer.py
Restart=always
RestartSec=10
MemoryMax=512M
CPUQuota=50%
Environment="PYTHONUNBUFFERED=1"

[Install]
WantedBy=multi-user.target
EOF

# Enable and start service
sudo systemctl daemon-reload
sudo systemctl enable binance-streamer.service
sudo systemctl start binance-streamer.service

With Restart=always and RestartSec=10, our streamer becomes virtually immortal. Script crashes? Auto-restart in 10 seconds. EC2 reboots? Starts automatically on boot. Zero manual intervention needed.

Monitoring Commands:

# View live logs
sudo journalctl -u binance-streamer.service -f

# Check service status
sudo systemctl status binance-streamer.service

# Restart if needed
sudo systemctl restart binance-streamer.service

Part 7: CloudWatch Monitoring & Alerts

We created comprehensive monitoring with 7 alarms per symbol (21 total):

High Error Rate - More than 10 errors in 5 minutes
No Data Received - No snapshots in 5 minutes
High Spread - Spread > 50 bps (potential liquidity crisis)
Extreme Order Book Imbalance - 70% imbalance for 2 minutes
EC2 Status Check Failed - Instance health issues
High CPU Usage - Above 80% for 10 minutes
Low Disk Space - Disk usage above 80%

All alarms send SNS email notifications to 2 members for redundancy (me and Joseph).

Example alarm trigger email:

ALARM: "BinanceStreamer-HighErrorRate-BTCUSDT" in EU (London)

Threshold Crossed: ErrorCount > 10 for 2 datapoints within 10 minutes

Current State: ALARM
Previous State: OK

Reason: Threshold Crossed: 1 datapoint [15.0] was greater than 
the threshold (10.0).

This gives us peace of mind. We know immediately if anything goes wrong.

Part 8: Actual Costs & Results

After 72 Hours (3 Days) of Operation:

Metric	Value
Total snapshots	259,200 per symbol (777,600 total)
S3 storage	2.6 GB live data + 44.6 GB historical = 47.2 GB
Uptime	100% (0 errors)
Files created	~384,000 JSON files
Avg file size	7.5 KB per snapshot

Cost Breakdown:

EC2 (t3.micro, 72 hours): $0.93 worth of credits
S3 storage (47.2 GB): $1.08 worth of credits
S3 PUT requests: $0.05 worth of credits
Data transfer: $0.02 worth of credits
Total for 3 days: $2.08 worth of credits

Projected Monthly Cost: ~$10 worth of credits per month

Well under our $50 budget, and remember, on the free tier you can't actually be charged real money. Once your credits hit zero, AWS just emails you and shuts things down unless you upgrade. It's a hard limit.

Architecture Diagram

Here's the complete system architecture showing the full data flow from Binance API through our EC2 instance to S3 storage, with CloudWatch monitoring and IAM-controlled team access.

┌─────────────────────────────────────────────────────────┐
│              Binance REST API                           |
│  (Order book snapshots for BTC/ETH/SOL every 1 second)  │
└────────────────────┬────────────────────────────────────┘
                     │
                     ▼
┌─────────────────────────────────────────────────────────┐
│  EC2 Instance (t3.micro, eu-west-2)                     │
│  ┌───────────────────────────────────────────────┐      |
│  │  Python Streamer (3 threads)                  │      |
│  │  • binance_l2_streamer.py                     │      |
│  │  • Systemd auto-restart                       │      |
│  │  • IAM role: <ec2-s3-role>                    │      |
│  └───────────────────────────────────────────────┘      |
└────────────┬──────────────────┬─────────────────────────┘
             │                  │
             │                  └──────────────┐
             ▼                                 ▼
┌────────────────────────────┐   ┌────────────────────────┐
│  S3: our-s3-bucket         │   │  CloudWatch Metrics    │
│  • Historical: 44.6 GB     │   │  • Snapshots/min       │
│  • Live L2: 2.6 GB         │   │  • Error count         │
│  • Time-partitioned        │   │  • Spread, liquidity   │
│  • 777K+ snapshots         │   │  • Order book balance  │
└────────────┬───────────────┘   └──────────┬─────────────┘
             │                              │
             │                              ▼
             │                   ┌─────────────────────────┐
             │                   │  CloudWatch Alarms      │
             │                   │  • 21 alarms total      │
             │                   │  • 7 per symbol         │
             │                   └──────────┬──────────────┘
             │                              │
             ▼                              ▼
┌────────────────────────────┐   ┌────────────────────────┐
│  IAM User Group            │   │  SNS Topic             │
│  BinanceReadOnlyTeam       │   │  streamer-alerts-topic │
│  • 6 team members          │   │                        │
│  • Read-only S3 access     │   │  • Email notifications │
│  • Individual access keys  │   │  • 2 team members      │
└────────────────────────────┘   └────────────────────────┘
             │
             ▼
┌────────────────────────────────────────────────────────┐
│  Team Analysis (Google Colab / Local)                  │
│  • Pandas DataFrames                                   │
│  • Liquidity stress analysis                           │
│  • Market microstructure research                      │
└────────────────────────────────────────────────────────┘

Lessons Learned

Here's what actually happened during our 3-day journey from idea to production system.

What Worked Really Well:

Budget-first approach - Set up cost controls BEFORE infrastructure = zero surprises
IAM roles for EC2 - No credential management headaches at all
User groups for team - Adding/removing members took literally seconds
Systemd for reliability - Scripts survived EC2 restarts, no manual intervention
Time-based S3 partitioning - Made data queries fast and cheap
REST over WebSocket - Simpler code, fewer bugs, good enough for our needs

Challenges We Hit:

Challenge	What Went Wrong	Solution	Time Lost
Budget actions didn't trigger	Created budget before EC2 existed	Created EC2 first, then recreated budget in same region	1 hour
Disk space filled up in 6 hours	Kept local copies after S3 upload	Added `os.remove(local_path)` after upload	30 min
CloudWatch metrics not appearing	Used wrong region in boto3 client	Changed `region_name='us-east-1'` to `'eu-west-2'`	20 min
Systemd service wouldn't start	Wrong Python interpreter path	Fixed ExecStart to use venv Python: `/home/ec2-user/trade_download/bin/python`	15 min
IAM permissions initially too broad	Gave EC2 role full S3 access	Restricted to specific bucket ARN only	45 min

Debugging Tip: 90% of our issues were solved by checking sudo journalctl -u binance-streamer.service -n 50. Always check logs first before assuming AWS is broken!

Part 9: Making the Data Research-Ready

Getting raw data is only half the battle. We needed to transform 44.6 GB of compressed historical data and continuous JSON snapshots into something you could actually query and analyze.

The full data processing pipeline deserves its own article (unzipping historical data, converting JSON to Parquet, AWS Athena setup, feature engineering, etc.), but here's what we built:

The transformation: Raw JSON → Parquet (60% storage reduction, 10-20x faster queries)
The interface: AWS Athena with time-based partitioning (query 6 months in seconds)
The features: Pre-computed metrics like log_returns, volume_imbalance, effective_spread, and liquidity_score

Result: Team members can run complex queries across the entire dataset without downloading anything locally. What would take minutes of local processing now happens in seconds via Athena.

This data processing pipeline will be covered in detail in Part 2, along with our feature engineering approach and the actual liquidity stress detection models.

What We're Researching

Our project focuses on liquidity stress detection in cryptocurrency markets using order book microstructure.

Research Questions:

Can we predict liquidity crises before they happen by analyzing order book imbalance?
How does liquidity behave differently across BTC, ETH, and SOL during stress events?
Are there early warning signals in bid-ask spreads during major market moves?
How quickly does liquidity recover after major crashes (COVID, Terra, FTX)?

Methodology:

Compare L2 snapshots during "normal" vs "stress" periods
Analyze spread widening, order book depth depletion, and imbalance patterns
Build predictive models using machine learning on microstructure features
Validate findings across all three assets and five historical windows

This dataset will enable us (and the broader research community) to study these questions with unprecedented granularity.

Full System Workflow

Here's what happens after initial setup (completely hands-off):

EC2 boots → systemd starts binance-streamer.service
Python script launches → 3 threads (BTC, ETH, SOL) start streaming
Every second → Each thread fetches order book, enriches data, uploads to S3
Every 60 seconds → CloudWatch metrics pushed (snapshots, errors, spread, etc.)
CloudWatch monitors → Checks all 21 alarms continuously
If alarm triggers → SNS sends email to team leads immediately
If script crashes → Systemd auto-restarts after 10 seconds
If EC2 reboots → Everything starts automatically on boot
Team members → Download data anytime via read-only IAM credentials

Total manual intervention required: Zero

Current Data Collection Status

As of writing this article:

Live Streaming (Day 3):

777,600 total snapshots captured
0 errors (100% uptime)
2.6 GB of live L2 data
All CloudWatch alarms functioning
Team members accessing data successfully

Historical Data:

44.6 GB across 5 market periods
84 monthly files downloaded and verified
All checksums validated

Projected 6-Month Collection:

~47 million snapshots per symbol (141 million total)
~50-80 GB additional live data
Total dataset: ~100-130 GB

Code Repository

All scripts (Python + Bash) and setup documentation will be released on GitHub:

GitHub: Coming soon (will include complete setup guide, all scripts, and infrastructure templates)

What will be included:

binance_l2_streamer.py - Multi-threaded live streamer
binance_trade_data.py - Historical data downloader
create_alarms.sh - CloudWatch alarm automation
setup_binance_services.sh - Systemd service installer
IAM policy templates (JSON)
Complete setup documentation
Team access guide (Google Colab notebook)

Making It Open Source

Once our research concludes (~6 months), we'll release:

1. The complete dataset (~50-100 GB)

Historical trade data (5 market periods)
6 months of live L2 snapshots
All three symbols (BTC, ETH, SOL)

2. All infrastructure code (might come earlier)

Terraform templates for one-click deployment
Docker containers (alternative to EC2)
Complete setup scripts

3. Analysis notebooks

Liquidity stress detection algorithms
Market microstructure analysis
Visualization dashboards

4. Research paper

Methodology
Findings
Reproducibility instructions

Why? High-frequency financial data is expensive and gatekept. Open-source data democratizes quantitative finance research for students and independent researchers.

How You Can Replicate This

Everything is designed to be plug-and-play. If you can SSH into a server and run a bash script, you can replicate this entire system.

Requirements:

AWS account with Free Tier ($100 credits for new accounts, plus $20 bonus credits per service)
Basic Python knowledge
SSH access to Linux
30-60 minutes for initial setup

Estimated Costs:

First few months: Draws from your $100-140 in credits
Monthly credit usage: ~$8-12 worth
One-time setup time: 1-2 hours

Steps:

Clone our GitHub repo (link coming soon)
Follow the setup guide in README.md
Run setup_binance_services.sh on your EC2 instance
Configure your budget and alarms
Start streaming!

The entire infrastructure is designed to be plug-and-play reproducible.

Want to Collaborate?

We're always interested in collaborating with other researchers working on:

Market microstructure
Cryptocurrency liquidity
High-frequency data analysis
Open-source finance datasets

Reach out if you:

Want to use our dataset for your research
Have ideas for improving the infrastructure
Want to contribute to the open-source release
Are building similar systems and want to share knowledge

Closing Thoughts

Building this system taught us that good data infrastructure doesn't require a huge budget or enterprise tools. It requires clear thinking about requirements and constraints, careful planning (especially around costs), simple and reliable tools over complex ones, good documentation for reproducibility, and team collaboration with clear ownership.

The hardest part wasn't the code. It was figuring out the architecture, IAM policies, budget controls, and making everything work together reliably.

If six students can build a production-grade market data pipeline on $10/month worth of credits, you can too.

Thanks for reading!

This article is part of our ongoing open-source research project. Star our GitHub repo when it's released to stay updated on the dataset publication and research findings!