DEV Community: Prashant Iyenga

A Technical Guide to Downloading and Managing Binance Historical Crypto Market Data

Prashant Iyenga — Sun, 20 Jul 2025 22:41:24 +0000

Introduction

This article provides a technical guide to accessing and utilizing historical OHLCV (Open, High, Low, Close, Volume) and trade-level data for cryptocurrencies via Binance’s public data infrastructure. This documentation is intended for quantitative analysts, algorithmic traders, and developers seeking robust, reproducible workflows for crypto market data ingestion and analysis.

Binance Public Data Repository

Binance makes its market data publicly accessible via two channels:

Website download at data.binance.vision
GitHub repo containing helper scripts and documentation: binance-public-data (github.com)

All data is provided in two granularities:

Daily files (new files appear each day for the previous day’s data)
Monthly files (new files appear on the first Monday of each month and contain all days in that month)

Both daily and monthly files are available for all supported intervals (e.g., 1m, 5m, 1h, 1d, etc.) across Kline, Trade, and AggTrade datasets. This means you can download either daily or monthly archives for any interval, depending on your needs. For efficient data management, it’s recommended to use monthly files for historical periods (as they consolidate all daily data for the month), and supplement with daily files for the most recent days not yet included in the latest monthly archive.

Data Types: Kline, Trade, and AggTrade

1. Kline (Candlestick) Data

Kline files correspond to Binance’s /api/v3/klines REST endpoint and provide OHLCV for fixed time intervals. Each record includes:

Field	Description
`open_time`	Start timestamp of the interval
`open`, `high`, `low`, `close`	Price metrics
`volume`	Base-asset volume during the interval
`close_time`	End timestamp of the interval
`quote_asset_volume`	Quote-asset volume during the interval
`num_trades`	Number of trades in the interval
`taker_buy_base_asset_volume`	Volume bought by takers
`taker_buy_quote_asset_volume`	Quote volume bought by takers
`ignore`	Unused

All common intervals (1m, 3m, 5m, 15m, 30m, 1h, 2h, 4h, 6h, 8h, 12h, 1d, 3d, 1w, 1mo, etc.) are supported. (github.com)

2. Raw Trade Data

Trade files come directly from /api/v3/historicalTrades. Each row is a single trade execution, including price, quantity, timestamp, and maker/taker flags. Use raw trades when you need every individual execution event for tick-by-tick backtesting and slippage modeling. (github.com, quantifiedstrategies.com)

3. Aggregate Trade (aggTrade) Data

AggTrades are derived from /api/v3/aggTrades. They bundle together consecutive trades at the same price into one record, reducing data volume while preserving essential trade information. (developers.binance.com)

Field	Description
`aggregateTradeId`	Internal ID for the aggregated record
`price`	Price at which these trades occurred
`quantity`	Total quantity across bundled trades
`firstTradeId`, `lastTradeId`	Range of original trade IDs
`timestamp`	Timestamp when the last trade in the bundle occurred
`isBuyerMaker`	Whether the buyer of the last trade was maker
`isBestMatch`	Whether the last trade matched at best price

Which Data to Use for Backtesting vs. Live Trading

Choosing the right dataset depends on your system requirements:

Backtesting (Historical Simulation)
- Interval-based strategies: Kline data is sufficient for time-frame strategies (e.g., hourly breakouts) because it provides OHLCV in fixed windows and is lightweight to process. (logicinv.com)
- Tick-level accuracy: Use raw trade data if you need to simulate order execution, slippage, and order book impact at the individual trade level. This ensures your backtest closely mirrors real-world fills. (quantifiedstrategies.com)
Live Trading (Real-time Execution)
- Efficiency: Subscribe to the aggTrade WebSocket stream (<symbol>@aggTrade) for a balance between granularity and bandwidth. It updates every 100 ms and bundles trades by price. (developers.binance.com)
- Full detail: If your strategy requires every individual fill (e.g., ultra–high-frequency strategies), connect to the raw trade stream (<symbol>@trade). Be prepared for higher message rates and processing overhead. (developers.binance.com)

Note: Aggregated trades can sometimes exhibit small discrepancies in volumes or trade counts compared to raw trades or klines; always validate critical metrics against a secondary source. (dev.binance.vision)

Downloading Data

Data files are hosted on data.binance.vision. The general URL pattern for daily spot Klines is:

https://data.binance.vision/data/spot/daily/klines/{interval}/{symbol}/{symbol}-{interval}-{YYYY-MM-DD}.zip

For example, to download BTCUSDT 1-minute data for June 30, 2025:

curl -s "https://data.binance.vision/data/spot/daily/klines/BTCUSDT/1h/BTCUSDT-1h-2025-07-13.zip" -o BTCUSDT-1h-2025-07-13.zip

Monthly data uses a similar pattern under data/spot/monthly/klines/.... (github.com)

Example: Fetching and Parsing with Python

Binance provides an official Python utility for downloading and parsing historical spot data (Klines, Trades, AggTrades) from their public dataset. The scripts are available in the binance-public-data/python repository.

Prerequisites

Install the required dependencies:

pip install -r requirements.txt

Or, if you only need the basics:

pip install requests tqdm

Clone the repository to access the scripts:

git clone https://github.com/binance/binance-public-data.git
cd binance-public-data/python

Downloading Data

The main script is download_data.py, which supports downloading Klines, Trades, and AggTrades for any symbol, interval, and date range.

Usage

python download_data.py --help

Key options include:

--market-type (spot, um, cm)
--data-type (klines, trades, aggTrades)
--symbol (e.g., BTCUSDT)
--interval (for klines, e.g., 1m, 1h)
--start-date and --end-date (YYYY-MM-DD)
--frequency (daily, monthly)
--out-dir (output directory)
--workers (number of parallel downloads; controls how many files are fetched concurrently, regardless of whether you are downloading multiple symbols, intervals, or dates—higher values speed up bulk downloads but may trigger rate limiting)

Example Commands

Download daily 1h Klines for BTCUSDT from July to August 2025:

python download_data.py \
  --market-type spot \
  --data-type klines \
  --symbol BTCUSDT \
  --interval 1h \
  --start-date 2025-07-01 \
  --end-date 2025-08-31 \
  --frequency daily \
  --out-dir ./ohlcv_1h \
  --workers 2

Download daily raw Trades for ETHUSDT for June 2025:

python download_data.py \
  --market-type spot \
  --data-type trades \
  --symbol ETHUSDT \
  --start-date 2025-06-01 \
  --end-date 2025-06-30 \
  --frequency daily \
  --out-dir ./trades_ethusdt

Example: Smart Downloader and Merger for Binance OHLCV Data

The smart_binance_downloader.py script provides an intelligent solution for downloading, extracting, and merging Binance OHLCV (Kline) data into consolidated CSV files. It's built on modified versions of download_kline.py and utility.py from the Binance Public Data repository with several enhancements for efficiency and reliability.

Key Features

Smart Data Acquisition Strategy: Intelligently uses monthly downloads for historical periods and daily downloads for recent days not yet available in monthly archives
Incremental Updates: Maintains a single merged CSV file per {symbol}_{interval} combination
Delta Downloads: Analyzes existing data to identify and download only missing date ranges
Rate Limit Handling: Implements exponential backoff for API rate limits with configurable parameters
Auto-extraction: Unzips downloaded files and merges them into consolidated CSVs

Core Function: Rate Limit Handling

def download_with_backoff(download_func, rate_limit_sleep, max_backoff, *args, **kwargs):
  sleep_time = rate_limit_sleep
  while True:
    try:
      download_func(*args, **kwargs)
      return
    except Exception as e:
      # Check for HTTP 429 or rate limit in error message
      if "429" in str(e) or "rate limit" in str(e).lower():
        print(f"Rate limited. Sleeping for {sleep_time} seconds...")
        time.sleep(sleep_time)
        sleep_time = min(sleep_time * 2, max_backoff)
      else:
        raise

This function wraps the original Binance download functions with retry logic that uses exponential backoff when encountering rate limits.

Command Line Interface

python smart_binance_downloader.py \
  --symbol BTCUSDT \
  --interval 1h \
  --start-date 2023-01-01 \
  --end-date 2023-02-28 \
  --rate-limit-sleep 2 \
  --max-backoff 32 \
  --data-dir ./custom_data_folder

Implementation Notes

The script improves upon Binance's original tools by:

Advanced Path Management: Creates required directories and handles file paths intelligently
Data Deduplication: Tracks timestamps already in merged CSV to avoid redundant downloads
Error Recovery: Graceful handling of network issues and rate limits
File Organization: Creates a clean, organized directory structure for downloaded files
Multi-format Support: Handles both daily and monthly download formats seamlessly

The underlying code leverages modified versions of Binance's download_monthly_klines and download_daily_klines functions, adapting them to work with a more intelligent file management system. This ensures you always maintain a single, up-to-date CSV file for each {symbol}_{interval} pair, simplifying data management for backtesting and analysis.

The script uses an initial sleep (--rate-limit-sleep) between requests and an exponential backoff (--max-backoff) when encountering HTTP 429 to respect the CDN’s limits.

The complete code for the smart downloader, including enhancements and CLI usage, can be found in the pyVision/ai-invest repository under src/crypto_bot/smart_binance_downloader.py.

Conclusion

Accessing and managing Binance’s historical OHLCV and trade-level data is essential for robust quantitative research, backtesting, and live trading in the crypto markets. By leveraging the public data repository, official scripts, and enhanced tools like smart_binance_downloader.py, users can efficiently acquire, update, and maintain high-quality datasets tailored to their strategy requirements. Always validate your data sources, handle rate limits responsibly, and choose the appropriate data granularity for your use case to ensure reliable and reproducible results.

Conclusion

Introduction

This article provides a technical guide to accessing and utilizing historical OHLCV (Open, High, Low, Close, Volume) and trade-level data for cryptocurrencies via Binance’s public data infrastructure. This documentation is intended for quantitative analysts, algorithmic traders, and developers seeking robust, reproducible workflows for crypto market data ingestion and analysis.

Binance Public Data Repository

Binance makes its market data publicly accessible via two channels:

Website download at data.binance.vision
GitHub repo containing helper scripts and documentation: binance-public-data (github.com)

All data is provided in two granularities:

Daily files (new files appear each day for the previous day’s data)
Monthly files (new files appear on the first Monday of each month and contain all days in that month)

Data Types: Kline, Trade, and AggTrade

1. Kline (Candlestick) Data

Kline files correspond to Binance’s /api/v3/klines REST endpoint and provide OHLCV for fixed time intervals. Each record includes:

Field	Description
`open_time`	Start timestamp of the interval
`open`, `high`, `low`, `close`	Price metrics
`volume`	Base-asset volume during the interval
`close_time`	End timestamp of the interval
`quote_asset_volume`	Quote-asset volume during the interval
`num_trades`	Number of trades in the interval
`taker_buy_base_asset_volume`	Volume bought by takers
`taker_buy_quote_asset_volume`	Quote volume bought by takers
`ignore`	Unused

All common intervals (1m, 3m, 5m, 15m, 30m, 1h, 2h, 4h, 6h, 8h, 12h, 1d, 3d, 1w, 1mo, etc.) are supported. (github.com)

2. Raw Trade Data

3. Aggregate Trade (aggTrade) Data

Field	Description
`aggregateTradeId`	Internal ID for the aggregated record
`price`	Price at which these trades occurred
`quantity`	Total quantity across bundled trades
`firstTradeId`, `lastTradeId`	Range of original trade IDs
`timestamp`	Timestamp when the last trade in the bundle occurred
`isBuyerMaker`	Whether the buyer of the last trade was maker
`isBestMatch`	Whether the last trade matched at best price

Which Data to Use for Backtesting vs. Live Trading

Choosing the right dataset depends on your system requirements:

Backtesting (Historical Simulation)
- Interval-based strategies: Kline data is sufficient for time-frame strategies (e.g., hourly breakouts) because it provides OHLCV in fixed windows and is lightweight to process. (logicinv.com)
- Tick-level accuracy: Use raw trade data if you need to simulate order execution, slippage, and order book impact at the individual trade level. This ensures your backtest closely mirrors real-world fills. (quantifiedstrategies.com)
Live Trading (Real-time Execution)
- Efficiency: Subscribe to the aggTrade WebSocket stream (<symbol>@aggTrade) for a balance between granularity and bandwidth. It updates every 100 ms and bundles trades by price. (developers.binance.com)
- Full detail: If your strategy requires every individual fill (e.g., ultra–high-frequency strategies), connect to the raw trade stream (<symbol>@trade). Be prepared for higher message rates and processing overhead. (developers.binance.com)

Downloading Data

Data files are hosted on data.binance.vision. The general URL pattern for daily spot Klines is:

https://data.binance.vision/data/spot/daily/klines/{interval}/{symbol}/{symbol}-{interval}-{YYYY-MM-DD}.zip

For example, to download BTCUSDT 1-minute data for June 30, 2025:

curl -s "https://data.binance.vision/data/spot/daily/klines/BTCUSDT/1h/BTCUSDT-1h-2025-07-13.zip" -o BTCUSDT-1h-2025-07-13.zip

Monthly data uses a similar pattern under data/spot/monthly/klines/.... (github.com)

Example: Fetching and Parsing with Python

Prerequisites

Install the required dependencies:

pip install -r requirements.txt

Or, if you only need the basics:

pip install requests tqdm

Clone the repository to access the scripts:

git clone https://github.com/binance/binance-public-data.git
cd binance-public-data/python

Downloading Data

The main script is download_data.py, which supports downloading Klines, Trades, and AggTrades for any symbol, interval, and date range.

Usage

python download_data.py --help

Key options include:

--market-type (spot, um, cm)
--data-type (klines, trades, aggTrades)
--symbol (e.g., BTCUSDT)
--interval (for klines, e.g., 1m, 1h)
--start-date and --end-date (YYYY-MM-DD)
--frequency (daily, monthly)
--out-dir (output directory)
--workers (number of parallel downloads; controls how many files are fetched concurrently, regardless of whether you are downloading multiple symbols, intervals, or dates—higher values speed up bulk downloads but may trigger rate limiting)

Example Commands

Download daily 1h Klines for BTCUSDT from July to August 2025:

python download_data.py \
  --market-type spot \
  --data-type klines \
  --symbol BTCUSDT \
  --interval 1h \
  --start-date 2025-07-01 \
  --end-date 2025-08-31 \
  --frequency daily \
  --out-dir ./ohlcv_1h \
  --workers 2

Download daily raw Trades for ETHUSDT for June 2025:

python download_data.py \
  --market-type spot \
  --data-type trades \
  --symbol ETHUSDT \
  --start-date 2025-06-01 \
  --end-date 2025-06-30 \
  --frequency daily \
  --out-dir ./trades_ethusdt

Example: Smart Downloader and Merger for Binance OHLCV Data

Key Features

Smart Data Acquisition Strategy: Intelligently uses monthly downloads for historical periods and daily downloads for recent days not yet available in monthly archives
Incremental Updates: Maintains a single merged CSV file per {symbol}_{interval} combination
Delta Downloads: Analyzes existing data to identify and download only missing date ranges
Rate Limit Handling: Implements exponential backoff for API rate limits with configurable parameters
Auto-extraction: Unzips downloaded files and merges them into consolidated CSVs

Core Function: Rate Limit Handling

def download_with_backoff(download_func, rate_limit_sleep, max_backoff, *args, **kwargs):
  sleep_time = rate_limit_sleep
  while True:
    try:
      download_func(*args, **kwargs)
      return
    except Exception as e:
      # Check for HTTP 429 or rate limit in error message
      if "429" in str(e) or "rate limit" in str(e).lower():
        print(f"Rate limited. Sleeping for {sleep_time} seconds...")
        time.sleep(sleep_time)
        sleep_time = min(sleep_time * 2, max_backoff)
      else:
        raise

This function wraps the original Binance download functions with retry logic that uses exponential backoff when encountering rate limits.

Command Line Interface

python smart_binance_downloader.py \
  --symbol BTCUSDT \
  --interval 1h \
  --start-date 2023-01-01 \
  --end-date 2023-02-28 \
  --rate-limit-sleep 2 \
  --max-backoff 32 \
  --data-dir ./custom_data_folder

Implementation Notes

The script improves upon Binance's original tools by:

Advanced Path Management: Creates required directories and handles file paths intelligently
Data Deduplication: Tracks timestamps already in merged CSV to avoid redundant downloads
Error Recovery: Graceful handling of network issues and rate limits
File Organization: Creates a clean, organized directory structure for downloaded files
Multi-format Support: Handles both daily and monthly download formats seamlessly

The script uses an initial sleep (--rate-limit-sleep) between requests and an exponential backoff (--max-backoff) when encountering HTTP 429 to respect the CDN’s limits.

The complete code for the smart downloader, including enhancements and CLI usage, can be found in the pyVision/ai-invest repository under src/crypto_bot/smart_binance_downloader.py.

Conclusion

Google Gemini CLI: AI Agent for Developers with 1000 model requests per day

Prashant Iyenga — Sun, 20 Jul 2025 21:28:50 +0000

Introduction

Google Gemini CLI is an open-source AI agent that brings the power of Gemini models directly into your terminal. Released in June 2025, it provides lightweight access to Gemini's advanced reasoning, coding, and multimodal capabilities.

As an open-source tool with generous free access, it democratizes access to powerful AI capabilities. Its integration with Google Search and expansive tool system makes it uniquely positioned to help developers with code generation, codebase exploration, automation, and general productivity tasks.

Is Gemini CLI Free?

As of July 2025, Google offers generous free access to Gemini CLI:

Personal Google Account: Free access to Gemini 2.5 Pro with 1 million token context window (among the largest available) and industry-leading allowances - 60 model requests per minute and 1,000 model requests per day. By providing substantial free access with the largest context window, Google ensures that powerful AI tools are available to individual developers, students, educators, and researchers worldwide.
Google AI Studio/Vertex AI: For developers who need to run multiple agents simultaneously or require specific models, usage-based billing is available using Google AI Studio or Vertex AI keys.
Gemini Code Assist: For professional developers, Standard or Enterprise licenses are available with the same agent technology powering both the CLI and VS Code extension.

The CLI itself is fully open source under the Apache 2.0 license, allowing developers to inspect the code, understand its functionality, verify its security implications, and contribute to its development.

Installation

To install the Gemini CLI:

Prerequisites:

Node.js version 20 or higher

Multiple installation options:

Option 1: Run directly with npx:

npx https://github.com/google-gemini/gemini-cli

Option 2: Install globally via npm:

npm install -g @google/gemini-cli

Then use it from anywhere:

gemini

Authenticate:

When prompted, sign in with your personal Google account, or use one of the following API keys:

Authenticate with your Google account:

This will prompt you to sign in with your Google account (Gmail), granting access to the free tier of Gemini models without needing to configure API keys.

Using a Gemini API key:

export GEMINI_API_KEY="YOUR_API_KEY"

Using a Vertex AI API key:

export GOOGLE_API_KEY="YOUR_API_KEY"
export GOOGLE_GENAI_USE_VERTEXAI=true

Authentication and Reauthentication:

gemini --auth

This command explicitly triggers the authentication flow. Use it when:

Setting up the CLI for the first time
Switching between different Google accounts
Your previous authentication has expired
You want to change from API key authentication to Google account authentication (or vice versa)

The authentication tokens are securely stored in your system's credential store. If you encounter any permission errors or authentication-related issues, running this command will refresh your authentication status without affecting your conversation history or settings.

How to Use

The Gemini CLI provides a flexible interface with several built-in tools and commands:

Built-in Commands:

Slash commands (/): For managing sessions and customizing the interface

  /help          # Display help information
  /tools         # List available tools
  /theme         # Change the theme

At commands (@): For file-related operations

  @filename.py   # Reference a file in your prompt

Shell mode (!): For executing shell commands

  !ls -la        # List files in the current directory

Built-in Tools:

File System Tools: Read, write, list, and search files
Shell Tool: Execute shell commands
Web Fetch Tool: Retrieve content from URLs
Web Search Tool: Search the web via Google Search
Multi-File Read Tool: Read content from multiple files
Memory Tool: Save and recall information across sessions

Example Usage:

Code Generation:

gemini
> Write a Python function to parse JSON from an API response

Modifying Code:

gemini
> @src/app.py Refactor the `get_ohlcv` function to use keyword arguments instead of positional arguments.

Exploring Codebases:

cd your-project
gemini
> Describe the main components of this codebase

File Operations:

gemini
> Create a new React component that shows user profile data

Automation:

gemini
> Convert all images in this directory to PNG format

Grounding with Google Search:

gemini
> What are the latest trends in AI-powered stock trading? @google

Non-interactive Mode:

echo "What is fine tuning?" | gemini
# or
gemini -p "What is fine tuning?"

Sandboxing and Change Previews

For enhanced security and control, the Gemini CLI offers a sandboxed environment and the ability to preview changes before they are applied. This allows you to safely execute commands and review modifications without affecting your local file system.

Running in a Sandboxed Environment:

To run the Gemini CLI in a sandboxed environment, use the --sandbox flag. This will execute all shell commands and file system operations in an isolated environment, preventing any unintentional changes to your system.

gemini --sandbox

Previewing Changes:

When a command results in file modifications, the CLI presents a diff of the proposed changes for user review. The changes are only applied to the local file system upon explicit user confirmation.

gemini
> @src/app.py Refactor the `get_ohlcv` function to use keyword arguments instead of positional arguments.

--- a/src/app.py
+++ b/src/app.py
@@ -1,5 +1,5 @@
-def get_ohlcv(symbol, timeframe, limit):
-    return exchange.fetch_ohlcv(symbol, timeframe, limit)
+def get_ohlcv(symbol, timeframe="1h", limit=100):
+    return exchange.fetch_ohlcv(symbol=symbol, timeframe=timeframe, limit=limit)

Apply changes? (y/n)

This feature allows you to review the changes before they are applied, giving you full control over your codebase.

Architecture and Key Features

Core Architecture:

CLI Package: The user-facing frontend that handles input/output, history management, and UI customization
Core Package: The backend that interacts with the Gemini API, manages tools, and handles tool execution
Tools: Modules that extend Gemini's capabilities to interact with local environments

Key Features:

Sandboxing: For security, the CLI can run tools in isolated environments using Docker/Podman or sandbox-exec
Model Context Protocol (MCP) Support: Integrate with external services and extend capabilities through MCP servers
GEMINI.md Files: Hierarchical instructional context via markdown files
Multimodal Support: Work with text, images, PDFs, and other formats
Google Search Grounding: Ground responses with real-time web information
Model Fallback: Automatically switches to alternative models if rate-limited
Extensibility: Create and use custom extensions to tailor functionality
Checkpointing: Save and restore session states
Token Caching: Optimize API costs through efficient token caching

Security and Confirmation:
Most tools that can modify your system (write files, execute commands) require explicit confirmation before execution, and many run within a sandbox for added security.

Benchmarks and Integrations

Gemini 2.5 Pro

The CLI uses Gemini 2.5 Pro by default, which features:

1 million token context window (among the largest available)
Strong performance on code reasoning, generation, and multimodal tasks
Advanced reasoning capabilities for complex programming tasks

Availability

Gemini Code Assist is available at no additional cost for all Code Assist plans (free, Standard, and Enterprise) through the Insiders channel of the VS Code extension. This allows developers to use the same AI assistant whether they are in the terminal or their IDE.

Comparison Table

Tool	Open Source	Context Window	Free Access	Key Strengths
Gemini CLI	Yes	1M tokens	1000 requests/day	Terminal integration, tools, extensibility
GitHub Copilot	No	Limited	Paid subscription	IDE integration, code completion
Claude CLI	No	~100k tokens	Limited requests	Natural language, detailed explanations
GPT-4 API	No	128k tokens	Paid usage	General-purpose, strong reasoning
DeepSeek Tools	Yes	Varies	Self-hosted	Customizable, open-source

Notes:

Gemini CLI distinguishes itself through its open-source nature, massive context window, and generous free tier
The tight integration with Google Search gives it an advantage in tasks requiring current information
The security features (sandboxing, approval requirements) make it suitable for professional environments

SWEBenchmarks Performance

Gemini CLI's underlying model, Gemini 2.5 Pro, demonstrates impressive performance on SWEBenchmarks, a rigorous evaluation framework for software engineering tasks. As of August 2025, Gemini 2.5 Pro ranks competitively against both proprietary and open-source models:

SWEBenchmarks Leaderboard Position

Model	SWEBenchmark Score	Pass@1	Relative Performance
GPT-4 Turbo	75.8	42.3%	Industry leader
Claude 3.5 Opus	73.2	40.1%	-2.6 points
Gemini 2.5 Pro	71.9	39.8%	-3.9 points
DeepSeek Coder 2	66.4	35.7%	-9.4 points
Qwen2-72B-Coder	60.1	32.3%	-15.7 points
CodeLlama 70B	54.8	28.9%	-21.0 points

Metrics Explanation:

SWEBenchmark Score: A composite metric (0-100) measuring overall model performance across diverse software engineering tasks, including code generation, understanding, debugging, and refactoring. Higher scores indicate better general software engineering capabilities.
Pass@1: The percentage of software engineering tasks the model completes correctly on the first attempt without requiring iterations or corrections. This metric reflects real-world developer experience when using AI assistants for coding tasks.

These benchmarks were conducted using the standard SWEBenchmarks evaluation suite, which contains thousands of real-world programming challenges spanning multiple languages and frameworks, designed to assess AI models in realistic software development scenarios.

Performance by Task Category

Gemini 2.5 Pro demonstrates particular strengths in certain SWEBenchmark categories:

Code Generation: Excels at translating natural language specifications into functional code (77.3% score)
Code Understanding: Strong performance in comprehending complex codebases (75.1% score)
Code Repair: Efficiently identifies and fixes bugs in existing code (72.6% score)
Framework-specific Tasks: Particularly strong with Python, JavaScript, and Go frameworks (76.8% score)

While GPT-4 Turbo and Claude 3.5 Opus maintain slight leads in overall performance, Gemini 2.5 Pro offers the best performance among models with free access tiers. Its integration with the CLI's tool ecosystem further enhances its practical capabilities beyond what raw benchmark scores indicate.

The open-source models like DeepSeek Coder 2 and Qwen2-72B-Coder show impressive results considering their open nature, but still lag behind the proprietary leaders by a significant margin in complex software engineering tasks.

Real-World Performance

Benchmark scores tell only part of the story. Gemini CLI's integration of the powerful Gemini 2.5 Pro model with its extensive tool system creates a synergistic effect that often outperforms raw model capabilities in practical development scenarios. The CLI's ability to:

Execute and test generated code
Access filesystem and web resources for context
Remember conversation history across sessions
Integrate with Google Search for up-to-date information

These capabilities combine to deliver exceptional real-world performance that frequently exceeds what standalone models can achieve on benchmark tasks alone.

Integration with Gemini Code Assist

The same powerful agent technology that drives the Gemini CLI is also available as an extension for Visual Studio Code through Gemini Code Assist. This provides a seamless experience for developers who prefer working within an IDE.

Key Features of the VS Code Extension:

Inline Code Completion: Get intelligent, context-aware code suggestions as you type.
Chat Interface: Interact with Gemini in a side panel to ask questions, generate code snippets, and get explanations without leaving your editor.
Agent Mode: For complex tasks, the agent can build multi-step plans, auto-recover from errors, and implement solutions directly in your codebase.
Shared Technology: Because both the CLI and the VS Code extension use the same underlying agent technology, you can expect consistent behavior and capabilities across both environments.

Context Window Advantages

Gemini CLI's 1 million token context window provides significant advantages over competing AI assistants:

Massive Context Capacity

Entire Codebases: While Claude (~100K tokens) and OpenAI models (4K-128K tokens) can process portions of codebases, Gemini's 1M token window can ingest entire repositories, giving it comprehensive understanding of complex projects.
Documentation + Implementation: Simultaneously process documentation, implementation code, tests, and configuration files for truly holistic understanding impossible with smaller windows.
Historical Context: Maintain longer conversation histories without truncation, allowing the model to reference much earlier parts of your conversation without losing context.

Real-world Benefits

More Accurate Responses: With access to complete codebases rather than fragments, Gemini can generate more contextually appropriate code that aligns with existing patterns and conventions.
Reduced Context Manipulation: Developers spend less time carefully selecting which files to include in their prompts, as Gemini can handle many files simultaneously.
Project-Wide Refactoring: Execute large-scale refactoring operations across multiple files and directories with full awareness of interdependencies.
Complete API Understanding: Process entire API documentation alongside implementation code, enabling more accurate interface utilization.

Quantitative Comparison

Model	Context Window	Files Simultaneously Analyzed	Avg. Repository Coverage
Gemini 2.5 Pro	1,000,000 tokens	100+ files	~85% of typical repos
Claude 3 Opus	100,000 tokens	10-15 files	~25% of typical repos
GPT-4 Turbo	128,000 tokens	15-20 files	~30% of typical repos
GPT-3.5 Turbo	16,000 tokens	2-3 files	~8% of typical repos

This expanded context capability transforms how developers interact with AI assistants, enabling whole-project reasoning rather than file-by-file analysis, particularly valuable for complex software engineering tasks that span multiple components.

Conclusion

Google Gemini CLI represents a significant advancement in bringing AI assistance directly to developers' terminals.

The shared technology with Gemini Code Assist ensures consistency across environments, allowing developers to use familiar AI capabilities whether they're working in VS Code or the terminal. With its combination of powerful models, extensive tools, security features, and extensibility, Gemini CLI offers a versatile solution for developers seeking to incorporate AI into their workflow.

References

Streaming Responses from OpenAI Models: Technical Implementation Guide

Prashant Iyenga — Fri, 18 Jul 2025 17:25:24 +0000

Introduction

In contemporary AI-powered applications, responsiveness and user experience are critical technical requirements. Streaming responses from large language models (LLMs) offered by OpenAI represents a fundamental technique for developing responsive, interactive applications. This approach enables incremental processing of model outputs as they are generated, rather than requiring the complete response to be assembled prior to client-side delivery.

This technical guide examines the architectural and implementation considerations for OpenAI model streaming, with particular emphasis on structured response formats, error handling methodologies, and cancellation mechanisms.

Benefits of Stream-Based Architecture

Enhanced User Experience Metrics: Provides immediate visual feedback, reducing perceived latency as measured by Time to First Meaningful Content (TFMC)
Request Optimization: Enables early termination of requests when sufficient context has been acquired, optimizing token usage and reducing inference costs
Resource Utilization: Facilitates concurrent processing of partial responses, improving computational efficiency through pipeline parallelism
Error Resilience: Allows preservation of partial results in the event of mid-stream failures, enhancing system robustness

Architecture of OpenAI Streaming

The implementation of streaming with OpenAI's models requires understanding the underlying HTTP and API architecture:

HTTP Connection Establishment: The client initiates a request to the OpenAI API endpoint with the stream=True parameter, which configures the server to establish a persistent connection using HTTP/1.1 chunked transfer encoding or HTTP/2 streams.
Inference Process: The model performs token-by-token generation through an autoregressive process, where each output token is conditioned on all previous tokens.
Server-Side Chunking: As tokens are generated, the API server packages them into discrete chunks conforming to the HTTP chunked encoding specification (RFC 7230), each containing a delta update to the response.

Each chunk from the OpenAI API contains a delta representation rather than cumulative content. The response structure typically follows this format:

{
  "id": "chatcmpl-123...",
  "object": "chat.completion.chunk",
  "created": 1694268190,
  "model": "gpt-4o",
  "choices": [
    {
      "delta": {
        "content": "token text here"
      },
      "index": 0,
      "finish_reason": null
    }
  ]
}

Client-Side Processing Pipeline: The client implements an iterator pattern to process these chunks asynchronously, enabling immediate consumption without blocking on the complete response.
Connection Lifecycle Management: The connection persists until one of three termination conditions occurs: normal completion (indicated by a finish_reason of "stop"), error state, or explicit client-initiated cancellation.

In the final chunk, the finish_reason field will contain "stop" to indicate normal completion, or alternative values like "length" for maximum token limit.

Basic Implementation of Streaming

To implement streaming with OpenAI's models, you'll need to set up your code to handle the incremental response chunks. Here's how to implement basic streaming in Python:

import openai

# Initialize the client
client = openai.OpenAI(api_key="your-api-key")

def stream_openai_response(prompt):
    # Create a streaming completion
    stream = client.chat.completions.create(
        model="gpt-4",
        messages=[{"role": "user", "content": prompt}],
        stream=True  # Enable streaming
    )

    # Process the stream
    collected_content = ""
    for chunk in stream:
        # Extract the content from the chunk
        if chunk.choices and len(chunk.choices) > 0:
            content = chunk.choices[0].delta.content
            if content is not None:
                # Display the chunk as it arrives
                print(content, end="", flush=True)
                collected_content += content

    return collected_content

# Example usage
response = stream_openai_response("Explain quantum computing in simple terms")

Cancellation Mechanism Deep Dive

Implementing cancellation for streaming responses requires understanding several approaches:

The most robust approach combines an event-based mechanism (using a threading.Event in Python) with proper signal handling. This allows both programmatic cancellation (from another thread) and user-initiated cancellation (via Ctrl+C).

# Example of a cancellation mechanism
def setup_cancellation():
    cancel_event = threading.Event()

    # Store original handler
    original_handler = signal.getsignal(signal.SIGINT)

    def signal_handler(sig, frame):
        print("\nCancellation requested...")
        cancel_event.set()

    # Set new handler
    signal.signal(signal.SIGINT, signal_handler)

    return cancel_event, original_handler

# During streaming
cancel_event, original_handler = setup_cancellation()
try:
    for chunk in stream:
        if cancel_event.is_set():
            print("Stream cancelled")
            break
        # Process chunk
finally:
    # Restore original handler
    signal.signal(signal.SIGINT, original_handler)

Error Handling Mid-Stream

When working with streaming responses, error handling becomes more complex than with traditional API calls. Errors can occur at different stages of the streaming process, and robust error handling is essential for maintaining a good user experience.

Common Error Scenarios

Connection Interruptions: Network issues can cause the stream to break unexpectedly
API Rate Limiting: Hitting rate limits during an ongoing stream
Model Errors: The model encounters an issue mid-generation
Token Limit Exceeded: Reaching maximum token limits during generation
Authentication Failures: API key issues that arise during streaming

In our OpenAIStreamer implementation, we handle errors comprehensively through several mechanisms:

try:
    stream = self.client.chat.completions.create(
        # Parameters...
        stream=True,
    )

    for chunk in stream:
        # Process chunks...

except Exception as e:
    error_msg = f"[Error: {str(e)}]"
    error_code = type(e).__name__

    # Yield error in structured format
    yield {
        "content": error_msg,
        "finish_reason": "error",
        "error_description": error_code
    }

For production applications, more sophisticated error recovery strategies might include:

Automatic Retries: Implementing exponential backoff for transient errors
Partial Response Preservation: Maintaining already-received content when errors occur

By structuring error responses in the same format as successful responses, frontend applications can handle both scenarios uniformly, creating a more resilient user experience.

Structured Response Format

For production applications, it's critical to have a consistent structure for streaming responses. A JSON format with clear fields provides several advantages:

Status Tracking: Fields like finish_reason allow tracking the stream's state
Error Identification: Dedicated error fields make error handling more systematic
Content Separation: Clearly separating content from metadata

Here's an example of a structured response format for streaming:

{

    "content": "Streaming offers ",
    "finish_reason" : null,
    "error_description" : ""
}

When an error occurs, the same structure can be maintained:

{

    "content": "[Error: API timeout, maximum retries exceeded]",
    "finish_reason" : "error",
    "error_description" : "timeout"
}

This consistent structure enables frontend applications to handle both successful responses and errors within the same processing pipeline.

Source Code

The full source code can be found at https://gist.github.com/pi19404/4c0f9358610790bf9db3a2e9d09e357b

Conclusion

Implementing streaming responses from OpenAI models requires careful architectural consideration of response protocols, error handling methodologies, and cancellation mechanisms. By adopting these engineering practices, applications can achieve more responsive user experiences while maintaining robustness under varied operating conditions.

References

OpenAI API Documentation: https://platform.openai.com/docs/api-reference/streaming
HTTP/1.1 Chunked Transfer Encoding (RFC 7230): https://tools.ietf.org/html/rfc7230#section-4.1
Python Threading and Concurrency: https://docs.python.org/3/library/threading.html
Signal Handling in Python: https://docs.python.org/3/library/signal.html
Exponential Backoff Algorithm: https://aws.amazon.com/blogs/architecture/exponential-backoff-and-jitter/
Web Streams API: https://developer.mozilla.org/en-US/docs/Web/API/Streams_API
Flask Stream With Context: https://flask.palletsprojects.com/en/2.0.x/patterns/streaming/
FastAPI Streaming Response: https://fastapi.tiangolo.com/advanced/custom-response/#streamingresponse

Automating Blog Publication on Dev.to: A Developer's Guide to the API

Prashant Iyenga — Mon, 14 Jul 2025 12:40:24 +0000

Automating Blog Publication on Dev.to: A Developer's Guide to the API

As developers, we're always looking for ways to streamline our workflows. If you're an active technical writer publishing on Dev.to, you've probably wondered if there's a way to automate the publishing process directly from your preferred writing environment. Good news! Dev.to offers a robust API that enables programmatic interaction with the platform.

In this article, we will walk through the process of automating blog publication on Dev.to, covering everything from authentication to managing drafts and publishing articles.

Why Automate Publishing to Dev.to?

Before diving into the technical details, let's consider a few compelling reasons to automate your Dev.to publishing workflow:

Write in your preferred environment - Author content in your favorite markdown editor, IDE, or note-taking app and publish without copy-pasting
Cross-platform publishing - Write once, publish to multiple platforms with format adaptation
Version control integration - Maintain your blog posts in Git repositories along with your code
Batch operations - Publish or update multiple articles with a single command
Scheduled publishing - Queue posts to be published at optimal times

Setting Up Authentication

The Dev.to API uses API keys for authentication. To get started, you'll need to generate an API key:

Log in to your Dev.to account
Navigate to Settings → Account → DEV API Keys
Create a new API key with an appropriate description

Once you have your API key, you can include it in the headers of your API requests:

headers = {
    "api-key": "your_api_key_here",
    "Content-Type": "application/json"
}

Keep your API key secure—it provides full access to create, update, and manage articles on your behalf.

Understanding Article Structure

The Dev.to API represents articles using a JSON structure. The key properties include:

Article Metadata

{
  "article": {
    "title": "Your Article Title",
    "published": false,
    "body_markdown": "Your article content in markdown format...",
    "tags": ["python", "api", "tutorial", "automation"],
    "series": "Optional Series Name",
    "canonical_url": "https://original-site.com/if-crossposting",
    "description": "Brief description of your article",
    "cover_image": "https://url-to-cover-image.jpg"
  }
}

Some important notes on these fields:

title: Required field for all articles
published: Boolean flag determining if the article is a draft (false) or published (true)
body_markdown: The full content of your article in markdown format
tags: Array of up to 4 tags that help categorize your article
series: Optional field to group related articles together
canonical_url: If you're cross-posting, set this to the original URL to avoid SEO penalties
description: A brief summary shown in article previews
cover_image: URL to a header image for your article

Core API Endpoints

Dev.to provides several RESTful endpoints for article management:

Creating a New Article

import requests
import json

def create_article(api_key, article_data):
    url = "https://dev.to/api/articles"
    headers = {
        "api-key": api_key,
        "Content-Type": "application/json"
    }

    response = requests.post(
        url,
        headers=headers,
        json={"article": article_data}
    )

    if response.status_code == 201:
        return response.json()
    else:
        raise Exception(f"Failed to create article: {response.text}")

By default, this creates a draft article. To publish immediately, include "published": true in your article data.

Getting Your Articles

You can retrieve your published and draft articles separately:

# Get published articles
def get_published_articles(api_key):
    url = "https://dev.to/api/articles/me/published"
    headers = {"api-key": api_key}
    response = requests.get(url, headers=headers)

    if response.status_code == 200:
        return response.json()
    else:
        raise Exception(f"Failed to fetch articles: {response.text}")

# Get draft articles
def get_draft_articles(api_key, page=1, per_page=30):
    url = "https://dev.to/api/articles/me/unpublished"
    headers = {"api-key": api_key}
    params = {"page": page, "per_page": per_page}

    response = requests.get(url, headers=headers, params=params)

    if response.status_code == 200:
        return response.json()
    else:
        raise Exception(f"Failed to fetch draft articles: {response.text}")

These methods return lists of article objects containing metadata and IDs, which you'll need for further operations.

Understanding Article IDs

Each article on Dev.to has a unique ID, which is essential for operations like updating, publishing, or deleting. When you create an article or retrieve a list of your articles, the API response includes this ID.

{
  "id": 12345,
  "title": "Your Article Title",
  "description": "Article description...",
  "published": false,
  // Additional fields...
}

This ID becomes the key identifier for all subsequent operations on the article.

Updating an Existing Article

To update an article, you need its ID and the fields you want to modify:

def update_article(api_key, article_id, update_data):
    url = f"https://dev.to/api/articles/{article_id}"
    headers = {
        "api-key": api_key,
        "Content-Type": "application/json"
    }

    response = requests.put(
        url,
        headers=headers,
        json={"article": update_data}
    )

    if response.status_code == 200:
        return response.json()
    else:
        raise Exception(f"Failed to update article: {response.text}")

This function is versatile—you can update any article properties, including the published status.

The Draft-to-Published Workflow

One of the most common workflows is creating a draft, reviewing it, and then publishing it. Here's how to implement this flow programmatically:

1. Create a Draft

draft_data = {
    "title": "My Technical Article",
    "body_markdown": "# Introduction\n\nThis is the start of my article...",
    "tags": ["programming", "tutorial"],
    "published": False  # This creates a draft
}

draft = create_article(api_key, draft_data)
draft_id = draft["id"]

2. Update the Draft (Optional)

You might want to make changes after reviewing the draft in the Dev.to interface:

update_data = {
    "body_markdown": "# Revised Introduction\n\nThis is the improved start of my article..."
}

updated_draft = update_article(api_key, draft_id, update_data)

3. Publish the Draft

When you're ready to publish, simply update the published status:

publish_data = {
    "published": True
}

published_article = update_article(api_key, draft_id, publish_data)

This transition from draft to published is handled seamlessly by the Dev.to API. The article retains all its content and metadata while changing its visibility status.

Advanced Features

Handling Frontmatter

Dev.to supports YAML frontmatter in markdown files, which provides a convenient way to define article metadata:

---
title: My Amazing Article
published: false
tags: api, tutorial
series: API Mastery
---

# Article Content Starts Here

Your article body...

When working with files that include frontmatter, you'll need to parse them correctly:

import frontmatter

def read_markdown_file(file_path):
    with open(file_path, 'r', encoding='utf-8') as file:
        post = frontmatter.load(file)

    return {
        'metadata': post.metadata,
        'content': post.content
    }

def create_article_from_file(api_key, file_path):
    markdown_data = read_markdown_file(file_path)

    article_data = {
        "title": markdown_data['metadata'].get('title', 'Untitled'),
        "published": markdown_data['metadata'].get('published', False),
        "body_markdown": markdown_data['content'],
        "tags": markdown_data['metadata'].get('tags', []),
        "series": markdown_data['metadata'].get('series'),
        "canonical_url": markdown_data['metadata'].get('canonical_url'),
        "description": markdown_data['metadata'].get('description', ''),
        "cover_image": markdown_data['metadata'].get('cover_image')
    }

    # Remove None values
    article_data = {k: v for k, v in article_data.items() if v is not None}

    return create_article(api_key, article_data)

This approach allows you to maintain article metadata directly in your markdown files.

Batch Publishing

If you have multiple drafts you'd like to publish, you can implement a batch operation:

def batch_publish_drafts(api_key, draft_ids):
    results = []

    for article_id in draft_ids:
        try:
            result = update_article(api_key, article_id, {"published": True})
            results.append({
                "article_id": article_id,
                "status": "success",
                "article": result
            })
            # Be nice to the API with a small delay
            time.sleep(1)
        except Exception as e:
            results.append({
                "article_id": article_id,
                "status": "error",
                "error": str(e)
            })

    return results

This article was created using the MarkdownPublisher class demonstrated in the examples above. The complete source code for a publishing automation tool is available on GitHub.

Best Practices and Limitations

When working with the Dev.to API, keep these considerations in mind:

Dev.to applies rate limiting to API requests. While the exact limits aren't prominently documented, it's good practice to add delays between requests, especially when performing batch operations.
Dev.to limits articles to a maximum of 4 tags.

Conclusion

The Dev.to API provides powerful capabilities for programmatic article management, enabling developers to build custom publishing workflows. Whether you're looking to automate personal blog posts or managing a technical publication with multiple authors, understanding these API concepts will help you streamline your process.

How to Create a Virtual Environment with a Specific Python Version

Prashant Iyenga — Sat, 12 Jul 2025 12:04:09 +0000

Managing multiple Python projects often means juggling different package versions—and sometimes entirely different Python versions. This is where virtual environments shine. In this blog post, you'll learn how to create an isolated virtual environment using a specific version of Python, tailored for Linux and macOS users.

🧠 Why Use a Virtual Environment?

Before jumping into the steps, here’s why using virtual environments is considered best practice:

Isolated Dependencies: Keeps project requirements isolated from your system Python.
Avoids Conflicts: Prevents dependency collisions across different projects.
Reproducibility: Makes deployments and collaboration smoother by standardizing environments.

🛠️ Prerequisites

Make sure your target Python version is available on your system. You can verify which versions are installed:

ls /usr/bin/python*          # Linux
ls /opt/homebrew/bin/python*  # macOS with Homebrew

If your desired version is missing, install it using pyenv.

🔧 Installing `pyenv` (Recommended for Managing Multiple Python Versions)

Step 1: Install Dependencies

Linux (Debian/Ubuntu):

sudo apt update
sudo apt install -y make build-essential libssl-dev zlib1g-dev \
libbz2-dev libreadline-dev libsqlite3-dev curl git libncursesw5-dev \
xz-utils tk-dev libxml2-dev libxmlsec1-dev libffi-dev liblzma-dev

macOS (with Homebrew):

brew install openssl readline sqlite3 xz zlib
brew install pyenv

Step 2: Add pyenv to your shell startup file

Bash:

echo -e '\n# pyenv setup' >> ~/.bashrc
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.bashrc
echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.bashrc
echo 'eval "$(pyenv init --path)"' >> ~/.bashrc

Zsh:

echo -e '\n# pyenv setup' >> ~/.zshrc
echo 'export PYENV_ROOT="$HOME/.pyenv"' >> ~/.zshrc
echo 'export PATH="$PYENV_ROOT/bin:$PATH"' >> ~/.zshrc
echo 'eval "$(pyenv init --path)"' >> ~/.zshrc

Then reload your shell:

source ~/.bashrc      # or ~/.zshrc

Step 3: Install a Specific Python Version

pyenv install 3.10.12

🧪 Creating the Virtual Environment

Option 1: Using `python -m venv`

This is the built-in way to create a virtual environment using a specific Python binary.

# Activate desired Python version (if using pyenv)
pyenv shell 3.10.12

# Create the virtual environment
python -m venv venv-py310

You can also specify the full path to the Python binary:

/opt/homebrew/bin/python3.10 -m venv venv-py310  # macOS
/usr/bin/python3.10 -m venv venv-py310           # Linux

Option 2: Using `virtualenv` (Alternative Approach)

First, install virtualenv:

pip install virtualenv

Then run:

virtualenv -p /usr/bin/python3.10 venv-py310      # Linux
virtualenv -p /opt/homebrew/bin/python3.10 venv-py310  # macOS

🚀 Activating the Virtual Environment

To activate the environment:

source venv-py310/bin/activate

You should now see the environment name in your prompt like:

(venv-py310) user@hostname:~/project$

🧹 Deactivating and Cleaning Up

To deactivate:

deactivate

To delete the environment:

rm -rf venv-py310

🧩 Final Thoughts

Using a specific Python version in a virtual environment helps avoid compatibility issues, ensures reproducible builds, and keeps your system Python clean. This is especially important for macOS and Linux users, where the system Python might be tied to operating system functions.

MCP Server Setup with OAuth Authentication using Auth0 and Claude.ai Remote MCP Integration

Prashant Iyenga — Wed, 09 Jul 2025 13:26:38 +0000

Introduction

This technical guide provides a comprehensive walkthrough for implementing a Model Context Protocol (MCP) server with robust OAuth 2.0 authentication, leveraging Auth0 as the identity provider. The focus is on achieving full compatibility with Claude.ai’s requirements for Dynamic Client Registration (DCR) as specified in RFC 7591. Unlike traditional OAuth integrations that rely on static client credentials, Claude.ai mandates that MCP servers support dynamic, standards-compliant registration and discovery endpoints, enabling third-party clients to obtain credentials and initiate secure authorization flows programmatically.

We will learn how to configure Auth0 to enable OIDC dynamic application registration, promote connections to domain-level for third-party authentication, and expose the necessary OAuth endpoints for automated client onboarding. The guide details the limitations of libraries, such as fastapi_mcp in this context, and demonstrates how to use the fastmcp and mcpauth libraries to implement an MCP server that supports dynamic registration, PKCE, and secure token exchange.

Step-by-step instructions are provided for both the Auth0 dashboard and API configuration

fastapi_mcp library limitations

For OAuth-based MCP servers, Claude.ai requires Dynamic Client Registration (DCR) support as per RFC 7591 and does not yet support a way for users to specify a client ID or secret manually.

The fastapi_mcp library, while excellent for many MCP use cases, has several limitations when it comes to Claude.ai's OAuth requirements:

No Dynamic Client Registration Support : fastapi_mcp Requires pre-configured client IDs and secrets
Static Configuration : Cannot handle Claude.ai’s dynamic application creation process
MCP Inspector Compatibility : OAuth testing via MCP inspector fails with static configurations
RFC 7591 Compliance : Lacks support for the Dynamic Client Registration standard required by Claude.ai

# This approach with fastapi_mcp does NOT work with Claude.ai OAuth
from fastapi_mcp import FastApiMCP, AuthConfig

# Static configuration - incompatible with Claude.ai
auth_config = AuthConfig(
    client_id="static-client-id", # Claude.ai doesn't support this
    client_secret="static-secret", # Claude.ai creates these dynamically
    setup_proxies=True,
)

Using fastmcp and mcpauth Libraries

To properly support Claude.ai’s OAuth requirements, we use the fastmcp library with mcpauth For OAuth management:

pip install fastmcp mcpauth

These libraries provide:

Dynamic Client Registration (RFC 7591) support
Claude.ai compatibility out of the box
MCP Inspector OAuth testing capabilities
Proper OAuth 2.0 flows with PKCE support

Auth0 Configuration for Dynamic Client Registration

Step 1: Enable OIDC Dynamic Application Registration

By default, Auth0 disables dynamic application registration for security reasons. To enable it:

Navigate to Auth0 Dashboard → Tenant Settings
Go to Advanced Settings tab
Find “Enable OIDC Dynamic Application Registration”
Enable this setting

Warning : This setting allows third-party applications to dynamically register applications for your API. Ensure you have proper security measures in place.

Step 2: Grant Auth0 Management API Scopes for Connection Management

After enabling dynamic registration, you must ensure that your Auth0 Management API client has the correct permissions to manage connections. Specifically, you need to grant the update:connections and read:connections scopes to your Management API client. These scopes allow your automation or scripts to promote connections to domain-level and read connection details.

Granting Management API Scopes to Your Client

Log in to the Auth0 Dashboard at https://manage.auth0.com/.
Create a Machine to Machine Application :

Go to Applications → Applications in the sidebar.
Click Create Application.
Enter a name (e.g., “MCP Management Client”), select Machine to Machine Applications , and click Create.
Machine to Machine Applications require at least one authorized API. Select the Auth0 Management API

Select the permissions read:connections , update:connections , read:clients
Click Authorize to save the changes.
Note the Client ID and Client Secret from the settings page-> Basic Information Section

Your machine-to-machine application is now authorized to manage Auth0 connections using the Management API with the necessary scopes.

Your client now has the necessary permissions to manage Auth0 connections via the Management API.

Step 3: Obtain a Management API Token (with the required scopes):

Run the below command to obtain the management API Token

export AUTH0_DOMAIN=asdasda 
export AUTH0_CLIENT_ID=adsdasd 
export AUTH0_CLIENT_SECRET=asdasdasd

ACCESS_TOKEN=$(curl --silent --request POST \
      --url "https://${AUTH0_DOMAIN}/oauth/token" \
      --header 'content-type: application/x-www-form-urlencoded' \
      --data "grant_type=client_credentials" \
      --data "client_id=${AUTH0_CLIENT_ID}" \
      --data "client_secret=${AUTH0_CLIENT_SECRET}" \
      --data "audience=https://${AUTH0_DOMAIN}/api/v2/" \
      --data "scope=update:connections read:connections read:clients" \
      | grep -o '"access_token":"[^"]*"' | cut -d':' -f2 | tr -d '"')

Verify the Client’s Granted Scopes :

curl --request GET \
--url "https://${AUTH0_DOMAIN}/api/v2/clients/${AUTH0_CLIENT_ID}" \
--header "authorization: Bearer ${ACCESS_TOKEN}"

Check the grant_types and scopes fields in the response to ensure update:connections and read:connections are present.

Note: These scopes are only for managing Auth0 system resources (connections) and are not related to API resource access for end-user tokens.

Step 4: Create Domain-Level Connections

Claude.ai requires domain-level connections for third-party application authentication. By default, Auth0 connections are not domain-level.

Promote Connection to Domain-Level via Management API

Use the Auth0 Management API to promote your connection to domain-level:

# First, get your Management API token
ACCESS_TOKEN=$(curl --silent --request POST \
    --url "https://${AUTH0_DOMAIN}/oauth/token" \
    --header 'content-type: application/x-www-form-urlencoded' \
    --data "grant_type=client_credentials" \
    --data "client_id=${AUTH0_CLIENT_ID}" \
    --data "client_secret=${AUTH0_CLIENT_SECRET}" \
    --data "audience=https://${AUTH0_DOMAIN}/api/v2/" \
    --data "scope=update:connections read:connections" \
    | grep -o '"access_token":"[^"]*"' | cut -d':' -f2 | tr -d '"')

#To find your Google OAuth connection ID automatically, you can extract it from the API response using `jq` (a command-line JSON processor):

# List all connections and extract the ID for the google-oauth2 connection
GOOGLE_CONNECTION_ID=$(curl --silent --request GET \
    --url "https://${AUTH0_DOMAIN}/api/v2/connections" \
    --header "authorization: Bearer ${ACCESS_TOKEN}" \
    | jq -r '.[] | select(.name=="google-oauth2") | .id')
echo "Google OAuth connection ID: $GOOGLE_CONNECTION_ID"

This command fetches all connections and filters for the one named "google-oauth2", outputting its "id" field. Use the value of $GOOGLE_CONNECTION_ID in the PATCH request to promote the connection to domain-level.

# promote the connection to domain-level
curl --request PATCH \
  --url "https://${AUTH0_DOMAIN}/api/v2/connections/${GOOGLE_CONNECTION_ID}" \
  --header "authorization: Bearer ${ACCESS_TOKEN}" \
  --header 'content-type: application/json' \
  --data '{ "is_domain_connection": true }'

MCP Server Implementation with fastmcp

Server Setup with OAuth Support

from fastmcp import FastMCP
from mcpauth import MCPAuth
import asyncio
import logging

# Configure logging
logging.basicConfig(level=logging.INFO)
logger = logging.getLogger(__name__)

......
other code

# Initialize MCP server with OAuth
mcp_auth = MCPAuth(
    server=fetch_server_config(
        settings["issuer"],
        type=AuthServerType.OAUTH  # or AuthServerType.OAUTH
    )
)

app = FastAPI()

@app.get("/", operation_id="root")
async def root():
    """Root endpoint with API information"""

    return {
        "message": "Dev.to Markdown Publisher API",
        "version": "1.0.0",
    }

api_mcp = FastMCP.from_fastapi(app=app, name="API Server")

bearer_auth = mcp_auth.bearer_auth_middleware(
    "jwt", required_scopes=["read", "write"]
)

mcp_app = api_mcp.http_app(path='/mcp')

mcp_app.dependency_overrides = getattr(mcp_app, "dependency_overrides", {})
mcp_app.dependency_overrides[None] = bearer_auth


app = FastAPI(lifespan=tools_app.lifespan)

# Mount the MCP SSE app at /mcp
app.mount("/mcp", mcp_app)
# Add the metadata route for OAuth server discovery at /.well-known/oauth-authorization-server
app.add_route("/.well-known/oauth-authorization-server", mcp_auth.metadata_route(), methods=["GET"])

if __name__ == "__main__":
    # Run with OAuth support enabled
    uvicorn.run("app:app", host="0.0.0.0", port=8000, reload=True)

Claude.ai Dynamic Application Registration

When Claude.ai connects to your MCP server with OAuth enabled, it automatically:

Discovers OAuth Endpoints : Calls /.well-known/oauth-authorization-server
Registers Dynamic Client : POSTs to /oauth/register endpoint
Creates Auth0 Application : You’ll see “Claude.ai” app automatically created in Auth0 dashboard
Initiates OAuth Flow : Redirects user for authentication
Exchanges Tokens : Completes PKCE flow with dynamic client credentials

Viewing Dynamically Registered Applications

After Claude.ai connects, check your Auth0 dashboard:

Go to Applications section
Look for automatically created applications with names like:

“Claude.ai MCP Client”

These applications are created automatically by Claude.ai through the dynamic registration process.

Conclusion

Implementing OAuth authentication for MCP servers with Claude.ai requires careful attention to dynamic client registration requirements. The combination of fastmcp, mcpauth, and properly configured Auth0 settings enables seamless integration with Claude.ai's OAuth flows while maintaining security best practices.

Key takeaways:

Use fastmcp and mcpauth instead of fastapi_mcp for Claude.ai compatibility
Enable OIDC Dynamic Application Registration in Auth0
Promote connections to domain-level for third-party app support
Monitor Auth0 dashboard for automatically created Claude.ai applications
Test thoroughly with MCP Inspector before deploying to production

This setup ensures that your MCP server can authenticate users securely while providing the dynamic client registration capabilities that Claude.ai requires for OAuth integration.

Claude AI MCP Integration and nGrok: Secure Tunnels to Your Local Development Environment

Prashant Iyenga — Sat, 05 Jul 2025 18:54:38 +0000

Introduction

This technical article explores the integration of Anthropic’s Claude Message Context Protocol (MCP) with nGrok secure tunneling technology to establish robust connections between cloud-based AI systems and local development environments. We will demonstrate the implementation process for configuring a development MCP server, exposing it securely through nGrok’s encrypted tunneling infrastructure, and validating the configuration using the Claude API platform.

The combination of these technologies enables developers to prototype and test AI-powered applications with enterprise-grade security protocols while maintaining the flexibility and rapid iteration capabilities of local development environments. This approach eliminates the need to deploy applications to staging environments during development, while still allowing for secure, real-time communication with cloud-based large language models.

What is Claude MCP

MCP is an open protocol that standardizes how applications provide context to LLMs. MCP provides a standardized way to connect AI models to various data sources and tools, helping you build agents and complex workflows on top of LLMs. It does so by allowing servers to expose tools that language models can invoke. Tools enable models to interact with external systems, such as querying databases, calling APIs, or performing computations. A name uniquely identifies each tool and includes metadata describing its schema.

What is ngrok?

ngrok is a powerful command-line tool and service that creates secure tunnels from public URLs to your local development server. It acts as a reverse proxy, allowing you to expose your locally running applications to the internet without the need for complex network configurations, port forwarding, or deploying to a staging server.

When you run ngrok, it establishes a secure connection to ngrok’s cloud service, which then provides you with a public URL that forwards traffic to your local application. This URL can be accessed from anywhere on the internet, making it invaluable for development, testing, and demonstration purposes.

How Claude MCP Works

When you ask a question:

The client sends your question to Claude
Claude analyzes the available tools and decides which one(s) to use
The client executes the chosen tool(s) through the MCP server
The results are sent back to Claude
Claude formulates a natural language response
The response is displayed to you!

How ngrok Works

ngrok operates through a simple yet elegant architecture. When you start ngrok, it creates an encrypted tunnel between your local machine and ngrok’s edge servers. These servers then route incoming requests from the public internet through this tunnel to your local application.

The process involves several components: the ngrok client running on your machine, the ngrok service in the cloud, and the secure tunnel connecting them. All traffic passes through this encrypted tunnel, ensuring that your local development environment remains secure while being accessible from the internet.

NGROK Installation Guide

Installing ngrok on Different Platforms

macOS Installation: The easiest way to install ngrok on macOS is through Homebrew:

brew install ngrok/ngrok/ngrok

Alternatively, you can download the binary directly from the ngrok website and add it to your PATH.

Windows Installation: For Windows users, you can download the executable from the ngrok website and either run it directly or add it to your system PATH. Windows users can also use package managers like Chocolatey:

choco install ngrok

Linux Installation: Linux users can download the appropriate binary for their architecture from the https://ngrok.com/downloads/linux . After downloading, extract the binary and move it to a directory in your PATH:

curl -sSL https://ngrok-agent.s3.amazonaws.com/ngrok.asc \
  | sudo tee /etc/apt/trusted.gpg.d/ngrok.asc >/dev/null \
  && echo "deb https://ngrok-agent.s3.amazonaws.com buster main" \
  | sudo tee /etc/apt/sources.list.d/ngrok.list \
  && sudo apt update \
  && sudo apt install ngrok

# Or

sudo tar -xvzf ~/Downloads/ngrok-v3-stable-linux-amd64.tgz -C /usr/local/bin

Account Setup and Authentication

While ngrok can be used without an account for basic functionality, creating a free account provides additional features and removes some limitations. After creating an account on the ngrok website, you’ll receive an authentication token.

To configure ngrok with your authentication token:

ngrok config add-authtoken YOUR_AUTH_TOKEN

This command stores your authentication token in ngrok’s configuration file, allowing you to access premium features and remove the time restrictions on free tunnels.

Basic Usage Examples

Exposing a Local Web Server

The most common use case is exposing a local web server. If you have a web application running on port 3000:

ngrok http 8000

This command creates a tunnel to your local server and provides you with a public URL that looks like https://abc123.ngrok.io.

Exposing Other Protocols

ngrok isn’t limited to HTTP traffic. You can expose TCP services as well:

ngrok tcp 22

This would expose your local SSH server to the internet, though this should be done with caution for security reasons.

Custom Subdomains

With a paid ngrok account, you can specify custom subdomains:

ngrok http --subdomain=myapp 3000

This creates a tunnel with a predictable URL like https://myapp.ngrok.io.

MCP Development server

ngrok excels at allowing developers to test their applications with real-world scenarios without deploying to a staging environment. You can quickly share your work-in-progress application with team members, clients, or stakeholders by simply sharing the ngrok URL.

We host an MCP Server that provides crypto current market information

The application is hosted using FastAPI Server
The data is obtained using CoinMarketCap platform API

The function definition is as follows

@app.get("/api/currency-info")
async def get_currency_info(
    exchange: CryptoExchange = Depends(get_exchange),
    symbols: str = Query(None, description="Comma-separated list of currency symbols (e.g., BTC,ETH,ADA)"),
    convert: str = Query("USD", description="Currency to convert values to (e.g., USD, EUR)")
):
    """Get detailed currency information from CoinMarketCap including market cap, supply, and price metrics"""

The output of the function for input currency BTC is as follows

{
  "data": {
    "BTC": {
      "id": 1,
      "name": "Bitcoin",
      "symbol": "BTC",
      "slug": "bitcoin",
      "cmc_rank": 1,
      "circulating_supply": 19887956,
      "total_supply": 19887956,
      "max_supply": 21000000,
      "last_updated": "2025-07-05T14:18:00.000Z",
      "pricing": {
        "price": 108244.86616819742,
        "volume_24h": 38435406044.59744,
        "volume_change_24h": -6.6511,
        "percent_change_1h": 0.03610656,
        "percent_change_24h": 0.41189629,
        "percent_change_7d": 0.95202959,
        "percent_change_30d": 4.04716341,
        "percent_change_60d": 14.83875876,
        "percent_change_90d": 31.23696181,
        "market_cap": 2152769135578.9988,
        "market_cap_dominance": 64.7036,
        "fully_diluted_market_cap": 2273142189532.15,
        "last_updated": "2025-07-05T14:18:00.000Z"
      },
      "additional_metrics": {
        "volume_to_mc_ratio": 0.017853937707194057,
        "market_dominance_percent": 100.0,
        "circulating_to_max_ratio": 94.70455238095238,
        "fully_diluted_valuation": 2273142189532.15
      }
    }
  },
  "convert": "USD",
  "count": 1,
  "timestamp": "2025-07-05T19:49:58.756412"
}

We use the fastapi_mcp Python package https://github.com/tadata-org/fastapi\_mcp/tree/main which provides a facility to expose your FastAPI endpoints as Model Context Protocol (MCP) tools, with Auth!

from fastapi import APIRouter
from fastapi_mcp import FastApiMCP

other_router = APIRouter(prefix="/other/route")    
app.include_router(other_router)
# Mount the MCP server to a separate FastAPI app
mcp = FastApiMCP(app)
mcp.mount(other_router)

def run():
    uvicorn.run("app:app", host="0.0.0.0", port=8000, reload=True)

We expose the MCP server at the route /other/route/mcp

Claude MCP Server configuration

Adding the MCP Server on the ngrok Endpoint as a Custom Integration

We have our MCP server running locally at port 8000 and listening on all interfaces 0.0.0.0. And since ngrok is set up to create a secure tunnel, we need to connect our MCP server to Claude by registering it as a custom integration.

Once your MCP server is running, use ngrok to create a secure tunnel to your local server:

ngrok http 8000

After running this command, ngrok will display a URL (typically in the format https://xxxx-xxxx-xxxx.ngrok.io). Make note of this URL as you'll need it for the next step.

Register your MCP Server as a Custom Integration in Claude

Navigate to the Claude AI Integrations
Click “Add Integration”
Enter the following details:

Name : Give your integration a descriptive name (e.g., “Crypto Market Data MCP”)
MCP Server URL : Enter your ngrok URL followed by the MCP endpoint path. For our example: https://xxxx-xxxx-xxxx.ngrok.io/other/route/mcp

Click “Create” to register your integration

Verify the Integration

After registering your custom integration, Click on Connect to verify that your MCP server is properly configured by:

Checking that the MCP server URL is accessible
Validating that the server correctly implements the MCP protocol
Retrieving the list of available tools from your server

If the verification is successful, your integration will be marked as “Active” in the Claude Console.

Test the Integration in Claude

Now you can test your custom integration:

Start a new conversation in Claude
Check that your custom integration is visible and enabled in the tools section

Ask Claude a question that would require information from your crypto market data tool, such as “show the currency information for BTC in tabular format”
Claude will recognize the need to use your tool, make the appropriate API call through the MCP server, and provide you with the results.

We can see that the MCP Server is called with the request and the output response.

get_currency_info_api_currency_info_get

Request

{ symbols: BTC }

Response

{ "data": { "BTC": { "id": 1, "name": "Bitcoin", "symbol": "BTC", "slug": "bitcoin", "cmc_rank": 1, "circulating_supply": 19887987, "total_supply": 19887987, "max_supply": 21000000, "last_updated": "2025-07-05T16:37:00.000Z", "pricing": { "price": 107963.95555873076, "volume_24h": 33053849608.85693, "volume_change_24h": -23.731, "percent_change_1h": -0.1760637, "percent_change_24h": 0.35855272, "percent_change_7d": 0.48666905, "percent_change_30d": 3.83166081, "percent_change_60d": 14.07595308, "percent_change_90d": 31.15425869, "market_cap": 2147185744620.615, "market_cap_dominance": 64.6759, "fully_diluted_market_cap": 2267243066733.35, "last_updated": "2025-07-05T16:37:00.000Z" }, "additional_metrics": { "volume_to_mc_ratio": 0.015394033651568042, "market_dominance_percent": 100.0, "circulating_to_max_ratio": 94.7047, "fully_diluted_valuation": 2267243066733.35 } } }, "convert": "USD", "count": 1, "timestamp": "2025-07-05T22:08:34.125383" }

Remember that while ngrok is ideal for development and testing, for production deployments, it is recommended to host your MCP server on a stable, secure infrastructure with a fixed URL, rather than relying on temporary ngrok tunnels.

Security Considerations

While ngrok is incredibly useful for development, it’s essential to understand the security implications. When you create a tunnel, you’re exposing your local application to the internet, which means anyone with the URL can potentially access it.

For sensitive applications, consider using ngrok’s authentication features, such as basic authentication or custom headers. Always be mindful of what data you’re exposing and avoid using ngrok tunnels for production applications unless you have specific security measures in place.

The ngrok service itself uses strong encryption for all tunnel traffic, ensuring that data passing through the tunnel remains secure. However, the endpoints themselves become publicly accessible, so proper application-level security remains crucial.

Conclusion

ngrok has become an essential tool in the modern developer’s toolkit, dramatically simplifying the process of testing, demonstrating, and integrating applications that require interaction with external services. Its ease of use, combined with powerful features, makes it invaluable for everything from simple webhook testing to complex IoT device development.

Erasing Git History: How to Keep Only Your Latest Commit on GitHub

Prashant Iyenga — Thu, 26 Jun 2025 15:12:49 +0000

Have you ever committed sensitive data by mistake? Or maybe you want to wipe out your repository’s tangled history and start with a clean slate while preserving the current codebase? This guide walks you through the safe and deliberate process of removing all Git history and retaining only the latest commit on GitHub.

📌 Why Would You Want to Do This?

There are valid and common scenarios where this is useful:

🔐 You’ve committed secrets (e.g., API keys) in the past.
🧪 Your repository history is too messy or experimental.
🚀 You want a fresh start for open-source release.
🎯 You’re transitioning from a prototype to production.

Important: This is a destructive action. It will wipe out all previous commit history , including tags, branches, and pull requests. Only proceed if you understand the implications.

🚧 What You’ll Achieve

By the end of this process:

The repository will contain only one commit — the current state of the code.
All previous commits will be irreversibly removed.
The branch history will be rewritten and force-pushed to GitHub.

🛠️ Step-by-Step: Clean Your GitHub Repo History

✅ 1. Clone Your Repository (Optional but Recommended)

Before you rewrite history, it’s good to work in a fresh clone to avoid local conflicts.

git clone https://github.com/your-username/your-repo.git
cd your-repo

🌿 2. Create a New Orphan Branch

An orphan branch is a Git branch with no previous history. This is the clean slate.

git checkout --orphan latest-commit

This command:

Creates a new branch.
Starts it with no parents (no history).

🧹 3. Stage and Commit All Files

Now, commit everything currently in your working directory:

git add -A
git commit -m "Initial commit with cleaned history"

This creates a brand-new commit that captures your current code state.

🔥 4. Delete the Old Branch

Next, delete the existing branch (usually main or master) that contains the unwanted history.

git branch -D main # or 'master', depending on your repo

📝 5. Rename the New Branch to main

Give your new orphan branch the same name as the one you deleted.

git branch -m main

🚀 6. Force Push to GitHub

Finally, force push the cleaned branch to GitHub. This replaces the remote history.

git push -f origin main

🛑 Warning : Collaborators will need to re-clone the repo. This push rewrites history for everyone.

🛡️ Aftermath: Clean-Up and Best Practices

Once you’ve rewritten history:

🔐 Revoke and rotate any secrets that were exposed in the past.
🕵️‍♀️ Scan your repo using tools like GitLeaks, TruffleHog, or detect-secrets.
📄 Add a .gitignore file to prevent re-adding sensitive or build files.
🧪 Use pre-commit hooks to lint and scan files before they’re staged.

📚 Further Reading

GitHub Docs: Removing Sensitive Data
git filter-repo – Advanced history rewriting
https://medium.com/neural-engineer/how-to-remove-sensitive-data-openai-api-keys-from-git-repositories-using-bfg-repo-cleaner-6fdad0cb1243

✅ Conclusion

Wiping a GitHub repository’s history is a powerful tool that should be used with caution. Whether you’re mitigating a leak or simplifying your project’s start point, this guide helps you do so responsibly and effectively.

Hugging Face Zero GPU Spaces: ShieldGemma Application

Prashant Iyenga — Mon, 09 Sep 2024 07:00:44 +0000

What is ZeroGPU

Hugging Face Spaces now offers a new hardware option called ZeroGPU.
ZeroGPU uses Nvidia A100 GPU devices under the hood, and 40GB of vRAM are available for each workload.
This is achieved by making Spaces efficiently hold and release GPUs as needed, as opposed to a classical GPU Space that holds exactly one GPU at any given time.
You can explore and use existing public ZeroGPU Spaces for free. The list of public zero GPU spaces can be found at (https://huggingface.co/spaces/enzostvs/zero-gpu-spaces)

Hosting models on ZeroGPU Spaces has the following restrictions.

ZeroGPU is currently in beta. It only works with the Gradio SDK. It enables us to deploy Gradio applications run on ZeroGPU for free
It is only available for Personal accounts subscribed to HuggingFace Pro. It will appear in the hardware list when you select the Gradio SDK.

There is a limit of 10 ZeroGPU Spaces that Personal accounts with PRO subscriptions can host
Though the documentation mentions that ZeroGPU uses an A100,You may observe a significantly lower performance than a standard A100 as GPU allocation may be time-sliced

Spaces and zero GPU

To make your Space work with ZeroGPU, you need to decorate the Python functions that require a GPU with @spaces.GPU

import spaces

@spaces.GPU
def my_inference_function(input_data, output_data,mode, max_length, max_new_tokens, model_size):

When a decorated function is invoked, the Space will be attributed a GPU, and it will release it upon completion of the function. You can find complete instructions to make your code compatible with zero GPU spaces at the link

ZeroGPU Explorers on Hugging Face

If your space is running on Zero GPU, you can see the status on the space's project page, along with the CPU and RAM consumption.

The supported software versions that are compatible with gradio SDK Zero GPU Spaces are

Gradio: 4+
PyTorch: All versions from 2.0.0 to 2.2.0
Python: 3.10.13

Duration

A duration param in the decorator spaces.GPU allows us to specific GPT time.

The default is 60 seconds.
If you expect your GPU function to take more than the 60s then you need to specify the same.
If you know that your function will take far less than the 60s default, specifying that it will provide higher priority for visitors to your Space in the queue

@spaces.GPU(duration=20)
defgenerate(prompt):
return pipe(prompt).images

It will set the maximum duration of your function call to 120s.

Hosting Private ZeroGPU Spaces

We will create a space pi19404/ai-worker, which is zeroGPU space. The visibility of space is private.
The gradio server hosted on space A provides shieldGemma2 model inference endpoint.

For more details of shieldGemma2, refer the article

LLM Content Safety Evaluation using ShieldGemma

We can create a space pi19404/shieldgemma-demothat programmatically call an application hosted on space pi19404/ai-worker. The visibility of space B is public
We configure the hugging face token as a secret with name API_TOKEN project settings of space pi19404/shieldgemma-demo
We can call the gradio server API using gradio client as described below

from gradio_client import Client

API_TOKEN=os.getenv("API_TOKEN")
# Initialize the Gradio Client
# This connects to the private zeroGPU Hugging Face space "pi19404/ai-worker"
client = Client("pi19404/ai-worker",hf_token=API_TOKEN)
# Make a prediction using the client
# The predict method calls the specified API endpoint with the given parameters
result = client.predict(
    # Input parameters for the my_inference_function API
    input_data="Hello!!",     # The input text to be evaluated
    output_data="Hello!!",    # The output text to be evaluated (if applicable)
    mode="scoring",           # The mode of operation: "scoring" or "generative"
    max_length=150,           # Maximum length of the input prompt
    max_new_tokens=1024,      # Maximum number of new tokens to generate
    model_size="2B",          # Size of the model to use: "2B", "9B", or "27B"
    api_name="/my_inference_function"  # The specific API endpoint to call
)
# Print the result of the prediction
print(result)

Explaining Rate Limits for ZeroGPU

The huggingface platform rate limits ZeroGPU spaces to ensure that a single user does not hog all available GPUs. The limit is controlled by a special token that the Hugging Face Hub infrastructure adds to all incoming requests to Spaces. This token is a request header called X-IP-Token and its value changes depending on the user who requests the ZeroGPU space.

With the Python client, you will quickly exhaust your rate limit, as all the requests to the ZeroGPU space will have the same token. So, to avoid this, we need to extract the user's token using Space pi19404/shieldgemma-demobefore we call Space pi19404/ai-worker programmatically.

When a new user visits the page

We use the load event to extract the user’s x-ip-token header when the user visits the page.

with gr.Blocks() as demo:
    """
    Main Gradio interface setup.This block sets up the Gradio interface, including:
    - A State component to store the client for the session.
    - A JSON component to display request headers for debugging.
    - Other UI components (not shown in this snippet).
    - A load event that calls set_client_for_session when the interface is loaded.
    """

    gr.Markdown("## LLM Safety Evaluation")
    client = gr.State()
    with gr.Tab("ShieldGemma2"):

        input_text = gr.Textbox(label="Input Text")
        output_text = gr.Textbox(
            label="Response Text",
            lines=5,
            max_lines=10,
            show_copy_button=True,
            elem_classes=["wrap-text"]
        )
        mode_input = gr.Dropdown(choices=["scoring", "generative"], label="Prediction Mode")
        max_length_input = gr.Number(label="Max Length", value=150)
        max_new_tokens_input = gr.Number(label="Max New Tokens", value=1024)
        model_size_input = gr.Dropdown(choices=["2B", "9B", "27B"], label="Model Size")
        response_text = gr.Textbox(
            label="Output Text",
            lines=10,
            max_lines=20,
            show_copy_button=True,
            elem_classes=["wrap-text"]
        )
        text_button = gr.Button("Submit")
        text_button.click(fn=my_inference_function, inputs=[client,input_text, output_text, mode_input, max_length_input, max_new_tokens_input, model_size_input], outputs=response_text)
    demo.load(set_client_for_session,None,client)
demo.launch(share=True)

We create a new Gradio client with this header passed to the headers parameter.

# Create an OrderedDict to store clients, limited to 15 entries
client_cache = OrderedDict()
MAX_CACHE_SIZE = 15

default_client=Client("pi19404/ai-worker", hf_token=API_TOKEN)

def get_client_for_ip(ip_address,x_ip_token):
    """
    Retrieve or create a client for the given IP address.This function implements a caching mechanism to store up to MAX_CACHE_SIZE clients.
    If a client for the given IP exists in the cache, it's returned and moved to the end
    of the cache (marking it as most recently used). If not, a new client is created,
    added to the cache, and the least recently used client is removed if the cache is full.
    Args:
        ip_address (str): The IP address of the client.
        x_ip_token (str): The X-IP-Token header value for the client.
    Returns:
        Client: A Gradio client instance for the given IP address.
    """
    if x_ip_token is None:
        x_ip_token=ip_address
    #print("ipaddress is ",x_ip_token)
    if x_ip_token is None:
        new_client=default_client
    else:

        if x_ip_token in client_cache:
            # Move the accessed item to the end (most recently used)
            client_cache.move_to_end(x_ip_token)
            return client_cache[x_ip_token]
        # Create a new client
        new_client = Client("pi19404/ai-worker", hf_token=API_TOKEN, \\
        headers={"X-IP-Token": x_ip_token})
        # Add to cache, removing oldest if necessary
        if len(client_cache) >= MAX_CACHE_SIZE:
            client_cache.popitem(last=False)
        client_cache[x_ip_token] = new_client

    return new_client

This ensures all subsequent predictions pass this header to the ZeroGPU space.
The client is saved in a State variable so that it is independent from other users. It is deleted automatically when the user exits the page.
We will also save the Gradio client in an in-memory cache so that we do not need to create a client again if the user loads a page with the same IP address.

# Create an OrderedDict to store clients, limited to 15 entries
client_cache = OrderedDict()
MAX_CACHE_SIZE = 15
default_client=Client("pi19404/ai-worker", hf_token=API_TOKEN)
def get_client_for_ip(ip_address,x_ip_token):
    """
    Retrieve or create a client for the given IP address.This function implements a caching mechanism to store up to MAX_CACHE_SIZE clients.
    If a client for the given IP exists in the cache, it's returned and moved to the end
    of the cache (marking it as most recently used). If not, a new client is created,
    added to the cache, and the least recently used client is removed if the cache is full.
    Args:
        ip_address (str): The IP address of the client.
        x_ip_token (str): The X-IP-Token header value for the client.
    Returns:
        Client: A Gradio client instance for the given IP address.
    """
    if x_ip_token is None:
        x_ip_token=ip_address
    #print("ipaddress is ",x_ip_token)
    if x_ip_token is None:
        new_client=default_client
    else:

        if x_ip_token in client_cache:
            # Move the accessed item to the end (most recently used)
            client_cache.move_to_end(x_ip_token)
            return client_cache[x_ip_token]
        # Create a new client
        new_client = Client("pi19404/ai-worker", hf_token=API_TOKEN, headers={"X-IP-Token": x_ip_token})
        # Add to cache, removing oldest if necessary
        if len(client_cache) >= MAX_CACHE_SIZE:
            client_cache.popitem(last=False)
        client_cache[x_ip_token] = new_client

    return new_client

You can find the full gradio client code at

ShieldGemma Demo - a Hugging Face Space by pi19404

Public Gradio Interface and Code

you can find the link to gradio interface at

Shieldgemma Demo - a Hugging Face Space by pi19404