Henry Lin

Posted on Mar 22

# RD-Agent Tutorial - Chapter 2: Core Functions

#agents #cli #python #tutorial

RD-Agent Tutorial - Chapter 2: Core Functions

2.1 Command Line Interface Basics

CLI Structure and Design Principles

RD-Agent adopts a unified command line interface (CLI) design, with all functions accessed through the rdagent command. The CLI core implementation is located at rdagent/app/cli.py, built using the Typer framework, providing an intuitive and powerful command line experience.

Architecture Analysis

# rdagent/app/cli.py core structure
import typer
from rdagent.app.data_science.loop import main as data_science
from rdagent.app.qlib_rd_loop.factor import main as fin_factor
from rdagent.app.qlib_rd_loop.model import main as fin_model
from rdagent.app.qlib_rd_loop.quant import main as fin_quant

app = typer.Typer()

# Register commands
app.command(name="fin_factor")(fin_factor)
app.command(name="fin_model")(fin_model)
app.command(name="data_science")(data_science)

Command Classification and Organization

RD-Agent commands are organized by functional domain:

Command Category	Command	Function Description	Primary Use
Financial Quant	`fin_factor`	Automatic factor mining	Discover effective quantitative factors
	`fin_model`	Automatic model evolution	Optimize prediction models
	`fin_quant`	Factor-model joint optimization	End-to-end strategy development
	`fin_factor_report`	Report factor extraction	Extract signals from financial reports
Data Science	`data_science`	General data science tasks	ML competitions, modeling projects
General Model	`general_model`	Paper model implementation	Research paper reproduction
Utility Functions	`ui`	Web interface	Visualization and monitoring
	`health_check`	Health check	System status verification
	`collect_info`	Information collection	Diagnostics and debugging

Automatic Environment Loading Mechanism

An important feature of the RD-Agent CLI is automatic environment variable loading:

from dotenv import load_dotenv
load_dotenv(".env")  # Automatically load .env file from current directory

Advantages:

🔄 Automatic configuration loading, no manual environment variable setup needed
📁 Supports project-level configuration, different projects can have different configs
🔒 Configuration files stored locally, improved security

Basic Command Usage

Help System Usage

# View all available commands
rdagent --help

# View detailed help for specific commands
rdagent fin_factor --help
rdagent data_science --help

# View subcommand parameters
rdagent ui --help

Example Output:

Usage: rdagent [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  data_science        Run data science automation
  fin_factor         Run factor mining
  fin_model          Run model evolution
  fin_quant          Run factor-model joint optimization
  general_model      Extract and implement models from papers
  health_check       Check system health
  ui                 Start web interface

Global Parameters and Options

While most of RD-Agent's configuration is managed through .env files, certain commands support runtime parameters:

# Common parameters for UI command
rdagent ui --port 19899 --log-dir ./logs --debug

# Selective checks for health check
rdagent health_check --no-check-env --no-check-docker

# Competition specification for data science command
rdagent data_science --competition tabular-playground-series-dec-2021

Logging and Output Management

RD-Agent provides comprehensive logging capabilities:

# Run with default log level
rdagent fin_factor

# Enable verbose logging (via environment variable)
export RDAGENT_LOG_LEVEL=DEBUG
rdagent fin_factor

# Specify log output directory
export RDAGENT_LOG_DIR=./custom_logs
rdagent fin_quant

Log Structure:

logs/
├── rdagent.log                 # Main log file
├── experiments/                # Experiment logs
│   ├── factor_exp_001/
│   └── model_exp_001/
└── ui/                        # Web UI logs
    └── access.log

Monitoring Interface Usage

Web UI Startup and Configuration

RD-Agent provides a powerful web interface for real-time monitoring and result viewing:

# Basic startup
rdagent ui --port 19899

# Specify log directory
rdagent ui --port 19899 --log-dir ./logs

# Enable debug mode
rdagent ui --port 19899 --debug

# Data science specific interface
rdagent ui --port 19899 --data_science True

Real-time Monitoring Features

The Web UI provides the following monitoring capabilities:

1. Experiment Progress Tracking

📊 Real-time experiment status display
📈 Performance metric charts
⏱️ Timeline view
🔄 Auto-refresh

2. Log Viewer

📝 Structured log display
🔍 Log search and filtering
📋 Multi-level logs (INFO, DEBUG, ERROR)
💾 Log export functionality

3. Result Visualization

📊 Factor performance comparison
📈 Model training curves
🎯 Backtest result charts
📋 Detailed performance reports

Interface Function Details

Main Dashboard:

┌─────────────────────────────────────────┐
│ RD-Agent Monitoring Panel                │
├─────────────────────────────────────────┤
│ 🔄 Experiment Status: Running (2/5)      │
│ ⏱️ Runtime: 1h 23m                       │
│ 📊 Current Task: Factor Validation       │
│ 🎯 Best Score: 0.234 (IC)                │
└─────────────────────────────────────────┘

Experiment List:

Experiment ID | Type   | Status   | Start Time | Score
--------------|--------|----------|------------|-------
EXP_001       | Factor | Complete | 14:23:01   | 0.198
EXP_002       | Factor | Running  | 14:45:12   | -
EXP_003       | Model  | Waiting  | -          | -

2.2 Data Science Agent Details

Data Science Scenario Overview

RD-Agent's data science agent is a fully automated machine learning engineering system capable of autonomously completing the entire workflow from data exploration to model deployment. This agent achieved first place in the MLE-bench benchmark, demonstrating powerful automation capabilities.

Supported Task Types

Task Type	Description	Application Scenarios	Technical Features
Tabular Data Modeling	Prediction tasks for structured data	Financial prediction, user behavior analysis	Feature engineering, model ensembling
Time Series Forecasting	Predictive modeling for time series data	Stock price prediction, demand forecasting	Time series features, sequence models
Image Classification	Computer vision tasks	Medical imaging, product recognition	CNN, transfer learning
Natural Language Processing	Text data processing	Sentiment analysis, document classification	Pretrained models, embeddings
Regression Analysis	Continuous value prediction	Price prediction, rating estimation	Linear/non-linear models

Workflow and Architecture

graph TD
    A[Data Input] --> B[Hypothesis Generation]
    B --> C[Experiment Design]
    C --> D[Feature Engineering]
    D --> E[Model Development]
    E --> F[Validation Testing]
    F --> G[Result Analysis]
    G --> H{Performance Satisfied?}
    H -->|No| I[Feedback Learning]
    I --> B
    H -->|Yes| J[Model Output]

    style B fill:#e1f5fe
    style D fill:#f3e5f5
    style E fill:#e8f5e8
    style G fill:#fff3e0

Core Components:

Hypothesis Generator - Generate modeling hypotheses based on data features and problem type
Experiment Manager - Systematically manage experiment workflow and resources
Feature Engineer - Automatically generate and select effective features
Model Developer - Automatically select and tune models
Evaluation System - Multi-dimensional assessment of model performance
Knowledge Manager - Accumulate and reuse experiment experience

Integration with Other Tools

# Supported data formats and sources
Data Sources:
├── CSV/Excel files
├── Kaggle competition data
├── Database connections (SQL)
├── API data sources
├── Image folders
└── Text document collections

Model Frameworks:
├── Scikit-learn
├── XGBoost/LightGBM
├── PyTorch
├── TensorFlow/Keras
├── Transformers (Hugging Face)
└── Custom models

Kaggle Competition Automation

Kaggle API Configuration

First, configure the Kaggle API to access competition data:

# 1. Get Kaggle API Token
# Login to Kaggle -> Account -> Create New Token
# Download kaggle.json file

# 2. Configure API file
mkdir -p ~/.kaggle
mv ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

# 3. Verify configuration
kaggle competitions list

Environment Variable Configuration

Add data science related configuration to your .env file:

# Data science agent configuration
DS_LOCAL_DATA_PATH="./data/competitions"
DS_CODER_ON_WHOLE_PIPELINE=True
DS_IF_USING_MLE_DATA=True
DS_SAMPLE_DATA_BY_LLM=True
DS_SCEN=rdagent.scenarios.data_science.scen.KaggleScen

# LLM configuration (as mentioned earlier)
CHAT_MODEL=gpt-4o
EMBEDDING_MODEL=text-embedding-3-small
OPENAI_API_KEY=your-api-key

Automatic Competition Data Download

# Automatically download and process competition data
rdagent data_science --competition tabular-playground-series-dec-2021

# Specify custom data path
export DS_LOCAL_DATA_PATH="./custom_data"
rdagent data_science --competition house-prices-advanced-regression-techniques

Complete Competition Workflow Demonstration

Using "Tabular Playground Series - Dec 2021" as an example:

Step 1: Environment Setup

# Create project directory
mkdir kaggle_rdagent_demo && cd kaggle_rdagent_demo

# Configure environment
cat > .env << EOF
CHAT_MODEL=gpt-4o
EMBEDDING_MODEL=text-embedding-3-small
OPENAI_API_KEY=your-openai-api-key

DS_LOCAL_DATA_PATH="$(pwd)/data"
DS_CODER_ON_WHOLE_PIPELINE=True
DS_IF_USING_MLE_DATA=True
DS_SAMPLE_DATA_BY_LLM=True
DS_SCEN=rdagent.scenarios.data_science.scen.KaggleScen
EOF

Step 2: Launch Automation Workflow

# Start RD-Agent data science agent
rdagent data_science --competition tabular-playground-series-dec-2021

Step 3: Real-time Monitoring

# Start monitoring interface in another terminal
rdagent ui --port 19899 --data_science True
# Visit http://localhost:19899

Automatically Executed Workflow:

Data Download and Exploration
- Automatically download competition data
- Generate data exploration report
- Identify data types and distributions
Hypothesis Generation
- Generate modeling hypotheses based on data features
- Analyze target variable characteristics
- Identify potential feature engineering opportunities
Feature Engineering
- Automatic numerical feature transformations
- Categorical feature encoding
- Interaction feature creation
- Time feature extraction (if applicable)
Model Development
- Try multiple algorithms (random forest, XGBoost, neural networks, etc.)
- Automatic hyperparameter tuning
- Cross-validation and model selection
Model Ensembling
- Stacking/Blending strategies
- Multi-model fusion
- Weight optimization
Submission Preparation
- Generate prediction files
- Format validation
- Automatic submission (optional)

Medical Prediction Model Scenario

Medical Data Processing

Medical data typically has special characteristics, and RD-Agent provides specialized processing capabilities:

# Run medical prediction task
rdagent data_science --competition medical-prediction-task

# Example: Acute kidney failure prediction
wget https://github.com/SunsetWolf/rdagent_resource/releases/download/ds_data/arf-12-hours-prediction-task.zip
unzip arf-12-hours-prediction-task.zip -d ./data/

Medical Data Characteristics Processing:

Privacy Protection
- Data anonymization and de-identification
- HIPAA compliance considerations
- Local processing, no raw data upload
Imbalanced Data Handling
- SMOTE oversampling
- Cost-sensitive learning
- Threshold tuning
Time Series Medical Data
- Disease course time modeling
- Multi-timepoint features
- Survival analysis methods
Multimodal Data Fusion
- Structured data (lab indicators)
- Unstructured data (medical record text)
- Medical image data

Prediction Model Development

# Example of special configuration for medical prediction
medical_config = {
    "task_type": "classification",
    "positive_class_weight": 10,  # Handle imbalance
    "cross_validation_folds": 10,  # More folds
    "feature_selection": "medical_relevance",
    "interpretability": True,  # Medical requires interpretability
    "ensemble_methods": ["voting", "stacking"],
    "evaluation_metrics": ["auc", "precision", "recall", "f1"],
}

Model Evaluation and Optimization

Medical model evaluation focuses more on:

🎯 Clinical Metrics: Sensitivity, Specificity, PPV, NPV
📊 ROC/PR Curves: Performance at different thresholds
⚖️ Cost-Benefit Analysis: Cost of misdiagnosis and missed diagnosis
🔍 Interpretability: SHAP values and feature importance

General Data Science Tasks

Custom Dataset Processing

For non-competition data, RD-Agent supports flexible data integration:

# Create data directory structure
mkdir -p custom_project/data
cd custom_project

# Prepare data files
# train.csv - Training data
# test.csv - Test data (optional)
# sample_submission.csv - Submission format (optional)

# Configure environment
cat > .env << EOF
DS_LOCAL_DATA_PATH="$(pwd)/data"
DS_CODER_ON_WHOLE_PIPELINE=True
DS_IF_USING_MLE_DATA=False  # Non-MLE data
DS_SAMPLE_DATA_BY_LLM=False
DS_SCEN=rdagent.scenarios.data_science.scen.DataScienceScen
EOF

# Run analysis
rdagent data_science --competition custom_project

Data Format Requirements

Standard Tabular Data:

data/
├── train.csv          # Required: Training data
├── test.csv           # Optional: Test data
├── sample_submission.csv  # Optional: Submission format
└── description.md     # Optional: Problem description

Multimodal Data:

data/
├── tabular/
│   ├── train.csv
│   └── test.csv
├── images/
│   ├── train/
│   └── test/
├── text/
│   ├── train_texts.json
│   └── test_texts.json
└── config.yaml       # Data configuration file

Model Selection and Tuning

RD-Agent intelligently selects the most suitable algorithms:

Classification Task Algorithm Selection:

classification_algorithms = {
    "small_data": ["RandomForest", "SVM", "LogisticRegression"],
    "medium_data": ["XGBoost", "LightGBM", "CatBoost"],
    "large_data": ["NeuralNetwork", "TabNet", "AutoML"],
    "text_data": ["BERT", "RoBERTa", "DistilBERT"],
    "image_data": ["ResNet", "EfficientNet", "ViT"],
}

Hyperparameter Tuning Strategies:

tuning_strategies = {
    "bayesian_optimization": "Efficient parameter space search",
    "random_search": "Fast preliminary tuning",
    "grid_search": "Precise but time-consuming search",
    "evolutionary": "Global optimization for complex spaces",
    "hyperband": "Multi-fidelity optimization",
}

Automated Feature Engineering

Numerical Feature Processing:

numerical_transformations = [
    "StandardScaler",      # Standardization
    "MinMaxScaler",        # Normalization
    "RobustScaler",        # Robust scaling
    "PowerTransformer",    # Power transform
    "QuantileTransformer", # Quantile transform
    "PCA",                 # Principal component analysis
    "PolynomialFeatures",  # Polynomial features
]

Categorical Feature Processing:

categorical_transformations = [
    "OneHotEncoder",       # One-hot encoding
    "LabelEncoder",        # Label encoding
    "TargetEncoder",       # Target encoding
    "BinaryEncoder",       # Binary encoding
    "HashingEncoder",      # Hash encoding
    "FrequencyEncoder",    # Frequency encoding
]

Automatic Feature Generation:

auto_feature_generation = [
    "feature_interactions",  # Feature interactions
    "feature_aggregations", # Feature aggregations
    "feature_ratios",       # Feature ratios
    "feature_differences",  # Feature differences
    "time_based_features",  # Time-based features
    "text_embeddings",      # Text embeddings
]

Summary

This chapter provided detailed coverage of RD-Agent's core functions, including:

Command Line Interface Basics - Unified CLI design and monitoring capabilities
Data Science Agent Details - Complete automated machine learning workflow
Practical Application Scenarios - Kaggle competitions, medical prediction, general modeling

The next chapter will delve into the advanced features of quantitative finance agents and general model agents.

Quick Reference Commands:

# Start data science project
rdagent data_science --competition <project_name>

# Start monitoring interface
rdagent ui --port 19899 --data_science True

# Health check
rdagent health_check

# View help
rdagent data_science --help

DEV Community