DEV Community

Henry Lin
Henry Lin

Posted on

# RD-Agent Tutorial - Chapter 2: Core Functions

RD-Agent Tutorial - Chapter 2: Core Functions

2.1 Command Line Interface Basics

CLI Structure and Design Principles

RD-Agent adopts a unified command line interface (CLI) design, with all functions accessed through the rdagent command. The CLI core implementation is located at rdagent/app/cli.py, built using the Typer framework, providing an intuitive and powerful command line experience.

Architecture Analysis

# rdagent/app/cli.py core structure
import typer
from rdagent.app.data_science.loop import main as data_science
from rdagent.app.qlib_rd_loop.factor import main as fin_factor
from rdagent.app.qlib_rd_loop.model import main as fin_model
from rdagent.app.qlib_rd_loop.quant import main as fin_quant

app = typer.Typer()

# Register commands
app.command(name="fin_factor")(fin_factor)
app.command(name="fin_model")(fin_model)
app.command(name="data_science")(data_science)
Enter fullscreen mode Exit fullscreen mode

Command Classification and Organization

RD-Agent commands are organized by functional domain:

Command Category Command Function Description Primary Use
Financial Quant fin_factor Automatic factor mining Discover effective quantitative factors
fin_model Automatic model evolution Optimize prediction models
fin_quant Factor-model joint optimization End-to-end strategy development
fin_factor_report Report factor extraction Extract signals from financial reports
Data Science data_science General data science tasks ML competitions, modeling projects
General Model general_model Paper model implementation Research paper reproduction
Utility Functions ui Web interface Visualization and monitoring
health_check Health check System status verification
collect_info Information collection Diagnostics and debugging

Automatic Environment Loading Mechanism

An important feature of the RD-Agent CLI is automatic environment variable loading:

from dotenv import load_dotenv
load_dotenv(".env")  # Automatically load .env file from current directory
Enter fullscreen mode Exit fullscreen mode

Advantages:

  • ๐Ÿ”„ Automatic configuration loading, no manual environment variable setup needed
  • ๐Ÿ“ Supports project-level configuration, different projects can have different configs
  • ๐Ÿ”’ Configuration files stored locally, improved security

Basic Command Usage

Help System Usage

# View all available commands
rdagent --help

# View detailed help for specific commands
rdagent fin_factor --help
rdagent data_science --help

# View subcommand parameters
rdagent ui --help
Enter fullscreen mode Exit fullscreen mode

Example Output:

Usage: rdagent [OPTIONS] COMMAND [ARGS]...

Options:
  --help  Show this message and exit.

Commands:
  data_science        Run data science automation
  fin_factor         Run factor mining
  fin_model          Run model evolution
  fin_quant          Run factor-model joint optimization
  general_model      Extract and implement models from papers
  health_check       Check system health
  ui                 Start web interface
Enter fullscreen mode Exit fullscreen mode

Global Parameters and Options

While most of RD-Agent's configuration is managed through .env files, certain commands support runtime parameters:

# Common parameters for UI command
rdagent ui --port 19899 --log-dir ./logs --debug

# Selective checks for health check
rdagent health_check --no-check-env --no-check-docker

# Competition specification for data science command
rdagent data_science --competition tabular-playground-series-dec-2021
Enter fullscreen mode Exit fullscreen mode

Logging and Output Management

RD-Agent provides comprehensive logging capabilities:

# Run with default log level
rdagent fin_factor

# Enable verbose logging (via environment variable)
export RDAGENT_LOG_LEVEL=DEBUG
rdagent fin_factor

# Specify log output directory
export RDAGENT_LOG_DIR=./custom_logs
rdagent fin_quant
Enter fullscreen mode Exit fullscreen mode

Log Structure:

logs/
โ”œโ”€โ”€ rdagent.log                 # Main log file
โ”œโ”€โ”€ experiments/                # Experiment logs
โ”‚   โ”œโ”€โ”€ factor_exp_001/
โ”‚   โ””โ”€โ”€ model_exp_001/
โ””โ”€โ”€ ui/                        # Web UI logs
    โ””โ”€โ”€ access.log
Enter fullscreen mode Exit fullscreen mode

Monitoring Interface Usage

Web UI Startup and Configuration

RD-Agent provides a powerful web interface for real-time monitoring and result viewing:

# Basic startup
rdagent ui --port 19899

# Specify log directory
rdagent ui --port 19899 --log-dir ./logs

# Enable debug mode
rdagent ui --port 19899 --debug

# Data science specific interface
rdagent ui --port 19899 --data_science True
Enter fullscreen mode Exit fullscreen mode

Real-time Monitoring Features

The Web UI provides the following monitoring capabilities:

1. Experiment Progress Tracking

  • ๐Ÿ“Š Real-time experiment status display
  • ๐Ÿ“ˆ Performance metric charts
  • โฑ๏ธ Timeline view
  • ๐Ÿ”„ Auto-refresh

2. Log Viewer

  • ๐Ÿ“ Structured log display
  • ๐Ÿ” Log search and filtering
  • ๐Ÿ“‹ Multi-level logs (INFO, DEBUG, ERROR)
  • ๐Ÿ’พ Log export functionality

3. Result Visualization

  • ๐Ÿ“Š Factor performance comparison
  • ๐Ÿ“ˆ Model training curves
  • ๐ŸŽฏ Backtest result charts
  • ๐Ÿ“‹ Detailed performance reports

Interface Function Details

Main Dashboard:

โ”Œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”
โ”‚ RD-Agent Monitoring Panel                โ”‚
โ”œโ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”ค
โ”‚ ๐Ÿ”„ Experiment Status: Running (2/5)      โ”‚
โ”‚ โฑ๏ธ Runtime: 1h 23m                       โ”‚
โ”‚ ๐Ÿ“Š Current Task: Factor Validation       โ”‚
โ”‚ ๐ŸŽฏ Best Score: 0.234 (IC)                โ”‚
โ””โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”˜
Enter fullscreen mode Exit fullscreen mode

Experiment List:

Experiment ID | Type   | Status   | Start Time | Score
--------------|--------|----------|------------|-------
EXP_001       | Factor | Complete | 14:23:01   | 0.198
EXP_002       | Factor | Running  | 14:45:12   | -
EXP_003       | Model  | Waiting  | -          | -
Enter fullscreen mode Exit fullscreen mode

2.2 Data Science Agent Details

Data Science Scenario Overview

RD-Agent's data science agent is a fully automated machine learning engineering system capable of autonomously completing the entire workflow from data exploration to model deployment. This agent achieved first place in the MLE-bench benchmark, demonstrating powerful automation capabilities.

Supported Task Types

Task Type Description Application Scenarios Technical Features
Tabular Data Modeling Prediction tasks for structured data Financial prediction, user behavior analysis Feature engineering, model ensembling
Time Series Forecasting Predictive modeling for time series data Stock price prediction, demand forecasting Time series features, sequence models
Image Classification Computer vision tasks Medical imaging, product recognition CNN, transfer learning
Natural Language Processing Text data processing Sentiment analysis, document classification Pretrained models, embeddings
Regression Analysis Continuous value prediction Price prediction, rating estimation Linear/non-linear models

Workflow and Architecture

graph TD
    A[Data Input] --> B[Hypothesis Generation]
    B --> C[Experiment Design]
    C --> D[Feature Engineering]
    D --> E[Model Development]
    E --> F[Validation Testing]
    F --> G[Result Analysis]
    G --> H{Performance Satisfied?}
    H -->|No| I[Feedback Learning]
    I --> B
    H -->|Yes| J[Model Output]

    style B fill:#e1f5fe
    style D fill:#f3e5f5
    style E fill:#e8f5e8
    style G fill:#fff3e0
Enter fullscreen mode Exit fullscreen mode

Core Components:

  1. Hypothesis Generator - Generate modeling hypotheses based on data features and problem type
  2. Experiment Manager - Systematically manage experiment workflow and resources
  3. Feature Engineer - Automatically generate and select effective features
  4. Model Developer - Automatically select and tune models
  5. Evaluation System - Multi-dimensional assessment of model performance
  6. Knowledge Manager - Accumulate and reuse experiment experience

Integration with Other Tools

# Supported data formats and sources
Data Sources:
โ”œโ”€โ”€ CSV/Excel files
โ”œโ”€โ”€ Kaggle competition data
โ”œโ”€โ”€ Database connections (SQL)
โ”œโ”€โ”€ API data sources
โ”œโ”€โ”€ Image folders
โ””โ”€โ”€ Text document collections

Model Frameworks:
โ”œโ”€โ”€ Scikit-learn
โ”œโ”€โ”€ XGBoost/LightGBM
โ”œโ”€โ”€ PyTorch
โ”œโ”€โ”€ TensorFlow/Keras
โ”œโ”€โ”€ Transformers (Hugging Face)
โ””โ”€โ”€ Custom models
Enter fullscreen mode Exit fullscreen mode

Kaggle Competition Automation

Kaggle API Configuration

First, configure the Kaggle API to access competition data:

# 1. Get Kaggle API Token
# Login to Kaggle -> Account -> Create New Token
# Download kaggle.json file

# 2. Configure API file
mkdir -p ~/.kaggle
mv ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json

# 3. Verify configuration
kaggle competitions list
Enter fullscreen mode Exit fullscreen mode

Environment Variable Configuration

Add data science related configuration to your .env file:

# Data science agent configuration
DS_LOCAL_DATA_PATH="./data/competitions"
DS_CODER_ON_WHOLE_PIPELINE=True
DS_IF_USING_MLE_DATA=True
DS_SAMPLE_DATA_BY_LLM=True
DS_SCEN=rdagent.scenarios.data_science.scen.KaggleScen

# LLM configuration (as mentioned earlier)
CHAT_MODEL=gpt-4o
EMBEDDING_MODEL=text-embedding-3-small
OPENAI_API_KEY=your-api-key
Enter fullscreen mode Exit fullscreen mode

Automatic Competition Data Download

# Automatically download and process competition data
rdagent data_science --competition tabular-playground-series-dec-2021

# Specify custom data path
export DS_LOCAL_DATA_PATH="./custom_data"
rdagent data_science --competition house-prices-advanced-regression-techniques
Enter fullscreen mode Exit fullscreen mode

Complete Competition Workflow Demonstration

Using "Tabular Playground Series - Dec 2021" as an example:

Step 1: Environment Setup

# Create project directory
mkdir kaggle_rdagent_demo && cd kaggle_rdagent_demo

# Configure environment
cat > .env << EOF
CHAT_MODEL=gpt-4o
EMBEDDING_MODEL=text-embedding-3-small
OPENAI_API_KEY=your-openai-api-key

DS_LOCAL_DATA_PATH="$(pwd)/data"
DS_CODER_ON_WHOLE_PIPELINE=True
DS_IF_USING_MLE_DATA=True
DS_SAMPLE_DATA_BY_LLM=True
DS_SCEN=rdagent.scenarios.data_science.scen.KaggleScen
EOF
Enter fullscreen mode Exit fullscreen mode

Step 2: Launch Automation Workflow

# Start RD-Agent data science agent
rdagent data_science --competition tabular-playground-series-dec-2021
Enter fullscreen mode Exit fullscreen mode

Step 3: Real-time Monitoring

# Start monitoring interface in another terminal
rdagent ui --port 19899 --data_science True
# Visit http://localhost:19899
Enter fullscreen mode Exit fullscreen mode

Automatically Executed Workflow:

  1. Data Download and Exploration

    • Automatically download competition data
    • Generate data exploration report
    • Identify data types and distributions
  2. Hypothesis Generation

    • Generate modeling hypotheses based on data features
    • Analyze target variable characteristics
    • Identify potential feature engineering opportunities
  3. Feature Engineering

    • Automatic numerical feature transformations
    • Categorical feature encoding
    • Interaction feature creation
    • Time feature extraction (if applicable)
  4. Model Development

    • Try multiple algorithms (random forest, XGBoost, neural networks, etc.)
    • Automatic hyperparameter tuning
    • Cross-validation and model selection
  5. Model Ensembling

    • Stacking/Blending strategies
    • Multi-model fusion
    • Weight optimization
  6. Submission Preparation

    • Generate prediction files
    • Format validation
    • Automatic submission (optional)

Medical Prediction Model Scenario

Medical Data Processing

Medical data typically has special characteristics, and RD-Agent provides specialized processing capabilities:

# Run medical prediction task
rdagent data_science --competition medical-prediction-task

# Example: Acute kidney failure prediction
wget https://github.com/SunsetWolf/rdagent_resource/releases/download/ds_data/arf-12-hours-prediction-task.zip
unzip arf-12-hours-prediction-task.zip -d ./data/
Enter fullscreen mode Exit fullscreen mode

Medical Data Characteristics Processing:

  1. Privacy Protection

    • Data anonymization and de-identification
    • HIPAA compliance considerations
    • Local processing, no raw data upload
  2. Imbalanced Data Handling

    • SMOTE oversampling
    • Cost-sensitive learning
    • Threshold tuning
  3. Time Series Medical Data

    • Disease course time modeling
    • Multi-timepoint features
    • Survival analysis methods
  4. Multimodal Data Fusion

    • Structured data (lab indicators)
    • Unstructured data (medical record text)
    • Medical image data

Prediction Model Development

# Example of special configuration for medical prediction
medical_config = {
    "task_type": "classification",
    "positive_class_weight": 10,  # Handle imbalance
    "cross_validation_folds": 10,  # More folds
    "feature_selection": "medical_relevance",
    "interpretability": True,  # Medical requires interpretability
    "ensemble_methods": ["voting", "stacking"],
    "evaluation_metrics": ["auc", "precision", "recall", "f1"],
}
Enter fullscreen mode Exit fullscreen mode

Model Evaluation and Optimization

Medical model evaluation focuses more on:

  • ๐ŸŽฏ Clinical Metrics: Sensitivity, Specificity, PPV, NPV
  • ๐Ÿ“Š ROC/PR Curves: Performance at different thresholds
  • โš–๏ธ Cost-Benefit Analysis: Cost of misdiagnosis and missed diagnosis
  • ๐Ÿ” Interpretability: SHAP values and feature importance

General Data Science Tasks

Custom Dataset Processing

For non-competition data, RD-Agent supports flexible data integration:

# Create data directory structure
mkdir -p custom_project/data
cd custom_project

# Prepare data files
# train.csv - Training data
# test.csv - Test data (optional)
# sample_submission.csv - Submission format (optional)

# Configure environment
cat > .env << EOF
DS_LOCAL_DATA_PATH="$(pwd)/data"
DS_CODER_ON_WHOLE_PIPELINE=True
DS_IF_USING_MLE_DATA=False  # Non-MLE data
DS_SAMPLE_DATA_BY_LLM=False
DS_SCEN=rdagent.scenarios.data_science.scen.DataScienceScen
EOF

# Run analysis
rdagent data_science --competition custom_project
Enter fullscreen mode Exit fullscreen mode

Data Format Requirements

Standard Tabular Data:

data/
โ”œโ”€โ”€ train.csv          # Required: Training data
โ”œโ”€โ”€ test.csv           # Optional: Test data
โ”œโ”€โ”€ sample_submission.csv  # Optional: Submission format
โ””โ”€โ”€ description.md     # Optional: Problem description
Enter fullscreen mode Exit fullscreen mode

Multimodal Data:

data/
โ”œโ”€โ”€ tabular/
โ”‚   โ”œโ”€โ”€ train.csv
โ”‚   โ””โ”€โ”€ test.csv
โ”œโ”€โ”€ images/
โ”‚   โ”œโ”€โ”€ train/
โ”‚   โ””โ”€โ”€ test/
โ”œโ”€โ”€ text/
โ”‚   โ”œโ”€โ”€ train_texts.json
โ”‚   โ””โ”€โ”€ test_texts.json
โ””โ”€โ”€ config.yaml       # Data configuration file
Enter fullscreen mode Exit fullscreen mode

Model Selection and Tuning

RD-Agent intelligently selects the most suitable algorithms:

Classification Task Algorithm Selection:

classification_algorithms = {
    "small_data": ["RandomForest", "SVM", "LogisticRegression"],
    "medium_data": ["XGBoost", "LightGBM", "CatBoost"],
    "large_data": ["NeuralNetwork", "TabNet", "AutoML"],
    "text_data": ["BERT", "RoBERTa", "DistilBERT"],
    "image_data": ["ResNet", "EfficientNet", "ViT"],
}
Enter fullscreen mode Exit fullscreen mode

Hyperparameter Tuning Strategies:

tuning_strategies = {
    "bayesian_optimization": "Efficient parameter space search",
    "random_search": "Fast preliminary tuning",
    "grid_search": "Precise but time-consuming search",
    "evolutionary": "Global optimization for complex spaces",
    "hyperband": "Multi-fidelity optimization",
}
Enter fullscreen mode Exit fullscreen mode

Automated Feature Engineering

Numerical Feature Processing:

numerical_transformations = [
    "StandardScaler",      # Standardization
    "MinMaxScaler",        # Normalization
    "RobustScaler",        # Robust scaling
    "PowerTransformer",    # Power transform
    "QuantileTransformer", # Quantile transform
    "PCA",                 # Principal component analysis
    "PolynomialFeatures",  # Polynomial features
]
Enter fullscreen mode Exit fullscreen mode

Categorical Feature Processing:

categorical_transformations = [
    "OneHotEncoder",       # One-hot encoding
    "LabelEncoder",        # Label encoding
    "TargetEncoder",       # Target encoding
    "BinaryEncoder",       # Binary encoding
    "HashingEncoder",      # Hash encoding
    "FrequencyEncoder",    # Frequency encoding
]
Enter fullscreen mode Exit fullscreen mode

Automatic Feature Generation:

auto_feature_generation = [
    "feature_interactions",  # Feature interactions
    "feature_aggregations", # Feature aggregations
    "feature_ratios",       # Feature ratios
    "feature_differences",  # Feature differences
    "time_based_features",  # Time-based features
    "text_embeddings",      # Text embeddings
]
Enter fullscreen mode Exit fullscreen mode

Summary

This chapter provided detailed coverage of RD-Agent's core functions, including:

  1. Command Line Interface Basics - Unified CLI design and monitoring capabilities
  2. Data Science Agent Details - Complete automated machine learning workflow
  3. Practical Application Scenarios - Kaggle competitions, medical prediction, general modeling

The next chapter will delve into the advanced features of quantitative finance agents and general model agents.


Quick Reference Commands:

# Start data science project
rdagent data_science --competition <project_name>

# Start monitoring interface
rdagent ui --port 19899 --data_science True

# Health check
rdagent health_check

# View help
rdagent data_science --help
Enter fullscreen mode Exit fullscreen mode

Top comments (0)