RD-Agent Tutorial - Chapter 2: Core Functions
2.1 Command Line Interface Basics
CLI Structure and Design Principles
RD-Agent adopts a unified command line interface (CLI) design, with all functions accessed through the rdagent command. The CLI core implementation is located at rdagent/app/cli.py, built using the Typer framework, providing an intuitive and powerful command line experience.
Architecture Analysis
# rdagent/app/cli.py core structure
import typer
from rdagent.app.data_science.loop import main as data_science
from rdagent.app.qlib_rd_loop.factor import main as fin_factor
from rdagent.app.qlib_rd_loop.model import main as fin_model
from rdagent.app.qlib_rd_loop.quant import main as fin_quant
app = typer.Typer()
# Register commands
app.command(name="fin_factor")(fin_factor)
app.command(name="fin_model")(fin_model)
app.command(name="data_science")(data_science)
Command Classification and Organization
RD-Agent commands are organized by functional domain:
| Command Category | Command | Function Description | Primary Use |
|---|---|---|---|
| Financial Quant | fin_factor |
Automatic factor mining | Discover effective quantitative factors |
fin_model |
Automatic model evolution | Optimize prediction models | |
fin_quant |
Factor-model joint optimization | End-to-end strategy development | |
fin_factor_report |
Report factor extraction | Extract signals from financial reports | |
| Data Science | data_science |
General data science tasks | ML competitions, modeling projects |
| General Model | general_model |
Paper model implementation | Research paper reproduction |
| Utility Functions | ui |
Web interface | Visualization and monitoring |
health_check |
Health check | System status verification | |
collect_info |
Information collection | Diagnostics and debugging |
Automatic Environment Loading Mechanism
An important feature of the RD-Agent CLI is automatic environment variable loading:
from dotenv import load_dotenv
load_dotenv(".env") # Automatically load .env file from current directory
Advantages:
- ๐ Automatic configuration loading, no manual environment variable setup needed
- ๐ Supports project-level configuration, different projects can have different configs
- ๐ Configuration files stored locally, improved security
Basic Command Usage
Help System Usage
# View all available commands
rdagent --help
# View detailed help for specific commands
rdagent fin_factor --help
rdagent data_science --help
# View subcommand parameters
rdagent ui --help
Example Output:
Usage: rdagent [OPTIONS] COMMAND [ARGS]...
Options:
--help Show this message and exit.
Commands:
data_science Run data science automation
fin_factor Run factor mining
fin_model Run model evolution
fin_quant Run factor-model joint optimization
general_model Extract and implement models from papers
health_check Check system health
ui Start web interface
Global Parameters and Options
While most of RD-Agent's configuration is managed through .env files, certain commands support runtime parameters:
# Common parameters for UI command
rdagent ui --port 19899 --log-dir ./logs --debug
# Selective checks for health check
rdagent health_check --no-check-env --no-check-docker
# Competition specification for data science command
rdagent data_science --competition tabular-playground-series-dec-2021
Logging and Output Management
RD-Agent provides comprehensive logging capabilities:
# Run with default log level
rdagent fin_factor
# Enable verbose logging (via environment variable)
export RDAGENT_LOG_LEVEL=DEBUG
rdagent fin_factor
# Specify log output directory
export RDAGENT_LOG_DIR=./custom_logs
rdagent fin_quant
Log Structure:
logs/
โโโ rdagent.log # Main log file
โโโ experiments/ # Experiment logs
โ โโโ factor_exp_001/
โ โโโ model_exp_001/
โโโ ui/ # Web UI logs
โโโ access.log
Monitoring Interface Usage
Web UI Startup and Configuration
RD-Agent provides a powerful web interface for real-time monitoring and result viewing:
# Basic startup
rdagent ui --port 19899
# Specify log directory
rdagent ui --port 19899 --log-dir ./logs
# Enable debug mode
rdagent ui --port 19899 --debug
# Data science specific interface
rdagent ui --port 19899 --data_science True
Real-time Monitoring Features
The Web UI provides the following monitoring capabilities:
1. Experiment Progress Tracking
- ๐ Real-time experiment status display
- ๐ Performance metric charts
- โฑ๏ธ Timeline view
- ๐ Auto-refresh
2. Log Viewer
- ๐ Structured log display
- ๐ Log search and filtering
- ๐ Multi-level logs (INFO, DEBUG, ERROR)
- ๐พ Log export functionality
3. Result Visualization
- ๐ Factor performance comparison
- ๐ Model training curves
- ๐ฏ Backtest result charts
- ๐ Detailed performance reports
Interface Function Details
Main Dashboard:
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
โ RD-Agent Monitoring Panel โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโค
โ ๐ Experiment Status: Running (2/5) โ
โ โฑ๏ธ Runtime: 1h 23m โ
โ ๐ Current Task: Factor Validation โ
โ ๐ฏ Best Score: 0.234 (IC) โ
โโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโโ
Experiment List:
Experiment ID | Type | Status | Start Time | Score
--------------|--------|----------|------------|-------
EXP_001 | Factor | Complete | 14:23:01 | 0.198
EXP_002 | Factor | Running | 14:45:12 | -
EXP_003 | Model | Waiting | - | -
2.2 Data Science Agent Details
Data Science Scenario Overview
RD-Agent's data science agent is a fully automated machine learning engineering system capable of autonomously completing the entire workflow from data exploration to model deployment. This agent achieved first place in the MLE-bench benchmark, demonstrating powerful automation capabilities.
Supported Task Types
| Task Type | Description | Application Scenarios | Technical Features |
|---|---|---|---|
| Tabular Data Modeling | Prediction tasks for structured data | Financial prediction, user behavior analysis | Feature engineering, model ensembling |
| Time Series Forecasting | Predictive modeling for time series data | Stock price prediction, demand forecasting | Time series features, sequence models |
| Image Classification | Computer vision tasks | Medical imaging, product recognition | CNN, transfer learning |
| Natural Language Processing | Text data processing | Sentiment analysis, document classification | Pretrained models, embeddings |
| Regression Analysis | Continuous value prediction | Price prediction, rating estimation | Linear/non-linear models |
Workflow and Architecture
graph TD
A[Data Input] --> B[Hypothesis Generation]
B --> C[Experiment Design]
C --> D[Feature Engineering]
D --> E[Model Development]
E --> F[Validation Testing]
F --> G[Result Analysis]
G --> H{Performance Satisfied?}
H -->|No| I[Feedback Learning]
I --> B
H -->|Yes| J[Model Output]
style B fill:#e1f5fe
style D fill:#f3e5f5
style E fill:#e8f5e8
style G fill:#fff3e0
Core Components:
- Hypothesis Generator - Generate modeling hypotheses based on data features and problem type
- Experiment Manager - Systematically manage experiment workflow and resources
- Feature Engineer - Automatically generate and select effective features
- Model Developer - Automatically select and tune models
- Evaluation System - Multi-dimensional assessment of model performance
- Knowledge Manager - Accumulate and reuse experiment experience
Integration with Other Tools
# Supported data formats and sources
Data Sources:
โโโ CSV/Excel files
โโโ Kaggle competition data
โโโ Database connections (SQL)
โโโ API data sources
โโโ Image folders
โโโ Text document collections
Model Frameworks:
โโโ Scikit-learn
โโโ XGBoost/LightGBM
โโโ PyTorch
โโโ TensorFlow/Keras
โโโ Transformers (Hugging Face)
โโโ Custom models
Kaggle Competition Automation
Kaggle API Configuration
First, configure the Kaggle API to access competition data:
# 1. Get Kaggle API Token
# Login to Kaggle -> Account -> Create New Token
# Download kaggle.json file
# 2. Configure API file
mkdir -p ~/.kaggle
mv ~/Downloads/kaggle.json ~/.kaggle/
chmod 600 ~/.kaggle/kaggle.json
# 3. Verify configuration
kaggle competitions list
Environment Variable Configuration
Add data science related configuration to your .env file:
# Data science agent configuration
DS_LOCAL_DATA_PATH="./data/competitions"
DS_CODER_ON_WHOLE_PIPELINE=True
DS_IF_USING_MLE_DATA=True
DS_SAMPLE_DATA_BY_LLM=True
DS_SCEN=rdagent.scenarios.data_science.scen.KaggleScen
# LLM configuration (as mentioned earlier)
CHAT_MODEL=gpt-4o
EMBEDDING_MODEL=text-embedding-3-small
OPENAI_API_KEY=your-api-key
Automatic Competition Data Download
# Automatically download and process competition data
rdagent data_science --competition tabular-playground-series-dec-2021
# Specify custom data path
export DS_LOCAL_DATA_PATH="./custom_data"
rdagent data_science --competition house-prices-advanced-regression-techniques
Complete Competition Workflow Demonstration
Using "Tabular Playground Series - Dec 2021" as an example:
Step 1: Environment Setup
# Create project directory
mkdir kaggle_rdagent_demo && cd kaggle_rdagent_demo
# Configure environment
cat > .env << EOF
CHAT_MODEL=gpt-4o
EMBEDDING_MODEL=text-embedding-3-small
OPENAI_API_KEY=your-openai-api-key
DS_LOCAL_DATA_PATH="$(pwd)/data"
DS_CODER_ON_WHOLE_PIPELINE=True
DS_IF_USING_MLE_DATA=True
DS_SAMPLE_DATA_BY_LLM=True
DS_SCEN=rdagent.scenarios.data_science.scen.KaggleScen
EOF
Step 2: Launch Automation Workflow
# Start RD-Agent data science agent
rdagent data_science --competition tabular-playground-series-dec-2021
Step 3: Real-time Monitoring
# Start monitoring interface in another terminal
rdagent ui --port 19899 --data_science True
# Visit http://localhost:19899
Automatically Executed Workflow:
-
Data Download and Exploration
- Automatically download competition data
- Generate data exploration report
- Identify data types and distributions
-
Hypothesis Generation
- Generate modeling hypotheses based on data features
- Analyze target variable characteristics
- Identify potential feature engineering opportunities
-
Feature Engineering
- Automatic numerical feature transformations
- Categorical feature encoding
- Interaction feature creation
- Time feature extraction (if applicable)
-
Model Development
- Try multiple algorithms (random forest, XGBoost, neural networks, etc.)
- Automatic hyperparameter tuning
- Cross-validation and model selection
-
Model Ensembling
- Stacking/Blending strategies
- Multi-model fusion
- Weight optimization
-
Submission Preparation
- Generate prediction files
- Format validation
- Automatic submission (optional)
Medical Prediction Model Scenario
Medical Data Processing
Medical data typically has special characteristics, and RD-Agent provides specialized processing capabilities:
# Run medical prediction task
rdagent data_science --competition medical-prediction-task
# Example: Acute kidney failure prediction
wget https://github.com/SunsetWolf/rdagent_resource/releases/download/ds_data/arf-12-hours-prediction-task.zip
unzip arf-12-hours-prediction-task.zip -d ./data/
Medical Data Characteristics Processing:
-
Privacy Protection
- Data anonymization and de-identification
- HIPAA compliance considerations
- Local processing, no raw data upload
-
Imbalanced Data Handling
- SMOTE oversampling
- Cost-sensitive learning
- Threshold tuning
-
Time Series Medical Data
- Disease course time modeling
- Multi-timepoint features
- Survival analysis methods
-
Multimodal Data Fusion
- Structured data (lab indicators)
- Unstructured data (medical record text)
- Medical image data
Prediction Model Development
# Example of special configuration for medical prediction
medical_config = {
"task_type": "classification",
"positive_class_weight": 10, # Handle imbalance
"cross_validation_folds": 10, # More folds
"feature_selection": "medical_relevance",
"interpretability": True, # Medical requires interpretability
"ensemble_methods": ["voting", "stacking"],
"evaluation_metrics": ["auc", "precision", "recall", "f1"],
}
Model Evaluation and Optimization
Medical model evaluation focuses more on:
- ๐ฏ Clinical Metrics: Sensitivity, Specificity, PPV, NPV
- ๐ ROC/PR Curves: Performance at different thresholds
- โ๏ธ Cost-Benefit Analysis: Cost of misdiagnosis and missed diagnosis
- ๐ Interpretability: SHAP values and feature importance
General Data Science Tasks
Custom Dataset Processing
For non-competition data, RD-Agent supports flexible data integration:
# Create data directory structure
mkdir -p custom_project/data
cd custom_project
# Prepare data files
# train.csv - Training data
# test.csv - Test data (optional)
# sample_submission.csv - Submission format (optional)
# Configure environment
cat > .env << EOF
DS_LOCAL_DATA_PATH="$(pwd)/data"
DS_CODER_ON_WHOLE_PIPELINE=True
DS_IF_USING_MLE_DATA=False # Non-MLE data
DS_SAMPLE_DATA_BY_LLM=False
DS_SCEN=rdagent.scenarios.data_science.scen.DataScienceScen
EOF
# Run analysis
rdagent data_science --competition custom_project
Data Format Requirements
Standard Tabular Data:
data/
โโโ train.csv # Required: Training data
โโโ test.csv # Optional: Test data
โโโ sample_submission.csv # Optional: Submission format
โโโ description.md # Optional: Problem description
Multimodal Data:
data/
โโโ tabular/
โ โโโ train.csv
โ โโโ test.csv
โโโ images/
โ โโโ train/
โ โโโ test/
โโโ text/
โ โโโ train_texts.json
โ โโโ test_texts.json
โโโ config.yaml # Data configuration file
Model Selection and Tuning
RD-Agent intelligently selects the most suitable algorithms:
Classification Task Algorithm Selection:
classification_algorithms = {
"small_data": ["RandomForest", "SVM", "LogisticRegression"],
"medium_data": ["XGBoost", "LightGBM", "CatBoost"],
"large_data": ["NeuralNetwork", "TabNet", "AutoML"],
"text_data": ["BERT", "RoBERTa", "DistilBERT"],
"image_data": ["ResNet", "EfficientNet", "ViT"],
}
Hyperparameter Tuning Strategies:
tuning_strategies = {
"bayesian_optimization": "Efficient parameter space search",
"random_search": "Fast preliminary tuning",
"grid_search": "Precise but time-consuming search",
"evolutionary": "Global optimization for complex spaces",
"hyperband": "Multi-fidelity optimization",
}
Automated Feature Engineering
Numerical Feature Processing:
numerical_transformations = [
"StandardScaler", # Standardization
"MinMaxScaler", # Normalization
"RobustScaler", # Robust scaling
"PowerTransformer", # Power transform
"QuantileTransformer", # Quantile transform
"PCA", # Principal component analysis
"PolynomialFeatures", # Polynomial features
]
Categorical Feature Processing:
categorical_transformations = [
"OneHotEncoder", # One-hot encoding
"LabelEncoder", # Label encoding
"TargetEncoder", # Target encoding
"BinaryEncoder", # Binary encoding
"HashingEncoder", # Hash encoding
"FrequencyEncoder", # Frequency encoding
]
Automatic Feature Generation:
auto_feature_generation = [
"feature_interactions", # Feature interactions
"feature_aggregations", # Feature aggregations
"feature_ratios", # Feature ratios
"feature_differences", # Feature differences
"time_based_features", # Time-based features
"text_embeddings", # Text embeddings
]
Summary
This chapter provided detailed coverage of RD-Agent's core functions, including:
- Command Line Interface Basics - Unified CLI design and monitoring capabilities
- Data Science Agent Details - Complete automated machine learning workflow
- Practical Application Scenarios - Kaggle competitions, medical prediction, general modeling
The next chapter will delve into the advanced features of quantitative finance agents and general model agents.
Quick Reference Commands:
# Start data science project
rdagent data_science --competition <project_name>
# Start monitoring interface
rdagent ui --port 19899 --data_science True
# Health check
rdagent health_check
# View help
rdagent data_science --help
Top comments (0)