Abstract
Building an AI Agent memory system is just the first step; scientifically evaluating its performance, accuracy, and reliability is equally important. Cortex Memory includes a complete evaluation framework supporting recall evaluation, effectiveness evaluation, performance evaluation, and other testing scenarios. This article provides an in-depth analysis of the evaluation framework's design philosophy, core implementation, and how to use it to validate and optimize memory system performance.
1. Problem Background: The Importance of Evaluation
1.1 Why an Evaluation Framework is Needed
The quality of a memory system directly impacts AI Agent performance, but how do we quantify this quality?
1.2 Core Evaluation Metrics
| Metric Category | Specific Metrics | Description |
|---|---|---|
| Recall | Precision@K, Recall@K, MAP, NDCG | Measure retrieval accuracy |
| Effectiveness | Fact extraction accuracy, classification correctness, deduplication accuracy | Measure processing quality |
| Performance | Latency, throughput, memory usage | Measure system efficiency |
| Reliability | Error rate, availability, consistency | Measure stability |
1.3 Evaluation Challenges
- Lack of standards: No unified evaluation standards
- Data dependency: Requires high-quality test datasets
- Scenario diversity: Different application scenarios focus on different metrics
- Continuous change: Re-evaluation needed after system optimization
2. Evaluation Framework Architecture
2.1 Overall Architecture
2.2 Core Components
2.2.1 Dataset Manager
pub struct DatasetManager {
datasets: HashMap<String, Dataset>,
}
pub struct Dataset {
pub id: String,
pub name: String,
pub description: String,
pub test_cases: Vec<TestCase>,
pub metadata: DatasetMetadata,
}
pub struct TestCase {
pub id: String,
pub query: String,
pub expected_results: Vec<ExpectedResult>,
pub metadata: TestCaseMetadata,
}
pub struct ExpectedResult {
pub memory_id: String,
pub relevance_score: f32,
pub position: usize,
}
pub enum DatasetType {
Recall, // Recall testing
Effectiveness, // Effectiveness testing
Performance, // Performance testing
Mixed, // Mixed testing
}
2.2.2 Evaluator Interface
#[async_trait]
pub trait Evaluator: Send + Sync {
async fn evaluate(&self, dataset: &Dataset) -> Result<EvaluationResult>;
fn name(&self) -> &str;
fn metrics(&self) -> Vec<MetricDefinition>;
}
pub struct EvaluationResult {
pub evaluator_name: String,
pub dataset_id: String,
pub metrics: HashMap<String, MetricValue>,
pub details: EvaluationDetails,
pub timestamp: DateTime<Utc>,
}
pub struct MetricValue {
pub name: String,
pub value: f64,
pub unit: String,
pub description: String,
}
pub struct EvaluationDetails {
pub test_cases_passed: usize,
pub test_cases_total: usize,
pub errors: Vec<EvaluationError>,
pub warnings: Vec<EvaluationWarning>,
}
3. Recall Evaluation
3.1 Evaluation Metrics
3.1.1 Precision@K
pub struct PrecisionAtK {
pub k: usize,
}
impl PrecisionAtK {
pub fn calculate(
&self,
retrieved: &[ScoredMemory],
relevant: &HashSet<String>,
) -> f64 {
if retrieved.is_empty() {
return 0.0;
}
let top_k = retrieved.iter().take(self.k);
let relevant_count = top_k
.filter(|m| relevant.contains(&m.memory.id))
.count();
relevant_count as f64 / self.k as f64
}
}
// Usage example
let precision_at_5 = PrecisionAtK { k: 5 };
let precision = precision_at_5.calculate(&results, &relevant_ids);
3.1.2 Recall@K
pub struct RecallAtK {
pub k: usize,
}
impl RecallAtK {
pub fn calculate(
&self,
retrieved: &[ScoredMemory],
relevant: &HashSet<String>,
) -> f64 {
if relevant.is_empty() {
return 1.0;
}
let top_k = retrieved.iter().take(self.k);
let retrieved_relevant = top_k
.filter(|m| relevant.contains(&m.memory.id))
.count();
retrieved_relevant as f64 / relevant.len() as f64
}
}
3.1.3 Mean Average Precision (MAP)
pub struct MeanAveragePrecision;
impl MeanAveragePrecision {
pub fn calculate(
&self,
all_results: &[Vec<ScoredMemory>],
all_relevant: &[HashSet<String>],
) -> f64 {
let mut aps = Vec::new();
for (results, relevant) in all_results.iter().zip(all_relevant.iter()) {
if relevant.is_empty() {
continue;
}
let ap = self.average_precision(results, relevant);
aps.push(ap);
}
if aps.is_empty() {
return 0.0;
}
aps.iter().sum::<f64>() / aps.len() as f64
}
fn average_precision(
&self,
results: &[ScoredMemory],
relevant: &HashSet<String>,
) -> f64 {
let mut precision_sum = 0.0;
let mut relevant_count = 0;
for (i, result) in results.iter().enumerate() {
if relevant.contains(&result.memory.id) {
relevant_count += 1;
let precision = relevant_count as f64 / (i + 1) as f64;
precision_sum += precision;
}
}
if relevant_count == 0 {
return 0.0;
}
precision_sum / relevant_count as f64
}
}
3.1.4 Normalized Discounted Cumulative Gain (NDCG)
pub struct NDCG {
pub k: usize,
}
impl NDCG {
pub fn calculate(
&self,
retrieved: &[ScoredMemory],
relevant: &HashMap<String, f64>,
) -> f64 {
let dcg = self.dcg(retrieved, relevant);
let idcg = self.idcg(relevant);
if idcg == 0.0 {
return 0.0;
}
dcg / idcg
}
fn dcg(
&self,
retrieved: &[ScoredMemory],
relevant: &HashMap<String, f64>,
) -> f64 {
let mut dcg = 0.0;
for (i, result) in retrieved.iter().take(self.k).enumerate() {
let relevance = relevant.get(&result.memory.id).unwrap_or(&0.0);
dcg += relevance / (i + 1) as f64;
}
dcg
}
fn idcg(&self, relevant: &HashMap<String, f64>) -> f64 {
let mut sorted_relevance: Vec<_> = relevant.values().cloned().collect();
sorted_relevance.sort_by(|a, b| b.partial_cmp(a).unwrap());
let mut idcg = 0.0;
for (i, relevance) in sorted_relevance.iter().take(self.k).enumerate() {
idcg += relevance / (i + 1) as f64;
}
idcg
}
}
3.2 Recall Evaluator Implementation
pub struct RecallEvaluator {
config: RecallEvaluatorConfig,
memory_manager: Arc<MemoryManager>,
}
#[derive(Debug, Clone)]
pub struct RecallEvaluatorConfig {
pub k_values: Vec<usize>,
pub similarity_thresholds: Vec<f32>,
pub max_results_per_query: usize,
}
#[async_trait]
impl Evaluator for RecallEvaluator {
async fn evaluate(&self, dataset: &Dataset) -> Result<EvaluationResult> {
let mut metrics = HashMap::new();
let mut details = EvaluationDetails {
test_cases_passed: 0,
test_cases_total: dataset.test_cases.len(),
errors: Vec::new(),
warnings: Vec::new(),
};
// Calculate Precision and Recall for each K value
for k in &self.config.k_values {
let mut precision_values = Vec::new();
let mut recall_values = Vec::new();
for test_case in &dataset.test_cases {
match self.evaluate_test_case(test_case, *k).await {
Ok((precision, recall)) => {
precision_values.push(precision);
recall_values.push(recall);
details.test_cases_passed += 1;
}
Err(e) => {
details.errors.push(EvaluationError {
test_case_id: test_case.id.clone(),
message: e.to_string(),
});
}
}
}
// Calculate averages
let avg_precision = precision_values.iter().sum::<f64>() / precision_values.len() as f64;
let avg_recall = recall_values.iter().sum::<f64>() / recall_values.len() as f64;
metrics.insert(
format!("precision@{}", k),
MetricValue {
name: format!("Precision@{}", k),
value: avg_precision,
unit: "score".to_string(),
description: format!("Average precision at K={}", k),
},
);
metrics.insert(
format!("recall@{}", k),
MetricValue {
name: format!("Recall@{}", k),
value: avg_recall,
unit: "score".to_string(),
description: format!("Average recall at K={}", k),
},
);
}
// Calculate MAP and NDCG
let all_results: Vec<Vec<ScoredMemory>> = dataset
.test_cases
.iter()
.filter_map(|tc| self.get_retrieved_results(tc).await.ok())
.collect();
let all_relevant: Vec<HashSet<String>> = dataset
.test_cases
.iter()
.map(|tc| {
tc.expected_results
.iter()
.map(|er| er.memory_id.clone())
.collect()
})
.collect();
let map = MeanAveragePrecision.calculate(&all_results, &all_relevant);
metrics.insert(
"map".to_string(),
MetricValue {
name: "Mean Average Precision".to_string(),
value: map,
unit: "score".to_string(),
description: "Mean Average Precision across all queries".to_string(),
},
);
Ok(EvaluationResult {
evaluator_name: self.name().to_string(),
dataset_id: dataset.id.clone(),
metrics,
details,
timestamp: Utc::now(),
})
}
fn name(&self) -> &str {
"recall_evaluator"
}
fn metrics(&self) -> Vec<MetricDefinition> {
vec![
MetricDefinition {
name: "precision@k".to_string(),
description: "Precision at K".to_string(),
unit: "score".to_string(),
},
MetricDefinition {
name: "recall@k".to_string(),
description: "Recall at K".to_string(),
unit: "score".to_string(),
},
MetricDefinition {
name: "map".to_string(),
description: "Mean Average Precision".to_string(),
unit: "score".to_string(),
},
]
}
}
impl RecallEvaluator {
async fn evaluate_test_case(
&self,
test_case: &TestCase,
k: usize,
) -> Result<(f64, f64)> {
// Execute search
let retrieved = self
.memory_manager
.search(&test_case.query, &Filters::default(), k)
.await?;
// Build relevant memories set
let relevant: HashSet<String> = test_case
.expected_results
.iter()
.map(|er| er.memory_id.clone())
.collect();
// Calculate Precision and Recall
let precision = PrecisionAtK { k }.calculate(&retrieved, &relevant);
let recall = RecallAtK { k }.calculate(&retrieved, &relevant);
Ok((precision, recall))
}
async fn get_retrieved_results(
&self,
test_case: &TestCase,
) -> Result<Vec<ScoredMemory>> {
self.memory_manager
.search(&test_case.query, &Filters::default(), self.config.max_results_per_query)
.await
}
}
4. Effectiveness Evaluation
4.1 Evaluation Dimensions
4.1.1 Fact Extraction Accuracy
pub struct FactExtractionEvaluator {
llm_client: Box<dyn LLMClient>,
}
impl FactExtractionEvaluator {
pub async fn evaluate_extraction_accuracy(
&self,
conversation: &[Message],
extracted_facts: &[ExtractedFact],
ground_truth: &[ExtractedFact],
) -> Result<ExtractionAccuracy> {
// Calculate precision
let precision = self.calculate_precision(extracted_facts, ground_truth).await?;
// Calculate recall
let recall = self.calculate_recall(extracted_facts, ground_truth).await?;
// Calculate F1 score
let f1 = if precision + recall > 0.0 {
2.0 * precision * recall / (precision + recall)
} else {
0.0
};
Ok(ExtractionAccuracy {
precision,
recall,
f1,
extracted_count: extracted_facts.len(),
ground_truth_count: ground_truth.len(),
})
}
async fn calculate_precision(
&self,
extracted: &[ExtractedFact],
ground_truth: &[ExtractedFact],
) -> Result<f64> {
if extracted.is_empty() {
return Ok(0.0);
}
let mut correct_count = 0;
for fact in extracted {
let is_correct = self.is_fact_correct(fact, ground_truth).await?;
if is_correct {
correct_count += 1;
}
}
Ok(correct_count as f64 / extracted.len() as f64)
}
async fn is_fact_correct(
&self,
fact: &ExtractedFact,
ground_truth: &[ExtractedFact],
) -> Result<bool> {
// Use LLM to judge if the fact is correct
let prompt = format!(
"Compare the following fact with the ground truth facts:\n\n\
Fact to evaluate: {}\n\n\
Ground truth facts:\n{}\n\n\
Is the fact correct and present in the ground truth? (yes/no)",
fact.content,
ground_truth
.iter()
.map(|f| format!("- {}", f.content))
.collect::<Vec<_>>()
.join("\n")
);
let response = self.llm_client.complete(&prompt).await?;
Ok(response.to_lowercase().contains("yes"))
}
}
pub struct ExtractionAccuracy {
pub precision: f64,
pub recall: f64,
pub f1: f64,
pub extracted_count: usize,
pub ground_truth_count: usize,
}
4.1.2 Classification Correctness
pub struct ClassificationEvaluator {
llm_client: Box<dyn LLMClient>,
}
impl ClassificationEvaluator {
pub async fn evaluate_classification_accuracy(
&self,
memories: &[(Memory, MemoryType)],
) -> Result<ClassificationAccuracy> {
let mut correct_count = 0;
let mut total_count = memories.len();
for (memory, expected_type) in memories {
let predicted_type = self.predict_type(&memory.content).await?;
if predicted_type == *expected_type {
correct_count += 1;
}
}
let accuracy = correct_count as f64 / total_count as f64;
// Calculate precision and recall for each category
let per_class_metrics = self.calculate_per_class_metrics(memories).await?;
Ok(ClassificationAccuracy {
overall_accuracy: accuracy,
per_class_metrics,
total_count,
})
}
async fn predict_type(&self, content: &str) -> Result<MemoryType> {
let prompt = format!(
"Classify the following memory content:\n\n\
Content: {}\n\n\
Classify as one of: Conversational, Procedural, Factual, Semantic, Episodic, Personal\n\n\
Classification:",
content
);
let response = self.llm_client.complete(&prompt).await?;
Ok(MemoryType::parse(&response))
}
async fn calculate_per_class_metrics(
&self,
memories: &[(Memory, MemoryType)],
) -> Result<HashMap<MemoryType, ClassMetrics>> {
let mut metrics: HashMap<MemoryType, ClassMetrics> = HashMap::new();
for (memory, expected_type) in memories {
let predicted_type = self.predict_type(&memory.content).await?;
// Update expected type metrics
metrics
.entry(*expected_type)
.or_insert_with(ClassMetrics::new)
.total += 1;
if predicted_type == *expected_type {
metrics.get_mut(&predicted_type).unwrap().true_positives += 1;
} else {
metrics
.get_mut(&predicted_type)
.or_insert_with(ClassMetrics::new)
.false_positives += 1;
metrics.get_mut(&expected_type).unwrap().false_negatives += 1;
}
}
// Calculate precision and recall
for (_, metric) in metrics.iter_mut() {
metric.precision = if metric.true_positives + metric.false_positives > 0 {
metric.true_positives as f64
/ (metric.true_positives + metric.false_positives) as f64
} else {
0.0
};
metric.recall = if metric.true_positives + metric.false_negatives > 0 {
metric.true_positives as f64
/ (metric.true_positives + metric.false_negatives) as f64
} else {
0.0
};
metric.f1 = if metric.precision + metric.recall > 0.0 {
2.0 * metric.precision * metric.recall / (metric.precision + metric.recall)
} else {
0.0
};
}
Ok(metrics)
}
}
pub struct ClassMetrics {
pub true_positives: usize,
pub false_positives: usize,
pub false_negatives: usize,
pub total: usize,
pub precision: f64,
pub recall: f64,
pub f1: f64,
}
impl ClassMetrics {
fn new() -> Self {
Self {
true_positives: 0,
false_positives: 0,
false_negatives: 0,
total: 0,
precision: 0.0,
recall: 0.0,
f1: 0.0,
}
}
}
5. Performance Evaluation
5.1 Benchmark Testing
use criterion::{black_box, criterion_group, criterion_main, Criterion, BenchmarkId};
pub struct PerformanceBenchmark {
memory_manager: Arc<MemoryManager>,
}
impl PerformanceBenchmark {
pub fn benchmark_search_latency(c: &mut Criterion) {
let mut group = c.benchmark_group("search_latency");
for dataset_size in [1000, 5000, 10000, 50000].iter() {
let manager = self.create_test_manager(*dataset_size).await;
group.bench_with_input(
BenchmarkId::from_parameter(dataset_size),
dataset_size,
|b, _| {
b.iter(|| {
black_box(
tokio::runtime::Runtime::new()
.unwrap()
.block_on(manager.search("test query", &Filters::default(), 10))
)
});
},
);
}
group.finish();
}
pub fn benchmark_insert_latency(c: &mut Criterion) {
let mut group = c.benchmark_group("insert_latency");
for content_length in [100, 500, 1000, 5000].iter() {
let manager = self.create_test_manager(0).await;
let content = "A".repeat(*content_length);
group.bench_with_input(
BenchmarkId::from_parameter(content_length),
content_length,
|b, _| {
b.iter(|| {
black_box(
tokio::runtime::Runtime::new()
.unwrap()
.block_on(
manager.create_memory(
content.clone(),
MemoryMetadata::default(),
)
)
)
});
},
);
}
group.finish();
}
pub fn benchmark_throughput(c: &mut Criterion) {
let manager = self.create_test_manager(10000).await;
let mut group = c.benchmark_group("throughput");
group.bench_function("concurrent_searches", |b| {
b.iter(|| {
let rt = tokio::runtime::Runtime::new().unwrap();
rt.block_on(async {
let handles: Vec<_> = (0..100)
.map(|_| {
let manager = manager.clone();
tokio::spawn(async move {
manager.search("test query", &Filters::default(), 10).await
})
})
.collect();
futures::future::join_all(handles).await;
});
});
});
group.finish();
}
}
criterion_group! {
name = benches;
config = Criterion::default().sample_size(100);
targets = PerformanceBenchmark::benchmark_search_latency,
PerformanceBenchmark::benchmark_insert_latency,
PerformanceBenchmark::benchmark_throughput
}
criterion_main!(benches);
5.2 Load Testing
use load_testing::LoadTester;
pub struct LoadTestRunner {
memory_manager: Arc<MemoryManager>,
}
impl LoadTestRunner {
pub async fn run_load_test(
&self,
config: LoadTestConfig,
) -> Result<LoadTestResult> {
let mut tester = LoadTester::new(config.concurrent_users);
// Add search tasks
for _ in 0..config.search_requests {
let manager = self.memory_manager.clone();
tester.add_task(async move {
manager.search("test query", &Filters::default(), 10).await
});
}
// Add insert tasks
for _ in 0..config.insert_requests {
let manager = self.memory_manager.clone();
tester.add_task(async move {
manager
.create_memory("test content".to_string(), MemoryMetadata::default())
.await
});
}
// Run test
let results = tester.run(Duration::from_secs(config.duration_secs)).await?;
Ok(LoadTestResult {
total_requests: results.total_requests,
successful_requests: results.successful_requests,
failed_requests: results.failed_requests,
average_latency: results.average_latency,
p50_latency: results.p50_latency,
p95_latency: results.p95_latency,
p99_latency: results.p99_latency,
requests_per_second: results.requests_per_second,
})
}
}
pub struct LoadTestConfig {
pub concurrent_users: usize,
pub search_requests: usize,
pub insert_requests: usize,
pub duration_secs: u64,
}
pub struct LoadTestResult {
pub total_requests: usize,
pub successful_requests: usize,
pub failed_requests: usize,
pub average_latency: Duration,
pub p50_latency: Duration,
pub p95_latency: Duration,
pub p99_latency: Duration,
pub requests_per_second: f64,
}
6. Dataset Management
6.1 Dataset Generation
pub struct DatasetGenerator {
llm_client: Box<dyn LLMClient>,
}
impl DatasetGenerator {
pub async fn generate_recall_dataset(
&self,
config: RecallDatasetConfig,
) -> Result<Dataset> {
let mut test_cases = Vec::new();
for i in 0..config.num_queries {
// Generate query
let query = self.generate_query(&config.domain).await?;
// Generate relevant memories
let relevant_memories = self.generate_relevant_memories(
&query,
config.num_relevant_per_query,
).await?;
// Build expected results
let expected_results = relevant_memories
.iter()
.enumerate()
.map(|(j, memory)| ExpectedResult {
memory_id: memory.id.clone(),
relevance_score: 1.0 - (j as f64 * 0.1),
position: j,
})
.collect();
test_cases.push(TestCase {
id: format!("query_{}", i),
query,
expected_results,
metadata: TestCaseMetadata {
domain: config.domain.clone(),
difficulty: config.difficulty,
..Default::default()
},
});
}
Ok(Dataset {
id: format!("recall_{}", Uuid::new_v4()),
name: format!("Recall Dataset - {}", config.domain),
description: format!("Generated recall dataset for {} domain", config.domain),
test_cases,
metadata: DatasetMetadata {
dataset_type: DatasetType::Recall,
created_at: Utc::now(),
version: "1.0".to_string(),
},
})
}
async fn generate_query(&self, domain: &str) -> Result<String> {
let prompt = format!(
"Generate a natural language query for the {} domain. \
The query should be specific and realistic.",
domain
);
let query = self.llm_client.complete(&prompt).await?;
Ok(query.trim().to_string())
}
async fn generate_relevant_memories(
&self,
query: &str,
count: usize,
) -> Result<Vec<Memory>> {
let prompt = format!(
"Generate {} relevant memories for the following query:\n\n\
Query: {}\n\n\
Each memory should be a short, specific piece of information.\n\
Format each memory on a new line.",
count, query
);
let response = self.llm_client.complete(&prompt).await?;
let memories: Vec<Memory> = response
.lines()
.filter(|line| !line.trim().is_empty())
.take(count)
.map(|content| Memory {
id: Uuid::new_v4().to_string(),
content: content.trim().to_string(),
embedding: vec![0.0; 1536], // placeholder
metadata: MemoryMetadata::default(),
created_at: Utc::now(),
updated_at: Utc::now(),
})
.collect();
Ok(memories)
}
}
pub struct RecallDatasetConfig {
pub domain: String,
pub num_queries: usize,
pub num_relevant_per_query: usize,
pub difficulty: Difficulty,
}
6.2 Dataset Loading
pub struct DatasetLoader;
impl DatasetLoader {
pub fn load_from_file(path: &str) -> Result<Dataset> {
let file = File::open(path)?;
let reader = BufReader::new(file);
let dataset: Dataset = serde_json::from_reader(reader)?;
Ok(dataset)
}
pub fn save_to_file(dataset: &Dataset, path: &str) -> Result<()> {
let file = File::create(path)?;
let writer = BufWriter::new(file);
serde_json::to_writer_pretty(writer, dataset)?;
Ok(())
}
pub fn load_from_directory(dir: &str) -> Result<Vec<Dataset>> {
let mut datasets = Vec::new();
for entry in fs::read_dir(dir)? {
let entry = entry?;
let path = entry.path();
if path.extension().and_then(|s| s.to_str()) == Some("json") {
let dataset = Self::load_from_file(path.to_str().unwrap())?;
datasets.push(dataset);
}
}
Ok(datasets)
}
}
7. Report Generation
7.1 Markdown Report
pub struct MarkdownReportGenerator;
impl MarkdownReportGenerator {
pub fn generate_report(
&self,
result: &EvaluationResult,
) -> Result<String> {
let mut report = String::new();
// Title
report.push_str(&format!("# {} Evaluation Report\n\n", result.evaluator_name));
// Metadata
report.push_str("## Metadata\n\n");
report.push_str(&format!("- **Dataset ID**: {}\n", result.dataset_id));
report.push_str(&format!("- **Timestamp**: {}\n", result.timestamp.format("%Y-%m-%d %H:%M:%S UTC")));
report.push_str("\n");
// Metrics
report.push_str("## Metrics\n\n");
report.push_str("| Metric | Value | Unit | Description |\n");
report.push_str("|--------|-------|------|-------------|\n");
for (name, metric) in &result.metrics {
report.push_str(&format!(
"| {} | {:.4} | {} | {} |\n",
metric.name, metric.value, metric.unit, metric.description
));
}
report.push_str("\n");
// Details
report.push_str("## Details\n\n");
report.push_str(&format!("- **Test Cases Passed**: {}/{}\n",
result.details.test_cases_passed,
result.details.test_cases_total
));
if !result.details.errors.is_empty() {
report.push_str("\n### Errors\n\n");
for error in &result.details.errors {
report.push_str(&format!("- **{}**: {}\n", error.test_case_id, error.message));
}
}
if !result.details.warnings.is_empty() {
report.push_str("\n### Warnings\n\n");
for warning in &result.details.warnings {
report.push_str(&format!("- **{}**: {}\n", warning.test_case_id, warning.message));
}
}
Ok(report)
}
}
7.2 JSON Report
pub struct JsonReportGenerator;
impl JsonReportGenerator {
pub fn generate_report(
&self,
result: &EvaluationResult,
) -> Result<String> {
serde_json::to_string_pretty(result)
.map_err(|e| MemoryError::Serialization(e.to_string()))
}
}
8. Usage Example
8.1 Running Evaluation
#[tokio::main]
async fn main() -> Result<()> {
// Initialize components
let config = load_config("config.toml").await?;
let manager = create_memory_manager(&config).await?;
// Create dataset
let dataset = generate_test_dataset(&manager).await?;
// Create evaluator
let evaluator = RecallEvaluator::new(
RecallEvaluatorConfig {
k_values: vec![5, 10, 20],
similarity_thresholds: vec![0.7, 0.8, 0.9],
max_results_per_query: 100,
},
manager.clone(),
);
// Run evaluation
let result = evaluator.evaluate(&dataset).await?;
// Generate report
let report = MarkdownReportGenerator.generate_report(&result)?;
// Save report
fs::write("evaluation_report.md", report)?;
println!("Evaluation completed successfully!");
println!("Report saved to evaluation_report.md");
Ok(())
}
8.2 Continuous Integration
# .github/workflows/evaluation.yml
name: Memory Evaluation
on:
push:
branches: [main]
pull_request:
branches: [main]
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v3
- name: Setup Rust
uses: actions-rs/toolchain@v1
with:
profile: release
toolchain: stable
- name: Run evaluation
run: |
cargo run --bin evaluation -- \
--dataset ./datasets/test.json \
--output ./reports/evaluation.md
- name: Upload report
uses: actions/upload-artifact@v3
with:
name: evaluation-report
path: ./reports/evaluation.md
9. Summary
Cortex Memory's evaluation framework provides the following capabilities:
- Comprehensive metrics: Covers recall, effectiveness, and performance dimensions
- Flexible dataset management: Supports multiple data sources and generation methods
- Multiple evaluators: Pluggable evaluator architecture
- Rich report formats: Supports Markdown, JSON, and other formats
- CI/CD integration: Easy to integrate into continuous integration workflows
This evaluation framework provides scientific, quantitative metrics for memory system optimization, ensuring continuous improvement of system quality and performance.



Top comments (0)