Knowledge and Memory Management v0.0.2: Portable Knowledge Collection

#ai #automation #opensource

The v0.0.2 release of the Knowledge and Memory Management system marks a clear shift toward portability and clean separation of concerns. All personal paths have been replaced with the $AGENT_HOME environment variable, eliminating hardcoded directory assumptions that plagued v0.0.1. This release focuses on two core pillars: knowledge collection from diverse sources and structured memory management for long-term retention.

Why $AGENT_HOME Matters

Previous versions required manual path configuration per deployment, leading to brittle setups. By standardizing on $AGENT_HOME, the system now resolves storage roots at runtime. This makes it trivial to share agent configurations across team members, CI pipelines, and containerized environments. The memory manager, knowledge collectors, and indexing engines all respect this base directory, so everything from raw downloads to vector data stays under one portable root.

Knowledge Collection: Web, Video, Articles

The collector modules are source-specific but export a consistent interface. For web content, the HTML scraper strips scripts, downloads images (with size limits), and extracts readable text using a configurable parser. Video collection relies on transcript APIs—YouTube, Vimeo, and local files—with optional frame extraction for slide-heavy content. Article ingestion handles RSS feeds and direct URLs, applying automatic summarization for long-form pieces.

Each collector normalizes metadata (title, source URL, timestamp, language) and passes the payload to a staging queue. Deduplication runs at the source and memory levels: identical URLs or hashed content are flagged before storage. The collectors also support custom filters—for example, ignoring articles below a word count or videos shorter than 60 seconds.

Memory Management: Indexing and Retrieval

Memory in v0.0.2 is built on a hybrid vector-store and key-value index. Knowledge entries are chunked, embedded (default model: all-MiniLM-L6-v2), and inserted into an HNSW-based vector database. The key-value index stores metadata and relationships, enabling graph traversal across related items. When a new piece of knowledge arrives, the memory manager checks for semantic similarity with existing entries—if a duplicate is detected, the new data can update the old entry’s timestamps and references instead of creating a duplicate.

Retrieval supports both dense vector search and keyword-based filtering. The retrieve() method accepts a query string, an optional source filter (web, video, article), and a recency window. Results are ranked by cosine similarity and weighted by source freshness.

Code Example: Collection Setup

The following demonstrates how to configure and use the collectors with $AGENT_HOME:

import os
from knowledge_collector import WebCollector, VideoCollector, ArticleCollector
from memory_manager import MemoryManager

agent_home = os.getenv('AGENT_HOME', './data')

memory = MemoryManager(base_path=agent_home)

web = WebCollector(memory=memory, dedup=True)
video = VideoCollector(memory=memory, transcript=True)
article = ArticleCollector(memory=memory, min_words=300)

web.collect('https://example.com/deep-learning-guide')
video.collect('https://youtube.com/watch?v=dQw4w9WgXcQ')
article.collect('https://blog.example.com/feed')

memory.commit()  # flush embeddings and indexes to disk

The collect() methods validate URLs, download content, normalize it, and push to memory. commit() writes all pending vector and key-value updates to the $AGENT_HOME storage tree.

Performance and Scalability

Memory operations are batched: by default, 100 entries trigger an automatic commit, or you can call commit() explicitly. The vector database uses mmap for large indexes, so memory overhead stays predictable even with 500k+ entries. The collectors are I/O-bound by design—they respect AGENT_HOME for caching downloads and avoiding redundant network requests.

Looking Ahead

v0.0.2 is a clean foundation. The next minor version will introduce cross-source merging (e.g., linking a video transcript to its accompanying article) and incremental garbage collection for stale entries. For now, the focus on portable paths and separated collection/memory layers makes this release suitable for production agents that need to learn from the web without environment-specific hacks.

DEV Community

Knowledge and Memory Management v0.0.2: Portable Knowledge Collection

Top comments (0)