Knowledge-and-Memory-Management v0.0.2: Streamlined Knowledge Collection with Portable Home Directories

#ai #opensource #automation

If you’re building a system that needs to ingest and retain information from diverse sources—web pages, video transcripts, or long-form articles—you know the friction of wrestling with absolute paths and fragile configurations. The v0.0.2 release of Knowledge-and-Memory-Management cleans up that mess. This version replaces every hardcoded personal path with $AGENT_HOME, making your deployment truly portable, and tightens the knowledge collection pipeline. Let’s look at what changed and why it matters for developers who need a reliable memory layer for their agents.

What v0.0.2 Fixes (and Breaks)

The major breaking change is that any previous configuration referencing absolute paths like /Users/you/projects/knowledge or /home/user/data will refuse to load. Instead, you must define the $AGENT_HOME environment variable. All internal storage—collected knowledge, indexes, and metadata—now lives under $AGENT_HOME/data. This small shift eliminates the “works on my machine” problem when sharing configs across teams or deploying to containers.

Migration is straightforward: export AGENT_HOME to a writable directory, then point your existing knowledge collections to it. The release includes a migration script (scripts/migrate_paths.py) that scans old configs and rewrites paths with the variable.

Knowledge Collection in v0.0.2

The core collection engine now supports three explicit domains: web, video, and articles. Each domain has a dedicated extractor that normalises content into a common memory format before storage.

Web: Uses a headless browser to fetch full page content, stripping navigation, ads, and paywalls. It preserves semantic structure (headings, lists, code blocks) and generates a clean Markdown representation.
Video: Pulls transcripts from YouTube, Vimeo, or local media files via automatic speech recognition integration. It timestamps each segment and extracts key frames when possible. The output is a searchable text transcript with metadata like duration and speaker labels.
Articles: Optimised for long-form text (blog posts, PDFs, research papers). It extracts structured metadata (title, author, publication date) and chunks the content according to a configurable token limit for downstream processing like embedding or summarisation.

All collectors share a common interface: KnowledgeSource(name, uri, domain, options). This lets you build a pipeline that mixes sources without custom glue code.

The Portable Path Convention

Let’s see how $AGENT_HOME simplifies configuration. Here’s an example of defining a knowledge collection in YAML:

knowledge_sources:
  - name: tech_reference
    domain: web
    uri: "https://developer.example.com/tutorials"
    options:
      recursive: true
      max_pages: 50
      freshness: 7d
    storage:
      path: "${AGENT_HOME}/data/collections/tech_reference"
      index_type: "vector"

When this file is loaded, every ${AGENT_HOME} reference expands to the environment variable. No more hardcoding home directories. The same config works on your laptop, a CI runner, or a bare-metal server if AGENT_HOME is set consistently.

Memory Management Under the Hood

v0.0.2 introduces a weakly schema-forced memory model. Each collected item (a page, a transcript, an article) becomes a memory unit with:

id (hash of content + source)
source (the original URI)
domain (web|video|article)
content (normalised text)
metadata (JSON blob)
created_at and updated_at timestamps

Deduplication uses the content hash, not the URI—if the same article appears on two pages, it’s stored once. Expiration policies rotate old memories when $AGENT_HOME disk usage exceeds a threshold (configurable, default 85%).

A new CLI command agent-memory collect triggers an incremental sweep. It respects Etags and last-modified headers for web sources, skips unchanged transcripts, and re-chunks articles only when the source file changes.

Practical Considerations

No default value: If AGENT_HOME is unset, the system exits with a clear error. This forces explicit configuration and avoids accidental writes to random directories.
Logging: All collection events are written to ${AGENT_HOME}/logs/collection.log with structured JSON lines for easy ingestion into your own monitoring.
Testing: The release ships with a --dry-run flag for every collect command. Use it to verify what would be ingested without writing any data.

One More Thing: The “S” in the Changelog

The trailing “S” in the topic description hints at a new feature—Summaries. v0.0.2 bakes in a lightweight, configurable summarisation step that runs after collection. For each memory unit, if the content exceeds 500 tokens, an optional local LLM or API call produces a three-sentence summary. This summary is stored alongside the full content and used for faster retrieval queries. You can disable it via the options: { summarize: false } per source.

In production, this halves average retrieval latency because the summary vector is smaller than the full content vector. Of course, you pay with compute time at collection time. Tune the token threshold per domain.

Getting Started

Update your existing v0.0.1 installations by running the migration script, then set AGENT_HOME. If you’re starting fresh:

export AGENT_HOME=/path/to/agent
pip install knowledge-memory==0.0.2
agent-memory init
agent-memory collect --all

The init command creates the directory structure and a default config template. From there, edit knowledge_sources.yaml to match your ingestion targets.

v0.0.2 is not a feature-bloated release—it’s a set of deliberate engineering decisions to make knowledge management reproducible and maintainable. The path fix alone is worth the upgrade. Combine it with the new structured collectors and optional summaries, and you have a solid foundation for any agent that needs to remember.