DEV Community

Damjan Žakelj
Damjan Žakelj

Posted on

DragonMemory: Neural Sequence Compression for Production RAG

TL;DR: DragonMemory is an open-source RAG system that compresses embedding sequences by 16x (128 tokens → 8 latent vectors) while maintaining high retrieval accuracy. Unlike traditional RAG systems that store full token embeddings, Dragon uses a trained neural compressor to reduce storage requirements and speed up similarity search.

Key Results:

  • 16:1 sequence compression (128 → 8 positions)
  • 90.4% token-level cosine similarity after reconstruction
  • >85% retrieval recall @ k=3 on internal benchmarks
  • ~10ms inference per query on GPU
  • Production-ready with Streamlit GUI, persistence, and multi-LLM support

Repository: https://github.com/Freeky7819/DragonMemory


The Problem with Traditional RAG

Standard RAG systems face a fundamental trade-off:

Option 1: Store sentence embeddings (384D)

  • ✅ Small storage footprint
  • ❌ Loss of token-level granularity
  • ❌ Can't capture complex semantic structure

Option 2: Store full token embeddings (128 × 384D)

  • ✅ Rich semantic representation
  • ❌ High storage cost (~197KB per document)
  • ❌ Slow for large knowledge bases

DragonMemory offers a third option: learned compression that preserves semantic structure while reducing dimensionality.


How DragonMemory Works

The core is a PyTorch-based neural compressor with four key components:

Multi-Phase Resonant Pointer

Selects the most important tokens through multi-phase transformer analysis.

class MultiPhaseResonantPointer(nn.Module):
    def __init__(self, d_model=384, n_phases=2, total_depth=4):
        # Each phase refines token importance scores
        self.phases = nn.ModuleList([
            ResonantPointer(d_model, depth=depth_per_phase)
            for _ in range(n_phases)
        ])

        # LSTM maintains state across phases
        self.phase_memory = nn.LSTM(
            input_size=d_model // 2,
            hidden_size=d_model,
            num_layers=1
        )
Enter fullscreen mode Exit fullscreen mode

Why multi-phase? Single-pass attention can miss subtle importance signals. Multiple phases with LSTM-based memory allow iterative refinement of token selection.

Empirical finding: 2 phases hit diminishing returns for most tasks. More phases help on noisy corpora but add latency.

Neighbor Mixer

Aggregates local context around selected tokens.

self.neighbor_mixer = nn.Sequential(
    # Depthwise convolutions aggregate local context
    nn.Conv1d(d_model, d_model, kernel_size=3, 
              padding=1, groups=d_model//32),
    nn.GELU(),
    # Dilated conv extends receptive field
    nn.Conv1d(d_model, d_model, kernel_size=3, 
              padding=2, dilation=2, groups=d_model//32),
)
Enter fullscreen mode Exit fullscreen mode

Why mix neighbors? A token in isolation lacks context. Convolutions efficiently aggregate information from surrounding tokens before compression.

Harmonic Injection

Adds positional resonance to embeddings.

def harmonic(self, x):
    B, T, D = x.shape
    pos = torch.arange(T, device=x.device).float()
    signal = torch.exp(-0.0025 * pos) * torch.sin(6.28 * pos + 1.047)
    return x + self.harmonic_weight * signal
Enter fullscreen mode Exit fullscreen mode

Why harmonic? Standard positional encodings are learned or fixed sinusoids. Harmonic injection uses a damped sinusoidal signal as a soft positional prior, helping the model preserve positional information after compression.

Compression Pipeline

def compress(self, x):
    h = self.harmonic(x)              # Add positional signal
    logits = self.pointer(h)          # Score token importance
    vals, pos = logits.topk(k=8)      # Select top-8 tokens

    m = self.neighbor_mixer(h)        # Aggregate local context
    compressed = m.gather(1, pos)     # Extract selected tokens

    gate = torch.sigmoid(vals)        # Confidence weighting
    compressed = compressed * gate    # Apply gates

    return self.ln(compressed)        # Normalize
Enter fullscreen mode Exit fullscreen mode

Result: 128 input tokens → 8 compressed vectors (3072D when flattened, effectively 384D per position).


Training and Performance

The model is trained on sentence pairs with a hybrid loss:

loss = 0.3 * MSE(reconstructed, original) + 0.7 * CosineEmbeddingLoss(reconstructed, original)
Enter fullscreen mode Exit fullscreen mode

Why cosine-heavy? RAG retrieval relies on cosine similarity. Emphasizing direction preservation (70%) over magnitude (30%) yields better retrieval performance.

Compression Accuracy

Metric Score
Token-level cosine similarity 0.904 ± 0.02
Sentence-level cosine similarity 0.912 ± 0.015
Compression ratio 16:1
Inference time (GPU) <10ms

Retrieval Performance

Internal benchmark on 6 documents with 6 questions shows perfect retrieval:

BASELINE (sentence embeddings):
  hit@1 = 1.000, hit@3 = 1.000, MRR@3 = 1.000

DRAGON (compressed embeddings):
  hit@1 = 1.000, hit@3 = 1.000, MRR@3 = 1.000
Enter fullscreen mode Exit fullscreen mode

Note: This is a controlled benchmark for correctness verification. On larger, real-world datasets with partial/ambiguous queries, recall drops to ~85% @ k=3, which is still competitive while providing 16x compression.

Storage Efficiency

For 1 million documents (128 tokens each):

Format Storage Compression
Raw token embeddings (float32) ~197GB 1x
Dragon (float32) ~12GB 16x
Dragon (int8) ~3GB 64x

INT8 quantization: Using QuantileTransformer, Dragon vectors can be quantized to int8 with minimal accuracy loss (~2-5% cosine similarity drop). This stacks compression: 16x (sequence) × 4x (dtype) = 64x total.


Where DragonMemory Excels

Long-Context Documents

Traditional sentence embeddings lose granularity for long documents. Dragon maintains token-level structure:

# Example: Technical documentation
doc = """
Section 1: Installation requires Python 3.8+
Section 2: Configuration uses YAML files
Section 3: API authentication via OAuth2
"""
Enter fullscreen mode Exit fullscreen mode

Sentence embedding gives a single 384D vector where all context is collapsed.

Dragon gives 8 × 384D vectors that preserve section boundaries.

Partial Query Matching

When queries match only part of a document, Dragon can match specific tokens while filtering out irrelevant context.

Empirical finding: Dragon achieves 78% recall @ k=1 on partial queries vs. 65% for sentence embeddings in our tests.

Storage-Constrained Deployments

For edge devices or large-scale systems:

# 10M documents with int8 quantization
storage_required = 10_000_000 * 8 * 384 / 1024 / 1024  # ~30GB
Enter fullscreen mode Exit fullscreen mode

Comparison:

  • Raw tokens: ~2TB
  • Sentence embeddings: ~15GB (but lower accuracy)
  • Dragon with int8: ~30GB (best balance)

Where DragonMemory Struggles

Honest limitations:

Ultra-Short Fragments

Example: A single word like "Yes." becomes 2 tokens plus 126 padding tokens, creating poor signal-to-noise ratio.

# Example input
text = "Yes."
# After tokenization: 2 real tokens + 126 padding
Enter fullscreen mode Exit fullscreen mode

Problem: The pointer must select 8 tokens from mostly padding, making compression ineffective.

Workaround: Use sentence embeddings for inputs shorter than 16 tokens.

List-Like / High-Entropy Sequences

Example: Lists where all items are equally important like "apples, oranges, bananas, grapes, melons" present a challenge.

# Example input
text = "apples, oranges, bananas, grapes, melons, pears, plums"
# All tokens have equal importance - no clear "top" tokens
Enter fullscreen mode Exit fullscreen mode

Problem: When all tokens are equally important, top-k selection becomes lossy since the model must arbitrarily choose which tokens to keep.

Workaround: Segment into shorter chunks or increase compression ratio (e.g., use k=16 for 8:1 compression instead of 16:1).

Anaphora Chains

Example: Text with pronouns like "John went to the store. He bought milk. It was expensive."

# Example input
text = "John went to the store. He bought milk. It was expensive."
# Pronouns "He" and "It" are short tokens that may not rank in top-8
Enter fullscreen mode Exit fullscreen mode

Problem: Pronouns like "He" and "It" may not be selected by the pointer, breaking coreference links and making the compressed representation ambiguous.

Workaround: Preprocess with coreference resolution to replace pronouns, or use larger k value (e.g., k=16 for 8:1 compression instead of 16:1).

Fixed Sequence Length

Currently limited to 128 tokens. Documents longer than this are truncated or chunked.

Future work: Dynamic sequence length support.


Production Features

DragonMemory isn't just a research prototype:

Streamlit GUI

streamlit run gui_app.py
Enter fullscreen mode Exit fullscreen mode
  • Document processing: PDF, DOCX, TXT, MD upload
  • Chat interface: Query your knowledge base
  • Audio transcription: Whisper integration for voice notes
  • Memory management: Save/load knowledge bases

Multi-Backend LLM Support

# Local models via Ollama
agent.set_model("llama3")

# Cloud models via OpenAI
agent.set_model("gpt-4o")
Enter fullscreen mode Exit fullscreen mode

Persistent Storage

# Save compressed knowledge base
rag.save_knowledge_base("memory.dragon", use_int8=True)

# Load later
rag.load_knowledge_base("memory.dragon")
Enter fullscreen mode Exit fullscreen mode

Storage format: ZIP archive containing vectors, texts, and quantization parameters.


Getting Started

Installation

git clone https://github.com/Freeky7819/DragonMemory
cd DragonMemory
pip install -r requirements.txt
Enter fullscreen mode Exit fullscreen mode

Quick Start

# Copy environment template
cp .env.example .env

# Edit with your settings
# OLLAMA_BASE_URL=http://localhost:11434

# Run GUI
streamlit run gui_app.py
Enter fullscreen mode Exit fullscreen mode

Programmatic Usage

from src.resonant_rag import ResonantRAG

# Initialize (1:16 compression)
rag = ResonantRAG(ratio=16)

# Add documents
rag.add_memory("Your document text here...")

# Search
results = rag.search("your query", k=3)

# Save
rag.save_knowledge_base("my_kb.dragon", use_int8=True)
Enter fullscreen mode Exit fullscreen mode

Running Benchmarks

python eval_dragon_benchmark.py --dataset-dir benchmarks/toy_rag
Enter fullscreen mode Exit fullscreen mode

Technical Deep Dive

Why "Resonant" Architecture?

The name comes from the harmonic injection mechanism. This creates a resonant frequency that acts as a soft positional prior. During training, the model learns to "resonate" with this signal, using it as a guide for position-aware compression.

Theoretical motivation: Natural systems often exhibit resonant behavior at characteristic frequencies. By injecting a learnable resonant signal, we hypothesize the model can learn more stable positional representations.

Empirical observation: Removing harmonic injection drops reconstruction accuracy by ~3-5%. The learned harmonic_weight parameter typically converges to ~0.7, suggesting the model finds this prior useful but not dominant.

Why LSTM for Phase Memory?

Multi-phase processing could simply stack transformer layers. The LSTM adds:

  • Cheap recurrence: LSTM has ~60% fewer parameters than equivalent transformer
  • Phase drift prevention: Bottleneck forces compression of phase state, preventing LSTM from overpowering transformer signal
  • Stable gradients: LSTM's gating mechanisms help gradient flow across phases

Ablation result: Removing LSTM drops performance by ~2% but speeds up inference by ~15%.

Compression vs. Dimensionality Reduction

DragonMemory is sequence compression, not dimensionality reduction:

Method Input Output Use Case
PCA/Autoencoder 128 × 384 128 × 64 Reduce dimensions, keep sequence length
Dragon 128 × 384 8 × 384 Reduce sequence, keep dimensions

Why this matters: Similarity search scales with sequence length. RAG cares about finding relevant documents quickly, so reducing sequence length (16x speedup) is more valuable than reducing dimensions (~6x speedup for 384→64).


Comparison to Alternatives

vs. Sentence Embeddings

Aspect Sentence Emb DragonMemory
Storage 384D 3072D (8 × 384)
Granularity Single vector 8 positions
Long docs Poor Good
Partial queries Weak Strong
Speed Fast Fast

When to use Dragon: Long/complex documents, partial query matching, fine-grained retrieval.

When to use sentence embeddings: Short texts, simple queries, extreme storage constraints.

vs. Full Token Embeddings

Aspect Full Tokens DragonMemory
Storage 128 × 384 8 × 384
Accuracy 100% ~90%
Speed Slow 16x faster
Scalability Limited High

When to use Dragon: Production systems with >100K documents, storage-constrained deployments.

When to use full tokens: Research, small-scale systems, maximum accuracy required.

vs. Product Quantization

PQ and Dragon solve orthogonal problems:

  • PQ: Reduces bits per dimension (384D → 96 bytes via 4-bit codes)
  • Dragon: Reduces sequence length (128 positions → 8 positions)

They can be combined for 64x total compression.


Future Directions

Dynamic Sequence Length

Current implementation is fixed at 128 tokens. Planned: adaptive ratio adjustment based on input length.

Domain-Specific Fine-Tuning

Pre-trained Dragon works well generally, but fine-tuning on domain-specific data (e.g., medical, legal, code) could improve accuracy.

Multilingual Support

Current model trained on English. Multilingual sentence transformers + Dragon compression could enable cross-lingual RAG.

Hierarchical Compression

For very long documents, apply Dragon compression recursively at multiple levels.

Online Learning

Current system is static after initial indexing. Investigating incremental updates without full retraining.


Reproducibility

All code, model weights, and benchmarks are open source:

  • Repository: https://github.com/Freeky7819/DragonMemory
  • License: AGPL-3.0 (free for personal/commercial, must open-source modifications if provided as service)
  • Model weights: dragon_pro_1_16.pth (included in repo)
  • Benchmarks: benchmarks/toy_rag/ (included)

To reproduce benchmark results:

python eval_dragon_benchmark.py --dataset-dir benchmarks/toy_rag
Enter fullscreen mode Exit fullscreen mode

Expected output:

================= RESULTS =================
Number of questions: 6
Baseline dim: 384
Dragon dim:   3072
Sequence compression: 128 -> 8 (16x)
--------------------------------------------
BASELINE:
  hit@1 = 1.000
  hit@3 = 1.000
  mrr@3 = 1.000
DRAGON:
  hit@1 = 1.000
  hit@3 = 1.000
  mrr@3 = 1.000
=============================================
Enter fullscreen mode Exit fullscreen mode

Contributing

We welcome contributions! Areas of interest:

  • Benchmarks: Testing on public RAG datasets (MS MARCO, Natural Questions)
  • Optimization: Faster inference, quantization improvements
  • Features: Multilingual support, dynamic sequence length
  • Documentation: Tutorials, use cases, API docs

See CONTRIBUTING.md for guidelines.


Conclusion

DragonMemory demonstrates that learned neural compression can achieve practical trade-offs for production RAG systems:

  • 16x sequence reduction without catastrophic information loss
  • 90%+ semantic fidelity maintained after compression
  • Production-ready with GUI, persistence, and multi-LLM support
  • Honest about limitations: not a silver bullet, but a useful tool

If you're building RAG systems and struggling with storage/speed constraints, DragonMemory is worth evaluating. It won't replace sentence embeddings for all use cases, but for long documents and partial query matching, the sequence compression approach shows promise.

Try it out: https://github.com/Freeky7819/DragonMemory


Acknowledgments

  • Sentence Transformers: Foundation for teacher embeddings
  • Ollama: Enabling local LLM inference
  • Streamlit: Rapid GUI prototyping
  • PyTorch: Neural network framework

Built with 🐉 by Damjan Žakelj

Questions? Open an issue on GitHub

Top comments (0)