vishalmysore

Posted on Dec 18, 2025

OWL-Aware Chunking Strategies: A Comprehensive Performance Analysis

#algorithms #rag #performance #llm

Detailed performance analysis of 10 different chunking strategies I created for Retrieval-Augmented Generation (RAG) applied to a legal ontology in Protégé. I tested both traditional text-based chunking and novel OWL-aware chunking strategies powered by the agenticmemory library. Performance varies significantly based on ontology structure, metadata quality, and naming conventions. ModuleExtractionChunking achieved the highest OWL-aware score (0.7068) with exceptional consistency, while AnnotationBasedChunking (0.7010) offered fine-grained semantic grouping with 39 focused chunks.

👉 Protege Plugin for Lucene based Vector store is here https://github.com/vishalmysore/lucene-protege-plugin

👉 Owl ontology for used in this article is here https://github.com/vishalmysore/graphrag/blob/main/graphrag/ontologies/legal-case-management.owl

👉 More Ontologis are here https://github.com/vishalmysore/graphrag/tree/main/graphrag/ontologies

👉 AgenticMemory pacakge is here https://github.com/vishalmysore/agenticmemory

Introduction

Chunking strategies are critical for RAG performance. The way I split knowledge into retrievable pieces directly impacts:

Context relevance: Whether the retrieved chunks contain the information needed
Answer accuracy: Whether the LLM receives complete vs. fragmented information
Query performance: Search time and computational cost

Traditional text-based chunking (word, sentence, paragraph boundaries) treats all text equally. However, ontologies have rich semantic structure—class hierarchies, property domains, annotation patterns—that can inform smarter chunking decisions.

This study evaluates whether OWL-aware chunking strategies outperform traditional text-based approaches.

Methodology

Test Environment

Platform: Protégé 5.6.7 with custom Lucene RAG plugin
Ontology: Legal domain (195 total axioms)
- 3 cases (Smith v. Jones, State v. Doe, Appeal CV-2023-500)
- 3 courts (District, Appellate, Supreme)
- 4 judges and 3 lawyers
- 3 evidence items
- 2 statutes (Federal, State)
Vector Store: Apache Lucene 9.8.0 with KnnFloatVectorField
Embeddings: OpenAI text-embedding-3-small (1024 dimensions)
LLM: GPT-4
Test Query: "Which cases are currently active?"

Evaluation Metrics

Chunk Count: Number of chunks created
Top Similarity Score: Cosine similarity of best-matching chunk
Answer Quality: Whether LLM provided correct, complete answer
Chunk Distribution: How axioms were grouped

Results: Text-Based Chunking Strategies

1. WordChunking

Chunks Created: 58

Top Similarity: 0.7135

Answer Quality: ✅ Correct (both cases identified)

How It Works: Splits text at word boundaries, typically 100 words per chunk.

Performance:

Created 58 focused chunks
Each chunk contained 1-2 complete entities
No fragmentation of entity names
Example chunk: Full "Smith v. Jones" case with all properties

Best For:

Structured data where entities < 100 words
Short, self-contained knowledge items
Clear word boundaries separate concepts

2. SentenceChunking

Chunks Created: 76

Top Similarity: 0.7258 (highest raw score)

Answer Quality: ❌ Incomplete ("Jones" instead of "Smith v. Jones")

How It Works: Splits text at sentence boundaries (periods, exclamation marks, question marks).

Performance:

Created 76 smaller chunks (most of any strategy)
Critical flaw: Fragmented entity names across chunks
Highest similarity score but worst answer quality
Example: "Jones" appeared in one chunk, "Smith v." in another

Problem Identified:

Chunk A: "...the case Smith v."
Chunk B: "Jones was filed in District Court..."

When LLM received Chunk B alone, it only saw "Jones" without "Smith v.", leading to incomplete answers.

Lesson Learned: Higher similarity scores don't guarantee better answers if chunks break semantic units.

3. ParagraphChunking

Chunks Created: 58

Top Similarity: 0.7141

Answer Quality: ✅ Correct (both cases identified)

How It Works: Splits text at paragraph boundaries (double newlines).

Performance:

Identical results to WordChunking (58 chunks)
This occurred because ontology entities had no paragraph breaks
All entities < 100 words, so paragraph = word boundaries

Best For:

Long-form documentation with clear paragraph structure
Articles, papers, documentation
Not ideal for structured RDF/OWL data

4. FixedSizeChunking

Chunks Created: Unknown (at least 5)

Top Similarity: 0.7141

Answer Quality: ✅ Correct (Smith v. Jones identified as active)

How It Works: Fixed character/token limits regardless of content boundaries. Unlike other strategies, ignores semantic structure entirely.

Performance:

Top similarity of 0.7141 (tied with ParagraphChunking)
Clean, well-structured chunks with complete entity information
Each chunk formatted with Class/Individual type, IRI, label, and properties
Successfully retrieved all relevant case information

Example Chunks:

Chunk 1 (0.7141): Criminal Case definition
Chunk 2 (0.7096): Appeal of CV-2023-500 (complete individual)
Chunk 3 (0.6985): Smith v. Jones (complete individual)  
Chunk 4 (0.6955): State v. Doe (complete individual)
Chunk 5 (0.6938): case status property

Key Observation: Despite ignoring semantic boundaries, FixedSizeChunking produced well-formed chunks because:

Ontology entities are naturally compact (< 100 words each)
RDF/OWL serialization creates natural boundaries
Fixed size happened to align with entity boundaries

Best For:

When entity size is consistent and predictable
Performance-critical applications needing uniform computational load
Baseline comparison for other strategies

Limitation: Would fragment entities if fixed size < entity size, or waste space if fixed size >> entity size.

Results: OWL-Aware Chunking Strategies

5. ClassBasedChunking

Chunks Created: 6

Top Similarity: 0.6964

Answer Quality: ✅ Correct

How It Works: Groups axioms by class hierarchies. Creates one chunk per class hierarchy plus one "orphan" chunk for non-hierarchy axioms.

Chunk Distribution:

Chunk 0: Evidence hierarchy (2 axioms)
  - DocumentEvidence
  - PhysicalEvidence

Chunk 1: Statute hierarchy (2 axioms)
  - FederalStatute
  - StateStatute

Chunk 2: Court hierarchy (3 axioms)
  - AppellateCourt
  - DistrictCourt
  - SupremeCourt

Chunk 3: Case hierarchy (3 axioms)
  - CivilCase
  - CriminalCase
  - AppellateCase

Chunk 4: Person hierarchy (7 axioms)
  - Judge, Lawyer
  - DefenseAttorney, Prosecutor
  - SupremeCourtJudge
  - Plaintiff, Defendant

Chunk 5: Orphan axioms (183 axioms) ← DOMINANT CHUNK
  - All individual assertions
  - All property declarations
  - All annotations

Key Finding: 183 of 195 axioms (93.8%) ended up in the "orphan" chunk because they were individual assertions, not class hierarchy definitions.

Best For:

Queries about class relationships
"What types of cases exist?"
"What are the subclasses of Person?"

Limitation: Most real data (case details, property values) concentrated in massive orphan chunk.

6. AnnotationBasedChunking

Chunks Created: 39

Top Similarity: 0.7010

Answer Quality: ✅ Correct with best context

How It Works: Groups axioms by annotation label prefixes (first 3 characters).

Chunk Distribution:

Top chunks by axiom count:
- no-annotations: 84 axioms (labels, ranges, domains)
- "cas" prefix: 26 axioms (Case, CaseNumber, CaseStatus, Case_SmithVsJones, Case_StateVsDoe)
- "sta" prefix: 24 axioms (Statute, StatuteCode, StateStatute, FederalStatute)
- "app" prefix: 17 axioms (AppellateCase, AppellateCourt, AppealsTo property, Case_AppealCV001)
- "fil" prefix: 12 axioms (FiledIn, FilingDate, all filing-related assertions)
- "cou" prefix: Court entities and properties
- "jud" prefix: Judge-related entities
- "law" prefix: Lawyer-related entities
- "evi" prefix: Evidence-related entities
- 30 other prefixes: Varying axiom counts (1-10 axioms each)

Total: 39 semantic chunks

Key Strengths:

Semantic grouping: Entities with similar names usually have related meanings
Balanced chunks: 39 focused chunks vs. 1 giant orphan chunk
Complete entities: "Case_SmithVsJones" stayed with "CaseStatus", "CaseNumber"
Effective retrieval: Top chunk (0.7010, no-annotations) contained complete case information with all labels
Metadata-dependent: Performance relies heavily on consistent naming conventions

Example Query Flow:

Query: "Which cases are currently active?"
  ↓
Embedding matches "cas" prefix chunk (0.7010 similarity)
  ↓
Chunk contains: Case_SmithVsJones (Active), Case_StateVsDoe (Trial)
  ↓
GPT-4 receives complete case information
  ↓
Answer: ✅ "Smith v. Jones (Active), State v. Doe (Trial)"

7. NamespaceBasedChunking

Chunks Created: 6

Top Similarity: 0.6964

Answer Quality: ✅ Correct

How It Works: Splits axioms by IRI namespace prefixes.

Performance:

Fell back to ClassBasedChunking: Legal ontology uses single namespace (http://www.semanticweb.org/legal#)
All chunk IDs showed "class-chunk-" not "namespace-chunk-"
Identical results to ClassBasedChunking

When It Would Excel:

Scenario: Multi-ontology project

Chunk 1: http://www.semanticweb.org/legal# (your domain)
  - Case, Court, Judge classes

Chunk 2: http://purl.org/dc/terms/ (Dublin Core)
  - creator, date, title

Chunk 3: http://xmlns.com/foaf/0.1/ (FOAF)
  - Person, Organization, name

Chunk 4: http://www.w3.org/2006/time# (OWL Time)
  - Instant, Interval, before, after

Best For:

Projects importing multiple external ontologies
Separating domain concepts from metadata
Modular ontology architectures

Limitation: Useless for single-namespace ontologies.

8. DepthBasedChunking

Chunks Created: 3

Top Similarity: 0.6967

Answer Quality: ✅ Correct

How It Works: Groups axioms by hierarchy depth level.

Chunk Distribution:

Chunk 1: Non-class axioms (183 axioms, depth: N/A)
  - All individual assertions
  - All property declarations
  - All annotations
  - Similarity: 0.6967 ← Top result

Chunk 2: Depth Level 0 (15 axioms)
  - Direct subclass relationships
  - Evidence → DocumentEvidence, PhysicalEvidence
  - Court → DistrictCourt, AppellateCourt, SupremeCourt
  - Case → CivilCase, CriminalCase, AppellateCase
  - Person → Judge, Lawyer, Plaintiff, Defendant
  - Statute → FederalStatute, StateStatute
  - Similarity: 0.6717

Chunk 3: Depth Level 1 (2 axioms)
  - Second-level subclass relationships
  - Lawyer → DefenseAttorney, Prosecutor
  - Judge → SupremeCourtJudge
  - Similarity: 0.6226

Key Insight: Legal ontology has only 2 hierarchy depth levels, indicating relatively flat structure.

Hierarchy Analysis:

Level 0: Top-level concepts (Case, Court, Person, Evidence, Statute)
Level 1: Direct children (17 classes)
Level 2: Grandchildren (3 classes: DefenseAttorney, Prosecutor, SupremeCourtJudge)

Best For:

Understanding ontology complexity
Queries about abstraction levels
"What are the top-level classes?"
Structural analysis

Limitation: Most data still in non-class axioms chunk.

9. ModuleExtractionChunking

Chunks Created: 28

Top Similarity: 0.7068

Answer Quality: ✅ Correct (both cases identified)

How It Works:

Extracts minimal, self-contained ontology modules using OWL API algorithms
Each module is complete and independent
Selects seed entities and pulls all related axioms

Performance:

Created 28 modules from 195-axiom ontology
Remarkably tight similarity clustering: Top 5 scores range 0.7056-0.7068 (0.0012 spread)
Top chunk: 132 axioms with 4 seed entities
Most balanced retrieval: All top results highly relevant

Example Module:

module-chunk-1 (0.7068):
  - 132 axioms
  - 4 seed entities (likely: Case, Court, Judge, Lawyer)
  - Complete case information with all dependencies

Key Insight:

Highest top score among OWL-aware strategies (0.7068 vs 0.7010 AnnotationBased)
Most consistent retrieval: Tiny 0.0012 variance in top-5 scores
All top chunks equally useful (any could answer the query)
Produces self-contained, logically complete modules

Why It Excels:

Dependency closure: Each module includes all related axioms
Semantic completeness: No fragmented information
Multiple relevant modules: Different seed entities = different perspectives
Logical coherence: Uses OWL reasoning to determine relationships

Best For:

Large, complex ontologies where relationships span many axioms
Queries requiring complete context (all properties, relationships)
Modular ontology architectures
Distributed knowledge bases
When consistency matters more than granularity

Trade-off:

Larger chunks (avg ~7 axioms) vs AnnotationBased (~5 axioms)
Fewer chunks (28 vs 39) = less storage, faster indexing
But superior retrieval consistency

10. SizeBasedChunking

Status: Not tested in this study

Configuration: 50 axioms per chunk

Expected Behavior:

Fixed axiom count per chunk
Maintains entity coherence (keeps related axioms together)
If single entity > 50 axioms, creates dedicated chunk

Best For:

Consistent computational load
Predictable memory usage
Balanced query performance

Comparative Analysis

Similarity Scores Ranking

Rank	Strategy	Score	Answer Quality	Retrieval Consistency
1	SentenceChunking	0.7258	❌ Incomplete	Low
2	FixedSizeChunking	0.7141	✅ Correct	Medium
2	ParagraphChunking	0.7141	✅ Correct	Medium
3	WordChunking	0.7135	✅ Correct	Medium
4	ModuleExtraction	0.7068	✅ Correct	⭐ Highest (0.0012 variance)
5	AnnotationBased	0.7010	✅ Best Context	High
6	DepthBased	0.6967	✅ Correct	Medium
7	ClassBased	0.6964	✅ Correct	Medium
8	NamespaceBased	0.6964	✅ Correct	Medium

Critical Observations:

Highest similarity score ≠ best answer quality (SentenceChunking fragmented entities)
ModuleExtraction: Highest OWL-aware score (0.7068) + most consistent retrieval (0.0012 variance)
AnnotationBased: Fine-grained grouping (39 chunks) effective when metadata quality is high
Performance highly dependent on ontology design and metadata conventions

Chunk Count Analysis

Strategy	Chunks	Average Size	Distribution
SentenceChunking	76	2.6 axioms	Very unbalanced
WordChunking	58	3.4 axioms	Balanced
ParagraphChunking	58	3.4 axioms	Balanced
FixedSizeChunking	Unknown	Unknown	Appears balanced
ModuleExtraction	28	~7 axioms	Varies by module
AnnotationBased	39	~5 axioms	Well-balanced
ClassBased	6	32.5 axioms	Highly unbalanced (1 giant)
DepthBased	3	65 axioms	Highly unbalanced
NamespaceBased	6	32.5 axioms	Highly unbalanced

Note: ModuleExtraction creates semantically complete modules (e.g., top chunk had 132 axioms with 4 seed entities), making "average size" less meaningful than for text-based strategies.

Insight: More chunks ≠ better retrieval. Balance matters more than count.

The Orphan Axiom Problem

Definition: Axioms not part of class hierarchy definitions (individual assertions, property declarations, annotations).

Prevalence: 183 of 195 axioms (93.8%) in legal ontology.

Impact on OWL-Aware Strategies:

ClassBased: 183-axiom orphan chunk dominates
DepthBased: 183-axiom non-class chunk dominates
AnnotationBased: Splits orphans into 39 semantic groups (effective when naming conventions exist)

Why This Matters: Real ontologies contain mostly ABox data (individuals), not TBox data (class definitions). Strategies that handle orphan axioms well will perform better in practice.

Key Findings

1. Performance Depends on Ontology Characteristics

ModuleExtractionChunking: Highest score (0.7068) + best consistency (0.0012 variance)
AnnotationBasedChunking: Fine-grained semantic grouping (39 chunks), effective when naming conventions are consistent
WordChunking: Highest correct-answer score (0.7135), simplest implementation
Optimal strategy depends on: ontology size, hierarchy depth, metadata quality, naming patterns

2. High Similarity Scores Can Mislead

SentenceChunking: 0.7258 score but fragmented entities
Chunk boundaries matter more than matching algorithms
Semantic completeness > mathematical similarity

3. OWL-Aware Strategies Excel in Specific Contexts

ModuleExtraction: Best for completeness and consistency (0.7068, 0.0012 variance)
AnnotationBased: Effective when naming conventions exist (requires metadata)
ClassBased: Ideal for hierarchy queries
DepthBased: Excellent for structural analysis
NamespaceBased: Essential for multi-ontology projects

4. The "Orphan Axiom" Challenge

93.8% of axioms are non-hierarchical
Traditional OWL-aware strategies struggle with this
AnnotationBased solution: semantic naming patterns

5. Ontology Structure Influences Strategy Selection

Flat hierarchy (2 levels): DepthBased produces only 3 chunks
Single namespace: NamespaceBased reverts to ClassBased
Entity size (< 100 words): WordChunking = ParagraphChunking

Recommendations

Strategy Selection Depends on Context

If ontology has consistent naming conventions (e.g., "case_", "judge_"):

AnnotationBasedChunking: Creates semantic groups automatically (39 chunks, 0.7010)
Requires well-designed metadata with prefix patterns

If ontology has complex relationships requiring complete context:

ModuleExtractionChunking: Highest accuracy (0.7068) + exceptional consistency (0.0012 variance)
Best for: Large ontologies, distributed knowledge bases

If seeking simplicity without OWL dependencies:

WordChunking: High performance (0.7135), no metadata required
ParagraphChunking: Good for documentation-style ontologies

Avoid in all cases:

SentenceChunking: Fragments entity names despite high scores

For Different Ontology Types

Deep Hierarchy Ontologies (5+ levels)

Use: DepthBasedChunking

Reveals abstraction layers
Good for "What are the most general concepts?" queries

Multi-Ontology Projects

Use: NamespaceBasedChunking

Clean separation between imported ontologies
Prevents cross-ontology confusion

Small, Class-Focused Ontologies

Use: ClassBasedChunking

Efficient for hierarchy queries
Works when most axioms are class definitions

Large, Complex Ontologies

Consider: ModuleExtractionChunking

Highest OWL-aware score (0.7068)
Self-contained modules with dependency closure
Exceptional consistency (0.0012 variance)
Best scalability

Performance-Critical Applications

Use: SizeBasedChunking

Predictable computational cost
Balanced load distribution

Technical Implementation Notes

Lucene Vector Store Configuration

LuceneVectorStore vectorStore = new LuceneVectorStore(
    "./lucene_index",  // File-based storage
    1024              // Max dimensions (Lucene 9.8.0 limit)
);

Chunking Strategy Selection (Java)

// In RagService.java
if (chunkingStrategy.equals("AnnotationBasedChunking")) {
    AnnotationBasedChunker chunker = new AnnotationBasedChunker();
    List<OWLChunk> chunks = chunker.chunk(ontology);

    for (OWLChunk chunk : chunks) {
        String chunkId = chunk.getId();        // "annotation-chunk-10"
        String text = chunk.toOWLString();     // Manchester syntax
        String strategy = chunk.getStrategy(); // "Annotation-Based"
        int axiomCount = chunk.getAxiomCount();

        // Create embedding and store
        List<Float> embedding = embeddingService.createEmbedding(text);
        vectorStore.upsert(chunkId, embedding, text, metadata);
    }
}

OpenAI Embedding Generation

// EmbeddingService.java
List<Float> embedding = embeddingService.createEmbedding(
    chunkText,
    "text-embedding-3-small",
    1024  // Dimensions
);

Similarity Score Calculation

Formula: Cosine similarity = (A · B) / (||A|| × ||B||)
Implementation: Built into Lucene's VectorSimilarityFunction.COSINE
Range: 0.0 (unrelated) to 1.0 (identical)
Typical scores: 0.65-0.75 for relevant chunks

Limitations of This Study

Single ontology tested: Results specific to legal domain with consistent naming conventions
Small scale: 195 axioms; performance at 10,000+ axioms unknown
Single query type: "Which cases are active?" tests factual retrieval only
Metadata-dependent: AnnotationBased performance assumes well-structured naming
No hybrid testing: Didn't test combinations of strategies
Limited query diversity: Different query types may favor different strategies

Future Research Directions

1. Hybrid Chunking Strategies

Combine multiple approaches:

AnnotationBased for orphan axioms
ClassBased for hierarchy axioms
Could achieve best of both worlds

2. Dynamic Strategy Selection

AI-powered strategy selection:

Analyze ontology structure
Choose optimal strategy automatically
Adapt based on query patterns

3. Custom Chunking Rules

Domain-specific configurations:

Legal: Group by case type
Medical: Group by diagnosis
E-commerce: Group by product category

4. Large-Scale Testing

Evaluate on:

SNOMED CT (300,000+ concepts)
Gene Ontology (45,000+ terms)
DBpedia (6M+ entities)

5. Multi-Modal Chunking

Incorporate:

Text content
Visual diagrams
Audio annotations
Temporal data

Conclusion

OWL-aware chunking strategies represent a significant advancement in RAG for ontologies. My analysis demonstrates that no single strategy is universally optimal—performance depends critically on:

Ontology structure: Hierarchy depth, namespace diversity, entity relationships
Metadata quality: Consistent naming conventions, annotation completeness
Query patterns: Specific facts vs. structural understanding
Implementation priorities: Accuracy vs. simplicity vs. performance

Key Insights

Highest Scores Don't Guarantee Best Answers: SentenceChunking achieved 0.7258 but fragmented entities, while lower-scoring strategies with semantic completeness produced correct answers.

Metadata Matters: AnnotationBasedChunking (0.7010, 39 chunks) excels only when ontologies follow consistent naming conventions. Poor metadata quality degrades it to random grouping.

Consistency vs. Peak Score: ModuleExtractionChunking (0.7068) achieved the highest OWL-aware score with remarkable consistency (0.0012 variance), making all top results equally useful.

Practical Guidance

Analyze your ontology first: Structure, metadata quality, naming patterns
Test multiple strategies on representative queries from your domain
Prioritize semantic completeness over raw similarity scores
Consider hybrid approaches for complex multi-domain ontologies
Match strategy to use case: Completeness (ModuleExtraction), granularity (AnnotationBased), simplicity (Word)

The agenticmemory library's OWL-aware chunking strategies provide powerful tools for knowledge graph RAG, but their effectiveness depends on matching strategy to ontology characteristics. As ontology-based AI systems proliferate, sophisticated chunking will become increasingly important for retrieval quality.

References

agenticmemory Library: https://github.com/vishalmysore/agenticmemory
Apache Lucene KNN: https://lucene.apache.org/core/9_8_0/core/org/apache/lucene/document/KnnFloatVectorField.html
OpenAI Embeddings API: https://platform.openai.com/docs/guides/embeddings
OWL API Documentation: http://owlcs.github.io/owlapi/
Protégé Plugin Development: https://protegewiki.stanford.edu/wiki/PluginDevelopment

Acknowledgments

This research was conducted using:

Protégé 5.6.7: Stanford Center for Biomedical Informatics Research
agenticmemory 0.1.1: Vishal Mysore
Apache Lucene 9.8.0: Apache Software Foundation
OpenAI GPT-4 & Embeddings: OpenAI
Neo4j 4.4.13: Neo4j, Inc.

Appendix: Complete Test Data

Legal Ontology Statistics

Total Axioms: 195
Classes: 17
Object Properties: 7
Data Properties: 6
Individuals: 15 (3 cases, 3 courts, 4 judges, 3 lawyers, 2 people, 3 evidence, 2 statutes, 1 verdict)
Annotations: 84 rdfs:label assertions
Hierarchy Depth: 2 levels maximum
Namespaces: 1 (http://www.semanticweb.org/legal#)

Chunk Distribution Visualization

ClassBasedChunking (6 chunks):
███████████████████████████████████████████████████████ Orphan (183)
█ Evidence (2)
█ Statute (2)
█ Court (3)
█ Case (3)
██ Person (7)

AnnotationBasedChunking (39 chunks):
███████ no-annotations (84)
███ cas (26)
███ sta (24)
██ app (17)
█ fil (12)
█ [34 other prefixes with varying axiom counts]

DepthBasedChunking (3 chunks):
███████████████████████████████████████████████████████ Non-class (183)
██ Depth-0 (15)
█ Depth-1 (2)