Detailed performance analysis of 10 different chunking strategies I created for Retrieval-Augmented Generation (RAG) applied to a legal ontology in Protégé. I tested both traditional text-based chunking and novel OWL-aware chunking strategies powered by the agenticmemory library. Performance varies significantly based on ontology structure, metadata quality, and naming conventions. ModuleExtractionChunking achieved the highest OWL-aware score (0.7068) with exceptional consistency, while AnnotationBasedChunking (0.7010) offered fine-grained semantic grouping with 39 focused chunks.
👉 Protege Plugin for Lucene based Vector store is here https://github.com/vishalmysore/lucene-protege-plugin
👉 Owl ontology for used in this article is here https://github.com/vishalmysore/graphrag/blob/main/graphrag/ontologies/legal-case-management.owl
👉 More Ontologis are here https://github.com/vishalmysore/graphrag/tree/main/graphrag/ontologies
👉 AgenticMemory pacakge is here https://github.com/vishalmysore/agenticmemory
Introduction
Chunking strategies are critical for RAG performance. The way I split knowledge into retrievable pieces directly impacts:
- Context relevance: Whether the retrieved chunks contain the information needed
- Answer accuracy: Whether the LLM receives complete vs. fragmented information
- Query performance: Search time and computational cost
Traditional text-based chunking (word, sentence, paragraph boundaries) treats all text equally. However, ontologies have rich semantic structure—class hierarchies, property domains, annotation patterns—that can inform smarter chunking decisions.
This study evaluates whether OWL-aware chunking strategies outperform traditional text-based approaches.
Methodology
Test Environment
- Platform: Protégé 5.6.7 with custom Lucene RAG plugin
-
Ontology: Legal domain (195 total axioms)
- 3 cases (Smith v. Jones, State v. Doe, Appeal CV-2023-500)
- 3 courts (District, Appellate, Supreme)
- 4 judges and 3 lawyers
- 3 evidence items
- 2 statutes (Federal, State)
- Vector Store: Apache Lucene 9.8.0 with KnnFloatVectorField
- Embeddings: OpenAI text-embedding-3-small (1024 dimensions)
- LLM: GPT-4
- Test Query: "Which cases are currently active?"
Evaluation Metrics
- Chunk Count: Number of chunks created
- Top Similarity Score: Cosine similarity of best-matching chunk
- Answer Quality: Whether LLM provided correct, complete answer
- Chunk Distribution: How axioms were grouped
Results: Text-Based Chunking Strategies
1. WordChunking
Chunks Created: 58
Top Similarity: 0.7135
Answer Quality: ✅ Correct (both cases identified)
How It Works: Splits text at word boundaries, typically 100 words per chunk.
Performance:
- Created 58 focused chunks
- Each chunk contained 1-2 complete entities
- No fragmentation of entity names
- Example chunk: Full "Smith v. Jones" case with all properties
Best For:
- Structured data where entities < 100 words
- Short, self-contained knowledge items
- Clear word boundaries separate concepts
2. SentenceChunking
Chunks Created: 76
Top Similarity: 0.7258 (highest raw score)
Answer Quality: ❌ Incomplete ("Jones" instead of "Smith v. Jones")
How It Works: Splits text at sentence boundaries (periods, exclamation marks, question marks).
Performance:
- Created 76 smaller chunks (most of any strategy)
- Critical flaw: Fragmented entity names across chunks
- Highest similarity score but worst answer quality
- Example: "Jones" appeared in one chunk, "Smith v." in another
Problem Identified:
Chunk A: "...the case Smith v."
Chunk B: "Jones was filed in District Court..."
When LLM received Chunk B alone, it only saw "Jones" without "Smith v.", leading to incomplete answers.
Lesson Learned: Higher similarity scores don't guarantee better answers if chunks break semantic units.
3. ParagraphChunking
Chunks Created: 58
Top Similarity: 0.7141
Answer Quality: ✅ Correct (both cases identified)
How It Works: Splits text at paragraph boundaries (double newlines).
Performance:
- Identical results to WordChunking (58 chunks)
- This occurred because ontology entities had no paragraph breaks
- All entities < 100 words, so paragraph = word boundaries
Best For:
- Long-form documentation with clear paragraph structure
- Articles, papers, documentation
- Not ideal for structured RDF/OWL data
4. FixedSizeChunking
Chunks Created: Unknown (at least 5)
Top Similarity: 0.7141
Answer Quality: ✅ Correct (Smith v. Jones identified as active)
How It Works: Fixed character/token limits regardless of content boundaries. Unlike other strategies, ignores semantic structure entirely.
Performance:
- Top similarity of 0.7141 (tied with ParagraphChunking)
- Clean, well-structured chunks with complete entity information
- Each chunk formatted with Class/Individual type, IRI, label, and properties
- Successfully retrieved all relevant case information
Example Chunks:
Chunk 1 (0.7141): Criminal Case definition
Chunk 2 (0.7096): Appeal of CV-2023-500 (complete individual)
Chunk 3 (0.6985): Smith v. Jones (complete individual)
Chunk 4 (0.6955): State v. Doe (complete individual)
Chunk 5 (0.6938): case status property
Key Observation: Despite ignoring semantic boundaries, FixedSizeChunking produced well-formed chunks because:
- Ontology entities are naturally compact (< 100 words each)
- RDF/OWL serialization creates natural boundaries
- Fixed size happened to align with entity boundaries
Best For:
- When entity size is consistent and predictable
- Performance-critical applications needing uniform computational load
- Baseline comparison for other strategies
Limitation: Would fragment entities if fixed size < entity size, or waste space if fixed size >> entity size.
Results: OWL-Aware Chunking Strategies
5. ClassBasedChunking
Chunks Created: 6
Top Similarity: 0.6964
Answer Quality: ✅ Correct
How It Works: Groups axioms by class hierarchies. Creates one chunk per class hierarchy plus one "orphan" chunk for non-hierarchy axioms.
Chunk Distribution:
Chunk 0: Evidence hierarchy (2 axioms)
- DocumentEvidence
- PhysicalEvidence
Chunk 1: Statute hierarchy (2 axioms)
- FederalStatute
- StateStatute
Chunk 2: Court hierarchy (3 axioms)
- AppellateCourt
- DistrictCourt
- SupremeCourt
Chunk 3: Case hierarchy (3 axioms)
- CivilCase
- CriminalCase
- AppellateCase
Chunk 4: Person hierarchy (7 axioms)
- Judge, Lawyer
- DefenseAttorney, Prosecutor
- SupremeCourtJudge
- Plaintiff, Defendant
Chunk 5: Orphan axioms (183 axioms) ← DOMINANT CHUNK
- All individual assertions
- All property declarations
- All annotations
Key Finding: 183 of 195 axioms (93.8%) ended up in the "orphan" chunk because they were individual assertions, not class hierarchy definitions.
Best For:
- Queries about class relationships
- "What types of cases exist?"
- "What are the subclasses of Person?"
Limitation: Most real data (case details, property values) concentrated in massive orphan chunk.
6. AnnotationBasedChunking
Chunks Created: 39
Top Similarity: 0.7010
Answer Quality: ✅ Correct with best context
How It Works: Groups axioms by annotation label prefixes (first 3 characters).
Chunk Distribution:
Top chunks by axiom count:
- no-annotations: 84 axioms (labels, ranges, domains)
- "cas" prefix: 26 axioms (Case, CaseNumber, CaseStatus, Case_SmithVsJones, Case_StateVsDoe)
- "sta" prefix: 24 axioms (Statute, StatuteCode, StateStatute, FederalStatute)
- "app" prefix: 17 axioms (AppellateCase, AppellateCourt, AppealsTo property, Case_AppealCV001)
- "fil" prefix: 12 axioms (FiledIn, FilingDate, all filing-related assertions)
- "cou" prefix: Court entities and properties
- "jud" prefix: Judge-related entities
- "law" prefix: Lawyer-related entities
- "evi" prefix: Evidence-related entities
- 30 other prefixes: Varying axiom counts (1-10 axioms each)
Total: 39 semantic chunks
Key Strengths:
- Semantic grouping: Entities with similar names usually have related meanings
- Balanced chunks: 39 focused chunks vs. 1 giant orphan chunk
- Complete entities: "Case_SmithVsJones" stayed with "CaseStatus", "CaseNumber"
- Effective retrieval: Top chunk (0.7010, no-annotations) contained complete case information with all labels
- Metadata-dependent: Performance relies heavily on consistent naming conventions
Example Query Flow:
Query: "Which cases are currently active?"
↓
Embedding matches "cas" prefix chunk (0.7010 similarity)
↓
Chunk contains: Case_SmithVsJones (Active), Case_StateVsDoe (Trial)
↓
GPT-4 receives complete case information
↓
Answer: ✅ "Smith v. Jones (Active), State v. Doe (Trial)"
7. NamespaceBasedChunking
Chunks Created: 6
Top Similarity: 0.6964
Answer Quality: ✅ Correct
How It Works: Splits axioms by IRI namespace prefixes.
Performance:
-
Fell back to ClassBasedChunking: Legal ontology uses single namespace (
http://www.semanticweb.org/legal#) - All chunk IDs showed "class-chunk-" not "namespace-chunk-"
- Identical results to ClassBasedChunking
When It Would Excel:
Scenario: Multi-ontology project
Chunk 1: http://www.semanticweb.org/legal# (your domain)
- Case, Court, Judge classes
Chunk 2: http://purl.org/dc/terms/ (Dublin Core)
- creator, date, title
Chunk 3: http://xmlns.com/foaf/0.1/ (FOAF)
- Person, Organization, name
Chunk 4: http://www.w3.org/2006/time# (OWL Time)
- Instant, Interval, before, after
Best For:
- Projects importing multiple external ontologies
- Separating domain concepts from metadata
- Modular ontology architectures
Limitation: Useless for single-namespace ontologies.
8. DepthBasedChunking
Chunks Created: 3
Top Similarity: 0.6967
Answer Quality: ✅ Correct
How It Works: Groups axioms by hierarchy depth level.
Chunk Distribution:
Chunk 1: Non-class axioms (183 axioms, depth: N/A)
- All individual assertions
- All property declarations
- All annotations
- Similarity: 0.6967 ← Top result
Chunk 2: Depth Level 0 (15 axioms)
- Direct subclass relationships
- Evidence → DocumentEvidence, PhysicalEvidence
- Court → DistrictCourt, AppellateCourt, SupremeCourt
- Case → CivilCase, CriminalCase, AppellateCase
- Person → Judge, Lawyer, Plaintiff, Defendant
- Statute → FederalStatute, StateStatute
- Similarity: 0.6717
Chunk 3: Depth Level 1 (2 axioms)
- Second-level subclass relationships
- Lawyer → DefenseAttorney, Prosecutor
- Judge → SupremeCourtJudge
- Similarity: 0.6226
Key Insight: Legal ontology has only 2 hierarchy depth levels, indicating relatively flat structure.
Hierarchy Analysis:
Level 0: Top-level concepts (Case, Court, Person, Evidence, Statute)
Level 1: Direct children (17 classes)
Level 2: Grandchildren (3 classes: DefenseAttorney, Prosecutor, SupremeCourtJudge)
Best For:
- Understanding ontology complexity
- Queries about abstraction levels
- "What are the top-level classes?"
- Structural analysis
Limitation: Most data still in non-class axioms chunk.
9. ModuleExtractionChunking
Chunks Created: 28
Top Similarity: 0.7068
Answer Quality: ✅ Correct (both cases identified)
How It Works:
- Extracts minimal, self-contained ontology modules using OWL API algorithms
- Each module is complete and independent
- Selects seed entities and pulls all related axioms
Performance:
- Created 28 modules from 195-axiom ontology
- Remarkably tight similarity clustering: Top 5 scores range 0.7056-0.7068 (0.0012 spread)
- Top chunk: 132 axioms with 4 seed entities
- Most balanced retrieval: All top results highly relevant
Example Module:
module-chunk-1 (0.7068):
- 132 axioms
- 4 seed entities (likely: Case, Court, Judge, Lawyer)
- Complete case information with all dependencies
Key Insight:
- Highest top score among OWL-aware strategies (0.7068 vs 0.7010 AnnotationBased)
- Most consistent retrieval: Tiny 0.0012 variance in top-5 scores
- All top chunks equally useful (any could answer the query)
- Produces self-contained, logically complete modules
Why It Excels:
- Dependency closure: Each module includes all related axioms
- Semantic completeness: No fragmented information
- Multiple relevant modules: Different seed entities = different perspectives
- Logical coherence: Uses OWL reasoning to determine relationships
Best For:
- Large, complex ontologies where relationships span many axioms
- Queries requiring complete context (all properties, relationships)
- Modular ontology architectures
- Distributed knowledge bases
- When consistency matters more than granularity
Trade-off:
- Larger chunks (avg ~7 axioms) vs AnnotationBased (~5 axioms)
- Fewer chunks (28 vs 39) = less storage, faster indexing
- But superior retrieval consistency
10. SizeBasedChunking
Status: Not tested in this study
Configuration: 50 axioms per chunk
Expected Behavior:
- Fixed axiom count per chunk
- Maintains entity coherence (keeps related axioms together)
- If single entity > 50 axioms, creates dedicated chunk
Best For:
- Consistent computational load
- Predictable memory usage
- Balanced query performance
Comparative Analysis
Similarity Scores Ranking
| Rank | Strategy | Score | Answer Quality | Retrieval Consistency |
|---|---|---|---|---|
| 1 | SentenceChunking | 0.7258 | ❌ Incomplete | Low |
| 2 | FixedSizeChunking | 0.7141 | ✅ Correct | Medium |
| 2 | ParagraphChunking | 0.7141 | ✅ Correct | Medium |
| 3 | WordChunking | 0.7135 | ✅ Correct | Medium |
| 4 | ModuleExtraction | 0.7068 | ✅ Correct | ⭐ Highest (0.0012 variance) |
| 5 | AnnotationBased | 0.7010 | ✅ Best Context | High |
| 6 | DepthBased | 0.6967 | ✅ Correct | Medium |
| 7 | ClassBased | 0.6964 | ✅ Correct | Medium |
| 8 | NamespaceBased | 0.6964 | ✅ Correct | Medium |
Critical Observations:
- Highest similarity score ≠ best answer quality (SentenceChunking fragmented entities)
- ModuleExtraction: Highest OWL-aware score (0.7068) + most consistent retrieval (0.0012 variance)
- AnnotationBased: Fine-grained grouping (39 chunks) effective when metadata quality is high
- Performance highly dependent on ontology design and metadata conventions
Chunk Count Analysis
| Strategy | Chunks | Average Size | Distribution |
|---|---|---|---|
| SentenceChunking | 76 | 2.6 axioms | Very unbalanced |
| WordChunking | 58 | 3.4 axioms | Balanced |
| ParagraphChunking | 58 | 3.4 axioms | Balanced |
| FixedSizeChunking | Unknown | Unknown | Appears balanced |
| ModuleExtraction | 28 | ~7 axioms | Varies by module |
| AnnotationBased | 39 | ~5 axioms | Well-balanced |
| ClassBased | 6 | 32.5 axioms | Highly unbalanced (1 giant) |
| DepthBased | 3 | 65 axioms | Highly unbalanced |
| NamespaceBased | 6 | 32.5 axioms | Highly unbalanced |
Note: ModuleExtraction creates semantically complete modules (e.g., top chunk had 132 axioms with 4 seed entities), making "average size" less meaningful than for text-based strategies.
Insight: More chunks ≠ better retrieval. Balance matters more than count.
The Orphan Axiom Problem
Definition: Axioms not part of class hierarchy definitions (individual assertions, property declarations, annotations).
Prevalence: 183 of 195 axioms (93.8%) in legal ontology.
Impact on OWL-Aware Strategies:
- ClassBased: 183-axiom orphan chunk dominates
- DepthBased: 183-axiom non-class chunk dominates
- AnnotationBased: Splits orphans into 39 semantic groups (effective when naming conventions exist)
Why This Matters: Real ontologies contain mostly ABox data (individuals), not TBox data (class definitions). Strategies that handle orphan axioms well will perform better in practice.
Key Findings
1. Performance Depends on Ontology Characteristics
- ModuleExtractionChunking: Highest score (0.7068) + best consistency (0.0012 variance)
- AnnotationBasedChunking: Fine-grained semantic grouping (39 chunks), effective when naming conventions are consistent
- WordChunking: Highest correct-answer score (0.7135), simplest implementation
- Optimal strategy depends on: ontology size, hierarchy depth, metadata quality, naming patterns
2. High Similarity Scores Can Mislead
- SentenceChunking: 0.7258 score but fragmented entities
- Chunk boundaries matter more than matching algorithms
- Semantic completeness > mathematical similarity
3. OWL-Aware Strategies Excel in Specific Contexts
- ModuleExtraction: Best for completeness and consistency (0.7068, 0.0012 variance)
- AnnotationBased: Effective when naming conventions exist (requires metadata)
- ClassBased: Ideal for hierarchy queries
- DepthBased: Excellent for structural analysis
- NamespaceBased: Essential for multi-ontology projects
4. The "Orphan Axiom" Challenge
- 93.8% of axioms are non-hierarchical
- Traditional OWL-aware strategies struggle with this
- AnnotationBased solution: semantic naming patterns
5. Ontology Structure Influences Strategy Selection
- Flat hierarchy (2 levels): DepthBased produces only 3 chunks
- Single namespace: NamespaceBased reverts to ClassBased
- Entity size (< 100 words): WordChunking = ParagraphChunking
Recommendations
Strategy Selection Depends on Context
If ontology has consistent naming conventions (e.g., "case_", "judge_"):
- AnnotationBasedChunking: Creates semantic groups automatically (39 chunks, 0.7010)
- Requires well-designed metadata with prefix patterns
If ontology has complex relationships requiring complete context:
- ModuleExtractionChunking: Highest accuracy (0.7068) + exceptional consistency (0.0012 variance)
- Best for: Large ontologies, distributed knowledge bases
If seeking simplicity without OWL dependencies:
- WordChunking: High performance (0.7135), no metadata required
- ParagraphChunking: Good for documentation-style ontologies
Avoid in all cases:
- SentenceChunking: Fragments entity names despite high scores
For Different Ontology Types
Deep Hierarchy Ontologies (5+ levels)
Use: DepthBasedChunking
- Reveals abstraction layers
- Good for "What are the most general concepts?" queries
Multi-Ontology Projects
Use: NamespaceBasedChunking
- Clean separation between imported ontologies
- Prevents cross-ontology confusion
Small, Class-Focused Ontologies
Use: ClassBasedChunking
- Efficient for hierarchy queries
- Works when most axioms are class definitions
Large, Complex Ontologies
Consider: ModuleExtractionChunking
- Highest OWL-aware score (0.7068)
- Self-contained modules with dependency closure
- Exceptional consistency (0.0012 variance)
- Best scalability
Performance-Critical Applications
Use: SizeBasedChunking
- Predictable computational cost
- Balanced load distribution
Technical Implementation Notes
Lucene Vector Store Configuration
LuceneVectorStore vectorStore = new LuceneVectorStore(
"./lucene_index", // File-based storage
1024 // Max dimensions (Lucene 9.8.0 limit)
);
Chunking Strategy Selection (Java)
// In RagService.java
if (chunkingStrategy.equals("AnnotationBasedChunking")) {
AnnotationBasedChunker chunker = new AnnotationBasedChunker();
List<OWLChunk> chunks = chunker.chunk(ontology);
for (OWLChunk chunk : chunks) {
String chunkId = chunk.getId(); // "annotation-chunk-10"
String text = chunk.toOWLString(); // Manchester syntax
String strategy = chunk.getStrategy(); // "Annotation-Based"
int axiomCount = chunk.getAxiomCount();
// Create embedding and store
List<Float> embedding = embeddingService.createEmbedding(text);
vectorStore.upsert(chunkId, embedding, text, metadata);
}
}
OpenAI Embedding Generation
// EmbeddingService.java
List<Float> embedding = embeddingService.createEmbedding(
chunkText,
"text-embedding-3-small",
1024 // Dimensions
);
Similarity Score Calculation
- Formula: Cosine similarity = (A · B) / (||A|| × ||B||)
-
Implementation: Built into Lucene's
VectorSimilarityFunction.COSINE - Range: 0.0 (unrelated) to 1.0 (identical)
- Typical scores: 0.65-0.75 for relevant chunks
Limitations of This Study
- Single ontology tested: Results specific to legal domain with consistent naming conventions
- Small scale: 195 axioms; performance at 10,000+ axioms unknown
- Single query type: "Which cases are active?" tests factual retrieval only
- Metadata-dependent: AnnotationBased performance assumes well-structured naming
- No hybrid testing: Didn't test combinations of strategies
- Limited query diversity: Different query types may favor different strategies
Future Research Directions
1. Hybrid Chunking Strategies
Combine multiple approaches:
- AnnotationBased for orphan axioms
- ClassBased for hierarchy axioms
- Could achieve best of both worlds
2. Dynamic Strategy Selection
AI-powered strategy selection:
- Analyze ontology structure
- Choose optimal strategy automatically
- Adapt based on query patterns
3. Custom Chunking Rules
Domain-specific configurations:
- Legal: Group by case type
- Medical: Group by diagnosis
- E-commerce: Group by product category
4. Large-Scale Testing
Evaluate on:
- SNOMED CT (300,000+ concepts)
- Gene Ontology (45,000+ terms)
- DBpedia (6M+ entities)
5. Multi-Modal Chunking
Incorporate:
- Text content
- Visual diagrams
- Audio annotations
- Temporal data
Conclusion
OWL-aware chunking strategies represent a significant advancement in RAG for ontologies. My analysis demonstrates that no single strategy is universally optimal—performance depends critically on:
- Ontology structure: Hierarchy depth, namespace diversity, entity relationships
- Metadata quality: Consistent naming conventions, annotation completeness
- Query patterns: Specific facts vs. structural understanding
- Implementation priorities: Accuracy vs. simplicity vs. performance
Key Insights
Highest Scores Don't Guarantee Best Answers: SentenceChunking achieved 0.7258 but fragmented entities, while lower-scoring strategies with semantic completeness produced correct answers.
Metadata Matters: AnnotationBasedChunking (0.7010, 39 chunks) excels only when ontologies follow consistent naming conventions. Poor metadata quality degrades it to random grouping.
Consistency vs. Peak Score: ModuleExtractionChunking (0.7068) achieved the highest OWL-aware score with remarkable consistency (0.0012 variance), making all top results equally useful.
Practical Guidance
- Analyze your ontology first: Structure, metadata quality, naming patterns
- Test multiple strategies on representative queries from your domain
- Prioritize semantic completeness over raw similarity scores
- Consider hybrid approaches for complex multi-domain ontologies
- Match strategy to use case: Completeness (ModuleExtraction), granularity (AnnotationBased), simplicity (Word)
The agenticmemory library's OWL-aware chunking strategies provide powerful tools for knowledge graph RAG, but their effectiveness depends on matching strategy to ontology characteristics. As ontology-based AI systems proliferate, sophisticated chunking will become increasingly important for retrieval quality.
References
- agenticmemory Library: https://github.com/vishalmysore/agenticmemory
- Apache Lucene KNN: https://lucene.apache.org/core/9_8_0/core/org/apache/lucene/document/KnnFloatVectorField.html
- OpenAI Embeddings API: https://platform.openai.com/docs/guides/embeddings
- OWL API Documentation: http://owlcs.github.io/owlapi/
- Protégé Plugin Development: https://protegewiki.stanford.edu/wiki/PluginDevelopment
Acknowledgments
This research was conducted using:
- Protégé 5.6.7: Stanford Center for Biomedical Informatics Research
- agenticmemory 0.1.1: Vishal Mysore
- Apache Lucene 9.8.0: Apache Software Foundation
- OpenAI GPT-4 & Embeddings: OpenAI
- Neo4j 4.4.13: Neo4j, Inc.
Appendix: Complete Test Data
Legal Ontology Statistics
- Total Axioms: 195
- Classes: 17
- Object Properties: 7
- Data Properties: 6
- Individuals: 15 (3 cases, 3 courts, 4 judges, 3 lawyers, 2 people, 3 evidence, 2 statutes, 1 verdict)
- Annotations: 84 rdfs:label assertions
- Hierarchy Depth: 2 levels maximum
- Namespaces: 1 (http://www.semanticweb.org/legal#)
Chunk Distribution Visualization
ClassBasedChunking (6 chunks):
███████████████████████████████████████████████████████ Orphan (183)
█ Evidence (2)
█ Statute (2)
█ Court (3)
█ Case (3)
██ Person (7)
AnnotationBasedChunking (39 chunks):
███████ no-annotations (84)
███ cas (26)
███ sta (24)
██ app (17)
█ fil (12)
█ [34 other prefixes with varying axiom counts]
DepthBasedChunking (3 chunks):
███████████████████████████████████████████████████████ Non-class (183)
██ Depth-0 (15)
█ Depth-1 (2)
Top comments (0)