Dimensional Directory Applied
A practical implementation of Dimensional Directory concepts in our Text Embedding System
System Overview
Our Text Embedding System demonstrates the Dimensional Directory concept in action. By creating a hierarchical system of tables that map between human-readable addresses and UUIDs, we enable efficient vector operations while maintaining perfect referential integrity and intuitive access to information.
Three-Stage Pipeline
- Chunking: Breaking text into sentences and organizing them with unique identifiers
- Indexing: Creating a structured index to track and manage these sentences
- Embedding: Generating vector embeddings for the indexed sentences
DD Principles Applied
- UUID-based storage with human-readable addressing
- Table-based index mapping between addresses and storage
- Hierarchical organization of information units
- Perfect deduplication while maintaining references
- Separation of metadata from raw content
The Address Book Concept
At the core of our implementation is what we call the "Address Book" - a meta-table system that indexes all our table.json files, creating a global map of addresses that can be used to find any piece of information in the system.
How It Works
The Address Book creates a zero-indexed addressing scheme that maps to UUIDs, which then point to the actual data locations:
Zero-Indexed Address Format:
doc:1-3
Where 1 = document index, 3 = sentence index
Address Resolves To:
UUID: 550e8400-e29b-41d4-a716...
Sentence Index: 3
Benefits of the Address Book
- •Human-Readable Addresses: Easy reference to any sentence or vector without needing to remember UUIDs
- •Indirection Layer: Changes to underlying storage don't break references
- •Relationship Tracking: Maintains explicit connections between sentences and their embeddings
- •Query Efficiency: Direct access paths to specific data points
Implementation Details
Address Book Structure
{ "documents": { "1": { "uuid": "550e8400-e29b-41d4-a716...", "meta": { "origin": "document1.txt", "sentences": 42 } }, "2": { "uuid": "6ba7b810-9dad-11d1-80b4...", "meta": { "origin": "document2.md", "sentences": 105 } } } }
Vector Mapping Table
{ "embeddings": { "bert-base": { "1": "550e8400-e29b-41d4-a716...", "2": "6ba7b810-9dad-11d1-80b4..." }, "mpnet": { "1": "c9e00578-f95f-42cd-8d18...", "2": "8d1de9c2-5d0e-4d8f-9a7b..." } } }
Address Resolution Process
Parse Address
Input: doc:1-3
Parse to: Document ID 1, Sentence Index 3
Lookup Document UUID
From address book, get: 550e8400-e29b-41d4-a716...
for Document ID 1
Access Sentence
File: data/indexed/sentences/550e8400-e29b-41d4-a716....json
Access index: 3
Result: "This is the third sentence from document 1."
Get Vector (Optional)
For model: bert-base
File: data/indexed/embeddings/550e8400-e29b-41d4-a716....json
Access index: 3
Result: [0.123, 0.456, ...]
Advanced Query Capabilities
Finding Information
Text-Based Queries
# Find all sentences containing "energy" results = query_sentences("energy") # Returns [ { "address": "doc:1-3", "text": "Energy flows..." }, { "address": "doc:2-17", "text": "The energy transfer..." } ]
Directly query text content and get back zero-indexed addresses for further operations.
Vector Queries
# Find similar sentences to "doc:1-3" vector = get_vector("doc:1-3", "mpnet") similar = vector_similarity_search( vector, model="mpnet", top_k=5 ) # Returns addresses and scores [ { "address": "doc:1-3", "score": 1.0 }, { "address": "doc:3-9", "score": 0.86 } ]
Perform semantic similarity searches using the vector representation of any text.
Hybrid Query Example
# Complex query demonstration def find_related_concepts(keyword, expansion_depth=2): # 1. Find direct mentions direct_matches = query_sentences(keyword) direct_addresses = [match["address"] for match in direct_matches] # 2. Get vectors for direct matches vectors = [get_vector(addr, "mpnet") for addr in direct_addresses] avg_vector = average_vectors(vectors) # 3. Find semantically similar content similar_addresses = vector_similarity_search( avg_vector, model="mpnet", top_k=20, exclude=direct_addresses ) # 4. Get the actual text content using addresses related_content = [ { "address": addr["address"], "text": get_sentence_by_address(addr["address"]), "similarity": addr["score"] } for addr in similar_addresses ] return { "direct_mentions": direct_matches, "related_concepts": related_content }
This example shows how the address system enables seamless movement between text-based and vector-based operations, creating powerful hybrid query capabilities.
Integration with Data Stores
JSON Files
The current implementation uses JSON files for tables and data storage, which provides:
- •Simple human-readable format
- •Ease of debugging and testing
- •Direct file system access without complex setup
SQLite Integration
The next version will add SQLite for metadata and indexing:
- •Fast query performance for address lookup
- •Structured relationships between entities
- •Transactional operations for data integrity
HDF5 for Vectors
Planned integration with HDF5 for embedding storage:
- •Efficient storage of high-dimensional vectors
- •Fast similarity search operations
- •Optimized for numerical data
- •Support for vector approximation algorithms
The beauty of the Dimensional Directory approach is that the addressing system remains consistent even as the underlying storage technologies evolve. Applications built on top of the system won't need to change even as we transition to more sophisticated storage backends.
Practical Use Cases
Document Analysis
The system excels at analyzing large document collections:
- •Concept Mapping: Track how ideas flow across documents
- •Semantic Search: Find content by meaning, not just keywords
- •Content Clustering: Group related information automatically
- •Cross-Reference Analysis: Discover connections between texts
Knowledge Management
Create sophisticated knowledge systems:
- •Knowledge Graphs: Build explicit relationships between concepts
- •Multi-Model Integration: Compare results across different embedding models
- •Contextual Information Retrieval: Get information in the right context
- •Source Attribution: Track information back to original sources
Development Roadmap
Phase 1: JSON Implementation (Complete)
Basic system with JSON file storage, demonstrating address book concept and working pipeline from text to embeddings.
Phase 2: SQLite Integration (In Progress)
Adding SQLite database for metadata and address mapping, improving query performance and supporting more complex relationship tracking.
Phase 3: HDF5 Vector Storage
Implementing HDF5 for optimized vector operations, enhancing similarity search performance and scaling to larger datasets.
Phase 4: Full DD Integration
Complete integration with the full Dimensional Directory infrastructure, including advanced relationship management and multi-type support.