Dimensional Directory Applied

A practical implementation of Dimensional Directory concepts in our Text Embedding System

System Overview

Our Text Embedding System demonstrates the Dimensional Directory concept in action. By creating a hierarchical system of tables that map between human-readable addresses and UUIDs, we enable efficient vector operations while maintaining perfect referential integrity and intuitive access to information.

Three-Stage Pipeline

Chunking: Breaking text into sentences and organizing them with unique identifiers
Indexing: Creating a structured index to track and manage these sentences
Embedding: Generating vector embeddings for the indexed sentences

DD Principles Applied

UUID-based storage with human-readable addressing
Table-based index mapping between addresses and storage
Hierarchical organization of information units
Perfect deduplication while maintaining references
Separation of metadata from raw content

The Address Book Concept

At the core of our implementation is what we call the "Address Book" - a meta-table system that indexes all our table.json files, creating a global map of addresses that can be used to find any piece of information in the system.

How It Works

The Address Book creates a zero-indexed addressing scheme that maps to UUIDs, which then point to the actual data locations:

Zero-Indexed Address Format:

doc:1-3

Where 1 = document index, 3 = sentence index

Address Resolves To:

UUID: 550e8400-e29b-41d4-a716...

Sentence Index: 3

Benefits of the Address Book

•Human-Readable Addresses: Easy reference to any sentence or vector without needing to remember UUIDs
•Indirection Layer: Changes to underlying storage don't break references
•Relationship Tracking: Maintains explicit connections between sentences and their embeddings
•Query Efficiency: Direct access paths to specific data points

Implementation Details

Address Book Structure

{
  "documents": {
    "1": {
      "uuid": "550e8400-e29b-41d4-a716...",
      "meta": {
        "origin": "document1.txt",
        "sentences": 42
      }
    },
    "2": {
      "uuid": "6ba7b810-9dad-11d1-80b4...",
      "meta": {
        "origin": "document2.md",
        "sentences": 105
      }
    }
  }
}

Vector Mapping Table

{
  "embeddings": {
    "bert-base": {
      "1": "550e8400-e29b-41d4-a716...",
      "2": "6ba7b810-9dad-11d1-80b4..."
    },
    "mpnet": {
      "1": "c9e00578-f95f-42cd-8d18...",
      "2": "8d1de9c2-5d0e-4d8f-9a7b..."
    }
  }
}

Address Resolution Process

Parse Address

Input: doc:1-3
Parse to: Document ID 1, Sentence Index 3

Lookup Document UUID

From address book, get: 550e8400-e29b-41d4-a716...
for Document ID 1

Access Sentence

File: data/indexed/sentences/550e8400-e29b-41d4-a716....json
Access index: 3
Result: "This is the third sentence from document 1."

Get Vector (Optional)

For model: bert-base
File: data/indexed/embeddings/550e8400-e29b-41d4-a716....json
Access index: 3
Result: [0.123, 0.456, ...]

Advanced Query Capabilities

Finding Information

Text-Based Queries

# Find all sentences containing "energy"
results = query_sentences("energy")

# Returns
[
  { "address": "doc:1-3", 
    "text": "Energy flows..." },
  { "address": "doc:2-17", 
    "text": "The energy transfer..." }
]

Directly query text content and get back zero-indexed addresses for further operations.

Vector Queries

# Find similar sentences to "doc:1-3"
vector = get_vector("doc:1-3", "mpnet")
similar = vector_similarity_search(
  vector, 
  model="mpnet",
  top_k=5
)

# Returns addresses and scores
[
  { "address": "doc:1-3", "score": 1.0 },
  { "address": "doc:3-9", "score": 0.86 }
]

Perform semantic similarity searches using the vector representation of any text.

Hybrid Query Example

# Complex query demonstration
def find_related_concepts(keyword, expansion_depth=2):
    # 1. Find direct mentions
    direct_matches = query_sentences(keyword)
    direct_addresses = [match["address"] for match in direct_matches]
    
    # 2. Get vectors for direct matches
    vectors = [get_vector(addr, "mpnet") for addr in direct_addresses]
    avg_vector = average_vectors(vectors)
    
    # 3. Find semantically similar content
    similar_addresses = vector_similarity_search(
        avg_vector, 
        model="mpnet",
        top_k=20,
        exclude=direct_addresses
    )
    
    # 4. Get the actual text content using addresses
    related_content = [
        {
            "address": addr["address"],
            "text": get_sentence_by_address(addr["address"]),
            "similarity": addr["score"]
        }
        for addr in similar_addresses
    ]
    
    return {
        "direct_mentions": direct_matches,
        "related_concepts": related_content
    }

This example shows how the address system enables seamless movement between text-based and vector-based operations, creating powerful hybrid query capabilities.

Integration with Data Stores

JSON Files

The current implementation uses JSON files for tables and data storage, which provides:

•Simple human-readable format
•Ease of debugging and testing
•Direct file system access without complex setup

SQLite Integration

The next version will add SQLite for metadata and indexing:

•Fast query performance for address lookup
•Structured relationships between entities
•Transactional operations for data integrity

HDF5 for Vectors

Planned integration with HDF5 for embedding storage:

•Efficient storage of high-dimensional vectors
•Fast similarity search operations
•Optimized for numerical data
•Support for vector approximation algorithms

The beauty of the Dimensional Directory approach is that the addressing system remains consistent even as the underlying storage technologies evolve. Applications built on top of the system won't need to change even as we transition to more sophisticated storage backends.

Practical Use Cases

Document Analysis

The system excels at analyzing large document collections:

•Concept Mapping: Track how ideas flow across documents
•Semantic Search: Find content by meaning, not just keywords
•Content Clustering: Group related information automatically
•Cross-Reference Analysis: Discover connections between texts

Knowledge Management

Create sophisticated knowledge systems:

•Knowledge Graphs: Build explicit relationships between concepts
•Multi-Model Integration: Compare results across different embedding models
•Contextual Information Retrieval: Get information in the right context
•Source Attribution: Track information back to original sources

Development Roadmap

Phase 1: JSON Implementation (Complete)

Basic system with JSON file storage, demonstrating address book concept and working pipeline from text to embeddings.

Phase 2: SQLite Integration (In Progress)

Adding SQLite database for metadata and address mapping, improving query performance and supporting more complex relationship tracking.

Phase 3: HDF5 Vector Storage

Implementing HDF5 for optimized vector operations, enhancing similarity search performance and scaling to larger datasets.

Phase 4: Full DD Integration

Complete integration with the full Dimensional Directory infrastructure, including advanced relationship management and multi-type support.