Text Embedding System

A structured approach to processing, indexing, and embedding large volumes of text data

System Overview

Our text processing and embedding system follows a three-stage pipeline designed to handle large volumes of text data in a structured and efficient way:

Chunking

Breaking text into sentences and organizing them with unique identifiers that preserve their original context.

Indexing

Creating a structured index to track and manage these sentences with metadata about their origin.

Embedding

Generating vector embeddings for the indexed sentences while maintaining their relational structure.

The system uses a UUID-based approach with corresponding table files to maintain relationships between original text, indexed sentences, and their embeddings – providing traceability throughout the entire process.

Data Structure

Directory Organization

data/
├── indexed/
│   ├── sentences/
│   │   ├── table.json         # Master index of sentence collections
│   │   ├── <uuid1>.json       # Indexed sentences for collection 1
│   │   ├── <uuid2>.json       # Indexed sentences for collection 2
│   │   └── ...
│   └── embeddings/
│       ├── table.json         # Master index of embedding collections
│       ├── <uuid1>.json       # Embeddings corresponding to sentences
│       ├── <uuid2>.json       # Embeddings for another collection
│       └── ...

Master Indexes

Each directory contains a table.json file that tracks all collections with metadata about their origin and processing.

{
  "uuid1": {
    "meta": {
      "origin": "original_file_name.txt",
      "processed": 42
    }
  },
  "uuid2": {
    "meta": {
      "origin": "another_document.md",
      "processed": 105
    }
  }
}

Collection Files

Individual UUID files contain either the sentences or their corresponding embeddings, indexed by position.

// Sentences file
{
  "1": "First sentence from the document.",
  "2": "Second sentence with more content.",
  "3": "Third sentence that continues the text."
}

// Embeddings file
{
  "1": [0.123, 0.456, ...],
  "2": [-0.789, 0.012, ...],
  "3": [0.345, -0.678, ...]
}

System Components

Chunking & Indexing

The chunk_sentences.py script processes input text files by:

Reading a text file and splitting it into sentences
Generating a UUID for the collection
Creating a structured index entry in table.json
Saving the indexed sentences in a UUID-specific JSON file

# Usage:

python chunk_sentences.py -i input_file.txt

Embedding Generation

The create_indexed_embeddings.py script:

Takes a UUID JSON file of indexed sentences
Loads the appropriate transformer model
Generates embeddings for each sentence
Preserves the index structure in the output
Creates/updates the corresponding table entry

# Usage:

python create_indexed_embeddings.py -m bert-base -i data/indexed/sentences/uuid.json

System Benefits

Traceability

Each set of sentences and embeddings has a consistent UUID that allows tracking back to the original source, maintaining data provenance.

Consistency

The index structure is preserved across the pipeline, making it easy to map between sentences and their embeddings.

Modularity

The separation into distinct stages allows for flexibility in processing, such as using different embedding models on the same indexed sentences.

Scalability

The system can handle large volumes of documents by processing them incrementally and maintaining consistent reference structures.

Integration with FractalWaves Technology

This embedding system forms a critical component in our technology stack, enabling:

Efficient processing and vectorization of knowledge bases
Structured data representation compatible with our C-Space Engine
Hierarchical organization of information for dimensional directory navigation
Preservation of contextual relationships in the compression stack

Explore Our Technology Stack