Text Embedding System
A structured approach to processing, indexing, and embedding large volumes of text data
System Overview
Our text processing and embedding system follows a three-stage pipeline designed to handle large volumes of text data in a structured and efficient way:
Chunking
Breaking text into sentences and organizing them with unique identifiers that preserve their original context.
Indexing
Creating a structured index to track and manage these sentences with metadata about their origin.
Embedding
Generating vector embeddings for the indexed sentences while maintaining their relational structure.
The system uses a UUID-based approach with corresponding table files to maintain relationships between original text, indexed sentences, and their embeddings – providing traceability throughout the entire process.
Data Structure
Directory Organization
data/ ├── indexed/ │ ├── sentences/ │ │ ├── table.json # Master index of sentence collections │ │ ├── <uuid1>.json # Indexed sentences for collection 1 │ │ ├── <uuid2>.json # Indexed sentences for collection 2 │ │ └── ... │ └── embeddings/ │ ├── table.json # Master index of embedding collections │ ├── <uuid1>.json # Embeddings corresponding to sentences │ ├── <uuid2>.json # Embeddings for another collection │ └── ...
Master Indexes
Each directory contains a table.json
file that tracks all collections with metadata about their origin and processing.
{ "uuid1": { "meta": { "origin": "original_file_name.txt", "processed": 42 } }, "uuid2": { "meta": { "origin": "another_document.md", "processed": 105 } } }
Collection Files
Individual UUID files contain either the sentences or their corresponding embeddings, indexed by position.
// Sentences file { "1": "First sentence from the document.", "2": "Second sentence with more content.", "3": "Third sentence that continues the text." } // Embeddings file { "1": [0.123, 0.456, ...], "2": [-0.789, 0.012, ...], "3": [0.345, -0.678, ...] }
System Components
Chunking & Indexing
The chunk_sentences.py
script processes input text files by:
- Reading a text file and splitting it into sentences
- Generating a UUID for the collection
- Creating a structured index entry in
table.json
- Saving the indexed sentences in a UUID-specific JSON file
# Usage:
python chunk_sentences.py -i input_file.txt
Embedding Generation
The create_indexed_embeddings.py
script:
- Takes a UUID JSON file of indexed sentences
- Loads the appropriate transformer model
- Generates embeddings for each sentence
- Preserves the index structure in the output
- Creates/updates the corresponding table entry
# Usage:
python create_indexed_embeddings.py -m bert-base -i data/indexed/sentences/uuid.json
System Benefits
Traceability
Each set of sentences and embeddings has a consistent UUID that allows tracking back to the original source, maintaining data provenance.
Consistency
The index structure is preserved across the pipeline, making it easy to map between sentences and their embeddings.
Modularity
The separation into distinct stages allows for flexibility in processing, such as using different embedding models on the same indexed sentences.
Scalability
The system can handle large volumes of documents by processing them incrementally and maintaining consistent reference structures.
Integration with FractalWaves Technology
This embedding system forms a critical component in our technology stack, enabling:
- Efficient processing and vectorization of knowledge bases
- Structured data representation compatible with our C-Space Engine
- Hierarchical organization of information for dimensional directory navigation
- Preservation of contextual relationships in the compression stack