Skip to content

TokenSmith 🔧

A comprehensive toolkit for streamlining data editing, search, and inspection for large-scale language model training and interpretability.

Python License: MIT

Overview

TokenSmith is a powerful Python package designed to simplify dataset management for large language model training. It provides a unified interface for editing, inspecting, searching, sampling, and exporting tokenized datasets, making it easier to work with training data at scale.

✨ Key Features

  • 🔍 Search & Index: Fast token sequence search with n-gram indexing
  • 📊 Dataset Inspection: Examine samples, batches, and document metadata
  • 🎯 Smart Sampling: Flexible sampling with policy-based selection
  • ✏️ Dataset Editing: Inject and modify training samples with precision
  • 📤 Export Utilities: Export data in multiple formats (JSONL, CSV)
  • 🖥️ Interactive UI: Streamlit-based web interface for visual exploration
  • ⚡ Memory Efficient: Chunked processing for large datasets

🏗️ Architecture

TokenSmith is built around a central DatasetManager that coordinates five specialized handlers:

DatasetManager
├── SearchHandler    # Token sequence search and indexing
├── InspectHandler   # Dataset examination and visualization  
├── SampleHandler    # Flexible data sampling strategies
├── EditHandler      # Dataset modification and injection
└── ExportHandler    # Multi-format data export

🚀 Quick Start

Installation

git clone https://github.com/aflah02/tokensmith.git
cd tokensmith
pip install -e .

Basic Usage

from tokensmith import DatasetManager
from transformers import AutoTokenizer

# Initialize tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

# Create dataset manager
manager = DatasetManager()

# Setup for editing, inspection, sampling, and export
manager.setup_edit_inspect_sample_export(
    dataset_prefix="path/to/your/dataset",
    batch_info_save_prefix="path/to/batch/info",
    train_iters=1000,
    train_batch_size=32,
    train_seq_len=1024,
    seed=42
)

# Setup search functionality
manager.setup_search(
    bin_file_path="path/to/dataset.bin",
    search_index_save_path="path/to/search/index",
    vocab=2**16
)

# Now you can use all handlers
sample = manager.inspect.inspect_sample_by_id(0)
results = manager.search.search([token_id_1, token_id_2])

📚 API Documentation

This documentation contains comprehensive API reference for all TokenSmith components:

🤝 Contributing

We welcome contributions! Please see our contributing guidelines for more information.

📄 License

This project is licensed under the MIT License - see the LICENSE file for details.