Skip to content

TokenSmith Tutorials

This directory contains step-by-step tutorials for using the TokenSmith library for dataset management, sampling, and manipulation.

Tutorial Overview

The tutorials are designed to be followed in order, building from basic concepts to advanced workflows:

📚 Tutorials

  1. 01_basic_setup.ipynb

  2. 02_inspect_samples.ipynb

  3. 03_sampling_methods.ipynb

  4. 04_dataset_editing_methods.ipynb

  5. 05_search_functionality.ipynb

Getting Started

Prerequisites

  • Python 3.8+
  • Jupyter notebook or JupyterLab
  • TokenSmith library installed
  • Required dependencies (transformers, numpy, etc.)

Quick Start

  1. Start with 01_basic_setup.ipynb to set up your environment
  2. Follow the tutorials in order
  3. Each tutorial builds on concepts from previous ones
  4. Run all cells in sequence for best results

Data Requirements

Most tutorials use sample data located in the data/ directory. Some tutorials may require you to provide your own tokenized datasets.

Troubleshooting

Common Issues

  • Import errors: Make sure TokenSmith is installed and in your Python path
  • Missing tokenizer: Download the required tokenizer models
  • File not found: Check that data files exist in the expected locations
  • Memory issues: Reduce batch sizes for large datasets

Getting Help

  • Check the main README.md for installation instructions
  • Review the API documentation in the source code
  • Look at the examples in each tutorial for common patterns

Contributing

If you find issues with tutorials or have suggestions for improvements, please create an issue or submit a pull request.