TokenSmith Tutorials¶
This directory contains step-by-step tutorials for using the TokenSmith library for dataset management, sampling, and manipulation.
Tutorial Overview¶
The tutorials are designed to be followed in order, building from basic concepts to advanced workflows:
📚 Tutorials¶
Getting Started¶
Prerequisites¶
- Python 3.8+
- Jupyter notebook or JupyterLab
- TokenSmith library installed
- Required dependencies (transformers, numpy, etc.)
Quick Start¶
- Start with
01_basic_setup.ipynb
to set up your environment - Follow the tutorials in order
- Each tutorial builds on concepts from previous ones
- Run all cells in sequence for best results
Data Requirements¶
Most tutorials use sample data located in the data/
directory. Some tutorials may require you to provide your own tokenized datasets.
Troubleshooting¶
Common Issues¶
- Import errors: Make sure TokenSmith is installed and in your Python path
- Missing tokenizer: Download the required tokenizer models
- File not found: Check that data files exist in the expected locations
- Memory issues: Reduce batch sizes for large datasets
Getting Help¶
- Check the main README.md for installation instructions
- Review the API documentation in the source code
- Look at the examples in each tutorial for common patterns
Contributing¶
If you find issues with tutorials or have suggestions for improvements, please create an issue or submit a pull request.