Basic Setup Tutorial¶
Fix Paths¶
In [7]:
Copied!
# May not be necessary, but ensures the path is set correctly
import sys
sys.path.insert(0, "/NS/llm-pretraining/work/afkhan/tokensmith")
sys.path.insert(0, "/NS/llm-pretraining/work/afkhan/USC_Colab/gpt-neox")
# May not be necessary, but ensures the path is set correctly
import sys
sys.path.insert(0, "/NS/llm-pretraining/work/afkhan/tokensmith")
sys.path.insert(0, "/NS/llm-pretraining/work/afkhan/USC_Colab/gpt-neox")
Loading Tokenizer¶
In [8]:
Copied!
from transformers import AutoTokenizer
TOKENIZER_NAME_OR_PATH = "EleutherAI/gpt-neox-20b"
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME_OR_PATH, add_eos_token=True)
from transformers import AutoTokenizer
TOKENIZER_NAME_OR_PATH = "EleutherAI/gpt-neox-20b"
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME_OR_PATH, add_eos_token=True)
/NS/venvs/work/afkhan/neox_updated_env/lib/python3.11/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Testing Imports¶
In [9]:
Copied!
from tokensmith.manager import DatasetManager
from tokensmith.manager import DatasetManager
In [10]:
Copied!
dataset_manager = DatasetManager()
dataset_manager = DatasetManager()
In [11]:
Copied!
# dataset_manager.setup_search(
# bin_file_path="../../artifacts/data_tokenized_text_document.bin",
# search_index_save_path="../../artifacts/search_index_text_document.idx",
# vocab=2**32,
# verbose=True,
# reuse=True,
# )
# dataset_manager.setup_search(
# bin_file_path="../../artifacts/data_tokenized_text_document.bin",
# search_index_save_path="../../artifacts/search_index_text_document.idx",
# vocab=2**32,
# verbose=True,
# reuse=True,
# )
In [12]:
Copied!
dataset_manager.setup_edit_inspect_sample_export(
dataset_prefix='../../artifacts/data_tokenized_text_document',
batch_info_save_prefix='../../artifacts/batch_info',
train_iters=100,
train_batch_size=16,
train_seq_len=2048,
seed=42,
splits_string='990,5,5',
packing_impl='packed',
allow_chopped=True,
)
dataset_manager.setup_edit_inspect_sample_export(
dataset_prefix='../../artifacts/data_tokenized_text_document',
batch_info_save_prefix='../../artifacts/batch_info',
train_iters=100,
train_batch_size=16,
train_seq_len=2048,
seed=42,
splits_string='990,5,5',
packing_impl='packed',
allow_chopped=True,
)
warming up index mmap file... reading sizes... reading pointers... reading document index...
Congratulations! You have successfully set up your environment
What was set up:
- The Python path was adjusted to include the necessary project directories.
- The
transformers
library'sAutoTokenizer
was loaded with the model:EleutherAI/gpt-neox-20b
. - The
DatasetManager
from tokensmith was imported and initialized. - The dataset manager was configured for both search and edit/inspect/sample/export workflows.
You are now ready to proceed with tokenization, dataset inspection, and further experiments!