Basic Setup Tutorial¶

Fix Paths¶

In [7]:

Copied!

# May not be necessary, but ensures the path is set correctly
import sys

sys.path.insert(0, "/NS/llm-pretraining/work/afkhan/tokensmith")
sys.path.insert(0, "/NS/llm-pretraining/work/afkhan/USC_Colab/gpt-neox")
# May not be necessary, but ensures the path is set correctly
import sys

sys.path.insert(0, "/NS/llm-pretraining/work/afkhan/tokensmith")
sys.path.insert(0, "/NS/llm-pretraining/work/afkhan/USC_Colab/gpt-neox")

Loading Tokenizer¶

In [8]:

Copied!

from transformers import AutoTokenizer
TOKENIZER_NAME_OR_PATH = "EleutherAI/gpt-neox-20b"
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME_OR_PATH, add_eos_token=True)
from transformers import AutoTokenizer
TOKENIZER_NAME_OR_PATH = "EleutherAI/gpt-neox-20b"
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME_OR_PATH, add_eos_token=True)

/NS/venvs/work/afkhan/neox_updated_env/lib/python3.11/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Testing Imports¶

In [9]:

Copied!

from tokensmith.manager import DatasetManager
from tokensmith.manager import DatasetManager

In [10]:

Copied!

dataset_manager = DatasetManager()
dataset_manager = DatasetManager()

In [11]:

Copied!





# dataset_manager.setup_search(
#     bin_file_path="../../artifacts/data_tokenized_text_document.bin",
#     search_index_save_path="../../artifacts/search_index_text_document.idx",
#     vocab=2**32,
#     verbose=True,
#     reuse=True,
# )
# dataset_manager.setup_search(
#     bin_file_path="../../artifacts/data_tokenized_text_document.bin",
#     search_index_save_path="../../artifacts/search_index_text_document.idx",
#     vocab=2**32,
#     verbose=True,
#     reuse=True,
# )

In [12]:

Copied!





dataset_manager.setup_edit_inspect_sample_export(
    dataset_prefix='../../artifacts/data_tokenized_text_document',
    batch_info_save_prefix='../../artifacts/batch_info',
    train_iters=100,
    train_batch_size=16,
    train_seq_len=2048,
    seed=42,
    splits_string='990,5,5',
    packing_impl='packed',
    allow_chopped=True,
)
dataset_manager.setup_edit_inspect_sample_export(
    dataset_prefix='../../artifacts/data_tokenized_text_document',
    batch_info_save_prefix='../../artifacts/batch_info',
    train_iters=100,
    train_batch_size=16,
    train_seq_len=2048,
    seed=42,
    splits_string='990,5,5',
    packing_impl='packed',
    allow_chopped=True,
)

    warming up index mmap file...
    reading sizes...
    reading pointers...
    reading document index...

Congratulations! You have successfully set up your environment

What was set up:

The Python path was adjusted to include the necessary project directories.
The transformers library's AutoTokenizer was loaded with the model: EleutherAI/gpt-neox-20b.
The DatasetManager from tokensmith was imported and initialized.
The dataset manager was configured for both search and edit/inspect/sample/export workflows.

You are now ready to proceed with tokenization, dataset inspection, and further experiments!