DatasetManager¶
Source code in tokensmith/manager.py
setup_search
¶
setup_search(bin_file_path, search_index_save_path, vocab, verbose=False, reuse=True)
Initializes the SearchHandler by building or loading the index. Should be called explicitly if search functionality is required. Not done automatically to avoid unnecessary overhead.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
str
|
Path to the binary file containing the dataset. |
required |
|
str
|
Path to save the search index. |
required |
|
Dict[str, int]
|
Vocabulary mapping words to their indices. |
required |
|
bool
|
If True, enables verbose output during index building. |
False
|
|
bool
|
If True, reuses the existing index if available. |
True
|
Raises:
Type | Description |
---|---|
ValueError
|
|
Returns:
Type | Description |
---|---|
None |
Source code in tokensmith/manager.py
setup_edit_inspect_sample_export
¶
setup_edit_inspect_sample_export(dataset_prefix, batch_info_save_prefix, train_iters, train_batch_size, train_seq_len, seed, splits_string='969,30,1', packing_impl='packed', allow_chopped=True, add_extra_token_to_seq=1)
Initializes the EditHandler, InspectHandler, SampleHandler, and ExportHandler. This method is called to set up the handlers with the provided bin file path.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
str
|
Prefix for the dataset files. This is used to locate the {dataset_prefix}.bin and {dataset_prefix}.idx files. |
required |
|
str
|
Prefix for the batch information files. This is used to locate the doc/sample/shuffle indexes with the given prefix/save path if the files are not found. |
required |
|
int
|
Number of training iterations for simulated training. |
required |
|
int
|
Size of each training batch for simulated training. |
required |
|
int
|
Length of the training sequences. |
required |
|
int
|
Random seed for simulated training. |
required |
|
str
|
Comma-separated string of train/val/test splits. (defaults to '969,30,1' which means 96.9% train, 3% val, and 0.1% test). |
'969,30,1'
|
|
str
|
Implementation for packing sequences. One of 'packed', 'pack_until_overflow', 'unpacked'. (defaults to 'packed'). |
'packed'
|
|
bool
|
WARNING: if your packing impl is packed, this is ignored. Allow chopped samples in the dataset. E.g if your sequence length is 1024 and you have a sample of length 1026, it will be chopped to 1024 (defaults to True). |
True
|
|
int
|
Number of extra tokens to add to each sequence (defaults to 1 to account for causal language modeling). |
1
|
Raises:
Type | Description |
---|---|
ValueError
|
|
Returns:
Type | Description |
---|---|
None |