Edit Handler¶
Source code in tokensmith/edit/handler.py
inject_and_preview
¶
inject_and_preview(text, tokenizer, injection_loc, injection_type='seq_shuffle', rng=None, add_eos_token=True, dry_run=True, return_details=False)
Injects a dummy sequence into the dataset at a given location and prints before/after samples.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
str
|
The dummy text to tokenize and inject. |
required |
|
Optional[Any]
|
A HuggingFace-compatible tokenizer with call and decode. |
required |
|
int
|
Index of the sample in the training set to modify. |
required |
|
str
|
Where to inject. Options: 'seq_shuffle' or 'seq_start'. |
'seq_shuffle'
|
|
Generator
|
RNG for reproducibility. If None, uses np.random.default_rng() with seed 1234. |
None
|
|
bool
|
Whether to add EOS token to the injected text. |
True
|
|
bool
|
If True, no actual injection is performed. |
True
|
|
bool
|
If True, returns structured data instead of just printing. |
False
|
Raises:
Type | Description |
---|---|
ValueError
|
If injection_loc is negative, injection_type is invalid, or tokenizer is None. |
Returns:
Name | Type | Description |
---|---|---|
None |
Union[None, Dict[str, Any]]
|
If return_details is False (default behavior with printing). |
Union[None, Dict[str, Any]]
|
Dict[str, Any]: If return_details is True, returns structured data with original and modified samples. |
Source code in tokensmith/edit/handler.py
22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108 109 110 111 112 113 114 115 116 117 118 119 120 121 122 123 124 125 126 127 128 129 130 131 132 133 134 135 136 137 138 139 140 141 142 143 144 145 146 147 148 149 150 |
|
inject_multiple_samples
¶
inject_multiple_samples(injections, tokenizer, rng=None, add_eos_token=True, dry_run=True, return_details=False)
Inject multiple samples into the dataset in batch.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
List[Dict]
|
List of injection specifications, each containing: - text (str): Text to inject - injection_loc (int): Location to inject - injection_type (str, optional): Type of injection, defaults to "seq_shuffle" |
required |
|
Optional[Any]
|
A HuggingFace-compatible tokenizer. |
required |
|
Generator
|
RNG for reproducibility. |
None
|
|
bool
|
Whether to add EOS token to injected text. |
True
|
|
bool
|
If True, no actual injection is performed. |
True
|
|
bool
|
If True, returns structured data for all injections. |
False
|
Raises:
Type | Description |
---|---|
ValueError
|
If injections list is invalid or any injection specification is invalid. |
Returns:
Name | Type | Description |
---|---|---|
None |
Union[None, List[Dict[str, Any]]]
|
If return_details is False. |
Union[None, List[Dict[str, Any]]]
|
List[Dict[str, Any]]: If return_details is True, returns list of injection results. |
Source code in tokensmith/edit/handler.py
preview_sample
¶
preview_sample(sample_id, return_doc_details=False, return_detokenized=True, tokenizer=None)
Preview a sample by its ID without modification, similar to inspect functionality.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
int
|
The index of the sample to preview. |
required |
|
bool
|
If True, includes associated document details. |
False
|
|
bool
|
If True, returns detokenized text instead of token arrays. |
True
|
|
Optional[Any]
|
The tokenizer to use for detokenization (required if return_detokenized is True). |
None
|
Raises:
Type | Description |
---|---|
ValueError
|
If sample_id is not a non-negative integer or if tokenizer is None when return_detokenized is True. |
Returns:
Type | Description |
---|---|
Union[List[ndarray], str, Tuple[List[ndarray], Dict], Tuple[str, Dict]]
|
Similar to InspectHandler.inspect_sample_by_id |
Source code in tokensmith/edit/handler.py
validate_injection_location
¶
validate_injection_location(injection_loc)
Validate if an injection location is valid for the dataset.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
int
|
The injection location to validate. |
required |
Returns:
Name | Type | Description |
---|---|---|
bool |
bool
|
True if the location is valid, False otherwise. |