Inspect Handler¶
Source code in tokensmith/inspect/handler.py
inspect_sample_by_id
¶
inspect_sample_by_id(sample_id, return_doc_details=False, return_detokenized=False, tokenizer=None)
Returns a sample by its ID, optionally with document details and/or detokenized.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
int
|
The index of the sample to retrieve. |
required |
|
bool
|
If True, includes associated document details. |
False
|
|
bool
|
If True, returns detokenized text instead of token arrays. |
False
|
|
Optional[Any]
|
The tokenizer to use for detokenization (required if return_detokenized is True). |
None
|
Raises:
Type | Description |
---|---|
ValueError
|
If sample_id is not a non-negative integer or if tokenizer is None when return_detokenized is True. |
Returns:
Name | Type | Description |
---|---|---|
Union[List[ndarray], str, Tuple[List[ndarray], Dict], Tuple[str, Dict]]
|
List[np.ndarray]: A list of numpy arrays representing the token sequence (if return_detokenized is False and return_doc_details is False). |
|
str |
Union[List[ndarray], str, Tuple[List[ndarray], Dict], Tuple[str, Dict]]
|
Detokenized text (if return_detokenized is True and return_doc_details is False). |
Union[List[ndarray], str, Tuple[List[ndarray], Dict], Tuple[str, Dict]]
|
Tuple[List[np.ndarray], Dict]: Token sequence and document details (if return_detokenized is False and return_doc_details is True). |
|
Union[List[ndarray], str, Tuple[List[ndarray], Dict], Tuple[str, Dict]]
|
Tuple[str, Dict]: Detokenized text and document details (if return_detokenized is True and return_doc_details is True). |
Source code in tokensmith/inspect/handler.py
inspect_sample_by_batch
¶
inspect_sample_by_batch(batch_id, batch_size, return_doc_details=False, return_detokenized=False, tokenizer=None)
Returns a batch of samples by batch ID, optionally with document details and/or detokenized.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
int
|
The index of the batch to retrieve. |
required |
|
int
|
The size of the batch. |
required |
|
bool
|
If True, includes associated document details. |
False
|
|
bool
|
If True, returns detokenized text instead of token arrays. |
False
|
|
Optional[Any]
|
The tokenizer to use for detokenization (required if return_detokenized is True). |
None
|
Raises:
Type | Description |
---|---|
ValueError
|
If batch_id is not a non-negative integer or if tokenizer is None when return_detokenized is True. |
Returns:
Type | Description |
---|---|
Union[List[List[ndarray]], List[str], List[Tuple[List[ndarray], Dict]], List[Tuple[str, Dict]]]
|
List[List[np.ndarray]]: A list of samples, where each sample is a list of token arrays (if return_detokenized is False and return_doc_details is False). |
Union[List[List[ndarray]], List[str], List[Tuple[List[ndarray], Dict]], List[Tuple[str, Dict]]]
|
List[str]: A list of detokenized text samples (if return_detokenized is True and return_doc_details is False). |
Union[List[List[ndarray]], List[str], List[Tuple[List[ndarray], Dict]], List[Tuple[str, Dict]]]
|
List[Tuple[List[np.ndarray], Dict]]: A list of tuples containing token sequences and document details (if return_detokenized is False and return_doc_details is True). |
Union[List[List[ndarray]], List[str], List[Tuple[List[ndarray], Dict]], List[Tuple[str, Dict]]]
|
List[Tuple[str, Dict]]: A list of tuples containing detokenized text and document details (if return_detokenized is True and return_doc_details is True). |