Sample Handler¶
Source code in tokensmith/sample/handler.py
get_samples_by_indices
¶
get_samples_by_indices(indices, return_doc_details=False, return_detokenized=False, tokenizer=None)
Returns a list of samples by their indices, optionally with document details and/or detokenized.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
List[int]
|
List of sample indices to retrieve. |
required |
|
bool
|
If True, includes associated document details. |
False
|
|
bool
|
If True, returns detokenized text instead of token arrays. |
False
|
|
Optional[Any]
|
The tokenizer to use for detokenization (required if return_detokenized is True). |
None
|
Raises:
Type | Description |
---|---|
ValueError
|
If indices is not a list of non-negative integers or if tokenizer is None when return_detokenized is True. |
Returns:
Type | Description |
---|---|
Union[List[List[ndarray]], List[str], List[Tuple[List[ndarray], Dict]], List[Tuple[str, Dict]]]
|
List[List[np.ndarray]]: A list of samples, where each sample is a list of token arrays (if return_detokenized is False and return_doc_details is False). |
Union[List[List[ndarray]], List[str], List[Tuple[List[ndarray], Dict]], List[Tuple[str, Dict]]]
|
List[str]: A list of detokenized text samples (if return_detokenized is True and return_doc_details is False). |
Union[List[List[ndarray]], List[str], List[Tuple[List[ndarray], Dict]], List[Tuple[str, Dict]]]
|
List[Tuple[List[np.ndarray], Dict]]: A list of tuples containing token sequences and document details (if return_detokenized is False and return_doc_details is True). |
Union[List[List[ndarray]], List[str], List[Tuple[List[ndarray], Dict]], List[Tuple[str, Dict]]]
|
List[Tuple[str, Dict]]: A list of tuples containing detokenized text and document details (if return_detokenized is True and return_doc_details is True). |
Source code in tokensmith/sample/handler.py
get_batches_by_ids
¶
get_batches_by_ids(batch_ids, batch_size, return_doc_details=False, return_detokenized=False, tokenizer=None)
Returns samples from multiple batches by their batch IDs, organized by batch, optionally with document details and/or detokenized.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
List[int]
|
List of batch IDs to retrieve. |
required |
|
int
|
The size of each batch. |
required |
|
bool
|
If True, includes associated document details. |
False
|
|
bool
|
If True, returns detokenized text instead of token arrays. |
False
|
|
Optional[Any]
|
The tokenizer to use for detokenization (required if return_detokenized is True). |
None
|
Raises:
Type | Description |
---|---|
ValueError
|
If batch_ids is not a list of non-negative integers or if tokenizer is None when return_detokenized is True. |
Returns:
Type | Description |
---|---|
Union[List[List[List[ndarray]]], List[List[str]], List[List[Tuple[List[ndarray], Dict]]], List[List[Tuple[str, Dict]]]]
|
List[List[List[np.ndarray]]]: A list of batches, where each batch is a list of samples (if return_detokenized is False and return_doc_details is False). |
Union[List[List[List[ndarray]]], List[List[str]], List[List[Tuple[List[ndarray], Dict]]], List[List[Tuple[str, Dict]]]]
|
List[List[str]]: A list of batches, where each batch is a list of detokenized text samples (if return_detokenized is True and return_doc_details is False). |
Union[List[List[List[ndarray]]], List[List[str]], List[List[Tuple[List[ndarray], Dict]]], List[List[Tuple[str, Dict]]]]
|
List[List[Tuple[List[np.ndarray], Dict]]]: A list of batches with token sequences and document details (if return_detokenized is False and return_doc_details is True). |
Union[List[List[List[ndarray]]], List[List[str]], List[List[Tuple[List[ndarray], Dict]]], List[List[Tuple[str, Dict]]]]
|
List[List[Tuple[str, Dict]]]: A list of batches with detokenized text and document details (if return_detokenized is True and return_doc_details is True). |
Source code in tokensmith/sample/handler.py
get_samples_by_policy
¶
get_samples_by_policy(policy_fn, *policy_args, return_doc_details=False, return_detokenized=False, tokenizer=None, **policy_kwargs)
Returns samples based on a sampling policy function that generates indices.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
callable
|
A function that returns a list of sample indices. |
required |
|
Positional arguments to pass to the policy function. |
()
|
|
|
bool
|
If True, includes associated document details. |
False
|
|
bool
|
If True, returns detokenized text instead of token arrays. |
False
|
|
Optional[Any]
|
The tokenizer to use for detokenization (required if return_detokenized is True). |
None
|
|
Keyword arguments to pass to the policy function. |
{}
|
Raises:
Type | Description |
---|---|
ValueError
|
If policy_fn is not callable or doesn't return a list of integers. |
Returns:
Type | Description |
---|---|
Union[List[List[ndarray]], List[str], List[Tuple[List[ndarray], Dict]], List[Tuple[str, Dict]]]
|
List[List[np.ndarray]]: A list of samples, where each sample is a list of token arrays (if return_detokenized is False and return_doc_details is False). |
Union[List[List[ndarray]], List[str], List[Tuple[List[ndarray], Dict]], List[Tuple[str, Dict]]]
|
List[str]: A list of detokenized text samples (if return_detokenized is True and return_doc_details is False). |
Union[List[List[ndarray]], List[str], List[Tuple[List[ndarray], Dict]], List[Tuple[str, Dict]]]
|
List[Tuple[List[np.ndarray], Dict]]: A list of tuples containing token sequences and document details (if return_detokenized is False and return_doc_details is True). |
Union[List[List[ndarray]], List[str], List[Tuple[List[ndarray], Dict]], List[Tuple[str, Dict]]]
|
List[Tuple[str, Dict]]: A list of tuples containing detokenized text and document details (if return_detokenized is True and return_doc_details is True). |
Source code in tokensmith/sample/handler.py
get_batches_by_policy
¶
get_batches_by_policy(policy_fn, batch_size, *policy_args, return_doc_details=False, return_detokenized=False, tokenizer=None, **policy_kwargs)
Returns batches of samples based on a sampling policy function that generates batch IDs.
Parameters:
Name | Type | Description | Default |
---|---|---|---|
|
callable
|
A function that returns a list of batch IDs. |
required |
|
int
|
The size of each batch. |
required |
|
Positional arguments to pass to the policy function. |
()
|
|
|
bool
|
If True, includes associated document details. |
False
|
|
bool
|
If True, returns detokenized text instead of token arrays. |
False
|
|
Optional[Any]
|
The tokenizer to use for detokenization (required if return_detokenized is True). |
None
|
|
Keyword arguments to pass to the policy function. |
{}
|
Raises:
Type | Description |
---|---|
ValueError
|
If policy_fn is not callable or doesn't return a list of integers. |
Returns:
Type | Description |
---|---|
Union[List[List[List[ndarray]]], List[List[str]], List[List[Tuple[List[ndarray], Dict]]], List[List[Tuple[str, Dict]]]]
|
List[List[List[np.ndarray]]]: A list of batches, where each batch is a list of samples (if return_detokenized is False and return_doc_details is False). |
Union[List[List[List[ndarray]]], List[List[str]], List[List[Tuple[List[ndarray], Dict]]], List[List[Tuple[str, Dict]]]]
|
List[List[str]]: A list of batches, where each batch is a list of detokenized text samples (if return_detokenized is True and return_doc_details is False). |
Union[List[List[List[ndarray]]], List[List[str]], List[List[Tuple[List[ndarray], Dict]]], List[List[Tuple[str, Dict]]]]
|
List[List[Tuple[List[np.ndarray], Dict]]]: A list of batches with token sequences and document details (if return_detokenized is False and return_doc_details is True). |
Union[List[List[List[ndarray]]], List[List[str]], List[List[Tuple[List[ndarray], Dict]]], List[List[Tuple[str, Dict]]]]
|
List[List[Tuple[str, Dict]]]: A list of batches with detokenized text and document details (if return_detokenized is True and return_doc_details is True). |