Dataset Inspection Tutorial¶

This tutorial demonstrates how to inspect samples from your tokenized dataset using TokenSmith's inspection functionality. We'll build on the setup from the first tutorial to explore individual samples and batches.

Prerequisites:

Complete the first tutorial (01_basic_setup.ipynb)
Have a tokenized dataset ready with batch info generated

What you'll learn:

How to inspect individual samples by ID
How to inspect batches of samples
Understanding document details and metadata
Working with tokenized vs detokenized content
Exploring document boundaries and offsets

Setup¶

Let's start by importing the necessary libraries and setting up our environment, similar to the first tutorial.

In [1]:

Copied!





# Fix paths for imports
import sys
sys.path.insert(0, "/NS/llm-pretraining/work/afkhan/tokensmith")
sys.path.insert(0, "/NS/llm-pretraining/work/afkhan/USC_Colab/gpt-neox")

# Import required libraries
import numpy as np
from transformers import AutoTokenizer
from tokensmith.manager import DatasetManager

# Load tokenizer
TOKENIZER_NAME_OR_PATH = "EleutherAI/gpt-neox-20b"
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME_OR_PATH, add_eos_token=True)
print(f"Loaded tokenizer: {TOKENIZER_NAME_OR_PATH}")
# Fix paths for imports
import sys
sys.path.insert(0, "/NS/llm-pretraining/work/afkhan/tokensmith")
sys.path.insert(0, "/NS/llm-pretraining/work/afkhan/USC_Colab/gpt-neox")

# Import required libraries
import numpy as np
from transformers import AutoTokenizer
from tokensmith.manager import DatasetManager

# Load tokenizer
TOKENIZER_NAME_OR_PATH = "EleutherAI/gpt-neox-20b"
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME_OR_PATH, add_eos_token=True)
print(f"Loaded tokenizer: {TOKENIZER_NAME_OR_PATH}")

[2025-07-01 12:19:08,855] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)

/NS/venvs/work/afkhan/neox_updated_env/lib/python3.11/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead.
  @autocast_custom_fwd
/NS/venvs/work/afkhan/neox_updated_env/lib/python3.11/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead.
  @autocast_custom_bwd
/NS/venvs/work/afkhan/neox_updated_env/lib/python3.11/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
/NS/venvs/work/afkhan/neox_updated_env/lib/python3.11/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Loaded tokenizer: EleutherAI/gpt-neox-20b

In [2]:

Copied!





# Initialize DatasetManager and setup for inspection
dataset_manager = DatasetManager()

# Setup the dataset for inspection (same as tutorial 1)
dataset_manager.setup_edit_inspect_sample_export(
    dataset_prefix='../../artifacts/data_tokenized_text_document',
    batch_info_save_prefix='../../artifacts/batch_info',
    train_iters=100,
    train_batch_size=16,
    train_seq_len=2048,
    seed=42,
    splits_string='990,5,5',
    packing_impl='packed',
    allow_chopped=True,
)
print("Dataset manager setup complete!")
# Initialize DatasetManager and setup for inspection
dataset_manager = DatasetManager()

# Setup the dataset for inspection (same as tutorial 1)
dataset_manager.setup_edit_inspect_sample_export(
    dataset_prefix='../../artifacts/data_tokenized_text_document',
    batch_info_save_prefix='../../artifacts/batch_info',
    train_iters=100,
    train_batch_size=16,
    train_seq_len=2048,
    seed=42,
    splits_string='990,5,5',
    packing_impl='packed',
    allow_chopped=True,
)
print("Dataset manager setup complete!")

    warming up index mmap file...
    reading sizes...
    reading pointers...
    reading document index...
Dataset manager setup complete!
    reading sizes...
    reading pointers...
    reading document index...
Dataset manager setup complete!

Basic Sample Inspection¶

Let's start by inspecting individual samples. We'll look at sample ID 0 and understand what information we can extract.

In [3]:

Copied!





# Inspect the first sample (ID: 0) - returns tokenized data
sample_0 = dataset_manager.inspect.inspect_sample_by_id(sample_id=0)

print("Sample 0 (tokenized):")
print(f"Type: {type(sample_0)}")
print(f"Number of segments: {len(sample_0)}")
print(f"First segment shape: {sample_0[0].shape}")
print(f"First 10 tokens: {sample_0[0][:10]}")
# Inspect the first sample (ID: 0) - returns tokenized data
sample_0 = dataset_manager.inspect.inspect_sample_by_id(sample_id=0)

print("Sample 0 (tokenized):")
print(f"Type: {type(sample_0)}")
print(f"Number of segments: {len(sample_0)}")
print(f"First segment shape: {sample_0[0].shape}")
print(f"First 10 tokens: {sample_0[0][:10]}")

Sample 0 (tokenized):
Type: <class 'list'>
Number of segments: 12
First segment shape: (70,)
First 10 tokens: [ 2181  4592    15 32817   434  1652  4929  2210  3515   285]

In [4]:

Copied!





# Now let's see the same sample but detokenized (human-readable text)
sample_0_text = dataset_manager.inspect.inspect_sample_by_id(
    sample_id=0, 
    return_detokenized=True, 
    tokenizer=tokenizer
)

print("Sample 0 (detokenized text):")
print(f"Type: {type(sample_0_text)}")
print(f"Length: {len(sample_0_text)} characters")
print("\nFirst 200 characters:")
print(sample_0_text[:200])
print("\n" + "="*50)
print("Last 200 characters:")
print(sample_0_text[-200:])
# Now let's see the same sample but detokenized (human-readable text)
sample_0_text = dataset_manager.inspect.inspect_sample_by_id(
    sample_id=0, 
    return_detokenized=True, 
    tokenizer=tokenizer
)

print("Sample 0 (detokenized text):")
print(f"Type: {type(sample_0_text)}")
print(f"Length: {len(sample_0_text)} characters")
print("\nFirst 200 characters:")
print(sample_0_text[:200])
print("\n" + "="*50)
print("Last 200 characters:")
print(sample_0_text[-200:])

Sample 0 (detokenized text):
Type: <class 'str'>
Length: 8990 characters

First 200 characters:
 thing happened. Lily's little brother came running and accidentally stepped on the diamond. Oh no! The diamond was destroyed! Lily was very sad, but her mommy and daddy told her that it's okay becaus

==================================================
Last 200 characters:
listening to the birds tweet and trying to catch a glimpse of a rabbit. Soon enough, they found the perfect spot to escape and set up a secret camp. They explored the forest, gathered flowers and made

Understanding Document Details¶

TokenSmith can also provide metadata about each sample, including document boundaries and offsets. This is useful for understanding how your data was packed and segmented.

In [5]:

Copied!





# Get sample with document details
sample_0_with_details = dataset_manager.inspect.inspect_sample_by_id(
    sample_id=0, 
    return_doc_details=True
)

tokens, doc_details = sample_0_with_details

print("Sample 0 - Document Details:")
print(f"Document details type: {type(doc_details)}")
print("Document metadata:")
for key, value in doc_details.items():
    print(f"  {key}: {value}")
# Get sample with document details
sample_0_with_details = dataset_manager.inspect.inspect_sample_by_id(
    sample_id=0, 
    return_doc_details=True
)

tokens, doc_details = sample_0_with_details

print("Sample 0 - Document Details:")
print(f"Document details type: {type(doc_details)}")
print("Document metadata:")
for key, value in doc_details.items():
    print(f"  {key}: {value}")

Sample 0 - Document Details:
Document details type: <class 'dict'>
Document metadata:
  doc_index_f: 11212
  doc_index_l: 11223
  offset_f: 67
  offset_l: 154

In [6]:

Copied!





# Get both detokenized text AND document details
sample_0_text_with_details = dataset_manager.inspect.inspect_sample_by_id(
    sample_id=0, 
    return_detokenized=True, 
    return_doc_details=True, 
    tokenizer=tokenizer
)

text, doc_details = sample_0_text_with_details

print("Sample 0 - Text with Document Details:")
print(f"Text length: {len(text)} characters")
print("\nDocument metadata:")
for key, value in doc_details.items():
    print(f"  {key}: {value}")
    
print(f"\nFirst 100 characters:\n{text[:100]}")
# Get both detokenized text AND document details
sample_0_text_with_details = dataset_manager.inspect.inspect_sample_by_id(
    sample_id=0, 
    return_detokenized=True, 
    return_doc_details=True, 
    tokenizer=tokenizer
)

text, doc_details = sample_0_text_with_details

print("Sample 0 - Text with Document Details:")
print(f"Text length: {len(text)} characters")
print("\nDocument metadata:")
for key, value in doc_details.items():
    print(f"  {key}: {value}")
    
print(f"\nFirst 100 characters:\n{text[:100]}")

Sample 0 - Text with Document Details:
Text length: 8990 characters

Document metadata:
  doc_index_f: 11212
  doc_index_l: 11223
  offset_f: 67
  offset_l: 154

First 100 characters:
 thing happened. Lily's little brother came running and accidentally stepped on the diamond. Oh no!

Inspecting Multiple Samples¶

Let's look at several samples to understand the variation in our dataset.

In [7]:

Copied!





# Inspect multiple individual samples
sample_ids_to_check = [0, 1, 5, 10, 50]

print("Inspecting multiple samples:")
print("="*60)

for sample_id in sample_ids_to_check:
    # Get detokenized text with document details
    text, doc_details = dataset_manager.inspect.inspect_sample_by_id(
        sample_id=sample_id,
        return_detokenized=True,
        return_doc_details=True,
        tokenizer=tokenizer
    )
    
    print(f"\nSample ID: {sample_id}")
    print(f"Text length: {len(text)} characters")
    print(f"Doc index range: {doc_details['doc_index_f']} to {doc_details['doc_index_l']}")
    print(f"Offset range: {doc_details['offset_f']} to {doc_details['offset_l']}")
    print(f"Preview: {text[:100]}...")
    print("-" * 40)
# Inspect multiple individual samples
sample_ids_to_check = [0, 1, 5, 10, 50]

print("Inspecting multiple samples:")
print("="*60)

for sample_id in sample_ids_to_check:
    # Get detokenized text with document details
    text, doc_details = dataset_manager.inspect.inspect_sample_by_id(
        sample_id=sample_id,
        return_detokenized=True,
        return_doc_details=True,
        tokenizer=tokenizer
    )
    
    print(f"\nSample ID: {sample_id}")
    print(f"Text length: {len(text)} characters")
    print(f"Doc index range: {doc_details['doc_index_f']} to {doc_details['doc_index_l']}")
    print(f"Offset range: {doc_details['offset_f']} to {doc_details['offset_l']}")
    print(f"Preview: {text[:100]}...")
    print("-" * 40)

Inspecting multiple samples:
============================================================

Sample ID: 0
Text length: 8990 characters
Doc index range: 11212 to 11223
Offset range: 67 to 154
Preview:  thing happened. Lily's little brother came running and accidentally stepped on the diamond. Oh no! ...
----------------------------------------

Sample ID: 1
Text length: 8388 characters
Doc index range: 15126 to 15133
Offset range: 129 to 151
Preview: . She had gone to the office for a minute. Lily had an idea. "Let's steal some crayons," she whisper...
----------------------------------------

Sample ID: 5
Text length: 8404 characters
Doc index range: 5991 to 5998
Offset range: 226 to 134
Preview: Let's look at the pictures. They might tell us something." Lila and Ben look at the pictures on the ...
----------------------------------------

Sample ID: 10
Text length: 8530 characters
Doc index range: 7983 to 7994
Offset range: 16 to 4
Preview:  day, they find a big club on the grass. It is brown and heavy. "Look, a club!" Lily says. "Let's pl...
----------------------------------------

Sample ID: 50
Text length: 8495 characters
Doc index range: 14417 to 14425
Offset range: 215 to 176
Preview:  Ben's car fell on the ground and broke. The wheel came off and the paint scratched. "Uh oh!" Lily s...
----------------------------------------

Batch Inspection¶

TokenSmith also allows you to inspect entire batches at once, which is useful for understanding how your training batches will look.

In [8]:

Copied!





# Inspect batch 0 (first batch of samples)
batch_0 = dataset_manager.inspect.inspect_sample_by_batch(
    batch_id=0,
    batch_size=4,  # Let's use a smaller batch size for easier inspection
    return_detokenized=True,
    tokenizer=tokenizer
)

print(f"Batch 0 inspection:")
print(f"Batch type: {type(batch_0)}")
print(f"Number of samples in batch: {len(batch_0)}")

for i, sample_text in enumerate(batch_0):
    print(f"\n--- Sample {i} in batch ---")
    print(f"Length: {len(sample_text)} characters")
    print(f"Preview: {sample_text[:80]}...")
# Inspect batch 0 (first batch of samples)
batch_0 = dataset_manager.inspect.inspect_sample_by_batch(
    batch_id=0,
    batch_size=4,  # Let's use a smaller batch size for easier inspection
    return_detokenized=True,
    tokenizer=tokenizer
)

print(f"Batch 0 inspection:")
print(f"Batch type: {type(batch_0)}")
print(f"Number of samples in batch: {len(batch_0)}")

for i, sample_text in enumerate(batch_0):
    print(f"\n--- Sample {i} in batch ---")
    print(f"Length: {len(sample_text)} characters")
    print(f"Preview: {sample_text[:80]}...")

Batch 0 inspection:
Batch type: <class 'list'>
Number of samples in batch: 4

--- Sample 0 in batch ---
Length: 8990 characters
Preview:  thing happened. Lily's little brother came running and accidentally stepped on ...

--- Sample 1 in batch ---
Length: 8388 characters
Preview: . She had gone to the office for a minute. Lily had an idea. "Let's steal some c...

--- Sample 2 in batch ---
Length: 8789 characters
Preview:  agreed to marry him. They had a wonderful wedding and were very happy together....

--- Sample 3 in batch ---
Length: 8700 characters
Preview:  sleep, Maggie's mommy saw something very rare and wet. It was raining outside a...

In [9]:

Copied!





# Inspect batch with document details
batch_0_with_details = dataset_manager.inspect.inspect_sample_by_batch(
    batch_id=0,
    batch_size=4,
    return_detokenized=True,
    return_doc_details=True,
    tokenizer=tokenizer
)

print("Batch 0 with document details:")
print(f"Batch size: {len(batch_0_with_details)}")

for i, (sample_text, doc_details) in enumerate(batch_0_with_details):
    print(f"\n--- Sample {i} in batch ---")
    print(f"Text length: {len(sample_text)} characters")
    print(f"Document range: docs {doc_details['doc_index_f']}-{doc_details['doc_index_l']}")
    print(f"Offset range: {doc_details['offset_f']}-{doc_details['offset_l']}")
    print(f"Preview: {sample_text[:60]}...")
# Inspect batch with document details
batch_0_with_details = dataset_manager.inspect.inspect_sample_by_batch(
    batch_id=0,
    batch_size=4,
    return_detokenized=True,
    return_doc_details=True,
    tokenizer=tokenizer
)

print("Batch 0 with document details:")
print(f"Batch size: {len(batch_0_with_details)}")

for i, (sample_text, doc_details) in enumerate(batch_0_with_details):
    print(f"\n--- Sample {i} in batch ---")
    print(f"Text length: {len(sample_text)} characters")
    print(f"Document range: docs {doc_details['doc_index_f']}-{doc_details['doc_index_l']}")
    print(f"Offset range: {doc_details['offset_f']}-{doc_details['offset_l']}")
    print(f"Preview: {sample_text[:60]}...")

Batch 0 with document details:
Batch size: 4

--- Sample 0 in batch ---
Text length: 8990 characters
Document range: docs 11212-11223
Offset range: 67-154
Preview:  thing happened. Lily's little brother came running and acci...

--- Sample 1 in batch ---
Text length: 8388 characters
Document range: docs 15126-15133
Offset range: 129-151
Preview: . She had gone to the office for a minute. Lily had an idea....

--- Sample 2 in batch ---
Text length: 8789 characters
Document range: docs 9100-9111
Offset range: 61-116
Preview:  agreed to marry him. They had a wonderful wedding and were ...

--- Sample 3 in batch ---
Text length: 8700 characters
Document range: docs 5168-5178
Offset range: 110-202
Preview:  sleep, Maggie's mommy saw something very rare and wet. It w...

Understanding Tokenization Patterns¶

Let's examine how different types of content get tokenized to better understand our dataset.

In [10]:

Copied!





# Compare tokenized vs detokenized for analysis
sample_id = 5

# Get tokenized version (raw tokens)
tokens = dataset_manager.inspect.inspect_sample_by_id(sample_id=sample_id)

# Get detokenized version (text)
text = dataset_manager.inspect.inspect_sample_by_id(
    sample_id=sample_id, 
    return_detokenized=True, 
    tokenizer=tokenizer
)

print(f"Analysis of Sample {sample_id}:")
print(f"Number of token segments: {len(tokens)}")

total_tokens = sum(len(segment) for segment in tokens)
print(f"Total tokens: {total_tokens}")
print(f"Total characters: {len(text)}")
print(f"Average tokens per character: {total_tokens/len(text):.3f}")

print(f"\nToken distribution across segments:")
for i, segment in enumerate(tokens):
    print(f"  Segment {i}: {len(segment)} tokens")
# Compare tokenized vs detokenized for analysis
sample_id = 5

# Get tokenized version (raw tokens)
tokens = dataset_manager.inspect.inspect_sample_by_id(sample_id=sample_id)

# Get detokenized version (text)
text = dataset_manager.inspect.inspect_sample_by_id(
    sample_id=sample_id, 
    return_detokenized=True, 
    tokenizer=tokenizer
)

print(f"Analysis of Sample {sample_id}:")
print(f"Number of token segments: {len(tokens)}")

total_tokens = sum(len(segment) for segment in tokens)
print(f"Total tokens: {total_tokens}")
print(f"Total characters: {len(text)}")
print(f"Average tokens per character: {total_tokens/len(text):.3f}")

print(f"\nToken distribution across segments:")
for i, segment in enumerate(tokens):
    print(f"  Segment {i}: {len(segment)} tokens")

Analysis of Sample 5:
Number of token segments: 8
Total tokens: 2049
Total characters: 8404
Average tokens per character: 0.244

Token distribution across segments:
  Segment 0: 495 tokens
  Segment 1: 173 tokens
  Segment 2: 179 tokens
  Segment 3: 228 tokens
  Segment 4: 256 tokens
  Segment 5: 171 tokens
  Segment 6: 412 tokens
  Segment 7: 135 tokens

In [11]:

Copied!





# Let's look at the actual token IDs and their decoded values
sample_tokens = tokens[0][:20]  # First 20 tokens from first segment
decoded_tokens = [tokenizer.decode([token_id]) for token_id in sample_tokens]

print("Token ID to text mapping (first 20 tokens):")
print("Token ID | Decoded Text")
print("-" * 30)
for token_id, decoded_text in zip(sample_tokens, decoded_tokens):
    # Clean up the decoded text for display
    display_text = repr(decoded_text)
    print(f"{token_id:8d} | {display_text}")
# Let's look at the actual token IDs and their decoded values
sample_tokens = tokens[0][:20]  # First 20 tokens from first segment
decoded_tokens = [tokenizer.decode([token_id]) for token_id in sample_tokens]

print("Token ID to text mapping (first 20 tokens):")
print("Token ID | Decoded Text")
print("-" * 30)
for token_id, decoded_text in zip(sample_tokens, decoded_tokens):
    # Clean up the decoded text for display
    display_text = repr(decoded_text)
    print(f"{token_id:8d} | {display_text}")

Token ID to text mapping (first 20 tokens):
Token ID | Decoded Text
------------------------------
    1466 | 'Let'
     434 | "'s"
    1007 | ' look'
     387 | ' at'
     253 | ' the'
    7968 | ' pictures'
      15 | '.'
    1583 | ' They'
    1537 | ' might'
    2028 | ' tell'
     441 | ' us'
    1633 | ' something'
     449 | '."'
     418 | ' L'
    8807 | 'ila'
     285 | ' and'
    6029 | ' Ben'
    1007 | ' look'
     387 | ' at'
     253 | ' the'

Advanced Inspection: Cross-Document Boundaries¶

Let's examine samples that might span multiple documents to understand how packing works.

In [12]:

Copied!





# Find samples that span multiple documents
samples_with_multi_docs = []

for sample_id in range(20):  # Check first 20 samples
    _, doc_details = dataset_manager.inspect.inspect_sample_by_id(
        sample_id=sample_id,
        return_doc_details=True
    )
    
    # Check if this sample spans multiple documents
    if doc_details['doc_index_f'] != doc_details['doc_index_l']:
        samples_with_multi_docs.append((sample_id, doc_details))

print(f"Found {len(samples_with_multi_docs)} samples spanning multiple documents:")

for sample_id, doc_details in samples_with_multi_docs[:3]:  # Show first 3
    text = dataset_manager.inspect.inspect_sample_by_id(
        sample_id=sample_id,
        return_detokenized=True,
        tokenizer=tokenizer
    )
    
    print(f"\nSample {sample_id}:")
    print(f"  Spans documents {doc_details['doc_index_f']} to {doc_details['doc_index_l']}")
    print(f"  Offset range: {doc_details['offset_f']} to {doc_details['offset_l']}")
    print(f"  Text length: {len(text)} characters")
    print(f"  Preview: {text[:100]}...")
# Find samples that span multiple documents
samples_with_multi_docs = []

for sample_id in range(20):  # Check first 20 samples
    _, doc_details = dataset_manager.inspect.inspect_sample_by_id(
        sample_id=sample_id,
        return_doc_details=True
    )
    
    # Check if this sample spans multiple documents
    if doc_details['doc_index_f'] != doc_details['doc_index_l']:
        samples_with_multi_docs.append((sample_id, doc_details))

print(f"Found {len(samples_with_multi_docs)} samples spanning multiple documents:")

for sample_id, doc_details in samples_with_multi_docs[:3]:  # Show first 3
    text = dataset_manager.inspect.inspect_sample_by_id(
        sample_id=sample_id,
        return_detokenized=True,
        tokenizer=tokenizer
    )
    
    print(f"\nSample {sample_id}:")
    print(f"  Spans documents {doc_details['doc_index_f']} to {doc_details['doc_index_l']}")
    print(f"  Offset range: {doc_details['offset_f']} to {doc_details['offset_l']}")
    print(f"  Text length: {len(text)} characters")
    print(f"  Preview: {text[:100]}...")

Found 20 samples spanning multiple documents:

Sample 0:
  Spans documents 11212 to 11223
  Offset range: 67 to 154
  Text length: 8990 characters
  Preview:  thing happened. Lily's little brother came running and accidentally stepped on the diamond. Oh no! ...

Sample 1:
  Spans documents 15126 to 15133
  Offset range: 129 to 151
  Text length: 8388 characters
  Preview: . She had gone to the office for a minute. Lily had an idea. "Let's steal some crayons," she whisper...

Sample 2:
  Spans documents 9100 to 9111
  Offset range: 61 to 116
  Text length: 8789 characters
  Preview:  agreed to marry him. They had a wonderful wedding and were very happy together. They lived happily ...

Practical Tips for Dataset Inspection¶

Here are some useful patterns for inspecting your dataset during development and debugging.

In [13]:

Copied!





def quick_sample_summary(dataset_manager, sample_id, tokenizer):
    """Helper function to get a quick summary of any sample"""
    
    # Get both tokenized and text versions with details
    tokens = dataset_manager.inspect.inspect_sample_by_id(sample_id=sample_id)
    text, doc_details = dataset_manager.inspect.inspect_sample_by_id(
        sample_id=sample_id,
        return_detokenized=True,
        return_doc_details=True,
        tokenizer=tokenizer
    )
    
    total_tokens = sum(len(segment) for segment in tokens)
    
    summary = {
        'sample_id': sample_id,
        'total_tokens': total_tokens,
        'total_chars': len(text),
        'num_segments': len(tokens),
        'doc_range': f"{doc_details['doc_index_f']}-{doc_details['doc_index_l']}",
        'offset_range': f"{doc_details['offset_f']}-{doc_details['offset_l']}",
        'spans_multiple_docs': doc_details['doc_index_f'] != doc_details['doc_index_l'],
        'preview': text[:50] + "..." if len(text) > 50 else text
    }
    
    return summary

# Test our helper function
for sample_id in [0, 10, 25]:
    summary = quick_sample_summary(dataset_manager, sample_id, tokenizer)
    print(f"Sample {sample_id} Summary:")
    for key, value in summary.items():
        if key != 'sample_id':
            print(f"  {key}: {value}")
    print()
def quick_sample_summary(dataset_manager, sample_id, tokenizer):
    """Helper function to get a quick summary of any sample"""
    
    # Get both tokenized and text versions with details
    tokens = dataset_manager.inspect.inspect_sample_by_id(sample_id=sample_id)
    text, doc_details = dataset_manager.inspect.inspect_sample_by_id(
        sample_id=sample_id,
        return_detokenized=True,
        return_doc_details=True,
        tokenizer=tokenizer
    )
    
    total_tokens = sum(len(segment) for segment in tokens)
    
    summary = {
        'sample_id': sample_id,
        'total_tokens': total_tokens,
        'total_chars': len(text),
        'num_segments': len(tokens),
        'doc_range': f"{doc_details['doc_index_f']}-{doc_details['doc_index_l']}",
        'offset_range': f"{doc_details['offset_f']}-{doc_details['offset_l']}",
        'spans_multiple_docs': doc_details['doc_index_f'] != doc_details['doc_index_l'],
        'preview': text[:50] + "..." if len(text) > 50 else text
    }
    
    return summary

# Test our helper function
for sample_id in [0, 10, 25]:
    summary = quick_sample_summary(dataset_manager, sample_id, tokenizer)
    print(f"Sample {sample_id} Summary:")
    for key, value in summary.items():
        if key != 'sample_id':
            print(f"  {key}: {value}")
    print()

Sample 0 Summary:
  total_tokens: 2049
  total_chars: 8990
  num_segments: 12
  doc_range: 11212-11223
  offset_range: 67-154
  spans_multiple_docs: True
  preview:  thing happened. Lily's little brother came runnin...

Sample 10 Summary:
  total_tokens: 2049
  total_chars: 8530
  num_segments: 12
  doc_range: 7983-7994
  offset_range: 16-4
  spans_multiple_docs: True
  preview:  day, they find a big club on the grass. It is bro...

Sample 25 Summary:
  total_tokens: 2049
  total_chars: 8254
  num_segments: 8
  doc_range: 11194-11201
  offset_range: 62-336
  spans_multiple_docs: True
  preview:  that the sun made droplets scatter off of their b...

In [14]:

Copied!





def batch_statistics(dataset_manager, batch_id, batch_size, tokenizer):
    """Get statistics for an entire batch"""
    
    batch_data = dataset_manager.inspect.inspect_sample_by_batch(
        batch_id=batch_id,
        batch_size=batch_size,
        return_detokenized=True,
        return_doc_details=True,
        tokenizer=tokenizer
    )
    
    stats = {
        'batch_id': batch_id,
        'batch_size': len(batch_data),
        'text_lengths': [],
        'multi_doc_samples': 0,
        'total_chars': 0
    }
    
    for text, doc_details in batch_data:
        stats['text_lengths'].append(len(text))
        stats['total_chars'] += len(text)
        if doc_details['doc_index_f'] != doc_details['doc_index_l']:
            stats['multi_doc_samples'] += 1
    
    stats['avg_length'] = stats['total_chars'] / stats['batch_size']
    stats['min_length'] = min(stats['text_lengths'])
    stats['max_length'] = max(stats['text_lengths'])
    
    return stats

# Get statistics for first few batches
for batch_id in range(3):
    stats = batch_statistics(dataset_manager, batch_id, 4, tokenizer)
    print(f"Batch {batch_id} Statistics:")
    print(f"  Samples: {stats['batch_size']}")
    print(f"  Total characters: {stats['total_chars']:,}")
    print(f"  Average length: {stats['avg_length']:.1f}")
    print(f"  Length range: {stats['min_length']}-{stats['max_length']}")
    print(f"  Multi-document samples: {stats['multi_doc_samples']}")
    print()
def batch_statistics(dataset_manager, batch_id, batch_size, tokenizer):
    """Get statistics for an entire batch"""
    
    batch_data = dataset_manager.inspect.inspect_sample_by_batch(
        batch_id=batch_id,
        batch_size=batch_size,
        return_detokenized=True,
        return_doc_details=True,
        tokenizer=tokenizer
    )
    
    stats = {
        'batch_id': batch_id,
        'batch_size': len(batch_data),
        'text_lengths': [],
        'multi_doc_samples': 0,
        'total_chars': 0
    }
    
    for text, doc_details in batch_data:
        stats['text_lengths'].append(len(text))
        stats['total_chars'] += len(text)
        if doc_details['doc_index_f'] != doc_details['doc_index_l']:
            stats['multi_doc_samples'] += 1
    
    stats['avg_length'] = stats['total_chars'] / stats['batch_size']
    stats['min_length'] = min(stats['text_lengths'])
    stats['max_length'] = max(stats['text_lengths'])
    
    return stats

# Get statistics for first few batches
for batch_id in range(3):
    stats = batch_statistics(dataset_manager, batch_id, 4, tokenizer)
    print(f"Batch {batch_id} Statistics:")
    print(f"  Samples: {stats['batch_size']}")
    print(f"  Total characters: {stats['total_chars']:,}")
    print(f"  Average length: {stats['avg_length']:.1f}")
    print(f"  Length range: {stats['min_length']}-{stats['max_length']}")
    print(f"  Multi-document samples: {stats['multi_doc_samples']}")
    print()

Batch 0 Statistics:
  Samples: 4
  Total characters: 34,867
  Average length: 8716.8
  Length range: 8388-8990
  Multi-document samples: 4

Batch 1 Statistics:
  Samples: 4
  Total characters: 34,649
  Average length: 8662.2
  Length range: 8404-8976
  Multi-document samples: 4

Batch 2 Statistics:
  Samples: 4
  Total characters: 34,621
  Average length: 8655.2
  Length range: 8530-8813
  Multi-document samples: 4

Batch 2 Statistics:
  Samples: 4
  Total characters: 34,621
  Average length: 8655.2
  Length range: 8530-8813
  Multi-document samples: 4

Summary¶

Congratulations! You've successfully learned how to inspect your tokenized dataset using TokenSmith. Here's what we covered:

Key Concepts Learned:¶

Individual Sample Inspection: How to retrieve and examine single samples by ID
Batch Inspection: How to inspect multiple samples as batches
Document Details: Understanding metadata about document boundaries and offsets
Tokenized vs Detokenized: Working with both token arrays and human-readable text
Cross-Document Analysis: Identifying samples that span multiple source documents
Practical Utilities: Creating helper functions for routine inspection tasks

Key Methods Used:¶

dataset_manager.inspect.inspect_sample_by_id() - Inspect individual samples
dataset_manager.inspect.inspect_sample_by_batch() - Inspect batches of samples
Parameters: return_doc_details, return_detokenized, tokenizer

Next Steps:¶

Tutorial 3: Learn about different sampling methods and policies
Tutorial 4: Explore search functionality across your dataset
Tutorial 5: Understand editing and injection capabilities

Pro Tips:¶

Always use return_doc_details=True when debugging data packing issues
Create helper functions for routine inspection tasks
Use batch inspection to understand training data patterns
Compare tokenized and detokenized versions to verify data integrity