Dataset Inspection Tutorial¶
This tutorial demonstrates how to inspect samples from your tokenized dataset using TokenSmith's inspection functionality. We'll build on the setup from the first tutorial to explore individual samples and batches.
Prerequisites:
- Complete the first tutorial (01_basic_setup.ipynb)
- Have a tokenized dataset ready with batch info generated
What you'll learn:
- How to inspect individual samples by ID
- How to inspect batches of samples
- Understanding document details and metadata
- Working with tokenized vs detokenized content
- Exploring document boundaries and offsets
Setup¶
Let's start by importing the necessary libraries and setting up our environment, similar to the first tutorial.
# Fix paths for imports
import sys
sys.path.insert(0, "/NS/llm-pretraining/work/afkhan/tokensmith")
sys.path.insert(0, "/NS/llm-pretraining/work/afkhan/USC_Colab/gpt-neox")
# Import required libraries
import numpy as np
from transformers import AutoTokenizer
from tokensmith.manager import DatasetManager
# Load tokenizer
TOKENIZER_NAME_OR_PATH = "EleutherAI/gpt-neox-20b"
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME_OR_PATH, add_eos_token=True)
print(f"Loaded tokenizer: {TOKENIZER_NAME_OR_PATH}")
[2025-07-01 12:19:08,855] [INFO] [real_accelerator.py:161:get_accelerator] Setting ds_accelerator to cuda (auto detect)
/NS/venvs/work/afkhan/neox_updated_env/lib/python3.11/site-packages/deepspeed/runtime/zero/linear.py:47: FutureWarning: `torch.cuda.amp.custom_fwd(args...)` is deprecated. Please use `torch.amp.custom_fwd(args..., device_type='cuda')` instead. @autocast_custom_fwd /NS/venvs/work/afkhan/neox_updated_env/lib/python3.11/site-packages/deepspeed/runtime/zero/linear.py:66: FutureWarning: `torch.cuda.amp.custom_bwd(args...)` is deprecated. Please use `torch.amp.custom_bwd(args..., device_type='cuda')` instead. @autocast_custom_bwd /NS/venvs/work/afkhan/neox_updated_env/lib/python3.11/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( /NS/venvs/work/afkhan/neox_updated_env/lib/python3.11/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Loaded tokenizer: EleutherAI/gpt-neox-20b
# Initialize DatasetManager and setup for inspection
dataset_manager = DatasetManager()
# Setup the dataset for inspection (same as tutorial 1)
dataset_manager.setup_edit_inspect_sample_export(
dataset_prefix='../../artifacts/data_tokenized_text_document',
batch_info_save_prefix='../../artifacts/batch_info',
train_iters=100,
train_batch_size=16,
train_seq_len=2048,
seed=42,
splits_string='990,5,5',
packing_impl='packed',
allow_chopped=True,
)
print("Dataset manager setup complete!")
warming up index mmap file... reading sizes... reading pointers... reading document index... Dataset manager setup complete! reading sizes... reading pointers... reading document index... Dataset manager setup complete!
Basic Sample Inspection¶
Let's start by inspecting individual samples. We'll look at sample ID 0 and understand what information we can extract.
# Inspect the first sample (ID: 0) - returns tokenized data
sample_0 = dataset_manager.inspect.inspect_sample_by_id(sample_id=0)
print("Sample 0 (tokenized):")
print(f"Type: {type(sample_0)}")
print(f"Number of segments: {len(sample_0)}")
print(f"First segment shape: {sample_0[0].shape}")
print(f"First 10 tokens: {sample_0[0][:10]}")
Sample 0 (tokenized): Type: <class 'list'> Number of segments: 12 First segment shape: (70,) First 10 tokens: [ 2181 4592 15 32817 434 1652 4929 2210 3515 285]
# Now let's see the same sample but detokenized (human-readable text)
sample_0_text = dataset_manager.inspect.inspect_sample_by_id(
sample_id=0,
return_detokenized=True,
tokenizer=tokenizer
)
print("Sample 0 (detokenized text):")
print(f"Type: {type(sample_0_text)}")
print(f"Length: {len(sample_0_text)} characters")
print("\nFirst 200 characters:")
print(sample_0_text[:200])
print("\n" + "="*50)
print("Last 200 characters:")
print(sample_0_text[-200:])
Sample 0 (detokenized text): Type: <class 'str'> Length: 8990 characters First 200 characters: thing happened. Lily's little brother came running and accidentally stepped on the diamond. Oh no! The diamond was destroyed! Lily was very sad, but her mommy and daddy told her that it's okay becaus ================================================== Last 200 characters: listening to the birds tweet and trying to catch a glimpse of a rabbit. Soon enough, they found the perfect spot to escape and set up a secret camp. They explored the forest, gathered flowers and made
Understanding Document Details¶
TokenSmith can also provide metadata about each sample, including document boundaries and offsets. This is useful for understanding how your data was packed and segmented.
# Get sample with document details
sample_0_with_details = dataset_manager.inspect.inspect_sample_by_id(
sample_id=0,
return_doc_details=True
)
tokens, doc_details = sample_0_with_details
print("Sample 0 - Document Details:")
print(f"Document details type: {type(doc_details)}")
print("Document metadata:")
for key, value in doc_details.items():
print(f" {key}: {value}")
Sample 0 - Document Details: Document details type: <class 'dict'> Document metadata: doc_index_f: 11212 doc_index_l: 11223 offset_f: 67 offset_l: 154
# Get both detokenized text AND document details
sample_0_text_with_details = dataset_manager.inspect.inspect_sample_by_id(
sample_id=0,
return_detokenized=True,
return_doc_details=True,
tokenizer=tokenizer
)
text, doc_details = sample_0_text_with_details
print("Sample 0 - Text with Document Details:")
print(f"Text length: {len(text)} characters")
print("\nDocument metadata:")
for key, value in doc_details.items():
print(f" {key}: {value}")
print(f"\nFirst 100 characters:\n{text[:100]}")
Sample 0 - Text with Document Details: Text length: 8990 characters Document metadata: doc_index_f: 11212 doc_index_l: 11223 offset_f: 67 offset_l: 154 First 100 characters: thing happened. Lily's little brother came running and accidentally stepped on the diamond. Oh no!
Inspecting Multiple Samples¶
Let's look at several samples to understand the variation in our dataset.
# Inspect multiple individual samples
sample_ids_to_check = [0, 1, 5, 10, 50]
print("Inspecting multiple samples:")
print("="*60)
for sample_id in sample_ids_to_check:
# Get detokenized text with document details
text, doc_details = dataset_manager.inspect.inspect_sample_by_id(
sample_id=sample_id,
return_detokenized=True,
return_doc_details=True,
tokenizer=tokenizer
)
print(f"\nSample ID: {sample_id}")
print(f"Text length: {len(text)} characters")
print(f"Doc index range: {doc_details['doc_index_f']} to {doc_details['doc_index_l']}")
print(f"Offset range: {doc_details['offset_f']} to {doc_details['offset_l']}")
print(f"Preview: {text[:100]}...")
print("-" * 40)
Inspecting multiple samples: ============================================================ Sample ID: 0 Text length: 8990 characters Doc index range: 11212 to 11223 Offset range: 67 to 154 Preview: thing happened. Lily's little brother came running and accidentally stepped on the diamond. Oh no! ... ---------------------------------------- Sample ID: 1 Text length: 8388 characters Doc index range: 15126 to 15133 Offset range: 129 to 151 Preview: . She had gone to the office for a minute. Lily had an idea. "Let's steal some crayons," she whisper... ---------------------------------------- Sample ID: 5 Text length: 8404 characters Doc index range: 5991 to 5998 Offset range: 226 to 134 Preview: Let's look at the pictures. They might tell us something." Lila and Ben look at the pictures on the ... ---------------------------------------- Sample ID: 10 Text length: 8530 characters Doc index range: 7983 to 7994 Offset range: 16 to 4 Preview: day, they find a big club on the grass. It is brown and heavy. "Look, a club!" Lily says. "Let's pl... ---------------------------------------- Sample ID: 50 Text length: 8495 characters Doc index range: 14417 to 14425 Offset range: 215 to 176 Preview: Ben's car fell on the ground and broke. The wheel came off and the paint scratched. "Uh oh!" Lily s... ----------------------------------------
Batch Inspection¶
TokenSmith also allows you to inspect entire batches at once, which is useful for understanding how your training batches will look.
# Inspect batch 0 (first batch of samples)
batch_0 = dataset_manager.inspect.inspect_sample_by_batch(
batch_id=0,
batch_size=4, # Let's use a smaller batch size for easier inspection
return_detokenized=True,
tokenizer=tokenizer
)
print(f"Batch 0 inspection:")
print(f"Batch type: {type(batch_0)}")
print(f"Number of samples in batch: {len(batch_0)}")
for i, sample_text in enumerate(batch_0):
print(f"\n--- Sample {i} in batch ---")
print(f"Length: {len(sample_text)} characters")
print(f"Preview: {sample_text[:80]}...")
Batch 0 inspection: Batch type: <class 'list'> Number of samples in batch: 4 --- Sample 0 in batch --- Length: 8990 characters Preview: thing happened. Lily's little brother came running and accidentally stepped on ... --- Sample 1 in batch --- Length: 8388 characters Preview: . She had gone to the office for a minute. Lily had an idea. "Let's steal some c... --- Sample 2 in batch --- Length: 8789 characters Preview: agreed to marry him. They had a wonderful wedding and were very happy together.... --- Sample 3 in batch --- Length: 8700 characters Preview: sleep, Maggie's mommy saw something very rare and wet. It was raining outside a...
# Inspect batch with document details
batch_0_with_details = dataset_manager.inspect.inspect_sample_by_batch(
batch_id=0,
batch_size=4,
return_detokenized=True,
return_doc_details=True,
tokenizer=tokenizer
)
print("Batch 0 with document details:")
print(f"Batch size: {len(batch_0_with_details)}")
for i, (sample_text, doc_details) in enumerate(batch_0_with_details):
print(f"\n--- Sample {i} in batch ---")
print(f"Text length: {len(sample_text)} characters")
print(f"Document range: docs {doc_details['doc_index_f']}-{doc_details['doc_index_l']}")
print(f"Offset range: {doc_details['offset_f']}-{doc_details['offset_l']}")
print(f"Preview: {sample_text[:60]}...")
Batch 0 with document details: Batch size: 4 --- Sample 0 in batch --- Text length: 8990 characters Document range: docs 11212-11223 Offset range: 67-154 Preview: thing happened. Lily's little brother came running and acci... --- Sample 1 in batch --- Text length: 8388 characters Document range: docs 15126-15133 Offset range: 129-151 Preview: . She had gone to the office for a minute. Lily had an idea.... --- Sample 2 in batch --- Text length: 8789 characters Document range: docs 9100-9111 Offset range: 61-116 Preview: agreed to marry him. They had a wonderful wedding and were ... --- Sample 3 in batch --- Text length: 8700 characters Document range: docs 5168-5178 Offset range: 110-202 Preview: sleep, Maggie's mommy saw something very rare and wet. It w...
Understanding Tokenization Patterns¶
Let's examine how different types of content get tokenized to better understand our dataset.
# Compare tokenized vs detokenized for analysis
sample_id = 5
# Get tokenized version (raw tokens)
tokens = dataset_manager.inspect.inspect_sample_by_id(sample_id=sample_id)
# Get detokenized version (text)
text = dataset_manager.inspect.inspect_sample_by_id(
sample_id=sample_id,
return_detokenized=True,
tokenizer=tokenizer
)
print(f"Analysis of Sample {sample_id}:")
print(f"Number of token segments: {len(tokens)}")
total_tokens = sum(len(segment) for segment in tokens)
print(f"Total tokens: {total_tokens}")
print(f"Total characters: {len(text)}")
print(f"Average tokens per character: {total_tokens/len(text):.3f}")
print(f"\nToken distribution across segments:")
for i, segment in enumerate(tokens):
print(f" Segment {i}: {len(segment)} tokens")
Analysis of Sample 5: Number of token segments: 8 Total tokens: 2049 Total characters: 8404 Average tokens per character: 0.244 Token distribution across segments: Segment 0: 495 tokens Segment 1: 173 tokens Segment 2: 179 tokens Segment 3: 228 tokens Segment 4: 256 tokens Segment 5: 171 tokens Segment 6: 412 tokens Segment 7: 135 tokens
# Let's look at the actual token IDs and their decoded values
sample_tokens = tokens[0][:20] # First 20 tokens from first segment
decoded_tokens = [tokenizer.decode([token_id]) for token_id in sample_tokens]
print("Token ID to text mapping (first 20 tokens):")
print("Token ID | Decoded Text")
print("-" * 30)
for token_id, decoded_text in zip(sample_tokens, decoded_tokens):
# Clean up the decoded text for display
display_text = repr(decoded_text)
print(f"{token_id:8d} | {display_text}")
Token ID to text mapping (first 20 tokens): Token ID | Decoded Text ------------------------------ 1466 | 'Let' 434 | "'s" 1007 | ' look' 387 | ' at' 253 | ' the' 7968 | ' pictures' 15 | '.' 1583 | ' They' 1537 | ' might' 2028 | ' tell' 441 | ' us' 1633 | ' something' 449 | '."' 418 | ' L' 8807 | 'ila' 285 | ' and' 6029 | ' Ben' 1007 | ' look' 387 | ' at' 253 | ' the'
Advanced Inspection: Cross-Document Boundaries¶
Let's examine samples that might span multiple documents to understand how packing works.
# Find samples that span multiple documents
samples_with_multi_docs = []
for sample_id in range(20): # Check first 20 samples
_, doc_details = dataset_manager.inspect.inspect_sample_by_id(
sample_id=sample_id,
return_doc_details=True
)
# Check if this sample spans multiple documents
if doc_details['doc_index_f'] != doc_details['doc_index_l']:
samples_with_multi_docs.append((sample_id, doc_details))
print(f"Found {len(samples_with_multi_docs)} samples spanning multiple documents:")
for sample_id, doc_details in samples_with_multi_docs[:3]: # Show first 3
text = dataset_manager.inspect.inspect_sample_by_id(
sample_id=sample_id,
return_detokenized=True,
tokenizer=tokenizer
)
print(f"\nSample {sample_id}:")
print(f" Spans documents {doc_details['doc_index_f']} to {doc_details['doc_index_l']}")
print(f" Offset range: {doc_details['offset_f']} to {doc_details['offset_l']}")
print(f" Text length: {len(text)} characters")
print(f" Preview: {text[:100]}...")
Found 20 samples spanning multiple documents: Sample 0: Spans documents 11212 to 11223 Offset range: 67 to 154 Text length: 8990 characters Preview: thing happened. Lily's little brother came running and accidentally stepped on the diamond. Oh no! ... Sample 1: Spans documents 15126 to 15133 Offset range: 129 to 151 Text length: 8388 characters Preview: . She had gone to the office for a minute. Lily had an idea. "Let's steal some crayons," she whisper... Sample 2: Spans documents 9100 to 9111 Offset range: 61 to 116 Text length: 8789 characters Preview: agreed to marry him. They had a wonderful wedding and were very happy together. They lived happily ...
Practical Tips for Dataset Inspection¶
Here are some useful patterns for inspecting your dataset during development and debugging.
def quick_sample_summary(dataset_manager, sample_id, tokenizer):
"""Helper function to get a quick summary of any sample"""
# Get both tokenized and text versions with details
tokens = dataset_manager.inspect.inspect_sample_by_id(sample_id=sample_id)
text, doc_details = dataset_manager.inspect.inspect_sample_by_id(
sample_id=sample_id,
return_detokenized=True,
return_doc_details=True,
tokenizer=tokenizer
)
total_tokens = sum(len(segment) for segment in tokens)
summary = {
'sample_id': sample_id,
'total_tokens': total_tokens,
'total_chars': len(text),
'num_segments': len(tokens),
'doc_range': f"{doc_details['doc_index_f']}-{doc_details['doc_index_l']}",
'offset_range': f"{doc_details['offset_f']}-{doc_details['offset_l']}",
'spans_multiple_docs': doc_details['doc_index_f'] != doc_details['doc_index_l'],
'preview': text[:50] + "..." if len(text) > 50 else text
}
return summary
# Test our helper function
for sample_id in [0, 10, 25]:
summary = quick_sample_summary(dataset_manager, sample_id, tokenizer)
print(f"Sample {sample_id} Summary:")
for key, value in summary.items():
if key != 'sample_id':
print(f" {key}: {value}")
print()
Sample 0 Summary: total_tokens: 2049 total_chars: 8990 num_segments: 12 doc_range: 11212-11223 offset_range: 67-154 spans_multiple_docs: True preview: thing happened. Lily's little brother came runnin... Sample 10 Summary: total_tokens: 2049 total_chars: 8530 num_segments: 12 doc_range: 7983-7994 offset_range: 16-4 spans_multiple_docs: True preview: day, they find a big club on the grass. It is bro... Sample 25 Summary: total_tokens: 2049 total_chars: 8254 num_segments: 8 doc_range: 11194-11201 offset_range: 62-336 spans_multiple_docs: True preview: that the sun made droplets scatter off of their b...
def batch_statistics(dataset_manager, batch_id, batch_size, tokenizer):
"""Get statistics for an entire batch"""
batch_data = dataset_manager.inspect.inspect_sample_by_batch(
batch_id=batch_id,
batch_size=batch_size,
return_detokenized=True,
return_doc_details=True,
tokenizer=tokenizer
)
stats = {
'batch_id': batch_id,
'batch_size': len(batch_data),
'text_lengths': [],
'multi_doc_samples': 0,
'total_chars': 0
}
for text, doc_details in batch_data:
stats['text_lengths'].append(len(text))
stats['total_chars'] += len(text)
if doc_details['doc_index_f'] != doc_details['doc_index_l']:
stats['multi_doc_samples'] += 1
stats['avg_length'] = stats['total_chars'] / stats['batch_size']
stats['min_length'] = min(stats['text_lengths'])
stats['max_length'] = max(stats['text_lengths'])
return stats
# Get statistics for first few batches
for batch_id in range(3):
stats = batch_statistics(dataset_manager, batch_id, 4, tokenizer)
print(f"Batch {batch_id} Statistics:")
print(f" Samples: {stats['batch_size']}")
print(f" Total characters: {stats['total_chars']:,}")
print(f" Average length: {stats['avg_length']:.1f}")
print(f" Length range: {stats['min_length']}-{stats['max_length']}")
print(f" Multi-document samples: {stats['multi_doc_samples']}")
print()
Batch 0 Statistics: Samples: 4 Total characters: 34,867 Average length: 8716.8 Length range: 8388-8990 Multi-document samples: 4 Batch 1 Statistics: Samples: 4 Total characters: 34,649 Average length: 8662.2 Length range: 8404-8976 Multi-document samples: 4 Batch 2 Statistics: Samples: 4 Total characters: 34,621 Average length: 8655.2 Length range: 8530-8813 Multi-document samples: 4 Batch 2 Statistics: Samples: 4 Total characters: 34,621 Average length: 8655.2 Length range: 8530-8813 Multi-document samples: 4
Summary¶
Congratulations! You've successfully learned how to inspect your tokenized dataset using TokenSmith. Here's what we covered:
Key Concepts Learned:¶
- Individual Sample Inspection: How to retrieve and examine single samples by ID
- Batch Inspection: How to inspect multiple samples as batches
- Document Details: Understanding metadata about document boundaries and offsets
- Tokenized vs Detokenized: Working with both token arrays and human-readable text
- Cross-Document Analysis: Identifying samples that span multiple source documents
- Practical Utilities: Creating helper functions for routine inspection tasks
Key Methods Used:¶
dataset_manager.inspect.inspect_sample_by_id()
- Inspect individual samplesdataset_manager.inspect.inspect_sample_by_batch()
- Inspect batches of samples- Parameters:
return_doc_details
,return_detokenized
,tokenizer
Next Steps:¶
- Tutorial 3: Learn about different sampling methods and policies
- Tutorial 4: Explore search functionality across your dataset
- Tutorial 5: Understand editing and injection capabilities
Pro Tips:¶
- Always use
return_doc_details=True
when debugging data packing issues - Create helper functions for routine inspection tasks
- Use batch inspection to understand training data patterns
- Compare tokenized and detokenized versions to verify data integrity