Dataset Searching Tutorial¶
Welcome to the comprehensive search functionality tutorial! This guide covers TokenSmith's powerful search capabilities for finding and analyzing token sequences in your datasets.
What you'll learn:
- Basic search operations (count, contains, positions)
- Advanced search features (next token prediction)
- Batch search operations for efficiency
- N-gram sampling with smoothing
- Real-world search applications
Prerequisites:
- Completed basic setup tutorial
- Understanding of tokenization
- Familiarity with token sequences
Setup and Configuration¶
First, let's set up our environment with the necessary imports and initialize our dataset manager.
# May not be necessary, but ensures the path is set correctly
import sys
sys.path.insert(0, "/NS/llm-pretraining/work/afkhan/tokensmith")
sys.path.insert(0, "/NS/llm-pretraining/work/afkhan/USC_Colab/gpt-neox")
import numpy as np
import json
from collections import Counter
import matplotlib.pyplot as plt
from typing import List, Dict, Tuple
Initialize Tokenizer¶
For search operations, we need a tokenizer to convert text to tokens and back.
from transformers import AutoTokenizer
TOKENIZER_NAME_OR_PATH = "EleutherAI/gpt-neox-20b"
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME_OR_PATH, add_eos_token=True)
print(f"Tokenizer loaded: {TOKENIZER_NAME_OR_PATH}")
print(f"Vocabulary size: {len(tokenizer)}")
print(f"EOS token: {tokenizer.eos_token} (ID: {tokenizer.eos_token_id})")
print(f"BOS token: {tokenizer.bos_token} (ID: {tokenizer.bos_token_id})")
/NS/venvs/work/afkhan/neox_updated_env/lib/python3.11/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`. warnings.warn( Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained. Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Tokenizer loaded: EleutherAI/gpt-neox-20b Vocabulary size: 50277 EOS token: <|endoftext|> (ID: 0) BOS token: <|endoftext|> (ID: 0)
Setup Dataset Manager¶
Initialize the DatasetManager with search functionality enabled.
from tokensmith.manager import DatasetManager
dataset_manager = DatasetManager()
# Setup search functionality - this builds/loads the search index
dataset_manager.setup_search(
bin_file_path="../../artifacts/data_ingested_text_document.bin",
search_index_save_path="../../artifacts/search_index_text_document.idx",
vocab=2**16, # Use 2**16 for GPT-NeoX tokenizer
verbose=True,
reuse=False, # Reuse existing index if available
)
print("Search functionality initialized successfully!")
print(f"Search handler available: {dataset_manager.search is not None}")
Writing indices to disk... Time elapsed: 35.987326ms Sorting indices... Time elapsed: 252.669649ms Search functionality initialized successfully! Search handler available: True Search functionality initialized successfully! Search handler available: True
Basic Search Operations¶
Let's start with the fundamental search operations: count, contains, and positions.
Counting Token Sequences¶
The count()
method tells us how many times a specific token sequence appears in the dataset.
# Example 1: Search for common phrases
common_phrases = [
"Once upon a time",
" icy hill",
" small yard",
" pretty candle",
" wanted to" # Prepended space is intentional for tokenization as the first token is then different than what it would be without it
]
print("=== Phrase Frequency Analysis ===")
for phrase in common_phrases:
# Convert text to tokens
tokens = tokenizer.encode(phrase, add_special_tokens=False)
count = dataset_manager.search.count(tokens)
print(f"'{phrase}':")
print(f" Tokens: {tokens}")
print(f" Count: {count}")
print()
# Example 2: Single token counts
print("=== Single Token Analysis ===")
common_words = ["the", "and", "to", "of", "a"]
for word in common_words:
token_id = tokenizer.encode(word, add_special_tokens=False)[0] # Get first token
count = dataset_manager.search.count([token_id])
decoded = tokenizer.decode([token_id])
print(f"Token '{decoded}' (ID: {token_id}): {count} occurrences")
=== Phrase Frequency Analysis === 'Once upon a time': Tokens: [10758, 2220, 247, 673] Count: 13533 ' icy hill': Tokens: [42947, 13599] Count: 9 ' small yard': Tokens: [1355, 15789] Count: 2 ' pretty candle': Tokens: [3965, 28725] Count: 3 ' wanted to': Tokens: [3078, 281] Count: 11524 === Single Token Analysis === Token 'the' (ID: 783): 12 occurrences Token 'and' (ID: 395): 93 occurrences Token 'to' (ID: 936): 21 occurrences Token 'of' (ID: 1171): 82 occurrences Token 'a' (ID: 66): 510 occurrences
Checking Sequence Existence¶
The contains()
method is useful for quickly checking if a sequence exists without counting.
# Test various sequences for existence
common_phrases = [
"Once upon a time",
" icy hill",
" small yard",
" pretty candle",
" wanted to" # Prepended space is intentional for tokenization as the first token is then different than what it would be without it
]
print("=== Sequence Existence Check ===")
for sequence in common_phrases:
tokens = tokenizer.encode(sequence, add_special_tokens=False)
exists = dataset_manager.search.contains(tokens)
status = "✓ Found" if exists else "✗ Not found"
print(f"{status}: '{sequence}'")
# If found, also get the count
if exists:
count = dataset_manager.search.count(tokens)
print(f" Occurrences: {count}")
print()
=== Sequence Existence Check === ✓ Found: 'Once upon a time' Occurrences: 13533 ✓ Found: ' icy hill' Occurrences: 9 ✓ Found: ' small yard' Occurrences: 2 ✓ Found: ' pretty candle' Occurrences: 3 ✓ Found: ' wanted to' Occurrences: 11524
Finding Sequence Positions¶
The positions()
method returns all locations where a sequence appears in the dataset.
# Find positions of specific sequences
search_phrase = "Once upon a time"
tokens = tokenizer.encode(search_phrase, add_special_tokens=False)
print(f"=== Position Analysis for '{search_phrase}' ===")
print(f"Tokens: {tokens}")
positions = dataset_manager.search.positions(tokens)
count = len(positions)
print(f"Total occurrences: {count}")
if count > 0:
print(f"Positions (first 10): {positions[:10]}")
else:
print("Sequence not found in dataset")
=== Position Analysis for 'Once upon a time' === Tokens: [10758, 2220, 247, 673] Total occurrences: 13533 Positions (first 10): [1871078, 3526259, 739125, 1332842, 1838022, 3434072, 484592, 2798653, 313457, 1297946]
Advanced Search: Next Token Prediction¶
One of the most powerful features is count_next()
, which shows what tokens typically follow a given sequence.
Analyzing Token Transitions¶
Let's see what commonly follows specific phrases.
def analyze_next_tokens(phrase: str, top_k: int = 10) -> Dict:
"""Analyze what tokens commonly follow a given phrase."""
tokens = tokenizer.encode(phrase, add_special_tokens=False)
# Get next token counts
next_counts = dataset_manager.search.count_next(tokens)
# Find non-zero counts
results = []
for token_id, count in enumerate(next_counts):
if count > 0:
try:
token_text = tokenizer.decode([token_id])
results.append((token_id, token_text, count))
except:
results.append((token_id, f"[ID:{token_id}]", count))
# Sort by count and return top k
results.sort(key=lambda x: x[2], reverse=True)
return {
'phrase': phrase,
'phrase_tokens': tokens,
'total_phrase_count': dataset_manager.search.count(tokens),
'top_next_tokens': results[:top_k],
'unique_next_tokens': len(results)
}
# Analyze several interesting phrases
analysis_phrases = [
"The cat",
"I am",
"Once upon a",
"In the",
"She said"
]
print("=== Next Token Analysis ===")
for phrase in analysis_phrases:
analysis = analyze_next_tokens(phrase, top_k=5)
print(f"\nPhrase: '{phrase}'")
print(f"Phrase occurs {analysis['total_phrase_count']} times")
print(f"Followed by {analysis['unique_next_tokens']} different tokens")
print("Top continuations:")
for i, (token_id, token_text, count) in enumerate(analysis['top_next_tokens'], 1):
percentage = (count / analysis['total_phrase_count']) * 100
print(f" {i}. '{token_text}' (ID: {token_id}) - {count} times ({percentage:.1f}%)")
=== Next Token Analysis === Phrase: 'The cat' Phrase occurs 5 times Followed by 3 different tokens Top continuations: 1. ' is' (ID: 310) - 3 times (60.0%) 2. ' will' (ID: 588) - 1 times (20.0%) 3. ' kept' (ID: 4934) - 1 times (20.0%) Phrase: 'I am' Phrase occurs 392 times Followed by 120 different tokens Top continuations: 1. ' sorry' (ID: 7016) - 74 times (18.9%) 2. ' a' (ID: 247) - 38 times (9.7%) 3. ' the' (ID: 253) - 26 times (6.6%) 4. ' going' (ID: 1469) - 20 times (5.1%) 5. ' so' (ID: 594) - 16 times (4.1%) Phrase: 'Once upon a' Phrase occurs 13542 times Followed by 9 different tokens Top continuations: 1. ' time' (ID: 673) - 13533 times (99.9%) 2. ' Tuesday' (ID: 7948) - 2 times (0.0%) 3. ' long' (ID: 1048) - 1 times (0.0%) 4. ' day' (ID: 1388) - 1 times (0.0%) 5. ' night' (ID: 2360) - 1 times (0.0%) Phrase: 'I am' Phrase occurs 392 times Followed by 120 different tokens Top continuations: 1. ' sorry' (ID: 7016) - 74 times (18.9%) 2. ' a' (ID: 247) - 38 times (9.7%) 3. ' the' (ID: 253) - 26 times (6.6%) 4. ' going' (ID: 1469) - 20 times (5.1%) 5. ' so' (ID: 594) - 16 times (4.1%) Phrase: 'Once upon a' Phrase occurs 13542 times Followed by 9 different tokens Top continuations: 1. ' time' (ID: 673) - 13533 times (99.9%) 2. ' Tuesday' (ID: 7948) - 2 times (0.0%) 3. ' long' (ID: 1048) - 1 times (0.0%) 4. ' day' (ID: 1388) - 1 times (0.0%) 5. ' night' (ID: 2360) - 1 times (0.0%) Phrase: 'In the' Phrase occurs 10 times Followed by 9 different tokens Top continuations: 1. ' park' (ID: 5603) - 2 times (20.0%) 2. ' big' (ID: 1943) - 1 times (10.0%) 3. ' dark' (ID: 3644) - 1 times (10.0%) 4. ' morning' (ID: 4131) - 1 times (10.0%) 5. ' middle' (ID: 4766) - 1 times (10.0%) Phrase: 'She said' Phrase occurs 1 times Followed by 1 different tokens Top continuations: 1. ' it' (ID: 352) - 1 times (100.0%) Phrase: 'In the' Phrase occurs 10 times Followed by 9 different tokens Top continuations: 1. ' park' (ID: 5603) - 2 times (20.0%) 2. ' big' (ID: 1943) - 1 times (10.0%) 3. ' dark' (ID: 3644) - 1 times (10.0%) 4. ' morning' (ID: 4131) - 1 times (10.0%) 5. ' middle' (ID: 4766) - 1 times (10.0%) Phrase: 'She said' Phrase occurs 1 times Followed by 1 different tokens Top continuations: 1. ' it' (ID: 352) - 1 times (100.0%)
Story Continuation Analysis¶
Let's do a deeper analysis of story beginnings to understand narrative patterns.
def story_continuation_analysis():
"""Analyze how stories typically continue after common openings."""
story_openings = [
"The cat",
"I am",
"Once upon a",
"In the",
"She said"
]
print("=== Story Continuation Patterns ===")
for opening in story_openings:
tokens = tokenizer.encode(opening, add_special_tokens=False)
phrase_count = dataset_manager.search.count(tokens)
if phrase_count == 0:
print(f"\n'{opening}': Not found in dataset")
continue
print(f"\n'{opening}' (appears {phrase_count} times):")
# Get next tokens
next_counts = dataset_manager.search.count_next(tokens)
# Build continuations by looking at multiple next tokens
next_tokens = []
for token_id, count in enumerate(next_counts):
if count > 0:
try:
token_text = tokenizer.decode([token_id])
next_tokens.append((token_id, token_text, count))
except:
continue
# Sort and show top continuations
next_tokens.sort(key=lambda x: x[2], reverse=True)
print(" Most common continuations:")
for i, (token_id, token_text, count) in enumerate(next_tokens[:7], 1):
# Create continuation phrase
continuation_tokens = tokens + [token_id]
full_continuation = tokenizer.decode(continuation_tokens)
probability = (count / phrase_count) * 100
print(f" {i}. '{full_continuation}' ({probability:.1f}%)")
story_continuation_analysis()
=== Story Continuation Patterns === 'The cat' (appears 5 times): Most common continuations: 1. 'The cat is' (60.0%) 2. 'The cat will' (20.0%) 3. 'The cat kept' (20.0%) 'I am' (appears 392 times): Most common continuations: 1. 'I am sorry' (18.9%) 2. 'I am a' (9.7%) 3. 'I am the' (6.6%) 4. 'I am going' (5.1%) 5. 'I am so' (4.1%) 6. 'I am glad' (3.1%) 7. 'I am happy' (2.8%) 'Once upon a' (appears 13542 times): Most common continuations: 1. 'Once upon a time' (99.9%) 2. 'Once upon a Tuesday' (0.0%) 3. 'Once upon a long' (0.0%) 4. 'Once upon a day' (0.0%) 5. 'Once upon a night' (0.0%) 6. 'Once upon a morning' (0.0%) 7. 'Once upon a Sunday' (0.0%) 'In the' (appears 10 times): Most common continuations: 1. 'In the park' (20.0%) 2. 'In the big' (10.0%) 3. 'In the dark' (10.0%) 4. 'In the morning' (10.0%) 5. 'In the middle' (10.0%) 6. 'In the summer' (10.0%) 7. 'In the farm' (10.0%) 'She said' (appears 1 times): Most common continuations: 1. 'She said it' (100.0%)
Batch Search Operations¶
For efficiency when searching multiple sequences, use batch operations.
# Batch next token analysis
def batch_next_token_analysis():
"""Demonstrate batch search operations for efficiency."""
# Prepare multiple queries
phrases = [
"The dog",
"The cat",
"The bird",
"The fish",
"The horse"
]
# Convert all phrases to token sequences
token_queries = []
for phrase in phrases:
tokens = tokenizer.encode(phrase, add_special_tokens=False)
token_queries.append(tokens)
print("=== Batch Next Token Analysis ===")
print(f"Analyzing {len(phrases)} phrases simultaneously...")
# Perform batch search
batch_results = dataset_manager.search.batch_count_next(token_queries)
# Analyze results
for i, (phrase, query_tokens, next_counts) in enumerate(zip(phrases, token_queries, batch_results)):
phrase_count = dataset_manager.search.count(query_tokens)
print(f"\n{i+1}. '{phrase}' (occurs {phrase_count} times):")
# Find top next tokens
next_tokens = []
for token_id, count in enumerate(next_counts):
if count > 0:
try:
token_text = tokenizer.decode([token_id])
next_tokens.append((token_text, count))
except:
continue
# Sort and display top 3
next_tokens.sort(key=lambda x: x[1], reverse=True)
for j, (token_text, count) in enumerate(next_tokens[:3], 1):
probability = (count / phrase_count) * 100 if phrase_count > 0 else 0
print(f" {j}. '{token_text}' - {count} times ({probability:.1f}%)")
batch_next_token_analysis()
=== Batch Next Token Analysis === Analyzing 5 phrases simultaneously... 1. 'The dog' (occurs 22 times): 1. ' is' - 13 times (59.1%) 2. ' was' - 2 times (9.1%) 3. ' might' - 2 times (9.1%) 2. 'The cat' (occurs 5 times): 1. ' is' - 3 times (60.0%) 2. ' will' - 1 times (20.0%) 3. ' kept' - 1 times (20.0%) 3. 'The bird' (occurs 10 times): 1. ' is' - 6 times (60.0%) 2. ' does' - 2 times (20.0%) 3. 'ie' - 1 times (10.0%) 4. 'The fish' (occurs 1 times): 1. ' smiled' - 1 times (100.0%) 5. 'The horse' (occurs 2 times): 1. ' is' - 1 times (50.0%) 2. ' feels' - 1 times (50.0%) 1. 'The dog' (occurs 22 times): 1. ' is' - 13 times (59.1%) 2. ' was' - 2 times (9.1%) 3. ' might' - 2 times (9.1%) 2. 'The cat' (occurs 5 times): 1. ' is' - 3 times (60.0%) 2. ' will' - 1 times (20.0%) 3. ' kept' - 1 times (20.0%) 3. 'The bird' (occurs 10 times): 1. ' is' - 6 times (60.0%) 2. ' does' - 2 times (20.0%) 3. 'ie' - 1 times (10.0%) 4. 'The fish' (occurs 1 times): 1. ' smiled' - 1 times (100.0%) 5. 'The horse' (occurs 2 times): 1. ' is' - 1 times (50.0%) 2. ' feels' - 1 times (50.0%)
N-gram Sampling with Smoothing¶
TokenSmith includes advanced n-gram sampling with Kneser-Ney smoothing for generating realistic continuations.
def demonstrate_ngram_sampling():
"""Demonstrate n-gram sampling with smoothing."""
# Start with a story beginning
seed_phrase = "Once upon a time there was a"
seed_tokens = tokenizer.encode(seed_phrase, add_special_tokens=False)
print("=== N-gram Sampling with Smoothing ===")
print(f"Seed phrase: '{seed_phrase}'")
print(f"Seed tokens: {seed_tokens}")
# Generate several continuations using different n-gram orders
n_values = [2, 3, 4] # bi-gram, tri-gram, 4-gram
for n in n_values:
print(f"\n--- {n}-gram Sampling ---")
try:
# Sample continuations
samples = dataset_manager.search.sample_smoothed(
query=seed_tokens,
n=n, # n-gram order
k=10, # length of continuation
num_samples=3 # number of samples
)
print(f"Generated {len(samples)} continuations:")
for i, sample_tokens in enumerate(samples, 1):
# Combine seed and sample
full_sequence = seed_tokens + sample_tokens
full_text = tokenizer.decode(full_sequence)
continuation_text = tokenizer.decode(sample_tokens)
print(f" {i}. Continuation: '{continuation_text}'")
print(f" Full text: '{full_text}'")
print()
except Exception as e:
print(f"Error with {n}-gram sampling: {e}")
continue
demonstrate_ngram_sampling()
=== N-gram Sampling with Smoothing === Seed phrase: 'Once upon a time there was a' Seed tokens: [10758, 2220, 247, 673, 627, 369, 247] --- 2-gram Sampling --- Generated 3 continuations: 1. Continuation: 'Once upon a time there was a cork from the microscope from the animals. After' Full text: 'Once upon a time there was aOnce upon a time there was a cork from the microscope from the animals. After' 2. Continuation: 'Once upon a time there was a great thing they had made a large it. I' Full text: 'Once upon a time there was aOnce upon a time there was a great thing they had made a large it. I' 3. Continuation: 'Once upon a time there was a big adventure, stick again. He saw a time' Full text: 'Once upon a time there was aOnce upon a time there was a big adventure, stick again. He saw a time' --- 3-gram Sampling --- Generated 3 continuations: 1. Continuation: 'Once upon a time there was a cork from the microscope from the animals. After' Full text: 'Once upon a time there was aOnce upon a time there was a cork from the microscope from the animals. After' 2. Continuation: 'Once upon a time there was a great thing they had made a large it. I' Full text: 'Once upon a time there was aOnce upon a time there was a great thing they had made a large it. I' 3. Continuation: 'Once upon a time there was a big adventure, stick again. He saw a time' Full text: 'Once upon a time there was aOnce upon a time there was a big adventure, stick again. He saw a time' --- 3-gram Sampling --- Generated 3 continuations: 1. Continuation: 'Once upon a time there was a girl named Lily. She does not know. During' Full text: 'Once upon a time there was aOnce upon a time there was a girl named Lily. She does not know. During' 2. Continuation: 'Once upon a time there was a little girl named Sally, "It is too strong' Full text: 'Once upon a time there was aOnce upon a time there was a little girl named Sally, "It is too strong' 3. Continuation: 'Once upon a time there was a little bird. "No, this is my friend' Full text: 'Once upon a time there was aOnce upon a time there was a little bird. "No, this is my friend' --- 4-gram Sampling --- Generated 3 continuations: 1. Continuation: 'Once upon a time there was a woman who lived in a small house near the woods' Full text: 'Once upon a time there was aOnce upon a time there was a woman who lived in a small house near the woods' 2. Continuation: 'Once upon a time there was a little girl, and he would often make sure they' Full text: 'Once upon a time there was aOnce upon a time there was a little girl, and he would often make sure they' 3. Continuation: 'Once upon a time there was a fish named Fin. Fin loved to swim, fish' Full text: 'Once upon a time there was aOnce upon a time there was a fish named Fin. Fin loved to swim, fish' Generated 3 continuations: 1. Continuation: 'Once upon a time there was a girl named Lily. She does not know. During' Full text: 'Once upon a time there was aOnce upon a time there was a girl named Lily. She does not know. During' 2. Continuation: 'Once upon a time there was a little girl named Sally, "It is too strong' Full text: 'Once upon a time there was aOnce upon a time there was a little girl named Sally, "It is too strong' 3. Continuation: 'Once upon a time there was a little bird. "No, this is my friend' Full text: 'Once upon a time there was aOnce upon a time there was a little bird. "No, this is my friend' --- 4-gram Sampling --- Generated 3 continuations: 1. Continuation: 'Once upon a time there was a woman who lived in a small house near the woods' Full text: 'Once upon a time there was aOnce upon a time there was a woman who lived in a small house near the woods' 2. Continuation: 'Once upon a time there was a little girl, and he would often make sure they' Full text: 'Once upon a time there was aOnce upon a time there was a little girl, and he would often make sure they' 3. Continuation: 'Once upon a time there was a fish named Fin. Fin loved to swim, fish' Full text: 'Once upon a time there was aOnce upon a time there was a fish named Fin. Fin loved to swim, fish'
Real-World Search Applications¶
Let's explore practical applications of search functionality for dataset analysis and research.
Content Analysis and Filtering¶
Use search to understand the content distribution in your dataset.
def content_analysis():
"""Analyze dataset content using search functionality."""
# Define content categories to search for
categories = {
'Science': ['science', 'research', 'experiment', 'hypothesis', 'data'],
'Technology': ['computer', 'software', 'algorithm', 'programming', 'digital'],
'Literature': ['novel', 'story', 'character', 'plot', 'narrative'],
'Education': ['learn', 'teach', 'student', 'school', 'education'],
'History': ['history', 'ancient', 'war', 'empire', 'civilization']
}
print("=== Dataset Content Analysis ===")
category_scores = {}
for category, keywords in categories.items():
total_score = 0
keyword_results = []
for keyword in keywords:
tokens = tokenizer.encode(keyword, add_special_tokens=False)
count = dataset_manager.search.count(tokens)
total_score += count
keyword_results.append((keyword, count))
category_scores[category] = {
'total_score': total_score,
'keywords': keyword_results
}
print(f"\n{category} (Total mentions: {total_score}):")
# Sort keywords by frequency
keyword_results.sort(key=lambda x: x[1], reverse=True)
for keyword, count in keyword_results:
print(f" '{keyword}': {count}")
# Find dominant category
dominant_category = max(category_scores.keys(), key=lambda k: category_scores[k]['total_score'])
print(f"\nDominant content category: {dominant_category}")
return category_scores
content_scores = content_analysis()
=== Dataset Content Analysis === Science (Total mentions: 0): 'science': 0 'research': 0 'experiment': 0 'hypothesis': 0 'data': 0 Technology (Total mentions: 0): 'computer': 0 'software': 0 'algorithm': 0 'programming': 0 'digital': 0 Literature (Total mentions: 1): 'story': 1 'novel': 0 'character': 0 'plot': 0 'narrative': 0 Education (Total mentions: 0): 'learn': 0 'teach': 0 'student': 0 'school': 0 'education': 0 History (Total mentions: 3): 'war': 3 'history': 0 'ancient': 0 'empire': 0 'civilization': 0 Dominant content category: History
Quality Assessment¶
Use search to identify potential quality issues in your dataset.
def quality_assessment():
"""Use search to assess dataset quality."""
print("=== Dataset Quality Assessment ===")
# Check for repetitive patterns
repetitive_patterns = [
"the the",
"and and",
"is is",
"to to",
"a a a"
]
print("\n1. Repetitive Pattern Detection:")
repetitive_found = False
for pattern in repetitive_patterns:
tokens = tokenizer.encode(pattern, add_special_tokens=False)
count = dataset_manager.search.count(tokens)
if count > 0:
print(f" '{pattern}': {count} occurrences ⚠️")
repetitive_found = True
if not repetitive_found:
print(" ✓ No obvious repetitive patterns found")
# Check for encoding issues
print("\n2. Potential Encoding Issues:")
encoding_issues = [
"\\n", # Escaped newlines
"\\t", # Escaped tabs
"\\r", # Escaped carriage returns
"’", # Common encoding artifact
"“", # Another common artifact
]
encoding_found = False
for issue in encoding_issues:
tokens = tokenizer.encode(issue, add_special_tokens=False)
count = dataset_manager.search.count(tokens)
if count > 0:
print(f" '{issue}': {count} occurrences ⚠️")
encoding_found = True
if not encoding_found:
print(" ✓ No obvious encoding issues found")
# Check for placeholder text
print("\n3. Placeholder Text Detection:")
placeholders = [
"lorem ipsum",
"placeholder text",
"sample text",
"TODO",
"FIXME"
]
placeholder_found = False
for placeholder in placeholders:
tokens = tokenizer.encode(placeholder, add_special_tokens=False)
count = dataset_manager.search.count(tokens)
if count > 0:
print(f" '{placeholder}': {count} occurrences ⚠️")
placeholder_found = True
if not placeholder_found:
print(" ✓ No placeholder text found")
quality_assessment()
=== Dataset Quality Assessment === 1. Repetitive Pattern Detection: ✓ No obvious repetitive patterns found 2. Potential Encoding Issues: ✓ No obvious encoding issues found 3. Placeholder Text Detection: ✓ No placeholder text found
Language Pattern Analysis¶
Analyze linguistic patterns and style in your dataset.
def linguistic_analysis():
"""Analyze linguistic patterns in the dataset."""
print("=== Linguistic Pattern Analysis ===")
# Analyze sentence starters
print("\n1. Common Sentence Starters:")
sentence_starters = [
"The", "A", "An", "I", "We", "They", "He", "She", "It",
"In", "On", "At", "With", "For", "During", "After", "Before"
]
starter_counts = []
for starter in sentence_starters:
tokens = tokenizer.encode(starter, add_special_tokens=False)
count = dataset_manager.search.count(tokens)
starter_counts.append((starter, count))
# Sort by frequency
starter_counts.sort(key=lambda x: x[1], reverse=True)
for i, (starter, count) in enumerate(starter_counts[:10], 1):
print(f" {i:2d}. '{starter}': {count}")
# Analyze question patterns
print("\n2. Question Patterns:")
question_words = ["What", "Where", "When", "Why", "How", "Who", "Which"]
total_questions = 0
for word in question_words:
tokens = tokenizer.encode(word, add_special_tokens=False)
count = dataset_manager.search.count(tokens)
total_questions += count
print(f" '{word}': {count}")
print(f" Total question indicators: {total_questions}")
# Analyze temporal indicators
print("\n3. Temporal Indicators:")
temporal_words = [
"yesterday", "today", "tomorrow",
"now", "then", "later", "soon",
"before", "after", "during", "while"
]
for word in temporal_words:
tokens = tokenizer.encode(word, add_special_tokens=False)
count = dataset_manager.search.count(tokens)
if count > 0:
print(f" '{word}': {count}")
linguistic_analysis()
=== Linguistic Pattern Analysis === 1. Common Sentence Starters: 1. 'I': 6164 2. 'It': 2266 3. 'We': 1426 4. 'The': 511 5. 'A': 182 6. 'They': 136 7. 'He': 121 8. 'She': 52 9. 'An': 24 10. 'In': 16 2. Question Patterns: 'What': 1863 'Where': 240 'When': 24 'Why': 561 'How': 151 'Who': 201 'Which': 13 Total question indicators: 3053 3. Temporal Indicators: 'today': 1 'now': 7 'then': 1
Search Performance and Optimization¶
Understanding search performance helps optimize your analysis workflows.
import time
def search_performance_analysis():
"""Analyze search performance for different query types."""
print("=== Search Performance Analysis ===")
# Pre-tokenize a long sequence for consistent testing
test_sequence = "the quick brown fox jumps over the lazy dog in the park"
all_tokens = tokenizer.encode(test_sequence, add_special_tokens=False)
# Test different query lengths using slices of the same sequence
test_queries = [
(all_tokens[:1], "Single token"),
(all_tokens[:2], "Two tokens"),
(all_tokens[:4], "Four tokens"),
(all_tokens[:7], "Seven tokens")
]
print("\n1. Query Length Performance:")
for tokens, description in test_queries:
query_text = tokenizer.decode(tokens)
# Time the count operation
start_time = time.time()
count = dataset_manager.search.count(tokens)
count_time = time.time() - start_time
# Time the positions operation (if count is reasonable)
if count < 1000: # Only get positions for reasonable counts
start_time = time.time()
positions = dataset_manager.search.positions(tokens)
positions_time = time.time() - start_time
else:
positions_time = "N/A (too many results)"
print(f" {description} ('{query_text}'):")
print(f" Count: {count} (Time: {count_time:.4f}s)")
print(f" Positions: {positions_time if isinstance(positions_time, str) else f'{positions_time:.4f}s'}")
# Test batch vs individual operations
print("\n2. Batch vs Individual Operations:")
# Use consistent tokenized queries
base_phrase = "Once upon a time there was a dog and a cat and a bird"
base_tokens = tokenizer.encode(base_phrase, add_special_tokens=False)
batch_queries = [
base_tokens[:2], # "the dog"
base_tokens[4:6], # "the cat"
base_tokens[8:10] # "the bird"
]*100
# Individual operations
start_time = time.time()
individual_results = []
for query in batch_queries:
result = dataset_manager.search.count_next(query)
individual_results.append(result)
individual_time = time.time() - start_time
# Batch operation
start_time = time.time()
batch_results = dataset_manager.search.batch_count_next(batch_queries)
batch_time = time.time() - start_time
assert len(individual_results) == len(batch_results), "Batch results length mismatch"
print(f" Individual operations: {individual_time:.4f}s")
print(f" Batch operation: {batch_time:.4f}s")
print(f" Speedup: {individual_time/batch_time:.2f}x" if batch_time > 0 else " Batch operation too fast to measure accurately")
# Test query frequency impact
print("\n3. Query Frequency Impact:")
# Test common vs rare sequences
common_tokens = all_tokens[:1] # Very common single token
rare_tokens = all_tokens[:6] # Potentially rare 6-token sequence
# Test common query
start_time = time.time()
common_count = dataset_manager.search.count(common_tokens)
common_time = time.time() - start_time
# Test rare query
start_time = time.time()
rare_count = dataset_manager.search.count(rare_tokens)
rare_time = time.time() - start_time
print(f" Common query ('{tokenizer.decode(common_tokens)}'): {common_count} results in {common_time:.4f}s")
print(f" Rare query ('{tokenizer.decode(rare_tokens)}'): {rare_count} results in {rare_time:.4f}s")
search_performance_analysis()
=== Search Performance Analysis === 1. Query Length Performance: Single token ('the'): Count: 12 (Time: 0.0000s) Positions: 0.0000s Two tokens ('the quick'): Count: 0 (Time: 0.0000s) Positions: 0.0000s Four tokens ('the quick brown fox'): Count: 0 (Time: 0.0000s) Positions: 0.0000s Seven tokens ('the quick brown fox jumps over the'): Count: 0 (Time: 0.0000s) Positions: 0.0000s 2. Batch vs Individual Operations: Individual operations: 0.4445s Batch operation: 0.9893s Speedup: 0.45x 3. Query Frequency Impact: Common query ('the'): 12 results in 0.0001s Rare query ('the quick brown fox jumps over'): 0 results in 0.0000s Individual operations: 0.4445s Batch operation: 0.9893s Speedup: 0.45x 3. Query Frequency Impact: Common query ('the'): 12 results in 0.0001s Rare query ('the quick brown fox jumps over'): 0 results in 0.0000s
Advanced Use Cases¶
Let's explore some advanced use cases that demonstrate the full power of search functionality.
Building Custom Language Models¶
Use search results to build simple language models or probability distributions.
def build_simple_language_model():
"""Build a simple n-gram language model using search results."""
print("=== Building Simple Language Model ===")
# Pre-tokenize a longer context to ensure we have consistent token sequences
full_context = "Once"
full_tokens = tokenizer.encode(full_context, add_special_tokens=False)
# Use first 2 tokens as our context
context_tokens = full_tokens[:2]
context = tokenizer.decode(context_tokens)
print(f"Context: '{context}'")
print(f"Context tokens: {context_tokens}")
# Get next token distribution
next_counts = dataset_manager.search.count_next(context_tokens)
context_count = dataset_manager.search.count(context_tokens)
if context_count == 0:
print("Context not found in dataset")
return
print(f"Context appears {context_count} times")
# Build probability distribution
probabilities = []
for token_id, count in enumerate(next_counts):
if count > 0:
prob = count / context_count
try:
token_text = tokenizer.decode([token_id])
probabilities.append((token_id, token_text, count, prob))
except:
continue
# Sort by probability
probabilities.sort(key=lambda x: x[3], reverse=True)
print(f"\nTop 10 most likely next tokens:")
cumulative_prob = 0
for i, (token_id, token_text, count, prob) in enumerate(probabilities[:10], 1):
cumulative_prob += prob
print(f" {i:2d}. '{token_text}' (ID: {token_id})")
print(f" Probability: {prob:.4f} ({prob*100:.1f}%)")
print(f" Count: {count}")
print(f"\nTop 10 tokens cover {cumulative_prob:.1%} of all continuations")
# Generate sample text using the model
print(f"\n--- Sample Generations ---")
for generation in range(3):
print(f"\nGeneration {generation + 1}:")
generation_tokens = context_tokens.copy()
# Generate 5 more tokens
for step in range(5):
# Use consistent context length (2 tokens)
current_context = generation_tokens[-2:]
next_counts = dataset_manager.search.count_next(current_context)
context_total = sum(next_counts)
if context_total == 0:
print(f" No continuations found for context: {tokenizer.decode(current_context)}")
break
# Sample next token based on probability
next_probs = [count / context_total for count in next_counts]
# Handle case where all probabilities are zero
if sum(next_probs) == 0:
print(f" No valid continuations for context: {tokenizer.decode(current_context)}")
break
next_token = np.random.choice(len(next_probs), p=next_probs)
generation_tokens.append(next_token)
generated_text = tokenizer.decode(generation_tokens)
continuation_text = tokenizer.decode(generation_tokens[len(context_tokens):])
print(f" Full text: '{generated_text}'")
print(f" Continuation: '{continuation_text}'")
build_simple_language_model()
=== Building Simple Language Model === Context: 'Once' Context tokens: [10758] Context appears 15326 times Top 10 most likely next tokens: 1. ' upon' (ID: 2220) Probability: 0.8838 (88.4%) Count: 13545 2. ' there' (ID: 627) Probability: 0.1088 (10.9%) Count: 1667 3. ',' (ID: 13) Probability: 0.0071 (0.7%) Count: 109 4. ' a' (ID: 247) Probability: 0.0001 (0.0%) Count: 1 5. ' in' (ID: 275) Probability: 0.0001 (0.0%) Count: 1 6. ' it' (ID: 352) Probability: 0.0001 (0.0%) Count: 1 7. ' Mary' (ID: 6393) Probability: 0.0001 (0.0%) Count: 1 8. ' Upon' (ID: 15797) Probability: 0.0001 (0.0%) Count: 1 Top 10 tokens cover 100.0% of all continuations --- Sample Generations --- Generation 1:
Full text: 'Once upon a time, there' Continuation: ' upon a time, there' Generation 2: Full text: 'Once upon a time, there' Continuation: ' upon a time, there' Generation 3: Full text: 'Once upon a time, in' Continuation: ' upon a time, in'
Dataset Comparison¶
Compare different datasets or dataset versions using search statistics.
def create_search_signature():
"""Create a 'signature' of the dataset using search statistics."""
print("=== Dataset Search Signature ===")
# Define signature queries - common patterns that characterize text
signature_queries = [
# Articles
"the", "a", "an",
# Pronouns
"I", "you", "he", "she", "we", "they",
# Common verbs
"is", "was", "are", "were", "have", "has",
# Conjunctions
"and", "or", "but", "if", "when",
# Common phrases
"of the", "in the", "to the", "and the",
# Question words
"what", "where", "when", "why", "how",
# Temporal
"time", "day", "year", "today", "now"
]
signature = {}
total_signature_count = 0
print("Computing dataset signature...")
for query_text in signature_queries:
tokens = tokenizer.encode(query_text, add_special_tokens=False)
count = dataset_manager.search.count(tokens)
signature[query_text] = count
total_signature_count += count
# Normalize to percentages
signature_percentages = {}
for query_text, count in signature.items():
percentage = (count / total_signature_count) * 100 if total_signature_count > 0 else 0
signature_percentages[query_text] = percentage
print(f"\nDataset Signature (Total signature tokens: {total_signature_count}):")
# Sort by frequency
sorted_signature = sorted(signature.items(), key=lambda x: x[1], reverse=True)
for query_text, count in sorted_signature:
percentage = signature_percentages[query_text]
print(f" '{query_text}': {count:6d} ({percentage:5.1f}%)")
return signature
signature = create_search_signature()
=== Dataset Search Signature === Computing dataset signature... Dataset Signature (Total signature tokens: 8188): 'I': 6164 ( 75.3%) 'a': 510 ( 6.2%) 'where': 351 ( 4.3%) 'we': 248 ( 3.0%) 'an': 210 ( 2.6%) 'year': 125 ( 1.5%) 'day': 101 ( 1.2%) 'and': 93 ( 1.1%) 'time': 81 ( 1.0%) 'is': 78 ( 1.0%) 'was': 67 ( 0.8%) 'you': 44 ( 0.5%) 'what': 20 ( 0.2%) 'or': 16 ( 0.2%) 'but': 15 ( 0.2%) 'she': 13 ( 0.2%) 'the': 12 ( 0.1%) 'are': 11 ( 0.1%) 'now': 7 ( 0.1%) 'he': 4 ( 0.0%) 'how': 4 ( 0.0%) 'have': 3 ( 0.0%) 'why': 3 ( 0.0%) 'they': 2 ( 0.0%) 'if': 2 ( 0.0%) 'when': 1 ( 0.0%) 'in the': 1 ( 0.0%) 'today': 1 ( 0.0%) 'were': 0 ( 0.0%) 'has': 0 ( 0.0%) 'of the': 0 ( 0.0%) 'to the': 0 ( 0.0%) 'and the': 0 ( 0.0%)
Summary and Best Practices¶
Let's wrap up with a summary of search functionality and best practices.
Search Functionality Summary¶
🔍 CORE SEARCH METHODS:
dataset_manager.search.count(query)
→ Count occurrences of token sequencedataset_manager.search.contains(query)
→ Check if sequence exists (faster than count)dataset_manager.search.positions(query)
→ Get all positions where sequence appearsdataset_manager.search.count_next(query)
→ Count what tokens follow the sequencedataset_manager.search.batch_count_next(queries)
→ Batch version for multiple queries
🧠 ADVANCED FEATURES:
dataset_manager.search.sample_smoothed(query, n, k, num_samples)
→ Generate continuations using Kneser-Ney smoothing- Efficient indexing with vocabulary size optimization
- Memory-mapped index for large datasets
- Support for 2^16 and 2^32 vocabulary sizes
💡 PRACTICAL APPLICATIONS:
- Content analysis and categorization
- Dataset quality assessment
- Language pattern analysis
- Next token prediction and modeling
- Linguistic research and analysis
- Dataset comparison and signatures
⚡ PERFORMANCE TIPS:
- Use batch operations for multiple queries
- Check exists before getting positions for rare sequences
- Use appropriate vocabulary size (2^16 vs 2^32)
- Reuse indexes when possible
- Consider query length impact on performance
🛠️ BEST PRACTICES:
- Always validate token sequences before search
- Handle edge cases (empty results, encoding issues)
- Use meaningful variable names for token sequences
- Consider memory usage for large result sets
- Cache frequently used search results
- Combine search with other TokenSmith handlers for powerful workflows
Next Steps¶
Congratulations! You've learned how to use TokenSmith's search functionality effectively. Here are some suggested next steps:
🎯 Immediate Actions:¶
- Experiment with your own dataset using the search methods
- Combine search with sampling and editing for powerful workflows
- Build custom analysis tools using the search results
- Optimize your search queries for better performance
📚 Additional Resources:¶
- TokenSmith Documentation - Complete API reference
- Basic Setup Tutorial - Getting started with TokenSmith
- Inspection Tutorial - Dataset examination techniques
- Sampling Tutorial - Flexible data sampling strategies
- Editing Tutorial - Dataset modification techniques
🚀 Advanced Projects:¶
- Language Model Analysis: Use search to analyze and compare different language models
- Content Classification: Build automated content classifiers using search patterns
- Dataset Curation: Use search to identify and filter high-quality content
- Linguistic Research: Investigate language patterns and evolution in large corpora
- Quality Control: Build automated quality assessment pipelines
🔬 Research Applications:¶
- Bias Detection: Search for potentially biased patterns in training data
- Memorization Studies: Identify memorized content in language models
- Distribution Analysis: Understand token and phrase distributions
- Cross-lingual Analysis: Compare patterns across different languages
- Temporal Analysis: Track language change over time in timestamped datasets
Happy searching! 🔍✨