Dataset Searching Tutorial¶

Welcome to the comprehensive search functionality tutorial! This guide covers TokenSmith's powerful search capabilities for finding and analyzing token sequences in your datasets.

What you'll learn:

Basic search operations (count, contains, positions)
Advanced search features (next token prediction)
Batch search operations for efficiency
N-gram sampling with smoothing
Real-world search applications

Prerequisites:

Completed basic setup tutorial
Understanding of tokenization
Familiarity with token sequences

Setup and Configuration¶

First, let's set up our environment with the necessary imports and initialize our dataset manager.

In [7]:

Copied!





# May not be necessary, but ensures the path is set correctly
import sys
sys.path.insert(0, "/NS/llm-pretraining/work/afkhan/tokensmith")
sys.path.insert(0, "/NS/llm-pretraining/work/afkhan/USC_Colab/gpt-neox")

import numpy as np
import json
from collections import Counter
import matplotlib.pyplot as plt
from typing import List, Dict, Tuple
# May not be necessary, but ensures the path is set correctly
import sys
sys.path.insert(0, "/NS/llm-pretraining/work/afkhan/tokensmith")
sys.path.insert(0, "/NS/llm-pretraining/work/afkhan/USC_Colab/gpt-neox")

import numpy as np
import json
from collections import Counter
import matplotlib.pyplot as plt
from typing import List, Dict, Tuple

Initialize Tokenizer¶

For search operations, we need a tokenizer to convert text to tokens and back.

In [8]:

Copied!





from transformers import AutoTokenizer

TOKENIZER_NAME_OR_PATH = "EleutherAI/gpt-neox-20b"
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME_OR_PATH, add_eos_token=True)

print(f"Tokenizer loaded: {TOKENIZER_NAME_OR_PATH}")
print(f"Vocabulary size: {len(tokenizer)}")
print(f"EOS token: {tokenizer.eos_token} (ID: {tokenizer.eos_token_id})")
print(f"BOS token: {tokenizer.bos_token} (ID: {tokenizer.bos_token_id})")
from transformers import AutoTokenizer

TOKENIZER_NAME_OR_PATH = "EleutherAI/gpt-neox-20b"
tokenizer = AutoTokenizer.from_pretrained(TOKENIZER_NAME_OR_PATH, add_eos_token=True)

print(f"Tokenizer loaded: {TOKENIZER_NAME_OR_PATH}")
print(f"Vocabulary size: {len(tokenizer)}")
print(f"EOS token: {tokenizer.eos_token} (ID: {tokenizer.eos_token_id})")
print(f"BOS token: {tokenizer.bos_token} (ID: {tokenizer.bos_token_id})")

/NS/venvs/work/afkhan/neox_updated_env/lib/python3.11/site-packages/huggingface_hub/file_download.py:795: FutureWarning: `resume_download` is deprecated and will be removed in version 1.0.0. Downloads always resume when possible. If you want to force a new download, use `force_download=True`.
  warnings.warn(
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

Tokenizer loaded: EleutherAI/gpt-neox-20b
Vocabulary size: 50277
EOS token: <|endoftext|> (ID: 0)
BOS token: <|endoftext|> (ID: 0)

Setup Dataset Manager¶

Initialize the DatasetManager with search functionality enabled.

In [9]:

Copied!





from tokensmith.manager import DatasetManager

dataset_manager = DatasetManager()

# Setup search functionality - this builds/loads the search index
dataset_manager.setup_search(
    bin_file_path="../../artifacts/data_ingested_text_document.bin",
    search_index_save_path="../../artifacts/search_index_text_document.idx",
    vocab=2**16,  # Use 2**16 for GPT-NeoX tokenizer
    verbose=True,
    reuse=False,   # Reuse existing index if available
)

print("Search functionality initialized successfully!")
print(f"Search handler available: {dataset_manager.search is not None}")
from tokensmith.manager import DatasetManager

dataset_manager = DatasetManager()

# Setup search functionality - this builds/loads the search index
dataset_manager.setup_search(
    bin_file_path="../../artifacts/data_ingested_text_document.bin",
    search_index_save_path="../../artifacts/search_index_text_document.idx",
    vocab=2**16,  # Use 2**16 for GPT-NeoX tokenizer
    verbose=True,
    reuse=False,   # Reuse existing index if available
)

print("Search functionality initialized successfully!")
print(f"Search handler available: {dataset_manager.search is not None}")

Writing indices to disk...
Time elapsed: 35.987326ms
Sorting indices...
Time elapsed: 252.669649ms
Search functionality initialized successfully!
Search handler available: True
Search functionality initialized successfully!
Search handler available: True

Basic Search Operations¶

Let's start with the fundamental search operations: count, contains, and positions.

Counting Token Sequences¶

The count() method tells us how many times a specific token sequence appears in the dataset.

In [33]:

Copied!





# Example 1: Search for common phrases
common_phrases = [
    "Once upon a time",
    " icy hill",
    " small yard",
    " pretty candle",
    " wanted to" # Prepended space is intentional for tokenization as the first token is then different than what it would be without it
]

print("=== Phrase Frequency Analysis ===")
for phrase in common_phrases:
    # Convert text to tokens
    tokens = tokenizer.encode(phrase, add_special_tokens=False)
    count = dataset_manager.search.count(tokens)
    
    print(f"'{phrase}':")
    print(f"  Tokens: {tokens}")
    print(f"  Count: {count}")
    print()

# Example 2: Single token counts
print("=== Single Token Analysis ===")
common_words = ["the", "and", "to", "of", "a"]
for word in common_words:
    token_id = tokenizer.encode(word, add_special_tokens=False)[0]  # Get first token
    count = dataset_manager.search.count([token_id])
    decoded = tokenizer.decode([token_id])
    print(f"Token '{decoded}' (ID: {token_id}): {count} occurrences")
# Example 1: Search for common phrases
common_phrases = [
    "Once upon a time",
    " icy hill",
    " small yard",
    " pretty candle",
    " wanted to" # Prepended space is intentional for tokenization as the first token is then different than what it would be without it
]

print("=== Phrase Frequency Analysis ===")
for phrase in common_phrases:
    # Convert text to tokens
    tokens = tokenizer.encode(phrase, add_special_tokens=False)
    count = dataset_manager.search.count(tokens)
    
    print(f"'{phrase}':")
    print(f"  Tokens: {tokens}")
    print(f"  Count: {count}")
    print()

# Example 2: Single token counts
print("=== Single Token Analysis ===")
common_words = ["the", "and", "to", "of", "a"]
for word in common_words:
    token_id = tokenizer.encode(word, add_special_tokens=False)[0]  # Get first token
    count = dataset_manager.search.count([token_id])
    decoded = tokenizer.decode([token_id])
    print(f"Token '{decoded}' (ID: {token_id}): {count} occurrences")

=== Phrase Frequency Analysis ===
'Once upon a time':
  Tokens: [10758, 2220, 247, 673]
  Count: 13533

' icy hill':
  Tokens: [42947, 13599]
  Count: 9

' small yard':
  Tokens: [1355, 15789]
  Count: 2

' pretty candle':
  Tokens: [3965, 28725]
  Count: 3

' wanted to':
  Tokens: [3078, 281]
  Count: 11524

=== Single Token Analysis ===
Token 'the' (ID: 783): 12 occurrences
Token 'and' (ID: 395): 93 occurrences
Token 'to' (ID: 936): 21 occurrences
Token 'of' (ID: 1171): 82 occurrences
Token 'a' (ID: 66): 510 occurrences

Checking Sequence Existence¶

The contains() method is useful for quickly checking if a sequence exists without counting.

In [34]:

Copied!





# Test various sequences for existence
common_phrases = [
    "Once upon a time",
    " icy hill",
    " small yard",
    " pretty candle",
    " wanted to" # Prepended space is intentional for tokenization as the first token is then different than what it would be without it
]

print("=== Sequence Existence Check ===")
for sequence in common_phrases:
    tokens = tokenizer.encode(sequence, add_special_tokens=False)
    exists = dataset_manager.search.contains(tokens)
    status = "✓ Found" if exists else "✗ Not found"
    print(f"{status}: '{sequence}'")
    
    # If found, also get the count
    if exists:
        count = dataset_manager.search.count(tokens)
        print(f"    Occurrences: {count}")
    print()
# Test various sequences for existence
common_phrases = [
    "Once upon a time",
    " icy hill",
    " small yard",
    " pretty candle",
    " wanted to" # Prepended space is intentional for tokenization as the first token is then different than what it would be without it
]

print("=== Sequence Existence Check ===")
for sequence in common_phrases:
    tokens = tokenizer.encode(sequence, add_special_tokens=False)
    exists = dataset_manager.search.contains(tokens)
    status = "✓ Found" if exists else "✗ Not found"
    print(f"{status}: '{sequence}'")
    
    # If found, also get the count
    if exists:
        count = dataset_manager.search.count(tokens)
        print(f"    Occurrences: {count}")
    print()

=== Sequence Existence Check ===
✓ Found: 'Once upon a time'
    Occurrences: 13533

✓ Found: ' icy hill'
    Occurrences: 9

✓ Found: ' small yard'
    Occurrences: 2

✓ Found: ' pretty candle'
    Occurrences: 3

✓ Found: ' wanted to'
    Occurrences: 11524

Finding Sequence Positions¶

The positions() method returns all locations where a sequence appears in the dataset.

In [35]:

Copied!





# Find positions of specific sequences
search_phrase = "Once upon a time"
tokens = tokenizer.encode(search_phrase, add_special_tokens=False)

print(f"=== Position Analysis for '{search_phrase}' ===")
print(f"Tokens: {tokens}")

positions = dataset_manager.search.positions(tokens)
count = len(positions)

print(f"Total occurrences: {count}")

if count > 0:
    print(f"Positions (first 10): {positions[:10]}")
else:
    print("Sequence not found in dataset")
# Find positions of specific sequences
search_phrase = "Once upon a time"
tokens = tokenizer.encode(search_phrase, add_special_tokens=False)

print(f"=== Position Analysis for '{search_phrase}' ===")
print(f"Tokens: {tokens}")

positions = dataset_manager.search.positions(tokens)
count = len(positions)

print(f"Total occurrences: {count}")

if count > 0:
    print(f"Positions (first 10): {positions[:10]}")
else:
    print("Sequence not found in dataset")

=== Position Analysis for 'Once upon a time' ===
Tokens: [10758, 2220, 247, 673]
Total occurrences: 13533
Positions (first 10): [1871078, 3526259, 739125, 1332842, 1838022, 3434072, 484592, 2798653, 313457, 1297946]

Advanced Search: Next Token Prediction¶

One of the most powerful features is count_next(), which shows what tokens typically follow a given sequence.

Analyzing Token Transitions¶

Let's see what commonly follows specific phrases.

In [13]:

Copied!





def analyze_next_tokens(phrase: str, top_k: int = 10) -> Dict:
    """Analyze what tokens commonly follow a given phrase."""
    tokens = tokenizer.encode(phrase, add_special_tokens=False)
    
    # Get next token counts
    next_counts = dataset_manager.search.count_next(tokens)
    
    # Find non-zero counts
    results = []
    for token_id, count in enumerate(next_counts):
        if count > 0:
            try:
                token_text = tokenizer.decode([token_id])
                results.append((token_id, token_text, count))
            except:
                results.append((token_id, f"[ID:{token_id}]", count))
    
    # Sort by count and return top k
    results.sort(key=lambda x: x[2], reverse=True)
    
    return {
        'phrase': phrase,
        'phrase_tokens': tokens,
        'total_phrase_count': dataset_manager.search.count(tokens),
        'top_next_tokens': results[:top_k],
        'unique_next_tokens': len(results)
    }

# Analyze several interesting phrases
analysis_phrases = [
    "The cat",
    "I am",
    "Once upon a",
    "In the",
    "She said"
]

print("=== Next Token Analysis ===")
for phrase in analysis_phrases:
    analysis = analyze_next_tokens(phrase, top_k=5)
    
    print(f"\nPhrase: '{phrase}'")
    print(f"Phrase occurs {analysis['total_phrase_count']} times")
    print(f"Followed by {analysis['unique_next_tokens']} different tokens")
    print("Top continuations:")
    
    for i, (token_id, token_text, count) in enumerate(analysis['top_next_tokens'], 1):
        percentage = (count / analysis['total_phrase_count']) * 100
        print(f"  {i}. '{token_text}' (ID: {token_id}) - {count} times ({percentage:.1f}%)")
def analyze_next_tokens(phrase: str, top_k: int = 10) -> Dict:
    """Analyze what tokens commonly follow a given phrase."""
    tokens = tokenizer.encode(phrase, add_special_tokens=False)
    
    # Get next token counts
    next_counts = dataset_manager.search.count_next(tokens)
    
    # Find non-zero counts
    results = []
    for token_id, count in enumerate(next_counts):
        if count > 0:
            try:
                token_text = tokenizer.decode([token_id])
                results.append((token_id, token_text, count))
            except:
                results.append((token_id, f"[ID:{token_id}]", count))
    
    # Sort by count and return top k
    results.sort(key=lambda x: x[2], reverse=True)
    
    return {
        'phrase': phrase,
        'phrase_tokens': tokens,
        'total_phrase_count': dataset_manager.search.count(tokens),
        'top_next_tokens': results[:top_k],
        'unique_next_tokens': len(results)
    }

# Analyze several interesting phrases
analysis_phrases = [
    "The cat",
    "I am",
    "Once upon a",
    "In the",
    "She said"
]

print("=== Next Token Analysis ===")
for phrase in analysis_phrases:
    analysis = analyze_next_tokens(phrase, top_k=5)
    
    print(f"\nPhrase: '{phrase}'")
    print(f"Phrase occurs {analysis['total_phrase_count']} times")
    print(f"Followed by {analysis['unique_next_tokens']} different tokens")
    print("Top continuations:")
    
    for i, (token_id, token_text, count) in enumerate(analysis['top_next_tokens'], 1):
        percentage = (count / analysis['total_phrase_count']) * 100
        print(f"  {i}. '{token_text}' (ID: {token_id}) - {count} times ({percentage:.1f}%)")

=== Next Token Analysis ===

Phrase: 'The cat'
Phrase occurs 5 times
Followed by 3 different tokens
Top continuations:
  1. ' is' (ID: 310) - 3 times (60.0%)
  2. ' will' (ID: 588) - 1 times (20.0%)
  3. ' kept' (ID: 4934) - 1 times (20.0%)

Phrase: 'I am'
Phrase occurs 392 times
Followed by 120 different tokens
Top continuations:
  1. ' sorry' (ID: 7016) - 74 times (18.9%)
  2. ' a' (ID: 247) - 38 times (9.7%)
  3. ' the' (ID: 253) - 26 times (6.6%)
  4. ' going' (ID: 1469) - 20 times (5.1%)
  5. ' so' (ID: 594) - 16 times (4.1%)

Phrase: 'Once upon a'
Phrase occurs 13542 times
Followed by 9 different tokens
Top continuations:
  1. ' time' (ID: 673) - 13533 times (99.9%)
  2. ' Tuesday' (ID: 7948) - 2 times (0.0%)
  3. ' long' (ID: 1048) - 1 times (0.0%)
  4. ' day' (ID: 1388) - 1 times (0.0%)
  5. ' night' (ID: 2360) - 1 times (0.0%)

Phrase: 'I am'
Phrase occurs 392 times
Followed by 120 different tokens
Top continuations:
  1. ' sorry' (ID: 7016) - 74 times (18.9%)
  2. ' a' (ID: 247) - 38 times (9.7%)
  3. ' the' (ID: 253) - 26 times (6.6%)
  4. ' going' (ID: 1469) - 20 times (5.1%)
  5. ' so' (ID: 594) - 16 times (4.1%)

Phrase: 'Once upon a'
Phrase occurs 13542 times
Followed by 9 different tokens
Top continuations:
  1. ' time' (ID: 673) - 13533 times (99.9%)
  2. ' Tuesday' (ID: 7948) - 2 times (0.0%)
  3. ' long' (ID: 1048) - 1 times (0.0%)
  4. ' day' (ID: 1388) - 1 times (0.0%)
  5. ' night' (ID: 2360) - 1 times (0.0%)

Phrase: 'In the'
Phrase occurs 10 times
Followed by 9 different tokens
Top continuations:
  1. ' park' (ID: 5603) - 2 times (20.0%)
  2. ' big' (ID: 1943) - 1 times (10.0%)
  3. ' dark' (ID: 3644) - 1 times (10.0%)
  4. ' morning' (ID: 4131) - 1 times (10.0%)
  5. ' middle' (ID: 4766) - 1 times (10.0%)

Phrase: 'She said'
Phrase occurs 1 times
Followed by 1 different tokens
Top continuations:
  1. ' it' (ID: 352) - 1 times (100.0%)

Phrase: 'In the'
Phrase occurs 10 times
Followed by 9 different tokens
Top continuations:
  1. ' park' (ID: 5603) - 2 times (20.0%)
  2. ' big' (ID: 1943) - 1 times (10.0%)
  3. ' dark' (ID: 3644) - 1 times (10.0%)
  4. ' morning' (ID: 4131) - 1 times (10.0%)
  5. ' middle' (ID: 4766) - 1 times (10.0%)

Phrase: 'She said'
Phrase occurs 1 times
Followed by 1 different tokens
Top continuations:
  1. ' it' (ID: 352) - 1 times (100.0%)

Story Continuation Analysis¶

Let's do a deeper analysis of story beginnings to understand narrative patterns.

In [36]:

Copied!





def story_continuation_analysis():
    """Analyze how stories typically continue after common openings."""
    
    story_openings = [
        "The cat",
        "I am",
        "Once upon a",
        "In the",
        "She said"
    ]
    
    print("=== Story Continuation Patterns ===")
    
    for opening in story_openings:
        tokens = tokenizer.encode(opening, add_special_tokens=False)
        phrase_count = dataset_manager.search.count(tokens)
        
        if phrase_count == 0:
            print(f"\n'{opening}': Not found in dataset")
            continue
            
        print(f"\n'{opening}' (appears {phrase_count} times):")
        
        # Get next tokens
        next_counts = dataset_manager.search.count_next(tokens)
        
        # Build continuations by looking at multiple next tokens
        next_tokens = []
        for token_id, count in enumerate(next_counts):
            if count > 0:
                try:
                    token_text = tokenizer.decode([token_id])
                    next_tokens.append((token_id, token_text, count))
                except:
                    continue
        
        # Sort and show top continuations
        next_tokens.sort(key=lambda x: x[2], reverse=True)
        
        print("  Most common continuations:")
        for i, (token_id, token_text, count) in enumerate(next_tokens[:7], 1):
            # Create continuation phrase
            continuation_tokens = tokens + [token_id]
            full_continuation = tokenizer.decode(continuation_tokens)
            probability = (count / phrase_count) * 100
            print(f"    {i}. '{full_continuation}' ({probability:.1f}%)")

story_continuation_analysis()
def story_continuation_analysis():
    """Analyze how stories typically continue after common openings."""
    
    story_openings = [
        "The cat",
        "I am",
        "Once upon a",
        "In the",
        "She said"
    ]
    
    print("=== Story Continuation Patterns ===")
    
    for opening in story_openings:
        tokens = tokenizer.encode(opening, add_special_tokens=False)
        phrase_count = dataset_manager.search.count(tokens)
        
        if phrase_count == 0:
            print(f"\n'{opening}': Not found in dataset")
            continue
            
        print(f"\n'{opening}' (appears {phrase_count} times):")
        
        # Get next tokens
        next_counts = dataset_manager.search.count_next(tokens)
        
        # Build continuations by looking at multiple next tokens
        next_tokens = []
        for token_id, count in enumerate(next_counts):
            if count > 0:
                try:
                    token_text = tokenizer.decode([token_id])
                    next_tokens.append((token_id, token_text, count))
                except:
                    continue
        
        # Sort and show top continuations
        next_tokens.sort(key=lambda x: x[2], reverse=True)
        
        print("  Most common continuations:")
        for i, (token_id, token_text, count) in enumerate(next_tokens[:7], 1):
            # Create continuation phrase
            continuation_tokens = tokens + [token_id]
            full_continuation = tokenizer.decode(continuation_tokens)
            probability = (count / phrase_count) * 100
            print(f"    {i}. '{full_continuation}' ({probability:.1f}%)")

story_continuation_analysis()

=== Story Continuation Patterns ===

'The cat' (appears 5 times):
  Most common continuations:
    1. 'The cat is' (60.0%)
    2. 'The cat will' (20.0%)
    3. 'The cat kept' (20.0%)

'I am' (appears 392 times):
  Most common continuations:
    1. 'I am sorry' (18.9%)
    2. 'I am a' (9.7%)
    3. 'I am the' (6.6%)
    4. 'I am going' (5.1%)
    5. 'I am so' (4.1%)
    6. 'I am glad' (3.1%)
    7. 'I am happy' (2.8%)

'Once upon a' (appears 13542 times):
  Most common continuations:
    1. 'Once upon a time' (99.9%)
    2. 'Once upon a Tuesday' (0.0%)
    3. 'Once upon a long' (0.0%)
    4. 'Once upon a day' (0.0%)
    5. 'Once upon a night' (0.0%)
    6. 'Once upon a morning' (0.0%)
    7. 'Once upon a Sunday' (0.0%)

'In the' (appears 10 times):
  Most common continuations:
    1. 'In the park' (20.0%)
    2. 'In the big' (10.0%)
    3. 'In the dark' (10.0%)
    4. 'In the morning' (10.0%)
    5. 'In the middle' (10.0%)
    6. 'In the summer' (10.0%)
    7. 'In the farm' (10.0%)

'She said' (appears 1 times):
  Most common continuations:
    1. 'She said it' (100.0%)

Batch Search Operations¶

For efficiency when searching multiple sequences, use batch operations.

In [15]:

Copied!





# Batch next token analysis
def batch_next_token_analysis():
    """Demonstrate batch search operations for efficiency."""
    
    # Prepare multiple queries
    phrases = [
        "The dog",
        "The cat", 
        "The bird",
        "The fish",
        "The horse"
    ]
    
    # Convert all phrases to token sequences
    token_queries = []
    for phrase in phrases:
        tokens = tokenizer.encode(phrase, add_special_tokens=False)
        token_queries.append(tokens)
    
    print("=== Batch Next Token Analysis ===")
    print(f"Analyzing {len(phrases)} phrases simultaneously...")
    
    # Perform batch search
    batch_results = dataset_manager.search.batch_count_next(token_queries)
    
    # Analyze results
    for i, (phrase, query_tokens, next_counts) in enumerate(zip(phrases, token_queries, batch_results)):
        phrase_count = dataset_manager.search.count(query_tokens)
        
        print(f"\n{i+1}. '{phrase}' (occurs {phrase_count} times):")
        
        # Find top next tokens
        next_tokens = []
        for token_id, count in enumerate(next_counts):
            if count > 0:
                try:
                    token_text = tokenizer.decode([token_id])
                    next_tokens.append((token_text, count))
                except:
                    continue
        
        # Sort and display top 3
        next_tokens.sort(key=lambda x: x[1], reverse=True)
        for j, (token_text, count) in enumerate(next_tokens[:3], 1):
            probability = (count / phrase_count) * 100 if phrase_count > 0 else 0
            print(f"    {j}. '{token_text}' - {count} times ({probability:.1f}%)")

batch_next_token_analysis()
# Batch next token analysis
def batch_next_token_analysis():
    """Demonstrate batch search operations for efficiency."""
    
    # Prepare multiple queries
    phrases = [
        "The dog",
        "The cat", 
        "The bird",
        "The fish",
        "The horse"
    ]
    
    # Convert all phrases to token sequences
    token_queries = []
    for phrase in phrases:
        tokens = tokenizer.encode(phrase, add_special_tokens=False)
        token_queries.append(tokens)
    
    print("=== Batch Next Token Analysis ===")
    print(f"Analyzing {len(phrases)} phrases simultaneously...")
    
    # Perform batch search
    batch_results = dataset_manager.search.batch_count_next(token_queries)
    
    # Analyze results
    for i, (phrase, query_tokens, next_counts) in enumerate(zip(phrases, token_queries, batch_results)):
        phrase_count = dataset_manager.search.count(query_tokens)
        
        print(f"\n{i+1}. '{phrase}' (occurs {phrase_count} times):")
        
        # Find top next tokens
        next_tokens = []
        for token_id, count in enumerate(next_counts):
            if count > 0:
                try:
                    token_text = tokenizer.decode([token_id])
                    next_tokens.append((token_text, count))
                except:
                    continue
        
        # Sort and display top 3
        next_tokens.sort(key=lambda x: x[1], reverse=True)
        for j, (token_text, count) in enumerate(next_tokens[:3], 1):
            probability = (count / phrase_count) * 100 if phrase_count > 0 else 0
            print(f"    {j}. '{token_text}' - {count} times ({probability:.1f}%)")

batch_next_token_analysis()

=== Batch Next Token Analysis ===
Analyzing 5 phrases simultaneously...

1. 'The dog' (occurs 22 times):
    1. ' is' - 13 times (59.1%)
    2. ' was' - 2 times (9.1%)
    3. ' might' - 2 times (9.1%)

2. 'The cat' (occurs 5 times):
    1. ' is' - 3 times (60.0%)
    2. ' will' - 1 times (20.0%)
    3. ' kept' - 1 times (20.0%)

3. 'The bird' (occurs 10 times):
    1. ' is' - 6 times (60.0%)
    2. ' does' - 2 times (20.0%)
    3. 'ie' - 1 times (10.0%)

4. 'The fish' (occurs 1 times):
    1. ' smiled' - 1 times (100.0%)

5. 'The horse' (occurs 2 times):
    1. ' is' - 1 times (50.0%)
    2. ' feels' - 1 times (50.0%)

1. 'The dog' (occurs 22 times):
    1. ' is' - 13 times (59.1%)
    2. ' was' - 2 times (9.1%)
    3. ' might' - 2 times (9.1%)

2. 'The cat' (occurs 5 times):
    1. ' is' - 3 times (60.0%)
    2. ' will' - 1 times (20.0%)
    3. ' kept' - 1 times (20.0%)

3. 'The bird' (occurs 10 times):
    1. ' is' - 6 times (60.0%)
    2. ' does' - 2 times (20.0%)
    3. 'ie' - 1 times (10.0%)

4. 'The fish' (occurs 1 times):
    1. ' smiled' - 1 times (100.0%)

5. 'The horse' (occurs 2 times):
    1. ' is' - 1 times (50.0%)
    2. ' feels' - 1 times (50.0%)

N-gram Sampling with Smoothing¶

TokenSmith includes advanced n-gram sampling with Kneser-Ney smoothing for generating realistic continuations.

In [16]:

Copied!





def demonstrate_ngram_sampling():
    """Demonstrate n-gram sampling with smoothing."""
    
    # Start with a story beginning
    seed_phrase = "Once upon a time there was a"
    seed_tokens = tokenizer.encode(seed_phrase, add_special_tokens=False)
    
    print("=== N-gram Sampling with Smoothing ===")
    print(f"Seed phrase: '{seed_phrase}'")
    print(f"Seed tokens: {seed_tokens}")
    
    # Generate several continuations using different n-gram orders
    n_values = [2, 3, 4]  # bi-gram, tri-gram, 4-gram
    
    for n in n_values:
        print(f"\n--- {n}-gram Sampling ---")
        
        try:
            # Sample continuations
            samples = dataset_manager.search.sample_smoothed(
                query=seed_tokens,
                n=n,           # n-gram order
                k=10,          # length of continuation
                num_samples=3  # number of samples
            )
            
            print(f"Generated {len(samples)} continuations:")
            
            for i, sample_tokens in enumerate(samples, 1):
                # Combine seed and sample
                full_sequence = seed_tokens + sample_tokens
                full_text = tokenizer.decode(full_sequence)
                continuation_text = tokenizer.decode(sample_tokens)
                
                print(f"  {i}. Continuation: '{continuation_text}'")
                print(f"     Full text: '{full_text}'")
                print()
                
        except Exception as e:
            print(f"Error with {n}-gram sampling: {e}")
            continue

demonstrate_ngram_sampling()
def demonstrate_ngram_sampling():
    """Demonstrate n-gram sampling with smoothing."""
    
    # Start with a story beginning
    seed_phrase = "Once upon a time there was a"
    seed_tokens = tokenizer.encode(seed_phrase, add_special_tokens=False)
    
    print("=== N-gram Sampling with Smoothing ===")
    print(f"Seed phrase: '{seed_phrase}'")
    print(f"Seed tokens: {seed_tokens}")
    
    # Generate several continuations using different n-gram orders
    n_values = [2, 3, 4]  # bi-gram, tri-gram, 4-gram
    
    for n in n_values:
        print(f"\n--- {n}-gram Sampling ---")
        
        try:
            # Sample continuations
            samples = dataset_manager.search.sample_smoothed(
                query=seed_tokens,
                n=n,           # n-gram order
                k=10,          # length of continuation
                num_samples=3  # number of samples
            )
            
            print(f"Generated {len(samples)} continuations:")
            
            for i, sample_tokens in enumerate(samples, 1):
                # Combine seed and sample
                full_sequence = seed_tokens + sample_tokens
                full_text = tokenizer.decode(full_sequence)
                continuation_text = tokenizer.decode(sample_tokens)
                
                print(f"  {i}. Continuation: '{continuation_text}'")
                print(f"     Full text: '{full_text}'")
                print()
                
        except Exception as e:
            print(f"Error with {n}-gram sampling: {e}")
            continue

demonstrate_ngram_sampling()

=== N-gram Sampling with Smoothing ===
Seed phrase: 'Once upon a time there was a'
Seed tokens: [10758, 2220, 247, 673, 627, 369, 247]

--- 2-gram Sampling ---
Generated 3 continuations:
  1. Continuation: 'Once upon a time there was a cork from the microscope from the animals. After'
     Full text: 'Once upon a time there was aOnce upon a time there was a cork from the microscope from the animals. After'

  2. Continuation: 'Once upon a time there was a great thing they had made a large it. I'
     Full text: 'Once upon a time there was aOnce upon a time there was a great thing they had made a large it. I'

  3. Continuation: 'Once upon a time there was a big adventure, stick again. He saw a time'
     Full text: 'Once upon a time there was aOnce upon a time there was a big adventure, stick again. He saw a time'


--- 3-gram Sampling ---
Generated 3 continuations:
  1. Continuation: 'Once upon a time there was a cork from the microscope from the animals. After'
     Full text: 'Once upon a time there was aOnce upon a time there was a cork from the microscope from the animals. After'

  2. Continuation: 'Once upon a time there was a great thing they had made a large it. I'
     Full text: 'Once upon a time there was aOnce upon a time there was a great thing they had made a large it. I'

  3. Continuation: 'Once upon a time there was a big adventure, stick again. He saw a time'
     Full text: 'Once upon a time there was aOnce upon a time there was a big adventure, stick again. He saw a time'


--- 3-gram Sampling ---
Generated 3 continuations:
  1. Continuation: 'Once upon a time there was a girl named Lily. She does not know. During'
     Full text: 'Once upon a time there was aOnce upon a time there was a girl named Lily. She does not know. During'

  2. Continuation: 'Once upon a time there was a little girl named Sally, "It is too strong'
     Full text: 'Once upon a time there was aOnce upon a time there was a little girl named Sally, "It is too strong'

  3. Continuation: 'Once upon a time there was a little bird. "No, this is my friend'
     Full text: 'Once upon a time there was aOnce upon a time there was a little bird. "No, this is my friend'


--- 4-gram Sampling ---
Generated 3 continuations:
  1. Continuation: 'Once upon a time there was a woman who lived in a small house near the woods'
     Full text: 'Once upon a time there was aOnce upon a time there was a woman who lived in a small house near the woods'

  2. Continuation: 'Once upon a time there was a little girl, and he would often make sure they'
     Full text: 'Once upon a time there was aOnce upon a time there was a little girl, and he would often make sure they'

  3. Continuation: 'Once upon a time there was a fish named Fin. Fin loved to swim, fish'
     Full text: 'Once upon a time there was aOnce upon a time there was a fish named Fin. Fin loved to swim, fish'

Generated 3 continuations:
  1. Continuation: 'Once upon a time there was a girl named Lily. She does not know. During'
     Full text: 'Once upon a time there was aOnce upon a time there was a girl named Lily. She does not know. During'

  2. Continuation: 'Once upon a time there was a little girl named Sally, "It is too strong'
     Full text: 'Once upon a time there was aOnce upon a time there was a little girl named Sally, "It is too strong'

  3. Continuation: 'Once upon a time there was a little bird. "No, this is my friend'
     Full text: 'Once upon a time there was aOnce upon a time there was a little bird. "No, this is my friend'


--- 4-gram Sampling ---
Generated 3 continuations:
  1. Continuation: 'Once upon a time there was a woman who lived in a small house near the woods'
     Full text: 'Once upon a time there was aOnce upon a time there was a woman who lived in a small house near the woods'

  2. Continuation: 'Once upon a time there was a little girl, and he would often make sure they'
     Full text: 'Once upon a time there was aOnce upon a time there was a little girl, and he would often make sure they'

  3. Continuation: 'Once upon a time there was a fish named Fin. Fin loved to swim, fish'
     Full text: 'Once upon a time there was aOnce upon a time there was a fish named Fin. Fin loved to swim, fish'

Real-World Search Applications¶

Let's explore practical applications of search functionality for dataset analysis and research.

Content Analysis and Filtering¶

Use search to understand the content distribution in your dataset.

In [17]:

Copied!





def content_analysis():
    """Analyze dataset content using search functionality."""
    
    # Define content categories to search for
    categories = {
        'Science': ['science', 'research', 'experiment', 'hypothesis', 'data'],
        'Technology': ['computer', 'software', 'algorithm', 'programming', 'digital'],
        'Literature': ['novel', 'story', 'character', 'plot', 'narrative'],
        'Education': ['learn', 'teach', 'student', 'school', 'education'],
        'History': ['history', 'ancient', 'war', 'empire', 'civilization']
    }
    
    print("=== Dataset Content Analysis ===")
    
    category_scores = {}
    
    for category, keywords in categories.items():
        total_score = 0
        keyword_results = []
        
        for keyword in keywords:
            tokens = tokenizer.encode(keyword, add_special_tokens=False)
            count = dataset_manager.search.count(tokens)
            total_score += count
            keyword_results.append((keyword, count))
        
        category_scores[category] = {
            'total_score': total_score,
            'keywords': keyword_results
        }
        
        print(f"\n{category} (Total mentions: {total_score}):")
        # Sort keywords by frequency
        keyword_results.sort(key=lambda x: x[1], reverse=True)
        for keyword, count in keyword_results:
            print(f"  '{keyword}': {count}")
    
    # Find dominant category
    dominant_category = max(category_scores.keys(), key=lambda k: category_scores[k]['total_score'])
    print(f"\nDominant content category: {dominant_category}")
    
    return category_scores

content_scores = content_analysis()
def content_analysis():
    """Analyze dataset content using search functionality."""
    
    # Define content categories to search for
    categories = {
        'Science': ['science', 'research', 'experiment', 'hypothesis', 'data'],
        'Technology': ['computer', 'software', 'algorithm', 'programming', 'digital'],
        'Literature': ['novel', 'story', 'character', 'plot', 'narrative'],
        'Education': ['learn', 'teach', 'student', 'school', 'education'],
        'History': ['history', 'ancient', 'war', 'empire', 'civilization']
    }
    
    print("=== Dataset Content Analysis ===")
    
    category_scores = {}
    
    for category, keywords in categories.items():
        total_score = 0
        keyword_results = []
        
        for keyword in keywords:
            tokens = tokenizer.encode(keyword, add_special_tokens=False)
            count = dataset_manager.search.count(tokens)
            total_score += count
            keyword_results.append((keyword, count))
        
        category_scores[category] = {
            'total_score': total_score,
            'keywords': keyword_results
        }
        
        print(f"\n{category} (Total mentions: {total_score}):")
        # Sort keywords by frequency
        keyword_results.sort(key=lambda x: x[1], reverse=True)
        for keyword, count in keyword_results:
            print(f"  '{keyword}': {count}")
    
    # Find dominant category
    dominant_category = max(category_scores.keys(), key=lambda k: category_scores[k]['total_score'])
    print(f"\nDominant content category: {dominant_category}")
    
    return category_scores

content_scores = content_analysis()

=== Dataset Content Analysis ===

Science (Total mentions: 0):
  'science': 0
  'research': 0
  'experiment': 0
  'hypothesis': 0
  'data': 0

Technology (Total mentions: 0):
  'computer': 0
  'software': 0
  'algorithm': 0
  'programming': 0
  'digital': 0

Literature (Total mentions: 1):
  'story': 1
  'novel': 0
  'character': 0
  'plot': 0
  'narrative': 0

Education (Total mentions: 0):
  'learn': 0
  'teach': 0
  'student': 0
  'school': 0
  'education': 0

History (Total mentions: 3):
  'war': 3
  'history': 0
  'ancient': 0
  'empire': 0
  'civilization': 0

Dominant content category: History

Quality Assessment¶

Use search to identify potential quality issues in your dataset.

In [18]:

Copied!





def quality_assessment():
    """Use search to assess dataset quality."""
    
    print("=== Dataset Quality Assessment ===")
    
    # Check for repetitive patterns
    repetitive_patterns = [
        "the the",
        "and and", 
        "is is",
        "to to",
        "a a a"
    ]
    
    print("\n1. Repetitive Pattern Detection:")
    repetitive_found = False
    for pattern in repetitive_patterns:
        tokens = tokenizer.encode(pattern, add_special_tokens=False)
        count = dataset_manager.search.count(tokens)
        if count > 0:
            print(f"  '{pattern}': {count} occurrences ⚠️")
            repetitive_found = True
    
    if not repetitive_found:
        print("  ✓ No obvious repetitive patterns found")
    
    # Check for encoding issues
    print("\n2. Potential Encoding Issues:")
    encoding_issues = [
        "\\n",  # Escaped newlines
        "\\t",  # Escaped tabs
        "\\r",  # Escaped carriage returns
        "â€™", # Common encoding artifact
        "â€œ", # Another common artifact
    ]
    
    encoding_found = False
    for issue in encoding_issues:
        tokens = tokenizer.encode(issue, add_special_tokens=False)
        count = dataset_manager.search.count(tokens)
        if count > 0:
            print(f"  '{issue}': {count} occurrences ⚠️")
            encoding_found = True
    
    if not encoding_found:
        print("  ✓ No obvious encoding issues found")
    
    # Check for placeholder text
    print("\n3. Placeholder Text Detection:")
    placeholders = [
        "lorem ipsum",
        "placeholder text",
        "sample text",
        "TODO",
        "FIXME"
    ]
    
    placeholder_found = False
    for placeholder in placeholders:
        tokens = tokenizer.encode(placeholder, add_special_tokens=False)
        count = dataset_manager.search.count(tokens)
        if count > 0:
            print(f"  '{placeholder}': {count} occurrences ⚠️")
            placeholder_found = True
    
    if not placeholder_found:
        print("  ✓ No placeholder text found")

quality_assessment()
def quality_assessment():
    """Use search to assess dataset quality."""
    
    print("=== Dataset Quality Assessment ===")
    
    # Check for repetitive patterns
    repetitive_patterns = [
        "the the",
        "and and", 
        "is is",
        "to to",
        "a a a"
    ]
    
    print("\n1. Repetitive Pattern Detection:")
    repetitive_found = False
    for pattern in repetitive_patterns:
        tokens = tokenizer.encode(pattern, add_special_tokens=False)
        count = dataset_manager.search.count(tokens)
        if count > 0:
            print(f"  '{pattern}': {count} occurrences ⚠️")
            repetitive_found = True
    
    if not repetitive_found:
        print("  ✓ No obvious repetitive patterns found")
    
    # Check for encoding issues
    print("\n2. Potential Encoding Issues:")
    encoding_issues = [
        "\\n",  # Escaped newlines
        "\\t",  # Escaped tabs
        "\\r",  # Escaped carriage returns
        "â€™", # Common encoding artifact
        "â€œ", # Another common artifact
    ]
    
    encoding_found = False
    for issue in encoding_issues:
        tokens = tokenizer.encode(issue, add_special_tokens=False)
        count = dataset_manager.search.count(tokens)
        if count > 0:
            print(f"  '{issue}': {count} occurrences ⚠️")
            encoding_found = True
    
    if not encoding_found:
        print("  ✓ No obvious encoding issues found")
    
    # Check for placeholder text
    print("\n3. Placeholder Text Detection:")
    placeholders = [
        "lorem ipsum",
        "placeholder text",
        "sample text",
        "TODO",
        "FIXME"
    ]
    
    placeholder_found = False
    for placeholder in placeholders:
        tokens = tokenizer.encode(placeholder, add_special_tokens=False)
        count = dataset_manager.search.count(tokens)
        if count > 0:
            print(f"  '{placeholder}': {count} occurrences ⚠️")
            placeholder_found = True
    
    if not placeholder_found:
        print("  ✓ No placeholder text found")

quality_assessment()

=== Dataset Quality Assessment ===

1. Repetitive Pattern Detection:
  ✓ No obvious repetitive patterns found

2. Potential Encoding Issues:
  ✓ No obvious encoding issues found

3. Placeholder Text Detection:
  ✓ No placeholder text found

Language Pattern Analysis¶

Analyze linguistic patterns and style in your dataset.

In [19]:

Copied!





def linguistic_analysis():
    """Analyze linguistic patterns in the dataset."""
    
    print("=== Linguistic Pattern Analysis ===")
    
    # Analyze sentence starters
    print("\n1. Common Sentence Starters:")
    sentence_starters = [
        "The", "A", "An", "I", "We", "They", "He", "She", "It",
        "In", "On", "At", "With", "For", "During", "After", "Before"
    ]
    
    starter_counts = []
    for starter in sentence_starters:
        tokens = tokenizer.encode(starter, add_special_tokens=False)
        count = dataset_manager.search.count(tokens)
        starter_counts.append((starter, count))
    
    # Sort by frequency
    starter_counts.sort(key=lambda x: x[1], reverse=True)
    for i, (starter, count) in enumerate(starter_counts[:10], 1):
        print(f"  {i:2d}. '{starter}': {count}")
    
    # Analyze question patterns
    print("\n2. Question Patterns:")
    question_words = ["What", "Where", "When", "Why", "How", "Who", "Which"]
    
    total_questions = 0
    for word in question_words:
        tokens = tokenizer.encode(word, add_special_tokens=False)
        count = dataset_manager.search.count(tokens)
        total_questions += count
        print(f"  '{word}': {count}")
    
    print(f"  Total question indicators: {total_questions}")
    
    # Analyze temporal indicators
    print("\n3. Temporal Indicators:")
    temporal_words = [
        "yesterday", "today", "tomorrow", 
        "now", "then", "later", "soon",
        "before", "after", "during", "while"
    ]
    
    for word in temporal_words:
        tokens = tokenizer.encode(word, add_special_tokens=False)
        count = dataset_manager.search.count(tokens)
        if count > 0:
            print(f"  '{word}': {count}")

linguistic_analysis()
def linguistic_analysis():
    """Analyze linguistic patterns in the dataset."""
    
    print("=== Linguistic Pattern Analysis ===")
    
    # Analyze sentence starters
    print("\n1. Common Sentence Starters:")
    sentence_starters = [
        "The", "A", "An", "I", "We", "They", "He", "She", "It",
        "In", "On", "At", "With", "For", "During", "After", "Before"
    ]
    
    starter_counts = []
    for starter in sentence_starters:
        tokens = tokenizer.encode(starter, add_special_tokens=False)
        count = dataset_manager.search.count(tokens)
        starter_counts.append((starter, count))
    
    # Sort by frequency
    starter_counts.sort(key=lambda x: x[1], reverse=True)
    for i, (starter, count) in enumerate(starter_counts[:10], 1):
        print(f"  {i:2d}. '{starter}': {count}")
    
    # Analyze question patterns
    print("\n2. Question Patterns:")
    question_words = ["What", "Where", "When", "Why", "How", "Who", "Which"]
    
    total_questions = 0
    for word in question_words:
        tokens = tokenizer.encode(word, add_special_tokens=False)
        count = dataset_manager.search.count(tokens)
        total_questions += count
        print(f"  '{word}': {count}")
    
    print(f"  Total question indicators: {total_questions}")
    
    # Analyze temporal indicators
    print("\n3. Temporal Indicators:")
    temporal_words = [
        "yesterday", "today", "tomorrow", 
        "now", "then", "later", "soon",
        "before", "after", "during", "while"
    ]
    
    for word in temporal_words:
        tokens = tokenizer.encode(word, add_special_tokens=False)
        count = dataset_manager.search.count(tokens)
        if count > 0:
            print(f"  '{word}': {count}")

linguistic_analysis()

=== Linguistic Pattern Analysis ===

1. Common Sentence Starters:
   1. 'I': 6164
   2. 'It': 2266
   3. 'We': 1426
   4. 'The': 511
   5. 'A': 182
   6. 'They': 136
   7. 'He': 121
   8. 'She': 52
   9. 'An': 24
  10. 'In': 16

2. Question Patterns:
  'What': 1863
  'Where': 240
  'When': 24
  'Why': 561
  'How': 151
  'Who': 201
  'Which': 13
  Total question indicators: 3053

3. Temporal Indicators:
  'today': 1
  'now': 7
  'then': 1

Search Performance and Optimization¶

Understanding search performance helps optimize your analysis workflows.

In [30]:

Copied!





import time

def search_performance_analysis():
    """Analyze search performance for different query types."""
    
    print("=== Search Performance Analysis ===")
    
    # Pre-tokenize a long sequence for consistent testing
    test_sequence = "the quick brown fox jumps over the lazy dog in the park"
    all_tokens = tokenizer.encode(test_sequence, add_special_tokens=False)
    
    # Test different query lengths using slices of the same sequence
    test_queries = [
        (all_tokens[:1], "Single token"),
        (all_tokens[:2], "Two tokens"),
        (all_tokens[:4], "Four tokens"),
        (all_tokens[:7], "Seven tokens")
    ]
    
    print("\n1. Query Length Performance:")
    for tokens, description in test_queries:
        query_text = tokenizer.decode(tokens)
        
        # Time the count operation
        start_time = time.time()
        count = dataset_manager.search.count(tokens)
        count_time = time.time() - start_time
        
        # Time the positions operation (if count is reasonable)
        if count < 1000:  # Only get positions for reasonable counts
            start_time = time.time()
            positions = dataset_manager.search.positions(tokens)
            positions_time = time.time() - start_time
        else:
            positions_time = "N/A (too many results)"
        
        print(f"  {description} ('{query_text}'):")
        print(f"    Count: {count} (Time: {count_time:.4f}s)")
        print(f"    Positions: {positions_time if isinstance(positions_time, str) else f'{positions_time:.4f}s'}")
    
    # Test batch vs individual operations
    print("\n2. Batch vs Individual Operations:")
    
    # Use consistent tokenized queries
    base_phrase = "Once upon a time there was a dog and a cat and a bird"
    base_tokens = tokenizer.encode(base_phrase, add_special_tokens=False)
    
    batch_queries = [
        base_tokens[:2],  # "the dog"
        base_tokens[4:6], # "the cat"
        base_tokens[8:10] # "the bird"
    ]*100
    
    # Individual operations
    start_time = time.time()
    individual_results = []
    for query in batch_queries:
        result = dataset_manager.search.count_next(query)
        individual_results.append(result)
    individual_time = time.time() - start_time
    
    # Batch operation
    start_time = time.time()
    batch_results = dataset_manager.search.batch_count_next(batch_queries)
    batch_time = time.time() - start_time

    assert len(individual_results) == len(batch_results), "Batch results length mismatch"
    
    print(f"  Individual operations: {individual_time:.4f}s")
    print(f"  Batch operation: {batch_time:.4f}s")
    print(f"  Speedup: {individual_time/batch_time:.2f}x" if batch_time > 0 else "  Batch operation too fast to measure accurately")
    
    # Test query frequency impact
    print("\n3. Query Frequency Impact:")
    
    # Test common vs rare sequences
    common_tokens = all_tokens[:1]  # Very common single token
    rare_tokens = all_tokens[:6]    # Potentially rare 6-token sequence
    
    # Test common query
    start_time = time.time()
    common_count = dataset_manager.search.count(common_tokens)
    common_time = time.time() - start_time
    
    # Test rare query  
    start_time = time.time()
    rare_count = dataset_manager.search.count(rare_tokens)
    rare_time = time.time() - start_time
    
    print(f"  Common query ('{tokenizer.decode(common_tokens)}'): {common_count} results in {common_time:.4f}s")
    print(f"  Rare query ('{tokenizer.decode(rare_tokens)}'): {rare_count} results in {rare_time:.4f}s")

search_performance_analysis()
import time

def search_performance_analysis():
    """Analyze search performance for different query types."""
    
    print("=== Search Performance Analysis ===")
    
    # Pre-tokenize a long sequence for consistent testing
    test_sequence = "the quick brown fox jumps over the lazy dog in the park"
    all_tokens = tokenizer.encode(test_sequence, add_special_tokens=False)
    
    # Test different query lengths using slices of the same sequence
    test_queries = [
        (all_tokens[:1], "Single token"),
        (all_tokens[:2], "Two tokens"),
        (all_tokens[:4], "Four tokens"),
        (all_tokens[:7], "Seven tokens")
    ]
    
    print("\n1. Query Length Performance:")
    for tokens, description in test_queries:
        query_text = tokenizer.decode(tokens)
        
        # Time the count operation
        start_time = time.time()
        count = dataset_manager.search.count(tokens)
        count_time = time.time() - start_time
        
        # Time the positions operation (if count is reasonable)
        if count < 1000:  # Only get positions for reasonable counts
            start_time = time.time()
            positions = dataset_manager.search.positions(tokens)
            positions_time = time.time() - start_time
        else:
            positions_time = "N/A (too many results)"
        
        print(f"  {description} ('{query_text}'):")
        print(f"    Count: {count} (Time: {count_time:.4f}s)")
        print(f"    Positions: {positions_time if isinstance(positions_time, str) else f'{positions_time:.4f}s'}")
    
    # Test batch vs individual operations
    print("\n2. Batch vs Individual Operations:")
    
    # Use consistent tokenized queries
    base_phrase = "Once upon a time there was a dog and a cat and a bird"
    base_tokens = tokenizer.encode(base_phrase, add_special_tokens=False)
    
    batch_queries = [
        base_tokens[:2],  # "the dog"
        base_tokens[4:6], # "the cat"
        base_tokens[8:10] # "the bird"
    ]*100
    
    # Individual operations
    start_time = time.time()
    individual_results = []
    for query in batch_queries:
        result = dataset_manager.search.count_next(query)
        individual_results.append(result)
    individual_time = time.time() - start_time
    
    # Batch operation
    start_time = time.time()
    batch_results = dataset_manager.search.batch_count_next(batch_queries)
    batch_time = time.time() - start_time

    assert len(individual_results) == len(batch_results), "Batch results length mismatch"
    
    print(f"  Individual operations: {individual_time:.4f}s")
    print(f"  Batch operation: {batch_time:.4f}s")
    print(f"  Speedup: {individual_time/batch_time:.2f}x" if batch_time > 0 else "  Batch operation too fast to measure accurately")
    
    # Test query frequency impact
    print("\n3. Query Frequency Impact:")
    
    # Test common vs rare sequences
    common_tokens = all_tokens[:1]  # Very common single token
    rare_tokens = all_tokens[:6]    # Potentially rare 6-token sequence
    
    # Test common query
    start_time = time.time()
    common_count = dataset_manager.search.count(common_tokens)
    common_time = time.time() - start_time
    
    # Test rare query  
    start_time = time.time()
    rare_count = dataset_manager.search.count(rare_tokens)
    rare_time = time.time() - start_time
    
    print(f"  Common query ('{tokenizer.decode(common_tokens)}'): {common_count} results in {common_time:.4f}s")
    print(f"  Rare query ('{tokenizer.decode(rare_tokens)}'): {rare_count} results in {rare_time:.4f}s")

search_performance_analysis()

=== Search Performance Analysis ===

1. Query Length Performance:
  Single token ('the'):
    Count: 12 (Time: 0.0000s)
    Positions: 0.0000s
  Two tokens ('the quick'):
    Count: 0 (Time: 0.0000s)
    Positions: 0.0000s
  Four tokens ('the quick brown fox'):
    Count: 0 (Time: 0.0000s)
    Positions: 0.0000s
  Seven tokens ('the quick brown fox jumps over the'):
    Count: 0 (Time: 0.0000s)
    Positions: 0.0000s

2. Batch vs Individual Operations:
  Individual operations: 0.4445s
  Batch operation: 0.9893s
  Speedup: 0.45x

3. Query Frequency Impact:
  Common query ('the'): 12 results in 0.0001s
  Rare query ('the quick brown fox jumps over'): 0 results in 0.0000s
  Individual operations: 0.4445s
  Batch operation: 0.9893s
  Speedup: 0.45x

3. Query Frequency Impact:
  Common query ('the'): 12 results in 0.0001s
  Rare query ('the quick brown fox jumps over'): 0 results in 0.0000s

Advanced Use Cases¶

Let's explore some advanced use cases that demonstrate the full power of search functionality.

Building Custom Language Models¶

Use search results to build simple language models or probability distributions.

In [26]:

Copied!





def build_simple_language_model():
    """Build a simple n-gram language model using search results."""
    
    print("=== Building Simple Language Model ===")
    
    # Pre-tokenize a longer context to ensure we have consistent token sequences
    full_context = "Once"
    full_tokens = tokenizer.encode(full_context, add_special_tokens=False)
    
    # Use first 2 tokens as our context
    context_tokens = full_tokens[:2]
    context = tokenizer.decode(context_tokens)
    
    print(f"Context: '{context}'")
    print(f"Context tokens: {context_tokens}")
    
    # Get next token distribution
    next_counts = dataset_manager.search.count_next(context_tokens)
    context_count = dataset_manager.search.count(context_tokens)
    
    if context_count == 0:
        print("Context not found in dataset")
        return
    
    print(f"Context appears {context_count} times")
    
    # Build probability distribution
    probabilities = []
    for token_id, count in enumerate(next_counts):
        if count > 0:
            prob = count / context_count
            try:
                token_text = tokenizer.decode([token_id])
                probabilities.append((token_id, token_text, count, prob))
            except:
                continue
    
    # Sort by probability
    probabilities.sort(key=lambda x: x[3], reverse=True)
    
    print(f"\nTop 10 most likely next tokens:")
    cumulative_prob = 0
    for i, (token_id, token_text, count, prob) in enumerate(probabilities[:10], 1):
        cumulative_prob += prob
        print(f"  {i:2d}. '{token_text}' (ID: {token_id})")
        print(f"      Probability: {prob:.4f} ({prob*100:.1f}%)")
        print(f"      Count: {count}")
    
    print(f"\nTop 10 tokens cover {cumulative_prob:.1%} of all continuations")
    
    # Generate sample text using the model
    print(f"\n--- Sample Generations ---")
    
    for generation in range(3):
        print(f"\nGeneration {generation + 1}:")
        generation_tokens = context_tokens.copy()
        
        # Generate 5 more tokens
        for step in range(5):
            # Use consistent context length (2 tokens)
            current_context = generation_tokens[-2:]
            next_counts = dataset_manager.search.count_next(current_context)
            context_total = sum(next_counts)
            
            if context_total == 0:
                print(f"    No continuations found for context: {tokenizer.decode(current_context)}")
                break
            
            # Sample next token based on probability
            next_probs = [count / context_total for count in next_counts]
            
            # Handle case where all probabilities are zero
            if sum(next_probs) == 0:
                print(f"    No valid continuations for context: {tokenizer.decode(current_context)}")
                break
                
            next_token = np.random.choice(len(next_probs), p=next_probs)
            generation_tokens.append(next_token)
        
        generated_text = tokenizer.decode(generation_tokens)
        continuation_text = tokenizer.decode(generation_tokens[len(context_tokens):])
        print(f"  Full text: '{generated_text}'")
        print(f"  Continuation: '{continuation_text}'")

build_simple_language_model()
def build_simple_language_model():
    """Build a simple n-gram language model using search results."""
    
    print("=== Building Simple Language Model ===")
    
    # Pre-tokenize a longer context to ensure we have consistent token sequences
    full_context = "Once"
    full_tokens = tokenizer.encode(full_context, add_special_tokens=False)
    
    # Use first 2 tokens as our context
    context_tokens = full_tokens[:2]
    context = tokenizer.decode(context_tokens)
    
    print(f"Context: '{context}'")
    print(f"Context tokens: {context_tokens}")
    
    # Get next token distribution
    next_counts = dataset_manager.search.count_next(context_tokens)
    context_count = dataset_manager.search.count(context_tokens)
    
    if context_count == 0:
        print("Context not found in dataset")
        return
    
    print(f"Context appears {context_count} times")
    
    # Build probability distribution
    probabilities = []
    for token_id, count in enumerate(next_counts):
        if count > 0:
            prob = count / context_count
            try:
                token_text = tokenizer.decode([token_id])
                probabilities.append((token_id, token_text, count, prob))
            except:
                continue
    
    # Sort by probability
    probabilities.sort(key=lambda x: x[3], reverse=True)
    
    print(f"\nTop 10 most likely next tokens:")
    cumulative_prob = 0
    for i, (token_id, token_text, count, prob) in enumerate(probabilities[:10], 1):
        cumulative_prob += prob
        print(f"  {i:2d}. '{token_text}' (ID: {token_id})")
        print(f"      Probability: {prob:.4f} ({prob*100:.1f}%)")
        print(f"      Count: {count}")
    
    print(f"\nTop 10 tokens cover {cumulative_prob:.1%} of all continuations")
    
    # Generate sample text using the model
    print(f"\n--- Sample Generations ---")
    
    for generation in range(3):
        print(f"\nGeneration {generation + 1}:")
        generation_tokens = context_tokens.copy()
        
        # Generate 5 more tokens
        for step in range(5):
            # Use consistent context length (2 tokens)
            current_context = generation_tokens[-2:]
            next_counts = dataset_manager.search.count_next(current_context)
            context_total = sum(next_counts)
            
            if context_total == 0:
                print(f"    No continuations found for context: {tokenizer.decode(current_context)}")
                break
            
            # Sample next token based on probability
            next_probs = [count / context_total for count in next_counts]
            
            # Handle case where all probabilities are zero
            if sum(next_probs) == 0:
                print(f"    No valid continuations for context: {tokenizer.decode(current_context)}")
                break
                
            next_token = np.random.choice(len(next_probs), p=next_probs)
            generation_tokens.append(next_token)
        
        generated_text = tokenizer.decode(generation_tokens)
        continuation_text = tokenizer.decode(generation_tokens[len(context_tokens):])
        print(f"  Full text: '{generated_text}'")
        print(f"  Continuation: '{continuation_text}'")

build_simple_language_model()

=== Building Simple Language Model ===
Context: 'Once'
Context tokens: [10758]
Context appears 15326 times

Top 10 most likely next tokens:
   1. ' upon' (ID: 2220)
      Probability: 0.8838 (88.4%)
      Count: 13545
   2. ' there' (ID: 627)
      Probability: 0.1088 (10.9%)
      Count: 1667
   3. ',' (ID: 13)
      Probability: 0.0071 (0.7%)
      Count: 109
   4. ' a' (ID: 247)
      Probability: 0.0001 (0.0%)
      Count: 1
   5. ' in' (ID: 275)
      Probability: 0.0001 (0.0%)
      Count: 1
   6. ' it' (ID: 352)
      Probability: 0.0001 (0.0%)
      Count: 1
   7. ' Mary' (ID: 6393)
      Probability: 0.0001 (0.0%)
      Count: 1
   8. ' Upon' (ID: 15797)
      Probability: 0.0001 (0.0%)
      Count: 1

Top 10 tokens cover 100.0% of all continuations

--- Sample Generations ---

Generation 1:

  Full text: 'Once upon a time, there'
  Continuation: ' upon a time, there'

Generation 2:
  Full text: 'Once upon a time, there'
  Continuation: ' upon a time, there'

Generation 3:
  Full text: 'Once upon a time, in'
  Continuation: ' upon a time, in'

Dataset Comparison¶

Compare different datasets or dataset versions using search statistics.

In [27]:

Copied!





def create_search_signature():
    """Create a 'signature' of the dataset using search statistics."""
    
    print("=== Dataset Search Signature ===")
    
    # Define signature queries - common patterns that characterize text
    signature_queries = [
        # Articles
        "the", "a", "an",
        # Pronouns
        "I", "you", "he", "she", "we", "they",
        # Common verbs
        "is", "was", "are", "were", "have", "has",
        # Conjunctions
        "and", "or", "but", "if", "when",
        # Common phrases
        "of the", "in the", "to the", "and the",
        # Question words
        "what", "where", "when", "why", "how",
        # Temporal
        "time", "day", "year", "today", "now"
    ]
    
    signature = {}
    total_signature_count = 0
    
    print("Computing dataset signature...")
    
    for query_text in signature_queries:
        tokens = tokenizer.encode(query_text, add_special_tokens=False)
        count = dataset_manager.search.count(tokens)
        signature[query_text] = count
        total_signature_count += count
    
    # Normalize to percentages
    signature_percentages = {}
    for query_text, count in signature.items():
        percentage = (count / total_signature_count) * 100 if total_signature_count > 0 else 0
        signature_percentages[query_text] = percentage
    
    print(f"\nDataset Signature (Total signature tokens: {total_signature_count}):")
    
    # Sort by frequency
    sorted_signature = sorted(signature.items(), key=lambda x: x[1], reverse=True)
    
    for query_text, count in sorted_signature:
        percentage = signature_percentages[query_text]
        print(f"  '{query_text}': {count:6d} ({percentage:5.1f}%)")
    
    return signature

signature = create_search_signature()
def create_search_signature():
    """Create a 'signature' of the dataset using search statistics."""
    
    print("=== Dataset Search Signature ===")
    
    # Define signature queries - common patterns that characterize text
    signature_queries = [
        # Articles
        "the", "a", "an",
        # Pronouns
        "I", "you", "he", "she", "we", "they",
        # Common verbs
        "is", "was", "are", "were", "have", "has",
        # Conjunctions
        "and", "or", "but", "if", "when",
        # Common phrases
        "of the", "in the", "to the", "and the",
        # Question words
        "what", "where", "when", "why", "how",
        # Temporal
        "time", "day", "year", "today", "now"
    ]
    
    signature = {}
    total_signature_count = 0
    
    print("Computing dataset signature...")
    
    for query_text in signature_queries:
        tokens = tokenizer.encode(query_text, add_special_tokens=False)
        count = dataset_manager.search.count(tokens)
        signature[query_text] = count
        total_signature_count += count
    
    # Normalize to percentages
    signature_percentages = {}
    for query_text, count in signature.items():
        percentage = (count / total_signature_count) * 100 if total_signature_count > 0 else 0
        signature_percentages[query_text] = percentage
    
    print(f"\nDataset Signature (Total signature tokens: {total_signature_count}):")
    
    # Sort by frequency
    sorted_signature = sorted(signature.items(), key=lambda x: x[1], reverse=True)
    
    for query_text, count in sorted_signature:
        percentage = signature_percentages[query_text]
        print(f"  '{query_text}': {count:6d} ({percentage:5.1f}%)")
    
    return signature

signature = create_search_signature()

=== Dataset Search Signature ===
Computing dataset signature...

Dataset Signature (Total signature tokens: 8188):
  'I':   6164 ( 75.3%)
  'a':    510 (  6.2%)
  'where':    351 (  4.3%)
  'we':    248 (  3.0%)
  'an':    210 (  2.6%)
  'year':    125 (  1.5%)
  'day':    101 (  1.2%)
  'and':     93 (  1.1%)
  'time':     81 (  1.0%)
  'is':     78 (  1.0%)
  'was':     67 (  0.8%)
  'you':     44 (  0.5%)
  'what':     20 (  0.2%)
  'or':     16 (  0.2%)
  'but':     15 (  0.2%)
  'she':     13 (  0.2%)
  'the':     12 (  0.1%)
  'are':     11 (  0.1%)
  'now':      7 (  0.1%)
  'he':      4 (  0.0%)
  'how':      4 (  0.0%)
  'have':      3 (  0.0%)
  'why':      3 (  0.0%)
  'they':      2 (  0.0%)
  'if':      2 (  0.0%)
  'when':      1 (  0.0%)
  'in the':      1 (  0.0%)
  'today':      1 (  0.0%)
  'were':      0 (  0.0%)
  'has':      0 (  0.0%)
  'of the':      0 (  0.0%)
  'to the':      0 (  0.0%)
  'and the':      0 (  0.0%)

Summary and Best Practices¶

Let's wrap up with a summary of search functionality and best practices.

Search Functionality Summary¶

🔍 CORE SEARCH METHODS:

dataset_manager.search.count(query) → Count occurrences of token sequence
dataset_manager.search.contains(query) → Check if sequence exists (faster than count)
dataset_manager.search.positions(query) → Get all positions where sequence appears
dataset_manager.search.count_next(query) → Count what tokens follow the sequence
dataset_manager.search.batch_count_next(queries) → Batch version for multiple queries

🧠 ADVANCED FEATURES:

dataset_manager.search.sample_smoothed(query, n, k, num_samples) → Generate continuations using Kneser-Ney smoothing
Efficient indexing with vocabulary size optimization
Memory-mapped index for large datasets
Support for 2^16 and 2^32 vocabulary sizes

💡 PRACTICAL APPLICATIONS:

Content analysis and categorization
Dataset quality assessment
Language pattern analysis
Next token prediction and modeling
Linguistic research and analysis
Dataset comparison and signatures

⚡ PERFORMANCE TIPS:

Use batch operations for multiple queries
Check exists before getting positions for rare sequences
Use appropriate vocabulary size (2^16 vs 2^32)
Reuse indexes when possible
Consider query length impact on performance

🛠️ BEST PRACTICES:

Always validate token sequences before search
Handle edge cases (empty results, encoding issues)
Use meaningful variable names for token sequences
Consider memory usage for large result sets
Cache frequently used search results
Combine search with other TokenSmith handlers for powerful workflows

Next Steps¶

Congratulations! You've learned how to use TokenSmith's search functionality effectively. Here are some suggested next steps:

🎯 Immediate Actions:¶

Experiment with your own dataset using the search methods
Combine search with sampling and editing for powerful workflows
Build custom analysis tools using the search results
Optimize your search queries for better performance

📚 Additional Resources:¶

TokenSmith Documentation - Complete API reference
Basic Setup Tutorial - Getting started with TokenSmith
Inspection Tutorial - Dataset examination techniques
Sampling Tutorial - Flexible data sampling strategies
Editing Tutorial - Dataset modification techniques

🚀 Advanced Projects:¶

Language Model Analysis: Use search to analyze and compare different language models
Content Classification: Build automated content classifiers using search patterns
Dataset Curation: Use search to identify and filter high-quality content
Linguistic Research: Investigate language patterns and evolution in large corpora
Quality Control: Build automated quality assessment pipelines

🔬 Research Applications:¶

Bias Detection: Search for potentially biased patterns in training data
Memorization Studies: Identify memorized content in language models
Distribution Analysis: Understand token and phrase distributions
Cross-lingual Analysis: Compare patterns across different languages
Temporal Analysis: Track language change over time in timestamped datasets

Happy searching! 🔍✨