Skip to main content

Local Search System

The SignalWire Agents SDK includes a powerful local search system that provides DataSphere-compatible search functionality without external dependencies. This system uses advanced query preprocessing, local embeddings, and hybrid search techniques to enable agents to search through document collections offline.

Overview​

The local search system provides:

  • Offline Search: No external API calls or internet required
  • Hybrid Search: Combines vector similarity and keyword search
  • Document Processing: Supports multiple file formats (Markdown, PDF, DOCX, etc.)
  • Smart Chunking: Intelligent document segmentation with context preservation
  • Advanced Query Processing: NLP-enhanced query understanding
  • Flexible Deployment: Local embedded mode or remote server mode
  • SQLite Storage: Portable .swsearch index files

Architecture​

β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”    β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Documents │───▢│ Index Builder │───▢│ .swsearch DB β”‚
β”‚ (MD, PDF, etc.) β”‚ β”‚ β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜
β”‚
β–Ό
β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β” β”Œβ”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”
β”‚ Agent │───▢│ Search Skill │───▢│ Search Engine β”‚
β”‚ β”‚ β”‚ β”‚ β”‚ β”‚
β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜ β””β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”€β”˜

Installation Options​

The search system uses optional dependencies to keep the base SDK lightweight. Choose the installation option that fits your needs:

Basic Search (~500MB)​

pip install signalwire-agents[search]

Includes:

  • Core search functionality
  • Sentence transformers for embeddings
  • SQLite FTS5 for keyword search
  • Basic document processing (text, markdown)

Full Document Processing (~600MB)​

pip install signalwire-agents[search-full]

Adds:

  • PDF processing (PyPDF2)
  • DOCX processing (python-docx)
  • HTML processing (BeautifulSoup4)
  • Additional file format support

Advanced NLP Features (~700MB)​

pip install signalwire-agents[search-nlp]

Adds:

  • spaCy for advanced text processing
  • NLTK for linguistic analysis
  • Enhanced query preprocessing
  • Language detection

⚠️ Additional Setup Required:

python -m spacy download en_core_web_sm

Performance Note: Advanced NLP features provide significantly better query understanding, synonym expansion, and search relevance, but are 2-3x slower than basic search. Only recommended if you have sufficient CPU power and can tolerate longer response times.

NLP Backend Control: You can choose which NLP backend to use:

  • NLTK (default): Fast processing, good for most use cases
  • spaCy: Better quality but slower, requires model download

Configure via the nlp_backend parameter in your search skill.

All Search Features (~700MB)​

pip install signalwire-agents[search-all]

Includes everything above

⚠️ Additional Setup Required:

python -m spacy download en_core_web_sm

Performance Note: This includes advanced NLP features which improve search quality but increase response times.

Minimal Installation (Base SDK only)​

pip install signalwire-agents

Search functionality will show helpful error messages when dependencies are missing.

Quick Start​

1. Install Dependencies​

pip install signalwire-agents[search-full]

2. Build a Search Index​

# Build from the comprehensive concepts guide
sw-search docs/signalwire_agents_concepts_guide.md --output concepts.swsearch

# Build from multiple individual files
sw-search README.md docs/agent_guide.md docs/architecture.md --output knowledge.swsearch

# Build from mixed sources (files and directories)
sw-search docs/signalwire_agents_concepts_guide.md examples --file-types md,py --output comprehensive.swsearch

# Build from a directory (traditional approach)
sw-search docs --output docs.swsearch

# Include specific file types
sw-search docs --file-types md,txt,py

# Exclude patterns
sw-search docs --exclude "**/test/**,**/__pycache__/**"

3. Use in Your Agent​

from signalwire_agents import AgentBase

class MyAgent(AgentBase):
def __init__(self):
super().__init__()

# Add search capability using the concepts guide
self.add_skill("native_vector_search", {
"tool_name": "search_docs",
"description": "Search the comprehensive SDK concepts guide for information",
"index_file": "concepts.swsearch",
"count": 5
})

agent = MyAgent()
agent.serve()

Ask your agent: "How do I create a new agent?" and it will search the comprehensive concepts guide to provide detailed answers.

Building Search Indexes​

Search indexes are SQLite databases with the .swsearch extension that contain processed documents, embeddings, and search metadata.

Basic Index Building​

# Build index from the comprehensive concepts guide
sw-search docs/signalwire_agents_concepts_guide.md --output concepts.swsearch

# Build from multiple individual files
sw-search README.md docs/agent_guide.md docs/architecture.md --output knowledge.swsearch

# Build from mixed sources (files and directories)
sw-search docs/signalwire_agents_concepts_guide.md examples --file-types md,py --output comprehensive.swsearch

# Build from a directory (traditional approach)
sw-search docs --output docs.swsearch

# Include specific file types
sw-search docs --file-types md,txt,py

# Exclude patterns
sw-search docs --exclude "**/test/**,**/__pycache__/**"

Advanced Index Building​

# Full configuration example with multiple sources
sw-search docs/signalwire_agents_concepts_guide.md ./examples README.md \
--output ./knowledge.swsearch \
--chunking-strategy sentence \
--max-sentences-per-chunk 8 \
--file-types md,txt,rst,py \
--exclude "**/test/**,**/__pycache__/**" \
--model sentence-transformers/all-mpnet-base-v2 \
--tags documentation,api \
--verbose

Supported File Types​

FormatExtensionRequirements
Markdown.mdBuilt-in
Text.txtBuilt-in
Python.pyBuilt-in
reStructuredText.rstBuilt-in
PDF.pdfsearch-full
Word Documents.docxsearch-full
HTML.htmlsearch-full
JSON.jsonBuilt-in

Index Structure​

Each .swsearch file contains:

  • Document chunks with embeddings and metadata
  • Full-text search index (SQLite FTS5)
  • Configuration and model information
  • Synonym cache for query expansion

Using the Search Skill​

The native_vector_search skill provides search functionality to your agents.

Basic Configuration​

self.add_skill("native_vector_search", {
"tool_name": "search_knowledge",
"description": "Search the knowledge base",
"index_file": "knowledge.swsearch"
})

Advanced Configuration​

NLP Backend Selection​

Choose between NLTK (fast) and spaCy (better quality) for query processing:

# Fast NLTK processing (default)
self.add_skill("native_vector_search", {
"tool_name": "search_docs",
"index_file": "docs.swsearch",
"nlp_backend": "nltk" # Fast, good for most use cases
})

# Better quality spaCy processing
self.add_skill("native_vector_search", {
"tool_name": "search_docs",
"index_file": "docs.swsearch",
"nlp_backend": "spacy" # Slower but better quality, requires model download
})

Performance Comparison:

  • NLTK: ~50-100ms query processing, good synonym expansion
  • spaCy: ~150-300ms query processing, better POS tagging and entity recognition

Custom Embedding Models​

# Use a different embedding model
self.add_skill("native_vector_search", {
"tool_name": "search_docs",
"index_file": "docs.swsearch",
"model": "sentence-transformers/all-MiniLM-L6-v2" # Smaller, faster model
})

Query Enhancement​

The system automatically enhances queries using:

  • Language detection
  • POS tagging (with NLP dependencies)
  • Synonym expansion using WordNet
  • Keyword extraction
  • Vector embeddings

Response Customization​

self.add_skill("native_vector_search", {
"tool_name": "search_docs",
"index_file": "docs.swsearch",

# Customize responses for voice calls
"response_prefix": "Based on the documentation, here's what I found:",
"response_postfix": "Would you like me to search for more specific information?",

# Custom no-results message
"no_results_message": "I couldn't find information about '{query}'. Try rephrasing your question.",

# SWAIG function fillers for natural conversation
"swaig_fields": {
"fillers": {
"en-US": [
"Let me search the documentation",
"Checking our knowledge base",
"Looking that up for you"
]
}
}
})

Tag-Based Filtering​

# Only search documents tagged with specific categories
self.add_skill("native_vector_search", {
"tool_name": "search_api_docs",
"index_file": "docs.swsearch",
"tags": ["api", "reference"], # Only search API docs
"description": "Search API reference documentation"
})

Complete Configuration Example​

self.add_skill("native_vector_search", {
# Tool configuration
"tool_name": "search_docs",
"description": "Search SDK documentation for detailed information",

# Index configuration
"index_file": "docs.swsearch",
"build_index": True, # Auto-build if missing
"source_dir": "./docs", # Source for auto-build
"file_types": ["md", "txt"],

# Search parameters
"count": 5, # Number of results
"distance_threshold": 0.1, # Similarity threshold
"tags": ["documentation"], # Filter by tags

# NLP backend selection
"nlp_backend": "nltk", # or "spacy" for better quality

# Response formatting
"response_prefix": "Based on the documentation:",
"response_postfix": "Would you like more details?",
"no_results_message": "No information found for '{query}'",

# SWAIG configuration
"swaig_fields": {
"fillers": {
"en-US": ["Let me search for that", "Checking the docs"]
}
}
})

Multiple Search Instances​

You can add multiple search instances for different document collections:

# Documentation search with spaCy for better quality
self.add_skill("native_vector_search", {
"tool_name": "search_docs",
"index_file": "docs.swsearch",
"nlp_backend": "spacy",
"description": "Search SDK documentation"
})

# Code examples search with NLTK for speed
self.add_skill("native_vector_search", {
"tool_name": "search_examples",
"index_file": "examples.swsearch",
"nlp_backend": "nltk",
"description": "Search code examples"
})

Local vs Remote Modes​

The search skill supports both local and remote operation modes.

Local Mode (Default)​

Pros:

  • Faster (no network latency)
  • Works offline
  • Simple deployment
  • Lower operational complexity

Cons:

  • Higher memory usage per agent
  • Index files must be distributed with each agent
  • Updates require redeploying agents

Configuration:

self.add_skill("native_vector_search", {
"tool_name": "search_docs",
"index_file": "docs.swsearch", # Local file
"nlp_backend": "nltk" # Choose NLP backend
})

Remote Mode​

Pros:

  • Lower memory usage per agent
  • Centralized index management
  • Easy updates without redeploying agents
  • Better scalability for multiple agents
  • Shared resources

Cons:

  • Network dependency
  • Additional infrastructure complexity
  • Potential latency

Configuration:

self.add_skill("native_vector_search", {
"tool_name": "search_docs",
"remote_url": "http://localhost:8001", # Search server
"index_name": "docs", # Index name on server
"nlp_backend": "nltk" # NLP backend for query preprocessing
})

Running a Remote Search Server​

  1. Start the search server:
python examples/search_server_standalone.py
  1. The server provides HTTP API:
  • POST /search - Search the indexes
  • GET /health - Health check and available indexes
  • POST /reload_index - Add or reload an index
  1. Test the API:
curl -X POST "http://localhost:8001/search" \
-H "Content-Type: application/json" \
-d '{"query": "how to create an agent", "index_name": "docs", "count": 3}'

Automatic Mode Detection​

The skill automatically detects which mode to use:

  • If remote_url is provided β†’ Remote mode
  • If index_file is provided β†’ Local mode
  • Remote mode takes priority if both are specified

Advanced Configuration​

Custom Embedding Models​

CLI Reference​

sw-search Command​

sw-search <source_dir> [options]

Arguments:

  • source_dir - Directory containing documents to index

Options:

  • --output FILE - Output .swsearch file (default: <source_dir>.swsearch)
  • --chunk-size SIZE - Chunk size in characters (default: 500)
  • --chunk-overlap SIZE - Overlap between chunks (default: 50)
  • --file-types TYPES - Comma-separated file extensions (default: md,txt,rst)
  • --exclude PATTERNS - Comma-separated glob patterns to exclude
  • --languages LANGS - Comma-separated language codes (default: en)
  • --model MODEL - Embedding model name (default: sentence-transformers/all-mpnet-base-v2)
  • --tags TAGS - Comma-separated tags to add to all chunks
  • --verbose - Show detailed progress information
  • --validate - Validate the created index after building

Subcommands:

validate - Validate Search Index​

sw-search validate <index_file> [--verbose]

Validates an existing .swsearch index file and shows statistics.

search - Search Within Index​

sw-search search <index_file> <query> [options]

Search within an existing .swsearch index file. This is useful for:

  • Testing search quality and relevance
  • Exploring index contents
  • Debugging search results
  • Scripting and automation

Search Options:

  • --count COUNT - Number of results to return (default: 5)
  • --distance-threshold FLOAT - Minimum similarity score (default: 0.0)
  • --tags TAGS - Comma-separated tags to filter by
  • --nlp-backend {nltk,spacy} - NLP backend to use (default: nltk)
  • --verbose - Show detailed information including index stats
  • --json - Output results as JSON for scripting
  • --no-content - Hide content in results (show only metadata)

Examples:

# Build from the comprehensive concepts guide
sw-search docs/signalwire_agents_concepts_guide.md --output concepts.swsearch

# Build from multiple sources (files and directories)
sw-search docs/signalwire_agents_concepts_guide.md examples README.md \
--output comprehensive.swsearch \
--file-types md,py,txt \
--verbose

# Traditional directory-based approach
sw-search ./documentation \
--output knowledge.swsearch \
--chunking-strategy sentence \
--max-sentences-per-chunk 8 \
--file-types md,rst,txt \
--exclude "**/drafts/**" \
--tags documentation,help \
--verbose

# Validate an existing index
sw-search validate concepts.swsearch --verbose

# Search within an index
sw-search search concepts.swsearch "how to create an agent"
sw-search search concepts.swsearch "API reference" --count 3 --verbose
sw-search search concepts.swsearch "configuration" --tags documentation --json

# Use different NLP backends
sw-search search concepts.swsearch "deployment options" --nlp-backend nltk # Fast
sw-search search concepts.swsearch "deployment options" --nlp-backend spacy # Better quality

# Advanced search with filtering
sw-search search concepts.swsearch "deployment options" \
--count 10 \
--distance-threshold 0.1 \
--tags "deployment,production" \
--nlp-backend spacy \
--verbose

# JSON output for scripting
sw-search search concepts.swsearch "error handling" --json | jq '.results[0].content'

# Build multiple specialized indexes
sw-search docs/signalwire_agents_concepts_guide.md --output concepts.swsearch
sw-search examples --output examples.swsearch --file-types py,md

Index Validation​

# Validate an existing index
python -c "
from signalwire_agents.search import SearchEngine
engine = SearchEngine('docs.swsearch')
print(f'Index stats: {engine.get_stats()}')
"

API Reference​

SearchEngine Class​

from signalwire_agents.search import SearchEngine

# Load an index
engine = SearchEngine("docs.swsearch")

# Perform search
results = engine.search(
query_vector=[...], # Optional: pre-computed query vector
enhanced_text="search query", # Enhanced query text
count=5, # Number of results
distance_threshold=0.0, # Minimum similarity score
tags=["documentation"] # Filter by tags
)

# Get index statistics
stats = engine.get_stats()
print(f"Total chunks: {stats['total_chunks']}")
print(f"Total files: {stats['total_files']}")

IndexBuilder Class​

from signalwire_agents.search import IndexBuilder

# Create index builder
builder = IndexBuilder(
model_name="sentence-transformers/all-mpnet-base-v2",
chunk_size=500,
chunk_overlap=50,
verbose=True
)

# Build index
builder.build_index(
source_dir="./docs",
output_file="docs.swsearch",
file_types=["md", "txt"],
exclude_patterns=["**/test/**"],
tags=["documentation"]
)