Skip to main content

Chapter 5: Vector Search and Similarity Operations

Overview

Vector search is at the heart of modern AI and machine learning applications, enabling semantic search, recommendation systems, and similarity-based analytics. Vektagraf provides a comprehensive vector search system with multiple algorithms, distance metrics, and optimization strategies. This chapter covers everything from basic vector concepts to advanced indexing algorithms like HNSW and IVFFlat, with practical examples for building production-ready vector search applications.

Learning Objectives

By the end of this chapter, you will be able to:

  • Understand vector embeddings and their role in semantic search
  • Implement vector similarity search using multiple distance metrics
  • Choose and configure appropriate indexing algorithms (HNSW, IVFFlat, Memory)
  • Optimize vector search performance for different use cases
  • Build hybrid search systems combining vector similarity with metadata filtering
  • Generate and manage embeddings for various data types
  • Monitor and tune vector search performance

Prerequisites

  • Completed Chapter 4: Database Operations and Transactions
  • Basic understanding of linear algebra and vector mathematics
  • Familiarity with machine learning concepts (embeddings, similarity)
  • Understanding of indexing and search algorithms

Core Concepts

Vector Embeddings and Semantic Search

Vector embeddings are numerical representations of data that capture semantic meaning in high-dimensional space. Objects with similar meanings are positioned close to each other in this space, enabling semantic search capabilities.

// Example: Text embeddings for semantic search
class Document {
  final String title;
  final String content;
  final List<double> titleEmbedding;    // 384-dimensional vector
  final List<double> contentEmbedding;  // 384-dimensional vector
  final Map<String, dynamic> metadata;
  
  Document({
    required this.title,
    required this.content,
    required this.titleEmbedding,
    required this.contentEmbedding,
    required this.metadata,
  });
}

// Example: Product embeddings for recommendations
class Product {
  final String name;
  final String description;
  final List<String> categories;
  final List<double> featureEmbedding;  // 512-dimensional vector
  final double price;
  
  Product({
    required this.name,
    required this.description,
    required this.categories,
    required this.featureEmbedding,
    required this.price,
  });
}

Distance Metrics and Similarity

Vektagraf supports multiple distance metrics, each optimized for different use cases:

Cosine Distance

  • Best for: Normalized vectors, text embeddings, recommendation systems
  • Range: 0 to 2 (0 = identical, 2 = opposite)
  • Formula: 1 - (A·B)/(|A||B|)

Euclidean Distance (L2)

  • Best for: Spatial data, image embeddings, continuous features
  • Range: 0 to ∞ (0 = identical)
  • Formula: √(Σ(Ai - Bi)²)

Inner Product Distance

  • Best for: Pre-normalized vectors, when magnitude matters
  • Range: -∞ to ∞ (higher = more similar)
  • Formula: -(A·B)

Vector Space Algorithms

Vektagraf provides three vector space implementations:

  1. Memory Vector Space: Linear search, exact results, best for small datasets
  2. HNSW (Hierarchical Navigable Small World): Approximate search, excellent for high-dimensional data
  3. IVFFlat (Inverted File Flat): Clustering-based search, good for large datasets with lower dimensions

Practical Examples

Basic Vector Operations

Setting Up Vector Spaces

import 'package:vektagraf/vektagraf.dart';

Future<void> setupVectorSpaces() async {
  final database = VektagrafDatabaseImpl();
  await database.open('vector_search.db');
  
  try {
    // Create different vector spaces for different use cases
    
    // 1. Memory vector space for small datasets (exact search)
    final documentVectors = database.vectorSpace('documents', 384);
    
    // 2. HNSW vector space for high-dimensional embeddings
    final productVectors = database.vectorSpace('products', 512);
    
    // 3. IVFFlat vector space for large datasets
    final userVectors = database.vectorSpace('users', 256);
    
    print('Vector spaces created successfully');
    
  } finally {
    await database.close();
  }
}

Adding Vectors with Metadata

Future<void> addVectorsExample() async {
  final database = VektagrafDatabaseImpl();
  await database.open('vector_search.db');
  
  try {
    final documentVectors = database.vectorSpace('documents', 384);
    
    // Add document vectors with rich metadata
    final documents = [
      {
        'title': 'Introduction to Machine Learning',
        'content': 'Machine learning is a subset of artificial intelligence...',
        'category': 'Technology',
        'author': 'Dr. Smith',
        'published': '2024-01-15',
        'tags': ['ML', 'AI', 'Technology'],
        'embedding': generateTextEmbedding('Introduction to Machine Learning...'),
      },
      {
        'title': 'Cooking with Python',
        'content': 'Learn to automate your kitchen with Python scripts...',
        'category': 'Programming',
        'author': 'Chef Code',
        'published': '2024-02-01',
        'tags': ['Python', 'Automation', 'Cooking'],
        'embedding': generateTextEmbedding('Cooking with Python...'),
      },
      {
        'title': 'Deep Learning Fundamentals',
        'content': 'Deep learning uses neural networks with multiple layers...',
        'category': 'Technology',
        'author': 'Dr. Neural',
        'published': '2024-01-20',
        'tags': ['Deep Learning', 'Neural Networks', 'AI'],
        'embedding': generateTextEmbedding('Deep Learning Fundamentals...'),
      },
    ];
    
    // Add vectors to the space
    for (final doc in documents) {
      final vectorId = await documentVectors.addVector(
        doc['embedding'] as List<double>,
        metadata: {
          'title': doc['title'],
          'category': doc['category'],
          'author': doc['author'],
          'published': DateTime.parse(doc['published'] as String),
          'tags': doc['tags'],
          'content_preview': (doc['content'] as String).substring(0, 100),
        },
      );
      
      print('Added document: ${doc['title']} with ID: $vectorId');
    }
    
    print('Total vectors: ${await documentVectors.vectorCount}');
    
  } finally {
    await database.close();
  }
}

// Simulated embedding generation (in practice, use a real embedding model)
List<double> generateTextEmbedding(String text) {
  // This is a placeholder - in real applications, you would use:
  // - OpenAI embeddings API
  // - Sentence transformers
  // - Custom trained models
  // - Pre-computed embeddings
  
  final random = Random(text.hashCode); // Deterministic for demo
  return List.generate(384, (_) => random.nextDouble() * 2 - 1);
}

Similarity Search Operations

Basic Similarity Search

Future<void> basicSimilaritySearch() async {
  final database = VektagrafDatabaseImpl();
  await database.open('vector_search.db');
  
  try {
    final documentVectors = database.vectorSpace('documents', 384);
    
    // Generate query embedding for search
    final queryText = 'artificial intelligence and neural networks';
    final queryEmbedding = generateTextEmbedding(queryText);
    
    // Perform similarity search with different metrics
    print('=== Cosine Similarity Search ===');
    final cosineResults = await documentVectors.similaritySearch(
      queryEmbedding,
      10,
      metric: DistanceMetric.cosine,
    );
    
    for (final result in cosineResults) {
      print('Title: ${result.metadata['title']}');
      print('Score: ${result.score.toStringAsFixed(4)}');
      print('Distance: ${result.distance.toStringAsFixed(4)}');
      print('Category: ${result.metadata['category']}');
      print('---');
    }
    
    print('\n=== Euclidean Distance Search ===');
    final euclideanResults = await documentVectors.similaritySearch(
      queryEmbedding,
      5,
      metric: DistanceMetric.euclidean,
    );
    
    for (final result in euclideanResults) {
      print('${result.metadata['title']}: ${result.score.toStringAsFixed(4)}');
    }
    
  } finally {
    await database.close();
  }
}

Hybrid Search with Metadata Filtering

Future<void> hybridSearchExample() async {
  final database = VektagrafDatabaseImpl();
  await database.open('vector_search.db');
  
  try {
    final documentVectors = database.vectorSpace('documents', 384);
    
    final queryEmbedding = generateTextEmbedding('machine learning algorithms');
    
    // Hybrid search: vector similarity + metadata filtering
    final hybridResults = await documentVectors.hybridSearch(
      queryEmbedding,
      (metadata) {
        // Filter by category and publication date
        final category = metadata['category'] as String?;
        final published = metadata['published'] as DateTime?;
        
        return category == 'Technology' && 
               published != null && 
               published.isAfter(DateTime(2024, 1, 1));
      },
      5,
      metric: DistanceMetric.cosine,
    );
    
    print('=== Hybrid Search Results ===');
    for (final result in hybridResults) {
      print('Title: ${result.metadata['title']}');
      print('Author: ${result.metadata['author']}');
      print('Published: ${result.metadata['published']}');
      print('Similarity Score: ${result.score.toStringAsFixed(4)}');
      print('Tags: ${result.metadata['tags']}');
      print('---');
    }
    
    // Advanced filtering with multiple conditions
    final advancedResults = await documentVectors.hybridSearch(
      queryEmbedding,
      (metadata) {
        final tags = metadata['tags'] as List<dynamic>? ?? [];
        final author = metadata['author'] as String? ?? '';
        
        // Find documents with AI/ML tags by specific authors
        return tags.any((tag) => ['AI', 'ML', 'Deep Learning'].contains(tag)) &&
               ['Dr. Smith', 'Dr. Neural'].contains(author);
      },
      10,
    );
    
    print('\n=== Advanced Filtered Results ===');
    for (final result in advancedResults) {
      print('${result.metadata['title']} by ${result.metadata['author']}');
      print('Relevance: ${result.score.toStringAsFixed(4)}');
    }
    
  } finally {
    await database.close();
  }
}

Advanced Vector Space Configuration

HNSW Configuration for High-Performance Search

Future<void> hnswConfigurationExample() async {
  final database = VektagrafDatabaseImpl();
  await database.open('high_performance_vectors.db');
  
  try {
    // Create HNSW vector space with custom parameters
    final hnswSpace = HnswVectorSpace(
      'high_dim_embeddings',
      1536, // OpenAI ada-002 embedding dimension
      maxConnections: 32,        // M parameter - higher = better recall, more memory
      efConstruction: 400,       // Higher = better index quality, slower build
      maxLayers: 16,            // Maximum layers in the hierarchy
      efSearch: 100,            // Higher = better recall, slower search
      defaultMetric: DistanceMetric.cosine,
    );
    
    // Add high-dimensional vectors (e.g., from OpenAI embeddings)
    final embeddings = await generateHighDimEmbeddings([
      'The future of artificial intelligence in healthcare',
      'Quantum computing breakthrough in cryptography',
      'Sustainable energy solutions for urban environments',
      'Advanced robotics in manufacturing automation',
      'Blockchain applications in supply chain management',
    ]);
    
    for (int i = 0; i < embeddings.length; i++) {
      await hnswSpace.addVector(
        embeddings[i],
        metadata: {
          'text': [
            'The future of artificial intelligence in healthcare',
            'Quantum computing breakthrough in cryptography',
            'Sustainable energy solutions for urban environments',
            'Advanced robotics in manufacturing automation',
            'Blockchain applications in supply chain management',
          ][i],
          'domain': ['Healthcare', 'Technology', 'Energy', 'Manufacturing', 'Finance'][i],
          'complexity': ['High', 'Very High', 'Medium', 'High', 'Medium'][i],
        },
      );
    }
    
    // Adjust search parameters for different use cases
    
    // High precision search (slower but more accurate)
    hnswSpace.setEfSearch(200);
    final precisionResults = await hnswSpace.similaritySearch(
      generateTextEmbedding('AI in medical diagnosis'),
      5,
    );
    
    print('=== High Precision Results ===');
    for (final result in precisionResults) {
      print('${result.metadata['text']}: ${result.score.toStringAsFixed(4)}');
    }
    
    // Fast search (faster but potentially less accurate)
    hnswSpace.setEfSearch(50);
    final fastResults = await hnswSpace.similaritySearch(
      generateTextEmbedding('AI in medical diagnosis'),
      5,
    );
    
    print('\n=== Fast Search Results ===');
    for (final result in fastResults) {
      print('${result.metadata['text']}: ${result.score.toStringAsFixed(4)}');
    }
    
    await hnswSpace.close();
    
  } finally {
    await database.close();
  }
}

Future<List<List<double>>> generateHighDimEmbeddings(List<String> texts) async {
  // In practice, you would use a real embedding service:
  // - OpenAI Embeddings API
  // - Hugging Face Transformers
  // - Google Universal Sentence Encoder
  // - Custom trained models
  
  return texts.map((text) => 
    List.generate(1536, (i) => Random(text.hashCode + i).nextDouble() * 2 - 1)
  ).toList();
}

IVFFlat Configuration for Large Datasets

Future<void> ivfFlatConfigurationExample() async {
  final database = VektagrafDatabaseImpl();
  await database.open('large_dataset_vectors.db');
  
  try {
    // Create IVFFlat vector space optimized for large datasets
    final ivfSpace = IvfFlatVectorSpace(
      'product_embeddings',
      512,
      numClusters: 1000,        // More clusters = better precision, more memory
      nProbe: 50,               // Search more clusters = better recall, slower
      defaultMetric: DistanceMetric.cosine,
      minTrainingVectors: 5000, // Minimum vectors needed before training
    );
    
    // Simulate adding a large number of product vectors
    print('Adding product vectors...');
    final productCategories = [
      'Electronics', 'Clothing', 'Books', 'Home & Garden', 'Sports',
      'Automotive', 'Health', 'Beauty', 'Toys', 'Food'
    ];
    
    for (int i = 0; i < 10000; i++) {
      final category = productCategories[i % productCategories.length];
      final productVector = generateProductEmbedding(category, i);
      
      await ivfSpace.addVector(
        productVector,
        metadata: {
          'product_id': 'PROD_${i.toString().padLeft(6, '0')}',
          'category': category,
          'price': (Random(i).nextDouble() * 1000).round() / 100,
          'rating': (Random(i + 1000).nextDouble() * 4 + 1).round() / 1,
          'in_stock': Random(i + 2000).nextBool(),
        },
      );
      
      if ((i + 1) % 1000 == 0) {
        print('Added ${i + 1} products...');
      }
    }
    
    print('Index training status: ${ivfSpace.isTrained}');
    print('Total vectors: ${await ivfSpace.vectorCount}');
    
    // Perform searches with different nProbe settings
    final queryVector = generateProductEmbedding('Electronics', 99999);
    
    // Fast search (low recall, high speed)
    ivfSpace.setNProbe(10);
    final startTime = DateTime.now();
    final fastResults = await ivfSpace.similaritySearch(queryVector, 20);
    final fastDuration = DateTime.now().difference(startTime);
    
    print('\n=== Fast Search (nProbe=10) ===');
    print('Duration: ${fastDuration.inMilliseconds}ms');
    print('Results found: ${fastResults.length}');
    
    // Accurate search (high recall, lower speed)
    ivfSpace.setNProbe(100);
    final accurateStart = DateTime.now();
    final accurateResults = await ivfSpace.similaritySearch(queryVector, 20);
    final accurateDuration = DateTime.now().difference(accurateStart);
    
    print('\n=== Accurate Search (nProbe=100) ===');
    print('Duration: ${accurateDuration.inMilliseconds}ms');
    print('Results found: ${accurateResults.length}');
    
    // Show top results
    for (int i = 0; i < 5 && i < accurateResults.length; i++) {
      final result = accurateResults[i];
      print('${result.metadata['product_id']}: '
            '${result.metadata['category']} - '
            'Score: ${result.score.toStringAsFixed(4)}');
    }
    
    await ivfSpace.close();
    
  } finally {
    await database.close();
  }
}

List<double> generateProductEmbedding(String category, int seed) {
  final random = Random(category.hashCode + seed);
  
  // Generate category-specific embeddings with some clustering
  final baseVector = List.generate(512, (_) => random.nextGaussian());
  
  // Add category-specific bias to create realistic clustering
  final categoryBias = category.hashCode % 100 / 100.0;
  for (int i = 0; i < 50; i++) {
    baseVector[i] += categoryBias;
  }
  
  return baseVector;
}

extension RandomGaussian on Random {
  double nextGaussian() {
    // Box-Muller transform for Gaussian distribution
    final u1 = nextDouble();
    final u2 = nextDouble();
    return sqrt(-2 * log(u1)) * cos(2 * pi * u2);
  }
}

Integration with VektagrafList

Vector-Enhanced Object Queries

Future<void> vectorEnhancedQueries() async {
  final database = VektagrafDatabaseImpl();
  await database.open('enhanced_search.db');
  
  try {
    // Get objects with vector search capabilities
    final articles = await database.objects<Article>();
    
    // Add articles with embeddings
    final sampleArticles = [
      Article(
        title: 'Machine Learning in Healthcare',
        content: 'AI is revolutionizing medical diagnosis...',
        category: 'Technology',
        tags: ['AI', 'Healthcare', 'ML'],
        titleEmbedding: generateTextEmbedding('Machine Learning in Healthcare'),
        contentEmbedding: generateTextEmbedding('AI is revolutionizing medical diagnosis...'),
      ),
      Article(
        title: 'Sustainable Agriculture Practices',
        content: 'Modern farming techniques for environmental sustainability...',
        category: 'Environment',
        tags: ['Agriculture', 'Sustainability', 'Environment'],
        titleEmbedding: generateTextEmbedding('Sustainable Agriculture Practices'),
        contentEmbedding: generateTextEmbedding('Modern farming techniques...'),
      ),
      // Add more articles...
    ];
    
    for (final article in sampleArticles) {
      await articles.save(article);
    }
    
    // Vector similarity search through VektagrafList
    final queryEmbedding = generateTextEmbedding('artificial intelligence medicine');
    
    // Search by title similarity
    final titleSimilar = await articles.whereSimilar(
      'titleEmbedding',
      queryEmbedding,
      limit: 5,
    );
    
    print('=== Title Similarity Results ===');
    for (final article in titleSimilar) {
      print('${article.title} (${article.category})');
    }
    
    // Search by content similarity
    final contentSimilar = await articles.whereSimilar(
      'contentEmbedding',
      queryEmbedding,
      limit: 5,
    );
    
    print('\n=== Content Similarity Results ===');
    for (final article in contentSimilar) {
      print('${article.title}: ${article.content.substring(0, 50)}...');
    }
    
    // Combined vector and traditional filtering
    final techArticles = await articles
        .whereProperty('category', 'Technology')
        .whereSimilar('titleEmbedding', queryEmbedding, limit: 10);
    
    print('\n=== Technology Articles Similar to Query ===');
    for (final article in techArticles) {
      print('${article.title} - Tags: ${article.tags.join(', ')}');
    }
    
    // Find similar articles to a reference article
    final referenceArticle = articles.first;
    final similarToReference = await articles.whereSimilarTo(
      referenceArticle,
      'contentEmbedding',
      limit: 3,
    );
    
    print('\n=== Articles Similar to "${referenceArticle.title}" ===');
    for (final article in similarToReference) {
      if (article != referenceArticle) {
        print('${article.title} (${article.category})');
      }
    }
    
  } finally {
    await database.close();
  }
}

class Article {
  final String title;
  final String content;
  final String category;
  final List<String> tags;
  final List<double> titleEmbedding;
  final List<double> contentEmbedding;
  
  Article({
    required this.title,
    required this.content,
    required this.category,
    required this.tags,
    required this.titleEmbedding,
    required this.contentEmbedding,
  });
}

Best Practices

Choosing the Right Algorithm

Decision Matrix

class VectorSpaceSelector {
  static String selectAlgorithm({
    required int vectorCount,
    required int dimensions,
    required double queryLatencyRequirement, // milliseconds
    required double memoryBudget, // MB
    required double accuracyRequirement, // 0.0 to 1.0
  }) {
    // Memory Vector Space
    if (vectorCount < 10000 && accuracyRequirement > 0.99) {
      return 'memory';
    }
    
    // HNSW Vector Space
    if (dimensions > 300 && 
        queryLatencyRequirement < 100 && 
        accuracyRequirement > 0.90) {
      return 'hnsw';
    }
    
    // IVFFlat Vector Space
    if (vectorCount > 100000 && 
        memoryBudget < 1000 && 
        accuracyRequirement > 0.80) {
      return 'ivfflat';
    }
    
    // Default to HNSW for most cases
    return 'hnsw';
  }
  
  static Map<String, dynamic> getOptimalParameters({
    required String algorithm,
    required int vectorCount,
    required int dimensions,
    required double accuracyRequirement,
  }) {
    switch (algorithm) {
      case 'hnsw':
        return {
          'maxConnections': accuracyRequirement > 0.95 ? 32 : 16,
          'efConstruction': (vectorCount * 0.02).clamp(100, 800).round(),
          'efSearch': accuracyRequirement > 0.95 ? 100 : 50,
          'maxLayers': 16,
        };
        
      case 'ivfflat':
        final numClusters = (sqrt(vectorCount) * 4).clamp(100, 10000).round();
        return {
          'nLists': numClusters,
          'nProbe': (numClusters * 0.1).clamp(10, 100).round(),
          'minTrainingVectors': (numClusters * 10).clamp(1000, 50000),
        };
        
      default:
        return {};
    }
  }
}

Embedding Generation Strategies

Text Embeddings

class TextEmbeddingGenerator {
  // Simulated embedding service - replace with real implementation
  static Future<List<double>> generateEmbedding(String text) async {
    // In production, use services like:
    // - OpenAI Embeddings API
    // - Sentence Transformers
    // - Google Universal Sentence Encoder
    // - Cohere Embed API
    
    return _simulateEmbedding(text);
  }
  
  static Future<List<double>> generateMultimodalEmbedding({
    String? text,
    String? imageUrl,
    Map<String, dynamic>? metadata,
  }) async {
    // Combine different modalities into a single embedding
    final textEmb = text != null ? await generateEmbedding(text) : <double>[];
    final imageEmb = imageUrl != null ? await _generateImageEmbedding(imageUrl) : <double>[];
    final metaEmb = metadata != null ? _generateMetadataEmbedding(metadata) : <double>[];
    
    // Concatenate or combine embeddings
    return [...textEmb, ...imageEmb, ...metaEmb];
  }
  
  static List<double> _simulateEmbedding(String text) {
    // Deterministic embedding based on text content
    final words = text.toLowerCase().split(RegExp(r'\W+'));
    final random = Random(words.join().hashCode);
    
    return List.generate(384, (_) => random.nextGaussian() * 0.5);
  }
  
  static Future<List<double>> _generateImageEmbedding(String imageUrl) async {
    // Simulate image embedding generation
    final random = Random(imageUrl.hashCode);
    return List.generate(512, (_) => random.nextGaussian() * 0.3);
  }
  
  static List<double> _generateMetadataEmbedding(Map<String, dynamic> metadata) {
    // Convert metadata to embedding
    final metaString = metadata.entries
        .map((e) => '${e.key}:${e.value}')
        .join(' ');
    final random = Random(metaString.hashCode);
    return List.generate(128, (_) => random.nextGaussian() * 0.2);
  }
}

Embedding Normalization and Preprocessing

class EmbeddingProcessor {
  /// Normalizes embeddings to unit length for cosine similarity
  static List<double> normalize(List<double> embedding) {
    final magnitude = sqrt(embedding.map((x) => x * x).reduce((a, b) => a + b));
    if (magnitude == 0) return embedding;
    
    return embedding.map((x) => x / magnitude).toList();
  }
  
  /// Reduces embedding dimensionality using PCA-like approach
  static List<double> reduceDimensions(
    List<double> embedding,
    int targetDimensions,
  ) {
    if (embedding.length <= targetDimensions) return embedding;
    
    // Simple dimension reduction (in practice, use proper PCA)
    final step = embedding.length / targetDimensions;
    final reduced = <double>[];
    
    for (int i = 0; i < targetDimensions; i++) {
      final index = (i * step).floor();
      reduced.add(embedding[index]);
    }
    
    return reduced;
  }
  
  /// Quantizes embeddings to reduce memory usage
  static List<int> quantizeEmbedding(
    List<double> embedding, {
    int bits = 8,
  }) {
    final maxValue = (1 << bits) - 1;
    final minVal = embedding.reduce(min);
    final maxVal = embedding.reduce(max);
    final range = maxVal - minVal;
    
    if (range == 0) return List.filled(embedding.length, 0);
    
    return embedding.map((x) => 
      ((x - minVal) / range * maxValue).round().clamp(0, maxValue)
    ).toList();
  }
  
  /// Dequantizes embeddings back to floating point
  static List<double> dequantizeEmbedding(
    List<int> quantized,
    double minVal,
    double maxVal, {
    int bits = 8,
  }) {
    final maxValue = (1 << bits) - 1;
    final range = maxVal - minVal;
    
    return quantized.map((x) => 
      minVal + (x / maxValue) * range
    ).toList();
  }
}

Performance Optimization

Batch Operations

class VectorBatchProcessor {
  final VektagrafVectorSpace vectorSpace;
  
  VectorBatchProcessor(this.vectorSpace);
  
  /// Adds vectors in optimized batches
  Future<List<VektagrafId>> addVectorsBatch(
    List<List<double>> vectors,
    List<Map<String, dynamic>> metadataList, {
    int batchSize = 1000,
  }) async {
    final ids = <VektagrafId>[];
    
    for (int i = 0; i < vectors.length; i += batchSize) {
      final batchVectors = vectors.skip(i).take(batchSize);
      final batchMetadata = metadataList.skip(i).take(batchSize);
      
      final batchIds = await Future.wait(
        batchVectors.toList().asMap().entries.map((entry) async {
          final index = entry.key;
          final vector = entry.value;
          final metadata = batchMetadata.elementAt(index);
          
          return await vectorSpace.addVector(vector, metadata: metadata);
        }),
      );
      
      ids.addAll(batchIds);
      
      print('Processed batch ${(i ~/ batchSize) + 1}/${(vectors.length / batchSize).ceil()}');
    }
    
    return ids;
  }
  
  /// Performs batch similarity searches
  Future<List<List<VectorSearchResult>>> batchSimilaritySearch(
    List<List<double>> queryVectors,
    int limit, {
    DistanceMetric metric = DistanceMetric.cosine,
  }) async {
    return await Future.wait(
      queryVectors.map((query) => 
        vectorSpace.similaritySearch(query, limit, metric: metric)
      ),
    );
  }
}

Caching and Memoization

class VectorSearchCache {
  final Map<String, List<VectorSearchResult>> _cache = {};
  final int _maxCacheSize;
  final Duration _cacheTtl;
  final Map<String, DateTime> _cacheTimestamps = {};
  
  VectorSearchCache({
    int maxCacheSize = 1000,
    Duration cacheTtl = const Duration(minutes: 30),
  }) : _maxCacheSize = maxCacheSize,
       _cacheTtl = cacheTtl;
  
  String _getCacheKey(List<double> query, int limit, DistanceMetric metric) {
    // Create deterministic cache key
    final queryHash = query.map((x) => x.toStringAsFixed(6)).join(',');
    return '${queryHash}_${limit}_${metric.name}';
  }
  
  Future<List<VectorSearchResult>?> getCachedResults(
    List<double> query,
    int limit,
    DistanceMetric metric,
  ) async {
    final key = _getCacheKey(query, limit, metric);
    final timestamp = _cacheTimestamps[key];
    
    if (timestamp != null && 
        DateTime.now().difference(timestamp) < _cacheTtl) {
      return _cache[key];
    }
    
    // Remove expired entry
    _cache.remove(key);
    _cacheTimestamps.remove(key);
    return null;
  }
  
  void cacheResults(
    List<double> query,
    int limit,
    DistanceMetric metric,
    List<VectorSearchResult> results,
  ) {
    final key = _getCacheKey(query, limit, metric);
    
    // Evict oldest entries if cache is full
    if (_cache.length >= _maxCacheSize) {
      final oldestKey = _cacheTimestamps.entries
          .reduce((a, b) => a.value.isBefore(b.value) ? a : b)
          .key;
      _cache.remove(oldestKey);
      _cacheTimestamps.remove(oldestKey);
    }
    
    _cache[key] = results;
    _cacheTimestamps[key] = DateTime.now();
  }
  
  void clearCache() {
    _cache.clear();
    _cacheTimestamps.clear();
  }
}

// Usage with caching
class CachedVectorSpace {
  final VektagrafVectorSpace _vectorSpace;
  final VectorSearchCache _cache;
  
  CachedVectorSpace(this._vectorSpace) 
      : _cache = VectorSearchCache();
  
  Future<List<VectorSearchResult>> similaritySearch(
    List<double> queryVector,
    int limit, {
    DistanceMetric metric = DistanceMetric.cosine,
  }) async {
    // Check cache first
    final cached = await _cache.getCachedResults(queryVector, limit, metric);
    if (cached != null) {
      return cached;
    }
    
    // Perform actual search
    final results = await _vectorSpace.similaritySearch(
      queryVector, 
      limit, 
      metric: metric,
    );
    
    // Cache results
    _cache.cacheResults(queryVector, limit, metric, results);
    
    return results;
  }
}

Advanced Topics

Custom Distance Metrics

class CustomDistanceMetrics {
  /// Weighted cosine distance with feature importance
  static double weightedCosineDistance(
    List<double> a,
    List<double> b,
    List<double> weights,
  ) {
    if (a.length != b.length || a.length != weights.length) {
      throw ArgumentError('All vectors must have the same length');
    }
    
    double dotProduct = 0.0;
    double normA = 0.0;
    double normB = 0.0;
    
    for (int i = 0; i < a.length; i++) {
      final weightedA = a[i] * weights[i];
      final weightedB = b[i] * weights[i];
      
      dotProduct += weightedA * weightedB;
      normA += weightedA * weightedA;
      normB += weightedB * weightedB;
    }
    
    if (normA == 0.0 || normB == 0.0) return 1.0;
    
    final cosineSimilarity = dotProduct / (sqrt(normA) * sqrt(normB));
    return 1.0 - cosineSimilarity.clamp(-1.0, 1.0);
  }
  
  /// Mahalanobis distance with covariance matrix
  static double mahalanobisDistance(
    List<double> a,
    List<double> b,
    List<List<double>> invCovarianceMatrix,
  ) {
    final diff = List.generate(a.length, (i) => a[i] - b[i]);
    
    // Calculate (a-b)^T * Σ^(-1) * (a-b)
    double result = 0.0;
    for (int i = 0; i < diff.length; i++) {
      for (int j = 0; j < diff.length; j++) {
        result += diff[i] * invCovarianceMatrix[i][j] * diff[j];
      }
    }
    
    return sqrt(result);
  }
  
  /// Hamming distance for binary vectors
  static int hammingDistance(List<bool> a, List<bool> b) {
    if (a.length != b.length) {
      throw ArgumentError('Vectors must have the same length');
    }
    
    int distance = 0;
    for (int i = 0; i < a.length; i++) {
      if (a[i] != b[i]) distance++;
    }
    
    return distance;
  }
}
class MultiVectorSearch {
  final Map<String, VektagrafVectorSpace> _vectorSpaces;
  
  MultiVectorSearch(this._vectorSpaces);
  
  /// Searches across multiple vector spaces and combines results
  Future<List<MultiVectorResult>> multiSpaceSearch(
    Map<String, List<double>> queries, // space_name -> query_vector
    int limit, {
    Map<String, double>? spaceWeights, // space_name -> weight
  }) async {
    final weights = spaceWeights ?? 
        Map.fromEntries(_vectorSpaces.keys.map((k) => MapEntry(k, 1.0)));
    
    final allResults = <MultiVectorResult>[];
    
    // Search each space
    for (final entry in queries.entries) {
      final spaceName = entry.key;
      final queryVector = entry.value;
      final space = _vectorSpaces[spaceName];
      
      if (space == null) continue;
      
      final results = await space.similaritySearch(queryVector, limit * 2);
      final weight = weights[spaceName] ?? 1.0;
      
      for (final result in results) {
        allResults.add(MultiVectorResult(
          spaceName: spaceName,
          vectorId: result.id,
          score: result.score * weight,
          distance: result.distance,
          metadata: result.metadata,
        ));
      }
    }
    
    // Combine and rank results
    final groupedResults = <VektagrafId, List<MultiVectorResult>>{};
    for (final result in allResults) {
      groupedResults.putIfAbsent(result.vectorId, () => []).add(result);
    }
    
    final combinedResults = <MultiVectorResult>[];
    for (final entry in groupedResults.entries) {
      final vectorId = entry.key;
      final results = entry.value;
      
      // Combine scores (average or max)
      final avgScore = results.map((r) => r.score).reduce((a, b) => a + b) / results.length;
      final maxScore = results.map((r) => r.score).reduce(max);
      
      combinedResults.add(MultiVectorResult(
        spaceName: 'combined',
        vectorId: vectorId,
        score: maxScore, // Use max score for ranking
        distance: results.first.distance,
        metadata: results.first.metadata,
      ));
    }
    
    // Sort by combined score and return top results
    combinedResults.sort((a, b) => b.score.compareTo(a.score));
    return combinedResults.take(limit).toList();
  }
}

class MultiVectorResult {
  final String spaceName;
  final VektagrafId vectorId;
  final double score;
  final double distance;
  final Map<String, dynamic> metadata;
  
  MultiVectorResult({
    required this.spaceName,
    required this.vectorId,
    required this.score,
    required this.distance,
    required this.metadata,
  });
}

Vector Space Monitoring

class VectorSpaceMonitor {
  final VektagrafVectorSpace vectorSpace;
  final List<SearchMetric> _searchMetrics = [];
  
  VectorSpaceMonitor(this.vectorSpace);
  
  Future<List<VectorSearchResult>> monitoredSearch(
    List<double> queryVector,
    int limit, {
    DistanceMetric metric = DistanceMetric.cosine,
  }) async {
    final stopwatch = Stopwatch()..start();
    
    try {
      final results = await vectorSpace.similaritySearch(
        queryVector, 
        limit, 
        metric: metric,
      );
      
      stopwatch.stop();
      
      _recordMetric(SearchMetric(
        timestamp: DateTime.now(),
        queryDimensions: queryVector.length,
        limit: limit,
        metric: metric,
        duration: stopwatch.elapsed,
        resultCount: results.length,
        success: true,
      ));
      
      return results;
      
    } catch (e) {
      stopwatch.stop();
      
      _recordMetric(SearchMetric(
        timestamp: DateTime.now(),
        queryDimensions: queryVector.length,
        limit: limit,
        metric: metric,
        duration: stopwatch.elapsed,
        resultCount: 0,
        success: false,
        error: e.toString(),
      ));
      
      rethrow;
    }
  }
  
  void _recordMetric(SearchMetric metric) {
    _searchMetrics.add(metric);
    
    // Keep only recent metrics (last 1000)
    if (_searchMetrics.length > 1000) {
      _searchMetrics.removeAt(0);
    }
  }
  
  SearchStatistics getStatistics({Duration? period}) {
    final cutoff = period != null 
        ? DateTime.now().subtract(period)
        : DateTime.fromMillisecondsSinceEpoch(0);
    
    final recentMetrics = _searchMetrics
        .where((m) => m.timestamp.isAfter(cutoff))
        .toList();
    
    if (recentMetrics.isEmpty) {
      return SearchStatistics.empty();
    }
    
    final successfulSearches = recentMetrics.where((m) => m.success);
    final durations = successfulSearches.map((m) => m.duration);
    
    return SearchStatistics(
      totalSearches: recentMetrics.length,
      successfulSearches: successfulSearches.length,
      averageDuration: durations.isEmpty 
          ? Duration.zero
          : Duration(microseconds: 
              durations.map((d) => d.inMicroseconds).reduce((a, b) => a + b) ~/ 
              durations.length),
      maxDuration: durations.isEmpty 
          ? Duration.zero 
          : durations.reduce((a, b) => a > b ? a : b),
      minDuration: durations.isEmpty 
          ? Duration.zero 
          : durations.reduce((a, b) => a < b ? a : b),
      averageResultCount: successfulSearches.isEmpty 
          ? 0.0 
          : successfulSearches.map((m) => m.resultCount).reduce((a, b) => a + b) / 
            successfulSearches.length,
      errorRate: recentMetrics.length > 0 
          ? (recentMetrics.length - successfulSearches.length) / recentMetrics.length
          : 0.0,
    );
  }
}

class SearchMetric {
  final DateTime timestamp;
  final int queryDimensions;
  final int limit;
  final DistanceMetric metric;
  final Duration duration;
  final int resultCount;
  final bool success;
  final String? error;
  
  SearchMetric({
    required this.timestamp,
    required this.queryDimensions,
    required this.limit,
    required this.metric,
    required this.duration,
    required this.resultCount,
    required this.success,
    this.error,
  });
}

class SearchStatistics {
  final int totalSearches;
  final int successfulSearches;
  final Duration averageDuration;
  final Duration maxDuration;
  final Duration minDuration;
  final double averageResultCount;
  final double errorRate;
  
  SearchStatistics({
    required this.totalSearches,
    required this.successfulSearches,
    required this.averageDuration,
    required this.maxDuration,
    required this.minDuration,
    required this.averageResultCount,
    required this.errorRate,
  });
  
  factory SearchStatistics.empty() => SearchStatistics(
    totalSearches: 0,
    successfulSearches: 0,
    averageDuration: Duration.zero,
    maxDuration: Duration.zero,
    minDuration: Duration.zero,
    averageResultCount: 0.0,
    errorRate: 0.0,
  );
}

Reference

Vector Space API

Core Interface

abstract class VektagrafVectorSpace {
  String get name;
  int get dimensions;
  
  Future<VektagrafId> addVector(List<double> vector, {Map<String, dynamic>? metadata});
  Future<List<VectorSearchResult>> similaritySearch(List<double> queryVector, int limit, 
    {DistanceMetric metric = DistanceMetric.cosine});
  Future<List<VectorSearchResult>> hybridSearch(List<double> queryVector, 
    bool Function(Map<String, dynamic>) metadataFilter, int limit, 
    {DistanceMetric metric = DistanceMetric.cosine});
  Future<void> removeVector(VektagrafId id);
  Future<void> updateMetadata(VektagrafId id, Map<String, dynamic> metadata);
  Future<int> get vectorCount;
  Future<void> close();
}

Distance Metrics

enum DistanceMetric {
  cosine,      // 1 - cosine_similarity, range: [0, 2]
  euclidean,   // L2 distance, range: [0, ∞]
  innerProduct // -dot_product, range: [-∞, ∞]
}

Algorithm-Specific Parameters

HNSW Parameters:

  • maxConnections (M): 4-64, default 16
  • efConstruction: 100-800, default 200
  • efSearch: 10-500, default 50
  • maxLayers: 8-32, default 16

IVFFlat Parameters:

  • numClusters (nLists): 100-10000, default sqrt(n_vectors) * 4
  • nProbe: 1-nLists, default nLists/10
  • minTrainingVectors: 1000-50000, default nLists * 10

Performance Guidelines

Memory Usage Estimates

AlgorithmMemory per VectorIndex Overhead
Memory4 * dimensionsMinimal
HNSW4 * dimensions~50% overhead
IVFFlat4 * dimensions~10% overhead

Query Performance

AlgorithmBuild TimeQuery TimeAccuracy
MemoryO(1)O(n)100%
HNSWO(n log n)O(log n)95-99%
IVFFlatO(n)O(√n)80-95%

Summary

This chapter covered Vektagraf's comprehensive vector search capabilities, from basic similarity operations to advanced indexing algorithms. Key takeaways include:

  • Multiple Algorithms: Choose between Memory, HNSW, and IVFFlat based on your requirements
  • Distance Metrics: Cosine, Euclidean, and Inner Product for different use cases
  • Hybrid Search: Combine vector similarity with metadata filtering
  • Performance Optimization: Batch operations, caching, and parameter tuning
  • Integration: Seamless integration with VektagrafList for enhanced object queries

Next Steps

  • Chapter 6: Learn about graph operations and relationship modeling
  • Chapter 7: Explore query optimization and performance tuning techniques
  • Chapter 19: Build complete recommendation systems using vector search