buster/packages/stored-values
dal bfee540c5b
starting migration to ai sdk 5
2025-07-29 20:57:22 -06:00
..
scripts Use tsx and .ts files for validation 2025-07-21 16:07:14 -06:00
src Move unit tests to be alongside source files 2025-07-17 16:13:40 +00:00
.env.example Mastra braintrust (#391) 2025-07-02 14:33:40 -07:00
.gitignore Mastra braintrust (#391) 2025-07-02 14:33:40 -07:00
README.md Mastra braintrust (#391) 2025-07-02 14:33:40 -07:00
biome.json Mastra braintrust (#391) 2025-07-02 14:33:40 -07:00
env.d.ts Mastra braintrust (#391) 2025-07-02 14:33:40 -07:00
package.json starting migration to ai sdk 5 2025-07-29 20:57:22 -06:00
tsconfig.json Update inlcude 2025-07-12 23:46:09 -06:00
turbo.json Mastra braintrust (#391) 2025-07-02 14:33:40 -07:00
vitest.config.ts Mastra braintrust (#391) 2025-07-02 14:33:40 -07:00

README.md

@buster/stored-values

Production-ready vector similarity search package for stored column values using OpenAI embeddings and pgvector.

Overview

This package provides functionality to search for stored column values using semantic similarity. It uses OpenAI's text-embedding-3-small model to generate embeddings and pgvector's halfvec type with HNSW indexes for efficient similarity search.

Features

  • Production Ready: Full Zod validation, parameterized queries, comprehensive error handling
  • 🔍 Semantic Search: Using OpenAI's text-embedding-3-small embeddings (1536 dimensions)
  • 🏢 Multi-tenant: Schema-specific tables for data isolation
  • 🎛️ Advanced Filtering: Filter by database, schema, table, and column
  • Parallel Search: Search across multiple targets concurrently
  • 🎯 Similarity Thresholds: Filter results by minimum similarity score
  • 🔄 Automatic Retries: Built-in retry logic for embedding generation
  • 📊 Health Checks: Verify database connectivity and table existence
  • 🛡️ Type Safe: Full TypeScript support with runtime validation

Installation

bun add @buster/stored-values

Quick Start

import { generateEmbedding, searchValuesByEmbedding } from '@buster/stored-values';

// Generate embedding for search query
const embedding = await generateEmbedding(['user email']);

// Search for similar values with options
const results = await searchValuesByEmbedding(
  'cc3ef3bc-44ec-4a43-8dc4-681cae5c996a', // data source UUID
  embedding,
  { limit: 10, similarityThreshold: 0.8 }
);

API Reference

Core Search Functions

searchValuesByEmbedding(dataSourceId, queryEmbedding, options?)

Basic semantic similarity search.

const results = await searchValuesByEmbedding(
  'cc3ef3bc-44ec-4a43-8dc4-681cae5c996a',
  embedding,
  {
    limit: 20,                    // Max results (1-1000, default: 10)
    similarityThreshold: 0.7      // Min similarity score (0-1, optional)
  }
);

searchValuesByEmbeddingWithFilters(dataSourceId, queryEmbedding, options?, ...filters)

Search with database/table/column filters.

const results = await searchValuesByEmbeddingWithFilters(
  'cc3ef3bc-44ec-4a43-8dc4-681cae5c996a',
  embedding,
  { limit: 10, similarityThreshold: 0.8 },
  'production',  // database name filter
  'public',      // schema name filter  
  'users',       // table name filter
  'email'        // column name filter
);

searchValuesAcrossTargets(dataSourceId, queryEmbedding, targets, limitPerTarget?, options?)

Parallel search across multiple specific targets.

const targets = [
  { database_name: 'prod', schema_name: 'public', table_name: 'users', column_name: 'email' },
  { database_name: 'prod', schema_name: 'public', table_name: 'products', column_name: 'name' }
];

const results = await searchValuesAcrossTargets(
  'cc3ef3bc-44ec-4a43-8dc4-681cae5c996a',
  embedding,
  targets,
  5,  // limit per target
  { similarityThreshold: 0.8 }
);

Utility Functions

generateEmbedding(searchTerms, options?)

Generate embeddings with automatic retries.

const embedding = await generateEmbedding(
  ['user', 'email', 'address'],
  {
    maxRetries: 5,                    // Retry attempts (0-10, default: 3)
    abortSignal: controller.signal    // Cancellation support
  }
);

healthCheck(dataSourceId)

Verify database connectivity and table existence.

const isHealthy = await healthCheck('cc3ef3bc-44ec-4a43-8dc4-681cae5c996a');
// Returns: boolean

extractSearchableColumnsFromYaml(ymlContent)

Parse YAML dataset content to find searchable text columns.

const yamlString = JSON.stringify({
  database: 'production',
  tables: [{
    name: 'users',
    schema: 'public', 
    columns: [
      { name: 'email', type: 'varchar(255)' },
      { name: 'bio', type: 'text' }
    ]
  }]
});

const targets = extractSearchableColumnsFromYaml(yamlString);
// Returns: SearchTarget[]

Types

interface StoredValueResult {
  id: string;
  value: string;
  database_name: string;
  column_name: string;
  table_name: string;
  schema_name: string;
  synced_at: Date | null;
}

interface SearchTarget {
  database_name: string;
  schema_name: string;
  table_name: string;
  column_name: string;
}

interface SearchOptions {
  limit?: number;              // 1-1000, default: 10
  similarityThreshold?: number; // 0-1, optional
}

interface EmbeddingOptions {
  maxRetries?: number;         // 0-10, default: 3
  abortSignal?: AbortSignal;   // Optional cancellation
}

Error Handling

All functions throw StoredValuesError with detailed error messages:

import { StoredValuesError } from '@buster/stored-values';

try {
  const results = await searchValuesByEmbedding(dataSourceId, embedding);
} catch (error) {
  if (error instanceof StoredValuesError) {
    console.error('Search failed:', error.message);
    console.error('Caused by:', error.cause);
  }
}

Validation

All inputs are validated using Zod schemas:

import { UuidSchema, EmbeddingSchema, SearchOptionsSchema } from '@buster/stored-values';

// Validate individual inputs
const validUuid = UuidSchema.parse('cc3ef3bc-44ec-4a43-8dc4-681cae5c996a');
const validEmbedding = EmbeddingSchema.parse(embedding); // Must be 1536 dimensions
const validOptions = SearchOptionsSchema.parse({ limit: 50 });

Database Schema

The package expects schema-specific tables:

CREATE TABLE "ds_{data_source_id}"."searchable_column_values" (
  id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
  value text NOT NULL,
  database_name text NOT NULL,
  column_name text NOT NULL,
  table_name text NOT NULL,
  schema_name text NOT NULL,
  embedding halfvec(1536),
  synced_at timestamp with time zone DEFAULT now()
);

-- HNSW index for efficient similarity search
CREATE INDEX idx_embedding_hnsw_{schema_name} 
ON "ds_{data_source_id}"."searchable_column_values"
USING hnsw (embedding halfvec_cosine_ops);

Environment Variables

DATABASE_URL=postgresql://user:pass@host:port/db    # Required
OPENAI_API_KEY=sk-...                              # Required

Testing

# Unit tests (no external dependencies)
bun test src/__tests__/schemas.unit.test.ts
bun test src/__tests__/utils.unit.test.ts
bun test src/__tests__/search.unit.test.ts

# Integration tests (requires DATABASE_URL and OPENAI_API_KEY)
bun test src/__tests__/search.integration.test.ts

Performance Notes

  • Embedding Dimensions: Strictly enforced at 1536 (text-embedding-3-small)
  • Query Limits: Maximum 1000 results per query, 100 targets per parallel search
  • Parameterized Queries: All SQL uses parameterized queries for security
  • Connection Pooling: Uses postgres.js connection pooling
  • Parallel Execution: Multiple targets searched concurrently

Migration from Rust

This TypeScript package implements the same functionality as api/libs/stored_values in Rust:

Rust Function TypeScript Function
search_values_by_embedding searchValuesByEmbedding
search_values_by_embedding_with_filters searchValuesByEmbeddingWithFilters
search_values_across_targets searchValuesAcrossTargets
extract_searchable_columns_from_yaml extractSearchableColumnsFromYaml

Security Features

  • SQL Injection Protection: All queries use parameterized statements
  • Input Validation: Comprehensive Zod schema validation
  • UUID Validation: Strict UUID format enforcement
  • Dimension Validation: Embedding arrays must be exactly 1536 elements
  • Rate Limiting Support: Built-in retry logic with exponential backoff
  • Type Safety: Full TypeScript coverage with runtime validation