buster/packages/stored-values/README.md

271 lines
7.7 KiB
Markdown

# @buster/stored-values
Production-ready vector similarity search package for stored column values using OpenAI embeddings and pgvector.
## Overview
This package provides functionality to search for stored column values using semantic similarity. It uses OpenAI's text-embedding-3-small model to generate embeddings and pgvector's halfvec type with HNSW indexes for efficient similarity search.
## Features
-**Production Ready**: Full Zod validation, parameterized queries, comprehensive error handling
- 🔍 **Semantic Search**: Using OpenAI's text-embedding-3-small embeddings (1536 dimensions)
- 🏢 **Multi-tenant**: Schema-specific tables for data isolation
- 🎛️ **Advanced Filtering**: Filter by database, schema, table, and column
-**Parallel Search**: Search across multiple targets concurrently
- 🎯 **Similarity Thresholds**: Filter results by minimum similarity score
- 🔄 **Automatic Retries**: Built-in retry logic for embedding generation
- 📊 **Health Checks**: Verify database connectivity and table existence
- 🛡️ **Type Safe**: Full TypeScript support with runtime validation
## Installation
```bash
bun add @buster/stored-values
```
## Quick Start
```typescript
import { generateEmbedding, searchValuesByEmbedding } from '@buster/stored-values';
// Generate embedding for search query
const embedding = await generateEmbedding(['user email']);
// Search for similar values with options
const results = await searchValuesByEmbedding(
'cc3ef3bc-44ec-4a43-8dc4-681cae5c996a', // data source UUID
embedding,
{ limit: 10, similarityThreshold: 0.8 }
);
```
## API Reference
### Core Search Functions
#### `searchValuesByEmbedding(dataSourceId, queryEmbedding, options?)`
Basic semantic similarity search.
```typescript
const results = await searchValuesByEmbedding(
'cc3ef3bc-44ec-4a43-8dc4-681cae5c996a',
embedding,
{
limit: 20, // Max results (1-1000, default: 10)
similarityThreshold: 0.7 // Min similarity score (0-1, optional)
}
);
```
#### `searchValuesByEmbeddingWithFilters(dataSourceId, queryEmbedding, options?, ...filters)`
Search with database/table/column filters.
```typescript
const results = await searchValuesByEmbeddingWithFilters(
'cc3ef3bc-44ec-4a43-8dc4-681cae5c996a',
embedding,
{ limit: 10, similarityThreshold: 0.8 },
'production', // database name filter
'public', // schema name filter
'users', // table name filter
'email' // column name filter
);
```
#### `searchValuesAcrossTargets(dataSourceId, queryEmbedding, targets, limitPerTarget?, options?)`
Parallel search across multiple specific targets.
```typescript
const targets = [
{ database_name: 'prod', schema_name: 'public', table_name: 'users', column_name: 'email' },
{ database_name: 'prod', schema_name: 'public', table_name: 'products', column_name: 'name' }
];
const results = await searchValuesAcrossTargets(
'cc3ef3bc-44ec-4a43-8dc4-681cae5c996a',
embedding,
targets,
5, // limit per target
{ similarityThreshold: 0.8 }
);
```
### Utility Functions
#### `generateEmbedding(searchTerms, options?)`
Generate embeddings with automatic retries.
```typescript
const embedding = await generateEmbedding(
['user', 'email', 'address'],
{
maxRetries: 5, // Retry attempts (0-10, default: 3)
abortSignal: controller.signal // Cancellation support
}
);
```
#### `healthCheck(dataSourceId)`
Verify database connectivity and table existence.
```typescript
const isHealthy = await healthCheck('cc3ef3bc-44ec-4a43-8dc4-681cae5c996a');
// Returns: boolean
```
#### `extractSearchableColumnsFromYaml(ymlContent)`
Parse YAML dataset content to find searchable text columns.
```typescript
const yamlString = JSON.stringify({
database: 'production',
tables: [{
name: 'users',
schema: 'public',
columns: [
{ name: 'email', type: 'varchar(255)' },
{ name: 'bio', type: 'text' }
]
}]
});
const targets = extractSearchableColumnsFromYaml(yamlString);
// Returns: SearchTarget[]
```
## Types
```typescript
interface StoredValueResult {
id: string;
value: string;
database_name: string;
column_name: string;
table_name: string;
schema_name: string;
synced_at: Date | null;
}
interface SearchTarget {
database_name: string;
schema_name: string;
table_name: string;
column_name: string;
}
interface SearchOptions {
limit?: number; // 1-1000, default: 10
similarityThreshold?: number; // 0-1, optional
}
interface EmbeddingOptions {
maxRetries?: number; // 0-10, default: 3
abortSignal?: AbortSignal; // Optional cancellation
}
```
## Error Handling
All functions throw `StoredValuesError` with detailed error messages:
```typescript
import { StoredValuesError } from '@buster/stored-values';
try {
const results = await searchValuesByEmbedding(dataSourceId, embedding);
} catch (error) {
if (error instanceof StoredValuesError) {
console.error('Search failed:', error.message);
console.error('Caused by:', error.cause);
}
}
```
## Validation
All inputs are validated using Zod schemas:
```typescript
import { UuidSchema, EmbeddingSchema, SearchOptionsSchema } from '@buster/stored-values';
// Validate individual inputs
const validUuid = UuidSchema.parse('cc3ef3bc-44ec-4a43-8dc4-681cae5c996a');
const validEmbedding = EmbeddingSchema.parse(embedding); // Must be 1536 dimensions
const validOptions = SearchOptionsSchema.parse({ limit: 50 });
```
## Database Schema
The package expects schema-specific tables:
```sql
CREATE TABLE "ds_{data_source_id}"."searchable_column_values" (
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
value text NOT NULL,
database_name text NOT NULL,
column_name text NOT NULL,
table_name text NOT NULL,
schema_name text NOT NULL,
embedding halfvec(1536),
synced_at timestamp with time zone DEFAULT now()
);
-- HNSW index for efficient similarity search
CREATE INDEX idx_embedding_hnsw_{schema_name}
ON "ds_{data_source_id}"."searchable_column_values"
USING hnsw (embedding halfvec_cosine_ops);
```
## Environment Variables
```bash
DATABASE_URL=postgresql://user:pass@host:port/db # Required
OPENAI_API_KEY=sk-... # Required
```
## Testing
```bash
# Unit tests (no external dependencies)
bun test src/__tests__/schemas.unit.test.ts
bun test src/__tests__/utils.unit.test.ts
bun test src/__tests__/search.unit.test.ts
# Integration tests (requires DATABASE_URL and OPENAI_API_KEY)
bun test src/__tests__/search.integration.test.ts
```
## Performance Notes
- **Embedding Dimensions**: Strictly enforced at 1536 (text-embedding-3-small)
- **Query Limits**: Maximum 1000 results per query, 100 targets per parallel search
- **Parameterized Queries**: All SQL uses parameterized queries for security
- **Connection Pooling**: Uses postgres.js connection pooling
- **Parallel Execution**: Multiple targets searched concurrently
## Migration from Rust
This TypeScript package implements the same functionality as `api/libs/stored_values` in Rust:
| Rust Function | TypeScript Function |
|---------------|-------------------|
| `search_values_by_embedding` | `searchValuesByEmbedding` |
| `search_values_by_embedding_with_filters` | `searchValuesByEmbeddingWithFilters` |
| `search_values_across_targets` | `searchValuesAcrossTargets` |
| `extract_searchable_columns_from_yaml` | `extractSearchableColumnsFromYaml` |
## Security Features
-**SQL Injection Protection**: All queries use parameterized statements
-**Input Validation**: Comprehensive Zod schema validation
-**UUID Validation**: Strict UUID format enforcement
-**Dimension Validation**: Embedding arrays must be exactly 1536 elements
-**Rate Limiting Support**: Built-in retry logic with exponential backoff
- ✅ **Type Safety**: Full TypeScript coverage with runtime validation