mirror of https://github.com/buster-so/buster.git
271 lines
7.7 KiB
Markdown
271 lines
7.7 KiB
Markdown
# @buster/stored-values
|
|
|
|
Production-ready vector similarity search package for stored column values using OpenAI embeddings and pgvector.
|
|
|
|
## Overview
|
|
|
|
This package provides functionality to search for stored column values using semantic similarity. It uses OpenAI's text-embedding-3-small model to generate embeddings and pgvector's halfvec type with HNSW indexes for efficient similarity search.
|
|
|
|
## Features
|
|
|
|
- ✅ **Production Ready**: Full Zod validation, parameterized queries, comprehensive error handling
|
|
- 🔍 **Semantic Search**: Using OpenAI's text-embedding-3-small embeddings (1536 dimensions)
|
|
- 🏢 **Multi-tenant**: Schema-specific tables for data isolation
|
|
- 🎛️ **Advanced Filtering**: Filter by database, schema, table, and column
|
|
- ⚡ **Parallel Search**: Search across multiple targets concurrently
|
|
- 🎯 **Similarity Thresholds**: Filter results by minimum similarity score
|
|
- 🔄 **Automatic Retries**: Built-in retry logic for embedding generation
|
|
- 📊 **Health Checks**: Verify database connectivity and table existence
|
|
- 🛡️ **Type Safe**: Full TypeScript support with runtime validation
|
|
|
|
## Installation
|
|
|
|
```bash
|
|
bun add @buster/stored-values
|
|
```
|
|
|
|
## Quick Start
|
|
|
|
```typescript
|
|
import { generateEmbedding, searchValuesByEmbedding } from '@buster/stored-values';
|
|
|
|
// Generate embedding for search query
|
|
const embedding = await generateEmbedding(['user email']);
|
|
|
|
// Search for similar values with options
|
|
const results = await searchValuesByEmbedding(
|
|
'cc3ef3bc-44ec-4a43-8dc4-681cae5c996a', // data source UUID
|
|
embedding,
|
|
{ limit: 10, similarityThreshold: 0.8 }
|
|
);
|
|
```
|
|
|
|
## API Reference
|
|
|
|
### Core Search Functions
|
|
|
|
#### `searchValuesByEmbedding(dataSourceId, queryEmbedding, options?)`
|
|
|
|
Basic semantic similarity search.
|
|
|
|
```typescript
|
|
const results = await searchValuesByEmbedding(
|
|
'cc3ef3bc-44ec-4a43-8dc4-681cae5c996a',
|
|
embedding,
|
|
{
|
|
limit: 20, // Max results (1-1000, default: 10)
|
|
similarityThreshold: 0.7 // Min similarity score (0-1, optional)
|
|
}
|
|
);
|
|
```
|
|
|
|
#### `searchValuesByEmbeddingWithFilters(dataSourceId, queryEmbedding, options?, ...filters)`
|
|
|
|
Search with database/table/column filters.
|
|
|
|
```typescript
|
|
const results = await searchValuesByEmbeddingWithFilters(
|
|
'cc3ef3bc-44ec-4a43-8dc4-681cae5c996a',
|
|
embedding,
|
|
{ limit: 10, similarityThreshold: 0.8 },
|
|
'production', // database name filter
|
|
'public', // schema name filter
|
|
'users', // table name filter
|
|
'email' // column name filter
|
|
);
|
|
```
|
|
|
|
#### `searchValuesAcrossTargets(dataSourceId, queryEmbedding, targets, limitPerTarget?, options?)`
|
|
|
|
Parallel search across multiple specific targets.
|
|
|
|
```typescript
|
|
const targets = [
|
|
{ database_name: 'prod', schema_name: 'public', table_name: 'users', column_name: 'email' },
|
|
{ database_name: 'prod', schema_name: 'public', table_name: 'products', column_name: 'name' }
|
|
];
|
|
|
|
const results = await searchValuesAcrossTargets(
|
|
'cc3ef3bc-44ec-4a43-8dc4-681cae5c996a',
|
|
embedding,
|
|
targets,
|
|
5, // limit per target
|
|
{ similarityThreshold: 0.8 }
|
|
);
|
|
```
|
|
|
|
### Utility Functions
|
|
|
|
#### `generateEmbedding(searchTerms, options?)`
|
|
|
|
Generate embeddings with automatic retries.
|
|
|
|
```typescript
|
|
const embedding = await generateEmbedding(
|
|
['user', 'email', 'address'],
|
|
{
|
|
maxRetries: 5, // Retry attempts (0-10, default: 3)
|
|
abortSignal: controller.signal // Cancellation support
|
|
}
|
|
);
|
|
```
|
|
|
|
#### `healthCheck(dataSourceId)`
|
|
|
|
Verify database connectivity and table existence.
|
|
|
|
```typescript
|
|
const isHealthy = await healthCheck('cc3ef3bc-44ec-4a43-8dc4-681cae5c996a');
|
|
// Returns: boolean
|
|
```
|
|
|
|
#### `extractSearchableColumnsFromYaml(ymlContent)`
|
|
|
|
Parse YAML dataset content to find searchable text columns.
|
|
|
|
```typescript
|
|
const yamlString = JSON.stringify({
|
|
database: 'production',
|
|
tables: [{
|
|
name: 'users',
|
|
schema: 'public',
|
|
columns: [
|
|
{ name: 'email', type: 'varchar(255)' },
|
|
{ name: 'bio', type: 'text' }
|
|
]
|
|
}]
|
|
});
|
|
|
|
const targets = extractSearchableColumnsFromYaml(yamlString);
|
|
// Returns: SearchTarget[]
|
|
```
|
|
|
|
## Types
|
|
|
|
```typescript
|
|
interface StoredValueResult {
|
|
id: string;
|
|
value: string;
|
|
database_name: string;
|
|
column_name: string;
|
|
table_name: string;
|
|
schema_name: string;
|
|
synced_at: Date | null;
|
|
}
|
|
|
|
interface SearchTarget {
|
|
database_name: string;
|
|
schema_name: string;
|
|
table_name: string;
|
|
column_name: string;
|
|
}
|
|
|
|
interface SearchOptions {
|
|
limit?: number; // 1-1000, default: 10
|
|
similarityThreshold?: number; // 0-1, optional
|
|
}
|
|
|
|
interface EmbeddingOptions {
|
|
maxRetries?: number; // 0-10, default: 3
|
|
abortSignal?: AbortSignal; // Optional cancellation
|
|
}
|
|
```
|
|
|
|
## Error Handling
|
|
|
|
All functions throw `StoredValuesError` with detailed error messages:
|
|
|
|
```typescript
|
|
import { StoredValuesError } from '@buster/stored-values';
|
|
|
|
try {
|
|
const results = await searchValuesByEmbedding(dataSourceId, embedding);
|
|
} catch (error) {
|
|
if (error instanceof StoredValuesError) {
|
|
console.error('Search failed:', error.message);
|
|
console.error('Caused by:', error.cause);
|
|
}
|
|
}
|
|
```
|
|
|
|
## Validation
|
|
|
|
All inputs are validated using Zod schemas:
|
|
|
|
```typescript
|
|
import { UuidSchema, EmbeddingSchema, SearchOptionsSchema } from '@buster/stored-values';
|
|
|
|
// Validate individual inputs
|
|
const validUuid = UuidSchema.parse('cc3ef3bc-44ec-4a43-8dc4-681cae5c996a');
|
|
const validEmbedding = EmbeddingSchema.parse(embedding); // Must be 1536 dimensions
|
|
const validOptions = SearchOptionsSchema.parse({ limit: 50 });
|
|
```
|
|
|
|
## Database Schema
|
|
|
|
The package expects schema-specific tables:
|
|
|
|
```sql
|
|
CREATE TABLE "ds_{data_source_id}"."searchable_column_values" (
|
|
id uuid PRIMARY KEY DEFAULT gen_random_uuid(),
|
|
value text NOT NULL,
|
|
database_name text NOT NULL,
|
|
column_name text NOT NULL,
|
|
table_name text NOT NULL,
|
|
schema_name text NOT NULL,
|
|
embedding halfvec(1536),
|
|
synced_at timestamp with time zone DEFAULT now()
|
|
);
|
|
|
|
-- HNSW index for efficient similarity search
|
|
CREATE INDEX idx_embedding_hnsw_{schema_name}
|
|
ON "ds_{data_source_id}"."searchable_column_values"
|
|
USING hnsw (embedding halfvec_cosine_ops);
|
|
```
|
|
|
|
## Environment Variables
|
|
|
|
```bash
|
|
DATABASE_URL=postgresql://user:pass@host:port/db # Required
|
|
OPENAI_API_KEY=sk-... # Required
|
|
```
|
|
|
|
## Testing
|
|
|
|
```bash
|
|
# Unit tests (no external dependencies)
|
|
bun test src/__tests__/schemas.unit.test.ts
|
|
bun test src/__tests__/utils.unit.test.ts
|
|
bun test src/__tests__/search.unit.test.ts
|
|
|
|
# Integration tests (requires DATABASE_URL and OPENAI_API_KEY)
|
|
bun test src/__tests__/search.integration.test.ts
|
|
```
|
|
|
|
## Performance Notes
|
|
|
|
- **Embedding Dimensions**: Strictly enforced at 1536 (text-embedding-3-small)
|
|
- **Query Limits**: Maximum 1000 results per query, 100 targets per parallel search
|
|
- **Parameterized Queries**: All SQL uses parameterized queries for security
|
|
- **Connection Pooling**: Uses postgres.js connection pooling
|
|
- **Parallel Execution**: Multiple targets searched concurrently
|
|
|
|
## Migration from Rust
|
|
|
|
This TypeScript package implements the same functionality as `api/libs/stored_values` in Rust:
|
|
|
|
| Rust Function | TypeScript Function |
|
|
|---------------|-------------------|
|
|
| `search_values_by_embedding` | `searchValuesByEmbedding` |
|
|
| `search_values_by_embedding_with_filters` | `searchValuesByEmbeddingWithFilters` |
|
|
| `search_values_across_targets` | `searchValuesAcrossTargets` |
|
|
| `extract_searchable_columns_from_yaml` | `extractSearchableColumnsFromYaml` |
|
|
|
|
## Security Features
|
|
|
|
- ✅ **SQL Injection Protection**: All queries use parameterized statements
|
|
- ✅ **Input Validation**: Comprehensive Zod schema validation
|
|
- ✅ **UUID Validation**: Strict UUID format enforcement
|
|
- ✅ **Dimension Validation**: Embedding arrays must be exactly 1536 elements
|
|
- ✅ **Rate Limiting Support**: Built-in retry logic with exponential backoff
|
|
- ✅ **Type Safety**: Full TypeScript coverage with runtime validation |