26 KiB
Refactoring Metric Data Metadata Storage
Problem Statement ✅
The MetricYml
structure currently contains a data_metadata
field that stores column metadata information as part of the metric definition. This approach has several issues:
- The data metadata is derived from query results rather than being an intrinsic property of the metric definition
- It can become outdated when underlying data changes, as it's not automatically updated
- Including it in the metric definition adds unnecessary complexity to the metric YAML schema
- It requires redundant validation logic in multiple places
- Multiple implementations of metadata calculation exist in different parts of the codebase (get_metric_data_handler.rs, run_sql.rs)
We need to establish a single source of truth for data metadata by making the query_engine directly responsible for metadata calculation, treating DataMetadata as a first-class database object, and ensuring it's always returned with every query result.
By moving data metadata from the metric definition to a database column on the metric_files
table and standardizing its calculation in the query engine, we'll separate the metric definition from its execution results, maintain cleaner YAML schema for metrics, and provide a more reliable caching mechanism for metadata.
Current Limitations
- Data metadata in MetricYml can become stale when source data changes
- Validation requires redundant computation of metadata in multiple places
METRIC_YML_SCHEMA
includes unnecessary complexity for data_metadata validation- Users may be confused about whether to update data_metadata manually
- Different code paths calculate metadata inconsistently (get_metric_data_handler.rs vs run_sql.rs)
- Duplicate code exists for metadata calculations
Impact
- User Impact: Cleaner metric definition format, more reliable metadata information in UI
- System Impact: More efficient caching of query metadata, reduced redundancy, consistent metadata calculations
- Business Impact: Improved reliability of metric visualizations and metadata analysis
- Developer Impact: Reduced code duplication, single source of truth for metadata
Requirements
Functional Requirements ✅
Core Functionality
-
Remove
data_metadata
field fromMetricYml
struct- Details: Field should be completely removed from the structure
- Acceptance Criteria: All metric-related code builds and functions without the field
- Dependencies: None
-
Make
DataMetadata
a first-class database type with FromSql/ToSql implementations- Details: Create a proper Rust type with serialization/deserialization support
- Acceptance Criteria: DataMetadata can be stored and retrieved from database
- Dependencies: None
-
Add
data_metadata
column tometric_files
table- Details: A new JSONB column added to the metric_files table to store cached metadata
- Acceptance Criteria: Successful migration that adds the column
- Dependencies: None
-
Enhance query_engine to always return data metadata with results
- Details: The query_engine should compute and return metadata for every query
- Acceptance Criteria: query_engine returns both results and metadata in a single structure
- Dependencies: None
-
Consolidate metadata calculation logic from run_sql.rs and get_metric_data_handler.rs
- Details: Move all metadata calculation to the query_engine
- Acceptance Criteria: All metadata calculations use a single implementation
- Dependencies: None
-
Update validate_sql to store computed metadata in the metric file
- Details: When validating SQL, save the computed metadata to the metric file record
- Acceptance Criteria: Validation updates the data_metadata field with current metadata
- Dependencies: data_metadata column added to metric_files
-
Update create_metrics.rs and update_metric.rs to handle data_metadata in the database
- Details: Any SQL changes should trigger metadata recalculation and storage
- Acceptance Criteria: Metadata is correctly updated when SQL is modified
- Dependencies: query_engine metadata calculation
-
Continue saving data_metadata in version_history for metrics
- Details: Include data_metadata in version history to preserve historical metadata
- Acceptance Criteria: Version history preserves data_metadata for each version
- Dependencies: None
Non-Functional Requirements ✅
- Performance Requirements
- Metadata computation should not significantly impact query performance
- Accessing cached metadata should be faster than recomputing it
- Security Requirements
- No additional security requirements (using existing table and permissions)
- Scalability Requirements
- Metadata should be compact enough to not significantly increase database size
Technical Design ✅
System Architecture
graph TD
A[Client Request] --> B[REST/WS API]
B --> C[Handlers]
C --> D[query_engine]
D -->|QueryResult with data & metadata| C
C -->|Store Metadata| E[metric_files table]
E -->|Cached Metadata| C
C -->|Response| B
B --> A
Core Components ✅
Component 1: Updated MetricYml Structure
// Remove data_metadata from MetricYml
#[derive(Debug, Serialize, Deserialize, Clone, FromSqlRow, AsExpression)]
#[diesel(sql_type = Jsonb)]
pub struct MetricYml {
pub name: String,
pub description: Option<String>,
pub time_frame: String,
pub sql: String,
pub chart_config: ChartConfig,
// data_metadata field removed
pub dataset_ids: Vec<Uuid>,
}
Component 2: DataMetadata as First-Class Type
// First-class DataMetadata type with database serialization
#[derive(Debug, Serialize, Deserialize, Clone, FromSqlRow, AsExpression)]
#[diesel(sql_type = Jsonb)]
pub struct DataMetadata {
pub column_count: i64,
pub row_count: i64,
pub column_metadata: Vec<ColumnMetaData>,
}
// Column metadata type
#[derive(Debug, Serialize, Deserialize, Clone)]
pub struct ColumnMetaData {
pub name: String,
pub min_value: serde_json::Value,
pub max_value: serde_json::Value,
pub unique_values: i32,
pub simple_type: SimpleType,
#[serde(rename = "type")]
pub column_type: ColumnType,
}
// Implement FromSql and ToSql for database serialization
impl FromSql<Jsonb, Pg> for DataMetadata {
fn from_sql(bytes: diesel::pg::PgValue) -> diesel::deserialize::Result<Self> {
let value = <serde_json::Value as FromSql<Jsonb, Pg>>::from_sql(bytes)?;
Ok(serde_json::from_value(value)?)
}
}
impl ToSql<Jsonb, Pg> for DataMetadata {
fn to_sql<'b>(&'b self, out: &mut Output<'b, '_, Pg>) -> diesel::serialize::Result {
out.write_all(&[1])?; // JSONB version 1 header
out.write_all(&serde_json::to_vec(self)?)?;
Ok(IsNull::No)
}
}
Component 3: Updated metric_files Schema
// Updated MetricFile model
#[derive(Queryable, Insertable, Identifiable, Debug, Clone, Serialize)]
#[diesel(table_name = metric_files)]
pub struct MetricFile {
pub id: Uuid,
pub name: String,
pub file_name: String,
pub content: MetricYml,
pub verification: Verification,
pub evaluation_obj: Option<Value>,
pub evaluation_summary: Option<String>,
pub evaluation_score: Option<f64>,
pub organization_id: Uuid,
pub created_by: Uuid,
pub created_at: DateTime<Utc>,
pub updated_at: DateTime<Utc>,
pub deleted_at: Option<DateTime<Utc>>,
pub publicly_accessible: bool,
pub publicly_enabled_by: Option<Uuid>,
pub public_expiry_date: Option<DateTime<Utc>>,
pub version_history: VersionHistory,
pub data_metadata: Option<DataMetadata>, // Changed to strongly typed DataMetadata
}
Component 4: Updated query_engine Interface
// Define a QueryResult structure to hold both results and metadata
#[derive(Debug, Clone)]
pub struct QueryResult {
pub data: Vec<IndexMap<String, DataType>>,
pub metadata: DataMetadata,
}
// Enhanced query_engine that always computes and returns metadata
pub async fn query_engine(
data_source_id: &Uuid,
sql: &str,
limit: Option<i64>,
) -> Result<QueryResult> {
let corrected_sql = sql.to_owned();
let secure_sql = corrected_sql.clone();
if let Some(warning) = query_safety_filter(secure_sql.clone()).await {
return Err(anyhow!(warning));
}
let results = match route_to_query(data_source_id, &secure_sql, limit).await {
Ok(results) => results,
Err(e) => {
tracing::error!(
"There was an issue while querying the data source: {}",
e
);
return Err(anyhow!(e));
}
};
// Always compute metadata from results - consolidated from run_sql.rs implementation
let metadata = compute_data_metadata(&results);
// Return both results and metadata in the QueryResult structure
Ok(QueryResult {
data: results,
metadata,
})
}
// Consolidated metadata calculation function based on run_sql.rs implementation
fn compute_data_metadata(data: &[IndexMap<String, DataType>]) -> DataMetadata {
if data.is_empty() {
return DataMetadata {
column_count: 0,
row_count: 0,
column_metadata: vec![],
};
}
let first_row = &data[0];
let column_count = first_row.len() as i64;
let row_count = data.len() as i64;
let column_metadata = compute_column_metadata(data);
DataMetadata {
column_count,
row_count,
column_metadata,
}
}
// Helper function for computing column metadata
fn compute_column_metadata(data: &[IndexMap<String, DataType>]) -> Vec<ColumnMetaData> {
if data.is_empty() {
return vec![];
}
let first_row = &data[0];
let columns: Vec<_> = first_row.keys().cloned().collect();
// Use existing column metadata calculation from run_sql.rs
// This maintains compatibility with the current implementation while
// centralizing the logic in the query_engine
columns.iter().map(|column_name| {
let mut value_map = HashSet::new();
let mut min_value = None;
let mut max_value = None;
let mut is_date_type = false;
let mut min_value_str: Option<String> = None;
let mut max_value_str: Option<String> = None;
for row in data {
if let Some(value) = row.get(column_name) {
// Track unique values (up to a reasonable limit)
if value_map.len() < 100 {
value_map.insert(format!("{:?}", value));
}
// Calculate min/max for appropriate types
match value {
DataType::Int2(Some(v)) |
DataType::Int4(Some(v)) => {
let n = *v as f64;
min_value = Some(min_value.map_or(n, |min: f64| min.min(n)));
max_value = Some(max_value.map_or(n, |max: f64| max.max(n)));
}
DataType::Int8(Some(v)) => {
let n = *v as f64;
min_value = Some(min_value.map_or(n, |min: f64| min.min(n)));
max_value = Some(max_value.map_or(n, |max: f64| max.max(n)));
}
DataType::Float4(Some(v)) |
DataType::Float8(Some(v)) => {
let n = *v as f64;
min_value = Some(min_value.map_or(n, |min: f64| min.min(n)));
max_value = Some(max_value.map_or(n, |max: f64| max.max(n)));
}
DataType::Date(Some(date)) => {
is_date_type = true;
let date_str = date.to_string();
update_date_min_max(&date_str, &mut min_value_str, &mut max_value_str);
}
DataType::Timestamp(Some(ts)) |
DataType::Timestamptz(Some(ts)) => {
is_date_type = true;
let ts_str = ts.to_string();
update_date_min_max(&ts_str, &mut min_value_str, &mut max_value_str);
}
_ => {}
}
}
}
// Determine the column type and simple type
let column_type = first_row.get(column_name).unwrap();
let (simple_type, column_type) = determine_types(column_type);
// Format min/max values appropriately based on type
let (min_value, max_value) = if is_date_type {
(
min_value_str.map_or(serde_json::Value::Null, |v| serde_json::Value::String(v)),
max_value_str.map_or(serde_json::Value::Null, |v| serde_json::Value::String(v)),
)
} else {
(
min_value.map_or(serde_json::Value::Null, |v| serde_json::Value::Number(serde_json::Number::from_f64(v).unwrap_or_default())),
max_value.map_or(serde_json::Value::Null, |v| serde_json::Value::Number(serde_json::Number::from_f64(v).unwrap_or_default())),
)
};
ColumnMetaData {
name: column_name.clone(),
min_value,
max_value,
unique_values: value_map.len() as i32,
simple_type,
column_type,
}
}).collect()
}
Component 5: Updated validate_sql Function
pub async fn validate_sql(
sql: &str,
dataset_id: &Uuid,
) -> Result<(String, Vec<IndexMap<String, DataType>>, Option<DataMetadata>)> {
// Get data source for the dataset
let mut conn = get_pg_pool().get().await?;
let data_source_id = datasets::table
.filter(datasets::id.eq(dataset_id))
.select(datasets::data_source_id)
.first::<Uuid>(&mut conn)
.await?;
// Execute query and get results with metadata
let query_result = query_engine(&data_source_id, sql, None).await?;
let num_records = query_result.data.len();
let message = if num_records == 0 {
"No records were found".to_string()
} else {
format!("{} records were returned", num_records)
};
// Return records (limited to 13), message, and metadata
let return_records = if num_records <= 13 {
query_result.data
} else {
Vec::new()
};
Ok((message, return_records, Some(query_result.metadata)))
}
// Updated MetricDataResponse structure to use DataMetadata #[derive(Debug, Serialize)] pub struct MetricDataResponse { pub metric_id: Uuid, pub data: Vec<IndexMap<String, DataType>>, pub data_metadata: DataMetadata, // Changed from MetricData to DataMetadata }
Database Changes
-- Add data_metadata column to metric_files table
ALTER TABLE metric_files
ADD COLUMN data_metadata JSONB;
-- Create index for faster queries
CREATE INDEX metric_files_data_metadata_idx ON metric_files USING GIN (data_metadata);
File Changes
Modified Files
-
libs/database/src/types/metric_yml.rs
- Changes: Remove data_metadata field and related functionality
- Impact: Simplifies metric definition structure
- Dependencies: None
-
libs/database/src/types/data_metadata.rs
(NEW)- Changes: Create new file for DataMetadata first-class type with FromSql/ToSql impls
- Impact: Consistent metadata type across the codebase
- Dependencies: None
-
libs/database/src/models.rs
- Changes: Add data_metadata field to MetricFile struct with strong typing
- Impact: Enables storage of metadata in the database
- Dependencies: DataMetadata type, Migration to add column
-
libs/database/src/schema.rs
- Changes: Update metric_files table definition to include data_metadata
- Impact: Reflects database schema change
- Dependencies: Migration to add column
-
libs/agents/src/tools/categories/file_tools/common.rs
- Changes: Update METRIC_YML_SCHEMA to remove data_metadata, update validate_sql
- Impact: Simplifies schema validation for metrics, uses new DataMetadata type
- Dependencies: DataMetadata type
-
libs/query_engine/src/data_source_query_routes/query_engine.rs
- Changes: Create QueryResult struct, enhance query_engine to always compute metadata
- Impact: Centralizes metadata computation, ensures it's always available
- Dependencies: DataMetadata type
-
libs/handlers/src/metrics/get_metric_data_handler.rs
- Changes: Remove metadata computation, use metadata from QueryResult
- Impact: Removes duplicate code, uses consistent metadata
- Dependencies: Enhanced query_engine
-
libs/handlers/src/metrics/update_metric_handler.rs
- Changes: Update data_metadata when SQL changes
- Impact: Ensures metadata is updated when metric definition changes
- Dependencies: Enhanced query_engine
-
libs/agents/src/tools/categories/file_tools/create_metrics.rs
- Changes: Add data_metadata to created metric files
- Impact: Ensures metadata is stored during metric creation
- Dependencies: Enhanced query_engine
-
server/src/routes/rest/routes/sql/run_sql.rs
- Changes: Remove duplicate metadata computation, use metadata from QueryResult
- Impact: Removes duplicate code, uses consistent metadata
- Dependencies: Enhanced query_engine
-
server/src/routes/ws/sql/run_sql.rs
- Changes: Remove duplicate metadata computation, use metadata from QueryResult
- Impact: Removes duplicate code, uses consistent metadata
- Dependencies: Enhanced query_engine
New Files
-
libs/database/src/types/data_metadata.rs
- Purpose: Define DataMetadata as a first-class type with serialization
- Key components: DataMetadata struct, FromSql/ToSql impls
- Dependencies: None
-
migrations/YYYY-MM-DD-HHMMSS_add_data_metadata_to_metric_files/up.sql
- Purpose: Add data_metadata column to metric_files table
- Key components: SQL to add column and index
- Dependencies: None
-
migrations/YYYY-MM-DD-HHMMSS_add_data_metadata_to_metric_files/down.sql
- Purpose: Remove data_metadata column from metric_files table
- Key components: SQL to drop column and index
- Dependencies: None
Implementation Plan
Phase 1: Database Schema Update 🔜 (Not Started)
-
Create migration to add data_metadata column
- Generate migration files using diesel CLI:
cd /path/to/project diesel migration generate add_data_metadata_to_metric_files
- Add the following SQL to the up.sql file:
-- Add data_metadata column to metric_files table ALTER TABLE metric_files ADD COLUMN data_metadata JSONB; -- Create index for faster queries CREATE INDEX metric_files_data_metadata_idx ON metric_files USING GIN (data_metadata);
- Add the following SQL to the down.sql file:
-- Drop index first DROP INDEX IF EXISTS metric_files_data_metadata_idx; -- Drop column ALTER TABLE metric_files DROP COLUMN IF EXISTS data_metadata;
- Run the migration:
diesel migration run
- Generate migration files using diesel CLI:
-
Update database-related structs
- Update models.rs with new field
- Update version_history.rs to include data_metadata for metric_file versions
Phase 2: Metadata Computation 🔜 (Not Started)
-
Enhance query_engine
- Create QueryResult structure to hold results and metadata
- Update query_engine function to compute and return metadata
- Implement compute_column_metadata helper function
- Add tests for metadata computation
-
Update validate_sql function
- Modify to use enhanced query_engine and extract metadata
- Add code to return metadata for metric storage
- Update tests
Phase 3: Metric Definition Update 🔜 (Not Started)
-
Update MetricYml structure
- Remove data_metadata field
- Update related methods and tests
- Update METRIC_YML_SCHEMA to remove data_metadata
-
Update handlers to use cached metadata
- Modify get_metric_data_handler to use cached metadata
- Update update_metric_handler to recalculate metadata when SQL changes
- Update tests
Phase 4: Agent Tool Updates 🔜 (Not Started)
-
Update metric creation and modification tools
- Modify process_metric_file to include metadata
- Update create_metrics.rs to store metadata
- Update tests
-
Migration for existing metrics
- Create script to calculate and store metadata for existing metrics
- Test migration on staging
- Run migration in production
-
Documentation and cleanup
- Update API documentation
- Add comments explaining metadata handling
- Remove any leftover references to data_metadata in yml
Testing Strategy
Unit Tests
- Test metadata computation with various data types
- Test validate_sql with both valid and invalid queries
- Test MetricYml serialization and deserialization without data_metadata
Integration Tests
- Test end-to-end flow from query to metadata storage
- Test metric creation with metadata computation
- Test metric update with metadata recalculation
Database Migration Tests
- Test migration on a copy of production data
- Verify data_metadata can be computed for existing metrics
- Test rollback procedure
Migration Procedure
Database Migration
-
Generate Migration Files
cd /path/to/project diesel migration generate add_data_metadata_to_metric_files
-
Implement Migration Files
up.sql:
-- Add data_metadata column to metric_files table ALTER TABLE metric_files ADD COLUMN data_metadata JSONB; -- Create index for faster queries CREATE INDEX metric_files_data_metadata_idx ON metric_files USING GIN (data_metadata);
down.sql:
-- Drop index first DROP INDEX IF EXISTS metric_files_data_metadata_idx; -- Drop column ALTER TABLE metric_files DROP COLUMN IF EXISTS data_metadata;
-
Run Migration
diesel migration run
-
Verify Migration
SELECT column_name, data_type FROM information_schema.columns WHERE table_name = 'metric_files' AND column_name = 'data_metadata';
Code Updates
-
Update Diesel Schema
diesel print-schema > libs/database/src/schema.rs
-
Adapt Database Models
- Add data_metadata field to MetricFile struct in models.rs
- Remove data_metadata field from MetricYml in metric_yml.rs
- Update version_history.rs to handle data_metadata in versions
-
Update Query Engine
- Create QueryResult structure in query_engine.rs
- Enhance existing query_engine function to calculate and return metadata
- Implement compute_column_metadata helper function
- Add tests for metadata computation
-
Update Handlers and Tools
- Update validate_sql to use enhanced query_engine and extract metadata
- Modify get_metric_data_handler to use cached metadata when available
- Update update_metric_handler to recalculate metadata when SQL changes
- Update create_metrics.rs to store metadata for new metrics
Backfill Data for Existing Metrics
A migration script will be needed to populate the data_metadata column for existing metrics. This script will:
-
Query for Existing Metrics
let metrics = metric_files::table .filter(metric_files::deleted_at.is_null()) .load::<MetricFile>(&mut conn) .await?;
-
Process Each Metric
for metric in metrics { // Parse YAML content let metric_yml: MetricYml = metric.content; // Get dataset IDs let dataset_id = metric_yml.dataset_ids.first().unwrap_or(default_dataset_id); // Get data source for dataset let data_source_id = get_data_source_for_dataset(dataset_id).await?; // Execute query with metadata calculation let query_result = query_engine(&data_source_id, &metric_yml.sql, None).await?; // Update metric record with metadata diesel::update(metric_files::table) .filter(metric_files::id.eq(metric.id)) .set(metric_files::data_metadata.eq(serde_json::to_value(query_result.metadata)?)) .execute(&mut conn) .await?; }
-
Deployment Strategy
- Run script during off-peak hours
- Use transaction batching (for example, process 100 metrics per transaction)
- Add logging and resumability in case of interruption
- Monitor database performance during migration
Deployment Strategy
-
Pre-Deployment
- Run all tests on staging environment
- Verify migrations work correctly
- Ensure backfill script completes successfully
- Check for performance issues
-
Deployment Order
- Deploy database migration
- Deploy code changes (compatible with both old and new schema)
- Run backfill script
- Deploy final code changes (removing old schema support)
-
Rollback Strategy
- If issues are detected after migration but before backfill: roll back code changes, then run down.sql
- If issues are detected after backfill: restore from backup or reverse data changes programmatically
Security Considerations
- No additional security concerns (using existing database permissions)
Rollback Plan
- Database migration includes down.sql for rollback
- Code can be reverted if issues are found
- Temporary dual support for both approaches during transition
Monitoring
- Add logs for metadata computation and caching
- Monitor query performance to ensure minimal impact
- Track database size to ensure metadata doesn't cause significant growth
Success Criteria
- All metrics have data_metadata stored in the database
- MetricYml structures no longer contain data_metadata field
- Visualization in UI uses cached metadata correctly
- Performance equal to or better than previous implementation
// Updated MetricDataResponse structure to use DataMetadata #[derive(Debug, Serialize)] pub struct MetricDataResponse { pub metric_id: Uuid, pub data: Vec<IndexMap<String, DataType>>, pub data_metadata: DataMetadata, // Changed from MetricData to DataMetadata }