data metadat impl

This commit is contained in:
dal 2025-04-02 16:19:34 -06:00
parent 9a3b659212
commit 90a11bc175
No known key found for this signature in database
GPG Key ID: 16F4B0E1E9F61122
2 changed files with 213 additions and 83 deletions

View File

@ -1,8 +1,10 @@
ok I need you to implement the prds/active/$ARGUMENTS.md prd following it in ok I need you to implement the prds/active/$ARGUMENTS.md prd following it in
order and accomplishing the tasks while referencing the prd, its notes, and recommendations. order and accomplishing the tasks while referencing the prd, its notes, and recommendations.
make sure to mark off completed tasks as you go.
you should follow best practices as related in documentation/ for database migrations, testing, handlers, you should follow best practices as related in documentation/ for database migrations, testing, handlers,
etc. etc. Plase analyze them before you modify a file.
please analyze all files before proceeding with any implementations. please analyze all files before proceeding with any implementations.

View File

@ -8,19 +8,25 @@ The `MetricYml` structure currently contains a `data_metadata` field that stores
2. It can become outdated when underlying data changes, as it's not automatically updated 2. It can become outdated when underlying data changes, as it's not automatically updated
3. Including it in the metric definition adds unnecessary complexity to the metric YAML schema 3. Including it in the metric definition adds unnecessary complexity to the metric YAML schema
4. It requires redundant validation logic in multiple places 4. It requires redundant validation logic in multiple places
5. Multiple implementations of metadata calculation exist in different parts of the codebase (get_metric_data_handler.rs, run_sql.rs)
By moving data metadata from the metric definition to a database column on the `metric_files` table, we'll separate the metric definition from its execution results, maintain cleaner YAML schema for metrics, and provide a more reliable caching mechanism for metadata. **We need to establish a single source of truth for data metadata** by making the query_engine directly responsible for metadata calculation, treating DataMetadata as a first-class database object, and ensuring it's always returned with every query result.
By moving data metadata from the metric definition to a database column on the `metric_files` table and standardizing its calculation in the query engine, we'll separate the metric definition from its execution results, maintain cleaner YAML schema for metrics, and provide a more reliable caching mechanism for metadata.
### Current Limitations ### Current Limitations
- Data metadata in MetricYml can become stale when source data changes - Data metadata in MetricYml can become stale when source data changes
- Validation requires redundant computation of metadata in multiple places - Validation requires redundant computation of metadata in multiple places
- `METRIC_YML_SCHEMA` includes unnecessary complexity for data_metadata validation - `METRIC_YML_SCHEMA` includes unnecessary complexity for data_metadata validation
- Users may be confused about whether to update data_metadata manually - Users may be confused about whether to update data_metadata manually
- Different code paths calculate metadata inconsistently (get_metric_data_handler.rs vs run_sql.rs)
- Duplicate code exists for metadata calculations
### Impact ### Impact
- User Impact: Cleaner metric definition format, more reliable metadata information in UI - User Impact: Cleaner metric definition format, more reliable metadata information in UI
- System Impact: More efficient caching of query metadata, reduced redundancy - System Impact: More efficient caching of query metadata, reduced redundancy, consistent metadata calculations
- Business Impact: Improved reliability of metric visualizations and metadata analysis - Business Impact: Improved reliability of metric visualizations and metadata analysis
- Developer Impact: Reduced code duplication, single source of truth for metadata
## Requirements ## Requirements
@ -32,14 +38,24 @@ By moving data metadata from the metric definition to a database column on the `
- Acceptance Criteria: All metric-related code builds and functions without the field - Acceptance Criteria: All metric-related code builds and functions without the field
- Dependencies: None - Dependencies: None
- Make `DataMetadata` a first-class database type with FromSql/ToSql implementations
- Details: Create a proper Rust type with serialization/deserialization support
- Acceptance Criteria: DataMetadata can be stored and retrieved from database
- Dependencies: None
- Add `data_metadata` column to `metric_files` table - Add `data_metadata` column to `metric_files` table
- Details: A new JSONB column added to the metric_files table to store cached metadata - Details: A new JSONB column added to the metric_files table to store cached metadata
- Acceptance Criteria: Successful migration that adds the column - Acceptance Criteria: Successful migration that adds the column
- Dependencies: None - Dependencies: None
- Update query_engine to return data metadata alongside results - Enhance query_engine to always return data metadata with results
- Details: The query_engine should compute and return metadata for every query - Details: The query_engine should compute and return metadata for every query
- Acceptance Criteria: query_engine returns both results and metadata - Acceptance Criteria: query_engine returns both results and metadata in a single structure
- Dependencies: None
- Consolidate metadata calculation logic from run_sql.rs and get_metric_data_handler.rs
- Details: Move all metadata calculation to the query_engine
- Acceptance Criteria: All metadata calculations use a single implementation
- Dependencies: None - Dependencies: None
- Update validate_sql to store computed metadata in the metric file - Update validate_sql to store computed metadata in the metric file
@ -73,10 +89,10 @@ By moving data metadata from the metric definition to a database column on the `
```mermaid ```mermaid
graph TD graph TD
A[Client Request] --> B[REST API] A[Client Request] --> B[REST/WS API]
B --> C[Handlers] B --> C[Handlers]
C --> D[query_engine] C --> D[query_engine]
D -->|Query Results & Metadata| C D -->|QueryResult with data & metadata| C
C -->|Store Metadata| E[metric_files table] C -->|Store Metadata| E[metric_files table]
E -->|Cached Metadata| C E -->|Cached Metadata| C
C -->|Response| B C -->|Response| B
@ -101,7 +117,47 @@ pub struct MetricYml {
} }
``` ```
#### Component 2: Updated metric_files Schema #### Component 2: DataMetadata as First-Class Type
```rust
// First-class DataMetadata type with database serialization
#[derive(Debug, Serialize, Deserialize, Clone, FromSqlRow, AsExpression)]
#[diesel(sql_type = Jsonb)]
pub struct DataMetadata {
pub column_count: i64,
pub row_count: i64,
pub column_metadata: Vec<ColumnMetaData>,
}
// Column metadata type
#[derive(Debug, Serialize, Deserialize, Clone)]
pub struct ColumnMetaData {
pub name: String,
pub min_value: serde_json::Value,
pub max_value: serde_json::Value,
pub unique_values: i32,
pub simple_type: SimpleType,
#[serde(rename = "type")]
pub column_type: ColumnType,
}
// Implement FromSql and ToSql for database serialization
impl FromSql<Jsonb, Pg> for DataMetadata {
fn from_sql(bytes: diesel::pg::PgValue) -> diesel::deserialize::Result<Self> {
let value = <serde_json::Value as FromSql<Jsonb, Pg>>::from_sql(bytes)?;
Ok(serde_json::from_value(value)?)
}
}
impl ToSql<Jsonb, Pg> for DataMetadata {
fn to_sql<'b>(&'b self, out: &mut Output<'b, '_, Pg>) -> diesel::serialize::Result {
out.write_all(&[1])?; // JSONB version 1 header
out.write_all(&serde_json::to_vec(self)?)?;
Ok(IsNull::No)
}
}
```
#### Component 3: Updated metric_files Schema
```rust ```rust
// Updated MetricFile model // Updated MetricFile model
#[derive(Queryable, Insertable, Identifiable, Debug, Clone, Serialize)] #[derive(Queryable, Insertable, Identifiable, Debug, Clone, Serialize)]
@ -124,19 +180,11 @@ pub struct MetricFile {
pub publicly_enabled_by: Option<Uuid>, pub publicly_enabled_by: Option<Uuid>,
pub public_expiry_date: Option<DateTime<Utc>>, pub public_expiry_date: Option<DateTime<Utc>>,
pub version_history: VersionHistory, pub version_history: VersionHistory,
pub data_metadata: Option<Value>, // New field for cached metadata pub data_metadata: Option<DataMetadata>, // Changed to strongly typed DataMetadata
} }
``` ```
// DataMetadata structure definition #### Component 4: Updated query_engine Interface
#[derive(Debug, Serialize)]
pub struct DataMetadata {
pub column_count: i64,
pub row_count: i64,
pub column_metadata: Vec<ColumnMetaData>,
}
#### Component 3: Updated query_engine Interface
```rust ```rust
// Define a QueryResult structure to hold both results and metadata // Define a QueryResult structure to hold both results and metadata
#[derive(Debug, Clone)] #[derive(Debug, Clone)]
@ -145,7 +193,7 @@ pub struct QueryResult {
pub metadata: DataMetadata, pub metadata: DataMetadata,
} }
// Enhance existing query_engine to include metadata in response // Enhanced query_engine that always computes and returns metadata
pub async fn query_engine( pub async fn query_engine(
data_source_id: &Uuid, data_source_id: &Uuid,
sql: &str, sql: &str,
@ -169,70 +217,135 @@ pub async fn query_engine(
} }
}; };
// Compute metadata from results // Always compute metadata from results - consolidated from run_sql.rs implementation
let column_count = results.first().map(|row| row.len()).unwrap_or(0) as i64; let metadata = compute_data_metadata(&results);
let row_count = results.len() as i64;
let column_metadata = compute_column_metadata(&results);
// Return both results and metadata in the QueryResult structure // Return both results and metadata in the QueryResult structure
Ok(QueryResult { Ok(QueryResult {
data: results, data: results,
metadata: DataMetadata { metadata,
column_count,
row_count,
column_metadata,
}
}) })
} }
// Helper function to compute column metadata from query results // Consolidated metadata calculation function based on run_sql.rs implementation
fn compute_column_metadata(results: &[IndexMap<String, DataType>]) -> Vec<ColumnMetaData> { fn compute_data_metadata(data: &[IndexMap<String, DataType>]) -> DataMetadata {
let mut column_metadata = Vec::new(); if data.is_empty() {
if let Some(first_row) = results.first() { return DataMetadata {
for (col_name, sample_value) in first_row.iter() { column_count: 0,
let mut value_map = HashSet::new(); row_count: 0,
let mut min_value = None; column_metadata: vec![],
let mut max_value = None; };
}
// Analyze column data let first_row = &data[0];
for row in results { let column_count = first_row.len() as i64;
if let Some(value) = row.get(col_name) { let row_count = data.len() as i64;
let column_metadata = compute_column_metadata(data);
DataMetadata {
column_count,
row_count,
column_metadata,
}
}
// Helper function for computing column metadata
fn compute_column_metadata(data: &[IndexMap<String, DataType>]) -> Vec<ColumnMetaData> {
if data.is_empty() {
return vec![];
}
let first_row = &data[0];
let columns: Vec<_> = first_row.keys().cloned().collect();
// Use existing column metadata calculation from run_sql.rs
// This maintains compatibility with the current implementation while
// centralizing the logic in the query_engine
columns.iter().map(|column_name| {
let mut value_map = HashSet::new();
let mut min_value = None;
let mut max_value = None;
let mut is_date_type = false;
let mut min_value_str: Option<String> = None;
let mut max_value_str: Option<String> = None;
for row in data {
if let Some(value) = row.get(column_name) {
// Track unique values (up to a reasonable limit)
if value_map.len() < 100 {
value_map.insert(format!("{:?}", value)); value_map.insert(format!("{:?}", value));
update_min_max(value, &mut min_value, &mut max_value); }
// Calculate min/max for appropriate types
match value {
DataType::Int2(Some(v)) |
DataType::Int4(Some(v)) => {
let n = *v as f64;
min_value = Some(min_value.map_or(n, |min: f64| min.min(n)));
max_value = Some(max_value.map_or(n, |max: f64| max.max(n)));
}
DataType::Int8(Some(v)) => {
let n = *v as f64;
min_value = Some(min_value.map_or(n, |min: f64| min.min(n)));
max_value = Some(max_value.map_or(n, |max: f64| max.max(n)));
}
DataType::Float4(Some(v)) |
DataType::Float8(Some(v)) => {
let n = *v as f64;
min_value = Some(min_value.map_or(n, |min: f64| min.min(n)));
max_value = Some(max_value.map_or(n, |max: f64| max.max(n)));
}
DataType::Date(Some(date)) => {
is_date_type = true;
let date_str = date.to_string();
update_date_min_max(&date_str, &mut min_value_str, &mut max_value_str);
}
DataType::Timestamp(Some(ts)) |
DataType::Timestamptz(Some(ts)) => {
is_date_type = true;
let ts_str = ts.to_string();
update_date_min_max(&ts_str, &mut min_value_str, &mut max_value_str);
}
_ => {}
} }
} }
let (simple_type, column_type) = determine_types(sample_value);
// Only include min/max for numeric and date types
let (min_value, max_value) = match simple_type {
SimpleType::Number | SimpleType::Date => (
min_value.unwrap_or(serde_json::Value::Null),
max_value.unwrap_or(serde_json::Value::Null),
),
_ => (serde_json::Value::Null, serde_json::Value::Null),
};
column_metadata.push(ColumnMetaData {
name: col_name.clone(),
min_value,
max_value,
unique_values: value_map.len() as i32,
simple_type,
column_type,
});
} }
}
column_metadata // Determine the column type and simple type
let column_type = first_row.get(column_name).unwrap();
let (simple_type, column_type) = determine_types(column_type);
// Format min/max values appropriately based on type
let (min_value, max_value) = if is_date_type {
(
min_value_str.map_or(serde_json::Value::Null, |v| serde_json::Value::String(v)),
max_value_str.map_or(serde_json::Value::Null, |v| serde_json::Value::String(v)),
)
} else {
(
min_value.map_or(serde_json::Value::Null, |v| serde_json::Value::Number(serde_json::Number::from_f64(v).unwrap_or_default())),
max_value.map_or(serde_json::Value::Null, |v| serde_json::Value::Number(serde_json::Number::from_f64(v).unwrap_or_default())),
)
};
ColumnMetaData {
name: column_name.clone(),
min_value,
max_value,
unique_values: value_map.len() as i32,
simple_type,
column_type,
}
}).collect()
} }
``` ```
#### Component 4: Updated validate_sql Function #### Component 5: Updated validate_sql Function
```rust ```rust
pub async fn validate_sql( pub async fn validate_sql(
sql: &str, sql: &str,
dataset_id: &Uuid, dataset_id: &Uuid,
) -> Result<(String, Vec<IndexMap<String, DataType>>, Option<Value>)> { ) -> Result<(String, Vec<IndexMap<String, DataType>>, Option<DataMetadata>)> {
// Get data source for the dataset // Get data source for the dataset
let mut conn = get_pg_pool().get().await?; let mut conn = get_pg_pool().get().await?;
let data_source_id = datasets::table let data_source_id = datasets::table
@ -258,7 +371,7 @@ pub async fn validate_sql(
Vec::new() Vec::new()
}; };
Ok((message, return_records, Some(serde_json::to_value(query_result.metadata)?))) Ok((message, return_records, Some(query_result.metadata)))
} }
``` ```
@ -289,10 +402,15 @@ CREATE INDEX metric_files_data_metadata_idx ON metric_files USING GIN (data_meta
- Impact: Simplifies metric definition structure - Impact: Simplifies metric definition structure
- Dependencies: None - Dependencies: None
- `libs/database/src/types/data_metadata.rs` (NEW)
- Changes: Create new file for DataMetadata first-class type with FromSql/ToSql impls
- Impact: Consistent metadata type across the codebase
- Dependencies: None
- `libs/database/src/models.rs` - `libs/database/src/models.rs`
- Changes: Add data_metadata field to MetricFile struct - Changes: Add data_metadata field to MetricFile struct with strong typing
- Impact: Enables storage of metadata in the database - Impact: Enables storage of metadata in the database
- Dependencies: Migration to add column - Dependencies: DataMetadata type, Migration to add column
- `libs/database/src/schema.rs` - `libs/database/src/schema.rs`
- Changes: Update metric_files table definition to include data_metadata - Changes: Update metric_files table definition to include data_metadata
@ -300,36 +418,46 @@ CREATE INDEX metric_files_data_metadata_idx ON metric_files USING GIN (data_meta
- Dependencies: Migration to add column - Dependencies: Migration to add column
- `libs/agents/src/tools/categories/file_tools/common.rs` - `libs/agents/src/tools/categories/file_tools/common.rs`
- Changes: Update METRIC_YML_SCHEMA to remove data_metadata - Changes: Update METRIC_YML_SCHEMA to remove data_metadata, update validate_sql
- Impact: Simplifies schema validation for metrics - Impact: Simplifies schema validation for metrics, uses new DataMetadata type
- Dependencies: None - Dependencies: DataMetadata type
- `libs/agents/src/tools/categories/file_tools/common.rs`
- Changes: Update validate_sql to return metadata and update process_metric_file to save it
- Impact: Enables metadata computation and storage
- Dependencies: query_engine changes
- `libs/query_engine/src/data_source_query_routes/query_engine.rs` - `libs/query_engine/src/data_source_query_routes/query_engine.rs`
- Changes: Add metadata computation functionality - Changes: Create QueryResult struct, enhance query_engine to always compute metadata
- Impact: Enables computation of metadata along with query results - Impact: Centralizes metadata computation, ensures it's always available
- Dependencies: None - Dependencies: DataMetadata type
- `libs/handlers/src/metrics/get_metric_data_handler.rs` - `libs/handlers/src/metrics/get_metric_data_handler.rs`
- Changes: Use cached metadata if available, compute if not - Changes: Remove metadata computation, use metadata from QueryResult
- Impact: Improves performance by using cached values - Impact: Removes duplicate code, uses consistent metadata
- Dependencies: data_metadata column in metric_files - Dependencies: Enhanced query_engine
- `libs/handlers/src/metrics/update_metric_handler.rs` - `libs/handlers/src/metrics/update_metric_handler.rs`
- Changes: Update data_metadata when SQL changes - Changes: Update data_metadata when SQL changes
- Impact: Ensures metadata is updated when metric definition changes - Impact: Ensures metadata is updated when metric definition changes
- Dependencies: validate_sql changes - Dependencies: Enhanced query_engine
- `libs/agents/src/tools/categories/file_tools/create_metrics.rs` - `libs/agents/src/tools/categories/file_tools/create_metrics.rs`
- Changes: Add data_metadata to created metric files - Changes: Add data_metadata to created metric files
- Impact: Ensures metadata is stored during metric creation - Impact: Ensures metadata is stored during metric creation
- Dependencies: validate_sql changes - Dependencies: Enhanced query_engine
- `server/src/routes/rest/routes/sql/run_sql.rs`
- Changes: Remove duplicate metadata computation, use metadata from QueryResult
- Impact: Removes duplicate code, uses consistent metadata
- Dependencies: Enhanced query_engine
- `server/src/routes/ws/sql/run_sql.rs`
- Changes: Remove duplicate metadata computation, use metadata from QueryResult
- Impact: Removes duplicate code, uses consistent metadata
- Dependencies: Enhanced query_engine
#### New Files #### New Files
- `libs/database/src/types/data_metadata.rs`
- Purpose: Define DataMetadata as a first-class type with serialization
- Key components: DataMetadata struct, FromSql/ToSql impls
- Dependencies: None
- `migrations/YYYY-MM-DD-HHMMSS_add_data_metadata_to_metric_files/up.sql` - `migrations/YYYY-MM-DD-HHMMSS_add_data_metadata_to_metric_files/up.sql`
- Purpose: Add data_metadata column to metric_files table - Purpose: Add data_metadata column to metric_files table
- Key components: SQL to add column and index - Key components: SQL to add column and index