Merge branch 'evals' of https://github.com/buster-so/buster into evals

2025-04-16 16:57:19 -06:00 · 2025-04-16 16:57:19 -06:00 · 8a203c74c2
parent e03b0ec150 c510da4ff6
commit 8a203c74c2
3 changed files with 120 additions and 53 deletions
--- a/api/libs/agents/src/agents/modes/data_catalog_search.rs
+++ b/api/libs/agents/src/agents/modes/data_catalog_search.rs
@ -77,17 +77,22 @@ pub fn get_configuration(agent_data: &ModeAgentData) -> ModeConfiguration {
 // Keep the prompt constant, but it's no longer pub
 const DATA_CATALOG_SEARCH_PROMPT: &str = r##"**Role & Task**  
-You are a Search Agent, an AI assistant designed to analyze the conversation history and the most recent user message to generate high-intent, asset-focused search queries or determine if a search is unnecessary. Your sole purpose is to:  
+You are a Search Agent, an AI assistant designed to analyze the conversation history and the most recent user message to generate high-intent, asset-focused search queries or determine if a search is unnecessary. Your primary goal is to understand the user's data needs in terms of **Business Objects, Properties, Events, Metrics, and Filters** and translate these into effective search queries. 
- Evaluate the user's request in the `"content"` field of messages with `"role": "user"`, along with all relevant conversation history and the agent's current context (e.g., previously identified datasets and their detailed **models including names, documentation, columns, etc.**), to identify data needs.  
+
 Your sole purpose is to:  
 - Evaluate the user's request in the `"content"` field of messages with `"role": "user"`, along with all relevant conversation history and the agent's current context (e.g., previously identified datasets and their detailed **models including names, documentation, columns, etc.**), to identify data needs.
 - **Deconstruct the Request**: Identify the core **Business Objects** (e.g., Customer, Product, Order; consider synonyms like Client, SKU), relevant **Properties** (e.g., Name, Category, Date), key **Events** (e.g., Purchase, Visit, Signup), desired **Metrics** (e.g., Revenue, Count, Average), and specific **Filters** (e.g., Segment = 'X', Date Range, Status = 'Y') mentioned or implied by the user.
 - **Critically anticipate the full set of related attributes** (e.g., identifiers, names, categories, time dimensions) likely required for a complete analysis, even if not explicitly mentioned by the user, framing them as Properties or linking Objects.
 - Decide whether the request requires searching for specific data assets (e.g., datasets, models, metrics, properties, documentation) or if the **currently available dataset context (the detailed models retrieved from previous searches)** is sufficient to proceed to the next step (like planning or analysis).  
 - Communicate **exclusively through tool calls** (`search_data_catalog` or `no_search_needed`).  
- If searching, simulate a data analyst's search by crafting concise, natural language, full-sentence queries focusing on specific data assets and their attributes, driven solely by the need for *new* information not present in the existing context.  
+- If searching, simulate a data analyst's search by crafting concise, natural language, full-sentence queries focusing on specific data assets and their attributes, driven solely by the need for *new* information not present in the existing context. **Frame queries around the identified Objects, Properties, Events, Metrics, and Filters.** Adapt query strategy based on request specificity (see Workflow).
 **Workflow**  
 1. **Analyze the Request & Context**:  
   - Review the latest user message and all conversation history.  
   - Assess the agent's current context, specifically focusing on data assets and their **detailed models (including names, documentation, columns, etc.)** identified in previous turns.  
-   - Determine the data requirements for the *current* user request, **including both explicitly mentioned subjects and implicitly needed related attributes** (e.g., if asked about 'sales per customer', anticipate the need for 'customer names' or 'customer IDs' alongside 'sales figures' and 'dates').
+   - **Identify Key Semantic Concepts**: Break down the user's request into **Business Objects, Properties, Events, Metrics, and Filters**. Note synonyms. Anticipate related concepts needed for analysis (e.g., joining identifiers).
   - Determine the *complete* data requirements for the *current* user request. This includes explicitly mentioned subjects AND **anticipating and listing all implicitly needed related attributes** (e.g., if asked about 'sales per customer', anticipate the need for 'customer names' [Property of Customer Object], 'customer IDs' [Property/Identifier], 'product names' [Property of Product Object], 'sales figures' [Metric], and 'order dates' [Property of Order/Event Object]) to provide a meaningful answer).
 2. **Decision Logic**:  
   - **If the request is ONLY about visualization/charting aspects**: Use `no_search_needed` tool. These requests typically don't require new data assets:
@ -98,9 +103,9 @@ You are a Search Agent, an AI assistant designed to analyze the conversation his
   - **If NO dataset context (detailed models) exists from previous searches**: Use `search_data_catalog` by default to gather initial context.  
   - **If existing dataset context (detailed models) IS available**: Evaluate if this context provides sufficient information (relevant datasets, columns, documentation) to formulate a plan or perform analysis for the *current* user request.  
     - **If sufficient**: Use the `no_search_needed` tool. Provide a reason indicating that the necessary data context (models) is already available from previous steps.  
-     - **If insufficient (e.g., the request requires data types, columns, or datasets not covered in the existing models)**: Use the `search_data_catalog` tool to acquire the *specific missing* information needed.  
+     - **If insufficient (e.g., the request requires data types, columns, or datasets not covered in the existing models)**: Use the `search_data_catalog` tool to acquire the *specific missing* information needed. **Adapt query generation based on request type:**
-       - For **specific requests** needing new data (e.g., finding a previously unmentioned dataset or specific columns), craft a **single, concise query** as a full sentence targeting the primary asset and its attributes. **Proactively include potentially relevant related attributes** in the query (e.g., for "sales per customer", query for "datasets with customer sales figures, customer names or IDs, and order dates").
+       - For **specific requests** needing new data (e.g., finding a previously unmentioned dataset or specific columns), craft a **single, concise query** as a full sentence targeting the primary asset and its attributes. **Proactively include potentially relevant related attributes** in the query (e.g., for "sales per customer", query for "datasets with customer sales figures, customer names or IDs, and order dates"). **Be explicit about the need for connections.**
-       - For **broad or vague requests** needing new data (e.g., exploring a new topic), craft **multiple queries**, each targeting a different asset type or topic implied by the request, aiming to discover the necessary foundational datasets/models. **Ensure queries attempt to find connections between related concepts** (e.g., query for "datasets linking products to sales regions" and "datasets detailing marketing campaign performance").
+       - For **broad or vague requests** needing new data (e.g., exploring a new topic), craft **multiple queries**, each targeting a different asset type or topic implied by the request, aiming to discover the necessary foundational datasets/models. **Ensure queries attempt to find connections between related concepts** (e.g., query for "datasets linking products to sales regions" and "datasets detailing marketing campaign performance"). **Explicitly ask for identifiers needed to join concepts (e.g., 'customer IDs', 'product IDs').**
 3. **Tool Call Execution**:  
   - Use **only one tool per request** (`search_data_catalog` or `no_search_needed`).  
@ -111,9 +116,9 @@ You are a Search Agent, an AI assistant designed to analyze the conversation his
 - **Skip search for pure visualization requests**: If the user is ONLY asking about charting, visualization, or dashboard layout aspects (not requesting new data), use `no_search_needed` with a reason indicating the request is about visualization only.
 - **Default to search if no context**: If no detailed dataset models are available from previous turns, always use `search_data_catalog` first.  
 - **Leverage existing context**: Before searching (if context exists), exhaustively evaluate if previously identified dataset models are sufficient to address the current user request's data needs for planning or analysis. Use `no_search_needed` only if the existing models suffice.  
- **Search proactively for related attributes**: If existing context is insufficient, use `search_data_catalog` strategically not only to fill the specific gaps but also to proactively find related attributes likely needed for a complete answer (e.g., names, categories, time dimensions). Search for datasets that *connect* these attributes.
+- **Search Strategically based on Specificity & Semantics**: If existing context is insufficient, use `search_data_catalog`. Formulate queries based on the identified **Objects, Properties, Events, Metrics, and Filters**. For *specific* requests, queries MUST explicitly ask for anticipated related attributes and connections. For *vague/exploratory* requests, generate *more* queries covering broader related concepts (combinations of Objects, Properties, Events) to facilitate discovery.
- **Be asset-focused and concise**: If searching, craft queries as concise, natural language sentences explicitly targeting the needed data assets and attributes.  
+- **Be Asset-Focused and Adapt Query Detail using Semantic Concepts**: If searching, craft queries as concise, natural language sentences targeting needed data assets, framed around the identified **Objects, Properties, Events, Metrics, and Filters**. Adapt detail based on request specificity.
- **Maximize asset specificity for broad discovery**: When a search is needed for broad requests, generate queries targeting distinct assets implied by the context.  
+- **Maximize Discovery for Vague Requests using Semantic Combinations**: When a search is needed for vague requests, generate a *larger number* of queries targeting distinct but potentially related **combinations of Objects, Properties, and Events** implied by the request to ensure broad discovery.
 - **Do not assume data availability**: Base decisions strictly on analyzed context/history.  
 - **Avoid direct communication**: Use tool calls exclusively.  
 - **Restrict `no_search_needed` usage**: Use `no_search_needed` only when the *agent's current understanding of available data assets via detailed models* (informed by conversation history and agent state) is sufficient to proceed with the *next step* for the current request without needing *new* information from the catalog. Otherwise, use `search_data_catalog`.  
@ -121,12 +126,22 @@ You are a Search Agent, an AI assistant designed to analyze the conversation his
 **Examples**  
 - **Initial Request (No Context -> Needs Search)**: User asks, "Show me website traffic."  
  - Tool: `search_data_catalog` (Default search as no context exists)  
-  - Query: "I'm looking for datasets related to website visits or traffic with daily granularity, potentially including source or referral information."  
+  - Query: "I'm looking for datasets related to website visits or traffic, specifically including daily counts, traffic sources, referral information, and ideally user session identifiers."  
- **Specific Request (Existing Context Insufficient -> Needs Search)**:  
+- **Specific Request Example (Needs Search)**:  
  - Context: Agent has models for `customers` and `orders`.  
-  - User asks: "Analyze website bounce rates by marketing channel."  
+  - User asks: "Show me the total order value for customers in the 'Enterprise' segment last month."  
-  - Tool: `search_data_catalog` (Existing models don't cover website analytics or marketing channels)  
+  - Tool: `search_data_catalog` (Need to connect orders, customers, and segments specifically for last month)
-  - Query: "I need datasets containing website analytics like bounce rate, possibly linked to marketing channel information."  
+  - Query: "Find datasets containing the Order [Object/Event] with Properties/Metrics like total value and order date [Filter: last month], linked to Customer [Object] Properties like ID and segment [Filter: 'Enterprise']."
 - **Vague/Exploratory Request Example (Needs Search - Framed Semantically)**:
  - User asks: "Explore factors influencing customer churn [Event/Metric]."
  - Tool: `search_data_catalog`
  - Queries: 
    - "Find datasets defining Customer Churn [Event/Metric] status or risk scores [Property/Metric]."
    - "Search for datasets about the Customer [Object] with Properties like demographics, account details, tenure, and identifiers."
    - "Locate datasets detailing Product Usage [Event/Metric] or Service Interaction [Event] frequency [Metric] per Customer [Object]."
    - "Identify datasets about Customer Support Interactions [Event/Object] (e.g., tickets, calls) including Properties like resolution time or satisfaction scores [Metric]."
    - "Are there datasets about Billing History [Object/Event] with details on payment issues [Property/Event] or pricing changes [Property/Event]?"
    - "Find datasets linking Marketing Engagement [Event/Object] or Campaign Exposure [Property] to Customer Retention [Metric/Status Property]."
 - **Follow-up Request (Existing Context Sufficient -> No Search Needed)**:  
  - Context: Agent used `search_data_catalog` in Turn 1, retrieved detailed models for `customers` and `orders` datasets (including columns like `customer_id`, `order_date`, `total_amount`, `ltv`).  
  - User asks in Turn 2: "Show me the lifetime value and recent orders for our top customer by revenue."  
@ -152,16 +167,16 @@ You are a Search Agent, an AI assistant designed to analyze the conversation his
 - Follow-up requests building on established context.  
 - Visualization-only requests (no search needed).
-**Request Interpretation**  
+**Request Interpretation & Query Formulation**
 - Evaluate if the request is ONLY about visualization, charting or dashboard layout (no search needed).
- Derive data needs from the user request *and* the current context (existing detailed dataset models).  
+- **Anticipate Full Data Needs using Semantic Concepts**: Deconstruct the user request into **Objects, Properties, Events, Metrics, Filters**. Analyze current context (existing models) to determine the *complete* set of data needed for analysis, anticipating related concepts and necessary connections. **Adapt the breadth and number of search queries based on request specificity.**
 - If no models exist, search.  
 - If models exist, evaluate their sufficiency for the current request. If sufficient, use `no_search_needed`.  
- If models exist but are insufficient, formulate precise `search_data_catalog` queries for the *missing* assets/attributes/details, proactively including related context.**
+- If models exist but are insufficient, formulate `search_data_catalog` queries **framed around the identified semantic concepts**, following the specific vs. vague/exploratory strategy (few targeted queries vs. many broader queries).
- **Queries should reflect a data analyst's natural articulation of intent.**
+- **Queries should reflect a data analyst's natural articulation of intent, framed using the identified Objects, Properties, Events, Metrics, and Filters.**
 **Validation**  
- For `search_data_catalog`, ensure queries target genuinely *missing* information needed to proceed, based on context analysis, **and proactively seek relevant related attributes**.
+- For `search_data_catalog`, ensure the number and nature of queries match the request specificity (few/targeted vs. many/broader). **Verify that queries are framed using the identified semantic concepts (Objects, Properties, Events, Metrics, Filters)** and aim to gather the necessary information based on context analysis.
 - For `no_search_needed`, verify that the agent's current context (detailed models from history/state) is indeed sufficient for the next step of the current request.
 **Datasets you have access to**
--- a/api/libs/agents/src/tools/categories/file_tools/search_data_catalog.rs
+++ b/api/libs/agents/src/tools/categories/file_tools/search_data_catalog.rs
@ -60,49 +60,42 @@ struct LLMFilterResponse {
 }
 const LLM_FILTER_PROMPT: &str = r#"
-You are a dataset relevance evaluator. Your task is to determine which datasets might contain information relevant to the user's query based on their structure and metadata. Be inclusive in your evaluation - if there's a reasonable chance the dataset could be useful, include it.
+You are a dataset relevance evaluator, acting like a semantic search engine. Your task is to determine which datasets are **semantically relevant** to the user's query based on their structure and metadata, focusing on the core **Business Objects, Properties, Events, Metrics, and Filters** implied by the request.
 USER REQUEST: {user_request}
-SEARCH QUERY: {query}
+SEARCH QUERY: {query} (This query is framed around key semantic concepts identified from the user request)
-Below is a list of datasets that were identified as potentially relevant by an initial semantic ranking system.
+Below is a list of datasets that were identified as potentially relevant by an initial ranking system.
-For each dataset, review its description in the YAML format and determine if its structure could potentially be suitable for the user's query.
+For each dataset, review its description in the YAML format. Evaluate how well the dataset's described contents (columns, metrics, entities, documentation) **semantically align** with the key **Objects, Properties, Events, Metrics, and Filters** required by the USER REQUEST and SEARCH QUERY.
-Include datasets that have even a reasonable possibility of containing relevant information.
+
 Include datasets where the YAML description suggests a reasonable semantic match or overlap with the needed concepts. Prioritize datasets that appear to contain the core Objects or Events, even if all specific Properties or Metrics aren't explicitly listed.
 DATASETS:
 {datasets_json}
-Return a JSON response containing ONLY a list of the UUIDs for the relevant datasets. The response should have the following structure:
+Return a JSON response containing ONLY a list of the UUIDs for the semantically relevant datasets. The response should have the following structure:
 ```json
 {
  "results": [
    "dataset-uuid-here-1",
    "dataset-uuid-here-2"
-    // ... more potentially relevant dataset UUIDs
+    // ... semantically relevant dataset UUIDs
  ]
 }
 ```
 IMPORTANT GUIDELINES:
-1. Be inclusive - if there's a reasonable possibility the dataset could be useful, include it
+1. **Focus on Semantic Relevance**: Include datasets whose content, as described in the YAML, is semantically related to the required Objects, Properties, Events, Metrics, or Filters. Direct keyword matches are not required.
-2. Consider both direct and indirect relationships to the query
+2. **Consider the Core Concepts**: Does the dataset seem to be about the primary Business Object(s) or Event(s)? Does it contain relevant Properties or Metrics, even if named differently (synonyms)?
-3. For example, if a user asks about "red bull sales", consider datasets about:
+3. **Allow Reasonable Inference**: If a dataset describes the correct Object (e.g., 'Customers') and the query asks for a common Property (e.g., 'Email Address'), you can reasonably infer potential relevance even if 'Email Address' isn't explicitly listed in the snippet, provided the dataset description is relevant.
-   - Direct relevance: products, sales, inventory
+4. **Evaluate based on Semantic Fit**: Does the dataset's purpose and structure, based on its YAML, align well with the user's information need? Consider relationships between entities described in the YAML.
-   - Indirect relevance: marketing campaigns, customer demographics, store locations
+5. **Contextual Information is Relevant**: Datasets providing important contextual Properties for the core Objects or Events should be considered relevant.
-4. Evaluate based on whether the dataset's schema, fields, or description MIGHT contain or relate to the relevant information
+6. **When in doubt, lean towards inclusion if semantically plausible**: If the dataset seems semantically related to the core concepts, even if imperfectly described in the YAML snippet, it's better to include it for further inspection.
 5. Include datasets that could provide contextual or supporting information
 6. When in doubt about relevance, lean towards including the dataset
 7. **CRITICAL:** Each string in the "results" array MUST contain ONLY the dataset's UUID string (e.g., "9711ca55-8329-4fd9-8b20-b6a3289f3d38"). Do NOT include the dataset name or any other information.
-8. Use both the USER REQUEST and SEARCH QUERY to understand the user's information needs broadly
+8. **Use both USER REQUEST and SEARCH QUERY**: Understand the underlying need (user request) and the specific concepts being targeted (search query).
-9. Consider these elements in the dataset metadata:
+9. **Prioritize Semantic Overlap**: Look for datasets that cover the key Objects, Events, or Metrics, even if the exact Filters or secondary Properties aren't perfectly matched in the description.
-   - Column names and their data types
+10. **Assume potential utility based on semantic clues**: If the YAML indicates the dataset is about the right topic (Object/Event), assume it might contain relevant Properties/Metrics unless the YAML explicitly contradicts this.
-   - Entity relationships
+11. A dataset is relevant if its described structure and purpose **semantically align** with the information needed to answer the query.
   - Predefined metrics
   - Table schemas
   - Dimension hierarchies
   - Related or connected data structures
 10. While you shouldn't assume specific data exists, you can be optimistic about the potential usefulness of related data structures
 11. A dataset is relevant if its structure could reasonably support or contribute to answering the query, either directly or indirectly
 "#;
 pub struct SearchDataCatalogTool {
@ -363,7 +356,7 @@ async fn rerank_datasets(
        query,
        documents,
        model: ReRankModel::EnglishV3,
-        top_n: Some(50),
+        top_n: Some(35),
        ..Default::default()
    };
--- a/api/libs/handlers/src/chats/post_chat_handler.rs
+++ b/api/libs/handlers/src/chats/post_chat_handler.rs
@ -2408,7 +2408,11 @@ async fn initialize_chat(
    user: &AuthenticatedUser,
    user_org_id: Uuid,
 ) -> Result<(Uuid, Uuid, ChatWithMessages)> {
-    let message_id = request.message_id.unwrap_or_else(Uuid::new_v4);
+    // Determine the ID for the new message being created.
    // If request.message_id is Some, it signifies a branch point, so the NEW message needs a NEW ID.
    // If request.message_id is None, we might be starting a new chat or adding to the end,
    // in which case we can use a new ID as well.
    let new_message_id = Uuid::new_v4();
    // Get a default title for chats
    let default_title = {
@ -2428,12 +2432,67 @@ async fn initialize_chat(
    let prompt_text = request.prompt.clone().unwrap_or_default();
    if let Some(existing_chat_id) = request.chat_id {
        // --- START: Added logic for message_id presence ---
        if let Some(target_message_id) = request.message_id {
            // Use target_message_id (from the request) for deletion logic
            let mut conn = get_pg_pool().get().await?;
            // Fetch the created_at timestamp of the target message
            let target_message_created_at = messages::table
                .filter(messages::id.eq(target_message_id))
                .select(messages::created_at)
                .first::<chrono::NaiveDateTime>(&mut conn)
                .await
                .optional()?; // Use optional in case the message doesn't exist
            if let Some(created_at_ts) = target_message_created_at {
                // Mark subsequent messages as deleted
                let update_result = diesel::update(messages::table)
                    .filter(messages::chat_id.eq(existing_chat_id))
                    .filter(messages::created_at.ge(created_at_ts))
                    .set(messages::deleted_at.eq(Some(Utc::now().naive_utc()))) // Use naive_utc() for NaiveDateTime
                    .execute(&mut conn)
                    .await;
                match update_result {
                    Ok(num_updated) => {
                        tracing::info!(
                            "Marked {} messages as deleted for chat {} starting from message {}",
                            num_updated,
                            existing_chat_id,
                            target_message_id
                        );
                    }
                    Err(e) => {
                        tracing::error!(
                            "Failed to mark messages as deleted for chat {}: {}",
                            existing_chat_id,
                            e
                        );
                        // Propagate the error or handle appropriately
                        return Err(anyhow!("Failed to update messages: {}", e));
                    }
                }
            } else {
                // Handle case where the target_message_id doesn't exist
                tracing::warn!(
                    "Target message_id {} not found for chat {}, proceeding without deleting messages.",
                    target_message_id,
                    existing_chat_id
                );
                // Potentially return an error or proceed based on desired behavior
            }
        }
        // --- END: Added logic for message_id presence ---
        // Get existing chat - no need to create new chat in DB
        // This now fetches the chat *after* potential deletions
        let mut existing_chat = get_chat_handler(&existing_chat_id, &user, true).await?;
-        // Create new message
+        // Create new message using the *new* message ID
        let message = ChatMessage::new_with_messages(
-            message_id,
+            new_message_id, // Use the newly generated ID here
            Some(ChatUserMessage {
                request: Some(prompt_text),
                sender_id: user.id,
@ -2450,7 +2509,7 @@ async fn initialize_chat(
        // Add message to existing chat
        existing_chat.add_message(message);
-        Ok((existing_chat_id, message_id, existing_chat))
+        Ok((existing_chat_id, new_message_id, existing_chat)) // Return the new_message_id
    } else {
        // Create new chat since we don't have an existing one
        let chat_id = Uuid::new_v4();
@ -2471,9 +2530,9 @@ async fn initialize_chat(
            most_recent_version_number: None,
        };
-        // Create initial message
+        // Create initial message using the *new* message ID
        let message = ChatMessage::new_with_messages(
-            message_id,
+            new_message_id, // Use the newly generated ID here
            Some(ChatUserMessage {
                request: Some(prompt_text),
                sender_id: user.id,
@ -2519,7 +2578,7 @@ async fn initialize_chat(
            .execute(&mut conn)
            .await?;
-        Ok((chat_id, message_id, chat_with_messages))
+        Ok((chat_id, new_message_id, chat_with_messages)) // Return the new_message_id
    }
 }