tweaks for some more speed

2025-04-15 15:30:52 -06:00 · 2025-04-15 15:30:52 -06:00 · c91a078d2b
parent db220b7fd5
commit c91a078d2b
2 changed files with 19 additions and 28 deletions
--- a/api/libs/agents/src/agents/buster_multi_agent.rs
+++ b/api/libs/agents/src/agents/buster_multi_agent.rs
@ -1301,7 +1301,7 @@ You are a Search Agent, an AI assistant designed to analyze the conversation his
 1. **Analyze the Request & Context**:  
   - Review the latest user message and all conversation history.  
   - Assess the agent's current context, specifically focusing on data assets and their **detailed models (including names, documentation, columns, etc.)** identified in previous turns.  
-   - Determine the data requirements for the *current* user request.  
+   - Determine the data requirements for the *current* user request, **including both explicitly mentioned subjects and implicitly needed related attributes** (e.g., if asked about 'sales per customer', anticipate the need for 'customer names' or 'customer IDs' alongside 'sales figures' and 'dates').

 2. **Decision Logic**:  
   - **If the request is ONLY about visualization/charting aspects**: Use `no_search_needed` tool. These requests typically don't require new data assets:
@ -1313,8 +1313,8 @@ You are a Search Agent, an AI assistant designed to analyze the conversation his
   - **If existing dataset context (detailed models) IS available**: Evaluate if this context provides sufficient information (relevant datasets, columns, documentation) to formulate a plan or perform analysis for the *current* user request.  
     - **If sufficient**: Use the `no_search_needed` tool. Provide a reason indicating that the necessary data context (models) is already available from previous steps.  
     - **If insufficient (e.g., the request requires data types, columns, or datasets not covered in the existing models)**: Use the `search_data_catalog` tool to acquire the *specific missing* information needed.  
-       - For **specific requests** needing new data (e.g., finding a previously unmentioned dataset or specific columns), craft a **single, concise query** as a full sentence targeting the primary asset and its attributes.  
-       - For **broad or vague requests** needing new data (e.g., exploring a new topic), craft **multiple queries**, each targeting a different asset type or topic implied by the request, aiming to discover the necessary foundational datasets/models.  
+       - For **specific requests** needing new data (e.g., finding a previously unmentioned dataset or specific columns), craft a **single, concise query** as a full sentence targeting the primary asset and its attributes. **Proactively include potentially relevant related attributes** in the query (e.g., for "sales per customer", query for "datasets with customer sales figures, customer names or IDs, and order dates").
+       - For **broad or vague requests** needing new data (e.g., exploring a new topic), craft **multiple queries**, each targeting a different asset type or topic implied by the request, aiming to discover the necessary foundational datasets/models. **Ensure queries attempt to find connections between related concepts** (e.g., query for "datasets linking products to sales regions" and "datasets detailing marketing campaign performance").

 3. **Tool Call Execution**:  
   - Use **only one tool per request** (`search_data_catalog` or `no_search_needed`).  
@ -1325,7 +1325,7 @@ You are a Search Agent, an AI assistant designed to analyze the conversation his
 - **Skip search for pure visualization requests**: If the user is ONLY asking about charting, visualization, or dashboard layout aspects (not requesting new data), use `no_search_needed` with a reason indicating the request is about visualization only.
 - **Default to search if no context**: If no detailed dataset models are available from previous turns, always use `search_data_catalog` first.  
 - **Leverage existing context**: Before searching (if context exists), exhaustively evaluate if previously identified dataset models are sufficient to address the current user request's data needs for planning or analysis. Use `no_search_needed` only if the existing models suffice.  
- **Search only for missing information**: If existing context is insufficient, use `search_data_catalog` strategically only to fill the specific gaps in the agent's context (missing datasets, columns, details), not to re-discover information already known.  
+- **Search proactively for related attributes**: If existing context is insufficient, use `search_data_catalog` strategically not only to fill the specific gaps but also to proactively find related attributes likely needed for a complete answer (e.g., names, categories, time dimensions). Search for datasets that *connect* these attributes.
 - **Be asset-focused and concise**: If searching, craft queries as concise, natural language sentences explicitly targeting the needed data assets and attributes.  
 - **Maximize asset specificity for broad discovery**: When a search is needed for broad requests, generate queries targeting distinct assets implied by the context.  
 - **Do not assume data availability**: Base decisions strictly on analyzed context/history.  
@ -1335,7 +1335,7 @@ You are a Search Agent, an AI assistant designed to analyze the conversation his
 **Examples**  
 - **Initial Request (No Context -> Needs Search)**: User asks, "Show me website traffic."  
  - Tool: `search_data_catalog` (Default search as no context exists)  
-  - Query: "I'm looking for datasets related to website visits or traffic with daily granularity."  
+  - Query: "I'm looking for datasets related to website visits or traffic with daily granularity, potentially including source or referral information."  
 - **Specific Request (Existing Context Insufficient -> Needs Search)**:  
  - Context: Agent has models for `customers` and `orders`.  
  - User asks: "Analyze website bounce rates by marketing channel."  
@ -1345,7 +1345,7 @@ You are a Search Agent, an AI assistant designed to analyze the conversation his
  - Context: Agent used `search_data_catalog` in Turn 1, retrieved detailed models for `customers` and `orders` datasets (including columns like `customer_id`, `order_date`, `total_amount`, `ltv`).  
  - User asks in Turn 2: "Show me the lifetime value and recent orders for our top customer by revenue."  
  - Tool: `no_search_needed`  
-  - Reason: "The necessary dataset models (`customers`, `orders`) identified previously contain the required columns (`ltv`, `order_date`, `total_amount`) to fulfill this request."  
+  - Reason: "The necessary dataset models (`customers`, `orders`) identified previously contain the required columns (`ltv`, `order_date`, `total_amount`, `customer_id`) to fulfill this request."  
 - **Visualization-Only Request (No Search Needed)**: User asks, "Make all the charts blue and add them to a dashboard."
  - Tool: `no_search_needed`
  - Reason: "The request is only about chart styling and dashboard placement, not requiring any new data assets."
@ -1371,11 +1371,11 @@ You are a Search Agent, an AI assistant designed to analyze the conversation his
 - Derive data needs from the user request *and* the current context (existing detailed dataset models).  
 - If no models exist, search.  
 - If models exist, evaluate their sufficiency for the current request. If sufficient, use `no_search_needed`.  
- If models exist but are insufficient, formulate precise `search_data_catalog` queries for the *missing* assets/attributes/details.  
- Queries should reflect a data analyst's natural articulation of intent.  
+- If models exist but are insufficient, formulate precise `search_data_catalog` queries for the *missing* assets/attributes/details, proactively including related context.**
+- **Queries should reflect a data analyst's natural articulation of intent.**

 **Validation**  
- For `search_data_catalog`, ensure queries target genuinely *missing* information needed to proceed, based on context analysis.  
+- For `search_data_catalog`, ensure queries target genuinely *missing* information needed to proceed, based on context analysis, **and proactively seek relevant related attributes**.
 - For `no_search_needed`, verify that the agent's current context (detailed models from history/state) is indeed sufficient for the next step of the current request.

 **Datasets you have access to**
--- a/api/libs/agents/src/tools/categories/file_tools/search_data_catalog.rs
+++ b/api/libs/agents/src/tools/categories/file_tools/search_data_catalog.rs
@ -56,14 +56,7 @@ struct RankedDataset {

 #[derive(Debug, Deserialize)]
 struct LLMFilterResponse {
-    results: Vec<FilteredDataset>,
-}
-
-#[derive(Debug, Deserialize)]
-struct FilteredDataset {
-    id: String,
-    #[allow(dead_code)]
-    reason: String,
+    results: Vec<String>,
 }

 const LLM_FILTER_PROMPT: &str = r#"
@ -79,15 +72,13 @@ Include datasets that have even a reasonable possibility of containing relevant
 DATASETS:
 {datasets_json}

-Return a JSON response with the following structure:
+Return a JSON response containing ONLY a list of the UUIDs for the relevant datasets. The response should have the following structure:
 ```json
 {
  "results": [
-    {
-      "id": "dataset-uuid-here",
-      "reason": "Brief explanation of why this dataset's structure might be relevant"
-    },
-    // ... more potentially relevant datasets
+    "dataset-uuid-here-1",
+    "dataset-uuid-here-2"
+    // ... more potentially relevant dataset UUIDs
  ]
 }
 ```
@ -101,7 +92,7 @@ IMPORTANT GUIDELINES:
 4. Evaluate based on whether the dataset's schema, fields, or description MIGHT contain or relate to the relevant information
 5. Include datasets that could provide contextual or supporting information
 6. When in doubt about relevance, lean towards including the dataset
-7. **CRITICAL:** The "id" field in your JSON response MUST contain ONLY the dataset's UUID string (e.g., "9711ca55-8329-4fd9-8b20-b6a3289f3d38"). Do NOT include the dataset name or any other information in the "id" field.
+7. **CRITICAL:** Each string in the "results" array MUST contain ONLY the dataset's UUID string (e.g., "9711ca55-8329-4fd9-8b20-b6a3289f3d38"). Do NOT include the dataset name or any other information.
 8. Use both the USER REQUEST and SEARCH QUERY to understand the user's information needs broadly
 9. Consider these elements in the dataset metadata:
   - Column names and their data types
@ -509,9 +500,9 @@ async fn filter_datasets_with_llm(
    let filtered_datasets: Vec<DatasetResult> = filter_response
        .results
        .into_iter()
-        .filter_map(|result| {
-            debug!(llm_result_id = %result.id, "Processing LLM filter result");
-            let parsed_uuid_result = Uuid::parse_str(&result.id);
+        .filter_map(|dataset_id_str| {
+            debug!(llm_result_id_str = %dataset_id_str, "Processing LLM filter result ID string");
+            let parsed_uuid_result = Uuid::parse_str(&dataset_id_str);
            match &parsed_uuid_result {
                Ok(parsed_id) => {
                    debug!(parsed_id = %parsed_id, "Successfully parsed UUID from LLM result");
@ -532,7 +523,7 @@ async fn filter_datasets_with_llm(
                    }
                }
                Err(e) => {
-                    error!(llm_result_id = %result.id, error = %e, "Failed to parse UUID from LLM result");
+                    error!(llm_result_id_str = %dataset_id_str, error = %e, "Failed to parse UUID from LLM result string");
                    None
                }
            }