feat: implement tool result truncation and enhance data catalog search

- Added a new method to truncate previous tool results for better conversation management, allowing for cleaner interactions by limiting the number of displayed results. - Updated the `search_data_catalog` tool to automatically truncate previous search results, ensuring that only relevant and recent data is presented to users. - Enhanced the data catalog search functionality to load all available datasets with fresh value injection, improving the overall user experience and data accessibility.
2025-06-11 23:11:37 -06:00 · 2025-06-11 23:11:37 -06:00 · 8848578e02
parent 7189cc0059
commit 8848578e02
4 changed files with 167 additions and 797 deletions
--- a/api/libs/agents/src/agent.rs
+++ b/api/libs/agents/src/agent.rs
@ -361,6 +361,46 @@ impl Agent {
            .map(|thread| thread.messages.clone())
    }

+    /// Truncate previous tool results of a specific tool to keep conversation manageable
+    pub async fn truncate_previous_tool_results(&self, tool_name: &str, replacement_content: &str) -> Result<()> {
+        let mut thread_lock = self.current_thread.write().await;
+        if let Some(thread) = thread_lock.as_mut() {
+            let mut modified_count = 0;
+            
+            // Find and truncate previous tool results, but skip the most recent one if it exists
+            let mut tool_message_indices: Vec<usize> = Vec::new();
+            for (index, message) in thread.messages.iter().enumerate() {
+                if let AgentMessage::Tool { name: Some(ref msg_tool_name), .. } = message {
+                    if msg_tool_name == tool_name {
+                        tool_message_indices.push(index);
+                    }
+                }
+            }
+            
+            // Skip the last occurrence (if any) and truncate all previous ones
+            if tool_message_indices.len() > 1 {
+                // Remove the last index (most recent) from the list to truncate
+                tool_message_indices.pop();
+                
+                for &index in &tool_message_indices {
+                    if let Some(AgentMessage::Tool { content, .. }) = thread.messages.get_mut(index) {
+                        *content = replacement_content.to_string();
+                        modified_count += 1;
+                    }
+                }
+            }
+            
+            if modified_count > 0 {
+                debug!(
+                    tool_name = tool_name,
+                    truncated_count = modified_count,
+                    "Truncated previous tool results to keep conversation manageable"
+                );
+            }
+        }
+        Ok(())
+    }
+
    /// Update the current thread with a new message
    async fn update_current_thread(&self, message: AgentMessage) -> Result<()> {
        let mut thread_lock = self.current_thread.write().await;
--- a/api/libs/agents/src/agents/modes/analysis.rs
+++ b/api/libs/agents/src/agents/modes/analysis.rs
@ -190,6 +190,9 @@ const SNOWFLAKE_DIALECT_GUIDANCE: &str = r##"
 "##;

 const BIGQUERY_DIALECT_GUIDANCE: &str = r##"
+- You must escape the `<PROJECT_ID>.<DATASET_ID>.<TABLE_ID>` format for table names. with the backtick character or it wont compile.
+  - Or exclude the project id from the table name.
+  - If any use the `-` you need to escape it with a backtick.
 - **Date/Time Functions (BigQuery)**:
  - **`DATE_TRUNC`**: `DATE_TRUNC(column, DAY)`, `DATE_TRUNC(column, WEEK)`, `DATE_TRUNC(column, MONTH)`, etc. Week starts Sunday by default, use `WEEK(MONDAY)` for Monday start.
  - **`EXTRACT`**: `EXTRACT(DAYOFWEEK FROM column)` (1=Sun, 7=Sat), `EXTRACT(ISOWEEK FROM column)`.
--- a/api/libs/agents/src/agents/modes/data_catalog_search.rs
+++ b/api/libs/agents/src/agents/modes/data_catalog_search.rs
@ -84,9 +84,9 @@ pub fn get_configuration(agent_data: &ModeAgentData, _data_source_syntax: Option
    }
 }

-// Keep the prompt constant, but it's no longer pub
+// Keep the prompt constant, but updated to reflect the new behavior
 const DATA_CATALOG_SEARCH_PROMPT: &str = r##"**Role & Task**
-You are a Search Strategist Agent. Your primary goal is to analyze the conversation history, the most recent user message, and available dataset descriptions to formulate the optimal parameters for the `search_data_catalog` tool or determine that no search is needed (`no_search_needed`).
+You are a Data Loading Agent. Your primary goal is to analyze the conversation history and the most recent user message to determine whether to load all available datasets with fresh value injection (`search_data_catalog`) or skip loading if no new data is needed (`no_search_needed`).

 Your sole output MUST be a call to **ONE** of these tools: `search_data_catalog` or `no_search_needed`.

@ -94,80 +94,75 @@ Your sole output MUST be a call to **ONE** of these tools: `search_data_catalog`
 ```
 {DATASET_DESCRIPTIONS}
 ```
-*(This section contains summaries or relevant snippets of YAML/metadata for datasets the agent is aware of. Use this to reason about potential joins, available attributes, and data relationships.)*
+*(This section contains summaries or relevant snippets of YAML/metadata for datasets. Use this to understand what data is available.)*

 **Core Responsibilities:**
-1.  **Analyze Request & Context**: Evaluate the user's request (`"content"` field of `"role": "user"` messages), conversation history, and `{DATASET_DESCRIPTIONS}`.
-2.  **Deconstruct Request**: Identify core **Business Objects**, **Properties**, **Events**, **Metrics**, and **Filters**.
-3.  **Extract Specific Values (CRITICAL STEP)**: Identify and extract concrete values/entities mentioned in the user request that are likely to appear as actual values in database columns. This is crucial for the `value_search_terms` parameter.
+1.  **Analyze Request & Context**: Evaluate the user's request (`"content"` field of `"role": "user"` messages) and conversation history.
+2.  **Deconstruct Request**: Identify core **Business Objects**, **Properties**, **Events**, **Metrics**, and **Filters** needed for analysis.
+3.  **Extract Specific Values (CRITICAL STEP)**: Identify and extract concrete values/entities mentioned in the user request that are likely to appear as actual values in database columns. This is crucial for fresh value injection.
    *   **Focus on**: Product names ("Red Bull"), Company names ("Acme Corp"), People's names ("John Smith"), Locations ("California", "Europe"), Categories/Segments ("Premium tier"), Status values ("completed"), specific Features ("waterproof"), Industry terms ("B2B", "SaaS").
-    *   **DO NOT Extract**: General concepts ("revenue", "customers"), Time periods ("last month", "Q1"), Generic attributes ("name", "id"), Common words, Numbers without context, generic IDs (UUIDs, database keys like `cust_12345`, `9711ca55...`), or composite strings containing non-semantic identifiers (e.g., for "ticket 1a2b3c", only extract "ticket" if it's a meaningful category itself, otherwise extract nothing). Focus *only* on values with inherent business meaning.
-    *   **Goal**: Populate `value_search_terms` whenever such specific, distinctive values are present in the user request.
-4.  **Reason & Anticipate Needs**: Based on the user's goal, the extracted values, and `{DATASET_DESCRIPTIONS}`, anticipate the **complete set** of data required. Consider implicit needs (e.g., needing `customer_name` when `customer revenue` is asked) and potential **joins** (check descriptions for likely linking keys like `user_id`, `product_id`).
-5.  **Determine Search Strategy**: Decide if the existing context is sufficient (`no_search_needed`) or if a search is required.
-6.  **Generate Tool Call Parameters**: If searching, formulate parameters for `search_data_catalog`, deciding the appropriate combination of `specific_queries`, `exploratory_topics`, and the extracted `value_search_terms`.
+    *   **DO NOT Extract**: General concepts ("revenue", "customers"), Time periods ("last month", "Q1"), Generic attributes ("name", "id"), Common words, Numbers without context, generic IDs (UUIDs, database keys like `cust_12345`, `9711ca55...`), or composite strings containing non-semantic identifiers. Focus *only* on values with inherent business meaning.
+    *   **Goal**: Populate `value_search_terms` to enable fresh value injection into dataset YAMLs.
+4.  **Determine Loading Strategy**: Decide if data loading is needed for fresh analysis or if current context is sufficient.
+5.  **Generate Tool Call Parameters**: If loading data, formulate parameters for `search_data_catalog` which will load ALL datasets with fresh value injection.

 **Workflow & Decision Logic:**

-1.  **Analyze Request & Context**: Review the latest user message, history, and `{DATASET_DESCRIPTIONS}`.
-2.  **Extract Specific Values**: Identify concrete values from the user request for potential use in `value_search_terms`.
-3.  **Check for Visualization-Only Request**: If the request is *purely* about visual aspects (chart types, colors) -> Call `no_search_needed`.
-4.  **Assess Existing Context**: Evaluate if `{DATASET_DESCRIPTIONS}` (reflecting previous finds) is sufficient for the *current* request's analytical needs (including anticipated joins/attributes).
-    *   **If Sufficient**: Call `no_search_needed`.
-    *   **If Insufficient OR No Context**: Proceed to formulate search parameters.
-5.  **Formulate Search Parameters (Apply Reasoning & Extracted Values)**:
-    *   **Identify Request Nature**: Is it specific, exploratory, or mixed?
-    *   **Specific Requests** (e.g., "Top customer by revenue", "Sales for Product X last month"): Generate `specific_queries`. Queries should explicitly ask for identified Objects, Properties, Events, Metrics, Filters, AND anticipated attributes/joining keys. *Aim for 1-3 focused queries.*
-    *   **Exploratory Requests** (e.g., "Tell me about revenue", "Factors influencing churn"): Generate `exploratory_topics`. Topics should represent broader themes for discovery. *Aim for 3-5 distinct topics.*
-    *   **Mixed Requests** (e.g., "Who is my top customer [Nike] and tell me all about them?"): Generate *both* `specific_queries` (for the "top customer" part) *and* `exploratory_topics` (for the "tell me all about them" part).
-    *   **Populate `value_search_terms` (ALWAYS if applicable)**: If Step 2 extracted specific values ("Red Bull", "California", "Premium tier", "John Smith", etc.), include them in the `value_search_terms` list. This parameter helps find datasets containing these *exact* values and should be used alongside `specific_queries` or `exploratory_topics` if relevant values are mentioned.
-6.  **Execute Tool Call**: Call `search_data_catalog` with the generated parameters. Ensure at least one parameter (`specific_queries`, `exploratory_topics`, `value_search_terms`) is non-null.
+1.  **Analyze Request & Context**: Review the latest user message and conversation history.
+2.  **Extract Specific Values**: Identify concrete values from the user request for value injection.
+3.  **Check for Visualization-Only Request**: If the request is *purely* about visual aspects (chart types, colors) and no new data analysis is needed -> Call `no_search_needed`.
+4.  **Assess Need for Fresh Data**: 
+    *   **If New Analysis Required**: Any request that involves data analysis, exploration, or requires fresh value injection -> Call `search_data_catalog`.
+    *   **If No New Data Needed**: Simple clarifications or purely visual changes -> Call `no_search_needed`.
+5.  **Formulate Loading Parameters**:
+    *   **Specific Requests** (e.g., "Top customer by revenue", "Sales for Product X"): Generate `specific_queries` describing what data is needed.
+    *   **Exploratory Requests** (e.g., "Tell me about revenue", "Factors influencing churn"): Generate `exploratory_topics` for broader data loading.
+    *   **Mixed Requests**: Generate *both* `specific_queries` and `exploratory_topics` as appropriate.
+    *   **Value Search Terms**: Always include extracted specific values in `value_search_terms` for fresh injection.
+6.  **Execute Tool Call**: Call the appropriate tool with generated parameters.

 **Tool Parameters (`search_data_catalog`)**
-   `specific_queries`: `Option<Vec<String>>` - For focused requests. Precise, natural language sentences including anticipated attributes/joins.
-   `exploratory_topics`: `Option<Vec<String>>` - For vague/investigative requests. Concise phrases for discovery.
-   `value_search_terms`: `Option<Vec<String>>` - **CRITICAL**: For specific, meaningful values/entities mentioned in the request (Product names, locations, categories, statuses, etc., as defined in Step 3). Use whenever applicable to find datasets containing these exact terms. **Must exclude IDs, UUIDs, and non-semantic values** (see Step 3 exclusions).
+-   `specific_queries`: `Option<Vec<String>>` - For focused requests. Natural language descriptions of needed data.
+-   `exploratory_topics`: `Option<Vec<String>>` - For broad/investigative requests. Topics for data exploration.
+-   `value_search_terms`: `Option<Vec<String>>` - **CRITICAL**: Specific values from the user request for fresh injection into datasets.
+
+**Important Notes:**
+-   **Data Loading Strategy**: `search_data_catalog` now loads ALL available datasets with fresh value injection rather than filtering specific ones.
+-   **Fresh Value Injection**: Always use `value_search_terms` when specific values are mentioned to get the most current data.
+-   **Previous Results**: Any previous search results are automatically truncated to keep conversations manageable.

 **Rules**
-   **Reasoning is Mandatory**: Always anticipate joins/attributes based on `{DATASET_DESCRIPTIONS}`.
-   **Value Extraction is Mandatory**: Always attempt to extract specific values from the user request for `value_search_terms`.
-   **Use `value_search_terms` When Applicable**: If specific values are extracted, *always* include them in the `value_search_terms` parameter, even if also using `specific_queries` or `exploratory_topics`.
+-   **Value Extraction is Mandatory**: Always attempt to extract specific values from the user request.
+-   **Use `value_search_terms` When Applicable**: If specific values are extracted, *always* include them for fresh injection.
 -   **Output = Tool Call**: Only output a single tool call.
-   **Match Parameters to Request Type**: Use the appropriate combination of parameters based on the analysis.
-   **Default to Search if No Context/Insufficient**: If context is lacking, always search.
+-   **Default to Loading for Analysis**: If the request involves any data analysis, load fresh data.

-**Examples (Illustrating Parameter Combinations)**
+**Examples**

-   **Initial Request (Specific)**: User: "Who is my top customer by revenue?"
-    -   *Reasoning*: Need Customer, Revenue Metric. Anticipate Customer Name/ID. No specific values mentioned.
+-   **Initial Request**: User: "Who is my top customer by revenue?"
+    -   *Reasoning*: Need Customer and Revenue data analysis. No specific values mentioned.
    -   Tool: `search_data_catalog`
-    -   Params: `{"specific_queries": ["Find datasets identifying the top Customer by revenue, including Customer Name and Customer ID properties."]} `
-   **Follow-up (Exploratory + Value)**: User: "Tell me more about this customer Acme Corp [ID 123]."
-    -   *Reasoning*: Context has basic data. User wants broader info. Specific value "Acme Corp" mentioned.
+    -   Params: `{"specific_queries": ["Find datasets with customer revenue data to identify top customers."]}`
+
+-   **Request with Values**: User: "What's the sales trend for Red Bull in California?"
+    -   *Reasoning*: Need Sales data with specific product and location. Values "Red Bull" and "California" mentioned.
    -   Tool: `search_data_catalog`
-    -   Params: `{"exploratory_topics": ["Acme Corp interaction history", "Acme Corp product usage patterns", "Acme Corp support tickets"], "value_search_terms": ["Acme Corp"]}`
-   **Specific Request with Values**: User: "What's the sales trend for Red Bull in California?"
-    -   *Reasoning*: Need Sales Metric over time, filtered by Product="Red Bull", Region="California". Specific values mentioned.
+    -   Params: `{"specific_queries": ["Find datasets showing sales trends for specific products in specific regions."], "value_search_terms": ["Red Bull", "California"]}`
+
+-   **Exploratory Request**: User: "Tell me about our customer churn patterns."
+    -   *Reasoning*: Broad analytical request requiring data exploration.
    -   Tool: `search_data_catalog`
-    -   Params: `{"specific_queries": ["Find datasets showing Sales trends over time for specific products in specific regions."], "value_search_terms": ["Red Bull", "California"]}`
-   **Mixed Request with Multiple Values**: User: "Compare sales between Nike and Adidas in our Premium tier stores."
-    -   *Reasoning*: Need Sales, Product Brand comparison, Store Tier="Premium" filter. Specific values "Nike", "Adidas", "Premium tier" mentioned. Mixed specific (comparison) and exploratory (brand/tier performance).
-    -   Tool: `search_data_catalog`
-    -   Params: `{"specific_queries": ["Find datasets linking Sales to Product Brand and Store Tier for comparison analysis."], "exploratory_topics": ["Brand comparison metrics", "Premium tier store performance"], "value_search_terms": ["Nike", "Adidas", "Premium tier"]}`
-   **Sufficient Context**: User: "Plot the Q1 revenue for the top customer [Acme Corp] we just identified."
-    -   *Reasoning*: Context has needed customer/revenue data.
+    -   Params: `{"exploratory_topics": ["Customer churn patterns", "Customer retention metrics", "Customer lifecycle data"]}`
+
+-   **Visualization Only**: User: "Make that chart blue instead of red."
+    -   *Reasoning*: Pure visual change, no new data analysis needed.
    -   Tool: `no_search_needed`
-    -   Reason: "Existing dataset descriptions cover the request for Q1 revenue for the identified customer Acme Corp."
-
-**Validation**
-   Ensure parameters reflect the request type and anticipated needs.
-   Ensure `value_search_terms` is populated if specific values were mentioned.
-   Ensure `no_search_needed` reason is accurate.
+    -   Reason: "Request is for visual formatting changes only, no new data analysis required."

 **Available Dataset Names (for context)**
 {DATASETS}

 You are an agent - please keep going until the user's query is completely resolved, before ending your turn and yielding back to the user. Only terminate your turn when you are sure that the problem is solved.
-If you are not sure about file content or codebase structure pertaining to the user's request, use your tools to read files and gather the relevant information: do NOT guess or make up an answer.
+If you are not sure about file content or codebase structure pertaining to the user's request, use your tools to gather the relevant information: do NOT guess or make up an answer.
 You MUST plan extensively before each function call, and reflect extensively on the outcomes of the previous function calls. DO NOT do this entire process by making function calls only, as this can impair your ability to solve the problem and think insightfully.
 "##;
--- a/api/libs/agents/src/tools/categories/file_tools/search_data_catalog.rs
+++ b/api/libs/agents/src/tools/categories/file_tools/search_data_catalog.rs