Merge branch 'evals' of https://github.com/buster-so/buster into evals

2025-04-18 17:00:35 -06:00 · 2025-04-18 17:00:35 -06:00 · 806d360c1a
parent 475c53f9ad cd5dc11501
commit 806d360c1a
4 changed files with 60 additions and 58 deletions
--- a/api/libs/agents/src/agents/modes/analysis.rs
+++ b/api/libs/agents/src/agents/modes/analysis.rs
@ -205,7 +205,8 @@ You can create, update, or modify the following assets, which are automatically
  - **Review and Update**: After creation, metrics can be reviewed and updated individually or in bulk as needed.
  - **Use in Dashboards**: Metrics can be saved to dashboards for further use.
  - **Percentage Formatting**: When defining a metric with a percentage column (style: `percent`) where the SQL returns the value as a decimal (e.g., 0.75), remember to set the `multiplier` in `columnLabelFormats` to 100 to display it correctly as 75%. If the value is already represented as a percentage (e.g., 75), the multiplier should be 1 (or omitted as it defaults to 1).
-  - **Date Grouping**: For metrics visualizing date columns on the X-axis (e.g., line or combo charts), remember to set the `xAxisTimeInterval` field within the `xAxisConfig` section of `chartConfig` to control how dates are grouped (e.g., `day`, `week`, `month`). This is crucial for meaningful time-series visualizations.
+  - **Date Axis Handling**: When visualizing date/time data on the X-axis (e.g., line/combo charts), you MUST configure the `xAxisConfig` section in the `chartConfig`. **ONLY set the `xAxisTimeInterval` field** (e.g., `xAxisConfig: { xAxisTimeInterval: 'day' }`) to define how dates should be grouped (`day`, `week`, `month`, `quarter`, `year`). This is essential for correct time-series aggregation. **Do NOT add other `xAxisConfig` properties or any `yAxisConfig` properties unless the user specifically asks for them.**
+    - Use the `dateFormat` property within the relevant `columnLabelFormats` entry to format the date labels according to the `xAxisTimeInterval`. Recommended formats: Year ('YYYY'), Quarter ('[Q]Q YYYY'), Month ('MMM YYYY' or 'MMMM'), Week/Day ('MMM D, YYYY' or 'MMM D').

 - **Dashboards**: Collections of metrics displaying live data, refreshed on each page load. Dashboards offer a dynamic, real-time view without descriptions or commentary.

@ -236,10 +237,25 @@ To conclude your worklow, you use the `finish_and_respond` tool to send a final
 ---

 ## SQL Best Practices and Constraints** (when creating new metrics)  
+- USE POSTGRESQL SYNTAX
+- **Date/Time Functions**:
+  - **`DATE_TRUNC`**: Prefer `DATE_TRUNC('day', column)`, `DATE_TRUNC('week', column)`, `DATE_TRUNC('month', column)`, etc., for grouping time series data. Note that `'week'` starts on Monday.
+  - **`EXTRACT`**:
+    - `EXTRACT(DOW FROM column)` gives day of week (0=Sunday, 6=Saturday).
+    - `EXTRACT(ISODOW FROM column)` gives ISO day of week (1=Monday, 7=Sunday).
+    - `EXTRACT(WEEK FROM column)` gives the week number (starting Monday). Combine with `EXTRACT(ISOYEAR FROM column)` for strict ISO week definitions.
+    - `EXTRACT(EPOCH FROM column)` returns Unix timestamp (seconds).
+  - **Intervals**: Use `INTERVAL '1 day'`, `INTERVAL '1 month'`, etc., for date arithmetic. Be mindful of variations in month/year lengths.
+  - **Performance**: Ensure date/timestamp columns used in `WHERE` or `JOIN` clauses are indexed. Consider functional indexes on `DATE_TRUNC` or `EXTRACT` expressions if filtering/grouping by them frequently.
+- **Grouping and Aggregation**:
+  - **`GROUP BY` Clause**: Include all non-aggregated `SELECT` columns. Using explicit names is clearer than ordinal positions (`GROUP BY 1, 2`).
+  - **`HAVING` Clause**: Use `HAVING` to filter *after* aggregation (e.g., `HAVING COUNT(*) > 10`). Use `WHERE` to filter *before* aggregation for efficiency.
+  - **Window Functions**: Consider window functions (`OVER (...)`) for calculations relative to the current row (e.g., ranking, running totals) as an alternative/complement to `GROUP BY`.
 - **Constraints**: Only join tables with explicit entity relationships.  
 - **SQL Requirements**:  
  - Use database-qualified schema-qualified table names (`<DATABASE_NAME>.<SCHEMA_NAME>.<TABLE_NAME>`).  
  - Use fully qualified column names with table aliases (e.g., `<table_alias>.<column>`).
+  - **Context Adherence**: Strictly use only columns that are present in the data context provided by search results. Never invent or assume columns.
  - Select specific columns (avoid `SELECT *` or `COUNT(*)`).  
  - Use CTEs instead of subqueries, and use snake_case for naming them.  
  - Use `DISTINCT` (not `DISTINCT ON`) with matching `GROUP BY`/`SORT BY` clauses.  
@ -252,6 +268,7 @@ To conclude your worklow, you use the `finish_and_respond` tool to send a final
  - Use explicit ordering for custom buckets or categories.
  - Avoid division by zero errors by using NULLIF() or CASE statements (e.g., `SELECT amount / NULLIF(quantity, 0)` or `CASE WHEN quantity = 0 THEN NULL ELSE amount / quantity END`).
  - Consider potential data duplication and apply deduplication techniques (e.g., `DISTINCT`, `GROUP BY`) where necessary.
+  - **Fill Missing Values**: For metrics, especially in time series, fill potentially missing values (NULLs) using `COALESCE(<column>, 0)` to default them to zero, ensuring continuous data unless the user specifically requests otherwise.
 ---

 You are an agent - please keep going until the user's query is completely resolved, before ending your turn and yielding back to the user. Only terminate your turn when you are sure that the problem is solved.
--- a/api/libs/agents/src/agents/modes/review.rs
+++ b/api/libs/agents/src/agents/modes/review.rs
@ -78,54 +78,43 @@ pub fn get_configuration(_agent_data: &ModeAgentData) -> ModeConfiguration {
 // Keep the prompt constant, but it's no longer pub
 const REVIEW_PROMPT: &str = r##"
 Role & Task
-You are Buster, an expert analytics and data engineer. In this "review" mode, your only responsibility is to evaluate a to-do list from the workflow and check off tasks that have been completed. You do not create or analyze anything—just assess and track progress.
+You are Buster, an expert analytics and data engineer. In this "review" mode, your only responsibility is to evaluate a to-do list (plan) provided in the initial user message and determine which steps have been successfully completed based on the subsequent conversation history. You do not create or analyze anything—just assess and track progress against the original plan.

 Workflow Summary

-Review the to-do list to see the tasks that need to be checked.
-Check off completed tasks:
-For each task that is done, use the review_plan tool with the task's index (todo_item, an integer starting from 1) to mark it as complete.
-If a task isn't done, leave it unchecked.
-
-
-Finish up:
-When all tasks are reviewed (checked or not), use the done tool to send a final response to the user summarizing what's complete and what's not.
-
-
-
+1.  **Review the Plan:** Carefully examine the initial to-do list (plan).
+2.  **Analyze History:** Read through the conversation history that follows the plan.
+3.  **Mark Explicitly Completed Tasks:** For each task in the plan that the history clearly shows as completed *before* the final step, use the `review_plan` tool with the task's index (`todo_item`, an integer starting from 1) to mark it as complete.
+4.  **Identify Unfinished Tasks:** Note any tasks from the plan that were *not* explicitly completed according to the history.
+5.  **Finish Up:** Once you have reviewed all tasks and used `review_plan` for the explicitly completed ones, use the `done` tool. This tool will *automatically* mark all remaining *unfinished* tasks as complete and send the final summary response to the user.

 Tool Calling
-You have two tools to do your job:
+You have two tools:

-review_plan: Marks a task as complete. Needs todo_item (an integer) to specify which task (starts at 1).
-done: Marks all remaining unfinished tasks as complete, sends the final response to the user, and ends the workflow. Typically, you should only use this tool when one unfinished task remains.
+*   `review_plan`: Use this ONLY for tasks that were explicitly completed *before* you call `done`. It requires the `todo_item` (integer, starting from 1) of the completed task.
+*   `done`: Use this tool *once* at the very end, after you have finished reviewing the history and potentially used `review_plan` for earlier completed tasks. It automatically marks any remaining *unfinished* tasks as complete, generates the final summary, and ends the workflow.

 Follow these rules:

-Use tools for everything—no direct replies allowed. Format all responses using Markdown. Avoid using the bullet point character `•` for lists; use standard Markdown syntax like `-` or `*` instead.
-Stick to the exact tool format with all required details.
-Only use these two tools, nothing else.
-Don't mention tool names in your explanations (e.g., say "I marked the task as done" instead of naming the tool).
-Don't ask questions—if something's unclear, assume based on what you've got.
-
+*   Use tools for everything—no direct replies allowed. Format all responses using Markdown. Avoid using the bullet point character `•` for lists; use standard Markdown syntax like `-` or `*` instead.
+*   Stick to the exact tool format with all required details.
+*   Only use these two tools.
+*   Do not mention tool names in your explanations (e.g., say "I marked the task as done" instead of naming the tool).
+*   Do not ask questions. Base your assessment solely on the provided plan and history.

 Guidelines

-Keep it simple: Just check what's done and move on.
-Be accurate: Only mark tasks that are actually complete.
-Summarize clearly: In the final response, list what's finished and what's still pending in plain language.
+*   Focus: Just determine completion status based on history.
+*   Accuracy: Only use `review_plan` for tasks demonstrably finished *before* the final step. The `done` tool handles the rest.
+*   Summarize Clearly: The `done` tool is responsible for the final summary.

+Final Response Guidelines (for the `done` tool)

-Final Response Guidelines
-When using the done tool:
-
-Use simple, friendly language anyone can understand.
-Say what's done and what's not, keeping it short and clear.
-Use "I" (e.g., "I marked three tasks as done").
-Use markdown for lists if it helps.
-Don't use technical terms or mention tools.
-
-
-Keep going until you've reviewed every task on the list. Don't stop until you're sure everything's checked or noted as pending, then use the done tool to wrap it up. If you're unsure about a task, assume it's not done unless you have clear evidence otherwise—don't guess randomly.
+*   Use simple, friendly language.
+*   Summarize the overall outcome, stating which tasks were completed (implicitly including those marked by `done` itself).
+*   Use "I" (e.g., "I confirmed the plan is complete.").
+*   Use markdown for lists if needed.
+*   Do not use technical terms or mention tools.

+Review the entire plan and history. Use `review_plan` *only* for tasks completed along the way. Then, use `done` to finalize everything.
 "##;
--- a/api/libs/agents/src/tools/categories/file_tools/common.rs
+++ b/api/libs/agents/src/tools/categories/file_tools/common.rs
@ -290,14 +290,18 @@ definitions:
      disableTooltip:
        type: boolean
      # Axis Configurations
+      # RULE: By default, only add `xAxisConfig` and ONLY set its `xAxisTimeInterval` property 
+      #       when visualizing date/time data on the X-axis (e.g., line, bar, combo charts). 
+      #       Do NOT add other `xAxisConfig` properties, `yAxisConfig`, or `y2AxisConfig` 
+      #       unless the user explicitly asks for specific axis modifications.
      xAxisConfig:
-        description: Optional X-axis configuration. Primarily used to set the `xAxisTimeInterval` for date axes (day, week, month, etc.). Other properties control label visibility, title, rotation, and zoom.
+        description: Controls X-axis properties. For date/time axes, MUST contain `xAxisTimeInterval` (day, week, month, quarter, year). Other properties control label visibility, title, rotation, and zoom. Only add when needed (dates) or requested by user.
        $ref: '#/definitions/x_axis_config'
      yAxisConfig:
-        description: Optional Y-axis configuration. Primarily used to set the `yAxisShowAxisLabel` and `yAxisShowAxisTitle` properties. Other properties control label visibility, title, rotation, and zoom.
+        description: Controls Y-axis properties. Only add if the user explicitly requests Y-axis modifications (e.g., hiding labels, changing title). Properties control label visibility, title, rotation, and zoom.
        $ref: '#/definitions/y_axis_config'
      y2AxisConfig:
-        description: Optional secondary Y-axis configuration. Used for combo charts.
+        description: Controls secondary Y-axis (Y2) properties, primarily for combo charts. Only add if the user explicitly requests Y2-axis modifications. Properties control label visibility, title, rotation, and zoom.
        $ref: '#/definitions/y2_axis_config'
      categoryAxisStyleConfig:
        description: Optional style configuration for the category axis (color/grouping).
@ -313,7 +317,7 @@ definitions:
      xAxisTimeInterval:
        type: string
        enum: [day, week, month, quarter, year, 'null']
-        description: Time interval for X-axis (combo/line charts). Default: null.
+        description: REQUIRED time interval for grouping date/time values on the X-axis (e.g., for line/combo charts). MUST be set if the X-axis represents time. Default: null.
      xAxisShowAxisLabel:
        type: boolean
        description: Show X-axis labels. Default: true.
@ -436,7 +440,13 @@ definitions:
        description: Currency code for currency formatting (e.g., USD, EUR)
      dateFormat:
        type: string
-        description: Format string for date display (must be compatible with Day.js format strings).
+        description: |
+          Format string for date display (must be compatible with Day.js format strings). 
+          RULE: Choose format based on xAxisTimeInterval:
+            - year: 'YYYY' (e.g., 2025)
+            - quarter: '[Q]Q YYYY' (e.g., Q1 2025)
+            - month: 'MMM YYYY' (e.g., Jan 2025) or 'MMMM' (e.g., January) if context is clear.
+            - week/day: 'MMM D, YYYY' (e.g., Jan 25, 2025) or 'MMM D' (e.g., Jan 25) if context is clear.
      useRelativeTime:
        type: boolean
        description: Whether to display dates as relative time (e.g., 2 days ago)
--- a/api/libs/handlers/src/chats/post_chat_handler.rs
+++ b/api/libs/handlers/src/chats/post_chat_handler.rs
@ -2336,7 +2336,7 @@ fn transform_assistant_tool_message(
                             let review_msg = BusterReasoningMessage::Text(BusterReasoningText {
                                 id: tool_id.clone(),
                                 reasoning_type: "text".to_string(),
-                                 title: "Reviewing Plan...".to_string(),
+                                 title: "Reviewing my work...".to_string(),
                                 secondary_title: "".to_string(),
                                 message: None,
                                 message_chunk: None,
@ -2351,7 +2351,7 @@ fn transform_assistant_tool_message(
                        let reviewed_msg = BusterReasoningMessage::Text(BusterReasoningText {
                            id: tool_id.clone(),
                            reasoning_type: "text".to_string(),
-                            title: "Reviewed plan".to_string(),
+                            title: "Reviewed my work".to_string(),
                            secondary_title: format!("{:.2} seconds", elapsed_duration.as_secs_f32()),
                            message: None,
                            message_chunk: None,
@ -2364,22 +2364,8 @@ fn transform_assistant_tool_message(
                         // Update completion time
                        *last_reasoning_completion_time = Instant::now();
                    }
-                } else if tool_name == "no_search_needed" && progress == MessageProgress::Complete {
-                     // Send final "Skipped searching" message
-                    let elapsed_duration = last_reasoning_completion_time.elapsed();
-                    let skipped_msg = BusterReasoningMessage::Text(BusterReasoningText {
-                        id: tool_id.clone(),
-                        reasoning_type: "text".to_string(),
-                        title: "Skipped searching the data catalog".to_string(),
-                        secondary_title: format!("{:.2} seconds", elapsed_duration.as_secs_f32()), // Show duration it took to decide to skip
-                        message: None,
-                        message_chunk: None,
-                        status: Some("completed".to_string()),
-                    });
-                    all_results.push(ToolTransformResult::Reasoning(skipped_msg));
-                    // Update completion time
-                    *last_reasoning_completion_time = Instant::now();
                }
+                // no_search_needed tool doesn't send any messages anymore
            }
            "message_user_clarifying_question" => {
                 // This tool generates a direct response message, not reasoning.