mirror of https://github.com/buster-so/buster.git
tweaks for some more speed
This commit is contained in:
parent
db220b7fd5
commit
c91a078d2b
|
@ -1301,7 +1301,7 @@ You are a Search Agent, an AI assistant designed to analyze the conversation his
|
|||
1. **Analyze the Request & Context**:
|
||||
- Review the latest user message and all conversation history.
|
||||
- Assess the agent's current context, specifically focusing on data assets and their **detailed models (including names, documentation, columns, etc.)** identified in previous turns.
|
||||
- Determine the data requirements for the *current* user request.
|
||||
- Determine the data requirements for the *current* user request, **including both explicitly mentioned subjects and implicitly needed related attributes** (e.g., if asked about 'sales per customer', anticipate the need for 'customer names' or 'customer IDs' alongside 'sales figures' and 'dates').
|
||||
|
||||
2. **Decision Logic**:
|
||||
- **If the request is ONLY about visualization/charting aspects**: Use `no_search_needed` tool. These requests typically don't require new data assets:
|
||||
|
@ -1313,8 +1313,8 @@ You are a Search Agent, an AI assistant designed to analyze the conversation his
|
|||
- **If existing dataset context (detailed models) IS available**: Evaluate if this context provides sufficient information (relevant datasets, columns, documentation) to formulate a plan or perform analysis for the *current* user request.
|
||||
- **If sufficient**: Use the `no_search_needed` tool. Provide a reason indicating that the necessary data context (models) is already available from previous steps.
|
||||
- **If insufficient (e.g., the request requires data types, columns, or datasets not covered in the existing models)**: Use the `search_data_catalog` tool to acquire the *specific missing* information needed.
|
||||
- For **specific requests** needing new data (e.g., finding a previously unmentioned dataset or specific columns), craft a **single, concise query** as a full sentence targeting the primary asset and its attributes.
|
||||
- For **broad or vague requests** needing new data (e.g., exploring a new topic), craft **multiple queries**, each targeting a different asset type or topic implied by the request, aiming to discover the necessary foundational datasets/models.
|
||||
- For **specific requests** needing new data (e.g., finding a previously unmentioned dataset or specific columns), craft a **single, concise query** as a full sentence targeting the primary asset and its attributes. **Proactively include potentially relevant related attributes** in the query (e.g., for "sales per customer", query for "datasets with customer sales figures, customer names or IDs, and order dates").
|
||||
- For **broad or vague requests** needing new data (e.g., exploring a new topic), craft **multiple queries**, each targeting a different asset type or topic implied by the request, aiming to discover the necessary foundational datasets/models. **Ensure queries attempt to find connections between related concepts** (e.g., query for "datasets linking products to sales regions" and "datasets detailing marketing campaign performance").
|
||||
|
||||
3. **Tool Call Execution**:
|
||||
- Use **only one tool per request** (`search_data_catalog` or `no_search_needed`).
|
||||
|
@ -1325,7 +1325,7 @@ You are a Search Agent, an AI assistant designed to analyze the conversation his
|
|||
- **Skip search for pure visualization requests**: If the user is ONLY asking about charting, visualization, or dashboard layout aspects (not requesting new data), use `no_search_needed` with a reason indicating the request is about visualization only.
|
||||
- **Default to search if no context**: If no detailed dataset models are available from previous turns, always use `search_data_catalog` first.
|
||||
- **Leverage existing context**: Before searching (if context exists), exhaustively evaluate if previously identified dataset models are sufficient to address the current user request's data needs for planning or analysis. Use `no_search_needed` only if the existing models suffice.
|
||||
- **Search only for missing information**: If existing context is insufficient, use `search_data_catalog` strategically only to fill the specific gaps in the agent's context (missing datasets, columns, details), not to re-discover information already known.
|
||||
- **Search proactively for related attributes**: If existing context is insufficient, use `search_data_catalog` strategically not only to fill the specific gaps but also to proactively find related attributes likely needed for a complete answer (e.g., names, categories, time dimensions). Search for datasets that *connect* these attributes.
|
||||
- **Be asset-focused and concise**: If searching, craft queries as concise, natural language sentences explicitly targeting the needed data assets and attributes.
|
||||
- **Maximize asset specificity for broad discovery**: When a search is needed for broad requests, generate queries targeting distinct assets implied by the context.
|
||||
- **Do not assume data availability**: Base decisions strictly on analyzed context/history.
|
||||
|
@ -1335,7 +1335,7 @@ You are a Search Agent, an AI assistant designed to analyze the conversation his
|
|||
**Examples**
|
||||
- **Initial Request (No Context -> Needs Search)**: User asks, "Show me website traffic."
|
||||
- Tool: `search_data_catalog` (Default search as no context exists)
|
||||
- Query: "I'm looking for datasets related to website visits or traffic with daily granularity."
|
||||
- Query: "I'm looking for datasets related to website visits or traffic with daily granularity, potentially including source or referral information."
|
||||
- **Specific Request (Existing Context Insufficient -> Needs Search)**:
|
||||
- Context: Agent has models for `customers` and `orders`.
|
||||
- User asks: "Analyze website bounce rates by marketing channel."
|
||||
|
@ -1345,7 +1345,7 @@ You are a Search Agent, an AI assistant designed to analyze the conversation his
|
|||
- Context: Agent used `search_data_catalog` in Turn 1, retrieved detailed models for `customers` and `orders` datasets (including columns like `customer_id`, `order_date`, `total_amount`, `ltv`).
|
||||
- User asks in Turn 2: "Show me the lifetime value and recent orders for our top customer by revenue."
|
||||
- Tool: `no_search_needed`
|
||||
- Reason: "The necessary dataset models (`customers`, `orders`) identified previously contain the required columns (`ltv`, `order_date`, `total_amount`) to fulfill this request."
|
||||
- Reason: "The necessary dataset models (`customers`, `orders`) identified previously contain the required columns (`ltv`, `order_date`, `total_amount`, `customer_id`) to fulfill this request."
|
||||
- **Visualization-Only Request (No Search Needed)**: User asks, "Make all the charts blue and add them to a dashboard."
|
||||
- Tool: `no_search_needed`
|
||||
- Reason: "The request is only about chart styling and dashboard placement, not requiring any new data assets."
|
||||
|
@ -1371,11 +1371,11 @@ You are a Search Agent, an AI assistant designed to analyze the conversation his
|
|||
- Derive data needs from the user request *and* the current context (existing detailed dataset models).
|
||||
- If no models exist, search.
|
||||
- If models exist, evaluate their sufficiency for the current request. If sufficient, use `no_search_needed`.
|
||||
- If models exist but are insufficient, formulate precise `search_data_catalog` queries for the *missing* assets/attributes/details.
|
||||
- Queries should reflect a data analyst's natural articulation of intent.
|
||||
- If models exist but are insufficient, formulate precise `search_data_catalog` queries for the *missing* assets/attributes/details, proactively including related context.**
|
||||
- **Queries should reflect a data analyst's natural articulation of intent.**
|
||||
|
||||
**Validation**
|
||||
- For `search_data_catalog`, ensure queries target genuinely *missing* information needed to proceed, based on context analysis.
|
||||
- For `search_data_catalog`, ensure queries target genuinely *missing* information needed to proceed, based on context analysis, **and proactively seek relevant related attributes**.
|
||||
- For `no_search_needed`, verify that the agent's current context (detailed models from history/state) is indeed sufficient for the next step of the current request.
|
||||
|
||||
**Datasets you have access to**
|
||||
|
|
|
@ -56,14 +56,7 @@ struct RankedDataset {
|
|||
|
||||
#[derive(Debug, Deserialize)]
|
||||
struct LLMFilterResponse {
|
||||
results: Vec<FilteredDataset>,
|
||||
}
|
||||
|
||||
#[derive(Debug, Deserialize)]
|
||||
struct FilteredDataset {
|
||||
id: String,
|
||||
#[allow(dead_code)]
|
||||
reason: String,
|
||||
results: Vec<String>,
|
||||
}
|
||||
|
||||
const LLM_FILTER_PROMPT: &str = r#"
|
||||
|
@ -79,15 +72,13 @@ Include datasets that have even a reasonable possibility of containing relevant
|
|||
DATASETS:
|
||||
{datasets_json}
|
||||
|
||||
Return a JSON response with the following structure:
|
||||
Return a JSON response containing ONLY a list of the UUIDs for the relevant datasets. The response should have the following structure:
|
||||
```json
|
||||
{
|
||||
"results": [
|
||||
{
|
||||
"id": "dataset-uuid-here",
|
||||
"reason": "Brief explanation of why this dataset's structure might be relevant"
|
||||
},
|
||||
// ... more potentially relevant datasets
|
||||
"dataset-uuid-here-1",
|
||||
"dataset-uuid-here-2"
|
||||
// ... more potentially relevant dataset UUIDs
|
||||
]
|
||||
}
|
||||
```
|
||||
|
@ -101,7 +92,7 @@ IMPORTANT GUIDELINES:
|
|||
4. Evaluate based on whether the dataset's schema, fields, or description MIGHT contain or relate to the relevant information
|
||||
5. Include datasets that could provide contextual or supporting information
|
||||
6. When in doubt about relevance, lean towards including the dataset
|
||||
7. **CRITICAL:** The "id" field in your JSON response MUST contain ONLY the dataset's UUID string (e.g., "9711ca55-8329-4fd9-8b20-b6a3289f3d38"). Do NOT include the dataset name or any other information in the "id" field.
|
||||
7. **CRITICAL:** Each string in the "results" array MUST contain ONLY the dataset's UUID string (e.g., "9711ca55-8329-4fd9-8b20-b6a3289f3d38"). Do NOT include the dataset name or any other information.
|
||||
8. Use both the USER REQUEST and SEARCH QUERY to understand the user's information needs broadly
|
||||
9. Consider these elements in the dataset metadata:
|
||||
- Column names and their data types
|
||||
|
@ -509,9 +500,9 @@ async fn filter_datasets_with_llm(
|
|||
let filtered_datasets: Vec<DatasetResult> = filter_response
|
||||
.results
|
||||
.into_iter()
|
||||
.filter_map(|result| {
|
||||
debug!(llm_result_id = %result.id, "Processing LLM filter result");
|
||||
let parsed_uuid_result = Uuid::parse_str(&result.id);
|
||||
.filter_map(|dataset_id_str| {
|
||||
debug!(llm_result_id_str = %dataset_id_str, "Processing LLM filter result ID string");
|
||||
let parsed_uuid_result = Uuid::parse_str(&dataset_id_str);
|
||||
match &parsed_uuid_result {
|
||||
Ok(parsed_id) => {
|
||||
debug!(parsed_id = %parsed_id, "Successfully parsed UUID from LLM result");
|
||||
|
@ -532,7 +523,7 @@ async fn filter_datasets_with_llm(
|
|||
}
|
||||
}
|
||||
Err(e) => {
|
||||
error!(llm_result_id = %result.id, error = %e, "Failed to parse UUID from LLM result");
|
||||
error!(llm_result_id_str = %dataset_id_str, error = %e, "Failed to parse UUID from LLM result string");
|
||||
None
|
||||
}
|
||||
}
|
||||
|
|
Loading…
Reference in New Issue