mirror of https://github.com/buster-so/buster.git
search data catalog change
This commit is contained in:
parent
360019f9eb
commit
8f3fb8732d
|
@ -68,14 +68,14 @@ struct FilteredDataset {
|
||||||
}
|
}
|
||||||
|
|
||||||
const LLM_FILTER_PROMPT: &str = r#"
|
const LLM_FILTER_PROMPT: &str = r#"
|
||||||
You are a dataset relevance evaluator. Your task is to determine which datasets might contain information relevant to the user's query based on their structure and metadata.
|
You are a dataset relevance evaluator. Your task is to determine which datasets might contain information relevant to the user's query based on their structure and metadata. Be inclusive in your evaluation - if there's a reasonable chance the dataset could be useful, include it.
|
||||||
|
|
||||||
USER REQUEST: {user_request}
|
USER REQUEST: {user_request}
|
||||||
SEARCH QUERY: {query}
|
SEARCH QUERY: {query}
|
||||||
|
|
||||||
Below is a list of datasets that were identified as potentially relevant by an initial semantic ranking system.
|
Below is a list of datasets that were identified as potentially relevant by an initial semantic ranking system.
|
||||||
For each dataset, review its description in the YAML format and determine if its structure is suitable for the user's query.
|
For each dataset, review its description in the YAML format and determine if its structure could potentially be suitable for the user's query.
|
||||||
ONLY include datasets that you determine are relevant in your response.
|
Include datasets that have even a reasonable possibility of containing relevant information.
|
||||||
|
|
||||||
DATASETS:
|
DATASETS:
|
||||||
{datasets_json}
|
{datasets_json}
|
||||||
|
@ -86,30 +86,33 @@ Return a JSON response with the following structure:
|
||||||
"results": [
|
"results": [
|
||||||
{
|
{
|
||||||
"id": "dataset-uuid-here",
|
"id": "dataset-uuid-here",
|
||||||
"reason": "Brief explanation of why this dataset's structure is relevant"
|
"reason": "Brief explanation of why this dataset's structure might be relevant"
|
||||||
},
|
},
|
||||||
// ... more relevant datasets only
|
// ... more potentially relevant datasets
|
||||||
]
|
]
|
||||||
}
|
}
|
||||||
```
|
```
|
||||||
|
|
||||||
IMPORTANT GUIDELINES:
|
IMPORTANT GUIDELINES:
|
||||||
1. DO NOT make assumptions about what specific values exist in the datasets
|
1. Be inclusive - if there's a reasonable possibility the dataset could be useful, include it
|
||||||
2. Focus EXCLUSIVELY on identifying datasets with STRUCTURES that could reasonably contain the type of information requested
|
2. Consider both direct and indirect relationships to the query
|
||||||
3. For example, if a user asks about "red bull sales", consider datasets about products, sales, inventory, etc. as potentially relevant - even if "red bull" is not explicitly mentioned
|
3. For example, if a user asks about "red bull sales", consider datasets about:
|
||||||
4. Evaluate based on whether the dataset's schema, fields, or description indicates it COULD contain the relevant information
|
- Direct relevance: products, sales, inventory
|
||||||
5. Look for structural compatibility rather than exact matches in the content
|
- Indirect relevance: marketing campaigns, customer demographics, store locations
|
||||||
6. ONLY include datasets you find relevant in your response - omit any that aren't relevant
|
4. Evaluate based on whether the dataset's schema, fields, or description MIGHT contain or relate to the relevant information
|
||||||
|
5. Include datasets that could provide contextual or supporting information
|
||||||
|
6. When in doubt about relevance, lean towards including the dataset
|
||||||
7. Ensure the "id" field exactly matches the dataset's UUID
|
7. Ensure the "id" field exactly matches the dataset's UUID
|
||||||
8. Use both the USER REQUEST and SEARCH QUERY to understand the user's information needs - the USER REQUEST provides broader context while the SEARCH QUERY represents specific search intent
|
8. Use both the USER REQUEST and SEARCH QUERY to understand the user's information needs broadly
|
||||||
9. Restrict your evaluation strictly to the defined elements in the dataset metadata:
|
9. Consider these elements in the dataset metadata:
|
||||||
- Column names and their data types
|
- Column names and their data types
|
||||||
- Entity relationships
|
- Entity relationships
|
||||||
- Predefined metrics
|
- Predefined metrics
|
||||||
- Table schemas
|
- Table schemas
|
||||||
- Dimension hierarchies
|
- Dimension hierarchies
|
||||||
10. Do NOT make assumptions about what data might exist beyond what is explicitly defined in the metadata
|
- Related or connected data structures
|
||||||
11. A dataset is relevant ONLY if its documented structure supports answering the query, not because you assume it might contain certain data
|
10. While you shouldn't assume specific data exists, you can be optimistic about the potential usefulness of related data structures
|
||||||
|
11. A dataset is relevant if its structure could reasonably support or contribute to answering the query, either directly or indirectly
|
||||||
"#;
|
"#;
|
||||||
|
|
||||||
pub fn router() -> Router {
|
pub fn router() -> Router {
|
||||||
|
@ -238,7 +241,7 @@ async fn rerank_datasets(
|
||||||
query,
|
query,
|
||||||
documents,
|
documents,
|
||||||
model: ReRankModel::EnglishV3,
|
model: ReRankModel::EnglishV3,
|
||||||
top_n: Some(25), // Get top 20 results per query
|
top_n: Some(30),
|
||||||
..Default::default()
|
..Default::default()
|
||||||
};
|
};
|
||||||
|
|
||||||
|
|
Loading…
Reference in New Issue