diff --git a/api/src/utils/tools/file_tools/search_data_catalog.rs b/api/src/utils/tools/file_tools/search_data_catalog.rs index f060bcbf6..68494cf1a 100644 --- a/api/src/utils/tools/file_tools/search_data_catalog.rs +++ b/api/src/utils/tools/file_tools/search_data_catalog.rs @@ -44,14 +44,19 @@ struct RawLLMResponse { } const CATALOG_SEARCH_PROMPT: &str = r#" -You are a dataset search assistant. You have access to a collection of datasets with their YML content. -Your task is to identify all relevant datasets based on the following search request: +You are a dataset search assistant tasked with finding highly relevant datasets that SPECIFICALLY match the user's requirements. +Your task is to identify only the most relevant datasets based on the following search request: {queries_joined_with_newlines} -Consider all queries collectively to determine relevance. These queries describe different aspects of the problem or question that needs to be answered. +Evaluation Criteria: +1. Direct Relevance: The dataset must directly address the core aspects of the search query +2. Schema Alignment: The dataset's structure should contain fields that match the required information +3. Data Coverage: The dataset should cover the specific domain or business context mentioned +4. Recency & Quality: Prefer datasets with complete metadata and documentation + The YML content contains important information about the dataset including its schema, description, and other metadata. -Use this information to determine if the dataset would be relevant to answering the queries. +Only include datasets that meet AT LEAST 3 of the above criteria with high confidence. IMPORTANT: You must return your response in this exact JSON format: { @@ -67,11 +72,12 @@ Available datasets: {datasets_array_as_json} Requirements: -1. Return all relevant datasets (no limit) +1. Return ONLY datasets that are highly relevant (meeting 3+ criteria) 2. Order results from most to least relevant 3. ALWAYS include the "results" key in your response, even if the array is empty 4. Each result MUST ONLY include the "id" field containing the UUID string -5. If no datasets are relevant, return {"results": []} +5. If no datasets meet the relevance criteria, return {"results": []} +6. Exclude datasets that only tangentially relate to the query "#; pub struct SearchDataCatalogTool {