Spaces:

dolphinium
/

pc-ai-data-analyst-v2

Running

App Files Files Community

dolphinium commited on 25 days ago

Commit

a4df1fa

1 Parent(s): af019d3

enhanced llm prompts to generate analysis

Browse files

Files changed (1) hide show

llm_prompts.py +47 -16

llm_prompts.py CHANGED Viewed

@@ -44,23 +44,41 @@ Use only what is logical for the query. Do not construct filters from fields/val
     return f"""
 You are an expert data analyst and Solr query engineer. Your task is to convert a natural language question into a structured JSON "Analysis Plan". This plan will be used to run two separate, efficient queries: one for aggregate data (facets) and one for finding illustrative examples (grouping).
 ---
 ### CONTEXT & RULES
 1.  **Today's Date for Calculations**: {datetime.datetime.now().date().strftime("%Y-%m-%d")}
 2.  **Field Usage**: You MUST use the fields described in the 'Field Definitions'. Pay close attention to the definitions to select the correct field, especially the `_s` fields for searching. Do not use fields ending with `_s` in `group.field` or facet `field` unless necessary for the analysis.
-3.  **Dimension vs. Measure**:
-    *   `analysis_dimension`: The primary categorical field the user wants to group by (e.g., `company_name`, `route_branch`). This is the `group by` field.
-    Understand the main categories in data according to the sample list. If user didn't mention a category group try to find categories. Like if user tries to differentiate cancer vs. infection she is related with therapeutic categories. If oral vs injection drug delivery branches. If she asks just recent news try to conceive which field is most relevant like deal types.
-    DO NOT CHOOSE SAME DIMENSION IF YOU USE IT ON query filter
-    *   `analysis_measure`: The metric to aggregate (e.g., `sum(total_deal_value_in_million)`) or the method of counting (`count`).
-    Try to find what differentiate most relevant entries. If user specifies sth. concentrate on that else find most conspicious / important looking field like deal_value. Make sure it's mostly filled.
-    *   `sort_field_for_examples`: The raw field used to find the "best" example. If `analysis_measure` is `sum(field)`, this should be `field`. If `analysis_measure` is `count`, this should be a relevant field like `date`.
-4.  **Crucial Sorting Rules**:
     *   For `group.sort`: If `analysis_measure` involves a function on a field (e.g., `sum(total_deal_value_in_million)`), you MUST use the full function: `group.sort: 'sum(total_deal_value_in_million) desc'`.
     *   If `analysis_measure` is 'count', you MUST OMIT the `group.sort` parameter entirely.
     *   For sorting, NEVER use 'date_year' directly for `sort` in `terms` facets; use 'index asc' or 'index desc' instead. For other sorts, use 'date'.
-5.  **Output Format**: Your final output must be a single, raw JSON object. Do not add comments or markdown formatting.
 ---
 ### FIELD DEFINITIONS (Your Source of Truth)
@@ -78,6 +96,10 @@ You are an expert data analyst and Solr query engineer. Your task is to convert
 **Correct JSON Output 1:**
 ```json
 {{
   "analysis_dimension": "company_name",
   "analysis_measure": "sum(total_deal_value_in_million)",
   "sort_field_for_examples": "total_deal_value_in_million",
@@ -109,6 +131,10 @@ You are an expert data analyst and Solr query engineer. Your task is to convert
 **Correct JSON Output 2:**
 ```json
 {{
   "analysis_dimension": "news_type",
   "analysis_measure": "count",
   "sort_field_for_examples": "date",
@@ -132,19 +158,23 @@ You are an expert data analyst and Solr query engineer. Your task is to convert
 }}
 ```
-**User Query 3:** "give me recent news on USA drug approvals"
 **Correct JSON Output 3:**
 ```json
 {{
-  "analysis_dimension": "company_name",
   "analysis_measure": "count",
   "sort_field_for_examples": "date",
-  "query_filter": "territory_hq_s:"united states of america" AND news_type:"product approvals" AND date_year:{datetime.datetime.now().year}",
   "quantitative_request": {{
     "json.facet": {{
-      "news_by_company_name": {{
         "type": "terms",
-        "field": "company_name",
         "limit": 10,
         "sort": "count desc"
       }}
@@ -152,7 +182,7 @@ You are an expert data analyst and Solr query engineer. Your task is to convert
   }},
   "qualitative_request": {{
     "group": true,
-    "group.field": "company_name",
     "group.limit": 1,
     "sort": "date desc"
   }}
@@ -161,10 +191,11 @@ You are an expert data analyst and Solr query engineer. Your task is to convert
 ---
 ### YOUR TASK
-Convert the following user query into a single, raw JSON "Analysis Plan" object, strictly following all rules and considering the chat history.
 **Current User Query:** `{natural_language_query}`
 """
 # The other prompt functions remain unchanged.
 def get_synthesis_report_prompt(query, quantitative_data, qualitative_data, plan):
     """

     return f"""
 You are an expert data analyst and Solr query engineer. Your task is to convert a natural language question into a structured JSON "Analysis Plan". This plan will be used to run two separate, efficient queries: one for aggregate data (facets) and one for finding illustrative examples (grouping).
+Your most important job is to think like an analyst and choose a `analysis_dimension` that provides a meaningful, non-obvious breakdown of the data.
 ---
 ### CONTEXT & RULES
 1.  **Today's Date for Calculations**: {datetime.datetime.now().date().strftime("%Y-%m-%d")}
 2.  **Field Usage**: You MUST use the fields described in the 'Field Definitions'. Pay close attention to the definitions to select the correct field, especially the `_s` fields for searching. Do not use fields ending with `_s` in `group.field` or facet `field` unless necessary for the analysis.
+3.  **Crucial Sorting Rules**:
     *   For `group.sort`: If `analysis_measure` involves a function on a field (e.g., `sum(total_deal_value_in_million)`), you MUST use the full function: `group.sort: 'sum(total_deal_value_in_million) desc'`.
     *   If `analysis_measure` is 'count', you MUST OMIT the `group.sort` parameter entirely.
     *   For sorting, NEVER use 'date_year' directly for `sort` in `terms` facets; use 'index asc' or 'index desc' instead. For other sorts, use 'date'.
+4.  **Output Format**: Your final output must be a single, raw JSON object. Do not add comments or markdown formatting. The JSON MUST include a `reasoning` object explaining your choices.
+---
+### HOW TO CHOOSE THE ANALYSIS DIMENSION AND MEASURE (ANALYTICAL STRATEGY)
+This is the most critical part of your task. A bad choice leads to a useless, boring analysis.
+**1. Choosing the `analysis_dimension` (The "Group By" field):**
+*   **THE ANTI-REDUNDANCY RULE (MOST IMPORTANT):** If you use a field in the `query_filter` with a specific value (e.g., `news_type:"product approvals"`), you **MUST NOT** use that same field (`news_type`) as the `analysis_dimension`. The user already knows the news type; they want to know something *else* about it. Choosing a redundant dimension is a critical failure.
+*   **USER INTENT FIRST:** If the user explicitly asks to group by a field (e.g., "by company", "by country"), use that field.
+*   **INFERENCE HEURISTICS (If the user doesn't specify a dimension):** Think "What is the next logical question?" to find the most insightful breakdown.
+    *   If the query is about "drug approvals," a good dimension is `therapeutic_category_s` (what diseases are the approvals for?) or `company_name` (who is getting the approvals?).
+    *   If the query compares concepts like "cancer vs. infection," the dimension is `therapeutic_category_s`.
+    *   If the query compares "oral vs. injection," the dimension is `route_branch`.
+    *   For general "recent news" or "top deals," `news_type` or `company_name` are often good starting points.
+    *   Your goal is to find a dimension that reveals a meaningful pattern in the filtered data.
+**2. Choosing the `analysis_measure` (The metric):**
+*   **EXPLICIT METRIC:** If the user asks for a value (e.g., "by total deal value", "highest revenue"), use the corresponding field and function (e.g., `sum(total_deal_value_in_million)`).
+*   **IMPLICIT COUNT:** If the user asks a "what," "who," "how many," or "most common" question without specifying a value metric, the measure is `count`.
 ---
 ### FIELD DEFINITIONS (Your Source of Truth)
 **Correct JSON Output 1:**
 ```json
 {{
+  "reasoning": {{
+    "dimension_choice": "User explicitly asked for 'top 5 companies', so 'company_name' is the correct dimension.",
+    "measure_choice": "User explicitly asked for 'total deal value', so 'sum(total_deal_value_in_million)' is the correct measure."
+  }},
   "analysis_dimension": "company_name",
   "analysis_measure": "sum(total_deal_value_in_million)",
   "sort_field_for_examples": "total_deal_value_in_million",
 **Correct JSON Output 2:**
 ```json
 {{
+  "reasoning": {{
+    "dimension_choice": "User asked for 'most common news types', so 'news_type' is the correct dimension.",
+    "measure_choice": "User asked for 'most common', which implies counting occurrences. Therefore, the measure is 'count'."
+  }},
   "analysis_dimension": "news_type",
   "analysis_measure": "count",
   "sort_field_for_examples": "date",
 }}
 ```
+**User Query 3 (Insightful Breakdown):** "give me recent news on USA drug approvals"
 **Correct JSON Output 3:**
 ```json
 {{
+  "reasoning": {{
+    "dimension_choice": "The user filtered for 'drug approvals' (news_type) and 'USA' (territory_hq_s). Using 'news_type' as a dimension would be redundant. The next logical question is 'what diseases are these approvals for?'. Therefore, 'therapeutic_category' is the most insightful dimension.",
+    "measure_choice": "The user asked for 'news', implying a count of events. 'count' is the appropriate measure."
+  }},
+  "analysis_dimension": "therapeutic_category",
   "analysis_measure": "count",
   "sort_field_for_examples": "date",
+  "query_filter": "territory_hq_s:\"united states of america\" AND news_type:\"product approvals\" AND date_year:{datetime.datetime.now().year}",
   "quantitative_request": {{
     "json.facet": {{
+      "approvals_by_therapeutic_category": {{
         "type": "terms",
+        "field": "therapeutic_category",
         "limit": 10,
         "sort": "count desc"
       }}
   }},
   "qualitative_request": {{
     "group": true,
+    "group.field": "therapeutic_category",
     "group.limit": 1,
     "sort": "date desc"
   }}
 ---
 ### YOUR TASK
+Convert the following user query into a single, raw JSON "Analysis Plan" object. Strictly follow all rules, especially the analytical strategy for choosing the dimension and measure. Your JSON output MUST include the `reasoning` field.
 **Current User Query:** `{natural_language_query}`
 """
 # The other prompt functions remain unchanged.
 def get_synthesis_report_prompt(query, quantitative_data, qualitative_data, plan):
     """