dolphinium commited on
Commit
a4df1fa
·
1 Parent(s): af019d3

enhanced llm prompts to generate analysis

Browse files
Files changed (1) hide show
  1. llm_prompts.py +47 -16
llm_prompts.py CHANGED
@@ -44,23 +44,41 @@ Use only what is logical for the query. Do not construct filters from fields/val
44
  return f"""
45
  You are an expert data analyst and Solr query engineer. Your task is to convert a natural language question into a structured JSON "Analysis Plan". This plan will be used to run two separate, efficient queries: one for aggregate data (facets) and one for finding illustrative examples (grouping).
46
 
 
 
47
  ---
48
  ### CONTEXT & RULES
49
 
50
  1. **Today's Date for Calculations**: {datetime.datetime.now().date().strftime("%Y-%m-%d")}
51
  2. **Field Usage**: You MUST use the fields described in the 'Field Definitions'. Pay close attention to the definitions to select the correct field, especially the `_s` fields for searching. Do not use fields ending with `_s` in `group.field` or facet `field` unless necessary for the analysis.
52
- 3. **Dimension vs. Measure**:
53
- * `analysis_dimension`: The primary categorical field the user wants to group by (e.g., `company_name`, `route_branch`). This is the `group by` field.
54
- Understand the main categories in data according to the sample list. If user didn't mention a category group try to find categories. Like if user tries to differentiate cancer vs. infection she is related with therapeutic categories. If oral vs injection drug delivery branches. If she asks just recent news try to conceive which field is most relevant like deal types.
55
- DO NOT CHOOSE SAME DIMENSION IF YOU USE IT ON query filter
56
- * `analysis_measure`: The metric to aggregate (e.g., `sum(total_deal_value_in_million)`) or the method of counting (`count`).
57
- Try to find what differentiate most relevant entries. If user specifies sth. concentrate on that else find most conspicious / important looking field like deal_value. Make sure it's mostly filled.
58
- * `sort_field_for_examples`: The raw field used to find the "best" example. If `analysis_measure` is `sum(field)`, this should be `field`. If `analysis_measure` is `count`, this should be a relevant field like `date`.
59
- 4. **Crucial Sorting Rules**:
60
  * For `group.sort`: If `analysis_measure` involves a function on a field (e.g., `sum(total_deal_value_in_million)`), you MUST use the full function: `group.sort: 'sum(total_deal_value_in_million) desc'`.
61
  * If `analysis_measure` is 'count', you MUST OMIT the `group.sort` parameter entirely.
62
  * For sorting, NEVER use 'date_year' directly for `sort` in `terms` facets; use 'index asc' or 'index desc' instead. For other sorts, use 'date'.
63
- 5. **Output Format**: Your final output must be a single, raw JSON object. Do not add comments or markdown formatting.
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
64
 
65
  ---
66
  ### FIELD DEFINITIONS (Your Source of Truth)
@@ -78,6 +96,10 @@ You are an expert data analyst and Solr query engineer. Your task is to convert
78
  **Correct JSON Output 1:**
79
  ```json
80
  {{
 
 
 
 
81
  "analysis_dimension": "company_name",
82
  "analysis_measure": "sum(total_deal_value_in_million)",
83
  "sort_field_for_examples": "total_deal_value_in_million",
@@ -109,6 +131,10 @@ You are an expert data analyst and Solr query engineer. Your task is to convert
109
  **Correct JSON Output 2:**
110
  ```json
111
  {{
 
 
 
 
112
  "analysis_dimension": "news_type",
113
  "analysis_measure": "count",
114
  "sort_field_for_examples": "date",
@@ -132,19 +158,23 @@ You are an expert data analyst and Solr query engineer. Your task is to convert
132
  }}
133
  ```
134
 
135
- **User Query 3:** "give me recent news on USA drug approvals"
136
  **Correct JSON Output 3:**
137
  ```json
138
  {{
139
- "analysis_dimension": "company_name",
 
 
 
 
140
  "analysis_measure": "count",
141
  "sort_field_for_examples": "date",
142
- "query_filter": "territory_hq_s:"united states of america" AND news_type:"product approvals" AND date_year:{datetime.datetime.now().year}",
143
  "quantitative_request": {{
144
  "json.facet": {{
145
- "news_by_company_name": {{
146
  "type": "terms",
147
- "field": "company_name",
148
  "limit": 10,
149
  "sort": "count desc"
150
  }}
@@ -152,7 +182,7 @@ You are an expert data analyst and Solr query engineer. Your task is to convert
152
  }},
153
  "qualitative_request": {{
154
  "group": true,
155
- "group.field": "company_name",
156
  "group.limit": 1,
157
  "sort": "date desc"
158
  }}
@@ -161,10 +191,11 @@ You are an expert data analyst and Solr query engineer. Your task is to convert
161
  ---
162
  ### YOUR TASK
163
 
164
- Convert the following user query into a single, raw JSON "Analysis Plan" object, strictly following all rules and considering the chat history.
165
 
166
  **Current User Query:** `{natural_language_query}`
167
  """
 
168
  # The other prompt functions remain unchanged.
169
  def get_synthesis_report_prompt(query, quantitative_data, qualitative_data, plan):
170
  """
 
44
  return f"""
45
  You are an expert data analyst and Solr query engineer. Your task is to convert a natural language question into a structured JSON "Analysis Plan". This plan will be used to run two separate, efficient queries: one for aggregate data (facets) and one for finding illustrative examples (grouping).
46
 
47
+ Your most important job is to think like an analyst and choose a `analysis_dimension` that provides a meaningful, non-obvious breakdown of the data.
48
+
49
  ---
50
  ### CONTEXT & RULES
51
 
52
  1. **Today's Date for Calculations**: {datetime.datetime.now().date().strftime("%Y-%m-%d")}
53
  2. **Field Usage**: You MUST use the fields described in the 'Field Definitions'. Pay close attention to the definitions to select the correct field, especially the `_s` fields for searching. Do not use fields ending with `_s` in `group.field` or facet `field` unless necessary for the analysis.
54
+ 3. **Crucial Sorting Rules**:
 
 
 
 
 
 
 
55
  * For `group.sort`: If `analysis_measure` involves a function on a field (e.g., `sum(total_deal_value_in_million)`), you MUST use the full function: `group.sort: 'sum(total_deal_value_in_million) desc'`.
56
  * If `analysis_measure` is 'count', you MUST OMIT the `group.sort` parameter entirely.
57
  * For sorting, NEVER use 'date_year' directly for `sort` in `terms` facets; use 'index asc' or 'index desc' instead. For other sorts, use 'date'.
58
+ 4. **Output Format**: Your final output must be a single, raw JSON object. Do not add comments or markdown formatting. The JSON MUST include a `reasoning` object explaining your choices.
59
+
60
+ ---
61
+ ### HOW TO CHOOSE THE ANALYSIS DIMENSION AND MEASURE (ANALYTICAL STRATEGY)
62
+
63
+ This is the most critical part of your task. A bad choice leads to a useless, boring analysis.
64
+
65
+ **1. Choosing the `analysis_dimension` (The "Group By" field):**
66
+
67
+ * **THE ANTI-REDUNDANCY RULE (MOST IMPORTANT):** If you use a field in the `query_filter` with a specific value (e.g., `news_type:"product approvals"`), you **MUST NOT** use that same field (`news_type`) as the `analysis_dimension`. The user already knows the news type; they want to know something *else* about it. Choosing a redundant dimension is a critical failure.
68
+
69
+ * **USER INTENT FIRST:** If the user explicitly asks to group by a field (e.g., "by company", "by country"), use that field.
70
+
71
+ * **INFERENCE HEURISTICS (If the user doesn't specify a dimension):** Think "What is the next logical question?" to find the most insightful breakdown.
72
+ * If the query is about "drug approvals," a good dimension is `therapeutic_category_s` (what diseases are the approvals for?) or `company_name` (who is getting the approvals?).
73
+ * If the query compares concepts like "cancer vs. infection," the dimension is `therapeutic_category_s`.
74
+ * If the query compares "oral vs. injection," the dimension is `route_branch`.
75
+ * For general "recent news" or "top deals," `news_type` or `company_name` are often good starting points.
76
+ * Your goal is to find a dimension that reveals a meaningful pattern in the filtered data.
77
+
78
+ **2. Choosing the `analysis_measure` (The metric):**
79
+
80
+ * **EXPLICIT METRIC:** If the user asks for a value (e.g., "by total deal value", "highest revenue"), use the corresponding field and function (e.g., `sum(total_deal_value_in_million)`).
81
+ * **IMPLICIT COUNT:** If the user asks a "what," "who," "how many," or "most common" question without specifying a value metric, the measure is `count`.
82
 
83
  ---
84
  ### FIELD DEFINITIONS (Your Source of Truth)
 
96
  **Correct JSON Output 1:**
97
  ```json
98
  {{
99
+ "reasoning": {{
100
+ "dimension_choice": "User explicitly asked for 'top 5 companies', so 'company_name' is the correct dimension.",
101
+ "measure_choice": "User explicitly asked for 'total deal value', so 'sum(total_deal_value_in_million)' is the correct measure."
102
+ }},
103
  "analysis_dimension": "company_name",
104
  "analysis_measure": "sum(total_deal_value_in_million)",
105
  "sort_field_for_examples": "total_deal_value_in_million",
 
131
  **Correct JSON Output 2:**
132
  ```json
133
  {{
134
+ "reasoning": {{
135
+ "dimension_choice": "User asked for 'most common news types', so 'news_type' is the correct dimension.",
136
+ "measure_choice": "User asked for 'most common', which implies counting occurrences. Therefore, the measure is 'count'."
137
+ }},
138
  "analysis_dimension": "news_type",
139
  "analysis_measure": "count",
140
  "sort_field_for_examples": "date",
 
158
  }}
159
  ```
160
 
161
+ **User Query 3 (Insightful Breakdown):** "give me recent news on USA drug approvals"
162
  **Correct JSON Output 3:**
163
  ```json
164
  {{
165
+ "reasoning": {{
166
+ "dimension_choice": "The user filtered for 'drug approvals' (news_type) and 'USA' (territory_hq_s). Using 'news_type' as a dimension would be redundant. The next logical question is 'what diseases are these approvals for?'. Therefore, 'therapeutic_category' is the most insightful dimension.",
167
+ "measure_choice": "The user asked for 'news', implying a count of events. 'count' is the appropriate measure."
168
+ }},
169
+ "analysis_dimension": "therapeutic_category",
170
  "analysis_measure": "count",
171
  "sort_field_for_examples": "date",
172
+ "query_filter": "territory_hq_s:\"united states of america\" AND news_type:\"product approvals\" AND date_year:{datetime.datetime.now().year}",
173
  "quantitative_request": {{
174
  "json.facet": {{
175
+ "approvals_by_therapeutic_category": {{
176
  "type": "terms",
177
+ "field": "therapeutic_category",
178
  "limit": 10,
179
  "sort": "count desc"
180
  }}
 
182
  }},
183
  "qualitative_request": {{
184
  "group": true,
185
+ "group.field": "therapeutic_category",
186
  "group.limit": 1,
187
  "sort": "date desc"
188
  }}
 
191
  ---
192
  ### YOUR TASK
193
 
194
+ Convert the following user query into a single, raw JSON "Analysis Plan" object. Strictly follow all rules, especially the analytical strategy for choosing the dimension and measure. Your JSON output MUST include the `reasoning` field.
195
 
196
  **Current User Query:** `{natural_language_query}`
197
  """
198
+
199
  # The other prompt functions remain unchanged.
200
  def get_synthesis_report_prompt(query, quantitative_data, qualitative_data, plan):
201
  """