recursivelabs commited on
Commit
36cdf5a
·
verified ·
1 Parent(s): 3d34cdd

Upload 13 files

Browse files
GPT-4.1 Prompting Guide.md ADDED
@@ -0,0 +1,1164 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # [GPT-4.1 Prompting Guide](https://cookbook.openai.com/examples/gpt4-1_prompting_guide#2-long-context)
2
+ > By [Noah MacCallum](https://x.com/noahmacca) and [Julian Lee](https://x.com/julianl093) (OpenAI)
3
+
4
+ The GPT-4.1 family of models represents a significant step forward from GPT-4o in capabilities across coding, instruction following, and long context. In this prompting guide, we collate a series of important prompting tips derived from extensive internal testing to help developers fully leverage the improved abilities of this new model family.
5
+
6
+ Many typical best practices still apply to GPT-4.1, such as providing context examples, making instructions as specific and clear as possible, and inducing planning via prompting to maximize model intelligence. However, we expect that getting the most out of this model will require some prompt migration. GPT-4.1 is trained to follow instructions more closely and more literally than its predecessors, which tended to more liberally infer intent from user and system prompts. This also means, however, that GPT-4.1 is highly steerable and responsive to well-specified prompts - if model behavior is different from what you expect, a single sentence firmly and unequivocally clarifying your desired behavior is almost always sufficient to steer the model on course.
7
+
8
+ Please read on for prompt examples you can use as a reference, and remember that while this guidance is widely applicable, no advice is one-size-fits-all. AI engineering is inherently an empirical discipline, and large language models are inherently nondeterministic; in addition to following this guide, we advise building informative evals and iterating often to ensure your prompt engineering changes are yielding benefits for your use case.
9
+
10
+ # 1. Agentic Workflows
11
+ GPT-4.1 is a great place to build agentic workflows. In model training we emphasized providing a diverse range of agentic problem-solving trajectories, and our agentic harness for the model achieves state-of-the-art performance for non-reasoning models on SWE-bench Verified, solving 55% of problems.
12
+
13
+ ## System Prompt Reminders
14
+ In order to fully utilize the agentic capabilities of GPT-4.1, we recommend including three key types of reminders in all agent prompts. The following prompts are optimized specifically for the agentic coding workflow, but can be easily modified for general agentic use cases.
15
+
16
+ 1. Persistence: this ensures the model understands it is entering a multi-message turn, and prevents it from prematurely yielding control back to the user. Our example is the following:
17
+ ```
18
+ You are an agent - please keep going until the user’s query is completely resolved, before ending your turn and yielding back to the user. Only terminate your turn when you are sure that the problem is solved.
19
+ ```
20
+
21
+ Tool-calling: this encourages the model to make full use of its tools, and reduces its likelihood of hallucinating or guessing an answer. Our example is the following:
22
+ ```
23
+ If you are not sure about file content or codebase structure pertaining to the user’s request, use your tools to read files and gather the relevant information: do NOT guess or make up an answer.
24
+ ```
25
+ Planning [optional]: if desired, this ensures the model explicitly plans and reflects upon each tool call in text, instead of completing the task by chaining together a series of only tool calls. Our example is the following:
26
+ ```
27
+ You MUST plan extensively before each function call, and reflect extensively on the outcomes of the previous function calls. DO NOT do this entire process by making function calls only, as this can impair your ability to solve the problem and think insightfully.
28
+ ```
29
+ GPT-4.1 is trained to respond very closely to both user instructions and system prompts in the agentic setting. The model adhered closely to these three simple instructions and increased our internal SWE-bench Verified score by close to 20% - so we highly encourage starting any agent prompt with clear reminders covering the three categories listed above. As a whole, we find that these three instructions transform the model from a chatbot-like state into a much more “eager” agent, driving the interaction forward autonomously and independently.
30
+
31
+ ## Tool Calls
32
+ Compared to previous models, GPT-4.1 has undergone more training on effectively utilizing tools passed as arguments in an OpenAI API request. We encourage developers to exclusively use the tools field to pass tools, rather than manually injecting tool descriptions into your prompt and writing a separate parser for tool calls, as some have reported doing in the past. This is the best way to minimize errors and ensure the model remains in distribution during tool-calling trajectories - in our own experiments, we observed a 2% increase in SWE-bench Verified pass rate when using API-parsed tool descriptions versus manually injecting the schemas into the system prompt.
33
+
34
+ Developers should name tools clearly to indicate their purpose and add a clear, detailed description in the "description" field of the tool. Similarly, for each tool param, lean on good naming and descriptions to ensure appropriate usage. If your tool is particularly complicated and you'd like to provide examples of tool usage, we recommend that you create an `# Examples` section in your system prompt and place the examples there, rather than adding them into the "description' field, which should remain thorough but relatively concise. Providing examples can be helpful to indicate when to use tools, whether to include user text alongside tool calls, and what parameters are appropriate for different inputs. Remember that you can use “Generate Anything” in the Prompt Playground to get a good starting point for your new tool definitions.
35
+
36
+ ## Prompting-Induced Planning & Chain-of-Thought
37
+ As mentioned already, developers can optionally prompt agents built with GPT-4.1 to plan and reflect between tool calls, instead of silently calling tools in an unbroken sequence. GPT-4.1 is not a reasoning model - meaning that it does not produce an internal chain of thought before answering - but in the prompt, a developer can induce the model to produce an explicit, step-by-step plan by using any variant of the Planning prompt component shown above. This can be thought of as the model “thinking out loud.” In our experimentation with the SWE-bench Verified agentic task, inducing explicit planning increased the pass rate by 4%.
38
+
39
+ ## Sample Prompt: SWE-bench Verified
40
+
41
+ Below, we share the agentic prompt that we used to achieve our highest score on SWE-bench Verified, which features detailed instructions about workflow and problem-solving strategy. This general pattern can be used for any agentic task.
42
+
43
+ ```python
44
+ from openai import OpenAI
45
+ import os
46
+
47
+ client = OpenAI(
48
+ api_key=os.environ.get(
49
+ "OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"
50
+ )
51
+ )
52
+
53
+ SYS_PROMPT_SWEBENCH = """
54
+ You will be tasked to fix an issue from an open-source repository.
55
+
56
+ Your thinking should be thorough and so it's fine if it's very long. You can think step by step before and after each action you decide to take.
57
+
58
+ You MUST iterate and keep going until the problem is solved.
59
+
60
+ You already have everything you need to solve this problem in the /testbed folder, even without internet connection. I want you to fully solve this autonomously before coming back to me.
61
+
62
+ Only terminate your turn when you are sure that the problem is solved. Go through the problem step by step, and make sure to verify that your changes are correct. NEVER end your turn without having solved the problem, and when you say you are going to make a tool call, make sure you ACTUALLY make the tool call, instead of ending your turn.
63
+
64
+ THE PROBLEM CAN DEFINITELY BE SOLVED WITHOUT THE INTERNET.
65
+
66
+ Take your time and think through every step - remember to check your solution rigorously and watch out for boundary cases, especially with the changes you made. Your solution must be perfect. If not, continue working on it. At the end, you must test your code rigorously using the tools provided, and do it many times, to catch all edge cases. If it is not robust, iterate more and make it perfect. Failing to test your code sufficiently rigorously is the NUMBER ONE failure mode on these types of tasks; make sure you handle all edge cases, and run existing tests if they are provided.
67
+
68
+ You MUST plan extensively before each function call, and reflect extensively on the outcomes of the previous function calls. DO NOT do this entire process by making function calls only, as this can impair your ability to solve the problem and think insightfully.
69
+
70
+ # Workflow
71
+
72
+ ## High-Level Problem Solving Strategy
73
+
74
+ 1. Understand the problem deeply. Carefully read the issue and think critically about what is required.
75
+ 2. Investigate the codebase. Explore relevant files, search for key functions, and gather context.
76
+ 3. Develop a clear, step-by-step plan. Break down the fix into manageable, incremental steps.
77
+ 4. Implement the fix incrementally. Make small, testable code changes.
78
+ 5. Debug as needed. Use debugging techniques to isolate and resolve issues.
79
+ 6. Test frequently. Run tests after each change to verify correctness.
80
+ 7. Iterate until the root cause is fixed and all tests pass.
81
+ 8. Reflect and validate comprehensively. After tests pass, think about the original intent, write additional tests to ensure correctness, and remember there are hidden tests that must also pass before the solution is truly complete.
82
+
83
+ Refer to the detailed sections below for more information on each step.
84
+
85
+ ## 1. Deeply Understand the Problem
86
+ Carefully read the issue and think hard about a plan to solve it before coding.
87
+
88
+ ## 2. Codebase Investigation
89
+ - Explore relevant files and directories.
90
+ - Search for key functions, classes, or variables related to the issue.
91
+ - Read and understand relevant code snippets.
92
+ - Identify the root cause of the problem.
93
+ - Validate and update your understanding continuously as you gather more context.
94
+
95
+ ## 3. Develop a Detailed Plan
96
+ - Outline a specific, simple, and verifiable sequence of steps to fix the problem.
97
+ - Break down the fix into small, incremental changes.
98
+
99
+ ## 4. Making Code Changes
100
+ - Before editing, always read the relevant file contents or section to ensure complete context.
101
+ - If a patch is not applied correctly, attempt to reapply it.
102
+ - Make small, testable, incremental changes that logically follow from your investigation and plan.
103
+
104
+ ## 5. Debugging
105
+ - Make code changes only if you have high confidence they can solve the problem
106
+ - When debugging, try to determine the root cause rather than addressing symptoms
107
+ - Debug for as long as needed to identify the root cause and identify a fix
108
+ - Use print statements, logs, or temporary code to inspect program state, including descriptive statements or error messages to understand what's happening
109
+ - To test hypotheses, you can also add test statements or functions
110
+ - Revisit your assumptions if unexpected behavior occurs.
111
+
112
+ ## 6. Testing
113
+ - Run tests frequently using `!python3 run_tests.py` (or equivalent).
114
+ - After each change, verify correctness by running relevant tests.
115
+ - If tests fail, analyze failures and revise your patch.
116
+ - Write additional tests if needed to capture important behaviors or edge cases.
117
+ - Ensure all tests pass before finalizing.
118
+
119
+ ## 7. Final Verification
120
+ - Confirm the root cause is fixed.
121
+ - Review your solution for logic correctness and robustness.
122
+ - Iterate until you are extremely confident the fix is complete and all tests pass.
123
+
124
+ ## 8. Final Reflection and Additional Testing
125
+ - Reflect carefully on the original intent of the user and the problem statement.
126
+ - Think about potential edge cases or scenarios that may not be covered by existing tests.
127
+ - Write additional tests that would need to pass to fully validate the correctness of your solution.
128
+ - Run these new tests and ensure they all pass.
129
+ - Be aware that there are additional hidden tests that must also pass for the solution to be successful.
130
+ - Do not assume the task is complete just because the visible tests pass; continue refining until you are confident the fix is robust and comprehensive.
131
+ """
132
+
133
+ PYTHON_TOOL_DESCRIPTION = """This function is used to execute Python code or terminal commands in a stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 60.0 seconds. Internet access for this session is disabled. Do not make external web requests or API calls as they will fail. Just as in a Jupyter notebook, you may also execute terminal commands by calling this function with a terminal command, prefaced with an exclamation mark.
134
+
135
+ In addition, for the purposes of this task, you can call this function with an `apply_patch` command as input. `apply_patch` effectively allows you to execute a diff/patch against a file, but the format of the diff specification is unique to this task, so pay careful attention to these instructions. To use the `apply_patch` command, you should pass a message of the following structure as "input":
136
+
137
+ %%bash
138
+ apply_patch <<"EOF"
139
+ *** Begin Patch
140
+ [YOUR_PATCH]
141
+ *** End Patch
142
+ EOF
143
+
144
+ Where [YOUR_PATCH] is the actual content of your patch, specified in the following V4A diff format.
145
+
146
+ *** [ACTION] File: [path/to/file] -> ACTION can be one of Add, Update, or Delete.
147
+ For each snippet of code that needs to be changed, repeat the following:
148
+ [context_before] -> See below for further instructions on context.
149
+ - [old_code] -> Precede the old code with a minus sign.
150
+ + [new_code] -> Precede the new, replacement code with a plus sign.
151
+ [context_after] -> See below for further instructions on context.
152
+
153
+ For instructions on [context_before] and [context_after]:
154
+ - By default, show 3 lines of code immediately above and 3 lines immediately below each change. If a change is within 3 lines of a previous change, do NOT duplicate the first change's [context_after] lines in the second change's [context_before] lines.
155
+ - If 3 lines of context is insufficient to uniquely identify the snippet of code within the file, use the @@ operator to indicate the class or function to which the snippet belongs. For instance, we might have:
156
+ @@ class BaseClass
157
+ [3 lines of pre-context]
158
+ - [old_code]
159
+ + [new_code]
160
+ [3 lines of post-context]
161
+
162
+ - If a code block is repeated so many times in a class or function such that even a single @@ statement and 3 lines of context cannot uniquely identify the snippet of code, you can use multiple `@@` statements to jump to the right context. For instance:
163
+
164
+ @@ class BaseClass
165
+ @@ def method():
166
+ [3 lines of pre-context]
167
+ - [old_code]
168
+ + [new_code]
169
+ [3 lines of post-context]
170
+
171
+ Note, then, that we do not use line numbers in this diff format, as the context is enough to uniquely identify code. An example of a message that you might pass as "input" to this function, in order to apply a patch, is shown below.
172
+
173
+ %%bash
174
+ apply_patch <<"EOF"
175
+ *** Begin Patch
176
+ *** Update File: pygorithm/searching/binary_search.py
177
+ @@ class BaseClass
178
+ @@ def search():
179
+ - pass
180
+ + raise NotImplementedError()
181
+
182
+ @@ class Subclass
183
+ @@ def search():
184
+ - pass
185
+ + raise NotImplementedError()
186
+
187
+ *** End Patch
188
+ EOF
189
+
190
+ File references can only be relative, NEVER ABSOLUTE. After the apply_patch command is run, python will always say "Done!", regardless of whether the patch was successfully applied or not. However, you can determine if there are issue and errors by looking at any warnings or logging lines printed BEFORE the "Done!" is output.
191
+ """
192
+
193
+ python_bash_patch_tool = {
194
+ "type": "function",
195
+ "name": "python",
196
+ "description": PYTHON_TOOL_DESCRIPTION,
197
+ "parameters": {
198
+ "type": "object",
199
+ "strict": True,
200
+ "properties": {
201
+ "input": {
202
+ "type": "string",
203
+ "description": " The Python code, terminal command (prefaced by exclamation mark), or apply_patch command that you wish to execute.",
204
+ }
205
+ },
206
+ "required": ["input"],
207
+ },
208
+ }
209
+
210
+ # Additional harness setup:
211
+ # - Add your repo to /testbed
212
+ # - Add your issue to the first user message
213
+ # - Note: Even though we used a single tool for python, bash, and apply_patch, we generally recommend defining more granular tools that are focused on a single function
214
+
215
+ response = client.responses.create(
216
+ instructions=SYS_PROMPT_SWEBENCH,
217
+ model="gpt-4.1-2025-04-14",
218
+ tools=[python_bash_patch_tool],
219
+ input=f"Please answer the following question:\nBug: Typerror..."
220
+ )
221
+
222
+ response.to_dict()["output"]
223
+ ```
224
+
225
+ ```
226
+ [{'id': 'msg_67fe92df26ac819182ffafce9ff4e4fc07c7e06242e51f8b',
227
+ 'content': [{'annotations': [],
228
+ 'text': "Thank you for the report, but “Typerror” is too vague for me to start debugging right away.\n\n**To make progress, I need to:**\n1. Find the exact error message text (e.g. `'TypeError: ...'`).\n2. Find which file and which line/function/class the error occurred in.\n3. Figure out what triggered the error (test file, usage, reproduction steps).\n4. Find the root cause and details.\n\n**Next steps:**\n- Investigate error/log/test output files for a Python `TypeError` message.\n- Examine the relevant code sections for problematic type usage.\n- If possible, reproduce the bug locally.\n\n**Plan:**\n- First, I will search for test files and log output in the `/testbed` directory that may contain the full error message and stack trace.\n\nLet’s start by listing the contents of the `/testbed` directory to look for clues.",
229
+ 'type': 'output_text'}],
230
+ 'role': 'assistant',
231
+ 'status': 'completed',
232
+ 'type': 'message'},
233
+ {'arguments': '{"input":"!ls -l /testbed"}',
234
+ 'call_id': 'call_frnxyJgKi5TsBem0nR9Zuzdw',
235
+ 'name': 'python',
236
+ 'type': 'function_call',
237
+ 'id': 'fc_67fe92e3da7081918fc18d5c96dddc1c07c7e06242e51f8b',
238
+ 'status': 'completed'}]
239
+ ```
240
+ # 2. Long context
241
+ GPT-4.1 has a performant 1M token input context window, and is useful for a variety of long context tasks, including structured document parsing, re-ranking, selecting relevant information while ignoring irrelevant context, and performing multi-hop reasoning using context.
242
+
243
+
244
+ ## Optimal Context Size
245
+ We observe very good performance on needle-in-a-haystack evaluations up to our full 1M token context, and we’ve observed very strong performance at complex tasks with a mix of both relevant and irrelevant code and other documents. However, long context performance can degrade as more items are required to be retrieved, or perform complex reasoning that requires knowledge of the state of the entire context (like performing a graph search, for example).
246
+
247
+ ## Tuning Context Reliance
248
+
249
+ Consider the mix of external vs. internal world knowledge that might be required to answer your question. Sometimes it’s important for the model to use some of its own knowledge to connect concepts or make logical jumps, while in others it’s desirable to only use provided context
250
+
251
+ ```
252
+ # Instructions
253
+ // for internal knowledge
254
+ - Only use the documents in the provided External Context to answer the User Query. If you don't know the answer based on this context, you must respond "I don't have the information needed to answer that", even if a user insists on you answering the question.
255
+ // For internal and external knowledge
256
+ - By default, use the provided external context to answer the User Query, but if other basic knowledge is needed to answer, and you're confident in the answer, you can use some of your own knowledge to help answer the question.
257
+ ```
258
+ ## Prompt Organization
259
+
260
+ Especially in long context usage, placement of instructions and context can impact performance. If you have long context in your prompt, ideally place your instructions at both the beginning and end of the provided context, as we found this to perform better than only above or below. If you’d prefer to only have your instructions once, then above the provided context works better than below.
261
+ # 3. Chain of Thought
262
+
263
+ As mentioned above, GPT-4.1 is not a reasoning model, but prompting the model to think step by step (called “chain of thought”) can be an effective way for a model to break down problems into more manageable pieces, solve them, and improve overall output quality, with the tradeoff of higher cost and latency associated with using more output tokens. The model has been trained to perform well at agentic reasoning about and real-world problem solving, so it shouldn’t require much prompting to perform well.
264
+
265
+ We recommend starting with this basic chain-of-thought instruction at the end of your prompt:
266
+ ```
267
+ ...
268
+
269
+ First, think carefully step by step about what documents are needed to answer the query. Then, print out the TITLE and ID of each document. Then, format the IDs into a list.
270
+ ```
271
+ From there, you should improve your chain-of-thought (CoT) prompt by auditing failures in your particular examples and evals, and addressing systematic planning and reasoning errors with more explicit instructions. In the unconstrained CoT prompt, there may be variance in the strategies it tries, and if you observe an approach that works well, you can codify that strategy in your prompt. Generally speaking, errors tend to occur from misunderstanding user intent, insufficient context gathering or analysis, or insufficient or incorrect step by step thinking, so watch out for these and try to address them with more opinionated instructions.
272
+
273
+ Here is an example prompt instructing the model to focus more methodically on analyzing user intent and considering relevant context before proceeding to answer.
274
+ ```
275
+ # Reasoning Strategy
276
+ 1. Query Analysis: Break down and analyze the query until you're confident about what it might be asking. Consider the provided context to help clarify any ambiguous or confusing information.
277
+ 2. Context Analysis: Carefully select and analyze a large set of potentially relevant documents. Optimize for recall - it's okay if some are irrelevant, but the correct documents must be in this list, otherwise your final answer will be wrong. Analysis steps for each:
278
+ a. Analysis: An analysis of how it may or may not be relevant to answering the query.
279
+ b. Relevance rating: [high, medium, low, none]
280
+ 3. Synthesis: summarize which documents are most relevant and why, including all documents with a relevance rating of medium or higher.
281
+
282
+ # User Question
283
+ {user_question}
284
+
285
+ # External Context
286
+ {external_context}
287
+
288
+ First, think carefully step by step about what documents are needed to answer the query, closely adhering to the provided Reasoning Strategy. Then, print out the TITLE and ID of each document. Then, format the IDs into a list.
289
+ ```
290
+ # 4. Instruction Following
291
+
292
+ GPT-4.1 exhibits outstanding instruction-following performance, which developers can leverage to precisely shape and control the outputs for their particular use cases. Developers often extensively prompt for agentic reasoning steps, response tone and voice, tool calling information, output formatting, topics to avoid, and more. However, since the model follows instructions more literally, developers may need to include explicit specification around what to do or not to do. Furthermore, existing prompts optimized for other models may not immediately work with this model, because existing instructions are followed more closely and implicit rules are no longer being as strongly inferred.
293
+ ## Recommended Workflow
294
+
295
+ Here is our recommended workflow for developing and debugging instructions in prompts:
296
+
297
+ Start with an overall “Response Rules” or “Instructions” section with high-level guidance and bullet points.
298
+ If you’d like to change a more specific behavior, add a section to specify more details for that category, like # Sample Phrases.
299
+ If there are specific steps you’d like the model to follow in its workflow, add an ordered list and instruct the model to follow these steps.
300
+ If behavior still isn’t working as expected:
301
+ Check for conflicting, underspecified, or wrong instructions and examples. If there are conflicting instructions, GPT-4.1 tends to follow the one closer to the end of the prompt.
302
+ Add examples that demonstrate desired behavior; ensure that any important behavior demonstrated in your examples are also cited in your rules.
303
+ It’s generally not necessary to use all-caps or other incentives like bribes or tips. We recommend starting without these, and only reaching for these if necessary for your particular prompt. Note that if your existing prompts include these techniques, it could cause GPT-4.1 to pay attention to it too strictly.
304
+
305
+ Note that using your preferred AI-powered IDE can be very helpful for iterating on prompts, including checking for consistency or conflicts, adding examples, or making cohesive updates like adding an instruction and updating instructions to demonstrate that instruction.
306
+ ## Common Failure Modes
307
+
308
+ These failure modes are not unique to GPT-4.1, but we share them here for general awareness and ease of debugging.
309
+
310
+ Instructing a model to always follow a specific behavior can occasionally induce adverse effects. For instance, if told “you must call a tool before responding to the user,” models may hallucinate tool inputs or call the tool with null values if they do not have enough information. Adding “if you don’t have enough information to call the tool, ask the user for the information you need” should mitigate this.
311
+ When provided sample phrases, models can use those quotes verbatim and start to sound repetitive to users. Ensure you instruct the model to vary them as necessary.
312
+ Without specific instructions, some models can be eager to provide additional prose to explain their decisions, or output more formatting in responses than may be desired. Provide instructions and potentially examples to help mitigate.
313
+
314
+ ## Example Prompt: Customer Service
315
+
316
+ This demonstrates best practices for a fictional customer service agent. Observe the diversity of rules, the specificity, the use of additional sections for greater detail, and an example to demonstrate precise behavior that incorporates all prior rules.
317
+
318
+ Try running the following notebook cell - you should see both a user message and tool call, and the user message should start with a greeting, then echo back their answer, then mention they're about to call a tool. Try changing the instructions to shape the model behavior, or trying other user messages, to test instruction following performance.
319
+ ```python
320
+ SYS_PROMPT_CUSTOMER_SERVICE = """You are a helpful customer service agent working for NewTelco, helping a user efficiently fulfill their request while adhering closely to provided guidelines.
321
+
322
+ # Instructions
323
+ - Always greet the user with "Hi, you've reached NewTelco, how can I help you?"
324
+ - Always call a tool before answering factual questions about the company, its offerings or products, or a user's account. Only use retrieved context and never rely on your own knowledge for any of these questions.
325
+ - However, if you don't have enough information to properly call the tool, ask the user for the information you need.
326
+ - Escalate to a human if the user requests.
327
+ - Do not discuss prohibited topics (politics, religion, controversial current events, medical, legal, or financial advice, personal conversations, internal company operations, or criticism of any people or company).
328
+ - Rely on sample phrases whenever appropriate, but never repeat a sample phrase in the same conversation. Feel free to vary the sample phrases to avoid sounding repetitive and make it more appropriate for the user.
329
+ - Always follow the provided output format for new messages, including citations for any factual statements from retrieved policy documents.
330
+ - If you're going to call a tool, always message the user with an appropriate message before and after calling the tool.
331
+ - Maintain a professional and concise tone in all responses, and use emojis between sentences.
332
+ - If you've resolved the user's request, ask if there's anything else you can help with
333
+
334
+ # Precise Response Steps (for each response)
335
+ 1. If necessary, call tools to fulfill the user's desired action. Always message the user before and after calling a tool to keep them in the loop.
336
+ 2. In your response to the user
337
+ a. Use active listening and echo back what you heard the user ask for.
338
+ b. Respond appropriately given the above guidelines.
339
+
340
+ # Sample Phrases
341
+ ## Deflecting a Prohibited Topic
342
+ - "I'm sorry, but I'm unable to discuss that topic. Is there something else I can help you with?"
343
+ - "That's not something I'm able to provide information on, but I'm happy to help with any other questions you may have."
344
+
345
+ ## Before calling a tool
346
+ - "To help you with that, I'll just need to verify your information."
347
+ - "Let me check that for you—one moment, please."
348
+ - "I'll retrieve the latest details for you now."
349
+
350
+ ## After calling a tool
351
+ - "Okay, here's what I found: [response]"
352
+ - "So here's what I found: [response]"
353
+
354
+ # Output Format
355
+ - Always include your final response to the user.
356
+ - When providing factual information from retrieved context, always include citations immediately after the relevant statement(s). Use the following citation format:
357
+ - For a single source: [NAME](ID)
358
+ - For multiple sources: [NAME](ID), [NAME](ID)
359
+ - Only provide information about this company, its policies, its products, or the customer's account, and only if it is based on information provided in context. Do not answer questions outside this scope.
360
+
361
+ # Example
362
+ ## User
363
+ Can you tell me about your family plan options?
364
+
365
+ ## Assistant Response 1
366
+ ### Message
367
+ "Hi, you've reached NewTelco, how can I help you? 😊🎉\n\nYou'd like to know about our family plan options. 🤝 Let me check that for you—one moment, please. 🚀"
368
+
369
+ ### Tool Calls
370
+ lookup_policy_document(topic="family plan options")
371
+
372
+ // After tool call, the assistant would follow up with:
373
+
374
+ ## Assistant Response 2 (after tool call)
375
+ ### Message
376
+ "Okay, here's what I found: 🎉 Our family plan allows up to 5 lines with shared data and a 10% discount for each additional line [Family Plan Policy](ID-010). 📱 Is there anything else I can help you with today? 😊"
377
+ """
378
+
379
+ get_policy_doc = {
380
+ "type": "function",
381
+ "name": "lookup_policy_document",
382
+ "description": "Tool to look up internal documents and policies by topic or keyword.",
383
+ "parameters": {
384
+ "strict": True,
385
+ "type": "object",
386
+ "properties": {
387
+ "topic": {
388
+ "type": "string",
389
+ "description": "The topic or keyword to search for in company policies or documents.",
390
+ },
391
+ },
392
+ "required": ["topic"],
393
+ "additionalProperties": False,
394
+ },
395
+ }
396
+
397
+ get_user_acct = {
398
+ "type": "function",
399
+ "name": "get_user_account_info",
400
+ "description": "Tool to get user account information",
401
+ "parameters": {
402
+ "strict": True,
403
+ "type": "object",
404
+ "properties": {
405
+ "phone_number": {
406
+ "type": "string",
407
+ "description": "Formatted as '(xxx) xxx-xxxx'",
408
+ },
409
+ },
410
+ "required": ["phone_number"],
411
+ "additionalProperties": False,
412
+ },
413
+ }
414
+
415
+ response = client.responses.create(
416
+ instructions=SYS_PROMPT_CUSTOMER_SERVICE,
417
+ model="gpt-4.1-2025-04-14",
418
+ tools=[get_policy_doc, get_user_acct],
419
+ input="How much will it cost for international service? I'm traveling to France.",
420
+ # input="Why was my last bill so high?"
421
+ )
422
+
423
+ response.to_dict()["output"]
424
+ ```
425
+ ```
426
+ [{'id': 'msg_67fe92d431548191b7ca6cd604b4784b06efc5beb16b3c5e',
427
+ 'content': [{'annotations': [],
428
+ 'text': "Hi, you've reached NewTelco, how can I help you? 🌍✈️\n\nYou'd like to know the cost of international service while traveling to France. 🇫🇷 Let me check the latest details for you—one moment, please. 🕑",
429
+ 'type': 'output_text'}],
430
+ 'role': 'assistant',
431
+ 'status': 'completed',
432
+ 'type': 'message'},
433
+ {'arguments': '{"topic":"international service cost France"}',
434
+ 'call_id': 'call_cF63DLeyhNhwfdyME3ZHd0yo',
435
+ 'name': 'lookup_policy_document',
436
+ 'type': 'function_call',
437
+ 'id': 'fc_67fe92d5d6888191b6cd7cf57f707e4606efc5beb16b3c5e',
438
+ 'status': 'completed'}]
439
+ ```
440
+ # 5. General Advice
441
+ ## Prompt Structure
442
+
443
+ For reference, here is a good starting point for structuring your prompts.
444
+ ```
445
+ # Role and Objective
446
+
447
+ # Instructions
448
+
449
+ ## Sub-categories for more detailed instructions
450
+
451
+ # Reasoning Steps
452
+
453
+ # Output Format
454
+
455
+ # Examples
456
+ ## Example 1
457
+
458
+ # Context
459
+
460
+ # Final instructions and prompt to think step by step
461
+ ```
462
+ Add or remove sections to suit your needs, and experiment to determine what’s optimal for your usage.
463
+ Delimiters
464
+
465
+ Here are some general guidelines for selecting the best delimiters for your prompt. Please refer to the Long Context section for special considerations for that context type.
466
+
467
+ 1. Markdown: We recommend starting here, and using markdown titles for major sections and subsections (including deeper hierarchy, to H4+). Use inline backticks or backtick blocks to precisely wrap code, and standard numbered or bulleted lists as needed.
468
+ 2. XML: These also perform well, and we have improved adherence to information in XML with this model. XML is convenient to precisely wrap a section including start and end, add metadata to the tags for additional context, and enable nesting. Here is an example of using XML tags to nest examples in an example section, with inputs and outputs for each:
469
+ ```
470
+ <examples>
471
+ <example1 type="Abbreviate">
472
+ <input>San Francisco</input>
473
+ <output>- SF</output>
474
+ </example1>
475
+ </examples>
476
+ ```
477
+ 3. JSON is highly structured and well understood by the model particularly in coding contexts. However it can be more verbose, and require character escaping that can add overhead.
478
+
479
+ Guidance specifically for adding a large number of documents or files to input context:
480
+
481
+ XML performed well in our long context testing.
482
+ Example: <doc id='1' title='The Fox'>The quick brown fox jumps over the lazy dog</doc>
483
+ This format, proposed by Lee et al. (ref), also performed well in our long context testing.
484
+ Example: ID: 1 | TITLE: The Fox | CONTENT: The quick brown fox jumps over the lazy dog
485
+ JSON performed particularly poorly.
486
+ Example: [{'id': 1, 'title': 'The Fox', 'content': 'The quick brown fox jumped over the lazy dog'}]
487
+
488
+ The model is trained to robustly understand structure in a variety of formats. Generally, use your judgement and think about what will provide clear information and “stand out” to the model. For example, if you’re retrieving documents that contain lots of XML, an XML-based delimiter will likely be less effective.
489
+ ## Caveats
490
+
491
+ In some isolated cases we have observed the model being resistant to producing very long, repetitive outputs, for example, analyzing hundreds of items one by one. If this is necessary for your use case, instruct the model strongly to output this information in full, and consider breaking down the problem or using a more concise approach.
492
+ We have seen some rare instances of parallel tool calls being incorrect. We advise testing this, and considering setting the parallel_tool_calls param to false if you’re seeing issues.
493
+
494
+ # Appendix: Generating and Applying File Diffs
495
+
496
+ Developers have provided us feedback that accurate and well-formed diff generation is a critical capability to power coding-related tasks. To this end, the GPT-4.1 family features substantially improved diff capabilities relative to previous GPT models. Moreover, while GPT-4.1 has strong performance generating diffs of any format given clear instructions and examples, we open-source here one recommended diff format, on which the model has been extensively trained. We hope that in particular for developers just starting out, that this will take much of the guesswork out of creating diffs yourself.
497
+ Apply Patch
498
+
499
+ See the example below for a prompt that applies our recommended tool call correctly.
500
+ ```python
501
+ APPLY_PATCH_TOOL_DESC = """This is a custom utility that makes it more convenient to add, remove, move, or edit code files. `apply_patch` effectively allows you to execute a diff/patch against a file, but the format of the diff specification is unique to this task, so pay careful attention to these instructions. To use the `apply_patch` command, you should pass a message of the following structure as "input":
502
+
503
+ %%bash
504
+ apply_patch <<"EOF"
505
+ *** Begin Patch
506
+ [YOUR_PATCH]
507
+ *** End Patch
508
+ EOF
509
+
510
+ Where [YOUR_PATCH] is the actual content of your patch, specified in the following V4A diff format.
511
+
512
+ *** [ACTION] File: [path/to/file] -> ACTION can be one of Add, Update, or Delete.
513
+ For each snippet of code that needs to be changed, repeat the following:
514
+ [context_before] -> See below for further instructions on context.
515
+ - [old_code] -> Precede the old code with a minus sign.
516
+ + [new_code] -> Precede the new, replacement code with a plus sign.
517
+ [context_after] -> See below for further instructions on context.
518
+
519
+ For instructions on [context_before] and [context_after]:
520
+ - By default, show 3 lines of code immediately above and 3 lines immediately below each change. If a change is within 3 lines of a previous change, do NOT duplicate the first change’s [context_after] lines in the second change’s [context_before] lines.
521
+ - If 3 lines of context is insufficient to uniquely identify the snippet of code within the file, use the @@ operator to indicate the class or function to which the snippet belongs. For instance, we might have:
522
+ @@ class BaseClass
523
+ [3 lines of pre-context]
524
+ - [old_code]
525
+ + [new_code]
526
+ [3 lines of post-context]
527
+
528
+ - If a code block is repeated so many times in a class or function such that even a single @@ statement and 3 lines of context cannot uniquely identify the snippet of code, you can use multiple `@@` statements to jump to the right context. For instance:
529
+
530
+ @@ class BaseClass
531
+ @@ def method():
532
+ [3 lines of pre-context]
533
+ - [old_code]
534
+ + [new_code]
535
+ [3 lines of post-context]
536
+
537
+ Note, then, that we do not use line numbers in this diff format, as the context is enough to uniquely identify code. An example of a message that you might pass as "input" to this function, in order to apply a patch, is shown below.
538
+
539
+ %%bash
540
+ apply_patch <<"EOF"
541
+ *** Begin Patch
542
+ *** Update File: pygorithm/searching/binary_search.py
543
+ @@ class BaseClass
544
+ @@ def search():
545
+ - pass
546
+ + raise NotImplementedError()
547
+
548
+ @@ class Subclass
549
+ @@ def search():
550
+ - pass
551
+ + raise NotImplementedError()
552
+
553
+ *** End Patch
554
+ EOF
555
+ """
556
+
557
+ APPLY_PATCH_TOOL = {
558
+ "name": "apply_patch",
559
+ "description": APPLY_PATCH_TOOL_DESC,
560
+ "parameters": {
561
+ "type": "object",
562
+ "properties": {
563
+ "input": {
564
+ "type": "string",
565
+ "description": " The apply_patch command that you wish to execute.",
566
+ }
567
+ },
568
+ "required": ["input"],
569
+ },
570
+ }
571
+ ```
572
+ ## Reference Implementation: apply_patch.py
573
+
574
+ Here’s a reference implementation of the apply_patch tool that we used as part of model training. You’ll need to make this an executable and available as `apply_patch` from the shell where the model will execute commands:
575
+ ```python
576
+ #!/usr/bin/env python3
577
+
578
+ """
579
+ A self-contained **pure-Python 3.9+** utility for applying human-readable
580
+ “pseudo-diff” patch files to a collection of text files.
581
+ """
582
+
583
+ from __future__ import annotations
584
+
585
+ import pathlib
586
+ from dataclasses import dataclass, field
587
+ from enum import Enum
588
+ from typing import (
589
+ Callable,
590
+ Dict,
591
+ List,
592
+ Optional,
593
+ Tuple,
594
+ Union,
595
+ )
596
+
597
+
598
+ # --------------------------------------------------------------------------- #
599
+ # Domain objects
600
+ # --------------------------------------------------------------------------- #
601
+ class ActionType(str, Enum):
602
+ ADD = "add"
603
+ DELETE = "delete"
604
+ UPDATE = "update"
605
+
606
+
607
+ @dataclass
608
+ class FileChange:
609
+ type: ActionType
610
+ old_content: Optional[str] = None
611
+ new_content: Optional[str] = None
612
+ move_path: Optional[str] = None
613
+
614
+
615
+ @dataclass
616
+ class Commit:
617
+ changes: Dict[str, FileChange] = field(default_factory=dict)
618
+
619
+
620
+ # --------------------------------------------------------------------------- #
621
+ # Exceptions
622
+ # --------------------------------------------------------------------------- #
623
+ class DiffError(ValueError):
624
+ """Any problem detected while parsing or applying a patch."""
625
+
626
+
627
+ # --------------------------------------------------------------------------- #
628
+ # Helper dataclasses used while parsing patches
629
+ # --------------------------------------------------------------------------- #
630
+ @dataclass
631
+ class Chunk:
632
+ orig_index: int = -1
633
+ del_lines: List[str] = field(default_factory=list)
634
+ ins_lines: List[str] = field(default_factory=list)
635
+
636
+
637
+ @dataclass
638
+ class PatchAction:
639
+ type: ActionType
640
+ new_file: Optional[str] = None
641
+ chunks: List[Chunk] = field(default_factory=list)
642
+ move_path: Optional[str] = None
643
+
644
+
645
+ @dataclass
646
+ class Patch:
647
+ actions: Dict[str, PatchAction] = field(default_factory=dict)
648
+
649
+
650
+ # --------------------------------------------------------------------------- #
651
+ # Patch text parser
652
+ # --------------------------------------------------------------------------- #
653
+ @dataclass
654
+ class Parser:
655
+ current_files: Dict[str, str]
656
+ lines: List[str]
657
+ index: int = 0
658
+ patch: Patch = field(default_factory=Patch)
659
+ fuzz: int = 0
660
+
661
+ # ------------- low-level helpers -------------------------------------- #
662
+ def _cur_line(self) -> str:
663
+ if self.index >= len(self.lines):
664
+ raise DiffError("Unexpected end of input while parsing patch")
665
+ return self.lines[self.index]
666
+
667
+ @staticmethod
668
+ def _norm(line: str) -> str:
669
+ """Strip CR so comparisons work for both LF and CRLF input."""
670
+ return line.rstrip("\r")
671
+
672
+ # ------------- scanning convenience ----------------------------------- #
673
+ def is_done(self, prefixes: Optional[Tuple[str, ...]] = None) -> bool:
674
+ if self.index >= len(self.lines):
675
+ return True
676
+ if (
677
+ prefixes
678
+ and len(prefixes) > 0
679
+ and self._norm(self._cur_line()).startswith(prefixes)
680
+ ):
681
+ return True
682
+ return False
683
+
684
+ def startswith(self, prefix: Union[str, Tuple[str, ...]]) -> bool:
685
+ return self._norm(self._cur_line()).startswith(prefix)
686
+
687
+ def read_str(self, prefix: str) -> str:
688
+ """
689
+ Consume the current line if it starts with *prefix* and return the text
690
+ **after** the prefix. Raises if prefix is empty.
691
+ """
692
+ if prefix == "":
693
+ raise ValueError("read_str() requires a non-empty prefix")
694
+ if self._norm(self._cur_line()).startswith(prefix):
695
+ text = self._cur_line()[len(prefix) :]
696
+ self.index += 1
697
+ return text
698
+ return ""
699
+
700
+ def read_line(self) -> str:
701
+ """Return the current raw line and advance."""
702
+ line = self._cur_line()
703
+ self.index += 1
704
+ return line
705
+
706
+ # ------------- public entry point -------------------------------------- #
707
+ def parse(self) -> None:
708
+ while not self.is_done(("*** End Patch",)):
709
+ # ---------- UPDATE ---------- #
710
+ path = self.read_str("*** Update File: ")
711
+ if path:
712
+ if path in self.patch.actions:
713
+ raise DiffError(f"Duplicate update for file: {path}")
714
+ move_to = self.read_str("*** Move to: ")
715
+ if path not in self.current_files:
716
+ raise DiffError(f"Update File Error - missing file: {path}")
717
+ text = self.current_files[path]
718
+ action = self._parse_update_file(text)
719
+ action.move_path = move_to or None
720
+ self.patch.actions[path] = action
721
+ continue
722
+
723
+ # ---------- DELETE ---------- #
724
+ path = self.read_str("*** Delete File: ")
725
+ if path:
726
+ if path in self.patch.actions:
727
+ raise DiffError(f"Duplicate delete for file: {path}")
728
+ if path not in self.current_files:
729
+ raise DiffError(f"Delete File Error - missing file: {path}")
730
+ self.patch.actions[path] = PatchAction(type=ActionType.DELETE)
731
+ continue
732
+
733
+ # ---------- ADD ---------- #
734
+ path = self.read_str("*** Add File: ")
735
+ if path:
736
+ if path in self.patch.actions:
737
+ raise DiffError(f"Duplicate add for file: {path}")
738
+ if path in self.current_files:
739
+ raise DiffError(f"Add File Error - file already exists: {path}")
740
+ self.patch.actions[path] = self._parse_add_file()
741
+ continue
742
+
743
+ raise DiffError(f"Unknown line while parsing: {self._cur_line()}")
744
+
745
+ if not self.startswith("*** End Patch"):
746
+ raise DiffError("Missing *** End Patch sentinel")
747
+ self.index += 1 # consume sentinel
748
+
749
+ # ------------- section parsers ---------------------------------------- #
750
+ def _parse_update_file(self, text: str) -> PatchAction:
751
+ action = PatchAction(type=ActionType.UPDATE)
752
+ lines = text.split("\n")
753
+ index = 0
754
+ while not self.is_done(
755
+ (
756
+ "*** End Patch",
757
+ "*** Update File:",
758
+ "*** Delete File:",
759
+ "*** Add File:",
760
+ "*** End of File",
761
+ )
762
+ ):
763
+ def_str = self.read_str("@@ ")
764
+ section_str = ""
765
+ if not def_str and self._norm(self._cur_line()) == "@@":
766
+ section_str = self.read_line()
767
+
768
+ if not (def_str or section_str or index == 0):
769
+ raise DiffError(f"Invalid line in update section:\n{self._cur_line()}")
770
+
771
+ if def_str.strip():
772
+ found = False
773
+ if def_str not in lines[:index]:
774
+ for i, s in enumerate(lines[index:], index):
775
+ if s == def_str:
776
+ index = i + 1
777
+ found = True
778
+ break
779
+ if not found and def_str.strip() not in [
780
+ s.strip() for s in lines[:index]
781
+ ]:
782
+ for i, s in enumerate(lines[index:], index):
783
+ if s.strip() == def_str.strip():
784
+ index = i + 1
785
+ self.fuzz += 1
786
+ found = True
787
+ break
788
+
789
+ next_ctx, chunks, end_idx, eof = peek_next_section(self.lines, self.index)
790
+ new_index, fuzz = find_context(lines, next_ctx, index, eof)
791
+ if new_index == -1:
792
+ ctx_txt = "\n".join(next_ctx)
793
+ raise DiffError(
794
+ f"Invalid {'EOF ' if eof else ''}context at {index}:\n{ctx_txt}"
795
+ )
796
+ self.fuzz += fuzz
797
+ for ch in chunks:
798
+ ch.orig_index += new_index
799
+ action.chunks.append(ch)
800
+ index = new_index + len(next_ctx)
801
+ self.index = end_idx
802
+ return action
803
+
804
+ def _parse_add_file(self) -> PatchAction:
805
+ lines: List[str] = []
806
+ while not self.is_done(
807
+ ("*** End Patch", "*** Update File:", "*** Delete File:", "*** Add File:")
808
+ ):
809
+ s = self.read_line()
810
+ if not s.startswith("+"):
811
+ raise DiffError(f"Invalid Add File line (missing '+'): {s}")
812
+ lines.append(s[1:]) # strip leading '+'
813
+ return PatchAction(type=ActionType.ADD, new_file="\n".join(lines))
814
+
815
+
816
+ # --------------------------------------------------------------------------- #
817
+ # Helper functions
818
+ # --------------------------------------------------------------------------- #
819
+ def find_context_core(
820
+ lines: List[str], context: List[str], start: int
821
+ ) -> Tuple[int, int]:
822
+ if not context:
823
+ return start, 0
824
+
825
+ for i in range(start, len(lines)):
826
+ if lines[i : i + len(context)] == context:
827
+ return i, 0
828
+ for i in range(start, len(lines)):
829
+ if [s.rstrip() for s in lines[i : i + len(context)]] == [
830
+ s.rstrip() for s in context
831
+ ]:
832
+ return i, 1
833
+ for i in range(start, len(lines)):
834
+ if [s.strip() for s in lines[i : i + len(context)]] == [
835
+ s.strip() for s in context
836
+ ]:
837
+ return i, 100
838
+ return -1, 0
839
+
840
+
841
+ def find_context(
842
+ lines: List[str], context: List[str], start: int, eof: bool
843
+ ) -> Tuple[int, int]:
844
+ if eof:
845
+ new_index, fuzz = find_context_core(lines, context, len(lines) - len(context))
846
+ if new_index != -1:
847
+ return new_index, fuzz
848
+ new_index, fuzz = find_context_core(lines, context, start)
849
+ return new_index, fuzz + 10_000
850
+ return find_context_core(lines, context, start)
851
+
852
+
853
+ def peek_next_section(
854
+ lines: List[str], index: int
855
+ ) -> Tuple[List[str], List[Chunk], int, bool]:
856
+ old: List[str] = []
857
+ del_lines: List[str] = []
858
+ ins_lines: List[str] = []
859
+ chunks: List[Chunk] = []
860
+ mode = "keep"
861
+ orig_index = index
862
+
863
+ while index < len(lines):
864
+ s = lines[index]
865
+ if s.startswith(
866
+ (
867
+ "@@",
868
+ "*** End Patch",
869
+ "*** Update File:",
870
+ "*** Delete File:",
871
+ "*** Add File:",
872
+ "*** End of File",
873
+ )
874
+ ):
875
+ break
876
+ if s == "***":
877
+ break
878
+ if s.startswith("***"):
879
+ raise DiffError(f"Invalid Line: {s}")
880
+ index += 1
881
+
882
+ last_mode = mode
883
+ if s == "":
884
+ s = " "
885
+ if s[0] == "+":
886
+ mode = "add"
887
+ elif s[0] == "-":
888
+ mode = "delete"
889
+ elif s[0] == " ":
890
+ mode = "keep"
891
+ else:
892
+ raise DiffError(f"Invalid Line: {s}")
893
+ s = s[1:]
894
+
895
+ if mode == "keep" and last_mode != mode:
896
+ if ins_lines or del_lines:
897
+ chunks.append(
898
+ Chunk(
899
+ orig_index=len(old) - len(del_lines),
900
+ del_lines=del_lines,
901
+ ins_lines=ins_lines,
902
+ )
903
+ )
904
+ del_lines, ins_lines = [], []
905
+
906
+ if mode == "delete":
907
+ del_lines.append(s)
908
+ old.append(s)
909
+ elif mode == "add":
910
+ ins_lines.append(s)
911
+ elif mode == "keep":
912
+ old.append(s)
913
+
914
+ if ins_lines or del_lines:
915
+ chunks.append(
916
+ Chunk(
917
+ orig_index=len(old) - len(del_lines),
918
+ del_lines=del_lines,
919
+ ins_lines=ins_lines,
920
+ )
921
+ )
922
+
923
+ if index < len(lines) and lines[index] == "*** End of File":
924
+ index += 1
925
+ return old, chunks, index, True
926
+
927
+ if index == orig_index:
928
+ raise DiffError("Nothing in this section")
929
+ return old, chunks, index, False
930
+
931
+
932
+ # --------------------------------------------------------------------------- #
933
+ # Patch → Commit and Commit application
934
+ # --------------------------------------------------------------------------- #
935
+ def _get_updated_file(text: str, action: PatchAction, path: str) -> str:
936
+ if action.type is not ActionType.UPDATE:
937
+ raise DiffError("_get_updated_file called with non-update action")
938
+ orig_lines = text.split("\n")
939
+ dest_lines: List[str] = []
940
+ orig_index = 0
941
+
942
+ for chunk in action.chunks:
943
+ if chunk.orig_index > len(orig_lines):
944
+ raise DiffError(
945
+ f"{path}: chunk.orig_index {chunk.orig_index} exceeds file length"
946
+ )
947
+ if orig_index > chunk.orig_index:
948
+ raise DiffError(
949
+ f"{path}: overlapping chunks at {orig_index} > {chunk.orig_index}"
950
+ )
951
+
952
+ dest_lines.extend(orig_lines[orig_index : chunk.orig_index])
953
+ orig_index = chunk.orig_index
954
+
955
+ dest_lines.extend(chunk.ins_lines)
956
+ orig_index += len(chunk.del_lines)
957
+
958
+ dest_lines.extend(orig_lines[orig_index:])
959
+ return "\n".join(dest_lines)
960
+
961
+
962
+ def patch_to_commit(patch: Patch, orig: Dict[str, str]) -> Commit:
963
+ commit = Commit()
964
+ for path, action in patch.actions.items():
965
+ if action.type is ActionType.DELETE:
966
+ commit.changes[path] = FileChange(
967
+ type=ActionType.DELETE, old_content=orig[path]
968
+ )
969
+ elif action.type is ActionType.ADD:
970
+ if action.new_file is None:
971
+ raise DiffError("ADD action without file content")
972
+ commit.changes[path] = FileChange(
973
+ type=ActionType.ADD, new_content=action.new_file
974
+ )
975
+ elif action.type is ActionType.UPDATE:
976
+ new_content = _get_updated_file(orig[path], action, path)
977
+ commit.changes[path] = FileChange(
978
+ type=ActionType.UPDATE,
979
+ old_content=orig[path],
980
+ new_content=new_content,
981
+ move_path=action.move_path,
982
+ )
983
+ return commit
984
+
985
+
986
+ # --------------------------------------------------------------------------- #
987
+ # User-facing helpers
988
+ # --------------------------------------------------------------------------- #
989
+ def text_to_patch(text: str, orig: Dict[str, str]) -> Tuple[Patch, int]:
990
+ lines = text.splitlines() # preserves blank lines, no strip()
991
+ if (
992
+ len(lines) < 2
993
+ or not Parser._norm(lines[0]).startswith("*** Begin Patch")
994
+ or Parser._norm(lines[-1]) != "*** End Patch"
995
+ ):
996
+ raise DiffError("Invalid patch text - missing sentinels")
997
+
998
+ parser = Parser(current_files=orig, lines=lines, index=1)
999
+ parser.parse()
1000
+ return parser.patch, parser.fuzz
1001
+
1002
+
1003
+ def identify_files_needed(text: str) -> List[str]:
1004
+ lines = text.splitlines()
1005
+ return [
1006
+ line[len("*** Update File: ") :]
1007
+ for line in lines
1008
+ if line.startswith("*** Update File: ")
1009
+ ] + [
1010
+ line[len("*** Delete File: ") :]
1011
+ for line in lines
1012
+ if line.startswith("*** Delete File: ")
1013
+ ]
1014
+
1015
+
1016
+ def identify_files_added(text: str) -> List[str]:
1017
+ lines = text.splitlines()
1018
+ return [
1019
+ line[len("*** Add File: ") :]
1020
+ for line in lines
1021
+ if line.startswith("*** Add File: ")
1022
+ ]
1023
+
1024
+
1025
+ # --------------------------------------------------------------------------- #
1026
+ # File-system helpers
1027
+ # --------------------------------------------------------------------------- #
1028
+ def load_files(paths: List[str], open_fn: Callable[[str], str]) -> Dict[str, str]:
1029
+ return {path: open_fn(path) for path in paths}
1030
+
1031
+
1032
+ def apply_commit(
1033
+ commit: Commit,
1034
+ write_fn: Callable[[str, str], None],
1035
+ remove_fn: Callable[[str], None],
1036
+ ) -> None:
1037
+ for path, change in commit.changes.items():
1038
+ if change.type is ActionType.DELETE:
1039
+ remove_fn(path)
1040
+ elif change.type is ActionType.ADD:
1041
+ if change.new_content is None:
1042
+ raise DiffError(f"ADD change for {path} has no content")
1043
+ write_fn(path, change.new_content)
1044
+ elif change.type is ActionType.UPDATE:
1045
+ if change.new_content is None:
1046
+ raise DiffError(f"UPDATE change for {path} has no new content")
1047
+ target = change.move_path or path
1048
+ write_fn(target, change.new_content)
1049
+ if change.move_path:
1050
+ remove_fn(path)
1051
+
1052
+
1053
+ def process_patch(
1054
+ text: str,
1055
+ open_fn: Callable[[str], str],
1056
+ write_fn: Callable[[str, str], None],
1057
+ remove_fn: Callable[[str], None],
1058
+ ) -> str:
1059
+ if not text.startswith("*** Begin Patch"):
1060
+ raise DiffError("Patch text must start with *** Begin Patch")
1061
+ paths = identify_files_needed(text)
1062
+ orig = load_files(paths, open_fn)
1063
+ patch, _fuzz = text_to_patch(text, orig)
1064
+ commit = patch_to_commit(patch, orig)
1065
+ apply_commit(commit, write_fn, remove_fn)
1066
+ return "Done!"
1067
+
1068
+
1069
+ # --------------------------------------------------------------------------- #
1070
+ # Default FS helpers
1071
+ # --------------------------------------------------------------------------- #
1072
+ def open_file(path: str) -> str:
1073
+ with open(path, "rt", encoding="utf-8") as fh:
1074
+ return fh.read()
1075
+
1076
+
1077
+ def write_file(path: str, content: str) -> None:
1078
+ target = pathlib.Path(path)
1079
+ target.parent.mkdir(parents=True, exist_ok=True)
1080
+ with target.open("wt", encoding="utf-8") as fh:
1081
+ fh.write(content)
1082
+
1083
+
1084
+ def remove_file(path: str) -> None:
1085
+ pathlib.Path(path).unlink(missing_ok=True)
1086
+
1087
+
1088
+ # --------------------------------------------------------------------------- #
1089
+ # CLI entry-point
1090
+ # --------------------------------------------------------------------------- #
1091
+ def main() -> None:
1092
+ import sys
1093
+
1094
+ patch_text = sys.stdin.read()
1095
+ if not patch_text:
1096
+ print("Please pass patch text through stdin", file=sys.stderr)
1097
+ return
1098
+ try:
1099
+ result = process_patch(patch_text, open_file, write_file, remove_file)
1100
+ except DiffError as exc:
1101
+ print(exc, file=sys.stderr)
1102
+ return
1103
+ print(result)
1104
+
1105
+
1106
+ if __name__ == "__main__":
1107
+ main()
1108
+ ```
1109
+ Other Effective Diff Formats
1110
+
1111
+ If you want to try using a different diff format, we found in testing that the SEARCH/REPLACE diff format used in Aider’s polyglot benchmark, as well as a pseudo-XML format with no internal escaping, both had high success rates.
1112
+
1113
+ These diff formats share two key aspects: (1) they do not use line numbers, and (2) they provide both the exact code to be replaced, and the exact code with which to replace it, with clear delimiters between the two.
1114
+ ```python
1115
+ SEARCH_REPLACE_DIFF_EXAMPLE = """
1116
+ path/to/file.py
1117
+
1118
+ >>>>>>> SEARCH
1119
+ def search():
1120
+ pass
1121
+ =======
1122
+ def search():
1123
+ raise NotImplementedError()
1124
+ <<<<<<< REPLACE
1125
+ """
1126
+
1127
+ PSEUDO_XML_DIFF_EXAMPLE = """
1128
+ <edit>
1129
+ <file>
1130
+ path/to/file.py
1131
+ </file>
1132
+ <old_code>
1133
+ def search():
1134
+ pass
1135
+ </old_code>
1136
+ <new_code>
1137
+ def search():
1138
+ raise NotImplementedError()
1139
+ </new_code>
1140
+ </edit>
1141
+ """
1142
+ ```
1143
+
1144
+
1145
+
1146
+
1147
+
1148
+
1149
+
1150
+
1151
+
1152
+
1153
+
1154
+
1155
+
1156
+
1157
+
1158
+
1159
+
1160
+
1161
+
1162
+
1163
+
1164
+
LICENSE ADDED
@@ -0,0 +1,21 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ MIT License
2
+
3
+ Copyright (c) 2025 ghchris2021
4
+
5
+ Permission is hereby granted, free of charge, to any person obtaining a copy
6
+ of this software and associated documentation files (the "Software"), to deal
7
+ in the Software without restriction, including without limitation the rights
8
+ to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
9
+ copies of the Software, and to permit persons to whom the Software is
10
+ furnished to do so, subject to the following conditions:
11
+
12
+ The above copyright notice and this permission notice shall be included in all
13
+ copies or substantial portions of the Software.
14
+
15
+ THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
16
+ IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
17
+ FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
18
+ AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
19
+ LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
20
+ OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
21
+ SOFTWARE.
README.md ADDED
@@ -0,0 +1,234 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # [OpenAI Cookbook Pro](https://chatgpt.com/canvas/shared/6825e9f6e8d88191bf9ef4de00b29b0f)
2
+ ### Developer Tools: [Universal Runtime](https://github.com/davidkimai/universal-runtime) | [Universal Developer](https://github.com/davidkimai/universal-developer)
3
+
4
+ **An Advanced Implementation Guide to GPT-4.1: Real-World Applications, Prompting Strategies, and Agent Workflows**
5
+
6
+ Welcome to **OpenAI Cookbook Pro** — a comprehensive, practical, and fully extensible resource tailored for engineers, developers, and researchers working with the GPT-4.1 API and related OpenAI tools. This repository distills best practices, integrates field-tested strategies, and supports high-performing workflows with enhanced reliability, precision, and developer autonomy.
7
+
8
+ > If you're familiar with the original OpenAI Cookbook, think of this project as an expanded version designed for production-grade deployments, advanced prompt development, tool integration, and agent design.
9
+
10
+
11
+ ## 🔧 What This Cookbook Offers
12
+
13
+ * **Structured examples** of effective prompting for instruction following, planning, tool usage, and dynamic interactions.
14
+ * **Agent design frameworks** built around persistent task completion and context-aware iteration.
15
+ * **Tool integration patterns** using OpenAI's native tool-calling API — optimized for accuracy and reliability.
16
+ * **Custom workflows** for coding tasks, debugging, testing, and patch management.
17
+ * **Long-context strategies** including prompt shaping, content selection, and information compression for up to 1M tokens.
18
+ * **Production-aligned system prompts** for customer service, support bots, and autonomous coding agents.
19
+
20
+ Whether you're building an agent to manage codebases or optimizing a high-context knowledge retrieval system, the examples here aim to be direct, reproducible, and extensible.
21
+
22
+
23
+ ## 📘 Table of Contents
24
+
25
+ 1. [Getting Started](#getting-started)
26
+ 2. [Prompting for Instruction Following](#prompting-for-instruction-following)
27
+ 3. [Designing Agent Workflows](#designing-agent-workflows)
28
+ 4. [Tool Use and Integration](#tool-use-and-integration)
29
+ 5. [Chain of Thought and Planning](#chain-of-thought-and-planning)
30
+ 6. [Handling Long Contexts](#handling-long-contexts)
31
+ 7. [Code Fixing and Diff Management](#code-fixing-and-diff-management)
32
+ 8. [Real-World Deployment Scenarios](#real-world-deployment-scenarios)
33
+ 9. [Prompt Engineering Reference Guide](#prompt-engineering-reference-guide)
34
+ 10. [API Usage Examples](#api-usage-examples)
35
+
36
+
37
+ ## Getting Started
38
+
39
+ OpenAI Cookbook Pro assumes a basic working knowledge of OpenAI’s Python SDK, the GPT-4.1 API, and how to use the `functions`, `tools`, and `system prompt` fields.
40
+
41
+ If you're new to OpenAI's tools, start here:
42
+
43
+ * [OpenAI Platform Documentation](https://platform.openai.com/docs)
44
+ * [Original OpenAI Cookbook](https://github.com/openai/openai-cookbook)
45
+
46
+ This project builds on those foundations, layering in advanced workflows and reproducible examples for:
47
+
48
+ * Task persistence
49
+ * Iterative debugging
50
+ * Prompt shaping and behavior targeting
51
+ * Multi-step tool planning
52
+
53
+
54
+ ## Prompting for Instruction Following
55
+
56
+ GPT-4.1’s instruction-following capabilities have been significantly improved. To ensure the model performs consistently:
57
+
58
+ * Be explicit. Literal instruction following means subtle ambiguities may derail output.
59
+ * Use clear formatting for instruction sets (Markdown, XML, or numbered lists).
60
+ * Place instructions **at both the top and bottom** of long prompts if the context window exceeds 100K tokens.
61
+
62
+ ### Example: Instruction Template
63
+
64
+ ```markdown
65
+ # Instructions
66
+ 1. Read the user’s message carefully.
67
+ 2. Do not generate a response until you've gathered all needed context.
68
+ 3. Use a tool if more information is required.
69
+ 4. Only respond when you can complete the request correctly.
70
+ ```
71
+
72
+ > See `/examples/instruction-following.md` for more variations and system prompt styles.
73
+
74
+
75
+ ## Designing Agent Workflows
76
+
77
+ GPT-4.1 supports agentic workflows that require multi-step planning, tool usage, and long turn durations. Designing effective agents starts with a disciplined structure:
78
+
79
+ ### Include Three System Prompt Anchors:
80
+
81
+ * **Persistence**: Emphasize that the model should continue until task completion.
82
+ * **Tool usage**: Make it clear that it must use tools if it lacks context.
83
+ * **Planning**: Encourage the model to write out plans and reflect after each action.
84
+
85
+ See `/agent_design/swe_bench_agent.md` for a complete agent example that solves live bugs in open-source repositories.
86
+
87
+
88
+ ## Tool Use and Integration
89
+
90
+ Leverage the `tools` parameter in OpenAI's API to define functional calls. Avoid embedding tool descriptions in prompts — the model performs better when tools are registered explicitly.
91
+
92
+ ### Tool Guidelines
93
+
94
+ * Name your tools clearly.
95
+ * Keep descriptions concise but specific.
96
+ * Provide optional examples in a dedicated `# Examples` section.
97
+
98
+ > Tool-based prompting increases reliability, reduces hallucinations, and helps maintain output consistency.
99
+
100
+
101
+ ## Chain of Thought and Planning
102
+
103
+ While GPT-4.1 does not inherently perform internal reasoning, it can be prompted to **think out loud**:
104
+
105
+ ```markdown
106
+ First, identify what documents may be relevant. Then list their titles and relevance. Finally, provide a list of IDs sorted by importance.
107
+ ```
108
+
109
+ Use structured strategies to enforce planning:
110
+
111
+ 1. Break down the query.
112
+ 2. Retrieve and assess context.
113
+ 3. Prioritize response steps.
114
+ 4. Deliver a refined output.
115
+
116
+ > See `/prompting/chain_of_thought.md` for templates and performance impact.
117
+
118
+
119
+ ## Handling Long Contexts
120
+
121
+ GPT-4.1 supports up to **1 million tokens**. To manage this effectively:
122
+
123
+ * Use structure: XML or markdown sections help the model parse relevance.
124
+ * Repeat critical instructions **at the top and bottom** of your prompt.
125
+ * Scope responses by separating external context from user queries.
126
+
127
+ ### Example Format
128
+
129
+ ```xml
130
+ <instructions>
131
+ Only answer based on External Context. Do not make assumptions.
132
+ </instructions>
133
+ <user_query>
134
+ How does the billing policy apply to usage overages?
135
+ </user_query>
136
+ <context>
137
+ <doc id="12" title="Billing Policy">
138
+ [...]
139
+ </doc>
140
+ </context>
141
+ ```
142
+
143
+ > See `/examples/long-context-formatting.md` for formatting guidance.
144
+
145
+
146
+ ## Code Fixing and Diff Management
147
+
148
+ GPT-4.1 includes support for a **tool-compatible diff format** that enables:
149
+
150
+ * Patch generation
151
+ * File updates
152
+ * Inline modifications with full context
153
+
154
+ Use the `apply_patch` tool with the recommended V4A diff format. Always:
155
+
156
+ * Use clear before/after code snippets
157
+ * Avoid relying on line numbers
158
+ * Use `@@` markers to indicate scope
159
+
160
+ > See `/tools/apply_patch_examples/` for real-world patch workflows.
161
+
162
+
163
+ ## Real-World Deployment Scenarios
164
+
165
+ ### Use Cases
166
+
167
+ * **Support automation** using grounded answers and clear tool policies
168
+ * **Code refactoring bots** that operate on large repositories
169
+ * **Document summarization** across thousands of pages
170
+ * **High-integrity report generation** from structured prompt templates
171
+
172
+ Each scenario includes:
173
+
174
+ * Prompt formats
175
+ * Tool definitions
176
+ * Behavior checks
177
+
178
+ > Explore the `/scenarios/` folder for ready-to-run templates.
179
+
180
+
181
+ ## Prompt Engineering Reference Guide
182
+
183
+ A distilled reference for designing robust prompts across various tasks.
184
+
185
+ ### Sections:
186
+
187
+ * General prompt structures
188
+ * Common failure patterns
189
+ * Formatting styles (Markdown, XML, JSON)
190
+ * Long-context techniques
191
+ * Instruction conflict resolution
192
+
193
+ > Found in `/reference/prompting_guide.md`
194
+
195
+
196
+ ## API Usage Examples
197
+
198
+ Includes starter scripts and walkthroughs for:
199
+
200
+ * Tool registration
201
+ * Chat prompt design
202
+ * Instruction tuning
203
+ * Streaming outputs
204
+
205
+ All examples use official OpenAI SDK patterns and can be run locally.
206
+
207
+
208
+ ## Contributing
209
+
210
+ We welcome contributions that:
211
+
212
+ * Improve clarity
213
+ * Extend agent workflows
214
+ * Add new prompt techniques
215
+ * Introduce tool examples
216
+
217
+ To contribute:
218
+
219
+ 1. Fork the repo
220
+ 2. Create a new folder under `/examples` or `/tools`
221
+ 3. Submit a PR with a brief description of your addition
222
+
223
+
224
+ ## License
225
+
226
+ This project is released under the MIT License.
227
+
228
+
229
+ ## Acknowledgments
230
+
231
+ This repository builds upon the foundational work of the original [OpenAI Cookbook](https://github.com/openai/openai-cookbook). All strategies are derived from real-world testing, usage analysis, and OpenAI’s 4.1 Prompting Guide (April 2025).
232
+
233
+
234
+ For support or suggestions, feel free to open an issue or connect via [OpenAI Developer Forum](https://community.openai.com).
api_usage.md ADDED
@@ -0,0 +1,326 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # [API Usage Examples with GPT-4.1](https://chatgpt.com/canvas/shared/6825f96694a48191af7648cad2996158)
2
+
3
+ ## Overview
4
+
5
+ This guide provides detailed, real-world examples of using the OpenAI GPT-4.1 API effectively, with a focus on instruction-following, tool integration, agent persistence, and prompt structuring. These examples are designed to help developers and engineers build resilient, production-ready systems using GPT-4.1 across various applications, including customer service, bug fixing, document analysis, and data labeling.
6
+
7
+ Each example illustrates system prompt construction, tool schema definitions, interaction workflows, and failure mitigation strategies.
8
+
9
+
10
+
11
+ ## Example 1: Customer Support Agent with Tool Use
12
+
13
+ ### Objective
14
+
15
+ Deploy a GPT-4.1 assistant to handle user questions about policies and account status.
16
+
17
+ ### System Prompt
18
+
19
+ ```markdown
20
+ You are a helpful assistant for NewTelco. Your job is to assist users with account and policy information.
21
+
22
+ # Instructions
23
+ - Always greet the user.
24
+ - Use tools to retrieve account and policy information.
25
+ - If needed data is missing, ask the user before calling tools.
26
+ - Avoid internal knowledge on restricted topics.
27
+
28
+ # Workflow
29
+ 1. Identify intent
30
+ 2. Call tools if sufficient input
31
+ 3. Message user before and after tool calls
32
+ 4. Cite results
33
+ ```
34
+
35
+ ### Tools
36
+
37
+ ```json
38
+ [
39
+ {
40
+ "name": "lookup_policy_document",
41
+ "description": "Searches internal docs by topic",
42
+ "parameters": {
43
+ "type": "object",
44
+ "properties": {
45
+ "topic": {"type": "string"}
46
+ },
47
+ "required": ["topic"]
48
+ }
49
+ },
50
+ {
51
+ "name": "get_user_account_info",
52
+ "description": "Retrieves account details using phone number",
53
+ "parameters": {
54
+ "type": "object",
55
+ "properties": {
56
+ "phone_number": {"type": "string"}
57
+ },
58
+ "required": ["phone_number"]
59
+ }
60
+ }
61
+ ]
62
+ ```
63
+
64
+ ### User Input
65
+
66
+ ```text
67
+ How much will it cost for international service in France?
68
+ ```
69
+
70
+ ### Assistant Output
71
+
72
+ ```text
73
+ Hi, you've reached NewTelco, how can I help you? 🌍✈️
74
+
75
+ You're asking about international service while traveling to France. Let me check that for you — one moment, please. 🕑
76
+ ```
77
+
78
+ ### Tool Call
79
+
80
+ ```json
81
+ {
82
+ "name": "lookup_policy_document",
83
+ "arguments": {"topic": "international service cost France"}
84
+ }
85
+ ```
86
+
87
+
88
+
89
+ ## Example 2: Software Bug Fix Agent with Diff Application
90
+
91
+ ### Objective
92
+
93
+ Use GPT-4.1 to investigate and fix a bug in a codebase using the `apply_patch` tool.
94
+
95
+ ### System Prompt Highlights
96
+
97
+ ```markdown
98
+ # Objective
99
+ Fix the issue provided by the user.
100
+
101
+ # Instructions
102
+ - Plan each step
103
+ - Reflect after each function call
104
+ - Never guess code — read it first using tools
105
+ - Only stop when all tests pass
106
+
107
+ # Workflow
108
+ 1. Understand issue deeply
109
+ 2. Investigate codebase
110
+ 3. Draft patch
111
+ 4. Apply patch
112
+ 5. Run tests
113
+ 6. Reflect and finalize
114
+ ```
115
+
116
+ ### Tool Definition
117
+
118
+ ```json
119
+ {
120
+ "name": "python",
121
+ "description": "Execute code or apply a patch",
122
+ "parameters": {
123
+ "type": "object",
124
+ "properties": {
125
+ "input": {"type": "string"}
126
+ },
127
+ "required": ["input"]
128
+ }
129
+ }
130
+ ```
131
+
132
+ ### Tool Call Example
133
+
134
+ ```bash
135
+ %%bash
136
+ apply_patch <<"EOF"
137
+ *** Begin Patch
138
+ *** Update File: src/core.py
139
+ @@ def is_valid():
140
+ - return False
141
+ + return True
142
+ *** End Patch
143
+ EOF
144
+ ```
145
+
146
+ ### Test Execution
147
+
148
+ ```json
149
+ {
150
+ "name": "python",
151
+ "arguments": {"input": "!python3 run_tests.py"}
152
+ }
153
+ ```
154
+
155
+
156
+
157
+ ## Example 3: Long-Context Document Analyzer
158
+
159
+ ### Objective
160
+
161
+ Summarize and extract insights from up to 1M tokens of context.
162
+
163
+ ### Prompt Sections
164
+
165
+ ```markdown
166
+ # Instructions
167
+ - Process documents in 10k token blocks
168
+ - Reflect after each segment
169
+ - Label relevance and extract core ideas
170
+
171
+ # Strategy
172
+ 1. Read → summarize
173
+ 2. Score relevance
174
+ 3. Synthesize into unified output
175
+ ```
176
+
177
+ ### Input Format
178
+
179
+ ```xml
180
+ <doc id="21" title="Policy Update">
181
+ <summary>Changes to international billing rules</summary>
182
+ <content>...</content>
183
+ </doc>
184
+ ```
185
+
186
+ ### Assistant Behavior
187
+
188
+ * Chunk input into 10k token sections
189
+ * After each, provide a summary and document scores
190
+ * Compile findings at end
191
+
192
+
193
+
194
+ ## Example 4: Data Labeling Assistant
195
+
196
+ ### Objective
197
+
198
+ Assist with structured classification tasks.
199
+
200
+ ### Prompt Template
201
+
202
+ ```markdown
203
+ # Instructions
204
+ - Label each entry using the provided schema
205
+ - Do not guess; if unsure, flag for human
206
+
207
+ # Labeling Categories
208
+ - Urgent
209
+ - Normal
210
+ - Spam
211
+
212
+ # Output Format
213
+ {"text": ..., "label": ...}
214
+
215
+ # Example
216
+ {"text": "Win money now!", "label": "Spam"}
217
+ ```
218
+
219
+ ### User Input
220
+
221
+ ```json
222
+ [
223
+ "New system update available",
224
+ "Limited time offer! Click now",
225
+ "Server crashed, need help ASAP"
226
+ ]
227
+ ```
228
+
229
+ ### Assistant Output
230
+
231
+ ```json
232
+ [
233
+ {"text": "New system update available", "label": "Normal"},
234
+ {"text": "Limited time offer! Click now", "label": "Spam"},
235
+ {"text": "Server crashed, need help ASAP", "label": "Urgent"}
236
+ ]
237
+ ```
238
+
239
+
240
+
241
+ ## Example 5: Chain-of-Thought for Multi-Hop Reasoning
242
+
243
+ ### Objective
244
+
245
+ Support a planning task by explicitly breaking down the steps.
246
+
247
+ ### Prompt Template
248
+
249
+ ```markdown
250
+ # Instructions
251
+ First, think carefully step by step. Then output the result.
252
+
253
+ # Reasoning Strategy
254
+ 1. Identify user question
255
+ 2. Extract context
256
+ 3. Connect information across documents
257
+ 4. Output answer
258
+ ```
259
+
260
+ ### Example Input
261
+
262
+ ```markdown
263
+ # User Question
264
+ How did the billing policy change after 2022?
265
+
266
+ # Context
267
+ <doc id="10" title="Policy 2022">...</doc>
268
+ <doc id="12" title="Policy 2023">...</doc>
269
+ ```
270
+
271
+ ### Model Output
272
+
273
+ ```text
274
+ Step 1: Identify relevant documents → IDs 10, 12
275
+ Step 2: Compare clauses
276
+ Step 3: 2022 had flat rates, 2023 added time-of-use billing
277
+ Answer: Billing policy changed to time-based pricing in 2023.
278
+ ```
279
+
280
+
281
+
282
+ ## General Prompt Formatting Guidelines
283
+
284
+ ### Preferred Structure
285
+
286
+ ```markdown
287
+ # Role
288
+ # Instructions
289
+ # Workflow (optional)
290
+ # Reasoning Strategy (optional)
291
+ # Output Format
292
+ # Examples (optional)
293
+ ```
294
+
295
+ ### Tool Use Reminders
296
+
297
+ * Only call tools when sufficient information is available
298
+ * Always notify the user before and after calls
299
+ * Use example-triggered calls for teaching tool behavior
300
+
301
+ ### Output Patterns
302
+
303
+ * JSON or markdown preferred
304
+ * Cite source documents if used
305
+ * Include fallback responses if uncertain (e.g., "Insufficient context")
306
+
307
+
308
+
309
+ ## Best Practices Summary
310
+
311
+ | Element | Best Practice |
312
+ | ------------ | --------------------------------------------- |
313
+ | Tool Calls | Always define schema with strong param names |
314
+ | Planning | Enforce pre- and post-action reflection |
315
+ | Output | Enforce format, validate JSON before response |
316
+ | Long Context | Use structured delimiters (Markdown, XML) |
317
+ | Labeling | Use few-shot examples and explicit categories |
318
+ | Diff Format | Use V4A patch format for code updates |
319
+
320
+
321
+
322
+ ## Final Note
323
+
324
+ These examples are starting templates. Each system will benefit from iterative refinements, structured logging, and real-world user testing. Maintain modular prompts and tool schemas, and adopt evaluation frameworks to monitor performance over time.
325
+
326
+ **Clarity, structure, and instruction adherence are the cornerstones of production-grade GPT-4.1 API design.**
chain_of_thought_planning.md ADDED
@@ -0,0 +1,195 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # [Chain of Thought and Planning in GPT-4.1](https://chatgpt.com/canvas/shared/6825f035f4b8819188e481e6e5cab29e)
2
+ ## Overview
3
+
4
+ This document serves as a comprehensive and standalone guide for implementing effective chain-of-thought prompting and planning techniques with the OpenAI GPT-4.1 model family. It draws from official prompt engineering strategies outlined in the OpenAI 4.1 Cookbook and translates them into an accessible, implementation-ready format for developers, researchers, and product engineers.
5
+
6
+ ## Key Goals
7
+
8
+ 1. Enable step-by-step problem-solving via structured reasoning.
9
+ 2. Amplify agentic behavior in tool-using contexts.
10
+ 3. Minimize hallucinations by encouraging reflective planning.
11
+ 4. Improve task completion rates in software engineering and knowledge work.
12
+ 5. Align prompt design with model strengths in instruction-following and long-context awareness.
13
+
14
+ ## Core Principles
15
+
16
+ ### 1. Chain-of-Thought (CoT) Induction
17
+
18
+ GPT-4.1 does not natively reason before answering; however, it can be prompted to simulate reasoning through structured instructions. This is known as "chain-of-thought prompting."
19
+
20
+ **Prompting Template:**
21
+
22
+ > "Before answering, think step by step about what’s needed to solve the task. Then begin executing."
23
+
24
+ Chain-of-thought is especially effective when applied to:
25
+
26
+ * Multi-hop reasoning questions
27
+ * Complex analytical tasks
28
+ * Document triage and synthesis
29
+ * Code tracing and debugging
30
+
31
+ ### 2. Agentic Planning
32
+
33
+ The model can be transformed into a more proactive, autonomous agent through three types of reminders:
34
+
35
+ * **Persistence Reminder:** Encourages continuation across multiple turns.
36
+ * **Tool-Use Reminder:** Discourages guessing; reinforces fact-finding.
37
+ * **Planning Reminder:** Encourages step-by-step thinking before and after tool use.
38
+
39
+ **Agentic Prompting Snippet:**
40
+
41
+ ```text
42
+ You are an agent. Keep going until the query is fully resolved. Use tools instead of guessing. Plan your actions and reflect after each step.
43
+ ```
44
+
45
+ This significantly increases model adherence to goals and improves results in complex domains like software engineering, particularly on structured benchmarks like SWE-bench Verified.
46
+
47
+ ### 3. Explicit Workflow Structuring
48
+
49
+ Providing workflows as ordered lists increases adherence and performance. This creates a "mental model" the assistant follows.
50
+
51
+ **Example Workflow:**
52
+
53
+ ```text
54
+ 1. Understand the query.
55
+ 2. Identify relevant context.
56
+ 3. Create a solution plan.
57
+ 4. Execute steps incrementally.
58
+ 5. Verify and test.
59
+ 6. Reflect and iterate.
60
+ ```
61
+
62
+ This structure serves dual purpose: guiding the model and signaling users the assistant's reasoning process.
63
+
64
+ ### 4. Contextual Grounding
65
+
66
+ In long-context situations (e.g., 100K+ token sessions), instruction placement matters:
67
+
68
+ * **Place instructions at both start and end of context blocks.**
69
+ * **Use markdown or XML delimiters for structure.**
70
+
71
+ Avoid JSON when loading multiple documents; XML or structured markdown outperforms.
72
+
73
+ ### 5. Output Control Through Instruction Templates
74
+
75
+ Instruction adherence improves when you:
76
+
77
+ * Start with high-level **Response Rules**.
78
+ * Follow with a **Step-by-Step Plan**.
79
+ * Include examples demonstrating the expected behavior.
80
+ * End with an instruction to think step by step.
81
+
82
+ **Example Prompt Structure:**
83
+
84
+ ```markdown
85
+ # Instructions
86
+ - Respond concisely.
87
+ - Think before acting.
88
+ - Use only tools provided.
89
+
90
+ # Steps
91
+ 1. Interpret the question.
92
+ 2. Search the context.
93
+ 3. Synthesize the answer.
94
+
95
+ # Example
96
+ **Q:** What caused the error?
97
+ **A:** Let's review the logs first...
98
+
99
+ # Final Thought Instruction
100
+ Think step by step before answering.
101
+ ```
102
+
103
+ ## Planning in Practice
104
+
105
+ Below is a sample prompt segment leveraging all core planning and chain-of-thought features:
106
+
107
+ ```text
108
+ You must:
109
+ - Plan extensively before calling any function.
110
+ - Reflect on outcomes after each call.
111
+ - Do not chain tools blindly.
112
+ - Be cautious of false positives or early stopping.
113
+ - Your solution must pass all tests, including hidden ones.
114
+
115
+ Always verify:
116
+ - Is your solution logically sound?
117
+ - Have you tested edge cases?
118
+ - Are additional test cases required?
119
+ ```
120
+
121
+ This style boosts planning performance by up to 4% in SWE-bench according to OpenAI’s own testing.
122
+
123
+ ## Debugging Chain-of-Thought Failures
124
+
125
+ Chain-of-thought prompts may fail due to:
126
+
127
+ * Ambiguous user intent
128
+ * Misidentification of relevant context
129
+ * Overly abstract plans without execution
130
+
131
+ **Countermeasures:**
132
+
133
+ * Break user queries into sub-components.
134
+ * Have the model rate the relevance of documents.
135
+ * Include specific test cases as checksums for correct reasoning.
136
+
137
+ **Correction Template:**
138
+
139
+ ```text
140
+ Let’s revise. Where did the plan fail? What assumption was wrong? Was context misused?
141
+ ```
142
+
143
+ ## Long-Context Planning Strategies
144
+
145
+ When context windows expand to 1M tokens:
146
+
147
+ * Encourage summarization between reasoning steps.
148
+ * Anchor sub-conclusions before proceeding.
149
+ * Repeat critical instructions at interval checkpoints.
150
+
151
+ **Chunked Reasoning Pattern:**
152
+
153
+ ```text
154
+ Summarize findings every 10,000 tokens.
155
+ Checkpoint progress with titles and delimiters.
156
+ Reflect before moving to the next section.
157
+ ```
158
+
159
+ ## Tool Use Integration
160
+
161
+ GPT-4.1 supports structured tool calls (functions, APIs, CLI commands). Effective planning enhances tool use via:
162
+
163
+ * Context-aware parameter setting
164
+ * Post-tool-call reflection
165
+ * Avoiding premature tool use
166
+
167
+ **Tool Use Best Practices:**
168
+
169
+ * Name tools clearly and descriptively
170
+ * Provide concise, structured descriptions
171
+ * Offer usage examples outside of the tool schema
172
+
173
+ ## Practical Use Cases
174
+
175
+ * **Software Agents**: Reliable plan-execute-reflect loops
176
+ * **Data Analysis**: Step-by-step exploration of CSVs or logs
177
+ * **Scientific Reasoning**: Layered hypothesis evaluation
178
+ * **Customer Service Bots**: Pre-check user input → tool call → output validation
179
+
180
+ ## Future-Proofing Your Prompts
181
+
182
+ Prompting is an empirical, iterative process. Maintain versioned prompt libraries and monitor:
183
+
184
+ * Performance regressions
185
+ * Latency vs. completeness tradeoffs
186
+ * Tool call efficiency
187
+ * Instruction adherence
188
+
189
+ Track systematic errors over time and codify high-performing reasoning strategies into your core prompts.
190
+
191
+ ## Summary
192
+
193
+ Chain-of-thought and planning, when intentionally embedded in GPT-4.1 prompts, unlock powerful new workflows for complex reasoning, debugging, and autonomous task completion. While GPT-4.1 does not reason innately, its ability to simulate planning and stepwise logic makes it a potent co-processor for advanced tasks.
194
+
195
+ **Start with clarity. Plan before acting. Reflect after execution.** That is the path to leveraging GPT-4.1 effectively for sophisticated agentic behavior.
code_fixing_and_diff.md ADDED
@@ -0,0 +1,254 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # [Code Fixing and Diff Management in GPT-4.1](https://chatgpt.com/canvas/shared/6825f21e65388191b9fb0baa737c1f18)
2
+
3
+ ## Overview
4
+
5
+ This document provides a comprehensive implementation guide for code fixing and diff generation strategies using the OpenAI GPT-4.1 model. It is designed to help developers and tool builders harness the model’s improved agentic behavior, tool integration, and patch application capabilities. The guidance herein is based on OpenAI’s internal agentic workflows, as tested on SWE-bench Verified and related coding benchmarks.
6
+
7
+
8
+ ## Objectives
9
+
10
+ * Enable GPT-4.1 to autonomously fix software bugs with minimal user intervention
11
+ * Standardize high-performance diff formats that GPT-4.1 understands well
12
+ * Leverage tool-calling strategies that minimize hallucination and improve precision
13
+ * Scaffold workflows for validation, patch application, and iterative debugging
14
+
15
+
16
+ ## Core Principles for Effective Bug Fixing
17
+
18
+ ### 1. Persistent Multi-Step Execution
19
+
20
+ To prevent premature termination, always instruct the model to:
21
+
22
+ ```text
23
+ Continue working until the issue is fully resolved. Do not return control to the user unless the fix is complete and validated.
24
+ ```
25
+
26
+ This aligns GPT-4.1’s behavior with full agent-mode operation.
27
+
28
+ ### 2. Tool-Use Encouragement
29
+
30
+ Rather than letting the model hallucinate file contents:
31
+
32
+ ```text
33
+ Use your tools to examine the file system or source code. Never guess.
34
+ ```
35
+
36
+ This ensures queries are grounded in actual project state.
37
+
38
+ ### 3. Planning and Reflection Enforcement
39
+
40
+ Prompt the model to:
41
+
42
+ * Plan before tool calls
43
+ * Reflect after each execution
44
+ * Avoid chains of back-to-back tool calls without synthesis in between
45
+
46
+ **Prompt Template:**
47
+
48
+ ```text
49
+ You MUST plan extensively before calling a function, and reflect thoroughly on its output before deciding your next step.
50
+ ```
51
+
52
+
53
+ ## Workflow Structure
54
+
55
+ ### High-Level Task Phases
56
+
57
+ 1. **Understand the Bug**
58
+ 2. **Explore the Codebase**
59
+ 3. **Plan the Fix**
60
+ 4. **Edit the Code**
61
+ 5. **Debug and Test**
62
+ 6. **Reflect and Finalize**
63
+
64
+ Each of these phases should be scaffolded in the prompt or system instructions.
65
+
66
+ ### Recommended Prompt Structure
67
+
68
+ ```markdown
69
+ # Instructions
70
+ - Fix the bug completely before ending.
71
+ - Use available tools.
72
+ - Think step-by-step before and after each action.
73
+
74
+ # Workflow
75
+ 1. Understand the issue.
76
+ 2. Investigate the source files.
77
+ 3. Plan an incremental fix.
78
+ 4. Apply and validate patch.
79
+ 5. Test extensively.
80
+ 6. Reflect and iterate.
81
+ ```
82
+
83
+
84
+ ## The V4A Patch Format (Recommended)
85
+
86
+ GPT-4.1 performs best with this clear, human-readable patch format:
87
+
88
+ ```bash
89
+ *** Begin Patch
90
+ *** Update File: path/to/file.py
91
+ @@ def some_function():
92
+ context_before
93
+ - buggy_code()
94
+ + fixed_code()
95
+ context_after
96
+ *** End Patch
97
+ ```
98
+
99
+ ### Diff Format Rules
100
+
101
+ * Use `*** Update File:` to mark the file.
102
+ * Use `@@` to denote function or class scope.
103
+ * Precede old code lines with `-`, new code with `+`.
104
+ * Include 3 lines of context above and below the change.
105
+ * If needed, add nested `@@` scopes for disambiguation.
106
+
107
+ **Avoid line numbers**; GPT-4.1 does not rely on them. It uses code context instead.
108
+
109
+
110
+ ## Tool Configuration: `apply_patch`
111
+
112
+ To simulate developer workflows, define a function tool with this pattern:
113
+
114
+ ```json
115
+ {
116
+ "name": "apply_patch",
117
+ "description": "Apply V4A diff patches to source files",
118
+ "parameters": {
119
+ "type": "object",
120
+ "properties": {
121
+ "input": { "type": "string" }
122
+ },
123
+ "required": ["input"]
124
+ }
125
+ }
126
+ ```
127
+
128
+ **Input Example:**
129
+
130
+ ```bash
131
+ %%bash
132
+ apply_patch <<"EOF"
133
+ *** Begin Patch
134
+ *** Update File: mymodule/core.py
135
+ @@ def validate():
136
+ - return False
137
+ + return True
138
+ *** End Patch
139
+ EOF
140
+ ```
141
+
142
+ The `apply_patch` tool accepts multi-file patches. Each file must be preceded by its action (`Add`, `Update`, or `Delete`).
143
+
144
+
145
+ ## Testing Strategy
146
+
147
+ ### Manual Testing within Prompt:
148
+
149
+ Prompt the model to run tests after every change:
150
+
151
+ ```text
152
+ Run all unit tests using `!python3 run_tests.py`. Do not assume success without verification.
153
+ ```
154
+
155
+ ### Encourage Reflection:
156
+
157
+ ```text
158
+ Did the test results indicate success? Were any edge cases missed? Do you need to write new tests?
159
+ ```
160
+
161
+ ### Output Evaluation:
162
+
163
+ * If tests fail, model should explain why and iterate
164
+ * If tests pass, model should reflect before finalizing
165
+
166
+
167
+ ## Debugging and Investigation Techniques
168
+
169
+ ### Investigation Plan Example:
170
+
171
+ ```text
172
+ I will begin by reading the test file that triggered the error, then locate the corresponding implementation file. From there, I’ll trace the logic and verify any assumptions.
173
+ ```
174
+
175
+ ### Debugging Prompt Reminders:
176
+
177
+ * Never change code without full context
178
+ * Use tools to inspect contents before editing
179
+ * Print debug output if necessary
180
+
181
+
182
+ ## Failure Mode Mitigations
183
+
184
+ | Failure Mode | Fix Strategy |
185
+ | ---------------------------- | ----------------------------------------------------------------------- |
186
+ | Patch applied in wrong place | Add more surrounding context or use double `@@` scope |
187
+ | Patch fails silently | Check patch syntax and apply logs before "Done!" line |
188
+ | Model ends before testing | Insert reminder: "Do not conclude until all tests are validated." |
189
+ | Partial bug fixes | Require model to re-verify against original issue and user expectations |
190
+
191
+
192
+ ## Final Validation Phase
193
+
194
+ Before finalizing a solution, prompt the model to:
195
+
196
+ * Re-read the original problem description
197
+ * Confirm alignment between intent and fix
198
+ * Run a fresh test suite
199
+ * Draft additional tests for uncovered scenarios
200
+ * Watch for silent failures or fragile patches
201
+
202
+ ### Final Prompt Template:
203
+
204
+ ```text
205
+ Think about the original bug and the goal. Is your fix logically complete? Did you run all tests? Are hidden edge cases covered?
206
+ ```
207
+
208
+
209
+ ## Alternative Diff Formats
210
+
211
+ If you need variations, GPT-4.1 performs well with:
212
+
213
+ ### Search/Replace Format
214
+
215
+ ```text
216
+ path/to/file.py
217
+ >>>>>> SEARCH
218
+ def broken():
219
+ pass
220
+ =======
221
+ def broken():
222
+ raise Exception("Fix me")
223
+ <<<<<< REPLACE
224
+ ```
225
+
226
+ ### Pseudo-XML Format
227
+
228
+ ```xml
229
+ <edit>
230
+ <file>path/to/file.py</file>
231
+ <old_code>def old(): pass</old_code>
232
+ <new_code>def old(): raise NotImplementedError()</new_code>
233
+ </edit>
234
+ ```
235
+
236
+ These are most useful in pipeline or IDE-integrated settings.
237
+
238
+
239
+ ## Best Practices Summary
240
+
241
+ | Principle | Practice |
242
+ | ------------------------- | ------------------------------------------------------ |
243
+ | Persistent Agent Behavior | Model must keep going until the fix is verified |
244
+ | Reflection | Insert plan-and-reflect instructions at each phase |
245
+ | Patch Format | Use V4A or equivalent context-driven diff structure |
246
+ | Testing | Prompt to test after every step |
247
+ | Finalization | Always include a validation + extra test writing phase |
248
+
249
+
250
+ ## Conclusion
251
+
252
+ GPT-4.1 can serve as a robust code-fixing agent when scaffolded with precise patch formats, rigorous test validation, and persistent reflection mechanisms. By integrating tool calls such as `apply_patch` and emphasizing validation over completion, developers can reliably use the model for end-to-end issue resolution workflows.
253
+
254
+ **Build the fix. Test the outcome. Validate the solution.** That’s the foundation for agentic software repair with GPT-4.1.
cookbook_pro.md ADDED
@@ -0,0 +1,288 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # [OpenAI Cookbook Pro: Comprehensive GPT-4.1 Application Framework](https://chatgpt.com/canvas/shared/6825fb38b0e0819184bb3153a3eb1a52)
2
+
3
+ ## Introduction
4
+
5
+ This document represents a fully evolved, professional-grade implementation of the OpenAI 4.1 Cookbook. It serves as a unified, production-ready guide for applied large language model deployment using GPT-4.1. Each section draws from OpenAI's internal best practices and external application patterns to provide a durable blueprint for advanced AI developers, architects, and researchers.
6
+
7
+ This Cookbook Pro version encapsulates:
8
+
9
+ * High-performance agentic prompting workflows
10
+ * Instruction literalism and planning strategies
11
+ * Long-context structuring methods
12
+ * Tool-calling schemas and evaluation principles
13
+ * Diff management and debugging strategies
14
+
15
+ ---
16
+
17
+ ## Part I — Agentic Workflows
18
+
19
+ ### 1.1 Prompt Harness Configuration
20
+
21
+ #### Three Essential Prompt Reminders:
22
+
23
+ ```markdown
24
+ # Persistence
25
+ You are an agent—keep working until the task is fully resolved. Do not yield control prematurely.
26
+
27
+ # Tool-Calling
28
+ If unsure about file or codebase content, use tools to gather accurate information. Do not guess.
29
+
30
+ # Planning
31
+ Before and after every function call, explicitly plan and reflect. Avoid tool-chaining without synthesis.
32
+ ```
33
+
34
+ These instructions significantly increase performance and enable stateful execution in multi-message tasks.
35
+
36
+ ### 1.2 Example: SWE-Bench Verified Prompt
37
+
38
+ ```markdown
39
+ # Objective
40
+ Fully resolve a software bug from an open-source issue.
41
+
42
+ # Workflow
43
+ 1. Understand the problem.
44
+ 2. Explore relevant files.
45
+ 3. Plan incremental fix steps.
46
+ 4. Apply code patches.
47
+ 5. Test thoroughly.
48
+ 6. Reflect and iterate until all tests pass.
49
+
50
+ # Constraint
51
+ Only end the session when the problem is fully fixed and verified.
52
+ ```
53
+
54
+ ---
55
+
56
+ ## Part II — Instruction Following & Output Control
57
+
58
+ ### 2.1 Instruction Clarity Protocol
59
+
60
+ Use:
61
+
62
+ * `# Instructions`: General rules
63
+ * `## Subsections`: Detailed formatting and behavioral constraints
64
+ * Explicit instruction/response pairings
65
+
66
+ ### 2.2 Sample Format
67
+
68
+ ```markdown
69
+ # Instructions
70
+ - Always greet the user.
71
+ - Avoid internal knowledge for company-specific questions.
72
+ - Cite retrieved content.
73
+
74
+ # Workflow
75
+ 1. Acknowledge the user.
76
+ 2. Call tools before answering.
77
+ 3. Reflect and respond.
78
+
79
+ # Output Format
80
+ Use: JSON with `title`, `answer`, `source` fields.
81
+ ```
82
+
83
+ ---
84
+
85
+ ## Part III — Tool Integration and Execution
86
+
87
+ ### 3.1 Schema Guidelines
88
+
89
+ Define tools via the `tools` API parameter, not inline prompt injection.
90
+
91
+ #### Tool Schema Template
92
+
93
+ ```json
94
+ {
95
+ "name": "lookup_policy_document",
96
+ "description": "Retrieve company policy details by topic.",
97
+ "parameters": {
98
+ "type": "object",
99
+ "properties": {
100
+ "topic": {"type": "string"}
101
+ },
102
+ "required": ["topic"]
103
+ }
104
+ }
105
+ ```
106
+
107
+ ### 3.2 Tool Usage Best Practices
108
+
109
+ * Define sample tool calls in `# Examples` sections
110
+ * Never overload the `description` field
111
+ * Validate inputs with required keys
112
+ * Prompt model to message user before and after calls
113
+
114
+ ---
115
+
116
+ ## Part IV — Planning and Chain-of-Thought Induction
117
+
118
+ ### 4.1 Step-by-Step Prompting Pattern
119
+
120
+ ```markdown
121
+ # Reasoning Strategy
122
+ 1. Query breakdown
123
+ 2. Context extraction
124
+ 3. Document relevance ranking
125
+ 4. Answer synthesis
126
+
127
+ # Instruction
128
+ Think step by step. Summarize relevant documents before answering.
129
+ ```
130
+
131
+ ### 4.2 Failure Mitigation Strategies
132
+
133
+ | Problem | Fix |
134
+ | ----------------- | ------------------------------------------- |
135
+ | Early response | Add: “Don’t conclude until fully resolved.” |
136
+ | Tool guess | Add: “Use tool or ask for missing data.” |
137
+ | CoT inconsistency | Prompt: “Summarize findings at each step.” |
138
+
139
+ ---
140
+
141
+ ## Part V — Long Context Optimization
142
+
143
+ ### 5.1 Instruction Anchoring
144
+
145
+ * Repeat instructions at both top and bottom of long input
146
+ * Use structured section headers (Markdown/XML)
147
+
148
+ ### 5.2 Effective Delimiters
149
+
150
+ | Type | Example | Use Case | |
151
+ | -------- | ----------------------- | ------------------ | ---------------------- |
152
+ | Markdown | `## Section Title` | General purpose | |
153
+ | XML | `<doc id='1'>...</doc>` | Document ingestion | |
154
+ | ID/Title | \`ID: 3 | TITLE: ...\` | Knowledge base parsing |
155
+
156
+ ### 5.3 Example Prompt
157
+
158
+ ```markdown
159
+ # Instructions
160
+ Use only documents provided. Reflect every 10K tokens.
161
+
162
+ # Long Context Input
163
+ <doc id="14" title="Security Policy">...</doc>
164
+ <doc id="15" title="Update Note">...</doc>
165
+
166
+ # Final Instruction
167
+ List all relevant IDs, then synthesize a summary.
168
+ ```
169
+
170
+ ---
171
+
172
+ ## Part VI — Diff Generation and Patch Application
173
+
174
+ ### 6.1 Recommended Format: V4A Diff
175
+
176
+ ```bash
177
+ *** Begin Patch
178
+ *** Update File: src/utils.py
179
+ @@ def sanitize()
180
+ - return text
181
+ + return text.strip()
182
+ *** End Patch
183
+ ```
184
+
185
+ ### 6.2 Diff Patch Execution Tool
186
+
187
+ ```json
188
+ {
189
+ "name": "apply_patch",
190
+ "description": "Apply structured code patches to files",
191
+ "parameters": {
192
+ "type": "object",
193
+ "properties": {
194
+ "input": {"type": "string"}
195
+ },
196
+ "required": ["input"]
197
+ }
198
+ }
199
+ ```
200
+
201
+ ### 6.3 Workflow
202
+
203
+ 1. Investigate issue
204
+ 2. Draft V4A patch
205
+ 3. Call `apply_patch`
206
+ 4. Run tests
207
+ 5. Reflect
208
+
209
+ ### 6.4 Edge Case Handling
210
+
211
+ | Symptom | Action |
212
+ | ------------------- | ----------------------------------- |
213
+ | Incorrect placement | Add `@@ def` or class scope headers |
214
+ | Test failures | Revise patch + rerun |
215
+ | Silent error | Check for malformed format |
216
+
217
+ ---
218
+
219
+ ## Part VII — Output Evaluation Framework
220
+
221
+ ### 7.1 Metrics to Track
222
+
223
+ | Metric | Description |
224
+ | -------------------------- | ---------------------------------------------------- |
225
+ | Tool Call Accuracy | Valid input usage and correct function selection |
226
+ | Response Format Compliance | Matches expected schema (e.g., JSON) |
227
+ | Instruction Adherence | Follows rules and workflow order |
228
+ | Plan Reflection Rate | Frequency and quality of plan → act → reflect cycles |
229
+
230
+ ### 7.2 Eval Tags for Audit
231
+
232
+ ```markdown
233
+ # Eval: TOOL_USE_FAIL
234
+ # Eval: INSTRUCTION_MISINTERPRET
235
+ # Eval: OUTPUT_FORMAT_OK
236
+ ```
237
+
238
+ ---
239
+
240
+ ## Part VIII — Unified Prompt Template
241
+
242
+ Use this as a base structure for all GPT-4.1 projects:
243
+
244
+ ```markdown
245
+ # Role
246
+ You are a [role] tasked with [objective].
247
+
248
+ # Instructions
249
+ [List core rules here.]
250
+
251
+ ## Response Rules
252
+ - Always use structured formatting
253
+ - Never repeat phrases verbatim
254
+
255
+ ## Workflow
256
+ [Include ordered plan.]
257
+
258
+ ## Reasoning Strategy
259
+ [Optional — for advanced reasoning tasks.]
260
+
261
+ # Output Format
262
+ [Specify format, e.g., JSON or Markdown.]
263
+
264
+ # Examples
265
+ ## Example 1
266
+ Input: "..."
267
+ Output: {...}
268
+ ```
269
+
270
+ ---
271
+
272
+ ## Final Notes
273
+
274
+ GPT-4.1 represents a leap forward in real-world agentic performance, tool adherence, long-context reliability, and instruction precision. However, performance hinges on prompt clarity, structured reasoning scaffolds, and modular tool integration.
275
+
276
+ To deploy GPT-4.1 at professional scale:
277
+
278
+ * Treat every prompt as a program
279
+ * Document assumptions
280
+ * Version control your system messages
281
+ * Build continuous evals for regression prevention
282
+
283
+ **Structure drives performance. Precision enables autonomy.**
284
+
285
+ Welcome to Cookbook Pro.
286
+
287
+ —End of Guide—
288
+
designing_agent_workflows.md ADDED
@@ -0,0 +1,354 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # [Designing Agent Workflows](https://chatgpt.com/canvas/shared/6825ece10cac819189e14d95e8ecd032)
2
+
3
+ ## Overview
4
+
5
+ The GPT-4.1 model introduces significant improvements in agentic capabilities, making it ideal for designing multi-turn workflows that rely on persistence, planning, and structured tool interaction. Whether you’re building automated software agents, coding assistants, customer service bots, or task execution systems, designing for success in GPT-4.1 requires careful coordination between prompt design, system instructions, tool usage, and behavior monitoring.
6
+
7
+ This guide provides a comprehensive framework for designing effective agent workflows using GPT-4.1, detailing structural components, implementation strategies, tool invocation principles, behavioral anchors, and debugging techniques.
8
+
9
+ Each section can be reused as a design module for your own applications, while contributing to the broader library of effective agent patterns.
10
+
11
+
12
+ ## What Is an Agent Workflow?
13
+
14
+ An agent workflow is a sequence of steps managed by the model in which it:
15
+
16
+ 1. Interprets the user’s task or goal
17
+ 2. Selects and applies the right tools
18
+ 3. Iterates until the goal is fully accomplished
19
+ 4. Manages context, planning, and persistence internally
20
+ 5. Responds only after verifiable success criteria are met
21
+
22
+ This process transforms GPT-4.1 from a turn-based assistant into a semi-autonomous task manager.
23
+
24
+
25
+ ## Key Model Behaviors That Enable Agent Design
26
+
27
+ ### Literal Instruction Compliance
28
+
29
+ GPT-4.1 follows instructions with high fidelity. This includes step ordering, constraints, and termination rules. The model is more responsive to direct, formatted behavioral cues than its predecessors.
30
+
31
+ ### Persistent Multi-Turn Context Management
32
+
33
+ The model maintains internal state across extended interactions. You can program it to persist in a loop, waiting to exit only once defined conditions are met.
34
+
35
+ ### Planning and Reflection
36
+
37
+ Though not a reasoning-first model, GPT-4.1 can be prompted to externalize plans, reflect on outcomes, and improve with each iteration when prompted properly.
38
+
39
+ ### Integrated Tool Use
40
+
41
+ The `tools` parameter in the API allows for direct invocation of functional calls (file inspection, patch application, database lookups, etc.) — making agentic behavior verifiable and extendable.
42
+
43
+
44
+ ## Core Agent Workflow Template
45
+
46
+ ### 🧩 System Prompt Template
47
+
48
+ ```markdown
49
+ # Agent Instructions
50
+ You are a multi-step problem-solving agent. Do not terminate until you have fully completed the assigned task.
51
+
52
+ ## Persistence
53
+ - Continue until task completion is verified.
54
+ - Do not yield to the user before the solution is complete.
55
+
56
+ ## Tool Use
57
+ - Use tools to gather information. Do not guess.
58
+ - Only proceed with actions when all necessary data is available.
59
+
60
+ ## Planning
61
+ - Before taking an action, create a plan.
62
+ - After each step, reflect on success/failure.
63
+
64
+ # Output Format
65
+ - Step number
66
+ - Action taken
67
+ - Result
68
+ - Updated plan (if any)
69
+
70
+ # Final Output
71
+ Summarize the solution, include test results if applicable.
72
+ ```
73
+
74
+ This format primes GPT-4.1 for proactive execution, tool integration, and termination control.
75
+
76
+
77
+ ## Task Archetypes: Common Agent Patterns
78
+
79
+ | Task Type | Characteristics | GPT-4.1 Design Notes |
80
+ | ------------------------ | ---------------------------------------------------------------------- | --------------------------------------------------- |
81
+ | **Code Fixing Agent** | Requires bug reproduction, patch generation, validation via tests | Use `apply_patch` tool + persistent reflection loop |
82
+ | **Data Lookup Agent** | Accesses external data via tool calls and summarizes findings | Tool use must be verified before user response |
83
+ | **Support Agent** | Answers factual queries with context validation and escalation support | Include step-by-step message plan and constraints |
84
+ | **Document Synth Agent** | Parses, filters, and summarizes from long context | Use instructions at top and bottom of prompt |
85
+
86
+
87
+ ## Designing for Persistence
88
+
89
+ Persistence is the foundation of reliable agent behavior. Without it, the model will default to single-turn chat behavior.
90
+
91
+ ### Design Pattern
92
+
93
+ ```markdown
94
+ You must NOT yield back to the user until the task is fully complete.
95
+ Check that all steps are verified. Repeat steps as needed. Only stop when all tests pass or instructions say to stop.
96
+ ```
97
+
98
+ Reinforce this message early and late in the prompt. In tests, models were 19% more likely to complete complex multi-step tasks when given persistent execution reminders.
99
+
100
+
101
+ ## Designing for Tool Use
102
+
103
+ The most reliable agents use tools for verifiable context access.
104
+
105
+ ### Tool Integration Best Practices
106
+
107
+ * Register tools in the `tools` parameter of the OpenAI API, not embedded in prompt text.
108
+ * Keep tool names simple and descriptive: `run_tests`, `apply_patch`, `lookup_invoice`.
109
+ * Provide clear descriptions and optionally list examples in a separate section.
110
+
111
+ ### Example Tool Schema
112
+
113
+ ```json
114
+ {
115
+ "name": "apply_patch",
116
+ "description": "Apply a structured diff patch to a file.",
117
+ "parameters": {
118
+ "type": "object",
119
+ "properties": {
120
+ "input": {"type": "string", "description": "The formatted patch text"}
121
+ },
122
+ "required": ["input"]
123
+ }
124
+ }
125
+ ```
126
+
127
+ ### Tool Usage Prompts
128
+
129
+ ```markdown
130
+ If you do not have enough context to proceed, pause and use the tools provided.
131
+ Do not guess about code, structure, or missing data. Always verify by tool.
132
+ ```
133
+
134
+
135
+ ## Planning and Reflection Prompts
136
+
137
+ Planning and reflection are the structural anchors of agentic reasoning.
138
+
139
+ ### Pre-Action Planning Prompt
140
+
141
+ ```markdown
142
+ Before proceeding:
143
+ 1. Restate the goal in your own words
144
+ 2. Write a short plan of how you will solve it
145
+ 3. List any tools you will need
146
+ ```
147
+
148
+ ### Post-Action Reflection Prompt
149
+
150
+ ```markdown
151
+ After taking an action:
152
+ 1. Summarize the result
153
+ 2. List any unexpected outcomes
154
+ 3. Determine if the goal is met
155
+ 4. If not, update your plan and try again
156
+ ```
157
+
158
+ These templates increase accuracy in multi-step execution by enforcing self-monitoring.
159
+
160
+
161
+ ## Full Agent Walkthrough: SWE-bench Example
162
+
163
+ ### System Prompt (Extracted from OpenAI's internal best practices)
164
+
165
+ ```markdown
166
+ You are a coding agent tasked with solving bugs in open-source software.
167
+
168
+ ## Task Requirements
169
+ - Always plan before you act
170
+ - Use tools to inspect files and apply patches
171
+ - Validate your fix with rigorous tests
172
+
173
+ ## Workflow
174
+ 1. Understand the bug deeply
175
+ 2. Investigate the codebase using tools
176
+ 3. Plan a fix
177
+ 4. Apply the fix in small, testable steps
178
+ 5. Run tests frequently
179
+ 6. Iterate until all visible and hidden tests pass
180
+ 7. Reflect and finalize
181
+
182
+ ## Rules
183
+ - Do not stop unless all tests pass
184
+ - If a fix fails, debug the failure and reattempt
185
+ - Document your process
186
+ ```
187
+
188
+ ### Tool Call: `apply_patch`
189
+
190
+ ```bash
191
+ apply_patch <<"EOF"
192
+ *** Begin Patch
193
+ *** Update File: src/module/handler.py
194
+ @@ def process_input():
195
+ - if input:
196
+ + if input is not None:
197
+ handle(input)
198
+ *** End Patch
199
+ EOF
200
+ ```
201
+
202
+ ### Tool Call: `run_tests`
203
+
204
+ ```bash
205
+ !python3 -m unittest discover
206
+ ```
207
+
208
+ ### Agent Response (Example)
209
+
210
+ ```
211
+ Step 1: Analyzed stack trace and confirmed null check missing
212
+ Step 2: Planned patch using apply_patch tool
213
+ Step 3: Applied patch successfully
214
+ Step 4: Ran tests – 1 failure in edge case
215
+ Step 5: Revised patch
216
+ Step 6: Re-ran tests – all passed
217
+ Task complete. Final patch validated.
218
+ ```
219
+
220
+
221
+ ## Failure Modes and Mitigations
222
+
223
+ | Failure Type | Symptom | Mitigation Tip |
224
+ | ------------------------------ | ------------------------------------- | -------------------------------------------- |
225
+ | Premature Turn Yield | Agent ends before solving | Add explicit no-yield rules in system prompt |
226
+ | Tool Hallucination | Tool called with invalid or null data | Instruct agent to ask for data if missing |
227
+ | No Planning or Reflection | Skips step-by-step reasoning | Add planning and reflection anchors |
228
+ | Ignoring Final Validation Step | Says task complete before verifying | Add final verification checklist to prompt |
229
+
230
+
231
+ ## Output Format Suggestions
232
+
233
+ A consistent output format improves interpretability and downstream usage.
234
+
235
+ ### Recommended Layout
236
+
237
+ ```markdown
238
+ # Task Status: In Progress
239
+
240
+ ## Current Step: Plan and Execute Fix
241
+ - Tool used: apply_patch
242
+ - Patch outcome: Success
243
+
244
+ ## Next Step
245
+ - Run full tests
246
+ - Validate output for edge cases
247
+ ```
248
+
249
+ You can train GPT-4.1 to adopt consistent internal status reporting with a format guide provided in each system prompt.
250
+
251
+
252
+ ## Escalation, Recovery, and Termination
253
+
254
+ ### Escalation
255
+
256
+ Encourage the model to escalate to the user when required:
257
+
258
+ ```markdown
259
+ If more data or permissions are needed, ask the user explicitly.
260
+ If a step cannot be completed after three attempts, escalate.
261
+ ```
262
+
263
+ ### Recovery
264
+
265
+ Allow the model to acknowledge failure and retry with adjustments:
266
+
267
+ ```markdown
268
+ If your fix fails tests, reflect and revise the patch.
269
+ List new hypotheses and retry using a modified plan.
270
+ ```
271
+
272
+ ### Termination
273
+
274
+ Use clear termination rules:
275
+
276
+ ```markdown
277
+ Only end your session when:
278
+ - All tests pass
279
+ - The task is fully verified
280
+ - You have summarized your actions for the user
281
+ ```
282
+
283
+
284
+ ## Behavioral Design Tips
285
+
286
+ | Technique | Effect |
287
+ | ----------------------------- | --------------------------------------------------- |
288
+ | System prompt layering | Prioritizes stable task framing |
289
+ | Mid-prompt behavior resets | Reinforces correct tool usage after failed attempts |
290
+ | Named sections (Markdown/XML) | Improves adherence to plan and formatting |
291
+ | Soft conditionals | Encourages resilience (“If X fails, try Y…”) |
292
+
293
+
294
+ ## Designing for Developer Control
295
+
296
+ Create parameterized prompts for easier tuning and behavior adjustment.
297
+
298
+ ### Template with Parameters
299
+
300
+ ```python
301
+ AGENT_TEMPLATE = f"""
302
+ # Role: {role}
303
+ # Task: {task_description}
304
+ # Output Format: {format_spec}
305
+ # Tools: {', '.join(tool_names)}
306
+ # Planning Required: {'Yes' if planning else 'No'}
307
+ """
308
+ ```
309
+
310
+ Use this model to power dashboards, agent templates, and UI-driven behavior controls.
311
+
312
+
313
+ ## Testing Agent Workflows
314
+
315
+ Use evaluation harnesses to test agent performance:
316
+
317
+ * Track step completion
318
+ * Analyze tool usage logs
319
+ * Compare plan quality across variants
320
+
321
+ Key metrics:
322
+
323
+ * Task success rate
324
+ * Iteration count per completion
325
+ * Tool error frequency
326
+ * Response length and structure fidelity
327
+
328
+
329
+ ## Summary
330
+
331
+ Agent workflows in GPT-4.1 are structured, reliable, and controllable — provided the design follows consistent instruction patterns, plans for tool usage, and includes persistence logic.
332
+
333
+ Follow these principles:
334
+
335
+ * Anchor every agent with clear, literal instructions
336
+ * Use tool APIs instead of embedded tool descriptions
337
+ * Require planning and reflection around actions
338
+ * Validate every output with structured criteria
339
+
340
+ By shaping agent workflows as formal task managers, developers can build systems that reliably complete complex operations in a safe, verifiable manner.
341
+
342
+
343
+ ## Next Steps
344
+
345
+ Explore these additional modules to expand your agent capabilities:
346
+
347
+ * [`Prompting for Instruction Following`](./Prompting%20for%20Instruction%20Following.md)
348
+ * [`Long Context Strategies`](./Long%20Context.md)
349
+ * [`Tool Calling and Integration`](./Tool%20Use%20and%20Integration.md)
350
+
351
+ For more agent-ready templates, visit the `/agent_design/` directory in the main repository.
352
+
353
+
354
+ For contributions or questions, open an issue or submit a pull request to `/agent_design/Designing Agent Workflows.md`.
handling_long_contexts.md ADDED
@@ -0,0 +1,279 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # [Handling Long Contexts in GPT-4.1](https://chatgpt.com/canvas/shared/6825f136fb448191aadfd3cded6defe5)
2
+
3
+ ## Overview
4
+
5
+ This guide focuses on how to effectively structure, prompt, and manage long contexts when working with GPT-4.1. With support for a 1 million-token context window, GPT-4.1 unlocks new possibilities for processing, reasoning, and extracting information from large datasets and documents. However, the benefits of long context can only be realized when prompts are precisely structured and context is meaningfully prioritized. This guide outlines practical techniques for context formatting, instruction design, and reasoning support across extended input windows.
6
+
7
+ ## Objectives
8
+
9
+ * Help developers utilize the full long-context capability of GPT-4.1
10
+ * Mitigate degradation in response quality due to token overflow or disorganized input
11
+ * Establish formatting and reasoning best practices that align with OpenAI’s tested strategies
12
+ * Enable document processing, re-ranking, retrieval, and multi-hop reasoning across long inputs
13
+
14
+
15
+ ## 1. Understanding Long Context Use Cases
16
+
17
+ GPT-4.1 is capable of processing up to 1 million tokens of input, making it suitable for a wide range of applications including:
18
+
19
+ * **Structured Document Parsing**: Legal documents, scientific papers, contracts, etc.
20
+ * **Retrieval-Augmented Generation (RAG)**: Combining long contexts with internal and external tools
21
+ * **Knowledge Graph Construction**: Extracting structured relationships from unstructured data
22
+ * **Log and Trace Analysis**: Reviewing extended server logs or output sequences
23
+ * **Multi-hop Reasoning**: Synthesizing answers from distributed pieces of information
24
+
25
+ While the model can technically parse vast inputs, developers must implement strategies to avoid cognitive overload and focus model attention effectively.
26
+
27
+
28
+ ## 2. Context Organization Principles
29
+
30
+ ### 2.1 Optimal Instruction Placement
31
+
32
+ OpenAI’s internal experiments found that the **positioning of prompt instructions significantly affects model performance**. Key guidelines include:
33
+
34
+ * **Dual placement**: Repeat key instructions at both the beginning and end of the prompt.
35
+ * **Top-loading**: If instructions are only placed once, placing them at the beginning is more effective than the end.
36
+ * **Segmented framing**: Use sectional titles to clearly mark transitions.
37
+
38
+ ### 2.2 Delimiter Selection
39
+
40
+ To help the model parse structure in large blocks of text, delimiters must be used consistently and appropriately:
41
+
42
+ | Delimiter Format | Description | Use Case |
43
+ | ------------------------------ | ---------------------------- | ------------------------------------- |
44
+ | **Markdown (`#`, `##`, `-`)** | Clean sectioning, readable | General-purpose long context parsing |
45
+ | **XML (`<doc>`, `<section>`)** | Best for document modeling | Structured multi-document input |
46
+ | **Inline backticks** | For code, queries, and data | Code and SQL parsing, tool parameters |
47
+ | **Avoid JSON** | Inefficient parsing at scale | Do not use for >10K token lists |
48
+
49
+ Markdown and XML structures yield better attention modeling across long contexts, while JSON often introduces parsing inefficiencies beyond a few thousand tokens.
50
+
51
+
52
+ ## 3. Strategies for Long-Context Prompting
53
+
54
+ ### 3.1 Context-Aware Instructioning
55
+
56
+ When dealing with large input windows, standard prompt formats must evolve. Use detailed scaffolds that define model behavior across each phase:
57
+
58
+ ```markdown
59
+ # Instructions
60
+ - Use only the documents provided.
61
+ - Focus on relevance before synthesis.
62
+ - Reflect after each major section.
63
+
64
+ # Reasoning Strategy
65
+ 1. Read and segment.
66
+ 2. Rank relevance.
67
+ 3. Synthesize step-by-step.
68
+
69
+ # Final Reminder
70
+ Adhere strictly to section boundaries and reason incrementally.
71
+ ```
72
+
73
+ ### 3.2 Step-by-Step Processing with Summarization
74
+
75
+ Break the long input into **logical checkpoints**. After each checkpoint:
76
+
77
+ * Summarize progress.
78
+ * List open questions.
79
+ * Forecast the next reasoning step.
80
+
81
+ This promotes internal alignment without hardcoding logic into tool calls.
82
+
83
+ **Example Prompt Snippet:**
84
+
85
+ ```text
86
+ After reading the next 5,000 tokens, summarize key entities mentioned and note unresolved questions. Then continue.
87
+ ```
88
+
89
+
90
+ ## 4. Long-Context Reasoning Patterns
91
+
92
+ ### 4.1 Needle-in-a-Haystack Retrieval
93
+
94
+ GPT-4.1 performs reliably at locating information embedded deep within large corpora. Best practices for precision include:
95
+
96
+ * **Unique section headers** to guide location memory
97
+ * **Explicit re-ranking instructions** after initial search
98
+ * **Preliminary entity listing** to establish anchors
99
+
100
+ ### 4.2 Document Relevance Rating
101
+
102
+ When feeding dozens or hundreds of documents into the model, instruct it to:
103
+
104
+ 1. Score each document based on a relevance scale
105
+ 2. Justify the score with reference to query terms
106
+ 3. Select only medium/high relevance docs for synthesis
107
+
108
+ **Example Snippet:**
109
+
110
+ ```text
111
+ Rate each doc on relevance to the query [high, medium, low]. Provide one sentence justification per doc. Use only high/medium docs in the final answer.
112
+ ```
113
+
114
+ ### 4.3 Multi-Hop Document Synthesis
115
+
116
+ For complex queries requiring synthesis from several different inputs:
117
+
118
+ * Start by identifying all possibly relevant documents
119
+ * Extract one-sentence summaries from each
120
+ * Weigh the evidence to converge on an answer
121
+
122
+ This scaffolds model behavior in a transparent and verifiable way.
123
+
124
+
125
+ ## 5. Managing Instructional Interference
126
+
127
+ As context grows, risk increases that initial instructions may be forgotten or overridden. To address this:
128
+
129
+ * **Insert refresher instructions at each major context segment**
130
+ * **Bold or delimit** instructional snippets to create visual attention anchors
131
+ * **Use hierarchical structure**: Title → Sub-section → Instruction → Content
132
+
133
+ Example:
134
+
135
+ ```markdown
136
+ ## Part 3: Analyze Error Logs
137
+ **Reminder:** Focus only on logs mentioning `TimeoutError`. Ignore unrelated traces.
138
+ ```
139
+
140
+
141
+ ## 6. Failure Modes and Fixes
142
+
143
+ ### 6.1 Early Context Drift
144
+
145
+ **Symptom:** The model misinterprets a query due to overemphasis on the early documents.
146
+
147
+ **Solution:** Insert a midway reflection point:
148
+
149
+ ```text
150
+ Pause and verify: Are we still on track based on the original query?
151
+ ```
152
+
153
+ ### 6.2 Instruction Overload
154
+
155
+ **Symptom:** Model ignores or selectively follows prompt instructions.
156
+
157
+ **Solution:** Simplify instruction blocks. Group similar guidance. Use numbered checklists.
158
+
159
+ ### 6.3 Latency and Token Limitations
160
+
161
+ **Symptom:** Prompting becomes slow or the output is truncated.
162
+
163
+ **Solution:**
164
+
165
+ * Shorten low-salience sections.
166
+ * Summarize documents before passing into prompt.
167
+ * Use a retrieval step to filter top-k relevant items.
168
+
169
+
170
+ ## 7. Formatting Techniques for Long Contexts
171
+
172
+ ### 7.1 Title-ID Pairing
173
+
174
+ Helpful in multi-document prompts.
175
+
176
+ ```text
177
+ ID: 001 | TITLE: Terms of Use | CONTENT: The user agrees to...
178
+ ```
179
+
180
+ This increases model ability to re-reference sections.
181
+
182
+ ### 7.2 XML Embedding for Hierarchical Structure
183
+
184
+ ```xml
185
+ <doc id="34" title="Security Policy">
186
+ <summary>Contains threat classifications and countermeasures</summary>
187
+ <content>...</content>
188
+ </doc>
189
+ ```
190
+
191
+ This formatting supports multi-pass parsing and structured memory.
192
+
193
+
194
+ ## 8. Alignment Between Internal and External Knowledge
195
+
196
+ In long-context tasks, decisions must be made about how much to rely on provided context vs. internal knowledge.
197
+
198
+ **Guideline Matrix:**
199
+
200
+ | Mode | Model Should... |
201
+ | ---------------- | ------------------------------------------------------------------- |
202
+ | Strict Retrieval | Only use external documents. If unsure, say "Not enough info." |
203
+ | Hybrid Mode | Use context first, but fill in with internal knowledge when needed. |
204
+ | Pure Generation | Use own knowledge; ignore prompt context. |
205
+
206
+ When prompting, make mode explicit:
207
+
208
+ ```text
209
+ Use only the following context. If insufficient, reply: "Insufficient data."
210
+ ```
211
+
212
+
213
+ ## 9. Tools and Token Budgeting
214
+
215
+ ### 9.1 Token Allocation Strategy
216
+
217
+ When constructing long prompts, divide tokens based on relevance and priority:
218
+
219
+ | Section | Suggested Max Tokens | Notes |
220
+ | --------------------- | -------------------- | --------------------------------------- |
221
+ | Instructions | 1,000 | Include high-priority guidance twice |
222
+ | Context Documents | 900,000 | Use title delimiters, sort by relevance |
223
+ | Task-Specific Prompts | 50,000 | Include reasoning strategy scaffolds |
224
+
225
+ Prioritize content by query salience and clarity.
226
+
227
+ ### 9.2 Intermediate Tool Use
228
+
229
+ Encourage the model to use tools mid-way:
230
+
231
+ * Re-rank document clusters
232
+ * Extract named entities
233
+ * Visualize flow or graph relationships
234
+
235
+ Encouraging this tool interaction creates checkpoints and avoids reasoning drift.
236
+
237
+
238
+ ## 10. Testing and Evaluation
239
+
240
+ When evaluating prompt effectiveness in long-context scenarios:
241
+
242
+ * Measure correctness, latency, and coverage
243
+ * Track hallucination and false-positive rates
244
+ * Use automated evals with known answer corpora
245
+
246
+ ### Recommended Metrics:
247
+
248
+ * Precision\@k for retrieval
249
+ * Response coherence score (human or model-rated)
250
+ * Instruction adherence rate
251
+
252
+ Incorporate feedback loops to update prompts based on failure analysis.
253
+
254
+
255
+ ## 11. Summary and Best Practices
256
+
257
+ | Principle | Best Practice |
258
+ | --------------------- | --------------------------------- |
259
+ | Instruction Placement | Use top and bottom |
260
+ | Context Segmentation | Insert checkpoints, summaries |
261
+ | Delimiters | Prefer Markdown/XML over JSON |
262
+ | Tool Usage | Mid-task tool calls preferred |
263
+ | Evaluation | Test adherence, accuracy, latency |
264
+
265
+ Effective long-context prompting is not about more data—it’s about better structure, thoughtful pacing, and precision anchoring.
266
+
267
+
268
+ ## Final Notes
269
+
270
+ GPT-4.1’s long-context capabilities can power a new generation of document-heavy applications. However, successful deployment requires more than dropping text into a prompt. It requires:
271
+
272
+ * Clear segment boundaries
273
+ * Frequent alignment checkpoints
274
+ * Purpose-driven formatting
275
+ * Strategic memory reinforcement
276
+
277
+ With these principles in place, the model not only reads—it understands.
278
+
279
+ Begin with structure. Sustain with clarity. Close with alignment.
prompt_engineering_guide.md ADDED
@@ -0,0 +1,298 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # [Prompt Engineering Reference Guide for GPT-4.1](https://chatgpt.com/canvas/shared/6825f88d7170819180b56e101e8b9d31)
2
+
3
+ ## Overview
4
+
5
+ This reference guide consolidates OpenAI’s latest findings and recommendations for effective prompt engineering with GPT-4.1. It is designed for developers, researchers, and applied AI engineers who seek reliable, reproducible results from GPT-4.1 in both experimental and production settings. The techniques presented here are rooted in empirical validation across use cases ranging from agent workflows to structured tool integration, long-context processing, and instruction-following optimization.
6
+
7
+ This document emphasizes concrete prompt patterns, scaffolding techniques, and deployment-tested prompt modularity.
8
+
9
+
10
+ ## Key Prompting Concepts
11
+
12
+ ### 1. Instruction Literalism
13
+
14
+ GPT-4.1 follows instructions **more precisely** than its predecessors. Developers should:
15
+
16
+ * Avoid vague or underspecified prompts
17
+ * Be explicit about desired behaviors, output formats, and prohibitions
18
+ * Expect literal compliance with phrasing, including negations and scope restrictions
19
+
20
+ ### 2. Planning Induction
21
+
22
+ GPT-4.1 does not natively plan before answering but can be prompted to simulate step-by-step reasoning.
23
+
24
+ **Template:**
25
+
26
+ ```text
27
+ Think carefully step by step. Break down the task into manageable parts. Then begin.
28
+ ```
29
+
30
+ Planning prompts should be framed before actions and reinforced between reasoning phases.
31
+
32
+ ### 3. Agentic Harnessing
33
+
34
+ Use GPT-4.1’s enhanced persistence and tool adherence by specifying three types of reminders:
35
+
36
+ * **Persistence**: “Keep working until the problem is fully resolved.”
37
+ * **Tool usage**: “Use available tools to inspect files—do not guess.”
38
+ * **Planning enforcement**: “Plan and reflect before and after every function call.”
39
+
40
+ These drastically increase the model’s task completion rate when integrated at the top of a system prompt.
41
+
42
+
43
+ ## Prompt Structure Blueprint
44
+
45
+ A recommended modular scaffold:
46
+
47
+ ```markdown
48
+ # Role and Objective
49
+ You are a [role] tasked with [goal].
50
+
51
+ # Instructions
52
+ - Bullet-point rules or constraints
53
+ - Output format expectations
54
+ - Prohibited topics or phrasing
55
+
56
+ # Workflow (Optional)
57
+ 1. Step-by-step plan
58
+ 2. Reflection checkpoints
59
+ 3. Tool interaction order
60
+
61
+ # Reasoning Strategy (Optional)
62
+ Describes how the model should analyze input or context before generating output.
63
+
64
+ # Output Format
65
+ JSON, Markdown, YAML, or prose specification
66
+
67
+ # Examples (Optional)
68
+ Demonstrates expected input/output behavior
69
+ ```
70
+
71
+ This format increases predictability and flexibility during live prompt debugging and iteration.
72
+
73
+
74
+ ## Long-Context Prompting
75
+
76
+ GPT-4.1 supports up to **1M token inputs**, enabling:
77
+
78
+ * Multi-document ingestion
79
+ * Codebase-wide searches
80
+ * Contextual re-ranking and synthesis
81
+
82
+ ### Strategies:
83
+
84
+ * **Repeat instructions at top and bottom**
85
+ * **Use markdown/XML tags** for structure
86
+ * **Insert reasoning checkpoints every 5–10k tokens**
87
+ * **Avoid JSON for large document embedding**
88
+
89
+ **Effective Delimiters:**
90
+
91
+ | Format | Use Case |
92
+ | -------- | ------------------------------------ |
93
+ | Markdown | General sectioning |
94
+ | XML | Hierarchical document parsing |
95
+ | Title/ID | Multi-document input structuring |
96
+ | JSON | Code/tool tasks only; avoid for text |
97
+
98
+
99
+ ## Tool-Calling Integration
100
+
101
+ ### Schema-Based Tool Usage
102
+
103
+ Define tools in the OpenAI `tools` field, not inline. Provide:
104
+
105
+ * **Name** (clear and descriptive)
106
+ * **Parameters** (structured JSON)
107
+ * **Usage examples** (in `# Examples` section, not in `description`)
108
+
109
+ **Tool Example:**
110
+
111
+ ```json
112
+ {
113
+ "name": "get_user_info",
114
+ "description": "Fetches user details from the database",
115
+ "parameters": {
116
+ "type": "object",
117
+ "properties": {
118
+ "user_id": { "type": "string" }
119
+ },
120
+ "required": ["user_id"]
121
+ }
122
+ }
123
+ ```
124
+
125
+ ### Prompt Reinforcement:
126
+
127
+ ```markdown
128
+ # Tool Instructions
129
+ - Use tools before answering factual queries
130
+ - If info is missing, request input from user
131
+ ```
132
+
133
+ ### Failure Mitigation:
134
+
135
+ | Issue | Fix |
136
+ | ------------------ | ------------------------------------------- |
137
+ | Null tool calls | Prompt: “Ask for missing info if needed” |
138
+ | Over-calling tools | Add reasoning delay + post-call reflection |
139
+ | Missed call | Add output format block and trigger keyword |
140
+
141
+
142
+ ## Instruction Following Optimization
143
+
144
+ GPT-4.1 is optimized for literal and structured instruction parsing. Improve reliability with:
145
+
146
+ ### Multi-Tiered Rules
147
+
148
+ Use layers:
149
+
150
+ * `# Instructions`: High-level
151
+ * `## Response Style`: Format and tone
152
+ * `## Error Handling`: Edge case mitigation
153
+
154
+ ### Ordered Workflows
155
+
156
+ Use numbered sequences to enforce step-by-step logic.
157
+
158
+ **Prompt Snippet:**
159
+
160
+ ```markdown
161
+ # Instructions
162
+ - Greet the user
163
+ - Request missing parameters
164
+ - Avoid repeating exact phrasing
165
+ - Escalate on request
166
+
167
+ # Workflow
168
+ 1. Confirm intent
169
+ 2. Call tool
170
+ 3. Reflect
171
+ 4. Respond
172
+ ```
173
+
174
+
175
+ ## Chain-of-Thought Prompting (CoT)
176
+
177
+ Chain-of-thought induces linear reasoning. Works best for:
178
+
179
+ * Logic puzzles
180
+ * Multi-hop QA
181
+ * Comparative analysis
182
+
183
+ **CoT Example:**
184
+
185
+ ```text
186
+ Let’s think through this. First, identify what the question is asking. Then examine context. Finally, synthesize an answer.
187
+ ```
188
+
189
+ **Advanced Prompt (Modular):**
190
+
191
+ ```markdown
192
+ # Reasoning Strategy
193
+ 1. Query analysis
194
+ 2. Context selection
195
+ 3. Evidence synthesis
196
+
197
+ # Final Instruction
198
+ Think step by step using the strategy above.
199
+ ```
200
+
201
+
202
+ ## Failure Modes and Fixes
203
+
204
+ | Problem | Mitigation |
205
+ | ------------------ | -------------------------------------------------------------- |
206
+ | Tool hallucination | Require tool call block, validate schema |
207
+ | Early termination | Add: "Do not yield until goal achieved." |
208
+ | Verbose repetition | Add paraphrasing constraint and variation list |
209
+ | Overcompliance | If model follows a sample phrase verbatim, instruct to vary it |
210
+
211
+
212
+ ## Evaluation Strategy
213
+
214
+ Prompt effectiveness should be evaluated across:
215
+
216
+ * **Instruction adherence**
217
+ * **Tool utilization accuracy**
218
+ * **Reasoning coherence**
219
+ * **Failure mode frequency**
220
+ * **Latency and cost tradeoffs**
221
+
222
+ ### Recommended Methodology:
223
+
224
+ * Create a test suite with edge-case prompts
225
+ * Log errors and model divergence cases
226
+ * Use eval tags (`# Eval:`) in prompt for meta-analysis
227
+
228
+
229
+ ## Delimiter Comparison Table
230
+
231
+ | Delimiter Type | Format Example | GPT-4.1 Performance | |
232
+ | -------------- | ------------------ | ------------------------------- | -------- |
233
+ | Markdown | `## Section Title` | Excellent | |
234
+ | XML | `<doc>` tags | Excellent | |
235
+ | JSON | `{"text": "..."}` | High (in code), Poor (in prose) | |
236
+ | Pipe-delimited | \`TITLE | CONTENT\` | Moderate |
237
+
238
+ ### Best Practice:
239
+
240
+ Use Markdown or XML for general structure; JSON for code/tools only.
241
+
242
+
243
+ ## Example: Prompt Debugging Workflow
244
+
245
+ ### Step 1: Identify Goal
246
+
247
+ E.g., summarizing medical trial documents with context weighting.
248
+
249
+ ### Step 2: Draft Prompt Template
250
+
251
+ ```markdown
252
+ # Objective
253
+ Summarize each trial based on outcome clarity and trial scale.
254
+
255
+ # Workflow
256
+ 1. Parse hypothesis/result
257
+ 2. Score for clarity
258
+ 3. Output structured summary
259
+
260
+ # Output Format
261
+ {"trial_id": ..., "clarity_score": ..., "summary": ...}
262
+ ```
263
+
264
+ ### Step 3: Insert Sample
265
+
266
+ ```json
267
+ {"trial_id": "T01", "clarity_score": 8, "summary": "Well-documented results..."}
268
+ ```
269
+
270
+ ### Step 4: Validate Output
271
+
272
+ Ensure model adheres to output format, logic, and reasoning instructions.
273
+
274
+
275
+ ## Summary: Prompt Engineering Heuristics
276
+
277
+ | Technique | When to Use |
278
+ | -------------------------- | ----------------------------------- |
279
+ | Instruction Bullets | All prompts |
280
+ | Chain-of-Thought | Any task requiring logic or steps |
281
+ | Workflow Lists | Multiphase reasoning tasks |
282
+ | Tool Block | Any prompt using API/tool calls |
283
+ | Reflection Reminders | Long context, debugging, validation |
284
+ | Dual Instruction Placement | Long documents (>100K tokens) |
285
+
286
+
287
+ ## Final Notes
288
+
289
+ Prompt engineering is empirical, not theoretical. Every use case is different. To engineer effectively with GPT-4.1:
290
+
291
+ * Maintain modular, versioned prompt templates
292
+ * Use structured instructions and output formats
293
+ * Enforce explicit planning and tool behavior
294
+ * Iterate prompts based on logs and evals
295
+
296
+ **Start simple. Add structure. Evaluate constantly.**
297
+
298
+ This guide is designed to be expanded. Use it as your baseline and evolve it as your systems scale.
prompting_for_instruction_following.md ADDED
@@ -0,0 +1,293 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # [Prompting for Instruction Following](https://chatgpt.com/canvas/shared/6825ebe022148191bceb9fa5473a34eb)
2
+
3
+ ## Overview
4
+
5
+ GPT-4.1 represents a significant shift in how developers should structure prompts for reliable, deterministic, and consistent behavior. Unlike earlier models which often inferred intent liberally, GPT-4.1 adheres to instructions in a far more literal, detail-sensitive manner. This brings both increased control and greater responsibility for developers: well-designed prompts yield exceptional results, while ambiguous or conflicting instructions may result in brittle or unexpected behavior.
6
+
7
+ This guide outlines best practices, real-world examples, and design patterns to fully utilize GPT-4.1’s instruction-following improvements across a variety of applications. It is structured to help you:
8
+
9
+ * Understand GPT-4.1’s instruction handling behavior
10
+ * Design high-integrity prompt scaffolds
11
+ * Debug prompt failures and mitigate ambiguity
12
+ * Align instructions with OpenAI’s guidance around tool usage, task persistence, and planning
13
+
14
+ This file is designed to stand alone for practical use and is fully aligned with the broader `openai-cookbook-pro` repository.
15
+
16
+
17
+ ## Why Instruction-Following Matters
18
+
19
+ Instruction following is central to:
20
+
21
+ * **Agent behavior**: models acting in multi-step environments must reliably interpret commands
22
+ * **Tool use**: execution hinges on clearly-defined tool invocation criteria
23
+ * **Support workflows**: factual grounding depends on accurate boundary adherence
24
+ * **Security and safety**: systems must not misinterpret prohibitions or fail to enforce policy constraints
25
+
26
+ With GPT-4.1’s shift toward literal interpretation, instruction scaffolding becomes the primary control interface.
27
+
28
+
29
+ ## GPT-4.1 Instruction Characteristics
30
+
31
+ ### 1. **Literal Compliance**
32
+
33
+ GPT-4.1 follows instructions with minimal assumption. If a step is missing or unclear, the model is less likely to “fill in” or guess the user’s intent.
34
+
35
+ * **Previous behavior**: interpreted vague prompts broadly
36
+ * **Current behavior**: waits for or requests clarification
37
+
38
+ This improves safety and traceability but also increases fragility in loosely written prompts.
39
+
40
+ ### 2. **Order-Sensitive Resolution**
41
+
42
+ When instructions conflict, GPT-4.1 favors those listed **last** in the prompt. This means developers should order rules hierarchically:
43
+
44
+ * General rules go early
45
+ * Specific overrides go later
46
+
47
+ Example:
48
+
49
+ ```markdown
50
+ # Instructions
51
+ - Do not guess if unsure
52
+ - Use your knowledge if a tool isn’t available
53
+ - If both options are available, prefer the tool
54
+ ```
55
+
56
+ ### 3. **Format-Aware Behavior**
57
+
58
+ GPT-4.1 performs better with clearly formatted instructions. Prefer structured formats:
59
+
60
+ * Markdown with headers and lists
61
+ * XML with nested tags
62
+ * Structured sections like `# Steps`, `# Output Format`
63
+
64
+ Poorly formatted, unsegmented prompts lead to instruction bleed and undesired merging of behaviors.
65
+
66
+
67
+ ## Recommended Prompt Structure
68
+
69
+ Organize your prompt using a structure that mirrors OpenAI’s internal evaluation standards.
70
+
71
+ ### 📁 Standard Sections
72
+
73
+ ```markdown
74
+ # Role and Objective
75
+ # Instructions
76
+ ## Sub-categories for Specific Behavior
77
+ # Workflow Steps (Optional)
78
+ # Output Format
79
+ # Examples (Optional)
80
+ # Final Reminder
81
+ ```
82
+
83
+ ### Example Prompt Template
84
+
85
+ ```markdown
86
+ # Role and Objective
87
+ You are a customer service assistant. Your job is to resolve user issues efficiently, using tools when needed.
88
+
89
+ # Instructions
90
+ - Greet the user politely.
91
+ - Use a tool before answering any account-related question.
92
+ - If unsure how to proceed, ask the user for clarification.
93
+ - If a user requests escalation, refer them to a human agent.
94
+
95
+ ## Output Format
96
+ - Always use a friendly tone.
97
+ - Format your answer in plain text.
98
+ - Include a summary at the end of your response.
99
+
100
+ ## Final Reminder
101
+ Do not rely on prior knowledge. Use provided tools and context only.
102
+ ```
103
+
104
+
105
+ ## Instruction Categories
106
+
107
+ ### 1. **Task Definition**
108
+
109
+ Clearly state the model’s job in the opening lines. Be explicit:
110
+
111
+ ✅ “You are an assistant that reviews and edits legal contracts.”
112
+
113
+ 🚫 “Help with contracts.”
114
+
115
+ ### 2. **Behavioral Constraints**
116
+
117
+ List what the model must or must not do:
118
+
119
+ * Must call tools before responding to factual queries
120
+ * Must ask for clarification if user input is incomplete
121
+ * Must not provide financial or legal advice
122
+
123
+ ### 3. **Response Style**
124
+
125
+ Define tone, length, formality, and structure.
126
+
127
+ * “Keep responses under 250 words.”
128
+ * “Avoid lists unless asked.”
129
+ * “Use a neutral tone.”
130
+
131
+ ### 4. **Tool Use Protocols**
132
+
133
+ Models often hallucinate tools unless guided:
134
+
135
+ * “If you don’t have enough information to use a tool, ask the user for more.”
136
+ * “Always confirm tool usage before responding.”
137
+
138
+
139
+ ## Debugging Instruction Failures
140
+
141
+ Instruction-following failures often stem from the following:
142
+
143
+ ### Common Causes
144
+
145
+ * Ambiguous rule phrasing
146
+ * Conflicting instructions (e.g., both asking to guess and not guess)
147
+ * Implicit behaviors expected, not stated
148
+ * Overloaded instructions without formatting
149
+
150
+ ### Diagnosis Steps
151
+
152
+ 1. Read the full prompt in sequence
153
+ 2. Identify potential ambiguity
154
+ 3. Reorder to clarify precedence
155
+ 4. Break complex rules into atomic steps
156
+ 5. Test with structured evals
157
+
158
+
159
+ ## Instruction Layering: The 3-Tier Model
160
+
161
+ When designing prompts for multi-step tasks, layer your instructions in tiers:
162
+
163
+ | Tier | Layer Purpose | Example |
164
+ | ---- | --------------------------- | ------------------------------------------ |
165
+ | 1 | Role Declaration | “You are an assistant for legal tasks.” |
166
+ | 2 | Global Behavior Constraints | “Always cite sources.” |
167
+ | 3 | Task-Specific Instructions | “In contracts, highlight ambiguous terms.” |
168
+
169
+ Each layer helps disambiguate behavior and provides a fallback structure if downstream instructions fail.
170
+
171
+
172
+ ## Long Context Instruction Handling
173
+
174
+ In prompts exceeding 50,000 tokens:
175
+
176
+ * Place **key instructions** both **before and after** the context.
177
+ * Use format anchors (`# Instructions`, `<rules>`) to signal boundaries.
178
+ * Avoid relying solely on the top-of-prompt instructions.
179
+
180
+ GPT-4.1 is trained to respect these placements, especially when consistent structure is maintained.
181
+
182
+
183
+ ## Literal vs. Flexible Models
184
+
185
+ | Capability | GPT-3.5 / GPT-4-turbo | GPT-4.1 |
186
+ | ---------------------- | --------------------- | --------------- |
187
+ | Implicit inference | High | Low |
188
+ | Literal compliance | Moderate | High |
189
+ | Prompt flexibility | Higher tolerance | Lower tolerance |
190
+ | Instruction debug cost | Lower | Higher |
191
+
192
+ GPT-4.1 performs better **when prompts are precise**. Treat prompt engineering as API design — clear, testable, and version-controlled.
193
+
194
+
195
+ ## Tips for Designing Instruction-Sensitive Prompts
196
+
197
+ ### ✔️ DO:
198
+
199
+ * Use structured formatting
200
+ * Scope behaviors into separate bullet points
201
+ * Use examples to anchor expected output
202
+ * Rewrite ambiguous instructions into atomic steps
203
+ * Add conditionals explicitly (e.g., “if X, then Y”)
204
+
205
+ ### ❌ DON’T:
206
+
207
+ * Assume the model will “understand what you meant”
208
+ * Use overloaded sentences with multiple actions
209
+ * Rely on invisible or implied rules
210
+ * Assume formatting styles (e.g., bullets) are optional
211
+
212
+
213
+ ## Example: Instruction-Controlled Code Agent
214
+
215
+ ```markdown
216
+ # Objective
217
+ You are a code assistant that fixes bugs in open-source projects.
218
+
219
+ # Instructions
220
+ - Always use the tools provided to inspect code.
221
+ - Do not make edits unless you have confirmed the bug’s root cause.
222
+ - If a change is proposed, validate using tests.
223
+ - Do not respond unless the patch is applied.
224
+
225
+ ## Output Format
226
+ 1. Description of bug
227
+ 2. Explanation of root cause
228
+ 3. Tool output (e.g., patch result)
229
+ 4. Confirmation message
230
+
231
+ ## Final Note
232
+ Do not guess. If you are unsure, use tools or ask.
233
+ ```
234
+
235
+ > For a complete walkthrough, see `/examples/code-agent-instructions.md`
236
+
237
+
238
+ ## Instruction Evolution Across Iterations
239
+
240
+ As your prompts grow, preserve instruction integrity using:
241
+
242
+ * Versioned templates
243
+ * Structured diffs for instruction edits
244
+ * Commented rules for traceability
245
+
246
+ Example diff:
247
+
248
+ ```diff
249
+ - Always answer user questions.
250
+ + Only answer user questions after validating tool output.
251
+ ```
252
+
253
+ Maintain a changelog for prompts as you would with source code. This ensures instructional integrity during collaborative development.
254
+
255
+
256
+ ## Testing and Evaluation
257
+
258
+ Prompt engineering is empirical. Validate instruction design using:
259
+
260
+ * **A/B tests**: Compare variants with and without behavioral scaffolds
261
+ * **Prompt evals**: Use deterministic queries to test edge case behavior
262
+ * **Behavioral matrices**: Track compliance with instruction categories
263
+
264
+ Example matrix:
265
+
266
+ | Instruction Category | Prompt A Pass | Prompt B Pass |
267
+ | -------------------- | ------------- | ------------- |
268
+ | Ask if unsure | ✅ | ❌ |
269
+ | Use tools first | ✅ | ✅ |
270
+ | Avoid sensitive data | ❌ | ✅ |
271
+
272
+
273
+ ## Final Reminders
274
+
275
+ GPT-4.1 is exceptionally effective **when paired with well-structured, comprehensive instructions**. Follow these principles:
276
+
277
+ * Instructions should be modular and auditable.
278
+ * Avoid unnecessary repetition, but reinforce critical rules.
279
+ * Use formatting styles that clearly separate content.
280
+ * Assume literalism — write prompts as if programming a function, not chatting with a person.
281
+
282
+ Every prompt is a contract. GPT-4.1 honors that contract, but only if written clearly.
283
+
284
+
285
+ ## See Also
286
+
287
+ * [`Agent Workflows`](../agent_design/swe_bench_agent.md)
288
+ * [`Prompt Format Reference`](../reference/prompting_guide.md)
289
+ * [`Long Context Strategies`](../examples/long-context-formatting.md)
290
+ * [`OpenAI 4.1 Prompting Guide`](https://platform.openai.com/docs/guides/prompting)
291
+
292
+
293
+ For questions, suggestions, or prompt design contributions, submit a pull request to `/examples/instruction-following.md` or open an issue in the main repo.
real_world_deployment.md ADDED
@@ -0,0 +1,282 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # [Real-World Deployment Scenarios for GPT-4.1](https://chatgpt.com/canvas/shared/6825f3194b888191ae2417991002dcbd)
2
+
3
+ ## Overview
4
+
5
+ This guide provides implementation-ready strategies for deploying GPT-4.1 in real-world systems. It outlines robust practices for integrating the model across diverse operational environments—from customer support automation to software development pipelines—while leveraging OpenAI's guidance on agentic workflows, instruction adherence, and tool integration.
6
+
7
+ The focus is on reliability, agent autonomy, and system-level alignment for production use. This document emphasizes scenario-based implementation blueprints, including prompt structure, tool configuration, risk mitigation, and iterative deployment cycles.
8
+
9
+
10
+ ## Objectives
11
+
12
+ * Showcase tested deployment architectures for GPT-4.1 in applied domains
13
+ * Illustrate structured prompting strategies aligned with OpenAI's latest harness recommendations
14
+ * Codify best practices for tool integration, planning induction, and agent persistence
15
+ * Support enterprise-grade use through modular scenario blueprints
16
+
17
+
18
+ ## Deployment Pattern 1: Customer Service Agent
19
+
20
+ ### Use Case
21
+
22
+ Deploy GPT-4.1 as a first-line support agent capable of greeting users, answering account-related questions, handling tool lookups, and escalating edge cases.
23
+
24
+ ### Prompt Structure
25
+
26
+ ```markdown
27
+ # Role
28
+ You are a helpful customer service assistant for NewTelco.
29
+
30
+ # Instructions
31
+ - Always greet the user.
32
+ - Call tools before answering factual queries.
33
+ - Never rely on internal knowledge for billing/account issues.
34
+ - Ask for missing parameters if insufficient input.
35
+ - Vary phrasing to avoid repetition.
36
+ - Always escalate when asked.
37
+ - Prohibited topics: [List Redacted].
38
+
39
+ # Sample Interaction
40
+ ## User: Can I get my last bill?
41
+ ## Assistant:
42
+ Hi, you've reached NewTelco. Let me retrieve that for you—one moment.
43
+ [Calls `get_user_bill` tool]
44
+ ```
45
+
46
+ ### Tool Schema
47
+
48
+ ```json
49
+ {
50
+ "name": "get_user_bill",
51
+ "description": "Retrieve a user's latest billing information.",
52
+ "parameters": {
53
+ "type": "object",
54
+ "properties": {
55
+ "phone_number": { "type": "string" }
56
+ },
57
+ "required": ["phone_number"]
58
+ }
59
+ }
60
+ ```
61
+
62
+ ### Best Practices
63
+
64
+ * Use a formal output format block.
65
+ * Include tool call before every factual output.
66
+ * Cite retrieved source when answering.
67
+
68
+ ### Failure Mitigation
69
+
70
+ | Risk | Prevention |
71
+ | ------------------------ | ----------------------------------------- |
72
+ | Repetitive responses | Vary phrasing with sample lists |
73
+ | Tool skipping | Require tool call before factual response |
74
+ | Prohibited topic leakage | Reinforce restriction list + test in QA |
75
+
76
+
77
+ ## Deployment Pattern 2: Codebase Maintenance Agent
78
+
79
+ ### Use Case
80
+
81
+ An agent responsible for identifying and fixing bugs using diffs, applying patches, running tests, and confirming bug resolution.
82
+
83
+ ### Prompt Highlights
84
+
85
+ ```markdown
86
+ # Instructions
87
+ - Read all context before patching.
88
+ - Plan changes first.
89
+ - Apply patches with `apply_patch`.
90
+ - Run tests before finalizing.
91
+ - Keep going until all tests pass.
92
+
93
+ # Patch Format
94
+ *** Begin Patch
95
+ *** Update File: path/to/file.py
96
+ @@ def buggy():
97
+ - broken()
98
+ + fixed()
99
+ *** End Patch
100
+ ```
101
+
102
+ ### Tool Schema
103
+
104
+ ```json
105
+ {
106
+ "name": "apply_patch",
107
+ "description": "Applies human-readable code patches",
108
+ "parameters": {
109
+ "type": "object",
110
+ "properties": {
111
+ "input": { "type": "string" }
112
+ },
113
+ "required": ["input"]
114
+ }
115
+ }
116
+ ```
117
+
118
+ ### Agent Workflow
119
+
120
+ 1. Understand the bug
121
+ 2. Explore relevant files
122
+ 3. Propose and apply patch
123
+ 4. Run `!python3 run_tests.py`
124
+ 5. Reflect and iterate until success
125
+
126
+ ### Notes
127
+
128
+ * Use `@@` headers to specify scope
129
+ * Plan before every action
130
+ * Reflect after test results
131
+
132
+
133
+ ## Deployment Pattern 3: Long Document Analyst
134
+
135
+ ### Use Case
136
+
137
+ A document triage and synthesis agent for use with 100k–1M token context windows.
138
+
139
+ ### Prompt Setup
140
+
141
+ ```markdown
142
+ # Instructions
143
+ - Focus on relevance.
144
+ - Reflect every 10k tokens.
145
+ - Summarize findings by section.
146
+
147
+ # Strategy
148
+ 1. Read → rate relevance
149
+ 2. Extract high-salience content
150
+ 3. Synthesize across documents
151
+ ```
152
+
153
+ ### Input Format Guidance
154
+
155
+ * Prefer `# Section`, `<doc>` tags, or ID/TITLE headers
156
+ * Avoid JSON for >10k tokens
157
+ * Repeat instructions at start and end
158
+
159
+ ### Best Practices
160
+
161
+ * Insert checkpoints every 5–10k tokens
162
+ * Ask model to pause and reflect: “Are we on track?”
163
+ * Evaluate document relevance before synthesis
164
+
165
+
166
+ ## Deployment Pattern 4: Data Labeling Assistant
167
+
168
+ ### Use Case
169
+
170
+ Assist in labeling structured or unstructured data with schema validation and few-shot learning.
171
+
172
+ ### Prompt Structure
173
+
174
+ ```markdown
175
+ # Labeling Instructions
176
+ - Label each entry using valid categories
177
+ - Format: {"text": ..., "label": ...}
178
+
179
+ # Categories
180
+ - Urgent
181
+ - Normal
182
+ - Spam
183
+
184
+ # Example
185
+ {"text": "Free money now!", "label": "Spam"}
186
+ ```
187
+
188
+ ### API Integration
189
+
190
+ Validate against schema on submit. Add real-time audit checks for consistency.
191
+
192
+ ### Evaluation
193
+
194
+ * Measure label precision
195
+ * Flag outliers for review
196
+ * Use `tool_call` to suggest schema fixes
197
+
198
+
199
+ ## Deployment Pattern 5: Research Assistant
200
+
201
+ ### Use Case
202
+
203
+ Used by analysts to extract, summarize, and contrast findings across large research corpora.
204
+
205
+ ### Core Prompt Blocks
206
+
207
+ ```markdown
208
+ # Objective
209
+ Identify similarities and differences across these studies.
210
+
211
+ # Step-by-Step Plan
212
+ 1. Break each study into hypothesis, method, result
213
+ 2. Extract claims
214
+ 3. Compare claim alignment or contradiction
215
+ ```
216
+
217
+ ### Ideal Format
218
+
219
+ Use XML-structured context for each paper:
220
+
221
+ ```xml
222
+ <doc id="23" title="Study A">
223
+ <hypothesis>...</hypothesis>
224
+ <method>...</method>
225
+ <results>...</results>
226
+ </doc>
227
+ ```
228
+
229
+ ### Output Pattern
230
+
231
+ ```json
232
+ [
233
+ {"id": "23", "summary": "Study A supports..."},
234
+ {"id": "47", "summary": "Study B challenges..."},
235
+ {"alignment": false, "conflict_reason": "Different control group"}
236
+ ]
237
+ ```
238
+
239
+
240
+ ## Deployment Best Practices
241
+
242
+ ### Prompting
243
+
244
+ * Use bullet-style `# Instructions`
245
+ * Add `# Reasoning Strategy` section to guide workflow
246
+ * Repeat instructions at top and bottom for long input
247
+
248
+ ### Tool Integration
249
+
250
+ * Pass tools in API schema, not inline
251
+ * Provide examples in `# Examples` section
252
+ * Use clear tool names and parameter descriptions
253
+
254
+ ### Output Handling
255
+
256
+ * Define expected format in advance
257
+ * Use schema validation for structured outputs
258
+ * Log every tool call and agent action
259
+
260
+ ### Iterative Evaluation
261
+
262
+ * Audit performance per use case
263
+ * Evaluate edge-case behavior explicitly
264
+ * Collect examples of failure modes
265
+ * Adjust prompts, tools, and planning steps accordingly
266
+
267
+
268
+ ## Summary
269
+
270
+ GPT-4.1 is deployable across a wide range of real-world systems. Success depends not only on model capability but on prompt structure, tool schema clarity, planning enforcement, and continual evaluation. Each scenario benefits from opinionated workflows, persistent agent behaviors, and clearly delimited responsibilities.
271
+
272
+ **Start with structured instructions. Plan agent actions. Validate at every step.**
273
+
274
+
275
+ ## Additional Notes
276
+
277
+ * Always measure: accuracy, tool latency, format compliance, adherence
278
+ * Use internal QA and sandbox environments before production
279
+ * Document all agentic patterns and update based on logs
280
+ * Prefer long-term performance tracking over one-off evals
281
+
282
+ Deployment is not one prompt—it’s a living system. Maintain, monitor, and adapt.
tool_use_and_integration.md ADDED
@@ -0,0 +1,317 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ # [Tool Use and Integration](https://chatgpt.com/canvas/shared/6825ee7dbfd081919e67bd643748f8de)
2
+
3
+ ## Overview
4
+
5
+ GPT-4.1 introduces robust capabilities for working with tools directly through the OpenAI API’s `tools` parameter. Rather than relying solely on the model's internal knowledge, developers can now extend functionality, reduce hallucination, and enforce reliable workflows by integrating explicitly defined tools into their applications.
6
+
7
+ This document offers a comprehensive guide for designing and deploying tool-augmented applications using GPT-4.1. It includes best practices for tool registration, prompting strategies, tool schema design, usage examples, and debugging common tool invocation failures. Each section is modular and designed to help you build reliable systems that scale across contexts, task types, and user interfaces.
8
+
9
+
10
+ ## What is a Tool in GPT-4.1?
11
+
12
+ A **tool** is an explicitly defined function or utility passed to the GPT-4.1 API, allowing the model to trigger predefined operations such as:
13
+
14
+ * Running code or bash commands
15
+ * Retrieving documents or structured data
16
+ * Performing API calls
17
+ * Applying file patches or diffs
18
+ * Looking up user account information
19
+
20
+ Tools are defined in a structured JSON schema and passed via the `tools` parameter. When the model determines a tool is required, it emits a function call rather than plain text. This enables **precise execution**, **auditable behavior**, and **tight application integration**.
21
+
22
+
23
+ ## Why Use Tools?
24
+
25
+ | Benefit | Description |
26
+ | ------------------------------ | -------------------------------------------------------------------------- |
27
+ | **Reduces hallucination** | Encourages the model to call real-world functions instead of guessing |
28
+ | **Improves traceability** | Tool calls are logged and interpretable as function outputs |
29
+ | **Enables complex workflows** | Offloads parts of the task to external systems (e.g., shell, Python, APIs) |
30
+ | **Enhances compliance** | Limits model responses to grounded tool outputs |
31
+ | **Improves agent performance** | Required for persistent, multi-turn agentic workflows |
32
+
33
+
34
+ ## Tool Definition: The Schema
35
+
36
+ Tools are defined using a JSON schema object that includes:
37
+
38
+ * `name`: A short, unique identifier
39
+ * `description`: A concise explanation of what the tool does
40
+ * `parameters`: A standard JSON Schema describing expected input
41
+
42
+ ### Example: Python Execution Tool
43
+
44
+ ```json
45
+ {
46
+ "type": "function",
47
+ "name": "python",
48
+ "description": "Run Python code or terminal commands in a secure environment.",
49
+ "parameters": {
50
+ "type": "object",
51
+ "properties": {
52
+ "input": {
53
+ "type": "string",
54
+ "description": "The code or command to run"
55
+ }
56
+ },
57
+ "required": ["input"]
58
+ }
59
+ }
60
+ ```
61
+
62
+ ### Best Practices for Schema Design
63
+
64
+ * Use clear names: `run_tests`, `lookup_policy`, `apply_patch`
65
+ * Keep descriptions actionable: Describe *when* and *why* to use
66
+ * Minimize complexity: Use shallow parameter objects where possible
67
+ * Use enums or constraints to reduce ambiguous calls
68
+
69
+
70
+ ## Registering Tools in the API
71
+
72
+ In the Python SDK:
73
+
74
+ ```python
75
+ response = client.chat.completions.create(
76
+ model="gpt-4.1",
77
+ messages=chat_history,
78
+ tools=[python_tool, get_user_info_tool],
79
+ tool_choice="auto"
80
+ )
81
+ ```
82
+
83
+ Set `tool_choice` to:
84
+
85
+ * `"auto"`: Allow the model to choose when to call
86
+ * A specific tool name: Force one call
87
+ * `"none"`: Prevent tool usage (useful for testing)
88
+
89
+
90
+ ## Prompting for Tool Use
91
+
92
+ ### Tool Use Prompting Guidelines
93
+
94
+ To guide GPT-4.1 toward proper tool usage:
95
+
96
+ * **Don’t rely on the model to infer when to call a tool.** Tell it explicitly when tools are required.
97
+ * **Prompt for failure cases**: Tell the model what to do when it lacks information (e.g., “ask the user” or “pause”).
98
+ * **Avoid ambiguity**: Be clear about tool invocation order and data requirements.
99
+
100
+ ### Example Prompt Snippet
101
+
102
+ ```markdown
103
+ Before answering any user question about billing, check if the necessary context is available.
104
+ If not, use the `lookup_policy_document` tool to find relevant information.
105
+ Never answer without citing a retrieved document.
106
+ ```
107
+
108
+ ### Escalation Pattern
109
+
110
+ ```markdown
111
+ If the tool fails to return the necessary data, ask the user for clarification.
112
+ If the user cannot provide it, explain the limitation and pause further action.
113
+ ```
114
+
115
+
116
+ ## Tool Use in Agent Workflows
117
+
118
+ Tool usage is foundational to agent design in GPT-4.1.
119
+
120
+ ### Multi-Stage Task Example: Bug Fix Agent
121
+
122
+ ```markdown
123
+ 1. Use `read_file` to inspect code
124
+ 2. Analyze and plan a fix
125
+ 3. Use `apply_patch` to update the file
126
+ 4. Use `run_tests` to verify changes
127
+ 5. Reflect and reattempt if needed
128
+ ```
129
+
130
+ Each tool call is logged as a JSON event and can be parsed programmatically.
131
+
132
+
133
+ ## Apply Patch: Recommended Format
134
+
135
+ One of the most powerful GPT-4.1 patterns is **patch generation** using a diff-like format.
136
+
137
+ ### Patch Structure
138
+
139
+ ```bash
140
+ apply_patch <<"EOF"
141
+ *** Begin Patch
142
+ *** Update File: path/to/file.py
143
+ @@ def function():
144
+ - old_code()
145
+ + new_code()
146
+ *** End Patch
147
+ EOF
148
+ ```
149
+
150
+ ### Tool Behavior
151
+
152
+ * No line numbers required
153
+ * Context determined by `@@` anchors and 3 lines of code before/after
154
+ * Errors must be handled gracefully and logged
155
+
156
+ See `/examples/apply_patch/` for templates and error-handling techniques.
157
+
158
+
159
+ ## Tool Examples by Use Case
160
+
161
+ | Use Case | Tool Name | Description |
162
+ | --------------------- | --------------- | ------------------------------------------ |
163
+ | Execute code | `python` | Runs code or shell commands |
164
+ | Apply file diff | `apply_patch` | Applies a patch to a source file |
165
+ | Fetch document | `lookup_policy` | Retrieves structured policy text |
166
+ | Get user account data | `get_user_info` | Fetches user account info via phone number |
167
+ | Log analytics | `log_event` | Sends metadata to your analytics platform |
168
+
169
+
170
+ ## Error Handling and Recovery
171
+
172
+ Tool failure is inevitable in complex systems. Plan for it.
173
+
174
+ ### Guidelines for GPT-4.1:
175
+
176
+ * Detect and summarize tool errors
177
+ * Ask for missing input
178
+ * Retry if safe
179
+ * Escalate to user if unresolvable
180
+
181
+ ### Prompt Pattern: Failure Response
182
+
183
+ ```markdown
184
+ If a tool fails with an error, summarize the issue clearly for the user.
185
+ Only retry if the cause of failure is known and correctable.
186
+ If not, explain the problem and ask the user for next steps.
187
+ ```
188
+
189
+
190
+ ## Tool Debugging and Logging
191
+
192
+ Enable structured logging to track model-tool interactions:
193
+
194
+ * **Log call attempts**: Include input parameters and timestamps
195
+ * **Log success/failure outcomes**: Include model reflections
196
+ * **Log retry logic**: Show how failures were handled
197
+
198
+ This creates full traceability for AI-involved actions.
199
+
200
+ ### Sample Tool Call Log (JSON)
201
+
202
+ ```json
203
+ {
204
+ "tool_name": "run_tests",
205
+ "input": "!python3 -m unittest discover",
206
+ "result": "3 tests passed, 1 failed",
207
+ "timestamp": "2025-05-15T14:32:12Z"
208
+ }
209
+ ```
210
+
211
+
212
+ ## Tool Evaluation and Performance Monitoring
213
+
214
+ Track tool usage metrics:
215
+
216
+ * **Tool Call Rate**: How often a tool is invoked
217
+ * **Tool Completion Rate**: How often tools finish without failure
218
+ * **Tool Contribution Score**: Impact on final task completion
219
+ * **Average Attempts per Task**: Retry behavior over time
220
+
221
+ Use this data to refine prompting and improve tool schema design.
222
+
223
+
224
+ ## Common Pitfalls and Solutions
225
+
226
+ | Issue | Likely Cause | Solution |
227
+ | ---------------------------- | ---------------------------------------------- | ---------------------------------------------------- |
228
+ | Tool called with empty input | Missing required parameter | Prompt model to validate input presence |
229
+ | Tool ignored | Tool not described clearly in schema or prompt | Add clear instruction for when to use tool |
230
+ | Repeated failed calls | No failure mitigation logic | Add conditionals to check and respond to tool errors |
231
+ | Model mixes tool names | Ambiguous tool naming | Use short, specific, unambiguous names |
232
+
233
+
234
+ ## Combining Tools with Instructions
235
+
236
+ When combining tools with detailed instruction sets:
237
+
238
+ * Include a `# Tools` section in your system prompt
239
+ * Define when and why each tool should be used
240
+ * Link tool calls to reasoning steps in `# Workflow`
241
+
242
+ ### Example Combined Prompt
243
+
244
+ ```markdown
245
+ # Role
246
+ You are a bug-fix agent using provided tools to solve code issues.
247
+
248
+ # Tools
249
+ - `read_file`: Inspect code files
250
+ - `apply_patch`: Apply structured diffs
251
+ - `run_tests`: Validate code after changes
252
+
253
+ # Instructions
254
+ 1. Always start with file inspection
255
+ 2. Plan before making changes
256
+ 3. Test after every patch
257
+ 4. Do not finish until all tests pass
258
+
259
+ # Output
260
+ Include patch summaries, test outcomes, and current status.
261
+ ```
262
+
263
+
264
+ ## Tool Testing Templates
265
+
266
+ Create test cases that validate:
267
+
268
+ * Input formatting
269
+ * Response validation
270
+ * Prompt-tool alignment
271
+ * Handling of edge cases
272
+
273
+ Use both synthetic and real examples:
274
+
275
+ ```markdown
276
+ ## Tool Call Test: run_tests
277
+ **Input**: Code with known error
278
+ **Expected Output**: Test failure summary
279
+ **Follow-up Behavior**: Retry with fixed patch
280
+ ```
281
+
282
+
283
+ ## Tool Choice Design
284
+
285
+ Choose between model-directed or developer-directed tool invocation:
286
+
287
+ | Mode | Behavior | Use Case |
288
+ | ------------- | ------------------------------------------- | ---------------------------------- |
289
+ | `auto` | Model decides whether and when to use tools | General assistants, exploration |
290
+ | `none` | Model cannot use tools | Testing model reasoning only |
291
+ | `forced` name | Developer instructs tool call immediately | Known pipeline steps, unit testing |
292
+
293
+ Choose based on control needs and task constraints.
294
+
295
+
296
+ ## Summary: Best Practices for Tool Integration
297
+
298
+ | Area | Best Practice |
299
+ | ---------------- | -------------------------------------------------------- |
300
+ | Tool Naming | Use action-based, unambiguous names |
301
+ | Prompt Structure | Clearly define when and how tools should be used |
302
+ | Tool Invocation | Register tools in API, not in plain prompt text |
303
+ | Failure Handling | Provide instructions for retrying or asking the user |
304
+ | Schema Design | Use JSON Schema with constraints to reduce invalid input |
305
+ | Evaluation | Track tool call success rate and contribution to outcome |
306
+
307
+
308
+ ## Further Exploration
309
+
310
+ * [`Designing Agent Workflows`](./Designing%20Agent%20Workflows.md)
311
+ * [`Prompting for Instruction Following`](./Prompting%20for%20Instruction%20Following.md)
312
+ * [`Long Context Strategies`](./Long%20Context.md)
313
+
314
+ For community templates and tool libraries, explore the `/tools/` and `/examples/` directories in the main repository.
315
+
316
+
317
+ For contributions, open a pull request or submit an issue in `/tools/Tool Use and Integration.md`.