Upload 13 files
Browse files- GPT-4.1 Prompting Guide.md +1164 -0
- LICENSE +21 -0
- README.md +234 -0
- api_usage.md +326 -0
- chain_of_thought_planning.md +195 -0
- code_fixing_and_diff.md +254 -0
- cookbook_pro.md +288 -0
- designing_agent_workflows.md +354 -0
- handling_long_contexts.md +279 -0
- prompt_engineering_guide.md +298 -0
- prompting_for_instruction_following.md +293 -0
- real_world_deployment.md +282 -0
- tool_use_and_integration.md +317 -0
GPT-4.1 Prompting Guide.md
ADDED
@@ -0,0 +1,1164 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# [GPT-4.1 Prompting Guide](https://cookbook.openai.com/examples/gpt4-1_prompting_guide#2-long-context)
|
2 |
+
> By [Noah MacCallum](https://x.com/noahmacca) and [Julian Lee](https://x.com/julianl093) (OpenAI)
|
3 |
+
|
4 |
+
The GPT-4.1 family of models represents a significant step forward from GPT-4o in capabilities across coding, instruction following, and long context. In this prompting guide, we collate a series of important prompting tips derived from extensive internal testing to help developers fully leverage the improved abilities of this new model family.
|
5 |
+
|
6 |
+
Many typical best practices still apply to GPT-4.1, such as providing context examples, making instructions as specific and clear as possible, and inducing planning via prompting to maximize model intelligence. However, we expect that getting the most out of this model will require some prompt migration. GPT-4.1 is trained to follow instructions more closely and more literally than its predecessors, which tended to more liberally infer intent from user and system prompts. This also means, however, that GPT-4.1 is highly steerable and responsive to well-specified prompts - if model behavior is different from what you expect, a single sentence firmly and unequivocally clarifying your desired behavior is almost always sufficient to steer the model on course.
|
7 |
+
|
8 |
+
Please read on for prompt examples you can use as a reference, and remember that while this guidance is widely applicable, no advice is one-size-fits-all. AI engineering is inherently an empirical discipline, and large language models are inherently nondeterministic; in addition to following this guide, we advise building informative evals and iterating often to ensure your prompt engineering changes are yielding benefits for your use case.
|
9 |
+
|
10 |
+
# 1. Agentic Workflows
|
11 |
+
GPT-4.1 is a great place to build agentic workflows. In model training we emphasized providing a diverse range of agentic problem-solving trajectories, and our agentic harness for the model achieves state-of-the-art performance for non-reasoning models on SWE-bench Verified, solving 55% of problems.
|
12 |
+
|
13 |
+
## System Prompt Reminders
|
14 |
+
In order to fully utilize the agentic capabilities of GPT-4.1, we recommend including three key types of reminders in all agent prompts. The following prompts are optimized specifically for the agentic coding workflow, but can be easily modified for general agentic use cases.
|
15 |
+
|
16 |
+
1. Persistence: this ensures the model understands it is entering a multi-message turn, and prevents it from prematurely yielding control back to the user. Our example is the following:
|
17 |
+
```
|
18 |
+
You are an agent - please keep going until the user’s query is completely resolved, before ending your turn and yielding back to the user. Only terminate your turn when you are sure that the problem is solved.
|
19 |
+
```
|
20 |
+
|
21 |
+
Tool-calling: this encourages the model to make full use of its tools, and reduces its likelihood of hallucinating or guessing an answer. Our example is the following:
|
22 |
+
```
|
23 |
+
If you are not sure about file content or codebase structure pertaining to the user’s request, use your tools to read files and gather the relevant information: do NOT guess or make up an answer.
|
24 |
+
```
|
25 |
+
Planning [optional]: if desired, this ensures the model explicitly plans and reflects upon each tool call in text, instead of completing the task by chaining together a series of only tool calls. Our example is the following:
|
26 |
+
```
|
27 |
+
You MUST plan extensively before each function call, and reflect extensively on the outcomes of the previous function calls. DO NOT do this entire process by making function calls only, as this can impair your ability to solve the problem and think insightfully.
|
28 |
+
```
|
29 |
+
GPT-4.1 is trained to respond very closely to both user instructions and system prompts in the agentic setting. The model adhered closely to these three simple instructions and increased our internal SWE-bench Verified score by close to 20% - so we highly encourage starting any agent prompt with clear reminders covering the three categories listed above. As a whole, we find that these three instructions transform the model from a chatbot-like state into a much more “eager” agent, driving the interaction forward autonomously and independently.
|
30 |
+
|
31 |
+
## Tool Calls
|
32 |
+
Compared to previous models, GPT-4.1 has undergone more training on effectively utilizing tools passed as arguments in an OpenAI API request. We encourage developers to exclusively use the tools field to pass tools, rather than manually injecting tool descriptions into your prompt and writing a separate parser for tool calls, as some have reported doing in the past. This is the best way to minimize errors and ensure the model remains in distribution during tool-calling trajectories - in our own experiments, we observed a 2% increase in SWE-bench Verified pass rate when using API-parsed tool descriptions versus manually injecting the schemas into the system prompt.
|
33 |
+
|
34 |
+
Developers should name tools clearly to indicate their purpose and add a clear, detailed description in the "description" field of the tool. Similarly, for each tool param, lean on good naming and descriptions to ensure appropriate usage. If your tool is particularly complicated and you'd like to provide examples of tool usage, we recommend that you create an `# Examples` section in your system prompt and place the examples there, rather than adding them into the "description' field, which should remain thorough but relatively concise. Providing examples can be helpful to indicate when to use tools, whether to include user text alongside tool calls, and what parameters are appropriate for different inputs. Remember that you can use “Generate Anything” in the Prompt Playground to get a good starting point for your new tool definitions.
|
35 |
+
|
36 |
+
## Prompting-Induced Planning & Chain-of-Thought
|
37 |
+
As mentioned already, developers can optionally prompt agents built with GPT-4.1 to plan and reflect between tool calls, instead of silently calling tools in an unbroken sequence. GPT-4.1 is not a reasoning model - meaning that it does not produce an internal chain of thought before answering - but in the prompt, a developer can induce the model to produce an explicit, step-by-step plan by using any variant of the Planning prompt component shown above. This can be thought of as the model “thinking out loud.” In our experimentation with the SWE-bench Verified agentic task, inducing explicit planning increased the pass rate by 4%.
|
38 |
+
|
39 |
+
## Sample Prompt: SWE-bench Verified
|
40 |
+
|
41 |
+
Below, we share the agentic prompt that we used to achieve our highest score on SWE-bench Verified, which features detailed instructions about workflow and problem-solving strategy. This general pattern can be used for any agentic task.
|
42 |
+
|
43 |
+
```python
|
44 |
+
from openai import OpenAI
|
45 |
+
import os
|
46 |
+
|
47 |
+
client = OpenAI(
|
48 |
+
api_key=os.environ.get(
|
49 |
+
"OPENAI_API_KEY", "<your OpenAI API key if not set as env var>"
|
50 |
+
)
|
51 |
+
)
|
52 |
+
|
53 |
+
SYS_PROMPT_SWEBENCH = """
|
54 |
+
You will be tasked to fix an issue from an open-source repository.
|
55 |
+
|
56 |
+
Your thinking should be thorough and so it's fine if it's very long. You can think step by step before and after each action you decide to take.
|
57 |
+
|
58 |
+
You MUST iterate and keep going until the problem is solved.
|
59 |
+
|
60 |
+
You already have everything you need to solve this problem in the /testbed folder, even without internet connection. I want you to fully solve this autonomously before coming back to me.
|
61 |
+
|
62 |
+
Only terminate your turn when you are sure that the problem is solved. Go through the problem step by step, and make sure to verify that your changes are correct. NEVER end your turn without having solved the problem, and when you say you are going to make a tool call, make sure you ACTUALLY make the tool call, instead of ending your turn.
|
63 |
+
|
64 |
+
THE PROBLEM CAN DEFINITELY BE SOLVED WITHOUT THE INTERNET.
|
65 |
+
|
66 |
+
Take your time and think through every step - remember to check your solution rigorously and watch out for boundary cases, especially with the changes you made. Your solution must be perfect. If not, continue working on it. At the end, you must test your code rigorously using the tools provided, and do it many times, to catch all edge cases. If it is not robust, iterate more and make it perfect. Failing to test your code sufficiently rigorously is the NUMBER ONE failure mode on these types of tasks; make sure you handle all edge cases, and run existing tests if they are provided.
|
67 |
+
|
68 |
+
You MUST plan extensively before each function call, and reflect extensively on the outcomes of the previous function calls. DO NOT do this entire process by making function calls only, as this can impair your ability to solve the problem and think insightfully.
|
69 |
+
|
70 |
+
# Workflow
|
71 |
+
|
72 |
+
## High-Level Problem Solving Strategy
|
73 |
+
|
74 |
+
1. Understand the problem deeply. Carefully read the issue and think critically about what is required.
|
75 |
+
2. Investigate the codebase. Explore relevant files, search for key functions, and gather context.
|
76 |
+
3. Develop a clear, step-by-step plan. Break down the fix into manageable, incremental steps.
|
77 |
+
4. Implement the fix incrementally. Make small, testable code changes.
|
78 |
+
5. Debug as needed. Use debugging techniques to isolate and resolve issues.
|
79 |
+
6. Test frequently. Run tests after each change to verify correctness.
|
80 |
+
7. Iterate until the root cause is fixed and all tests pass.
|
81 |
+
8. Reflect and validate comprehensively. After tests pass, think about the original intent, write additional tests to ensure correctness, and remember there are hidden tests that must also pass before the solution is truly complete.
|
82 |
+
|
83 |
+
Refer to the detailed sections below for more information on each step.
|
84 |
+
|
85 |
+
## 1. Deeply Understand the Problem
|
86 |
+
Carefully read the issue and think hard about a plan to solve it before coding.
|
87 |
+
|
88 |
+
## 2. Codebase Investigation
|
89 |
+
- Explore relevant files and directories.
|
90 |
+
- Search for key functions, classes, or variables related to the issue.
|
91 |
+
- Read and understand relevant code snippets.
|
92 |
+
- Identify the root cause of the problem.
|
93 |
+
- Validate and update your understanding continuously as you gather more context.
|
94 |
+
|
95 |
+
## 3. Develop a Detailed Plan
|
96 |
+
- Outline a specific, simple, and verifiable sequence of steps to fix the problem.
|
97 |
+
- Break down the fix into small, incremental changes.
|
98 |
+
|
99 |
+
## 4. Making Code Changes
|
100 |
+
- Before editing, always read the relevant file contents or section to ensure complete context.
|
101 |
+
- If a patch is not applied correctly, attempt to reapply it.
|
102 |
+
- Make small, testable, incremental changes that logically follow from your investigation and plan.
|
103 |
+
|
104 |
+
## 5. Debugging
|
105 |
+
- Make code changes only if you have high confidence they can solve the problem
|
106 |
+
- When debugging, try to determine the root cause rather than addressing symptoms
|
107 |
+
- Debug for as long as needed to identify the root cause and identify a fix
|
108 |
+
- Use print statements, logs, or temporary code to inspect program state, including descriptive statements or error messages to understand what's happening
|
109 |
+
- To test hypotheses, you can also add test statements or functions
|
110 |
+
- Revisit your assumptions if unexpected behavior occurs.
|
111 |
+
|
112 |
+
## 6. Testing
|
113 |
+
- Run tests frequently using `!python3 run_tests.py` (or equivalent).
|
114 |
+
- After each change, verify correctness by running relevant tests.
|
115 |
+
- If tests fail, analyze failures and revise your patch.
|
116 |
+
- Write additional tests if needed to capture important behaviors or edge cases.
|
117 |
+
- Ensure all tests pass before finalizing.
|
118 |
+
|
119 |
+
## 7. Final Verification
|
120 |
+
- Confirm the root cause is fixed.
|
121 |
+
- Review your solution for logic correctness and robustness.
|
122 |
+
- Iterate until you are extremely confident the fix is complete and all tests pass.
|
123 |
+
|
124 |
+
## 8. Final Reflection and Additional Testing
|
125 |
+
- Reflect carefully on the original intent of the user and the problem statement.
|
126 |
+
- Think about potential edge cases or scenarios that may not be covered by existing tests.
|
127 |
+
- Write additional tests that would need to pass to fully validate the correctness of your solution.
|
128 |
+
- Run these new tests and ensure they all pass.
|
129 |
+
- Be aware that there are additional hidden tests that must also pass for the solution to be successful.
|
130 |
+
- Do not assume the task is complete just because the visible tests pass; continue refining until you are confident the fix is robust and comprehensive.
|
131 |
+
"""
|
132 |
+
|
133 |
+
PYTHON_TOOL_DESCRIPTION = """This function is used to execute Python code or terminal commands in a stateful Jupyter notebook environment. python will respond with the output of the execution or time out after 60.0 seconds. Internet access for this session is disabled. Do not make external web requests or API calls as they will fail. Just as in a Jupyter notebook, you may also execute terminal commands by calling this function with a terminal command, prefaced with an exclamation mark.
|
134 |
+
|
135 |
+
In addition, for the purposes of this task, you can call this function with an `apply_patch` command as input. `apply_patch` effectively allows you to execute a diff/patch against a file, but the format of the diff specification is unique to this task, so pay careful attention to these instructions. To use the `apply_patch` command, you should pass a message of the following structure as "input":
|
136 |
+
|
137 |
+
%%bash
|
138 |
+
apply_patch <<"EOF"
|
139 |
+
*** Begin Patch
|
140 |
+
[YOUR_PATCH]
|
141 |
+
*** End Patch
|
142 |
+
EOF
|
143 |
+
|
144 |
+
Where [YOUR_PATCH] is the actual content of your patch, specified in the following V4A diff format.
|
145 |
+
|
146 |
+
*** [ACTION] File: [path/to/file] -> ACTION can be one of Add, Update, or Delete.
|
147 |
+
For each snippet of code that needs to be changed, repeat the following:
|
148 |
+
[context_before] -> See below for further instructions on context.
|
149 |
+
- [old_code] -> Precede the old code with a minus sign.
|
150 |
+
+ [new_code] -> Precede the new, replacement code with a plus sign.
|
151 |
+
[context_after] -> See below for further instructions on context.
|
152 |
+
|
153 |
+
For instructions on [context_before] and [context_after]:
|
154 |
+
- By default, show 3 lines of code immediately above and 3 lines immediately below each change. If a change is within 3 lines of a previous change, do NOT duplicate the first change's [context_after] lines in the second change's [context_before] lines.
|
155 |
+
- If 3 lines of context is insufficient to uniquely identify the snippet of code within the file, use the @@ operator to indicate the class or function to which the snippet belongs. For instance, we might have:
|
156 |
+
@@ class BaseClass
|
157 |
+
[3 lines of pre-context]
|
158 |
+
- [old_code]
|
159 |
+
+ [new_code]
|
160 |
+
[3 lines of post-context]
|
161 |
+
|
162 |
+
- If a code block is repeated so many times in a class or function such that even a single @@ statement and 3 lines of context cannot uniquely identify the snippet of code, you can use multiple `@@` statements to jump to the right context. For instance:
|
163 |
+
|
164 |
+
@@ class BaseClass
|
165 |
+
@@ def method():
|
166 |
+
[3 lines of pre-context]
|
167 |
+
- [old_code]
|
168 |
+
+ [new_code]
|
169 |
+
[3 lines of post-context]
|
170 |
+
|
171 |
+
Note, then, that we do not use line numbers in this diff format, as the context is enough to uniquely identify code. An example of a message that you might pass as "input" to this function, in order to apply a patch, is shown below.
|
172 |
+
|
173 |
+
%%bash
|
174 |
+
apply_patch <<"EOF"
|
175 |
+
*** Begin Patch
|
176 |
+
*** Update File: pygorithm/searching/binary_search.py
|
177 |
+
@@ class BaseClass
|
178 |
+
@@ def search():
|
179 |
+
- pass
|
180 |
+
+ raise NotImplementedError()
|
181 |
+
|
182 |
+
@@ class Subclass
|
183 |
+
@@ def search():
|
184 |
+
- pass
|
185 |
+
+ raise NotImplementedError()
|
186 |
+
|
187 |
+
*** End Patch
|
188 |
+
EOF
|
189 |
+
|
190 |
+
File references can only be relative, NEVER ABSOLUTE. After the apply_patch command is run, python will always say "Done!", regardless of whether the patch was successfully applied or not. However, you can determine if there are issue and errors by looking at any warnings or logging lines printed BEFORE the "Done!" is output.
|
191 |
+
"""
|
192 |
+
|
193 |
+
python_bash_patch_tool = {
|
194 |
+
"type": "function",
|
195 |
+
"name": "python",
|
196 |
+
"description": PYTHON_TOOL_DESCRIPTION,
|
197 |
+
"parameters": {
|
198 |
+
"type": "object",
|
199 |
+
"strict": True,
|
200 |
+
"properties": {
|
201 |
+
"input": {
|
202 |
+
"type": "string",
|
203 |
+
"description": " The Python code, terminal command (prefaced by exclamation mark), or apply_patch command that you wish to execute.",
|
204 |
+
}
|
205 |
+
},
|
206 |
+
"required": ["input"],
|
207 |
+
},
|
208 |
+
}
|
209 |
+
|
210 |
+
# Additional harness setup:
|
211 |
+
# - Add your repo to /testbed
|
212 |
+
# - Add your issue to the first user message
|
213 |
+
# - Note: Even though we used a single tool for python, bash, and apply_patch, we generally recommend defining more granular tools that are focused on a single function
|
214 |
+
|
215 |
+
response = client.responses.create(
|
216 |
+
instructions=SYS_PROMPT_SWEBENCH,
|
217 |
+
model="gpt-4.1-2025-04-14",
|
218 |
+
tools=[python_bash_patch_tool],
|
219 |
+
input=f"Please answer the following question:\nBug: Typerror..."
|
220 |
+
)
|
221 |
+
|
222 |
+
response.to_dict()["output"]
|
223 |
+
```
|
224 |
+
|
225 |
+
```
|
226 |
+
[{'id': 'msg_67fe92df26ac819182ffafce9ff4e4fc07c7e06242e51f8b',
|
227 |
+
'content': [{'annotations': [],
|
228 |
+
'text': "Thank you for the report, but “Typerror” is too vague for me to start debugging right away.\n\n**To make progress, I need to:**\n1. Find the exact error message text (e.g. `'TypeError: ...'`).\n2. Find which file and which line/function/class the error occurred in.\n3. Figure out what triggered the error (test file, usage, reproduction steps).\n4. Find the root cause and details.\n\n**Next steps:**\n- Investigate error/log/test output files for a Python `TypeError` message.\n- Examine the relevant code sections for problematic type usage.\n- If possible, reproduce the bug locally.\n\n**Plan:**\n- First, I will search for test files and log output in the `/testbed` directory that may contain the full error message and stack trace.\n\nLet’s start by listing the contents of the `/testbed` directory to look for clues.",
|
229 |
+
'type': 'output_text'}],
|
230 |
+
'role': 'assistant',
|
231 |
+
'status': 'completed',
|
232 |
+
'type': 'message'},
|
233 |
+
{'arguments': '{"input":"!ls -l /testbed"}',
|
234 |
+
'call_id': 'call_frnxyJgKi5TsBem0nR9Zuzdw',
|
235 |
+
'name': 'python',
|
236 |
+
'type': 'function_call',
|
237 |
+
'id': 'fc_67fe92e3da7081918fc18d5c96dddc1c07c7e06242e51f8b',
|
238 |
+
'status': 'completed'}]
|
239 |
+
```
|
240 |
+
# 2. Long context
|
241 |
+
GPT-4.1 has a performant 1M token input context window, and is useful for a variety of long context tasks, including structured document parsing, re-ranking, selecting relevant information while ignoring irrelevant context, and performing multi-hop reasoning using context.
|
242 |
+
|
243 |
+
|
244 |
+
## Optimal Context Size
|
245 |
+
We observe very good performance on needle-in-a-haystack evaluations up to our full 1M token context, and we’ve observed very strong performance at complex tasks with a mix of both relevant and irrelevant code and other documents. However, long context performance can degrade as more items are required to be retrieved, or perform complex reasoning that requires knowledge of the state of the entire context (like performing a graph search, for example).
|
246 |
+
|
247 |
+
## Tuning Context Reliance
|
248 |
+
|
249 |
+
Consider the mix of external vs. internal world knowledge that might be required to answer your question. Sometimes it’s important for the model to use some of its own knowledge to connect concepts or make logical jumps, while in others it’s desirable to only use provided context
|
250 |
+
|
251 |
+
```
|
252 |
+
# Instructions
|
253 |
+
// for internal knowledge
|
254 |
+
- Only use the documents in the provided External Context to answer the User Query. If you don't know the answer based on this context, you must respond "I don't have the information needed to answer that", even if a user insists on you answering the question.
|
255 |
+
// For internal and external knowledge
|
256 |
+
- By default, use the provided external context to answer the User Query, but if other basic knowledge is needed to answer, and you're confident in the answer, you can use some of your own knowledge to help answer the question.
|
257 |
+
```
|
258 |
+
## Prompt Organization
|
259 |
+
|
260 |
+
Especially in long context usage, placement of instructions and context can impact performance. If you have long context in your prompt, ideally place your instructions at both the beginning and end of the provided context, as we found this to perform better than only above or below. If you’d prefer to only have your instructions once, then above the provided context works better than below.
|
261 |
+
# 3. Chain of Thought
|
262 |
+
|
263 |
+
As mentioned above, GPT-4.1 is not a reasoning model, but prompting the model to think step by step (called “chain of thought”) can be an effective way for a model to break down problems into more manageable pieces, solve them, and improve overall output quality, with the tradeoff of higher cost and latency associated with using more output tokens. The model has been trained to perform well at agentic reasoning about and real-world problem solving, so it shouldn’t require much prompting to perform well.
|
264 |
+
|
265 |
+
We recommend starting with this basic chain-of-thought instruction at the end of your prompt:
|
266 |
+
```
|
267 |
+
...
|
268 |
+
|
269 |
+
First, think carefully step by step about what documents are needed to answer the query. Then, print out the TITLE and ID of each document. Then, format the IDs into a list.
|
270 |
+
```
|
271 |
+
From there, you should improve your chain-of-thought (CoT) prompt by auditing failures in your particular examples and evals, and addressing systematic planning and reasoning errors with more explicit instructions. In the unconstrained CoT prompt, there may be variance in the strategies it tries, and if you observe an approach that works well, you can codify that strategy in your prompt. Generally speaking, errors tend to occur from misunderstanding user intent, insufficient context gathering or analysis, or insufficient or incorrect step by step thinking, so watch out for these and try to address them with more opinionated instructions.
|
272 |
+
|
273 |
+
Here is an example prompt instructing the model to focus more methodically on analyzing user intent and considering relevant context before proceeding to answer.
|
274 |
+
```
|
275 |
+
# Reasoning Strategy
|
276 |
+
1. Query Analysis: Break down and analyze the query until you're confident about what it might be asking. Consider the provided context to help clarify any ambiguous or confusing information.
|
277 |
+
2. Context Analysis: Carefully select and analyze a large set of potentially relevant documents. Optimize for recall - it's okay if some are irrelevant, but the correct documents must be in this list, otherwise your final answer will be wrong. Analysis steps for each:
|
278 |
+
a. Analysis: An analysis of how it may or may not be relevant to answering the query.
|
279 |
+
b. Relevance rating: [high, medium, low, none]
|
280 |
+
3. Synthesis: summarize which documents are most relevant and why, including all documents with a relevance rating of medium or higher.
|
281 |
+
|
282 |
+
# User Question
|
283 |
+
{user_question}
|
284 |
+
|
285 |
+
# External Context
|
286 |
+
{external_context}
|
287 |
+
|
288 |
+
First, think carefully step by step about what documents are needed to answer the query, closely adhering to the provided Reasoning Strategy. Then, print out the TITLE and ID of each document. Then, format the IDs into a list.
|
289 |
+
```
|
290 |
+
# 4. Instruction Following
|
291 |
+
|
292 |
+
GPT-4.1 exhibits outstanding instruction-following performance, which developers can leverage to precisely shape and control the outputs for their particular use cases. Developers often extensively prompt for agentic reasoning steps, response tone and voice, tool calling information, output formatting, topics to avoid, and more. However, since the model follows instructions more literally, developers may need to include explicit specification around what to do or not to do. Furthermore, existing prompts optimized for other models may not immediately work with this model, because existing instructions are followed more closely and implicit rules are no longer being as strongly inferred.
|
293 |
+
## Recommended Workflow
|
294 |
+
|
295 |
+
Here is our recommended workflow for developing and debugging instructions in prompts:
|
296 |
+
|
297 |
+
Start with an overall “Response Rules” or “Instructions” section with high-level guidance and bullet points.
|
298 |
+
If you’d like to change a more specific behavior, add a section to specify more details for that category, like # Sample Phrases.
|
299 |
+
If there are specific steps you’d like the model to follow in its workflow, add an ordered list and instruct the model to follow these steps.
|
300 |
+
If behavior still isn’t working as expected:
|
301 |
+
Check for conflicting, underspecified, or wrong instructions and examples. If there are conflicting instructions, GPT-4.1 tends to follow the one closer to the end of the prompt.
|
302 |
+
Add examples that demonstrate desired behavior; ensure that any important behavior demonstrated in your examples are also cited in your rules.
|
303 |
+
It’s generally not necessary to use all-caps or other incentives like bribes or tips. We recommend starting without these, and only reaching for these if necessary for your particular prompt. Note that if your existing prompts include these techniques, it could cause GPT-4.1 to pay attention to it too strictly.
|
304 |
+
|
305 |
+
Note that using your preferred AI-powered IDE can be very helpful for iterating on prompts, including checking for consistency or conflicts, adding examples, or making cohesive updates like adding an instruction and updating instructions to demonstrate that instruction.
|
306 |
+
## Common Failure Modes
|
307 |
+
|
308 |
+
These failure modes are not unique to GPT-4.1, but we share them here for general awareness and ease of debugging.
|
309 |
+
|
310 |
+
Instructing a model to always follow a specific behavior can occasionally induce adverse effects. For instance, if told “you must call a tool before responding to the user,” models may hallucinate tool inputs or call the tool with null values if they do not have enough information. Adding “if you don’t have enough information to call the tool, ask the user for the information you need” should mitigate this.
|
311 |
+
When provided sample phrases, models can use those quotes verbatim and start to sound repetitive to users. Ensure you instruct the model to vary them as necessary.
|
312 |
+
Without specific instructions, some models can be eager to provide additional prose to explain their decisions, or output more formatting in responses than may be desired. Provide instructions and potentially examples to help mitigate.
|
313 |
+
|
314 |
+
## Example Prompt: Customer Service
|
315 |
+
|
316 |
+
This demonstrates best practices for a fictional customer service agent. Observe the diversity of rules, the specificity, the use of additional sections for greater detail, and an example to demonstrate precise behavior that incorporates all prior rules.
|
317 |
+
|
318 |
+
Try running the following notebook cell - you should see both a user message and tool call, and the user message should start with a greeting, then echo back their answer, then mention they're about to call a tool. Try changing the instructions to shape the model behavior, or trying other user messages, to test instruction following performance.
|
319 |
+
```python
|
320 |
+
SYS_PROMPT_CUSTOMER_SERVICE = """You are a helpful customer service agent working for NewTelco, helping a user efficiently fulfill their request while adhering closely to provided guidelines.
|
321 |
+
|
322 |
+
# Instructions
|
323 |
+
- Always greet the user with "Hi, you've reached NewTelco, how can I help you?"
|
324 |
+
- Always call a tool before answering factual questions about the company, its offerings or products, or a user's account. Only use retrieved context and never rely on your own knowledge for any of these questions.
|
325 |
+
- However, if you don't have enough information to properly call the tool, ask the user for the information you need.
|
326 |
+
- Escalate to a human if the user requests.
|
327 |
+
- Do not discuss prohibited topics (politics, religion, controversial current events, medical, legal, or financial advice, personal conversations, internal company operations, or criticism of any people or company).
|
328 |
+
- Rely on sample phrases whenever appropriate, but never repeat a sample phrase in the same conversation. Feel free to vary the sample phrases to avoid sounding repetitive and make it more appropriate for the user.
|
329 |
+
- Always follow the provided output format for new messages, including citations for any factual statements from retrieved policy documents.
|
330 |
+
- If you're going to call a tool, always message the user with an appropriate message before and after calling the tool.
|
331 |
+
- Maintain a professional and concise tone in all responses, and use emojis between sentences.
|
332 |
+
- If you've resolved the user's request, ask if there's anything else you can help with
|
333 |
+
|
334 |
+
# Precise Response Steps (for each response)
|
335 |
+
1. If necessary, call tools to fulfill the user's desired action. Always message the user before and after calling a tool to keep them in the loop.
|
336 |
+
2. In your response to the user
|
337 |
+
a. Use active listening and echo back what you heard the user ask for.
|
338 |
+
b. Respond appropriately given the above guidelines.
|
339 |
+
|
340 |
+
# Sample Phrases
|
341 |
+
## Deflecting a Prohibited Topic
|
342 |
+
- "I'm sorry, but I'm unable to discuss that topic. Is there something else I can help you with?"
|
343 |
+
- "That's not something I'm able to provide information on, but I'm happy to help with any other questions you may have."
|
344 |
+
|
345 |
+
## Before calling a tool
|
346 |
+
- "To help you with that, I'll just need to verify your information."
|
347 |
+
- "Let me check that for you—one moment, please."
|
348 |
+
- "I'll retrieve the latest details for you now."
|
349 |
+
|
350 |
+
## After calling a tool
|
351 |
+
- "Okay, here's what I found: [response]"
|
352 |
+
- "So here's what I found: [response]"
|
353 |
+
|
354 |
+
# Output Format
|
355 |
+
- Always include your final response to the user.
|
356 |
+
- When providing factual information from retrieved context, always include citations immediately after the relevant statement(s). Use the following citation format:
|
357 |
+
- For a single source: [NAME](ID)
|
358 |
+
- For multiple sources: [NAME](ID), [NAME](ID)
|
359 |
+
- Only provide information about this company, its policies, its products, or the customer's account, and only if it is based on information provided in context. Do not answer questions outside this scope.
|
360 |
+
|
361 |
+
# Example
|
362 |
+
## User
|
363 |
+
Can you tell me about your family plan options?
|
364 |
+
|
365 |
+
## Assistant Response 1
|
366 |
+
### Message
|
367 |
+
"Hi, you've reached NewTelco, how can I help you? 😊🎉\n\nYou'd like to know about our family plan options. 🤝 Let me check that for you—one moment, please. 🚀"
|
368 |
+
|
369 |
+
### Tool Calls
|
370 |
+
lookup_policy_document(topic="family plan options")
|
371 |
+
|
372 |
+
// After tool call, the assistant would follow up with:
|
373 |
+
|
374 |
+
## Assistant Response 2 (after tool call)
|
375 |
+
### Message
|
376 |
+
"Okay, here's what I found: 🎉 Our family plan allows up to 5 lines with shared data and a 10% discount for each additional line [Family Plan Policy](ID-010). 📱 Is there anything else I can help you with today? 😊"
|
377 |
+
"""
|
378 |
+
|
379 |
+
get_policy_doc = {
|
380 |
+
"type": "function",
|
381 |
+
"name": "lookup_policy_document",
|
382 |
+
"description": "Tool to look up internal documents and policies by topic or keyword.",
|
383 |
+
"parameters": {
|
384 |
+
"strict": True,
|
385 |
+
"type": "object",
|
386 |
+
"properties": {
|
387 |
+
"topic": {
|
388 |
+
"type": "string",
|
389 |
+
"description": "The topic or keyword to search for in company policies or documents.",
|
390 |
+
},
|
391 |
+
},
|
392 |
+
"required": ["topic"],
|
393 |
+
"additionalProperties": False,
|
394 |
+
},
|
395 |
+
}
|
396 |
+
|
397 |
+
get_user_acct = {
|
398 |
+
"type": "function",
|
399 |
+
"name": "get_user_account_info",
|
400 |
+
"description": "Tool to get user account information",
|
401 |
+
"parameters": {
|
402 |
+
"strict": True,
|
403 |
+
"type": "object",
|
404 |
+
"properties": {
|
405 |
+
"phone_number": {
|
406 |
+
"type": "string",
|
407 |
+
"description": "Formatted as '(xxx) xxx-xxxx'",
|
408 |
+
},
|
409 |
+
},
|
410 |
+
"required": ["phone_number"],
|
411 |
+
"additionalProperties": False,
|
412 |
+
},
|
413 |
+
}
|
414 |
+
|
415 |
+
response = client.responses.create(
|
416 |
+
instructions=SYS_PROMPT_CUSTOMER_SERVICE,
|
417 |
+
model="gpt-4.1-2025-04-14",
|
418 |
+
tools=[get_policy_doc, get_user_acct],
|
419 |
+
input="How much will it cost for international service? I'm traveling to France.",
|
420 |
+
# input="Why was my last bill so high?"
|
421 |
+
)
|
422 |
+
|
423 |
+
response.to_dict()["output"]
|
424 |
+
```
|
425 |
+
```
|
426 |
+
[{'id': 'msg_67fe92d431548191b7ca6cd604b4784b06efc5beb16b3c5e',
|
427 |
+
'content': [{'annotations': [],
|
428 |
+
'text': "Hi, you've reached NewTelco, how can I help you? 🌍✈️\n\nYou'd like to know the cost of international service while traveling to France. 🇫🇷 Let me check the latest details for you—one moment, please. 🕑",
|
429 |
+
'type': 'output_text'}],
|
430 |
+
'role': 'assistant',
|
431 |
+
'status': 'completed',
|
432 |
+
'type': 'message'},
|
433 |
+
{'arguments': '{"topic":"international service cost France"}',
|
434 |
+
'call_id': 'call_cF63DLeyhNhwfdyME3ZHd0yo',
|
435 |
+
'name': 'lookup_policy_document',
|
436 |
+
'type': 'function_call',
|
437 |
+
'id': 'fc_67fe92d5d6888191b6cd7cf57f707e4606efc5beb16b3c5e',
|
438 |
+
'status': 'completed'}]
|
439 |
+
```
|
440 |
+
# 5. General Advice
|
441 |
+
## Prompt Structure
|
442 |
+
|
443 |
+
For reference, here is a good starting point for structuring your prompts.
|
444 |
+
```
|
445 |
+
# Role and Objective
|
446 |
+
|
447 |
+
# Instructions
|
448 |
+
|
449 |
+
## Sub-categories for more detailed instructions
|
450 |
+
|
451 |
+
# Reasoning Steps
|
452 |
+
|
453 |
+
# Output Format
|
454 |
+
|
455 |
+
# Examples
|
456 |
+
## Example 1
|
457 |
+
|
458 |
+
# Context
|
459 |
+
|
460 |
+
# Final instructions and prompt to think step by step
|
461 |
+
```
|
462 |
+
Add or remove sections to suit your needs, and experiment to determine what’s optimal for your usage.
|
463 |
+
Delimiters
|
464 |
+
|
465 |
+
Here are some general guidelines for selecting the best delimiters for your prompt. Please refer to the Long Context section for special considerations for that context type.
|
466 |
+
|
467 |
+
1. Markdown: We recommend starting here, and using markdown titles for major sections and subsections (including deeper hierarchy, to H4+). Use inline backticks or backtick blocks to precisely wrap code, and standard numbered or bulleted lists as needed.
|
468 |
+
2. XML: These also perform well, and we have improved adherence to information in XML with this model. XML is convenient to precisely wrap a section including start and end, add metadata to the tags for additional context, and enable nesting. Here is an example of using XML tags to nest examples in an example section, with inputs and outputs for each:
|
469 |
+
```
|
470 |
+
<examples>
|
471 |
+
<example1 type="Abbreviate">
|
472 |
+
<input>San Francisco</input>
|
473 |
+
<output>- SF</output>
|
474 |
+
</example1>
|
475 |
+
</examples>
|
476 |
+
```
|
477 |
+
3. JSON is highly structured and well understood by the model particularly in coding contexts. However it can be more verbose, and require character escaping that can add overhead.
|
478 |
+
|
479 |
+
Guidance specifically for adding a large number of documents or files to input context:
|
480 |
+
|
481 |
+
XML performed well in our long context testing.
|
482 |
+
Example: <doc id='1' title='The Fox'>The quick brown fox jumps over the lazy dog</doc>
|
483 |
+
This format, proposed by Lee et al. (ref), also performed well in our long context testing.
|
484 |
+
Example: ID: 1 | TITLE: The Fox | CONTENT: The quick brown fox jumps over the lazy dog
|
485 |
+
JSON performed particularly poorly.
|
486 |
+
Example: [{'id': 1, 'title': 'The Fox', 'content': 'The quick brown fox jumped over the lazy dog'}]
|
487 |
+
|
488 |
+
The model is trained to robustly understand structure in a variety of formats. Generally, use your judgement and think about what will provide clear information and “stand out” to the model. For example, if you’re retrieving documents that contain lots of XML, an XML-based delimiter will likely be less effective.
|
489 |
+
## Caveats
|
490 |
+
|
491 |
+
In some isolated cases we have observed the model being resistant to producing very long, repetitive outputs, for example, analyzing hundreds of items one by one. If this is necessary for your use case, instruct the model strongly to output this information in full, and consider breaking down the problem or using a more concise approach.
|
492 |
+
We have seen some rare instances of parallel tool calls being incorrect. We advise testing this, and considering setting the parallel_tool_calls param to false if you’re seeing issues.
|
493 |
+
|
494 |
+
# Appendix: Generating and Applying File Diffs
|
495 |
+
|
496 |
+
Developers have provided us feedback that accurate and well-formed diff generation is a critical capability to power coding-related tasks. To this end, the GPT-4.1 family features substantially improved diff capabilities relative to previous GPT models. Moreover, while GPT-4.1 has strong performance generating diffs of any format given clear instructions and examples, we open-source here one recommended diff format, on which the model has been extensively trained. We hope that in particular for developers just starting out, that this will take much of the guesswork out of creating diffs yourself.
|
497 |
+
Apply Patch
|
498 |
+
|
499 |
+
See the example below for a prompt that applies our recommended tool call correctly.
|
500 |
+
```python
|
501 |
+
APPLY_PATCH_TOOL_DESC = """This is a custom utility that makes it more convenient to add, remove, move, or edit code files. `apply_patch` effectively allows you to execute a diff/patch against a file, but the format of the diff specification is unique to this task, so pay careful attention to these instructions. To use the `apply_patch` command, you should pass a message of the following structure as "input":
|
502 |
+
|
503 |
+
%%bash
|
504 |
+
apply_patch <<"EOF"
|
505 |
+
*** Begin Patch
|
506 |
+
[YOUR_PATCH]
|
507 |
+
*** End Patch
|
508 |
+
EOF
|
509 |
+
|
510 |
+
Where [YOUR_PATCH] is the actual content of your patch, specified in the following V4A diff format.
|
511 |
+
|
512 |
+
*** [ACTION] File: [path/to/file] -> ACTION can be one of Add, Update, or Delete.
|
513 |
+
For each snippet of code that needs to be changed, repeat the following:
|
514 |
+
[context_before] -> See below for further instructions on context.
|
515 |
+
- [old_code] -> Precede the old code with a minus sign.
|
516 |
+
+ [new_code] -> Precede the new, replacement code with a plus sign.
|
517 |
+
[context_after] -> See below for further instructions on context.
|
518 |
+
|
519 |
+
For instructions on [context_before] and [context_after]:
|
520 |
+
- By default, show 3 lines of code immediately above and 3 lines immediately below each change. If a change is within 3 lines of a previous change, do NOT duplicate the first change’s [context_after] lines in the second change’s [context_before] lines.
|
521 |
+
- If 3 lines of context is insufficient to uniquely identify the snippet of code within the file, use the @@ operator to indicate the class or function to which the snippet belongs. For instance, we might have:
|
522 |
+
@@ class BaseClass
|
523 |
+
[3 lines of pre-context]
|
524 |
+
- [old_code]
|
525 |
+
+ [new_code]
|
526 |
+
[3 lines of post-context]
|
527 |
+
|
528 |
+
- If a code block is repeated so many times in a class or function such that even a single @@ statement and 3 lines of context cannot uniquely identify the snippet of code, you can use multiple `@@` statements to jump to the right context. For instance:
|
529 |
+
|
530 |
+
@@ class BaseClass
|
531 |
+
@@ def method():
|
532 |
+
[3 lines of pre-context]
|
533 |
+
- [old_code]
|
534 |
+
+ [new_code]
|
535 |
+
[3 lines of post-context]
|
536 |
+
|
537 |
+
Note, then, that we do not use line numbers in this diff format, as the context is enough to uniquely identify code. An example of a message that you might pass as "input" to this function, in order to apply a patch, is shown below.
|
538 |
+
|
539 |
+
%%bash
|
540 |
+
apply_patch <<"EOF"
|
541 |
+
*** Begin Patch
|
542 |
+
*** Update File: pygorithm/searching/binary_search.py
|
543 |
+
@@ class BaseClass
|
544 |
+
@@ def search():
|
545 |
+
- pass
|
546 |
+
+ raise NotImplementedError()
|
547 |
+
|
548 |
+
@@ class Subclass
|
549 |
+
@@ def search():
|
550 |
+
- pass
|
551 |
+
+ raise NotImplementedError()
|
552 |
+
|
553 |
+
*** End Patch
|
554 |
+
EOF
|
555 |
+
"""
|
556 |
+
|
557 |
+
APPLY_PATCH_TOOL = {
|
558 |
+
"name": "apply_patch",
|
559 |
+
"description": APPLY_PATCH_TOOL_DESC,
|
560 |
+
"parameters": {
|
561 |
+
"type": "object",
|
562 |
+
"properties": {
|
563 |
+
"input": {
|
564 |
+
"type": "string",
|
565 |
+
"description": " The apply_patch command that you wish to execute.",
|
566 |
+
}
|
567 |
+
},
|
568 |
+
"required": ["input"],
|
569 |
+
},
|
570 |
+
}
|
571 |
+
```
|
572 |
+
## Reference Implementation: apply_patch.py
|
573 |
+
|
574 |
+
Here’s a reference implementation of the apply_patch tool that we used as part of model training. You’ll need to make this an executable and available as `apply_patch` from the shell where the model will execute commands:
|
575 |
+
```python
|
576 |
+
#!/usr/bin/env python3
|
577 |
+
|
578 |
+
"""
|
579 |
+
A self-contained **pure-Python 3.9+** utility for applying human-readable
|
580 |
+
“pseudo-diff” patch files to a collection of text files.
|
581 |
+
"""
|
582 |
+
|
583 |
+
from __future__ import annotations
|
584 |
+
|
585 |
+
import pathlib
|
586 |
+
from dataclasses import dataclass, field
|
587 |
+
from enum import Enum
|
588 |
+
from typing import (
|
589 |
+
Callable,
|
590 |
+
Dict,
|
591 |
+
List,
|
592 |
+
Optional,
|
593 |
+
Tuple,
|
594 |
+
Union,
|
595 |
+
)
|
596 |
+
|
597 |
+
|
598 |
+
# --------------------------------------------------------------------------- #
|
599 |
+
# Domain objects
|
600 |
+
# --------------------------------------------------------------------------- #
|
601 |
+
class ActionType(str, Enum):
|
602 |
+
ADD = "add"
|
603 |
+
DELETE = "delete"
|
604 |
+
UPDATE = "update"
|
605 |
+
|
606 |
+
|
607 |
+
@dataclass
|
608 |
+
class FileChange:
|
609 |
+
type: ActionType
|
610 |
+
old_content: Optional[str] = None
|
611 |
+
new_content: Optional[str] = None
|
612 |
+
move_path: Optional[str] = None
|
613 |
+
|
614 |
+
|
615 |
+
@dataclass
|
616 |
+
class Commit:
|
617 |
+
changes: Dict[str, FileChange] = field(default_factory=dict)
|
618 |
+
|
619 |
+
|
620 |
+
# --------------------------------------------------------------------------- #
|
621 |
+
# Exceptions
|
622 |
+
# --------------------------------------------------------------------------- #
|
623 |
+
class DiffError(ValueError):
|
624 |
+
"""Any problem detected while parsing or applying a patch."""
|
625 |
+
|
626 |
+
|
627 |
+
# --------------------------------------------------------------------------- #
|
628 |
+
# Helper dataclasses used while parsing patches
|
629 |
+
# --------------------------------------------------------------------------- #
|
630 |
+
@dataclass
|
631 |
+
class Chunk:
|
632 |
+
orig_index: int = -1
|
633 |
+
del_lines: List[str] = field(default_factory=list)
|
634 |
+
ins_lines: List[str] = field(default_factory=list)
|
635 |
+
|
636 |
+
|
637 |
+
@dataclass
|
638 |
+
class PatchAction:
|
639 |
+
type: ActionType
|
640 |
+
new_file: Optional[str] = None
|
641 |
+
chunks: List[Chunk] = field(default_factory=list)
|
642 |
+
move_path: Optional[str] = None
|
643 |
+
|
644 |
+
|
645 |
+
@dataclass
|
646 |
+
class Patch:
|
647 |
+
actions: Dict[str, PatchAction] = field(default_factory=dict)
|
648 |
+
|
649 |
+
|
650 |
+
# --------------------------------------------------------------------------- #
|
651 |
+
# Patch text parser
|
652 |
+
# --------------------------------------------------------------------------- #
|
653 |
+
@dataclass
|
654 |
+
class Parser:
|
655 |
+
current_files: Dict[str, str]
|
656 |
+
lines: List[str]
|
657 |
+
index: int = 0
|
658 |
+
patch: Patch = field(default_factory=Patch)
|
659 |
+
fuzz: int = 0
|
660 |
+
|
661 |
+
# ------------- low-level helpers -------------------------------------- #
|
662 |
+
def _cur_line(self) -> str:
|
663 |
+
if self.index >= len(self.lines):
|
664 |
+
raise DiffError("Unexpected end of input while parsing patch")
|
665 |
+
return self.lines[self.index]
|
666 |
+
|
667 |
+
@staticmethod
|
668 |
+
def _norm(line: str) -> str:
|
669 |
+
"""Strip CR so comparisons work for both LF and CRLF input."""
|
670 |
+
return line.rstrip("\r")
|
671 |
+
|
672 |
+
# ------------- scanning convenience ----------------------------------- #
|
673 |
+
def is_done(self, prefixes: Optional[Tuple[str, ...]] = None) -> bool:
|
674 |
+
if self.index >= len(self.lines):
|
675 |
+
return True
|
676 |
+
if (
|
677 |
+
prefixes
|
678 |
+
and len(prefixes) > 0
|
679 |
+
and self._norm(self._cur_line()).startswith(prefixes)
|
680 |
+
):
|
681 |
+
return True
|
682 |
+
return False
|
683 |
+
|
684 |
+
def startswith(self, prefix: Union[str, Tuple[str, ...]]) -> bool:
|
685 |
+
return self._norm(self._cur_line()).startswith(prefix)
|
686 |
+
|
687 |
+
def read_str(self, prefix: str) -> str:
|
688 |
+
"""
|
689 |
+
Consume the current line if it starts with *prefix* and return the text
|
690 |
+
**after** the prefix. Raises if prefix is empty.
|
691 |
+
"""
|
692 |
+
if prefix == "":
|
693 |
+
raise ValueError("read_str() requires a non-empty prefix")
|
694 |
+
if self._norm(self._cur_line()).startswith(prefix):
|
695 |
+
text = self._cur_line()[len(prefix) :]
|
696 |
+
self.index += 1
|
697 |
+
return text
|
698 |
+
return ""
|
699 |
+
|
700 |
+
def read_line(self) -> str:
|
701 |
+
"""Return the current raw line and advance."""
|
702 |
+
line = self._cur_line()
|
703 |
+
self.index += 1
|
704 |
+
return line
|
705 |
+
|
706 |
+
# ------------- public entry point -------------------------------------- #
|
707 |
+
def parse(self) -> None:
|
708 |
+
while not self.is_done(("*** End Patch",)):
|
709 |
+
# ---------- UPDATE ---------- #
|
710 |
+
path = self.read_str("*** Update File: ")
|
711 |
+
if path:
|
712 |
+
if path in self.patch.actions:
|
713 |
+
raise DiffError(f"Duplicate update for file: {path}")
|
714 |
+
move_to = self.read_str("*** Move to: ")
|
715 |
+
if path not in self.current_files:
|
716 |
+
raise DiffError(f"Update File Error - missing file: {path}")
|
717 |
+
text = self.current_files[path]
|
718 |
+
action = self._parse_update_file(text)
|
719 |
+
action.move_path = move_to or None
|
720 |
+
self.patch.actions[path] = action
|
721 |
+
continue
|
722 |
+
|
723 |
+
# ---------- DELETE ---------- #
|
724 |
+
path = self.read_str("*** Delete File: ")
|
725 |
+
if path:
|
726 |
+
if path in self.patch.actions:
|
727 |
+
raise DiffError(f"Duplicate delete for file: {path}")
|
728 |
+
if path not in self.current_files:
|
729 |
+
raise DiffError(f"Delete File Error - missing file: {path}")
|
730 |
+
self.patch.actions[path] = PatchAction(type=ActionType.DELETE)
|
731 |
+
continue
|
732 |
+
|
733 |
+
# ---------- ADD ---------- #
|
734 |
+
path = self.read_str("*** Add File: ")
|
735 |
+
if path:
|
736 |
+
if path in self.patch.actions:
|
737 |
+
raise DiffError(f"Duplicate add for file: {path}")
|
738 |
+
if path in self.current_files:
|
739 |
+
raise DiffError(f"Add File Error - file already exists: {path}")
|
740 |
+
self.patch.actions[path] = self._parse_add_file()
|
741 |
+
continue
|
742 |
+
|
743 |
+
raise DiffError(f"Unknown line while parsing: {self._cur_line()}")
|
744 |
+
|
745 |
+
if not self.startswith("*** End Patch"):
|
746 |
+
raise DiffError("Missing *** End Patch sentinel")
|
747 |
+
self.index += 1 # consume sentinel
|
748 |
+
|
749 |
+
# ------------- section parsers ---------------------------------------- #
|
750 |
+
def _parse_update_file(self, text: str) -> PatchAction:
|
751 |
+
action = PatchAction(type=ActionType.UPDATE)
|
752 |
+
lines = text.split("\n")
|
753 |
+
index = 0
|
754 |
+
while not self.is_done(
|
755 |
+
(
|
756 |
+
"*** End Patch",
|
757 |
+
"*** Update File:",
|
758 |
+
"*** Delete File:",
|
759 |
+
"*** Add File:",
|
760 |
+
"*** End of File",
|
761 |
+
)
|
762 |
+
):
|
763 |
+
def_str = self.read_str("@@ ")
|
764 |
+
section_str = ""
|
765 |
+
if not def_str and self._norm(self._cur_line()) == "@@":
|
766 |
+
section_str = self.read_line()
|
767 |
+
|
768 |
+
if not (def_str or section_str or index == 0):
|
769 |
+
raise DiffError(f"Invalid line in update section:\n{self._cur_line()}")
|
770 |
+
|
771 |
+
if def_str.strip():
|
772 |
+
found = False
|
773 |
+
if def_str not in lines[:index]:
|
774 |
+
for i, s in enumerate(lines[index:], index):
|
775 |
+
if s == def_str:
|
776 |
+
index = i + 1
|
777 |
+
found = True
|
778 |
+
break
|
779 |
+
if not found and def_str.strip() not in [
|
780 |
+
s.strip() for s in lines[:index]
|
781 |
+
]:
|
782 |
+
for i, s in enumerate(lines[index:], index):
|
783 |
+
if s.strip() == def_str.strip():
|
784 |
+
index = i + 1
|
785 |
+
self.fuzz += 1
|
786 |
+
found = True
|
787 |
+
break
|
788 |
+
|
789 |
+
next_ctx, chunks, end_idx, eof = peek_next_section(self.lines, self.index)
|
790 |
+
new_index, fuzz = find_context(lines, next_ctx, index, eof)
|
791 |
+
if new_index == -1:
|
792 |
+
ctx_txt = "\n".join(next_ctx)
|
793 |
+
raise DiffError(
|
794 |
+
f"Invalid {'EOF ' if eof else ''}context at {index}:\n{ctx_txt}"
|
795 |
+
)
|
796 |
+
self.fuzz += fuzz
|
797 |
+
for ch in chunks:
|
798 |
+
ch.orig_index += new_index
|
799 |
+
action.chunks.append(ch)
|
800 |
+
index = new_index + len(next_ctx)
|
801 |
+
self.index = end_idx
|
802 |
+
return action
|
803 |
+
|
804 |
+
def _parse_add_file(self) -> PatchAction:
|
805 |
+
lines: List[str] = []
|
806 |
+
while not self.is_done(
|
807 |
+
("*** End Patch", "*** Update File:", "*** Delete File:", "*** Add File:")
|
808 |
+
):
|
809 |
+
s = self.read_line()
|
810 |
+
if not s.startswith("+"):
|
811 |
+
raise DiffError(f"Invalid Add File line (missing '+'): {s}")
|
812 |
+
lines.append(s[1:]) # strip leading '+'
|
813 |
+
return PatchAction(type=ActionType.ADD, new_file="\n".join(lines))
|
814 |
+
|
815 |
+
|
816 |
+
# --------------------------------------------------------------------------- #
|
817 |
+
# Helper functions
|
818 |
+
# --------------------------------------------------------------------------- #
|
819 |
+
def find_context_core(
|
820 |
+
lines: List[str], context: List[str], start: int
|
821 |
+
) -> Tuple[int, int]:
|
822 |
+
if not context:
|
823 |
+
return start, 0
|
824 |
+
|
825 |
+
for i in range(start, len(lines)):
|
826 |
+
if lines[i : i + len(context)] == context:
|
827 |
+
return i, 0
|
828 |
+
for i in range(start, len(lines)):
|
829 |
+
if [s.rstrip() for s in lines[i : i + len(context)]] == [
|
830 |
+
s.rstrip() for s in context
|
831 |
+
]:
|
832 |
+
return i, 1
|
833 |
+
for i in range(start, len(lines)):
|
834 |
+
if [s.strip() for s in lines[i : i + len(context)]] == [
|
835 |
+
s.strip() for s in context
|
836 |
+
]:
|
837 |
+
return i, 100
|
838 |
+
return -1, 0
|
839 |
+
|
840 |
+
|
841 |
+
def find_context(
|
842 |
+
lines: List[str], context: List[str], start: int, eof: bool
|
843 |
+
) -> Tuple[int, int]:
|
844 |
+
if eof:
|
845 |
+
new_index, fuzz = find_context_core(lines, context, len(lines) - len(context))
|
846 |
+
if new_index != -1:
|
847 |
+
return new_index, fuzz
|
848 |
+
new_index, fuzz = find_context_core(lines, context, start)
|
849 |
+
return new_index, fuzz + 10_000
|
850 |
+
return find_context_core(lines, context, start)
|
851 |
+
|
852 |
+
|
853 |
+
def peek_next_section(
|
854 |
+
lines: List[str], index: int
|
855 |
+
) -> Tuple[List[str], List[Chunk], int, bool]:
|
856 |
+
old: List[str] = []
|
857 |
+
del_lines: List[str] = []
|
858 |
+
ins_lines: List[str] = []
|
859 |
+
chunks: List[Chunk] = []
|
860 |
+
mode = "keep"
|
861 |
+
orig_index = index
|
862 |
+
|
863 |
+
while index < len(lines):
|
864 |
+
s = lines[index]
|
865 |
+
if s.startswith(
|
866 |
+
(
|
867 |
+
"@@",
|
868 |
+
"*** End Patch",
|
869 |
+
"*** Update File:",
|
870 |
+
"*** Delete File:",
|
871 |
+
"*** Add File:",
|
872 |
+
"*** End of File",
|
873 |
+
)
|
874 |
+
):
|
875 |
+
break
|
876 |
+
if s == "***":
|
877 |
+
break
|
878 |
+
if s.startswith("***"):
|
879 |
+
raise DiffError(f"Invalid Line: {s}")
|
880 |
+
index += 1
|
881 |
+
|
882 |
+
last_mode = mode
|
883 |
+
if s == "":
|
884 |
+
s = " "
|
885 |
+
if s[0] == "+":
|
886 |
+
mode = "add"
|
887 |
+
elif s[0] == "-":
|
888 |
+
mode = "delete"
|
889 |
+
elif s[0] == " ":
|
890 |
+
mode = "keep"
|
891 |
+
else:
|
892 |
+
raise DiffError(f"Invalid Line: {s}")
|
893 |
+
s = s[1:]
|
894 |
+
|
895 |
+
if mode == "keep" and last_mode != mode:
|
896 |
+
if ins_lines or del_lines:
|
897 |
+
chunks.append(
|
898 |
+
Chunk(
|
899 |
+
orig_index=len(old) - len(del_lines),
|
900 |
+
del_lines=del_lines,
|
901 |
+
ins_lines=ins_lines,
|
902 |
+
)
|
903 |
+
)
|
904 |
+
del_lines, ins_lines = [], []
|
905 |
+
|
906 |
+
if mode == "delete":
|
907 |
+
del_lines.append(s)
|
908 |
+
old.append(s)
|
909 |
+
elif mode == "add":
|
910 |
+
ins_lines.append(s)
|
911 |
+
elif mode == "keep":
|
912 |
+
old.append(s)
|
913 |
+
|
914 |
+
if ins_lines or del_lines:
|
915 |
+
chunks.append(
|
916 |
+
Chunk(
|
917 |
+
orig_index=len(old) - len(del_lines),
|
918 |
+
del_lines=del_lines,
|
919 |
+
ins_lines=ins_lines,
|
920 |
+
)
|
921 |
+
)
|
922 |
+
|
923 |
+
if index < len(lines) and lines[index] == "*** End of File":
|
924 |
+
index += 1
|
925 |
+
return old, chunks, index, True
|
926 |
+
|
927 |
+
if index == orig_index:
|
928 |
+
raise DiffError("Nothing in this section")
|
929 |
+
return old, chunks, index, False
|
930 |
+
|
931 |
+
|
932 |
+
# --------------------------------------------------------------------------- #
|
933 |
+
# Patch → Commit and Commit application
|
934 |
+
# --------------------------------------------------------------------------- #
|
935 |
+
def _get_updated_file(text: str, action: PatchAction, path: str) -> str:
|
936 |
+
if action.type is not ActionType.UPDATE:
|
937 |
+
raise DiffError("_get_updated_file called with non-update action")
|
938 |
+
orig_lines = text.split("\n")
|
939 |
+
dest_lines: List[str] = []
|
940 |
+
orig_index = 0
|
941 |
+
|
942 |
+
for chunk in action.chunks:
|
943 |
+
if chunk.orig_index > len(orig_lines):
|
944 |
+
raise DiffError(
|
945 |
+
f"{path}: chunk.orig_index {chunk.orig_index} exceeds file length"
|
946 |
+
)
|
947 |
+
if orig_index > chunk.orig_index:
|
948 |
+
raise DiffError(
|
949 |
+
f"{path}: overlapping chunks at {orig_index} > {chunk.orig_index}"
|
950 |
+
)
|
951 |
+
|
952 |
+
dest_lines.extend(orig_lines[orig_index : chunk.orig_index])
|
953 |
+
orig_index = chunk.orig_index
|
954 |
+
|
955 |
+
dest_lines.extend(chunk.ins_lines)
|
956 |
+
orig_index += len(chunk.del_lines)
|
957 |
+
|
958 |
+
dest_lines.extend(orig_lines[orig_index:])
|
959 |
+
return "\n".join(dest_lines)
|
960 |
+
|
961 |
+
|
962 |
+
def patch_to_commit(patch: Patch, orig: Dict[str, str]) -> Commit:
|
963 |
+
commit = Commit()
|
964 |
+
for path, action in patch.actions.items():
|
965 |
+
if action.type is ActionType.DELETE:
|
966 |
+
commit.changes[path] = FileChange(
|
967 |
+
type=ActionType.DELETE, old_content=orig[path]
|
968 |
+
)
|
969 |
+
elif action.type is ActionType.ADD:
|
970 |
+
if action.new_file is None:
|
971 |
+
raise DiffError("ADD action without file content")
|
972 |
+
commit.changes[path] = FileChange(
|
973 |
+
type=ActionType.ADD, new_content=action.new_file
|
974 |
+
)
|
975 |
+
elif action.type is ActionType.UPDATE:
|
976 |
+
new_content = _get_updated_file(orig[path], action, path)
|
977 |
+
commit.changes[path] = FileChange(
|
978 |
+
type=ActionType.UPDATE,
|
979 |
+
old_content=orig[path],
|
980 |
+
new_content=new_content,
|
981 |
+
move_path=action.move_path,
|
982 |
+
)
|
983 |
+
return commit
|
984 |
+
|
985 |
+
|
986 |
+
# --------------------------------------------------------------------------- #
|
987 |
+
# User-facing helpers
|
988 |
+
# --------------------------------------------------------------------------- #
|
989 |
+
def text_to_patch(text: str, orig: Dict[str, str]) -> Tuple[Patch, int]:
|
990 |
+
lines = text.splitlines() # preserves blank lines, no strip()
|
991 |
+
if (
|
992 |
+
len(lines) < 2
|
993 |
+
or not Parser._norm(lines[0]).startswith("*** Begin Patch")
|
994 |
+
or Parser._norm(lines[-1]) != "*** End Patch"
|
995 |
+
):
|
996 |
+
raise DiffError("Invalid patch text - missing sentinels")
|
997 |
+
|
998 |
+
parser = Parser(current_files=orig, lines=lines, index=1)
|
999 |
+
parser.parse()
|
1000 |
+
return parser.patch, parser.fuzz
|
1001 |
+
|
1002 |
+
|
1003 |
+
def identify_files_needed(text: str) -> List[str]:
|
1004 |
+
lines = text.splitlines()
|
1005 |
+
return [
|
1006 |
+
line[len("*** Update File: ") :]
|
1007 |
+
for line in lines
|
1008 |
+
if line.startswith("*** Update File: ")
|
1009 |
+
] + [
|
1010 |
+
line[len("*** Delete File: ") :]
|
1011 |
+
for line in lines
|
1012 |
+
if line.startswith("*** Delete File: ")
|
1013 |
+
]
|
1014 |
+
|
1015 |
+
|
1016 |
+
def identify_files_added(text: str) -> List[str]:
|
1017 |
+
lines = text.splitlines()
|
1018 |
+
return [
|
1019 |
+
line[len("*** Add File: ") :]
|
1020 |
+
for line in lines
|
1021 |
+
if line.startswith("*** Add File: ")
|
1022 |
+
]
|
1023 |
+
|
1024 |
+
|
1025 |
+
# --------------------------------------------------------------------------- #
|
1026 |
+
# File-system helpers
|
1027 |
+
# --------------------------------------------------------------------------- #
|
1028 |
+
def load_files(paths: List[str], open_fn: Callable[[str], str]) -> Dict[str, str]:
|
1029 |
+
return {path: open_fn(path) for path in paths}
|
1030 |
+
|
1031 |
+
|
1032 |
+
def apply_commit(
|
1033 |
+
commit: Commit,
|
1034 |
+
write_fn: Callable[[str, str], None],
|
1035 |
+
remove_fn: Callable[[str], None],
|
1036 |
+
) -> None:
|
1037 |
+
for path, change in commit.changes.items():
|
1038 |
+
if change.type is ActionType.DELETE:
|
1039 |
+
remove_fn(path)
|
1040 |
+
elif change.type is ActionType.ADD:
|
1041 |
+
if change.new_content is None:
|
1042 |
+
raise DiffError(f"ADD change for {path} has no content")
|
1043 |
+
write_fn(path, change.new_content)
|
1044 |
+
elif change.type is ActionType.UPDATE:
|
1045 |
+
if change.new_content is None:
|
1046 |
+
raise DiffError(f"UPDATE change for {path} has no new content")
|
1047 |
+
target = change.move_path or path
|
1048 |
+
write_fn(target, change.new_content)
|
1049 |
+
if change.move_path:
|
1050 |
+
remove_fn(path)
|
1051 |
+
|
1052 |
+
|
1053 |
+
def process_patch(
|
1054 |
+
text: str,
|
1055 |
+
open_fn: Callable[[str], str],
|
1056 |
+
write_fn: Callable[[str, str], None],
|
1057 |
+
remove_fn: Callable[[str], None],
|
1058 |
+
) -> str:
|
1059 |
+
if not text.startswith("*** Begin Patch"):
|
1060 |
+
raise DiffError("Patch text must start with *** Begin Patch")
|
1061 |
+
paths = identify_files_needed(text)
|
1062 |
+
orig = load_files(paths, open_fn)
|
1063 |
+
patch, _fuzz = text_to_patch(text, orig)
|
1064 |
+
commit = patch_to_commit(patch, orig)
|
1065 |
+
apply_commit(commit, write_fn, remove_fn)
|
1066 |
+
return "Done!"
|
1067 |
+
|
1068 |
+
|
1069 |
+
# --------------------------------------------------------------------------- #
|
1070 |
+
# Default FS helpers
|
1071 |
+
# --------------------------------------------------------------------------- #
|
1072 |
+
def open_file(path: str) -> str:
|
1073 |
+
with open(path, "rt", encoding="utf-8") as fh:
|
1074 |
+
return fh.read()
|
1075 |
+
|
1076 |
+
|
1077 |
+
def write_file(path: str, content: str) -> None:
|
1078 |
+
target = pathlib.Path(path)
|
1079 |
+
target.parent.mkdir(parents=True, exist_ok=True)
|
1080 |
+
with target.open("wt", encoding="utf-8") as fh:
|
1081 |
+
fh.write(content)
|
1082 |
+
|
1083 |
+
|
1084 |
+
def remove_file(path: str) -> None:
|
1085 |
+
pathlib.Path(path).unlink(missing_ok=True)
|
1086 |
+
|
1087 |
+
|
1088 |
+
# --------------------------------------------------------------------------- #
|
1089 |
+
# CLI entry-point
|
1090 |
+
# --------------------------------------------------------------------------- #
|
1091 |
+
def main() -> None:
|
1092 |
+
import sys
|
1093 |
+
|
1094 |
+
patch_text = sys.stdin.read()
|
1095 |
+
if not patch_text:
|
1096 |
+
print("Please pass patch text through stdin", file=sys.stderr)
|
1097 |
+
return
|
1098 |
+
try:
|
1099 |
+
result = process_patch(patch_text, open_file, write_file, remove_file)
|
1100 |
+
except DiffError as exc:
|
1101 |
+
print(exc, file=sys.stderr)
|
1102 |
+
return
|
1103 |
+
print(result)
|
1104 |
+
|
1105 |
+
|
1106 |
+
if __name__ == "__main__":
|
1107 |
+
main()
|
1108 |
+
```
|
1109 |
+
Other Effective Diff Formats
|
1110 |
+
|
1111 |
+
If you want to try using a different diff format, we found in testing that the SEARCH/REPLACE diff format used in Aider’s polyglot benchmark, as well as a pseudo-XML format with no internal escaping, both had high success rates.
|
1112 |
+
|
1113 |
+
These diff formats share two key aspects: (1) they do not use line numbers, and (2) they provide both the exact code to be replaced, and the exact code with which to replace it, with clear delimiters between the two.
|
1114 |
+
```python
|
1115 |
+
SEARCH_REPLACE_DIFF_EXAMPLE = """
|
1116 |
+
path/to/file.py
|
1117 |
+
|
1118 |
+
>>>>>>> SEARCH
|
1119 |
+
def search():
|
1120 |
+
pass
|
1121 |
+
=======
|
1122 |
+
def search():
|
1123 |
+
raise NotImplementedError()
|
1124 |
+
<<<<<<< REPLACE
|
1125 |
+
"""
|
1126 |
+
|
1127 |
+
PSEUDO_XML_DIFF_EXAMPLE = """
|
1128 |
+
<edit>
|
1129 |
+
<file>
|
1130 |
+
path/to/file.py
|
1131 |
+
</file>
|
1132 |
+
<old_code>
|
1133 |
+
def search():
|
1134 |
+
pass
|
1135 |
+
</old_code>
|
1136 |
+
<new_code>
|
1137 |
+
def search():
|
1138 |
+
raise NotImplementedError()
|
1139 |
+
</new_code>
|
1140 |
+
</edit>
|
1141 |
+
"""
|
1142 |
+
```
|
1143 |
+
|
1144 |
+
|
1145 |
+
|
1146 |
+
|
1147 |
+
|
1148 |
+
|
1149 |
+
|
1150 |
+
|
1151 |
+
|
1152 |
+
|
1153 |
+
|
1154 |
+
|
1155 |
+
|
1156 |
+
|
1157 |
+
|
1158 |
+
|
1159 |
+
|
1160 |
+
|
1161 |
+
|
1162 |
+
|
1163 |
+
|
1164 |
+
|
LICENSE
ADDED
@@ -0,0 +1,21 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
MIT License
|
2 |
+
|
3 |
+
Copyright (c) 2025 ghchris2021
|
4 |
+
|
5 |
+
Permission is hereby granted, free of charge, to any person obtaining a copy
|
6 |
+
of this software and associated documentation files (the "Software"), to deal
|
7 |
+
in the Software without restriction, including without limitation the rights
|
8 |
+
to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
|
9 |
+
copies of the Software, and to permit persons to whom the Software is
|
10 |
+
furnished to do so, subject to the following conditions:
|
11 |
+
|
12 |
+
The above copyright notice and this permission notice shall be included in all
|
13 |
+
copies or substantial portions of the Software.
|
14 |
+
|
15 |
+
THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
|
16 |
+
IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
|
17 |
+
FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
|
18 |
+
AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
|
19 |
+
LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
|
20 |
+
OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
|
21 |
+
SOFTWARE.
|
README.md
ADDED
@@ -0,0 +1,234 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# [OpenAI Cookbook Pro](https://chatgpt.com/canvas/shared/6825e9f6e8d88191bf9ef4de00b29b0f)
|
2 |
+
### Developer Tools: [Universal Runtime](https://github.com/davidkimai/universal-runtime) | [Universal Developer](https://github.com/davidkimai/universal-developer)
|
3 |
+
|
4 |
+
**An Advanced Implementation Guide to GPT-4.1: Real-World Applications, Prompting Strategies, and Agent Workflows**
|
5 |
+
|
6 |
+
Welcome to **OpenAI Cookbook Pro** — a comprehensive, practical, and fully extensible resource tailored for engineers, developers, and researchers working with the GPT-4.1 API and related OpenAI tools. This repository distills best practices, integrates field-tested strategies, and supports high-performing workflows with enhanced reliability, precision, and developer autonomy.
|
7 |
+
|
8 |
+
> If you're familiar with the original OpenAI Cookbook, think of this project as an expanded version designed for production-grade deployments, advanced prompt development, tool integration, and agent design.
|
9 |
+
|
10 |
+
|
11 |
+
## 🔧 What This Cookbook Offers
|
12 |
+
|
13 |
+
* **Structured examples** of effective prompting for instruction following, planning, tool usage, and dynamic interactions.
|
14 |
+
* **Agent design frameworks** built around persistent task completion and context-aware iteration.
|
15 |
+
* **Tool integration patterns** using OpenAI's native tool-calling API — optimized for accuracy and reliability.
|
16 |
+
* **Custom workflows** for coding tasks, debugging, testing, and patch management.
|
17 |
+
* **Long-context strategies** including prompt shaping, content selection, and information compression for up to 1M tokens.
|
18 |
+
* **Production-aligned system prompts** for customer service, support bots, and autonomous coding agents.
|
19 |
+
|
20 |
+
Whether you're building an agent to manage codebases or optimizing a high-context knowledge retrieval system, the examples here aim to be direct, reproducible, and extensible.
|
21 |
+
|
22 |
+
|
23 |
+
## 📘 Table of Contents
|
24 |
+
|
25 |
+
1. [Getting Started](#getting-started)
|
26 |
+
2. [Prompting for Instruction Following](#prompting-for-instruction-following)
|
27 |
+
3. [Designing Agent Workflows](#designing-agent-workflows)
|
28 |
+
4. [Tool Use and Integration](#tool-use-and-integration)
|
29 |
+
5. [Chain of Thought and Planning](#chain-of-thought-and-planning)
|
30 |
+
6. [Handling Long Contexts](#handling-long-contexts)
|
31 |
+
7. [Code Fixing and Diff Management](#code-fixing-and-diff-management)
|
32 |
+
8. [Real-World Deployment Scenarios](#real-world-deployment-scenarios)
|
33 |
+
9. [Prompt Engineering Reference Guide](#prompt-engineering-reference-guide)
|
34 |
+
10. [API Usage Examples](#api-usage-examples)
|
35 |
+
|
36 |
+
|
37 |
+
## Getting Started
|
38 |
+
|
39 |
+
OpenAI Cookbook Pro assumes a basic working knowledge of OpenAI’s Python SDK, the GPT-4.1 API, and how to use the `functions`, `tools`, and `system prompt` fields.
|
40 |
+
|
41 |
+
If you're new to OpenAI's tools, start here:
|
42 |
+
|
43 |
+
* [OpenAI Platform Documentation](https://platform.openai.com/docs)
|
44 |
+
* [Original OpenAI Cookbook](https://github.com/openai/openai-cookbook)
|
45 |
+
|
46 |
+
This project builds on those foundations, layering in advanced workflows and reproducible examples for:
|
47 |
+
|
48 |
+
* Task persistence
|
49 |
+
* Iterative debugging
|
50 |
+
* Prompt shaping and behavior targeting
|
51 |
+
* Multi-step tool planning
|
52 |
+
|
53 |
+
|
54 |
+
## Prompting for Instruction Following
|
55 |
+
|
56 |
+
GPT-4.1’s instruction-following capabilities have been significantly improved. To ensure the model performs consistently:
|
57 |
+
|
58 |
+
* Be explicit. Literal instruction following means subtle ambiguities may derail output.
|
59 |
+
* Use clear formatting for instruction sets (Markdown, XML, or numbered lists).
|
60 |
+
* Place instructions **at both the top and bottom** of long prompts if the context window exceeds 100K tokens.
|
61 |
+
|
62 |
+
### Example: Instruction Template
|
63 |
+
|
64 |
+
```markdown
|
65 |
+
# Instructions
|
66 |
+
1. Read the user’s message carefully.
|
67 |
+
2. Do not generate a response until you've gathered all needed context.
|
68 |
+
3. Use a tool if more information is required.
|
69 |
+
4. Only respond when you can complete the request correctly.
|
70 |
+
```
|
71 |
+
|
72 |
+
> See `/examples/instruction-following.md` for more variations and system prompt styles.
|
73 |
+
|
74 |
+
|
75 |
+
## Designing Agent Workflows
|
76 |
+
|
77 |
+
GPT-4.1 supports agentic workflows that require multi-step planning, tool usage, and long turn durations. Designing effective agents starts with a disciplined structure:
|
78 |
+
|
79 |
+
### Include Three System Prompt Anchors:
|
80 |
+
|
81 |
+
* **Persistence**: Emphasize that the model should continue until task completion.
|
82 |
+
* **Tool usage**: Make it clear that it must use tools if it lacks context.
|
83 |
+
* **Planning**: Encourage the model to write out plans and reflect after each action.
|
84 |
+
|
85 |
+
See `/agent_design/swe_bench_agent.md` for a complete agent example that solves live bugs in open-source repositories.
|
86 |
+
|
87 |
+
|
88 |
+
## Tool Use and Integration
|
89 |
+
|
90 |
+
Leverage the `tools` parameter in OpenAI's API to define functional calls. Avoid embedding tool descriptions in prompts — the model performs better when tools are registered explicitly.
|
91 |
+
|
92 |
+
### Tool Guidelines
|
93 |
+
|
94 |
+
* Name your tools clearly.
|
95 |
+
* Keep descriptions concise but specific.
|
96 |
+
* Provide optional examples in a dedicated `# Examples` section.
|
97 |
+
|
98 |
+
> Tool-based prompting increases reliability, reduces hallucinations, and helps maintain output consistency.
|
99 |
+
|
100 |
+
|
101 |
+
## Chain of Thought and Planning
|
102 |
+
|
103 |
+
While GPT-4.1 does not inherently perform internal reasoning, it can be prompted to **think out loud**:
|
104 |
+
|
105 |
+
```markdown
|
106 |
+
First, identify what documents may be relevant. Then list their titles and relevance. Finally, provide a list of IDs sorted by importance.
|
107 |
+
```
|
108 |
+
|
109 |
+
Use structured strategies to enforce planning:
|
110 |
+
|
111 |
+
1. Break down the query.
|
112 |
+
2. Retrieve and assess context.
|
113 |
+
3. Prioritize response steps.
|
114 |
+
4. Deliver a refined output.
|
115 |
+
|
116 |
+
> See `/prompting/chain_of_thought.md` for templates and performance impact.
|
117 |
+
|
118 |
+
|
119 |
+
## Handling Long Contexts
|
120 |
+
|
121 |
+
GPT-4.1 supports up to **1 million tokens**. To manage this effectively:
|
122 |
+
|
123 |
+
* Use structure: XML or markdown sections help the model parse relevance.
|
124 |
+
* Repeat critical instructions **at the top and bottom** of your prompt.
|
125 |
+
* Scope responses by separating external context from user queries.
|
126 |
+
|
127 |
+
### Example Format
|
128 |
+
|
129 |
+
```xml
|
130 |
+
<instructions>
|
131 |
+
Only answer based on External Context. Do not make assumptions.
|
132 |
+
</instructions>
|
133 |
+
<user_query>
|
134 |
+
How does the billing policy apply to usage overages?
|
135 |
+
</user_query>
|
136 |
+
<context>
|
137 |
+
<doc id="12" title="Billing Policy">
|
138 |
+
[...]
|
139 |
+
</doc>
|
140 |
+
</context>
|
141 |
+
```
|
142 |
+
|
143 |
+
> See `/examples/long-context-formatting.md` for formatting guidance.
|
144 |
+
|
145 |
+
|
146 |
+
## Code Fixing and Diff Management
|
147 |
+
|
148 |
+
GPT-4.1 includes support for a **tool-compatible diff format** that enables:
|
149 |
+
|
150 |
+
* Patch generation
|
151 |
+
* File updates
|
152 |
+
* Inline modifications with full context
|
153 |
+
|
154 |
+
Use the `apply_patch` tool with the recommended V4A diff format. Always:
|
155 |
+
|
156 |
+
* Use clear before/after code snippets
|
157 |
+
* Avoid relying on line numbers
|
158 |
+
* Use `@@` markers to indicate scope
|
159 |
+
|
160 |
+
> See `/tools/apply_patch_examples/` for real-world patch workflows.
|
161 |
+
|
162 |
+
|
163 |
+
## Real-World Deployment Scenarios
|
164 |
+
|
165 |
+
### Use Cases
|
166 |
+
|
167 |
+
* **Support automation** using grounded answers and clear tool policies
|
168 |
+
* **Code refactoring bots** that operate on large repositories
|
169 |
+
* **Document summarization** across thousands of pages
|
170 |
+
* **High-integrity report generation** from structured prompt templates
|
171 |
+
|
172 |
+
Each scenario includes:
|
173 |
+
|
174 |
+
* Prompt formats
|
175 |
+
* Tool definitions
|
176 |
+
* Behavior checks
|
177 |
+
|
178 |
+
> Explore the `/scenarios/` folder for ready-to-run templates.
|
179 |
+
|
180 |
+
|
181 |
+
## Prompt Engineering Reference Guide
|
182 |
+
|
183 |
+
A distilled reference for designing robust prompts across various tasks.
|
184 |
+
|
185 |
+
### Sections:
|
186 |
+
|
187 |
+
* General prompt structures
|
188 |
+
* Common failure patterns
|
189 |
+
* Formatting styles (Markdown, XML, JSON)
|
190 |
+
* Long-context techniques
|
191 |
+
* Instruction conflict resolution
|
192 |
+
|
193 |
+
> Found in `/reference/prompting_guide.md`
|
194 |
+
|
195 |
+
|
196 |
+
## API Usage Examples
|
197 |
+
|
198 |
+
Includes starter scripts and walkthroughs for:
|
199 |
+
|
200 |
+
* Tool registration
|
201 |
+
* Chat prompt design
|
202 |
+
* Instruction tuning
|
203 |
+
* Streaming outputs
|
204 |
+
|
205 |
+
All examples use official OpenAI SDK patterns and can be run locally.
|
206 |
+
|
207 |
+
|
208 |
+
## Contributing
|
209 |
+
|
210 |
+
We welcome contributions that:
|
211 |
+
|
212 |
+
* Improve clarity
|
213 |
+
* Extend agent workflows
|
214 |
+
* Add new prompt techniques
|
215 |
+
* Introduce tool examples
|
216 |
+
|
217 |
+
To contribute:
|
218 |
+
|
219 |
+
1. Fork the repo
|
220 |
+
2. Create a new folder under `/examples` or `/tools`
|
221 |
+
3. Submit a PR with a brief description of your addition
|
222 |
+
|
223 |
+
|
224 |
+
## License
|
225 |
+
|
226 |
+
This project is released under the MIT License.
|
227 |
+
|
228 |
+
|
229 |
+
## Acknowledgments
|
230 |
+
|
231 |
+
This repository builds upon the foundational work of the original [OpenAI Cookbook](https://github.com/openai/openai-cookbook). All strategies are derived from real-world testing, usage analysis, and OpenAI’s 4.1 Prompting Guide (April 2025).
|
232 |
+
|
233 |
+
|
234 |
+
For support or suggestions, feel free to open an issue or connect via [OpenAI Developer Forum](https://community.openai.com).
|
api_usage.md
ADDED
@@ -0,0 +1,326 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# [API Usage Examples with GPT-4.1](https://chatgpt.com/canvas/shared/6825f96694a48191af7648cad2996158)
|
2 |
+
|
3 |
+
## Overview
|
4 |
+
|
5 |
+
This guide provides detailed, real-world examples of using the OpenAI GPT-4.1 API effectively, with a focus on instruction-following, tool integration, agent persistence, and prompt structuring. These examples are designed to help developers and engineers build resilient, production-ready systems using GPT-4.1 across various applications, including customer service, bug fixing, document analysis, and data labeling.
|
6 |
+
|
7 |
+
Each example illustrates system prompt construction, tool schema definitions, interaction workflows, and failure mitigation strategies.
|
8 |
+
|
9 |
+
|
10 |
+
|
11 |
+
## Example 1: Customer Support Agent with Tool Use
|
12 |
+
|
13 |
+
### Objective
|
14 |
+
|
15 |
+
Deploy a GPT-4.1 assistant to handle user questions about policies and account status.
|
16 |
+
|
17 |
+
### System Prompt
|
18 |
+
|
19 |
+
```markdown
|
20 |
+
You are a helpful assistant for NewTelco. Your job is to assist users with account and policy information.
|
21 |
+
|
22 |
+
# Instructions
|
23 |
+
- Always greet the user.
|
24 |
+
- Use tools to retrieve account and policy information.
|
25 |
+
- If needed data is missing, ask the user before calling tools.
|
26 |
+
- Avoid internal knowledge on restricted topics.
|
27 |
+
|
28 |
+
# Workflow
|
29 |
+
1. Identify intent
|
30 |
+
2. Call tools if sufficient input
|
31 |
+
3. Message user before and after tool calls
|
32 |
+
4. Cite results
|
33 |
+
```
|
34 |
+
|
35 |
+
### Tools
|
36 |
+
|
37 |
+
```json
|
38 |
+
[
|
39 |
+
{
|
40 |
+
"name": "lookup_policy_document",
|
41 |
+
"description": "Searches internal docs by topic",
|
42 |
+
"parameters": {
|
43 |
+
"type": "object",
|
44 |
+
"properties": {
|
45 |
+
"topic": {"type": "string"}
|
46 |
+
},
|
47 |
+
"required": ["topic"]
|
48 |
+
}
|
49 |
+
},
|
50 |
+
{
|
51 |
+
"name": "get_user_account_info",
|
52 |
+
"description": "Retrieves account details using phone number",
|
53 |
+
"parameters": {
|
54 |
+
"type": "object",
|
55 |
+
"properties": {
|
56 |
+
"phone_number": {"type": "string"}
|
57 |
+
},
|
58 |
+
"required": ["phone_number"]
|
59 |
+
}
|
60 |
+
}
|
61 |
+
]
|
62 |
+
```
|
63 |
+
|
64 |
+
### User Input
|
65 |
+
|
66 |
+
```text
|
67 |
+
How much will it cost for international service in France?
|
68 |
+
```
|
69 |
+
|
70 |
+
### Assistant Output
|
71 |
+
|
72 |
+
```text
|
73 |
+
Hi, you've reached NewTelco, how can I help you? 🌍✈️
|
74 |
+
|
75 |
+
You're asking about international service while traveling to France. Let me check that for you — one moment, please. 🕑
|
76 |
+
```
|
77 |
+
|
78 |
+
### Tool Call
|
79 |
+
|
80 |
+
```json
|
81 |
+
{
|
82 |
+
"name": "lookup_policy_document",
|
83 |
+
"arguments": {"topic": "international service cost France"}
|
84 |
+
}
|
85 |
+
```
|
86 |
+
|
87 |
+
|
88 |
+
|
89 |
+
## Example 2: Software Bug Fix Agent with Diff Application
|
90 |
+
|
91 |
+
### Objective
|
92 |
+
|
93 |
+
Use GPT-4.1 to investigate and fix a bug in a codebase using the `apply_patch` tool.
|
94 |
+
|
95 |
+
### System Prompt Highlights
|
96 |
+
|
97 |
+
```markdown
|
98 |
+
# Objective
|
99 |
+
Fix the issue provided by the user.
|
100 |
+
|
101 |
+
# Instructions
|
102 |
+
- Plan each step
|
103 |
+
- Reflect after each function call
|
104 |
+
- Never guess code — read it first using tools
|
105 |
+
- Only stop when all tests pass
|
106 |
+
|
107 |
+
# Workflow
|
108 |
+
1. Understand issue deeply
|
109 |
+
2. Investigate codebase
|
110 |
+
3. Draft patch
|
111 |
+
4. Apply patch
|
112 |
+
5. Run tests
|
113 |
+
6. Reflect and finalize
|
114 |
+
```
|
115 |
+
|
116 |
+
### Tool Definition
|
117 |
+
|
118 |
+
```json
|
119 |
+
{
|
120 |
+
"name": "python",
|
121 |
+
"description": "Execute code or apply a patch",
|
122 |
+
"parameters": {
|
123 |
+
"type": "object",
|
124 |
+
"properties": {
|
125 |
+
"input": {"type": "string"}
|
126 |
+
},
|
127 |
+
"required": ["input"]
|
128 |
+
}
|
129 |
+
}
|
130 |
+
```
|
131 |
+
|
132 |
+
### Tool Call Example
|
133 |
+
|
134 |
+
```bash
|
135 |
+
%%bash
|
136 |
+
apply_patch <<"EOF"
|
137 |
+
*** Begin Patch
|
138 |
+
*** Update File: src/core.py
|
139 |
+
@@ def is_valid():
|
140 |
+
- return False
|
141 |
+
+ return True
|
142 |
+
*** End Patch
|
143 |
+
EOF
|
144 |
+
```
|
145 |
+
|
146 |
+
### Test Execution
|
147 |
+
|
148 |
+
```json
|
149 |
+
{
|
150 |
+
"name": "python",
|
151 |
+
"arguments": {"input": "!python3 run_tests.py"}
|
152 |
+
}
|
153 |
+
```
|
154 |
+
|
155 |
+
|
156 |
+
|
157 |
+
## Example 3: Long-Context Document Analyzer
|
158 |
+
|
159 |
+
### Objective
|
160 |
+
|
161 |
+
Summarize and extract insights from up to 1M tokens of context.
|
162 |
+
|
163 |
+
### Prompt Sections
|
164 |
+
|
165 |
+
```markdown
|
166 |
+
# Instructions
|
167 |
+
- Process documents in 10k token blocks
|
168 |
+
- Reflect after each segment
|
169 |
+
- Label relevance and extract core ideas
|
170 |
+
|
171 |
+
# Strategy
|
172 |
+
1. Read → summarize
|
173 |
+
2. Score relevance
|
174 |
+
3. Synthesize into unified output
|
175 |
+
```
|
176 |
+
|
177 |
+
### Input Format
|
178 |
+
|
179 |
+
```xml
|
180 |
+
<doc id="21" title="Policy Update">
|
181 |
+
<summary>Changes to international billing rules</summary>
|
182 |
+
<content>...</content>
|
183 |
+
</doc>
|
184 |
+
```
|
185 |
+
|
186 |
+
### Assistant Behavior
|
187 |
+
|
188 |
+
* Chunk input into 10k token sections
|
189 |
+
* After each, provide a summary and document scores
|
190 |
+
* Compile findings at end
|
191 |
+
|
192 |
+
|
193 |
+
|
194 |
+
## Example 4: Data Labeling Assistant
|
195 |
+
|
196 |
+
### Objective
|
197 |
+
|
198 |
+
Assist with structured classification tasks.
|
199 |
+
|
200 |
+
### Prompt Template
|
201 |
+
|
202 |
+
```markdown
|
203 |
+
# Instructions
|
204 |
+
- Label each entry using the provided schema
|
205 |
+
- Do not guess; if unsure, flag for human
|
206 |
+
|
207 |
+
# Labeling Categories
|
208 |
+
- Urgent
|
209 |
+
- Normal
|
210 |
+
- Spam
|
211 |
+
|
212 |
+
# Output Format
|
213 |
+
{"text": ..., "label": ...}
|
214 |
+
|
215 |
+
# Example
|
216 |
+
{"text": "Win money now!", "label": "Spam"}
|
217 |
+
```
|
218 |
+
|
219 |
+
### User Input
|
220 |
+
|
221 |
+
```json
|
222 |
+
[
|
223 |
+
"New system update available",
|
224 |
+
"Limited time offer! Click now",
|
225 |
+
"Server crashed, need help ASAP"
|
226 |
+
]
|
227 |
+
```
|
228 |
+
|
229 |
+
### Assistant Output
|
230 |
+
|
231 |
+
```json
|
232 |
+
[
|
233 |
+
{"text": "New system update available", "label": "Normal"},
|
234 |
+
{"text": "Limited time offer! Click now", "label": "Spam"},
|
235 |
+
{"text": "Server crashed, need help ASAP", "label": "Urgent"}
|
236 |
+
]
|
237 |
+
```
|
238 |
+
|
239 |
+
|
240 |
+
|
241 |
+
## Example 5: Chain-of-Thought for Multi-Hop Reasoning
|
242 |
+
|
243 |
+
### Objective
|
244 |
+
|
245 |
+
Support a planning task by explicitly breaking down the steps.
|
246 |
+
|
247 |
+
### Prompt Template
|
248 |
+
|
249 |
+
```markdown
|
250 |
+
# Instructions
|
251 |
+
First, think carefully step by step. Then output the result.
|
252 |
+
|
253 |
+
# Reasoning Strategy
|
254 |
+
1. Identify user question
|
255 |
+
2. Extract context
|
256 |
+
3. Connect information across documents
|
257 |
+
4. Output answer
|
258 |
+
```
|
259 |
+
|
260 |
+
### Example Input
|
261 |
+
|
262 |
+
```markdown
|
263 |
+
# User Question
|
264 |
+
How did the billing policy change after 2022?
|
265 |
+
|
266 |
+
# Context
|
267 |
+
<doc id="10" title="Policy 2022">...</doc>
|
268 |
+
<doc id="12" title="Policy 2023">...</doc>
|
269 |
+
```
|
270 |
+
|
271 |
+
### Model Output
|
272 |
+
|
273 |
+
```text
|
274 |
+
Step 1: Identify relevant documents → IDs 10, 12
|
275 |
+
Step 2: Compare clauses
|
276 |
+
Step 3: 2022 had flat rates, 2023 added time-of-use billing
|
277 |
+
Answer: Billing policy changed to time-based pricing in 2023.
|
278 |
+
```
|
279 |
+
|
280 |
+
|
281 |
+
|
282 |
+
## General Prompt Formatting Guidelines
|
283 |
+
|
284 |
+
### Preferred Structure
|
285 |
+
|
286 |
+
```markdown
|
287 |
+
# Role
|
288 |
+
# Instructions
|
289 |
+
# Workflow (optional)
|
290 |
+
# Reasoning Strategy (optional)
|
291 |
+
# Output Format
|
292 |
+
# Examples (optional)
|
293 |
+
```
|
294 |
+
|
295 |
+
### Tool Use Reminders
|
296 |
+
|
297 |
+
* Only call tools when sufficient information is available
|
298 |
+
* Always notify the user before and after calls
|
299 |
+
* Use example-triggered calls for teaching tool behavior
|
300 |
+
|
301 |
+
### Output Patterns
|
302 |
+
|
303 |
+
* JSON or markdown preferred
|
304 |
+
* Cite source documents if used
|
305 |
+
* Include fallback responses if uncertain (e.g., "Insufficient context")
|
306 |
+
|
307 |
+
|
308 |
+
|
309 |
+
## Best Practices Summary
|
310 |
+
|
311 |
+
| Element | Best Practice |
|
312 |
+
| ------------ | --------------------------------------------- |
|
313 |
+
| Tool Calls | Always define schema with strong param names |
|
314 |
+
| Planning | Enforce pre- and post-action reflection |
|
315 |
+
| Output | Enforce format, validate JSON before response |
|
316 |
+
| Long Context | Use structured delimiters (Markdown, XML) |
|
317 |
+
| Labeling | Use few-shot examples and explicit categories |
|
318 |
+
| Diff Format | Use V4A patch format for code updates |
|
319 |
+
|
320 |
+
|
321 |
+
|
322 |
+
## Final Note
|
323 |
+
|
324 |
+
These examples are starting templates. Each system will benefit from iterative refinements, structured logging, and real-world user testing. Maintain modular prompts and tool schemas, and adopt evaluation frameworks to monitor performance over time.
|
325 |
+
|
326 |
+
**Clarity, structure, and instruction adherence are the cornerstones of production-grade GPT-4.1 API design.**
|
chain_of_thought_planning.md
ADDED
@@ -0,0 +1,195 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# [Chain of Thought and Planning in GPT-4.1](https://chatgpt.com/canvas/shared/6825f035f4b8819188e481e6e5cab29e)
|
2 |
+
## Overview
|
3 |
+
|
4 |
+
This document serves as a comprehensive and standalone guide for implementing effective chain-of-thought prompting and planning techniques with the OpenAI GPT-4.1 model family. It draws from official prompt engineering strategies outlined in the OpenAI 4.1 Cookbook and translates them into an accessible, implementation-ready format for developers, researchers, and product engineers.
|
5 |
+
|
6 |
+
## Key Goals
|
7 |
+
|
8 |
+
1. Enable step-by-step problem-solving via structured reasoning.
|
9 |
+
2. Amplify agentic behavior in tool-using contexts.
|
10 |
+
3. Minimize hallucinations by encouraging reflective planning.
|
11 |
+
4. Improve task completion rates in software engineering and knowledge work.
|
12 |
+
5. Align prompt design with model strengths in instruction-following and long-context awareness.
|
13 |
+
|
14 |
+
## Core Principles
|
15 |
+
|
16 |
+
### 1. Chain-of-Thought (CoT) Induction
|
17 |
+
|
18 |
+
GPT-4.1 does not natively reason before answering; however, it can be prompted to simulate reasoning through structured instructions. This is known as "chain-of-thought prompting."
|
19 |
+
|
20 |
+
**Prompting Template:**
|
21 |
+
|
22 |
+
> "Before answering, think step by step about what’s needed to solve the task. Then begin executing."
|
23 |
+
|
24 |
+
Chain-of-thought is especially effective when applied to:
|
25 |
+
|
26 |
+
* Multi-hop reasoning questions
|
27 |
+
* Complex analytical tasks
|
28 |
+
* Document triage and synthesis
|
29 |
+
* Code tracing and debugging
|
30 |
+
|
31 |
+
### 2. Agentic Planning
|
32 |
+
|
33 |
+
The model can be transformed into a more proactive, autonomous agent through three types of reminders:
|
34 |
+
|
35 |
+
* **Persistence Reminder:** Encourages continuation across multiple turns.
|
36 |
+
* **Tool-Use Reminder:** Discourages guessing; reinforces fact-finding.
|
37 |
+
* **Planning Reminder:** Encourages step-by-step thinking before and after tool use.
|
38 |
+
|
39 |
+
**Agentic Prompting Snippet:**
|
40 |
+
|
41 |
+
```text
|
42 |
+
You are an agent. Keep going until the query is fully resolved. Use tools instead of guessing. Plan your actions and reflect after each step.
|
43 |
+
```
|
44 |
+
|
45 |
+
This significantly increases model adherence to goals and improves results in complex domains like software engineering, particularly on structured benchmarks like SWE-bench Verified.
|
46 |
+
|
47 |
+
### 3. Explicit Workflow Structuring
|
48 |
+
|
49 |
+
Providing workflows as ordered lists increases adherence and performance. This creates a "mental model" the assistant follows.
|
50 |
+
|
51 |
+
**Example Workflow:**
|
52 |
+
|
53 |
+
```text
|
54 |
+
1. Understand the query.
|
55 |
+
2. Identify relevant context.
|
56 |
+
3. Create a solution plan.
|
57 |
+
4. Execute steps incrementally.
|
58 |
+
5. Verify and test.
|
59 |
+
6. Reflect and iterate.
|
60 |
+
```
|
61 |
+
|
62 |
+
This structure serves dual purpose: guiding the model and signaling users the assistant's reasoning process.
|
63 |
+
|
64 |
+
### 4. Contextual Grounding
|
65 |
+
|
66 |
+
In long-context situations (e.g., 100K+ token sessions), instruction placement matters:
|
67 |
+
|
68 |
+
* **Place instructions at both start and end of context blocks.**
|
69 |
+
* **Use markdown or XML delimiters for structure.**
|
70 |
+
|
71 |
+
Avoid JSON when loading multiple documents; XML or structured markdown outperforms.
|
72 |
+
|
73 |
+
### 5. Output Control Through Instruction Templates
|
74 |
+
|
75 |
+
Instruction adherence improves when you:
|
76 |
+
|
77 |
+
* Start with high-level **Response Rules**.
|
78 |
+
* Follow with a **Step-by-Step Plan**.
|
79 |
+
* Include examples demonstrating the expected behavior.
|
80 |
+
* End with an instruction to think step by step.
|
81 |
+
|
82 |
+
**Example Prompt Structure:**
|
83 |
+
|
84 |
+
```markdown
|
85 |
+
# Instructions
|
86 |
+
- Respond concisely.
|
87 |
+
- Think before acting.
|
88 |
+
- Use only tools provided.
|
89 |
+
|
90 |
+
# Steps
|
91 |
+
1. Interpret the question.
|
92 |
+
2. Search the context.
|
93 |
+
3. Synthesize the answer.
|
94 |
+
|
95 |
+
# Example
|
96 |
+
**Q:** What caused the error?
|
97 |
+
**A:** Let's review the logs first...
|
98 |
+
|
99 |
+
# Final Thought Instruction
|
100 |
+
Think step by step before answering.
|
101 |
+
```
|
102 |
+
|
103 |
+
## Planning in Practice
|
104 |
+
|
105 |
+
Below is a sample prompt segment leveraging all core planning and chain-of-thought features:
|
106 |
+
|
107 |
+
```text
|
108 |
+
You must:
|
109 |
+
- Plan extensively before calling any function.
|
110 |
+
- Reflect on outcomes after each call.
|
111 |
+
- Do not chain tools blindly.
|
112 |
+
- Be cautious of false positives or early stopping.
|
113 |
+
- Your solution must pass all tests, including hidden ones.
|
114 |
+
|
115 |
+
Always verify:
|
116 |
+
- Is your solution logically sound?
|
117 |
+
- Have you tested edge cases?
|
118 |
+
- Are additional test cases required?
|
119 |
+
```
|
120 |
+
|
121 |
+
This style boosts planning performance by up to 4% in SWE-bench according to OpenAI’s own testing.
|
122 |
+
|
123 |
+
## Debugging Chain-of-Thought Failures
|
124 |
+
|
125 |
+
Chain-of-thought prompts may fail due to:
|
126 |
+
|
127 |
+
* Ambiguous user intent
|
128 |
+
* Misidentification of relevant context
|
129 |
+
* Overly abstract plans without execution
|
130 |
+
|
131 |
+
**Countermeasures:**
|
132 |
+
|
133 |
+
* Break user queries into sub-components.
|
134 |
+
* Have the model rate the relevance of documents.
|
135 |
+
* Include specific test cases as checksums for correct reasoning.
|
136 |
+
|
137 |
+
**Correction Template:**
|
138 |
+
|
139 |
+
```text
|
140 |
+
Let’s revise. Where did the plan fail? What assumption was wrong? Was context misused?
|
141 |
+
```
|
142 |
+
|
143 |
+
## Long-Context Planning Strategies
|
144 |
+
|
145 |
+
When context windows expand to 1M tokens:
|
146 |
+
|
147 |
+
* Encourage summarization between reasoning steps.
|
148 |
+
* Anchor sub-conclusions before proceeding.
|
149 |
+
* Repeat critical instructions at interval checkpoints.
|
150 |
+
|
151 |
+
**Chunked Reasoning Pattern:**
|
152 |
+
|
153 |
+
```text
|
154 |
+
Summarize findings every 10,000 tokens.
|
155 |
+
Checkpoint progress with titles and delimiters.
|
156 |
+
Reflect before moving to the next section.
|
157 |
+
```
|
158 |
+
|
159 |
+
## Tool Use Integration
|
160 |
+
|
161 |
+
GPT-4.1 supports structured tool calls (functions, APIs, CLI commands). Effective planning enhances tool use via:
|
162 |
+
|
163 |
+
* Context-aware parameter setting
|
164 |
+
* Post-tool-call reflection
|
165 |
+
* Avoiding premature tool use
|
166 |
+
|
167 |
+
**Tool Use Best Practices:**
|
168 |
+
|
169 |
+
* Name tools clearly and descriptively
|
170 |
+
* Provide concise, structured descriptions
|
171 |
+
* Offer usage examples outside of the tool schema
|
172 |
+
|
173 |
+
## Practical Use Cases
|
174 |
+
|
175 |
+
* **Software Agents**: Reliable plan-execute-reflect loops
|
176 |
+
* **Data Analysis**: Step-by-step exploration of CSVs or logs
|
177 |
+
* **Scientific Reasoning**: Layered hypothesis evaluation
|
178 |
+
* **Customer Service Bots**: Pre-check user input → tool call → output validation
|
179 |
+
|
180 |
+
## Future-Proofing Your Prompts
|
181 |
+
|
182 |
+
Prompting is an empirical, iterative process. Maintain versioned prompt libraries and monitor:
|
183 |
+
|
184 |
+
* Performance regressions
|
185 |
+
* Latency vs. completeness tradeoffs
|
186 |
+
* Tool call efficiency
|
187 |
+
* Instruction adherence
|
188 |
+
|
189 |
+
Track systematic errors over time and codify high-performing reasoning strategies into your core prompts.
|
190 |
+
|
191 |
+
## Summary
|
192 |
+
|
193 |
+
Chain-of-thought and planning, when intentionally embedded in GPT-4.1 prompts, unlock powerful new workflows for complex reasoning, debugging, and autonomous task completion. While GPT-4.1 does not reason innately, its ability to simulate planning and stepwise logic makes it a potent co-processor for advanced tasks.
|
194 |
+
|
195 |
+
**Start with clarity. Plan before acting. Reflect after execution.** That is the path to leveraging GPT-4.1 effectively for sophisticated agentic behavior.
|
code_fixing_and_diff.md
ADDED
@@ -0,0 +1,254 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# [Code Fixing and Diff Management in GPT-4.1](https://chatgpt.com/canvas/shared/6825f21e65388191b9fb0baa737c1f18)
|
2 |
+
|
3 |
+
## Overview
|
4 |
+
|
5 |
+
This document provides a comprehensive implementation guide for code fixing and diff generation strategies using the OpenAI GPT-4.1 model. It is designed to help developers and tool builders harness the model’s improved agentic behavior, tool integration, and patch application capabilities. The guidance herein is based on OpenAI’s internal agentic workflows, as tested on SWE-bench Verified and related coding benchmarks.
|
6 |
+
|
7 |
+
|
8 |
+
## Objectives
|
9 |
+
|
10 |
+
* Enable GPT-4.1 to autonomously fix software bugs with minimal user intervention
|
11 |
+
* Standardize high-performance diff formats that GPT-4.1 understands well
|
12 |
+
* Leverage tool-calling strategies that minimize hallucination and improve precision
|
13 |
+
* Scaffold workflows for validation, patch application, and iterative debugging
|
14 |
+
|
15 |
+
|
16 |
+
## Core Principles for Effective Bug Fixing
|
17 |
+
|
18 |
+
### 1. Persistent Multi-Step Execution
|
19 |
+
|
20 |
+
To prevent premature termination, always instruct the model to:
|
21 |
+
|
22 |
+
```text
|
23 |
+
Continue working until the issue is fully resolved. Do not return control to the user unless the fix is complete and validated.
|
24 |
+
```
|
25 |
+
|
26 |
+
This aligns GPT-4.1’s behavior with full agent-mode operation.
|
27 |
+
|
28 |
+
### 2. Tool-Use Encouragement
|
29 |
+
|
30 |
+
Rather than letting the model hallucinate file contents:
|
31 |
+
|
32 |
+
```text
|
33 |
+
Use your tools to examine the file system or source code. Never guess.
|
34 |
+
```
|
35 |
+
|
36 |
+
This ensures queries are grounded in actual project state.
|
37 |
+
|
38 |
+
### 3. Planning and Reflection Enforcement
|
39 |
+
|
40 |
+
Prompt the model to:
|
41 |
+
|
42 |
+
* Plan before tool calls
|
43 |
+
* Reflect after each execution
|
44 |
+
* Avoid chains of back-to-back tool calls without synthesis in between
|
45 |
+
|
46 |
+
**Prompt Template:**
|
47 |
+
|
48 |
+
```text
|
49 |
+
You MUST plan extensively before calling a function, and reflect thoroughly on its output before deciding your next step.
|
50 |
+
```
|
51 |
+
|
52 |
+
|
53 |
+
## Workflow Structure
|
54 |
+
|
55 |
+
### High-Level Task Phases
|
56 |
+
|
57 |
+
1. **Understand the Bug**
|
58 |
+
2. **Explore the Codebase**
|
59 |
+
3. **Plan the Fix**
|
60 |
+
4. **Edit the Code**
|
61 |
+
5. **Debug and Test**
|
62 |
+
6. **Reflect and Finalize**
|
63 |
+
|
64 |
+
Each of these phases should be scaffolded in the prompt or system instructions.
|
65 |
+
|
66 |
+
### Recommended Prompt Structure
|
67 |
+
|
68 |
+
```markdown
|
69 |
+
# Instructions
|
70 |
+
- Fix the bug completely before ending.
|
71 |
+
- Use available tools.
|
72 |
+
- Think step-by-step before and after each action.
|
73 |
+
|
74 |
+
# Workflow
|
75 |
+
1. Understand the issue.
|
76 |
+
2. Investigate the source files.
|
77 |
+
3. Plan an incremental fix.
|
78 |
+
4. Apply and validate patch.
|
79 |
+
5. Test extensively.
|
80 |
+
6. Reflect and iterate.
|
81 |
+
```
|
82 |
+
|
83 |
+
|
84 |
+
## The V4A Patch Format (Recommended)
|
85 |
+
|
86 |
+
GPT-4.1 performs best with this clear, human-readable patch format:
|
87 |
+
|
88 |
+
```bash
|
89 |
+
*** Begin Patch
|
90 |
+
*** Update File: path/to/file.py
|
91 |
+
@@ def some_function():
|
92 |
+
context_before
|
93 |
+
- buggy_code()
|
94 |
+
+ fixed_code()
|
95 |
+
context_after
|
96 |
+
*** End Patch
|
97 |
+
```
|
98 |
+
|
99 |
+
### Diff Format Rules
|
100 |
+
|
101 |
+
* Use `*** Update File:` to mark the file.
|
102 |
+
* Use `@@` to denote function or class scope.
|
103 |
+
* Precede old code lines with `-`, new code with `+`.
|
104 |
+
* Include 3 lines of context above and below the change.
|
105 |
+
* If needed, add nested `@@` scopes for disambiguation.
|
106 |
+
|
107 |
+
**Avoid line numbers**; GPT-4.1 does not rely on them. It uses code context instead.
|
108 |
+
|
109 |
+
|
110 |
+
## Tool Configuration: `apply_patch`
|
111 |
+
|
112 |
+
To simulate developer workflows, define a function tool with this pattern:
|
113 |
+
|
114 |
+
```json
|
115 |
+
{
|
116 |
+
"name": "apply_patch",
|
117 |
+
"description": "Apply V4A diff patches to source files",
|
118 |
+
"parameters": {
|
119 |
+
"type": "object",
|
120 |
+
"properties": {
|
121 |
+
"input": { "type": "string" }
|
122 |
+
},
|
123 |
+
"required": ["input"]
|
124 |
+
}
|
125 |
+
}
|
126 |
+
```
|
127 |
+
|
128 |
+
**Input Example:**
|
129 |
+
|
130 |
+
```bash
|
131 |
+
%%bash
|
132 |
+
apply_patch <<"EOF"
|
133 |
+
*** Begin Patch
|
134 |
+
*** Update File: mymodule/core.py
|
135 |
+
@@ def validate():
|
136 |
+
- return False
|
137 |
+
+ return True
|
138 |
+
*** End Patch
|
139 |
+
EOF
|
140 |
+
```
|
141 |
+
|
142 |
+
The `apply_patch` tool accepts multi-file patches. Each file must be preceded by its action (`Add`, `Update`, or `Delete`).
|
143 |
+
|
144 |
+
|
145 |
+
## Testing Strategy
|
146 |
+
|
147 |
+
### Manual Testing within Prompt:
|
148 |
+
|
149 |
+
Prompt the model to run tests after every change:
|
150 |
+
|
151 |
+
```text
|
152 |
+
Run all unit tests using `!python3 run_tests.py`. Do not assume success without verification.
|
153 |
+
```
|
154 |
+
|
155 |
+
### Encourage Reflection:
|
156 |
+
|
157 |
+
```text
|
158 |
+
Did the test results indicate success? Were any edge cases missed? Do you need to write new tests?
|
159 |
+
```
|
160 |
+
|
161 |
+
### Output Evaluation:
|
162 |
+
|
163 |
+
* If tests fail, model should explain why and iterate
|
164 |
+
* If tests pass, model should reflect before finalizing
|
165 |
+
|
166 |
+
|
167 |
+
## Debugging and Investigation Techniques
|
168 |
+
|
169 |
+
### Investigation Plan Example:
|
170 |
+
|
171 |
+
```text
|
172 |
+
I will begin by reading the test file that triggered the error, then locate the corresponding implementation file. From there, I’ll trace the logic and verify any assumptions.
|
173 |
+
```
|
174 |
+
|
175 |
+
### Debugging Prompt Reminders:
|
176 |
+
|
177 |
+
* Never change code without full context
|
178 |
+
* Use tools to inspect contents before editing
|
179 |
+
* Print debug output if necessary
|
180 |
+
|
181 |
+
|
182 |
+
## Failure Mode Mitigations
|
183 |
+
|
184 |
+
| Failure Mode | Fix Strategy |
|
185 |
+
| ---------------------------- | ----------------------------------------------------------------------- |
|
186 |
+
| Patch applied in wrong place | Add more surrounding context or use double `@@` scope |
|
187 |
+
| Patch fails silently | Check patch syntax and apply logs before "Done!" line |
|
188 |
+
| Model ends before testing | Insert reminder: "Do not conclude until all tests are validated." |
|
189 |
+
| Partial bug fixes | Require model to re-verify against original issue and user expectations |
|
190 |
+
|
191 |
+
|
192 |
+
## Final Validation Phase
|
193 |
+
|
194 |
+
Before finalizing a solution, prompt the model to:
|
195 |
+
|
196 |
+
* Re-read the original problem description
|
197 |
+
* Confirm alignment between intent and fix
|
198 |
+
* Run a fresh test suite
|
199 |
+
* Draft additional tests for uncovered scenarios
|
200 |
+
* Watch for silent failures or fragile patches
|
201 |
+
|
202 |
+
### Final Prompt Template:
|
203 |
+
|
204 |
+
```text
|
205 |
+
Think about the original bug and the goal. Is your fix logically complete? Did you run all tests? Are hidden edge cases covered?
|
206 |
+
```
|
207 |
+
|
208 |
+
|
209 |
+
## Alternative Diff Formats
|
210 |
+
|
211 |
+
If you need variations, GPT-4.1 performs well with:
|
212 |
+
|
213 |
+
### Search/Replace Format
|
214 |
+
|
215 |
+
```text
|
216 |
+
path/to/file.py
|
217 |
+
>>>>>> SEARCH
|
218 |
+
def broken():
|
219 |
+
pass
|
220 |
+
=======
|
221 |
+
def broken():
|
222 |
+
raise Exception("Fix me")
|
223 |
+
<<<<<< REPLACE
|
224 |
+
```
|
225 |
+
|
226 |
+
### Pseudo-XML Format
|
227 |
+
|
228 |
+
```xml
|
229 |
+
<edit>
|
230 |
+
<file>path/to/file.py</file>
|
231 |
+
<old_code>def old(): pass</old_code>
|
232 |
+
<new_code>def old(): raise NotImplementedError()</new_code>
|
233 |
+
</edit>
|
234 |
+
```
|
235 |
+
|
236 |
+
These are most useful in pipeline or IDE-integrated settings.
|
237 |
+
|
238 |
+
|
239 |
+
## Best Practices Summary
|
240 |
+
|
241 |
+
| Principle | Practice |
|
242 |
+
| ------------------------- | ------------------------------------------------------ |
|
243 |
+
| Persistent Agent Behavior | Model must keep going until the fix is verified |
|
244 |
+
| Reflection | Insert plan-and-reflect instructions at each phase |
|
245 |
+
| Patch Format | Use V4A or equivalent context-driven diff structure |
|
246 |
+
| Testing | Prompt to test after every step |
|
247 |
+
| Finalization | Always include a validation + extra test writing phase |
|
248 |
+
|
249 |
+
|
250 |
+
## Conclusion
|
251 |
+
|
252 |
+
GPT-4.1 can serve as a robust code-fixing agent when scaffolded with precise patch formats, rigorous test validation, and persistent reflection mechanisms. By integrating tool calls such as `apply_patch` and emphasizing validation over completion, developers can reliably use the model for end-to-end issue resolution workflows.
|
253 |
+
|
254 |
+
**Build the fix. Test the outcome. Validate the solution.** That’s the foundation for agentic software repair with GPT-4.1.
|
cookbook_pro.md
ADDED
@@ -0,0 +1,288 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# [OpenAI Cookbook Pro: Comprehensive GPT-4.1 Application Framework](https://chatgpt.com/canvas/shared/6825fb38b0e0819184bb3153a3eb1a52)
|
2 |
+
|
3 |
+
## Introduction
|
4 |
+
|
5 |
+
This document represents a fully evolved, professional-grade implementation of the OpenAI 4.1 Cookbook. It serves as a unified, production-ready guide for applied large language model deployment using GPT-4.1. Each section draws from OpenAI's internal best practices and external application patterns to provide a durable blueprint for advanced AI developers, architects, and researchers.
|
6 |
+
|
7 |
+
This Cookbook Pro version encapsulates:
|
8 |
+
|
9 |
+
* High-performance agentic prompting workflows
|
10 |
+
* Instruction literalism and planning strategies
|
11 |
+
* Long-context structuring methods
|
12 |
+
* Tool-calling schemas and evaluation principles
|
13 |
+
* Diff management and debugging strategies
|
14 |
+
|
15 |
+
---
|
16 |
+
|
17 |
+
## Part I — Agentic Workflows
|
18 |
+
|
19 |
+
### 1.1 Prompt Harness Configuration
|
20 |
+
|
21 |
+
#### Three Essential Prompt Reminders:
|
22 |
+
|
23 |
+
```markdown
|
24 |
+
# Persistence
|
25 |
+
You are an agent—keep working until the task is fully resolved. Do not yield control prematurely.
|
26 |
+
|
27 |
+
# Tool-Calling
|
28 |
+
If unsure about file or codebase content, use tools to gather accurate information. Do not guess.
|
29 |
+
|
30 |
+
# Planning
|
31 |
+
Before and after every function call, explicitly plan and reflect. Avoid tool-chaining without synthesis.
|
32 |
+
```
|
33 |
+
|
34 |
+
These instructions significantly increase performance and enable stateful execution in multi-message tasks.
|
35 |
+
|
36 |
+
### 1.2 Example: SWE-Bench Verified Prompt
|
37 |
+
|
38 |
+
```markdown
|
39 |
+
# Objective
|
40 |
+
Fully resolve a software bug from an open-source issue.
|
41 |
+
|
42 |
+
# Workflow
|
43 |
+
1. Understand the problem.
|
44 |
+
2. Explore relevant files.
|
45 |
+
3. Plan incremental fix steps.
|
46 |
+
4. Apply code patches.
|
47 |
+
5. Test thoroughly.
|
48 |
+
6. Reflect and iterate until all tests pass.
|
49 |
+
|
50 |
+
# Constraint
|
51 |
+
Only end the session when the problem is fully fixed and verified.
|
52 |
+
```
|
53 |
+
|
54 |
+
---
|
55 |
+
|
56 |
+
## Part II — Instruction Following & Output Control
|
57 |
+
|
58 |
+
### 2.1 Instruction Clarity Protocol
|
59 |
+
|
60 |
+
Use:
|
61 |
+
|
62 |
+
* `# Instructions`: General rules
|
63 |
+
* `## Subsections`: Detailed formatting and behavioral constraints
|
64 |
+
* Explicit instruction/response pairings
|
65 |
+
|
66 |
+
### 2.2 Sample Format
|
67 |
+
|
68 |
+
```markdown
|
69 |
+
# Instructions
|
70 |
+
- Always greet the user.
|
71 |
+
- Avoid internal knowledge for company-specific questions.
|
72 |
+
- Cite retrieved content.
|
73 |
+
|
74 |
+
# Workflow
|
75 |
+
1. Acknowledge the user.
|
76 |
+
2. Call tools before answering.
|
77 |
+
3. Reflect and respond.
|
78 |
+
|
79 |
+
# Output Format
|
80 |
+
Use: JSON with `title`, `answer`, `source` fields.
|
81 |
+
```
|
82 |
+
|
83 |
+
---
|
84 |
+
|
85 |
+
## Part III — Tool Integration and Execution
|
86 |
+
|
87 |
+
### 3.1 Schema Guidelines
|
88 |
+
|
89 |
+
Define tools via the `tools` API parameter, not inline prompt injection.
|
90 |
+
|
91 |
+
#### Tool Schema Template
|
92 |
+
|
93 |
+
```json
|
94 |
+
{
|
95 |
+
"name": "lookup_policy_document",
|
96 |
+
"description": "Retrieve company policy details by topic.",
|
97 |
+
"parameters": {
|
98 |
+
"type": "object",
|
99 |
+
"properties": {
|
100 |
+
"topic": {"type": "string"}
|
101 |
+
},
|
102 |
+
"required": ["topic"]
|
103 |
+
}
|
104 |
+
}
|
105 |
+
```
|
106 |
+
|
107 |
+
### 3.2 Tool Usage Best Practices
|
108 |
+
|
109 |
+
* Define sample tool calls in `# Examples` sections
|
110 |
+
* Never overload the `description` field
|
111 |
+
* Validate inputs with required keys
|
112 |
+
* Prompt model to message user before and after calls
|
113 |
+
|
114 |
+
---
|
115 |
+
|
116 |
+
## Part IV — Planning and Chain-of-Thought Induction
|
117 |
+
|
118 |
+
### 4.1 Step-by-Step Prompting Pattern
|
119 |
+
|
120 |
+
```markdown
|
121 |
+
# Reasoning Strategy
|
122 |
+
1. Query breakdown
|
123 |
+
2. Context extraction
|
124 |
+
3. Document relevance ranking
|
125 |
+
4. Answer synthesis
|
126 |
+
|
127 |
+
# Instruction
|
128 |
+
Think step by step. Summarize relevant documents before answering.
|
129 |
+
```
|
130 |
+
|
131 |
+
### 4.2 Failure Mitigation Strategies
|
132 |
+
|
133 |
+
| Problem | Fix |
|
134 |
+
| ----------------- | ------------------------------------------- |
|
135 |
+
| Early response | Add: “Don’t conclude until fully resolved.” |
|
136 |
+
| Tool guess | Add: “Use tool or ask for missing data.” |
|
137 |
+
| CoT inconsistency | Prompt: “Summarize findings at each step.” |
|
138 |
+
|
139 |
+
---
|
140 |
+
|
141 |
+
## Part V — Long Context Optimization
|
142 |
+
|
143 |
+
### 5.1 Instruction Anchoring
|
144 |
+
|
145 |
+
* Repeat instructions at both top and bottom of long input
|
146 |
+
* Use structured section headers (Markdown/XML)
|
147 |
+
|
148 |
+
### 5.2 Effective Delimiters
|
149 |
+
|
150 |
+
| Type | Example | Use Case | |
|
151 |
+
| -------- | ----------------------- | ------------------ | ---------------------- |
|
152 |
+
| Markdown | `## Section Title` | General purpose | |
|
153 |
+
| XML | `<doc id='1'>...</doc>` | Document ingestion | |
|
154 |
+
| ID/Title | \`ID: 3 | TITLE: ...\` | Knowledge base parsing |
|
155 |
+
|
156 |
+
### 5.3 Example Prompt
|
157 |
+
|
158 |
+
```markdown
|
159 |
+
# Instructions
|
160 |
+
Use only documents provided. Reflect every 10K tokens.
|
161 |
+
|
162 |
+
# Long Context Input
|
163 |
+
<doc id="14" title="Security Policy">...</doc>
|
164 |
+
<doc id="15" title="Update Note">...</doc>
|
165 |
+
|
166 |
+
# Final Instruction
|
167 |
+
List all relevant IDs, then synthesize a summary.
|
168 |
+
```
|
169 |
+
|
170 |
+
---
|
171 |
+
|
172 |
+
## Part VI — Diff Generation and Patch Application
|
173 |
+
|
174 |
+
### 6.1 Recommended Format: V4A Diff
|
175 |
+
|
176 |
+
```bash
|
177 |
+
*** Begin Patch
|
178 |
+
*** Update File: src/utils.py
|
179 |
+
@@ def sanitize()
|
180 |
+
- return text
|
181 |
+
+ return text.strip()
|
182 |
+
*** End Patch
|
183 |
+
```
|
184 |
+
|
185 |
+
### 6.2 Diff Patch Execution Tool
|
186 |
+
|
187 |
+
```json
|
188 |
+
{
|
189 |
+
"name": "apply_patch",
|
190 |
+
"description": "Apply structured code patches to files",
|
191 |
+
"parameters": {
|
192 |
+
"type": "object",
|
193 |
+
"properties": {
|
194 |
+
"input": {"type": "string"}
|
195 |
+
},
|
196 |
+
"required": ["input"]
|
197 |
+
}
|
198 |
+
}
|
199 |
+
```
|
200 |
+
|
201 |
+
### 6.3 Workflow
|
202 |
+
|
203 |
+
1. Investigate issue
|
204 |
+
2. Draft V4A patch
|
205 |
+
3. Call `apply_patch`
|
206 |
+
4. Run tests
|
207 |
+
5. Reflect
|
208 |
+
|
209 |
+
### 6.4 Edge Case Handling
|
210 |
+
|
211 |
+
| Symptom | Action |
|
212 |
+
| ------------------- | ----------------------------------- |
|
213 |
+
| Incorrect placement | Add `@@ def` or class scope headers |
|
214 |
+
| Test failures | Revise patch + rerun |
|
215 |
+
| Silent error | Check for malformed format |
|
216 |
+
|
217 |
+
---
|
218 |
+
|
219 |
+
## Part VII — Output Evaluation Framework
|
220 |
+
|
221 |
+
### 7.1 Metrics to Track
|
222 |
+
|
223 |
+
| Metric | Description |
|
224 |
+
| -------------------------- | ---------------------------------------------------- |
|
225 |
+
| Tool Call Accuracy | Valid input usage and correct function selection |
|
226 |
+
| Response Format Compliance | Matches expected schema (e.g., JSON) |
|
227 |
+
| Instruction Adherence | Follows rules and workflow order |
|
228 |
+
| Plan Reflection Rate | Frequency and quality of plan → act → reflect cycles |
|
229 |
+
|
230 |
+
### 7.2 Eval Tags for Audit
|
231 |
+
|
232 |
+
```markdown
|
233 |
+
# Eval: TOOL_USE_FAIL
|
234 |
+
# Eval: INSTRUCTION_MISINTERPRET
|
235 |
+
# Eval: OUTPUT_FORMAT_OK
|
236 |
+
```
|
237 |
+
|
238 |
+
---
|
239 |
+
|
240 |
+
## Part VIII — Unified Prompt Template
|
241 |
+
|
242 |
+
Use this as a base structure for all GPT-4.1 projects:
|
243 |
+
|
244 |
+
```markdown
|
245 |
+
# Role
|
246 |
+
You are a [role] tasked with [objective].
|
247 |
+
|
248 |
+
# Instructions
|
249 |
+
[List core rules here.]
|
250 |
+
|
251 |
+
## Response Rules
|
252 |
+
- Always use structured formatting
|
253 |
+
- Never repeat phrases verbatim
|
254 |
+
|
255 |
+
## Workflow
|
256 |
+
[Include ordered plan.]
|
257 |
+
|
258 |
+
## Reasoning Strategy
|
259 |
+
[Optional — for advanced reasoning tasks.]
|
260 |
+
|
261 |
+
# Output Format
|
262 |
+
[Specify format, e.g., JSON or Markdown.]
|
263 |
+
|
264 |
+
# Examples
|
265 |
+
## Example 1
|
266 |
+
Input: "..."
|
267 |
+
Output: {...}
|
268 |
+
```
|
269 |
+
|
270 |
+
---
|
271 |
+
|
272 |
+
## Final Notes
|
273 |
+
|
274 |
+
GPT-4.1 represents a leap forward in real-world agentic performance, tool adherence, long-context reliability, and instruction precision. However, performance hinges on prompt clarity, structured reasoning scaffolds, and modular tool integration.
|
275 |
+
|
276 |
+
To deploy GPT-4.1 at professional scale:
|
277 |
+
|
278 |
+
* Treat every prompt as a program
|
279 |
+
* Document assumptions
|
280 |
+
* Version control your system messages
|
281 |
+
* Build continuous evals for regression prevention
|
282 |
+
|
283 |
+
**Structure drives performance. Precision enables autonomy.**
|
284 |
+
|
285 |
+
Welcome to Cookbook Pro.
|
286 |
+
|
287 |
+
—End of Guide—
|
288 |
+
|
designing_agent_workflows.md
ADDED
@@ -0,0 +1,354 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# [Designing Agent Workflows](https://chatgpt.com/canvas/shared/6825ece10cac819189e14d95e8ecd032)
|
2 |
+
|
3 |
+
## Overview
|
4 |
+
|
5 |
+
The GPT-4.1 model introduces significant improvements in agentic capabilities, making it ideal for designing multi-turn workflows that rely on persistence, planning, and structured tool interaction. Whether you’re building automated software agents, coding assistants, customer service bots, or task execution systems, designing for success in GPT-4.1 requires careful coordination between prompt design, system instructions, tool usage, and behavior monitoring.
|
6 |
+
|
7 |
+
This guide provides a comprehensive framework for designing effective agent workflows using GPT-4.1, detailing structural components, implementation strategies, tool invocation principles, behavioral anchors, and debugging techniques.
|
8 |
+
|
9 |
+
Each section can be reused as a design module for your own applications, while contributing to the broader library of effective agent patterns.
|
10 |
+
|
11 |
+
|
12 |
+
## What Is an Agent Workflow?
|
13 |
+
|
14 |
+
An agent workflow is a sequence of steps managed by the model in which it:
|
15 |
+
|
16 |
+
1. Interprets the user’s task or goal
|
17 |
+
2. Selects and applies the right tools
|
18 |
+
3. Iterates until the goal is fully accomplished
|
19 |
+
4. Manages context, planning, and persistence internally
|
20 |
+
5. Responds only after verifiable success criteria are met
|
21 |
+
|
22 |
+
This process transforms GPT-4.1 from a turn-based assistant into a semi-autonomous task manager.
|
23 |
+
|
24 |
+
|
25 |
+
## Key Model Behaviors That Enable Agent Design
|
26 |
+
|
27 |
+
### Literal Instruction Compliance
|
28 |
+
|
29 |
+
GPT-4.1 follows instructions with high fidelity. This includes step ordering, constraints, and termination rules. The model is more responsive to direct, formatted behavioral cues than its predecessors.
|
30 |
+
|
31 |
+
### Persistent Multi-Turn Context Management
|
32 |
+
|
33 |
+
The model maintains internal state across extended interactions. You can program it to persist in a loop, waiting to exit only once defined conditions are met.
|
34 |
+
|
35 |
+
### Planning and Reflection
|
36 |
+
|
37 |
+
Though not a reasoning-first model, GPT-4.1 can be prompted to externalize plans, reflect on outcomes, and improve with each iteration when prompted properly.
|
38 |
+
|
39 |
+
### Integrated Tool Use
|
40 |
+
|
41 |
+
The `tools` parameter in the API allows for direct invocation of functional calls (file inspection, patch application, database lookups, etc.) — making agentic behavior verifiable and extendable.
|
42 |
+
|
43 |
+
|
44 |
+
## Core Agent Workflow Template
|
45 |
+
|
46 |
+
### 🧩 System Prompt Template
|
47 |
+
|
48 |
+
```markdown
|
49 |
+
# Agent Instructions
|
50 |
+
You are a multi-step problem-solving agent. Do not terminate until you have fully completed the assigned task.
|
51 |
+
|
52 |
+
## Persistence
|
53 |
+
- Continue until task completion is verified.
|
54 |
+
- Do not yield to the user before the solution is complete.
|
55 |
+
|
56 |
+
## Tool Use
|
57 |
+
- Use tools to gather information. Do not guess.
|
58 |
+
- Only proceed with actions when all necessary data is available.
|
59 |
+
|
60 |
+
## Planning
|
61 |
+
- Before taking an action, create a plan.
|
62 |
+
- After each step, reflect on success/failure.
|
63 |
+
|
64 |
+
# Output Format
|
65 |
+
- Step number
|
66 |
+
- Action taken
|
67 |
+
- Result
|
68 |
+
- Updated plan (if any)
|
69 |
+
|
70 |
+
# Final Output
|
71 |
+
Summarize the solution, include test results if applicable.
|
72 |
+
```
|
73 |
+
|
74 |
+
This format primes GPT-4.1 for proactive execution, tool integration, and termination control.
|
75 |
+
|
76 |
+
|
77 |
+
## Task Archetypes: Common Agent Patterns
|
78 |
+
|
79 |
+
| Task Type | Characteristics | GPT-4.1 Design Notes |
|
80 |
+
| ------------------------ | ---------------------------------------------------------------------- | --------------------------------------------------- |
|
81 |
+
| **Code Fixing Agent** | Requires bug reproduction, patch generation, validation via tests | Use `apply_patch` tool + persistent reflection loop |
|
82 |
+
| **Data Lookup Agent** | Accesses external data via tool calls and summarizes findings | Tool use must be verified before user response |
|
83 |
+
| **Support Agent** | Answers factual queries with context validation and escalation support | Include step-by-step message plan and constraints |
|
84 |
+
| **Document Synth Agent** | Parses, filters, and summarizes from long context | Use instructions at top and bottom of prompt |
|
85 |
+
|
86 |
+
|
87 |
+
## Designing for Persistence
|
88 |
+
|
89 |
+
Persistence is the foundation of reliable agent behavior. Without it, the model will default to single-turn chat behavior.
|
90 |
+
|
91 |
+
### Design Pattern
|
92 |
+
|
93 |
+
```markdown
|
94 |
+
You must NOT yield back to the user until the task is fully complete.
|
95 |
+
Check that all steps are verified. Repeat steps as needed. Only stop when all tests pass or instructions say to stop.
|
96 |
+
```
|
97 |
+
|
98 |
+
Reinforce this message early and late in the prompt. In tests, models were 19% more likely to complete complex multi-step tasks when given persistent execution reminders.
|
99 |
+
|
100 |
+
|
101 |
+
## Designing for Tool Use
|
102 |
+
|
103 |
+
The most reliable agents use tools for verifiable context access.
|
104 |
+
|
105 |
+
### Tool Integration Best Practices
|
106 |
+
|
107 |
+
* Register tools in the `tools` parameter of the OpenAI API, not embedded in prompt text.
|
108 |
+
* Keep tool names simple and descriptive: `run_tests`, `apply_patch`, `lookup_invoice`.
|
109 |
+
* Provide clear descriptions and optionally list examples in a separate section.
|
110 |
+
|
111 |
+
### Example Tool Schema
|
112 |
+
|
113 |
+
```json
|
114 |
+
{
|
115 |
+
"name": "apply_patch",
|
116 |
+
"description": "Apply a structured diff patch to a file.",
|
117 |
+
"parameters": {
|
118 |
+
"type": "object",
|
119 |
+
"properties": {
|
120 |
+
"input": {"type": "string", "description": "The formatted patch text"}
|
121 |
+
},
|
122 |
+
"required": ["input"]
|
123 |
+
}
|
124 |
+
}
|
125 |
+
```
|
126 |
+
|
127 |
+
### Tool Usage Prompts
|
128 |
+
|
129 |
+
```markdown
|
130 |
+
If you do not have enough context to proceed, pause and use the tools provided.
|
131 |
+
Do not guess about code, structure, or missing data. Always verify by tool.
|
132 |
+
```
|
133 |
+
|
134 |
+
|
135 |
+
## Planning and Reflection Prompts
|
136 |
+
|
137 |
+
Planning and reflection are the structural anchors of agentic reasoning.
|
138 |
+
|
139 |
+
### Pre-Action Planning Prompt
|
140 |
+
|
141 |
+
```markdown
|
142 |
+
Before proceeding:
|
143 |
+
1. Restate the goal in your own words
|
144 |
+
2. Write a short plan of how you will solve it
|
145 |
+
3. List any tools you will need
|
146 |
+
```
|
147 |
+
|
148 |
+
### Post-Action Reflection Prompt
|
149 |
+
|
150 |
+
```markdown
|
151 |
+
After taking an action:
|
152 |
+
1. Summarize the result
|
153 |
+
2. List any unexpected outcomes
|
154 |
+
3. Determine if the goal is met
|
155 |
+
4. If not, update your plan and try again
|
156 |
+
```
|
157 |
+
|
158 |
+
These templates increase accuracy in multi-step execution by enforcing self-monitoring.
|
159 |
+
|
160 |
+
|
161 |
+
## Full Agent Walkthrough: SWE-bench Example
|
162 |
+
|
163 |
+
### System Prompt (Extracted from OpenAI's internal best practices)
|
164 |
+
|
165 |
+
```markdown
|
166 |
+
You are a coding agent tasked with solving bugs in open-source software.
|
167 |
+
|
168 |
+
## Task Requirements
|
169 |
+
- Always plan before you act
|
170 |
+
- Use tools to inspect files and apply patches
|
171 |
+
- Validate your fix with rigorous tests
|
172 |
+
|
173 |
+
## Workflow
|
174 |
+
1. Understand the bug deeply
|
175 |
+
2. Investigate the codebase using tools
|
176 |
+
3. Plan a fix
|
177 |
+
4. Apply the fix in small, testable steps
|
178 |
+
5. Run tests frequently
|
179 |
+
6. Iterate until all visible and hidden tests pass
|
180 |
+
7. Reflect and finalize
|
181 |
+
|
182 |
+
## Rules
|
183 |
+
- Do not stop unless all tests pass
|
184 |
+
- If a fix fails, debug the failure and reattempt
|
185 |
+
- Document your process
|
186 |
+
```
|
187 |
+
|
188 |
+
### Tool Call: `apply_patch`
|
189 |
+
|
190 |
+
```bash
|
191 |
+
apply_patch <<"EOF"
|
192 |
+
*** Begin Patch
|
193 |
+
*** Update File: src/module/handler.py
|
194 |
+
@@ def process_input():
|
195 |
+
- if input:
|
196 |
+
+ if input is not None:
|
197 |
+
handle(input)
|
198 |
+
*** End Patch
|
199 |
+
EOF
|
200 |
+
```
|
201 |
+
|
202 |
+
### Tool Call: `run_tests`
|
203 |
+
|
204 |
+
```bash
|
205 |
+
!python3 -m unittest discover
|
206 |
+
```
|
207 |
+
|
208 |
+
### Agent Response (Example)
|
209 |
+
|
210 |
+
```
|
211 |
+
Step 1: Analyzed stack trace and confirmed null check missing
|
212 |
+
Step 2: Planned patch using apply_patch tool
|
213 |
+
Step 3: Applied patch successfully
|
214 |
+
Step 4: Ran tests – 1 failure in edge case
|
215 |
+
Step 5: Revised patch
|
216 |
+
Step 6: Re-ran tests – all passed
|
217 |
+
Task complete. Final patch validated.
|
218 |
+
```
|
219 |
+
|
220 |
+
|
221 |
+
## Failure Modes and Mitigations
|
222 |
+
|
223 |
+
| Failure Type | Symptom | Mitigation Tip |
|
224 |
+
| ------------------------------ | ------------------------------------- | -------------------------------------------- |
|
225 |
+
| Premature Turn Yield | Agent ends before solving | Add explicit no-yield rules in system prompt |
|
226 |
+
| Tool Hallucination | Tool called with invalid or null data | Instruct agent to ask for data if missing |
|
227 |
+
| No Planning or Reflection | Skips step-by-step reasoning | Add planning and reflection anchors |
|
228 |
+
| Ignoring Final Validation Step | Says task complete before verifying | Add final verification checklist to prompt |
|
229 |
+
|
230 |
+
|
231 |
+
## Output Format Suggestions
|
232 |
+
|
233 |
+
A consistent output format improves interpretability and downstream usage.
|
234 |
+
|
235 |
+
### Recommended Layout
|
236 |
+
|
237 |
+
```markdown
|
238 |
+
# Task Status: In Progress
|
239 |
+
|
240 |
+
## Current Step: Plan and Execute Fix
|
241 |
+
- Tool used: apply_patch
|
242 |
+
- Patch outcome: Success
|
243 |
+
|
244 |
+
## Next Step
|
245 |
+
- Run full tests
|
246 |
+
- Validate output for edge cases
|
247 |
+
```
|
248 |
+
|
249 |
+
You can train GPT-4.1 to adopt consistent internal status reporting with a format guide provided in each system prompt.
|
250 |
+
|
251 |
+
|
252 |
+
## Escalation, Recovery, and Termination
|
253 |
+
|
254 |
+
### Escalation
|
255 |
+
|
256 |
+
Encourage the model to escalate to the user when required:
|
257 |
+
|
258 |
+
```markdown
|
259 |
+
If more data or permissions are needed, ask the user explicitly.
|
260 |
+
If a step cannot be completed after three attempts, escalate.
|
261 |
+
```
|
262 |
+
|
263 |
+
### Recovery
|
264 |
+
|
265 |
+
Allow the model to acknowledge failure and retry with adjustments:
|
266 |
+
|
267 |
+
```markdown
|
268 |
+
If your fix fails tests, reflect and revise the patch.
|
269 |
+
List new hypotheses and retry using a modified plan.
|
270 |
+
```
|
271 |
+
|
272 |
+
### Termination
|
273 |
+
|
274 |
+
Use clear termination rules:
|
275 |
+
|
276 |
+
```markdown
|
277 |
+
Only end your session when:
|
278 |
+
- All tests pass
|
279 |
+
- The task is fully verified
|
280 |
+
- You have summarized your actions for the user
|
281 |
+
```
|
282 |
+
|
283 |
+
|
284 |
+
## Behavioral Design Tips
|
285 |
+
|
286 |
+
| Technique | Effect |
|
287 |
+
| ----------------------------- | --------------------------------------------------- |
|
288 |
+
| System prompt layering | Prioritizes stable task framing |
|
289 |
+
| Mid-prompt behavior resets | Reinforces correct tool usage after failed attempts |
|
290 |
+
| Named sections (Markdown/XML) | Improves adherence to plan and formatting |
|
291 |
+
| Soft conditionals | Encourages resilience (“If X fails, try Y…”) |
|
292 |
+
|
293 |
+
|
294 |
+
## Designing for Developer Control
|
295 |
+
|
296 |
+
Create parameterized prompts for easier tuning and behavior adjustment.
|
297 |
+
|
298 |
+
### Template with Parameters
|
299 |
+
|
300 |
+
```python
|
301 |
+
AGENT_TEMPLATE = f"""
|
302 |
+
# Role: {role}
|
303 |
+
# Task: {task_description}
|
304 |
+
# Output Format: {format_spec}
|
305 |
+
# Tools: {', '.join(tool_names)}
|
306 |
+
# Planning Required: {'Yes' if planning else 'No'}
|
307 |
+
"""
|
308 |
+
```
|
309 |
+
|
310 |
+
Use this model to power dashboards, agent templates, and UI-driven behavior controls.
|
311 |
+
|
312 |
+
|
313 |
+
## Testing Agent Workflows
|
314 |
+
|
315 |
+
Use evaluation harnesses to test agent performance:
|
316 |
+
|
317 |
+
* Track step completion
|
318 |
+
* Analyze tool usage logs
|
319 |
+
* Compare plan quality across variants
|
320 |
+
|
321 |
+
Key metrics:
|
322 |
+
|
323 |
+
* Task success rate
|
324 |
+
* Iteration count per completion
|
325 |
+
* Tool error frequency
|
326 |
+
* Response length and structure fidelity
|
327 |
+
|
328 |
+
|
329 |
+
## Summary
|
330 |
+
|
331 |
+
Agent workflows in GPT-4.1 are structured, reliable, and controllable — provided the design follows consistent instruction patterns, plans for tool usage, and includes persistence logic.
|
332 |
+
|
333 |
+
Follow these principles:
|
334 |
+
|
335 |
+
* Anchor every agent with clear, literal instructions
|
336 |
+
* Use tool APIs instead of embedded tool descriptions
|
337 |
+
* Require planning and reflection around actions
|
338 |
+
* Validate every output with structured criteria
|
339 |
+
|
340 |
+
By shaping agent workflows as formal task managers, developers can build systems that reliably complete complex operations in a safe, verifiable manner.
|
341 |
+
|
342 |
+
|
343 |
+
## Next Steps
|
344 |
+
|
345 |
+
Explore these additional modules to expand your agent capabilities:
|
346 |
+
|
347 |
+
* [`Prompting for Instruction Following`](./Prompting%20for%20Instruction%20Following.md)
|
348 |
+
* [`Long Context Strategies`](./Long%20Context.md)
|
349 |
+
* [`Tool Calling and Integration`](./Tool%20Use%20and%20Integration.md)
|
350 |
+
|
351 |
+
For more agent-ready templates, visit the `/agent_design/` directory in the main repository.
|
352 |
+
|
353 |
+
|
354 |
+
For contributions or questions, open an issue or submit a pull request to `/agent_design/Designing Agent Workflows.md`.
|
handling_long_contexts.md
ADDED
@@ -0,0 +1,279 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# [Handling Long Contexts in GPT-4.1](https://chatgpt.com/canvas/shared/6825f136fb448191aadfd3cded6defe5)
|
2 |
+
|
3 |
+
## Overview
|
4 |
+
|
5 |
+
This guide focuses on how to effectively structure, prompt, and manage long contexts when working with GPT-4.1. With support for a 1 million-token context window, GPT-4.1 unlocks new possibilities for processing, reasoning, and extracting information from large datasets and documents. However, the benefits of long context can only be realized when prompts are precisely structured and context is meaningfully prioritized. This guide outlines practical techniques for context formatting, instruction design, and reasoning support across extended input windows.
|
6 |
+
|
7 |
+
## Objectives
|
8 |
+
|
9 |
+
* Help developers utilize the full long-context capability of GPT-4.1
|
10 |
+
* Mitigate degradation in response quality due to token overflow or disorganized input
|
11 |
+
* Establish formatting and reasoning best practices that align with OpenAI’s tested strategies
|
12 |
+
* Enable document processing, re-ranking, retrieval, and multi-hop reasoning across long inputs
|
13 |
+
|
14 |
+
|
15 |
+
## 1. Understanding Long Context Use Cases
|
16 |
+
|
17 |
+
GPT-4.1 is capable of processing up to 1 million tokens of input, making it suitable for a wide range of applications including:
|
18 |
+
|
19 |
+
* **Structured Document Parsing**: Legal documents, scientific papers, contracts, etc.
|
20 |
+
* **Retrieval-Augmented Generation (RAG)**: Combining long contexts with internal and external tools
|
21 |
+
* **Knowledge Graph Construction**: Extracting structured relationships from unstructured data
|
22 |
+
* **Log and Trace Analysis**: Reviewing extended server logs or output sequences
|
23 |
+
* **Multi-hop Reasoning**: Synthesizing answers from distributed pieces of information
|
24 |
+
|
25 |
+
While the model can technically parse vast inputs, developers must implement strategies to avoid cognitive overload and focus model attention effectively.
|
26 |
+
|
27 |
+
|
28 |
+
## 2. Context Organization Principles
|
29 |
+
|
30 |
+
### 2.1 Optimal Instruction Placement
|
31 |
+
|
32 |
+
OpenAI’s internal experiments found that the **positioning of prompt instructions significantly affects model performance**. Key guidelines include:
|
33 |
+
|
34 |
+
* **Dual placement**: Repeat key instructions at both the beginning and end of the prompt.
|
35 |
+
* **Top-loading**: If instructions are only placed once, placing them at the beginning is more effective than the end.
|
36 |
+
* **Segmented framing**: Use sectional titles to clearly mark transitions.
|
37 |
+
|
38 |
+
### 2.2 Delimiter Selection
|
39 |
+
|
40 |
+
To help the model parse structure in large blocks of text, delimiters must be used consistently and appropriately:
|
41 |
+
|
42 |
+
| Delimiter Format | Description | Use Case |
|
43 |
+
| ------------------------------ | ---------------------------- | ------------------------------------- |
|
44 |
+
| **Markdown (`#`, `##`, `-`)** | Clean sectioning, readable | General-purpose long context parsing |
|
45 |
+
| **XML (`<doc>`, `<section>`)** | Best for document modeling | Structured multi-document input |
|
46 |
+
| **Inline backticks** | For code, queries, and data | Code and SQL parsing, tool parameters |
|
47 |
+
| **Avoid JSON** | Inefficient parsing at scale | Do not use for >10K token lists |
|
48 |
+
|
49 |
+
Markdown and XML structures yield better attention modeling across long contexts, while JSON often introduces parsing inefficiencies beyond a few thousand tokens.
|
50 |
+
|
51 |
+
|
52 |
+
## 3. Strategies for Long-Context Prompting
|
53 |
+
|
54 |
+
### 3.1 Context-Aware Instructioning
|
55 |
+
|
56 |
+
When dealing with large input windows, standard prompt formats must evolve. Use detailed scaffolds that define model behavior across each phase:
|
57 |
+
|
58 |
+
```markdown
|
59 |
+
# Instructions
|
60 |
+
- Use only the documents provided.
|
61 |
+
- Focus on relevance before synthesis.
|
62 |
+
- Reflect after each major section.
|
63 |
+
|
64 |
+
# Reasoning Strategy
|
65 |
+
1. Read and segment.
|
66 |
+
2. Rank relevance.
|
67 |
+
3. Synthesize step-by-step.
|
68 |
+
|
69 |
+
# Final Reminder
|
70 |
+
Adhere strictly to section boundaries and reason incrementally.
|
71 |
+
```
|
72 |
+
|
73 |
+
### 3.2 Step-by-Step Processing with Summarization
|
74 |
+
|
75 |
+
Break the long input into **logical checkpoints**. After each checkpoint:
|
76 |
+
|
77 |
+
* Summarize progress.
|
78 |
+
* List open questions.
|
79 |
+
* Forecast the next reasoning step.
|
80 |
+
|
81 |
+
This promotes internal alignment without hardcoding logic into tool calls.
|
82 |
+
|
83 |
+
**Example Prompt Snippet:**
|
84 |
+
|
85 |
+
```text
|
86 |
+
After reading the next 5,000 tokens, summarize key entities mentioned and note unresolved questions. Then continue.
|
87 |
+
```
|
88 |
+
|
89 |
+
|
90 |
+
## 4. Long-Context Reasoning Patterns
|
91 |
+
|
92 |
+
### 4.1 Needle-in-a-Haystack Retrieval
|
93 |
+
|
94 |
+
GPT-4.1 performs reliably at locating information embedded deep within large corpora. Best practices for precision include:
|
95 |
+
|
96 |
+
* **Unique section headers** to guide location memory
|
97 |
+
* **Explicit re-ranking instructions** after initial search
|
98 |
+
* **Preliminary entity listing** to establish anchors
|
99 |
+
|
100 |
+
### 4.2 Document Relevance Rating
|
101 |
+
|
102 |
+
When feeding dozens or hundreds of documents into the model, instruct it to:
|
103 |
+
|
104 |
+
1. Score each document based on a relevance scale
|
105 |
+
2. Justify the score with reference to query terms
|
106 |
+
3. Select only medium/high relevance docs for synthesis
|
107 |
+
|
108 |
+
**Example Snippet:**
|
109 |
+
|
110 |
+
```text
|
111 |
+
Rate each doc on relevance to the query [high, medium, low]. Provide one sentence justification per doc. Use only high/medium docs in the final answer.
|
112 |
+
```
|
113 |
+
|
114 |
+
### 4.3 Multi-Hop Document Synthesis
|
115 |
+
|
116 |
+
For complex queries requiring synthesis from several different inputs:
|
117 |
+
|
118 |
+
* Start by identifying all possibly relevant documents
|
119 |
+
* Extract one-sentence summaries from each
|
120 |
+
* Weigh the evidence to converge on an answer
|
121 |
+
|
122 |
+
This scaffolds model behavior in a transparent and verifiable way.
|
123 |
+
|
124 |
+
|
125 |
+
## 5. Managing Instructional Interference
|
126 |
+
|
127 |
+
As context grows, risk increases that initial instructions may be forgotten or overridden. To address this:
|
128 |
+
|
129 |
+
* **Insert refresher instructions at each major context segment**
|
130 |
+
* **Bold or delimit** instructional snippets to create visual attention anchors
|
131 |
+
* **Use hierarchical structure**: Title → Sub-section → Instruction → Content
|
132 |
+
|
133 |
+
Example:
|
134 |
+
|
135 |
+
```markdown
|
136 |
+
## Part 3: Analyze Error Logs
|
137 |
+
**Reminder:** Focus only on logs mentioning `TimeoutError`. Ignore unrelated traces.
|
138 |
+
```
|
139 |
+
|
140 |
+
|
141 |
+
## 6. Failure Modes and Fixes
|
142 |
+
|
143 |
+
### 6.1 Early Context Drift
|
144 |
+
|
145 |
+
**Symptom:** The model misinterprets a query due to overemphasis on the early documents.
|
146 |
+
|
147 |
+
**Solution:** Insert a midway reflection point:
|
148 |
+
|
149 |
+
```text
|
150 |
+
Pause and verify: Are we still on track based on the original query?
|
151 |
+
```
|
152 |
+
|
153 |
+
### 6.2 Instruction Overload
|
154 |
+
|
155 |
+
**Symptom:** Model ignores or selectively follows prompt instructions.
|
156 |
+
|
157 |
+
**Solution:** Simplify instruction blocks. Group similar guidance. Use numbered checklists.
|
158 |
+
|
159 |
+
### 6.3 Latency and Token Limitations
|
160 |
+
|
161 |
+
**Symptom:** Prompting becomes slow or the output is truncated.
|
162 |
+
|
163 |
+
**Solution:**
|
164 |
+
|
165 |
+
* Shorten low-salience sections.
|
166 |
+
* Summarize documents before passing into prompt.
|
167 |
+
* Use a retrieval step to filter top-k relevant items.
|
168 |
+
|
169 |
+
|
170 |
+
## 7. Formatting Techniques for Long Contexts
|
171 |
+
|
172 |
+
### 7.1 Title-ID Pairing
|
173 |
+
|
174 |
+
Helpful in multi-document prompts.
|
175 |
+
|
176 |
+
```text
|
177 |
+
ID: 001 | TITLE: Terms of Use | CONTENT: The user agrees to...
|
178 |
+
```
|
179 |
+
|
180 |
+
This increases model ability to re-reference sections.
|
181 |
+
|
182 |
+
### 7.2 XML Embedding for Hierarchical Structure
|
183 |
+
|
184 |
+
```xml
|
185 |
+
<doc id="34" title="Security Policy">
|
186 |
+
<summary>Contains threat classifications and countermeasures</summary>
|
187 |
+
<content>...</content>
|
188 |
+
</doc>
|
189 |
+
```
|
190 |
+
|
191 |
+
This formatting supports multi-pass parsing and structured memory.
|
192 |
+
|
193 |
+
|
194 |
+
## 8. Alignment Between Internal and External Knowledge
|
195 |
+
|
196 |
+
In long-context tasks, decisions must be made about how much to rely on provided context vs. internal knowledge.
|
197 |
+
|
198 |
+
**Guideline Matrix:**
|
199 |
+
|
200 |
+
| Mode | Model Should... |
|
201 |
+
| ---------------- | ------------------------------------------------------------------- |
|
202 |
+
| Strict Retrieval | Only use external documents. If unsure, say "Not enough info." |
|
203 |
+
| Hybrid Mode | Use context first, but fill in with internal knowledge when needed. |
|
204 |
+
| Pure Generation | Use own knowledge; ignore prompt context. |
|
205 |
+
|
206 |
+
When prompting, make mode explicit:
|
207 |
+
|
208 |
+
```text
|
209 |
+
Use only the following context. If insufficient, reply: "Insufficient data."
|
210 |
+
```
|
211 |
+
|
212 |
+
|
213 |
+
## 9. Tools and Token Budgeting
|
214 |
+
|
215 |
+
### 9.1 Token Allocation Strategy
|
216 |
+
|
217 |
+
When constructing long prompts, divide tokens based on relevance and priority:
|
218 |
+
|
219 |
+
| Section | Suggested Max Tokens | Notes |
|
220 |
+
| --------------------- | -------------------- | --------------------------------------- |
|
221 |
+
| Instructions | 1,000 | Include high-priority guidance twice |
|
222 |
+
| Context Documents | 900,000 | Use title delimiters, sort by relevance |
|
223 |
+
| Task-Specific Prompts | 50,000 | Include reasoning strategy scaffolds |
|
224 |
+
|
225 |
+
Prioritize content by query salience and clarity.
|
226 |
+
|
227 |
+
### 9.2 Intermediate Tool Use
|
228 |
+
|
229 |
+
Encourage the model to use tools mid-way:
|
230 |
+
|
231 |
+
* Re-rank document clusters
|
232 |
+
* Extract named entities
|
233 |
+
* Visualize flow or graph relationships
|
234 |
+
|
235 |
+
Encouraging this tool interaction creates checkpoints and avoids reasoning drift.
|
236 |
+
|
237 |
+
|
238 |
+
## 10. Testing and Evaluation
|
239 |
+
|
240 |
+
When evaluating prompt effectiveness in long-context scenarios:
|
241 |
+
|
242 |
+
* Measure correctness, latency, and coverage
|
243 |
+
* Track hallucination and false-positive rates
|
244 |
+
* Use automated evals with known answer corpora
|
245 |
+
|
246 |
+
### Recommended Metrics:
|
247 |
+
|
248 |
+
* Precision\@k for retrieval
|
249 |
+
* Response coherence score (human or model-rated)
|
250 |
+
* Instruction adherence rate
|
251 |
+
|
252 |
+
Incorporate feedback loops to update prompts based on failure analysis.
|
253 |
+
|
254 |
+
|
255 |
+
## 11. Summary and Best Practices
|
256 |
+
|
257 |
+
| Principle | Best Practice |
|
258 |
+
| --------------------- | --------------------------------- |
|
259 |
+
| Instruction Placement | Use top and bottom |
|
260 |
+
| Context Segmentation | Insert checkpoints, summaries |
|
261 |
+
| Delimiters | Prefer Markdown/XML over JSON |
|
262 |
+
| Tool Usage | Mid-task tool calls preferred |
|
263 |
+
| Evaluation | Test adherence, accuracy, latency |
|
264 |
+
|
265 |
+
Effective long-context prompting is not about more data—it’s about better structure, thoughtful pacing, and precision anchoring.
|
266 |
+
|
267 |
+
|
268 |
+
## Final Notes
|
269 |
+
|
270 |
+
GPT-4.1’s long-context capabilities can power a new generation of document-heavy applications. However, successful deployment requires more than dropping text into a prompt. It requires:
|
271 |
+
|
272 |
+
* Clear segment boundaries
|
273 |
+
* Frequent alignment checkpoints
|
274 |
+
* Purpose-driven formatting
|
275 |
+
* Strategic memory reinforcement
|
276 |
+
|
277 |
+
With these principles in place, the model not only reads—it understands.
|
278 |
+
|
279 |
+
Begin with structure. Sustain with clarity. Close with alignment.
|
prompt_engineering_guide.md
ADDED
@@ -0,0 +1,298 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# [Prompt Engineering Reference Guide for GPT-4.1](https://chatgpt.com/canvas/shared/6825f88d7170819180b56e101e8b9d31)
|
2 |
+
|
3 |
+
## Overview
|
4 |
+
|
5 |
+
This reference guide consolidates OpenAI’s latest findings and recommendations for effective prompt engineering with GPT-4.1. It is designed for developers, researchers, and applied AI engineers who seek reliable, reproducible results from GPT-4.1 in both experimental and production settings. The techniques presented here are rooted in empirical validation across use cases ranging from agent workflows to structured tool integration, long-context processing, and instruction-following optimization.
|
6 |
+
|
7 |
+
This document emphasizes concrete prompt patterns, scaffolding techniques, and deployment-tested prompt modularity.
|
8 |
+
|
9 |
+
|
10 |
+
## Key Prompting Concepts
|
11 |
+
|
12 |
+
### 1. Instruction Literalism
|
13 |
+
|
14 |
+
GPT-4.1 follows instructions **more precisely** than its predecessors. Developers should:
|
15 |
+
|
16 |
+
* Avoid vague or underspecified prompts
|
17 |
+
* Be explicit about desired behaviors, output formats, and prohibitions
|
18 |
+
* Expect literal compliance with phrasing, including negations and scope restrictions
|
19 |
+
|
20 |
+
### 2. Planning Induction
|
21 |
+
|
22 |
+
GPT-4.1 does not natively plan before answering but can be prompted to simulate step-by-step reasoning.
|
23 |
+
|
24 |
+
**Template:**
|
25 |
+
|
26 |
+
```text
|
27 |
+
Think carefully step by step. Break down the task into manageable parts. Then begin.
|
28 |
+
```
|
29 |
+
|
30 |
+
Planning prompts should be framed before actions and reinforced between reasoning phases.
|
31 |
+
|
32 |
+
### 3. Agentic Harnessing
|
33 |
+
|
34 |
+
Use GPT-4.1’s enhanced persistence and tool adherence by specifying three types of reminders:
|
35 |
+
|
36 |
+
* **Persistence**: “Keep working until the problem is fully resolved.”
|
37 |
+
* **Tool usage**: “Use available tools to inspect files—do not guess.”
|
38 |
+
* **Planning enforcement**: “Plan and reflect before and after every function call.”
|
39 |
+
|
40 |
+
These drastically increase the model’s task completion rate when integrated at the top of a system prompt.
|
41 |
+
|
42 |
+
|
43 |
+
## Prompt Structure Blueprint
|
44 |
+
|
45 |
+
A recommended modular scaffold:
|
46 |
+
|
47 |
+
```markdown
|
48 |
+
# Role and Objective
|
49 |
+
You are a [role] tasked with [goal].
|
50 |
+
|
51 |
+
# Instructions
|
52 |
+
- Bullet-point rules or constraints
|
53 |
+
- Output format expectations
|
54 |
+
- Prohibited topics or phrasing
|
55 |
+
|
56 |
+
# Workflow (Optional)
|
57 |
+
1. Step-by-step plan
|
58 |
+
2. Reflection checkpoints
|
59 |
+
3. Tool interaction order
|
60 |
+
|
61 |
+
# Reasoning Strategy (Optional)
|
62 |
+
Describes how the model should analyze input or context before generating output.
|
63 |
+
|
64 |
+
# Output Format
|
65 |
+
JSON, Markdown, YAML, or prose specification
|
66 |
+
|
67 |
+
# Examples (Optional)
|
68 |
+
Demonstrates expected input/output behavior
|
69 |
+
```
|
70 |
+
|
71 |
+
This format increases predictability and flexibility during live prompt debugging and iteration.
|
72 |
+
|
73 |
+
|
74 |
+
## Long-Context Prompting
|
75 |
+
|
76 |
+
GPT-4.1 supports up to **1M token inputs**, enabling:
|
77 |
+
|
78 |
+
* Multi-document ingestion
|
79 |
+
* Codebase-wide searches
|
80 |
+
* Contextual re-ranking and synthesis
|
81 |
+
|
82 |
+
### Strategies:
|
83 |
+
|
84 |
+
* **Repeat instructions at top and bottom**
|
85 |
+
* **Use markdown/XML tags** for structure
|
86 |
+
* **Insert reasoning checkpoints every 5–10k tokens**
|
87 |
+
* **Avoid JSON for large document embedding**
|
88 |
+
|
89 |
+
**Effective Delimiters:**
|
90 |
+
|
91 |
+
| Format | Use Case |
|
92 |
+
| -------- | ------------------------------------ |
|
93 |
+
| Markdown | General sectioning |
|
94 |
+
| XML | Hierarchical document parsing |
|
95 |
+
| Title/ID | Multi-document input structuring |
|
96 |
+
| JSON | Code/tool tasks only; avoid for text |
|
97 |
+
|
98 |
+
|
99 |
+
## Tool-Calling Integration
|
100 |
+
|
101 |
+
### Schema-Based Tool Usage
|
102 |
+
|
103 |
+
Define tools in the OpenAI `tools` field, not inline. Provide:
|
104 |
+
|
105 |
+
* **Name** (clear and descriptive)
|
106 |
+
* **Parameters** (structured JSON)
|
107 |
+
* **Usage examples** (in `# Examples` section, not in `description`)
|
108 |
+
|
109 |
+
**Tool Example:**
|
110 |
+
|
111 |
+
```json
|
112 |
+
{
|
113 |
+
"name": "get_user_info",
|
114 |
+
"description": "Fetches user details from the database",
|
115 |
+
"parameters": {
|
116 |
+
"type": "object",
|
117 |
+
"properties": {
|
118 |
+
"user_id": { "type": "string" }
|
119 |
+
},
|
120 |
+
"required": ["user_id"]
|
121 |
+
}
|
122 |
+
}
|
123 |
+
```
|
124 |
+
|
125 |
+
### Prompt Reinforcement:
|
126 |
+
|
127 |
+
```markdown
|
128 |
+
# Tool Instructions
|
129 |
+
- Use tools before answering factual queries
|
130 |
+
- If info is missing, request input from user
|
131 |
+
```
|
132 |
+
|
133 |
+
### Failure Mitigation:
|
134 |
+
|
135 |
+
| Issue | Fix |
|
136 |
+
| ------------------ | ------------------------------------------- |
|
137 |
+
| Null tool calls | Prompt: “Ask for missing info if needed” |
|
138 |
+
| Over-calling tools | Add reasoning delay + post-call reflection |
|
139 |
+
| Missed call | Add output format block and trigger keyword |
|
140 |
+
|
141 |
+
|
142 |
+
## Instruction Following Optimization
|
143 |
+
|
144 |
+
GPT-4.1 is optimized for literal and structured instruction parsing. Improve reliability with:
|
145 |
+
|
146 |
+
### Multi-Tiered Rules
|
147 |
+
|
148 |
+
Use layers:
|
149 |
+
|
150 |
+
* `# Instructions`: High-level
|
151 |
+
* `## Response Style`: Format and tone
|
152 |
+
* `## Error Handling`: Edge case mitigation
|
153 |
+
|
154 |
+
### Ordered Workflows
|
155 |
+
|
156 |
+
Use numbered sequences to enforce step-by-step logic.
|
157 |
+
|
158 |
+
**Prompt Snippet:**
|
159 |
+
|
160 |
+
```markdown
|
161 |
+
# Instructions
|
162 |
+
- Greet the user
|
163 |
+
- Request missing parameters
|
164 |
+
- Avoid repeating exact phrasing
|
165 |
+
- Escalate on request
|
166 |
+
|
167 |
+
# Workflow
|
168 |
+
1. Confirm intent
|
169 |
+
2. Call tool
|
170 |
+
3. Reflect
|
171 |
+
4. Respond
|
172 |
+
```
|
173 |
+
|
174 |
+
|
175 |
+
## Chain-of-Thought Prompting (CoT)
|
176 |
+
|
177 |
+
Chain-of-thought induces linear reasoning. Works best for:
|
178 |
+
|
179 |
+
* Logic puzzles
|
180 |
+
* Multi-hop QA
|
181 |
+
* Comparative analysis
|
182 |
+
|
183 |
+
**CoT Example:**
|
184 |
+
|
185 |
+
```text
|
186 |
+
Let’s think through this. First, identify what the question is asking. Then examine context. Finally, synthesize an answer.
|
187 |
+
```
|
188 |
+
|
189 |
+
**Advanced Prompt (Modular):**
|
190 |
+
|
191 |
+
```markdown
|
192 |
+
# Reasoning Strategy
|
193 |
+
1. Query analysis
|
194 |
+
2. Context selection
|
195 |
+
3. Evidence synthesis
|
196 |
+
|
197 |
+
# Final Instruction
|
198 |
+
Think step by step using the strategy above.
|
199 |
+
```
|
200 |
+
|
201 |
+
|
202 |
+
## Failure Modes and Fixes
|
203 |
+
|
204 |
+
| Problem | Mitigation |
|
205 |
+
| ------------------ | -------------------------------------------------------------- |
|
206 |
+
| Tool hallucination | Require tool call block, validate schema |
|
207 |
+
| Early termination | Add: "Do not yield until goal achieved." |
|
208 |
+
| Verbose repetition | Add paraphrasing constraint and variation list |
|
209 |
+
| Overcompliance | If model follows a sample phrase verbatim, instruct to vary it |
|
210 |
+
|
211 |
+
|
212 |
+
## Evaluation Strategy
|
213 |
+
|
214 |
+
Prompt effectiveness should be evaluated across:
|
215 |
+
|
216 |
+
* **Instruction adherence**
|
217 |
+
* **Tool utilization accuracy**
|
218 |
+
* **Reasoning coherence**
|
219 |
+
* **Failure mode frequency**
|
220 |
+
* **Latency and cost tradeoffs**
|
221 |
+
|
222 |
+
### Recommended Methodology:
|
223 |
+
|
224 |
+
* Create a test suite with edge-case prompts
|
225 |
+
* Log errors and model divergence cases
|
226 |
+
* Use eval tags (`# Eval:`) in prompt for meta-analysis
|
227 |
+
|
228 |
+
|
229 |
+
## Delimiter Comparison Table
|
230 |
+
|
231 |
+
| Delimiter Type | Format Example | GPT-4.1 Performance | |
|
232 |
+
| -------------- | ------------------ | ------------------------------- | -------- |
|
233 |
+
| Markdown | `## Section Title` | Excellent | |
|
234 |
+
| XML | `<doc>` tags | Excellent | |
|
235 |
+
| JSON | `{"text": "..."}` | High (in code), Poor (in prose) | |
|
236 |
+
| Pipe-delimited | \`TITLE | CONTENT\` | Moderate |
|
237 |
+
|
238 |
+
### Best Practice:
|
239 |
+
|
240 |
+
Use Markdown or XML for general structure; JSON for code/tools only.
|
241 |
+
|
242 |
+
|
243 |
+
## Example: Prompt Debugging Workflow
|
244 |
+
|
245 |
+
### Step 1: Identify Goal
|
246 |
+
|
247 |
+
E.g., summarizing medical trial documents with context weighting.
|
248 |
+
|
249 |
+
### Step 2: Draft Prompt Template
|
250 |
+
|
251 |
+
```markdown
|
252 |
+
# Objective
|
253 |
+
Summarize each trial based on outcome clarity and trial scale.
|
254 |
+
|
255 |
+
# Workflow
|
256 |
+
1. Parse hypothesis/result
|
257 |
+
2. Score for clarity
|
258 |
+
3. Output structured summary
|
259 |
+
|
260 |
+
# Output Format
|
261 |
+
{"trial_id": ..., "clarity_score": ..., "summary": ...}
|
262 |
+
```
|
263 |
+
|
264 |
+
### Step 3: Insert Sample
|
265 |
+
|
266 |
+
```json
|
267 |
+
{"trial_id": "T01", "clarity_score": 8, "summary": "Well-documented results..."}
|
268 |
+
```
|
269 |
+
|
270 |
+
### Step 4: Validate Output
|
271 |
+
|
272 |
+
Ensure model adheres to output format, logic, and reasoning instructions.
|
273 |
+
|
274 |
+
|
275 |
+
## Summary: Prompt Engineering Heuristics
|
276 |
+
|
277 |
+
| Technique | When to Use |
|
278 |
+
| -------------------------- | ----------------------------------- |
|
279 |
+
| Instruction Bullets | All prompts |
|
280 |
+
| Chain-of-Thought | Any task requiring logic or steps |
|
281 |
+
| Workflow Lists | Multiphase reasoning tasks |
|
282 |
+
| Tool Block | Any prompt using API/tool calls |
|
283 |
+
| Reflection Reminders | Long context, debugging, validation |
|
284 |
+
| Dual Instruction Placement | Long documents (>100K tokens) |
|
285 |
+
|
286 |
+
|
287 |
+
## Final Notes
|
288 |
+
|
289 |
+
Prompt engineering is empirical, not theoretical. Every use case is different. To engineer effectively with GPT-4.1:
|
290 |
+
|
291 |
+
* Maintain modular, versioned prompt templates
|
292 |
+
* Use structured instructions and output formats
|
293 |
+
* Enforce explicit planning and tool behavior
|
294 |
+
* Iterate prompts based on logs and evals
|
295 |
+
|
296 |
+
**Start simple. Add structure. Evaluate constantly.**
|
297 |
+
|
298 |
+
This guide is designed to be expanded. Use it as your baseline and evolve it as your systems scale.
|
prompting_for_instruction_following.md
ADDED
@@ -0,0 +1,293 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# [Prompting for Instruction Following](https://chatgpt.com/canvas/shared/6825ebe022148191bceb9fa5473a34eb)
|
2 |
+
|
3 |
+
## Overview
|
4 |
+
|
5 |
+
GPT-4.1 represents a significant shift in how developers should structure prompts for reliable, deterministic, and consistent behavior. Unlike earlier models which often inferred intent liberally, GPT-4.1 adheres to instructions in a far more literal, detail-sensitive manner. This brings both increased control and greater responsibility for developers: well-designed prompts yield exceptional results, while ambiguous or conflicting instructions may result in brittle or unexpected behavior.
|
6 |
+
|
7 |
+
This guide outlines best practices, real-world examples, and design patterns to fully utilize GPT-4.1’s instruction-following improvements across a variety of applications. It is structured to help you:
|
8 |
+
|
9 |
+
* Understand GPT-4.1’s instruction handling behavior
|
10 |
+
* Design high-integrity prompt scaffolds
|
11 |
+
* Debug prompt failures and mitigate ambiguity
|
12 |
+
* Align instructions with OpenAI’s guidance around tool usage, task persistence, and planning
|
13 |
+
|
14 |
+
This file is designed to stand alone for practical use and is fully aligned with the broader `openai-cookbook-pro` repository.
|
15 |
+
|
16 |
+
|
17 |
+
## Why Instruction-Following Matters
|
18 |
+
|
19 |
+
Instruction following is central to:
|
20 |
+
|
21 |
+
* **Agent behavior**: models acting in multi-step environments must reliably interpret commands
|
22 |
+
* **Tool use**: execution hinges on clearly-defined tool invocation criteria
|
23 |
+
* **Support workflows**: factual grounding depends on accurate boundary adherence
|
24 |
+
* **Security and safety**: systems must not misinterpret prohibitions or fail to enforce policy constraints
|
25 |
+
|
26 |
+
With GPT-4.1’s shift toward literal interpretation, instruction scaffolding becomes the primary control interface.
|
27 |
+
|
28 |
+
|
29 |
+
## GPT-4.1 Instruction Characteristics
|
30 |
+
|
31 |
+
### 1. **Literal Compliance**
|
32 |
+
|
33 |
+
GPT-4.1 follows instructions with minimal assumption. If a step is missing or unclear, the model is less likely to “fill in” or guess the user’s intent.
|
34 |
+
|
35 |
+
* **Previous behavior**: interpreted vague prompts broadly
|
36 |
+
* **Current behavior**: waits for or requests clarification
|
37 |
+
|
38 |
+
This improves safety and traceability but also increases fragility in loosely written prompts.
|
39 |
+
|
40 |
+
### 2. **Order-Sensitive Resolution**
|
41 |
+
|
42 |
+
When instructions conflict, GPT-4.1 favors those listed **last** in the prompt. This means developers should order rules hierarchically:
|
43 |
+
|
44 |
+
* General rules go early
|
45 |
+
* Specific overrides go later
|
46 |
+
|
47 |
+
Example:
|
48 |
+
|
49 |
+
```markdown
|
50 |
+
# Instructions
|
51 |
+
- Do not guess if unsure
|
52 |
+
- Use your knowledge if a tool isn’t available
|
53 |
+
- If both options are available, prefer the tool
|
54 |
+
```
|
55 |
+
|
56 |
+
### 3. **Format-Aware Behavior**
|
57 |
+
|
58 |
+
GPT-4.1 performs better with clearly formatted instructions. Prefer structured formats:
|
59 |
+
|
60 |
+
* Markdown with headers and lists
|
61 |
+
* XML with nested tags
|
62 |
+
* Structured sections like `# Steps`, `# Output Format`
|
63 |
+
|
64 |
+
Poorly formatted, unsegmented prompts lead to instruction bleed and undesired merging of behaviors.
|
65 |
+
|
66 |
+
|
67 |
+
## Recommended Prompt Structure
|
68 |
+
|
69 |
+
Organize your prompt using a structure that mirrors OpenAI’s internal evaluation standards.
|
70 |
+
|
71 |
+
### 📁 Standard Sections
|
72 |
+
|
73 |
+
```markdown
|
74 |
+
# Role and Objective
|
75 |
+
# Instructions
|
76 |
+
## Sub-categories for Specific Behavior
|
77 |
+
# Workflow Steps (Optional)
|
78 |
+
# Output Format
|
79 |
+
# Examples (Optional)
|
80 |
+
# Final Reminder
|
81 |
+
```
|
82 |
+
|
83 |
+
### Example Prompt Template
|
84 |
+
|
85 |
+
```markdown
|
86 |
+
# Role and Objective
|
87 |
+
You are a customer service assistant. Your job is to resolve user issues efficiently, using tools when needed.
|
88 |
+
|
89 |
+
# Instructions
|
90 |
+
- Greet the user politely.
|
91 |
+
- Use a tool before answering any account-related question.
|
92 |
+
- If unsure how to proceed, ask the user for clarification.
|
93 |
+
- If a user requests escalation, refer them to a human agent.
|
94 |
+
|
95 |
+
## Output Format
|
96 |
+
- Always use a friendly tone.
|
97 |
+
- Format your answer in plain text.
|
98 |
+
- Include a summary at the end of your response.
|
99 |
+
|
100 |
+
## Final Reminder
|
101 |
+
Do not rely on prior knowledge. Use provided tools and context only.
|
102 |
+
```
|
103 |
+
|
104 |
+
|
105 |
+
## Instruction Categories
|
106 |
+
|
107 |
+
### 1. **Task Definition**
|
108 |
+
|
109 |
+
Clearly state the model’s job in the opening lines. Be explicit:
|
110 |
+
|
111 |
+
✅ “You are an assistant that reviews and edits legal contracts.”
|
112 |
+
|
113 |
+
🚫 “Help with contracts.”
|
114 |
+
|
115 |
+
### 2. **Behavioral Constraints**
|
116 |
+
|
117 |
+
List what the model must or must not do:
|
118 |
+
|
119 |
+
* Must call tools before responding to factual queries
|
120 |
+
* Must ask for clarification if user input is incomplete
|
121 |
+
* Must not provide financial or legal advice
|
122 |
+
|
123 |
+
### 3. **Response Style**
|
124 |
+
|
125 |
+
Define tone, length, formality, and structure.
|
126 |
+
|
127 |
+
* “Keep responses under 250 words.”
|
128 |
+
* “Avoid lists unless asked.”
|
129 |
+
* “Use a neutral tone.”
|
130 |
+
|
131 |
+
### 4. **Tool Use Protocols**
|
132 |
+
|
133 |
+
Models often hallucinate tools unless guided:
|
134 |
+
|
135 |
+
* “If you don’t have enough information to use a tool, ask the user for more.”
|
136 |
+
* “Always confirm tool usage before responding.”
|
137 |
+
|
138 |
+
|
139 |
+
## Debugging Instruction Failures
|
140 |
+
|
141 |
+
Instruction-following failures often stem from the following:
|
142 |
+
|
143 |
+
### Common Causes
|
144 |
+
|
145 |
+
* Ambiguous rule phrasing
|
146 |
+
* Conflicting instructions (e.g., both asking to guess and not guess)
|
147 |
+
* Implicit behaviors expected, not stated
|
148 |
+
* Overloaded instructions without formatting
|
149 |
+
|
150 |
+
### Diagnosis Steps
|
151 |
+
|
152 |
+
1. Read the full prompt in sequence
|
153 |
+
2. Identify potential ambiguity
|
154 |
+
3. Reorder to clarify precedence
|
155 |
+
4. Break complex rules into atomic steps
|
156 |
+
5. Test with structured evals
|
157 |
+
|
158 |
+
|
159 |
+
## Instruction Layering: The 3-Tier Model
|
160 |
+
|
161 |
+
When designing prompts for multi-step tasks, layer your instructions in tiers:
|
162 |
+
|
163 |
+
| Tier | Layer Purpose | Example |
|
164 |
+
| ---- | --------------------------- | ------------------------------------------ |
|
165 |
+
| 1 | Role Declaration | “You are an assistant for legal tasks.” |
|
166 |
+
| 2 | Global Behavior Constraints | “Always cite sources.” |
|
167 |
+
| 3 | Task-Specific Instructions | “In contracts, highlight ambiguous terms.” |
|
168 |
+
|
169 |
+
Each layer helps disambiguate behavior and provides a fallback structure if downstream instructions fail.
|
170 |
+
|
171 |
+
|
172 |
+
## Long Context Instruction Handling
|
173 |
+
|
174 |
+
In prompts exceeding 50,000 tokens:
|
175 |
+
|
176 |
+
* Place **key instructions** both **before and after** the context.
|
177 |
+
* Use format anchors (`# Instructions`, `<rules>`) to signal boundaries.
|
178 |
+
* Avoid relying solely on the top-of-prompt instructions.
|
179 |
+
|
180 |
+
GPT-4.1 is trained to respect these placements, especially when consistent structure is maintained.
|
181 |
+
|
182 |
+
|
183 |
+
## Literal vs. Flexible Models
|
184 |
+
|
185 |
+
| Capability | GPT-3.5 / GPT-4-turbo | GPT-4.1 |
|
186 |
+
| ---------------------- | --------------------- | --------------- |
|
187 |
+
| Implicit inference | High | Low |
|
188 |
+
| Literal compliance | Moderate | High |
|
189 |
+
| Prompt flexibility | Higher tolerance | Lower tolerance |
|
190 |
+
| Instruction debug cost | Lower | Higher |
|
191 |
+
|
192 |
+
GPT-4.1 performs better **when prompts are precise**. Treat prompt engineering as API design — clear, testable, and version-controlled.
|
193 |
+
|
194 |
+
|
195 |
+
## Tips for Designing Instruction-Sensitive Prompts
|
196 |
+
|
197 |
+
### ✔️ DO:
|
198 |
+
|
199 |
+
* Use structured formatting
|
200 |
+
* Scope behaviors into separate bullet points
|
201 |
+
* Use examples to anchor expected output
|
202 |
+
* Rewrite ambiguous instructions into atomic steps
|
203 |
+
* Add conditionals explicitly (e.g., “if X, then Y”)
|
204 |
+
|
205 |
+
### ❌ DON’T:
|
206 |
+
|
207 |
+
* Assume the model will “understand what you meant”
|
208 |
+
* Use overloaded sentences with multiple actions
|
209 |
+
* Rely on invisible or implied rules
|
210 |
+
* Assume formatting styles (e.g., bullets) are optional
|
211 |
+
|
212 |
+
|
213 |
+
## Example: Instruction-Controlled Code Agent
|
214 |
+
|
215 |
+
```markdown
|
216 |
+
# Objective
|
217 |
+
You are a code assistant that fixes bugs in open-source projects.
|
218 |
+
|
219 |
+
# Instructions
|
220 |
+
- Always use the tools provided to inspect code.
|
221 |
+
- Do not make edits unless you have confirmed the bug’s root cause.
|
222 |
+
- If a change is proposed, validate using tests.
|
223 |
+
- Do not respond unless the patch is applied.
|
224 |
+
|
225 |
+
## Output Format
|
226 |
+
1. Description of bug
|
227 |
+
2. Explanation of root cause
|
228 |
+
3. Tool output (e.g., patch result)
|
229 |
+
4. Confirmation message
|
230 |
+
|
231 |
+
## Final Note
|
232 |
+
Do not guess. If you are unsure, use tools or ask.
|
233 |
+
```
|
234 |
+
|
235 |
+
> For a complete walkthrough, see `/examples/code-agent-instructions.md`
|
236 |
+
|
237 |
+
|
238 |
+
## Instruction Evolution Across Iterations
|
239 |
+
|
240 |
+
As your prompts grow, preserve instruction integrity using:
|
241 |
+
|
242 |
+
* Versioned templates
|
243 |
+
* Structured diffs for instruction edits
|
244 |
+
* Commented rules for traceability
|
245 |
+
|
246 |
+
Example diff:
|
247 |
+
|
248 |
+
```diff
|
249 |
+
- Always answer user questions.
|
250 |
+
+ Only answer user questions after validating tool output.
|
251 |
+
```
|
252 |
+
|
253 |
+
Maintain a changelog for prompts as you would with source code. This ensures instructional integrity during collaborative development.
|
254 |
+
|
255 |
+
|
256 |
+
## Testing and Evaluation
|
257 |
+
|
258 |
+
Prompt engineering is empirical. Validate instruction design using:
|
259 |
+
|
260 |
+
* **A/B tests**: Compare variants with and without behavioral scaffolds
|
261 |
+
* **Prompt evals**: Use deterministic queries to test edge case behavior
|
262 |
+
* **Behavioral matrices**: Track compliance with instruction categories
|
263 |
+
|
264 |
+
Example matrix:
|
265 |
+
|
266 |
+
| Instruction Category | Prompt A Pass | Prompt B Pass |
|
267 |
+
| -------------------- | ------------- | ------------- |
|
268 |
+
| Ask if unsure | ✅ | ❌ |
|
269 |
+
| Use tools first | ✅ | ✅ |
|
270 |
+
| Avoid sensitive data | ❌ | ✅ |
|
271 |
+
|
272 |
+
|
273 |
+
## Final Reminders
|
274 |
+
|
275 |
+
GPT-4.1 is exceptionally effective **when paired with well-structured, comprehensive instructions**. Follow these principles:
|
276 |
+
|
277 |
+
* Instructions should be modular and auditable.
|
278 |
+
* Avoid unnecessary repetition, but reinforce critical rules.
|
279 |
+
* Use formatting styles that clearly separate content.
|
280 |
+
* Assume literalism — write prompts as if programming a function, not chatting with a person.
|
281 |
+
|
282 |
+
Every prompt is a contract. GPT-4.1 honors that contract, but only if written clearly.
|
283 |
+
|
284 |
+
|
285 |
+
## See Also
|
286 |
+
|
287 |
+
* [`Agent Workflows`](../agent_design/swe_bench_agent.md)
|
288 |
+
* [`Prompt Format Reference`](../reference/prompting_guide.md)
|
289 |
+
* [`Long Context Strategies`](../examples/long-context-formatting.md)
|
290 |
+
* [`OpenAI 4.1 Prompting Guide`](https://platform.openai.com/docs/guides/prompting)
|
291 |
+
|
292 |
+
|
293 |
+
For questions, suggestions, or prompt design contributions, submit a pull request to `/examples/instruction-following.md` or open an issue in the main repo.
|
real_world_deployment.md
ADDED
@@ -0,0 +1,282 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# [Real-World Deployment Scenarios for GPT-4.1](https://chatgpt.com/canvas/shared/6825f3194b888191ae2417991002dcbd)
|
2 |
+
|
3 |
+
## Overview
|
4 |
+
|
5 |
+
This guide provides implementation-ready strategies for deploying GPT-4.1 in real-world systems. It outlines robust practices for integrating the model across diverse operational environments—from customer support automation to software development pipelines—while leveraging OpenAI's guidance on agentic workflows, instruction adherence, and tool integration.
|
6 |
+
|
7 |
+
The focus is on reliability, agent autonomy, and system-level alignment for production use. This document emphasizes scenario-based implementation blueprints, including prompt structure, tool configuration, risk mitigation, and iterative deployment cycles.
|
8 |
+
|
9 |
+
|
10 |
+
## Objectives
|
11 |
+
|
12 |
+
* Showcase tested deployment architectures for GPT-4.1 in applied domains
|
13 |
+
* Illustrate structured prompting strategies aligned with OpenAI's latest harness recommendations
|
14 |
+
* Codify best practices for tool integration, planning induction, and agent persistence
|
15 |
+
* Support enterprise-grade use through modular scenario blueprints
|
16 |
+
|
17 |
+
|
18 |
+
## Deployment Pattern 1: Customer Service Agent
|
19 |
+
|
20 |
+
### Use Case
|
21 |
+
|
22 |
+
Deploy GPT-4.1 as a first-line support agent capable of greeting users, answering account-related questions, handling tool lookups, and escalating edge cases.
|
23 |
+
|
24 |
+
### Prompt Structure
|
25 |
+
|
26 |
+
```markdown
|
27 |
+
# Role
|
28 |
+
You are a helpful customer service assistant for NewTelco.
|
29 |
+
|
30 |
+
# Instructions
|
31 |
+
- Always greet the user.
|
32 |
+
- Call tools before answering factual queries.
|
33 |
+
- Never rely on internal knowledge for billing/account issues.
|
34 |
+
- Ask for missing parameters if insufficient input.
|
35 |
+
- Vary phrasing to avoid repetition.
|
36 |
+
- Always escalate when asked.
|
37 |
+
- Prohibited topics: [List Redacted].
|
38 |
+
|
39 |
+
# Sample Interaction
|
40 |
+
## User: Can I get my last bill?
|
41 |
+
## Assistant:
|
42 |
+
Hi, you've reached NewTelco. Let me retrieve that for you—one moment.
|
43 |
+
[Calls `get_user_bill` tool]
|
44 |
+
```
|
45 |
+
|
46 |
+
### Tool Schema
|
47 |
+
|
48 |
+
```json
|
49 |
+
{
|
50 |
+
"name": "get_user_bill",
|
51 |
+
"description": "Retrieve a user's latest billing information.",
|
52 |
+
"parameters": {
|
53 |
+
"type": "object",
|
54 |
+
"properties": {
|
55 |
+
"phone_number": { "type": "string" }
|
56 |
+
},
|
57 |
+
"required": ["phone_number"]
|
58 |
+
}
|
59 |
+
}
|
60 |
+
```
|
61 |
+
|
62 |
+
### Best Practices
|
63 |
+
|
64 |
+
* Use a formal output format block.
|
65 |
+
* Include tool call before every factual output.
|
66 |
+
* Cite retrieved source when answering.
|
67 |
+
|
68 |
+
### Failure Mitigation
|
69 |
+
|
70 |
+
| Risk | Prevention |
|
71 |
+
| ------------------------ | ----------------------------------------- |
|
72 |
+
| Repetitive responses | Vary phrasing with sample lists |
|
73 |
+
| Tool skipping | Require tool call before factual response |
|
74 |
+
| Prohibited topic leakage | Reinforce restriction list + test in QA |
|
75 |
+
|
76 |
+
|
77 |
+
## Deployment Pattern 2: Codebase Maintenance Agent
|
78 |
+
|
79 |
+
### Use Case
|
80 |
+
|
81 |
+
An agent responsible for identifying and fixing bugs using diffs, applying patches, running tests, and confirming bug resolution.
|
82 |
+
|
83 |
+
### Prompt Highlights
|
84 |
+
|
85 |
+
```markdown
|
86 |
+
# Instructions
|
87 |
+
- Read all context before patching.
|
88 |
+
- Plan changes first.
|
89 |
+
- Apply patches with `apply_patch`.
|
90 |
+
- Run tests before finalizing.
|
91 |
+
- Keep going until all tests pass.
|
92 |
+
|
93 |
+
# Patch Format
|
94 |
+
*** Begin Patch
|
95 |
+
*** Update File: path/to/file.py
|
96 |
+
@@ def buggy():
|
97 |
+
- broken()
|
98 |
+
+ fixed()
|
99 |
+
*** End Patch
|
100 |
+
```
|
101 |
+
|
102 |
+
### Tool Schema
|
103 |
+
|
104 |
+
```json
|
105 |
+
{
|
106 |
+
"name": "apply_patch",
|
107 |
+
"description": "Applies human-readable code patches",
|
108 |
+
"parameters": {
|
109 |
+
"type": "object",
|
110 |
+
"properties": {
|
111 |
+
"input": { "type": "string" }
|
112 |
+
},
|
113 |
+
"required": ["input"]
|
114 |
+
}
|
115 |
+
}
|
116 |
+
```
|
117 |
+
|
118 |
+
### Agent Workflow
|
119 |
+
|
120 |
+
1. Understand the bug
|
121 |
+
2. Explore relevant files
|
122 |
+
3. Propose and apply patch
|
123 |
+
4. Run `!python3 run_tests.py`
|
124 |
+
5. Reflect and iterate until success
|
125 |
+
|
126 |
+
### Notes
|
127 |
+
|
128 |
+
* Use `@@` headers to specify scope
|
129 |
+
* Plan before every action
|
130 |
+
* Reflect after test results
|
131 |
+
|
132 |
+
|
133 |
+
## Deployment Pattern 3: Long Document Analyst
|
134 |
+
|
135 |
+
### Use Case
|
136 |
+
|
137 |
+
A document triage and synthesis agent for use with 100k–1M token context windows.
|
138 |
+
|
139 |
+
### Prompt Setup
|
140 |
+
|
141 |
+
```markdown
|
142 |
+
# Instructions
|
143 |
+
- Focus on relevance.
|
144 |
+
- Reflect every 10k tokens.
|
145 |
+
- Summarize findings by section.
|
146 |
+
|
147 |
+
# Strategy
|
148 |
+
1. Read → rate relevance
|
149 |
+
2. Extract high-salience content
|
150 |
+
3. Synthesize across documents
|
151 |
+
```
|
152 |
+
|
153 |
+
### Input Format Guidance
|
154 |
+
|
155 |
+
* Prefer `# Section`, `<doc>` tags, or ID/TITLE headers
|
156 |
+
* Avoid JSON for >10k tokens
|
157 |
+
* Repeat instructions at start and end
|
158 |
+
|
159 |
+
### Best Practices
|
160 |
+
|
161 |
+
* Insert checkpoints every 5–10k tokens
|
162 |
+
* Ask model to pause and reflect: “Are we on track?”
|
163 |
+
* Evaluate document relevance before synthesis
|
164 |
+
|
165 |
+
|
166 |
+
## Deployment Pattern 4: Data Labeling Assistant
|
167 |
+
|
168 |
+
### Use Case
|
169 |
+
|
170 |
+
Assist in labeling structured or unstructured data with schema validation and few-shot learning.
|
171 |
+
|
172 |
+
### Prompt Structure
|
173 |
+
|
174 |
+
```markdown
|
175 |
+
# Labeling Instructions
|
176 |
+
- Label each entry using valid categories
|
177 |
+
- Format: {"text": ..., "label": ...}
|
178 |
+
|
179 |
+
# Categories
|
180 |
+
- Urgent
|
181 |
+
- Normal
|
182 |
+
- Spam
|
183 |
+
|
184 |
+
# Example
|
185 |
+
{"text": "Free money now!", "label": "Spam"}
|
186 |
+
```
|
187 |
+
|
188 |
+
### API Integration
|
189 |
+
|
190 |
+
Validate against schema on submit. Add real-time audit checks for consistency.
|
191 |
+
|
192 |
+
### Evaluation
|
193 |
+
|
194 |
+
* Measure label precision
|
195 |
+
* Flag outliers for review
|
196 |
+
* Use `tool_call` to suggest schema fixes
|
197 |
+
|
198 |
+
|
199 |
+
## Deployment Pattern 5: Research Assistant
|
200 |
+
|
201 |
+
### Use Case
|
202 |
+
|
203 |
+
Used by analysts to extract, summarize, and contrast findings across large research corpora.
|
204 |
+
|
205 |
+
### Core Prompt Blocks
|
206 |
+
|
207 |
+
```markdown
|
208 |
+
# Objective
|
209 |
+
Identify similarities and differences across these studies.
|
210 |
+
|
211 |
+
# Step-by-Step Plan
|
212 |
+
1. Break each study into hypothesis, method, result
|
213 |
+
2. Extract claims
|
214 |
+
3. Compare claim alignment or contradiction
|
215 |
+
```
|
216 |
+
|
217 |
+
### Ideal Format
|
218 |
+
|
219 |
+
Use XML-structured context for each paper:
|
220 |
+
|
221 |
+
```xml
|
222 |
+
<doc id="23" title="Study A">
|
223 |
+
<hypothesis>...</hypothesis>
|
224 |
+
<method>...</method>
|
225 |
+
<results>...</results>
|
226 |
+
</doc>
|
227 |
+
```
|
228 |
+
|
229 |
+
### Output Pattern
|
230 |
+
|
231 |
+
```json
|
232 |
+
[
|
233 |
+
{"id": "23", "summary": "Study A supports..."},
|
234 |
+
{"id": "47", "summary": "Study B challenges..."},
|
235 |
+
{"alignment": false, "conflict_reason": "Different control group"}
|
236 |
+
]
|
237 |
+
```
|
238 |
+
|
239 |
+
|
240 |
+
## Deployment Best Practices
|
241 |
+
|
242 |
+
### Prompting
|
243 |
+
|
244 |
+
* Use bullet-style `# Instructions`
|
245 |
+
* Add `# Reasoning Strategy` section to guide workflow
|
246 |
+
* Repeat instructions at top and bottom for long input
|
247 |
+
|
248 |
+
### Tool Integration
|
249 |
+
|
250 |
+
* Pass tools in API schema, not inline
|
251 |
+
* Provide examples in `# Examples` section
|
252 |
+
* Use clear tool names and parameter descriptions
|
253 |
+
|
254 |
+
### Output Handling
|
255 |
+
|
256 |
+
* Define expected format in advance
|
257 |
+
* Use schema validation for structured outputs
|
258 |
+
* Log every tool call and agent action
|
259 |
+
|
260 |
+
### Iterative Evaluation
|
261 |
+
|
262 |
+
* Audit performance per use case
|
263 |
+
* Evaluate edge-case behavior explicitly
|
264 |
+
* Collect examples of failure modes
|
265 |
+
* Adjust prompts, tools, and planning steps accordingly
|
266 |
+
|
267 |
+
|
268 |
+
## Summary
|
269 |
+
|
270 |
+
GPT-4.1 is deployable across a wide range of real-world systems. Success depends not only on model capability but on prompt structure, tool schema clarity, planning enforcement, and continual evaluation. Each scenario benefits from opinionated workflows, persistent agent behaviors, and clearly delimited responsibilities.
|
271 |
+
|
272 |
+
**Start with structured instructions. Plan agent actions. Validate at every step.**
|
273 |
+
|
274 |
+
|
275 |
+
## Additional Notes
|
276 |
+
|
277 |
+
* Always measure: accuracy, tool latency, format compliance, adherence
|
278 |
+
* Use internal QA and sandbox environments before production
|
279 |
+
* Document all agentic patterns and update based on logs
|
280 |
+
* Prefer long-term performance tracking over one-off evals
|
281 |
+
|
282 |
+
Deployment is not one prompt—it’s a living system. Maintain, monitor, and adapt.
|
tool_use_and_integration.md
ADDED
@@ -0,0 +1,317 @@
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
|
1 |
+
# [Tool Use and Integration](https://chatgpt.com/canvas/shared/6825ee7dbfd081919e67bd643748f8de)
|
2 |
+
|
3 |
+
## Overview
|
4 |
+
|
5 |
+
GPT-4.1 introduces robust capabilities for working with tools directly through the OpenAI API’s `tools` parameter. Rather than relying solely on the model's internal knowledge, developers can now extend functionality, reduce hallucination, and enforce reliable workflows by integrating explicitly defined tools into their applications.
|
6 |
+
|
7 |
+
This document offers a comprehensive guide for designing and deploying tool-augmented applications using GPT-4.1. It includes best practices for tool registration, prompting strategies, tool schema design, usage examples, and debugging common tool invocation failures. Each section is modular and designed to help you build reliable systems that scale across contexts, task types, and user interfaces.
|
8 |
+
|
9 |
+
|
10 |
+
## What is a Tool in GPT-4.1?
|
11 |
+
|
12 |
+
A **tool** is an explicitly defined function or utility passed to the GPT-4.1 API, allowing the model to trigger predefined operations such as:
|
13 |
+
|
14 |
+
* Running code or bash commands
|
15 |
+
* Retrieving documents or structured data
|
16 |
+
* Performing API calls
|
17 |
+
* Applying file patches or diffs
|
18 |
+
* Looking up user account information
|
19 |
+
|
20 |
+
Tools are defined in a structured JSON schema and passed via the `tools` parameter. When the model determines a tool is required, it emits a function call rather than plain text. This enables **precise execution**, **auditable behavior**, and **tight application integration**.
|
21 |
+
|
22 |
+
|
23 |
+
## Why Use Tools?
|
24 |
+
|
25 |
+
| Benefit | Description |
|
26 |
+
| ------------------------------ | -------------------------------------------------------------------------- |
|
27 |
+
| **Reduces hallucination** | Encourages the model to call real-world functions instead of guessing |
|
28 |
+
| **Improves traceability** | Tool calls are logged and interpretable as function outputs |
|
29 |
+
| **Enables complex workflows** | Offloads parts of the task to external systems (e.g., shell, Python, APIs) |
|
30 |
+
| **Enhances compliance** | Limits model responses to grounded tool outputs |
|
31 |
+
| **Improves agent performance** | Required for persistent, multi-turn agentic workflows |
|
32 |
+
|
33 |
+
|
34 |
+
## Tool Definition: The Schema
|
35 |
+
|
36 |
+
Tools are defined using a JSON schema object that includes:
|
37 |
+
|
38 |
+
* `name`: A short, unique identifier
|
39 |
+
* `description`: A concise explanation of what the tool does
|
40 |
+
* `parameters`: A standard JSON Schema describing expected input
|
41 |
+
|
42 |
+
### Example: Python Execution Tool
|
43 |
+
|
44 |
+
```json
|
45 |
+
{
|
46 |
+
"type": "function",
|
47 |
+
"name": "python",
|
48 |
+
"description": "Run Python code or terminal commands in a secure environment.",
|
49 |
+
"parameters": {
|
50 |
+
"type": "object",
|
51 |
+
"properties": {
|
52 |
+
"input": {
|
53 |
+
"type": "string",
|
54 |
+
"description": "The code or command to run"
|
55 |
+
}
|
56 |
+
},
|
57 |
+
"required": ["input"]
|
58 |
+
}
|
59 |
+
}
|
60 |
+
```
|
61 |
+
|
62 |
+
### Best Practices for Schema Design
|
63 |
+
|
64 |
+
* Use clear names: `run_tests`, `lookup_policy`, `apply_patch`
|
65 |
+
* Keep descriptions actionable: Describe *when* and *why* to use
|
66 |
+
* Minimize complexity: Use shallow parameter objects where possible
|
67 |
+
* Use enums or constraints to reduce ambiguous calls
|
68 |
+
|
69 |
+
|
70 |
+
## Registering Tools in the API
|
71 |
+
|
72 |
+
In the Python SDK:
|
73 |
+
|
74 |
+
```python
|
75 |
+
response = client.chat.completions.create(
|
76 |
+
model="gpt-4.1",
|
77 |
+
messages=chat_history,
|
78 |
+
tools=[python_tool, get_user_info_tool],
|
79 |
+
tool_choice="auto"
|
80 |
+
)
|
81 |
+
```
|
82 |
+
|
83 |
+
Set `tool_choice` to:
|
84 |
+
|
85 |
+
* `"auto"`: Allow the model to choose when to call
|
86 |
+
* A specific tool name: Force one call
|
87 |
+
* `"none"`: Prevent tool usage (useful for testing)
|
88 |
+
|
89 |
+
|
90 |
+
## Prompting for Tool Use
|
91 |
+
|
92 |
+
### Tool Use Prompting Guidelines
|
93 |
+
|
94 |
+
To guide GPT-4.1 toward proper tool usage:
|
95 |
+
|
96 |
+
* **Don’t rely on the model to infer when to call a tool.** Tell it explicitly when tools are required.
|
97 |
+
* **Prompt for failure cases**: Tell the model what to do when it lacks information (e.g., “ask the user” or “pause”).
|
98 |
+
* **Avoid ambiguity**: Be clear about tool invocation order and data requirements.
|
99 |
+
|
100 |
+
### Example Prompt Snippet
|
101 |
+
|
102 |
+
```markdown
|
103 |
+
Before answering any user question about billing, check if the necessary context is available.
|
104 |
+
If not, use the `lookup_policy_document` tool to find relevant information.
|
105 |
+
Never answer without citing a retrieved document.
|
106 |
+
```
|
107 |
+
|
108 |
+
### Escalation Pattern
|
109 |
+
|
110 |
+
```markdown
|
111 |
+
If the tool fails to return the necessary data, ask the user for clarification.
|
112 |
+
If the user cannot provide it, explain the limitation and pause further action.
|
113 |
+
```
|
114 |
+
|
115 |
+
|
116 |
+
## Tool Use in Agent Workflows
|
117 |
+
|
118 |
+
Tool usage is foundational to agent design in GPT-4.1.
|
119 |
+
|
120 |
+
### Multi-Stage Task Example: Bug Fix Agent
|
121 |
+
|
122 |
+
```markdown
|
123 |
+
1. Use `read_file` to inspect code
|
124 |
+
2. Analyze and plan a fix
|
125 |
+
3. Use `apply_patch` to update the file
|
126 |
+
4. Use `run_tests` to verify changes
|
127 |
+
5. Reflect and reattempt if needed
|
128 |
+
```
|
129 |
+
|
130 |
+
Each tool call is logged as a JSON event and can be parsed programmatically.
|
131 |
+
|
132 |
+
|
133 |
+
## Apply Patch: Recommended Format
|
134 |
+
|
135 |
+
One of the most powerful GPT-4.1 patterns is **patch generation** using a diff-like format.
|
136 |
+
|
137 |
+
### Patch Structure
|
138 |
+
|
139 |
+
```bash
|
140 |
+
apply_patch <<"EOF"
|
141 |
+
*** Begin Patch
|
142 |
+
*** Update File: path/to/file.py
|
143 |
+
@@ def function():
|
144 |
+
- old_code()
|
145 |
+
+ new_code()
|
146 |
+
*** End Patch
|
147 |
+
EOF
|
148 |
+
```
|
149 |
+
|
150 |
+
### Tool Behavior
|
151 |
+
|
152 |
+
* No line numbers required
|
153 |
+
* Context determined by `@@` anchors and 3 lines of code before/after
|
154 |
+
* Errors must be handled gracefully and logged
|
155 |
+
|
156 |
+
See `/examples/apply_patch/` for templates and error-handling techniques.
|
157 |
+
|
158 |
+
|
159 |
+
## Tool Examples by Use Case
|
160 |
+
|
161 |
+
| Use Case | Tool Name | Description |
|
162 |
+
| --------------------- | --------------- | ------------------------------------------ |
|
163 |
+
| Execute code | `python` | Runs code or shell commands |
|
164 |
+
| Apply file diff | `apply_patch` | Applies a patch to a source file |
|
165 |
+
| Fetch document | `lookup_policy` | Retrieves structured policy text |
|
166 |
+
| Get user account data | `get_user_info` | Fetches user account info via phone number |
|
167 |
+
| Log analytics | `log_event` | Sends metadata to your analytics platform |
|
168 |
+
|
169 |
+
|
170 |
+
## Error Handling and Recovery
|
171 |
+
|
172 |
+
Tool failure is inevitable in complex systems. Plan for it.
|
173 |
+
|
174 |
+
### Guidelines for GPT-4.1:
|
175 |
+
|
176 |
+
* Detect and summarize tool errors
|
177 |
+
* Ask for missing input
|
178 |
+
* Retry if safe
|
179 |
+
* Escalate to user if unresolvable
|
180 |
+
|
181 |
+
### Prompt Pattern: Failure Response
|
182 |
+
|
183 |
+
```markdown
|
184 |
+
If a tool fails with an error, summarize the issue clearly for the user.
|
185 |
+
Only retry if the cause of failure is known and correctable.
|
186 |
+
If not, explain the problem and ask the user for next steps.
|
187 |
+
```
|
188 |
+
|
189 |
+
|
190 |
+
## Tool Debugging and Logging
|
191 |
+
|
192 |
+
Enable structured logging to track model-tool interactions:
|
193 |
+
|
194 |
+
* **Log call attempts**: Include input parameters and timestamps
|
195 |
+
* **Log success/failure outcomes**: Include model reflections
|
196 |
+
* **Log retry logic**: Show how failures were handled
|
197 |
+
|
198 |
+
This creates full traceability for AI-involved actions.
|
199 |
+
|
200 |
+
### Sample Tool Call Log (JSON)
|
201 |
+
|
202 |
+
```json
|
203 |
+
{
|
204 |
+
"tool_name": "run_tests",
|
205 |
+
"input": "!python3 -m unittest discover",
|
206 |
+
"result": "3 tests passed, 1 failed",
|
207 |
+
"timestamp": "2025-05-15T14:32:12Z"
|
208 |
+
}
|
209 |
+
```
|
210 |
+
|
211 |
+
|
212 |
+
## Tool Evaluation and Performance Monitoring
|
213 |
+
|
214 |
+
Track tool usage metrics:
|
215 |
+
|
216 |
+
* **Tool Call Rate**: How often a tool is invoked
|
217 |
+
* **Tool Completion Rate**: How often tools finish without failure
|
218 |
+
* **Tool Contribution Score**: Impact on final task completion
|
219 |
+
* **Average Attempts per Task**: Retry behavior over time
|
220 |
+
|
221 |
+
Use this data to refine prompting and improve tool schema design.
|
222 |
+
|
223 |
+
|
224 |
+
## Common Pitfalls and Solutions
|
225 |
+
|
226 |
+
| Issue | Likely Cause | Solution |
|
227 |
+
| ---------------------------- | ---------------------------------------------- | ---------------------------------------------------- |
|
228 |
+
| Tool called with empty input | Missing required parameter | Prompt model to validate input presence |
|
229 |
+
| Tool ignored | Tool not described clearly in schema or prompt | Add clear instruction for when to use tool |
|
230 |
+
| Repeated failed calls | No failure mitigation logic | Add conditionals to check and respond to tool errors |
|
231 |
+
| Model mixes tool names | Ambiguous tool naming | Use short, specific, unambiguous names |
|
232 |
+
|
233 |
+
|
234 |
+
## Combining Tools with Instructions
|
235 |
+
|
236 |
+
When combining tools with detailed instruction sets:
|
237 |
+
|
238 |
+
* Include a `# Tools` section in your system prompt
|
239 |
+
* Define when and why each tool should be used
|
240 |
+
* Link tool calls to reasoning steps in `# Workflow`
|
241 |
+
|
242 |
+
### Example Combined Prompt
|
243 |
+
|
244 |
+
```markdown
|
245 |
+
# Role
|
246 |
+
You are a bug-fix agent using provided tools to solve code issues.
|
247 |
+
|
248 |
+
# Tools
|
249 |
+
- `read_file`: Inspect code files
|
250 |
+
- `apply_patch`: Apply structured diffs
|
251 |
+
- `run_tests`: Validate code after changes
|
252 |
+
|
253 |
+
# Instructions
|
254 |
+
1. Always start with file inspection
|
255 |
+
2. Plan before making changes
|
256 |
+
3. Test after every patch
|
257 |
+
4. Do not finish until all tests pass
|
258 |
+
|
259 |
+
# Output
|
260 |
+
Include patch summaries, test outcomes, and current status.
|
261 |
+
```
|
262 |
+
|
263 |
+
|
264 |
+
## Tool Testing Templates
|
265 |
+
|
266 |
+
Create test cases that validate:
|
267 |
+
|
268 |
+
* Input formatting
|
269 |
+
* Response validation
|
270 |
+
* Prompt-tool alignment
|
271 |
+
* Handling of edge cases
|
272 |
+
|
273 |
+
Use both synthetic and real examples:
|
274 |
+
|
275 |
+
```markdown
|
276 |
+
## Tool Call Test: run_tests
|
277 |
+
**Input**: Code with known error
|
278 |
+
**Expected Output**: Test failure summary
|
279 |
+
**Follow-up Behavior**: Retry with fixed patch
|
280 |
+
```
|
281 |
+
|
282 |
+
|
283 |
+
## Tool Choice Design
|
284 |
+
|
285 |
+
Choose between model-directed or developer-directed tool invocation:
|
286 |
+
|
287 |
+
| Mode | Behavior | Use Case |
|
288 |
+
| ------------- | ------------------------------------------- | ---------------------------------- |
|
289 |
+
| `auto` | Model decides whether and when to use tools | General assistants, exploration |
|
290 |
+
| `none` | Model cannot use tools | Testing model reasoning only |
|
291 |
+
| `forced` name | Developer instructs tool call immediately | Known pipeline steps, unit testing |
|
292 |
+
|
293 |
+
Choose based on control needs and task constraints.
|
294 |
+
|
295 |
+
|
296 |
+
## Summary: Best Practices for Tool Integration
|
297 |
+
|
298 |
+
| Area | Best Practice |
|
299 |
+
| ---------------- | -------------------------------------------------------- |
|
300 |
+
| Tool Naming | Use action-based, unambiguous names |
|
301 |
+
| Prompt Structure | Clearly define when and how tools should be used |
|
302 |
+
| Tool Invocation | Register tools in API, not in plain prompt text |
|
303 |
+
| Failure Handling | Provide instructions for retrying or asking the user |
|
304 |
+
| Schema Design | Use JSON Schema with constraints to reduce invalid input |
|
305 |
+
| Evaluation | Track tool call success rate and contribution to outcome |
|
306 |
+
|
307 |
+
|
308 |
+
## Further Exploration
|
309 |
+
|
310 |
+
* [`Designing Agent Workflows`](./Designing%20Agent%20Workflows.md)
|
311 |
+
* [`Prompting for Instruction Following`](./Prompting%20for%20Instruction%20Following.md)
|
312 |
+
* [`Long Context Strategies`](./Long%20Context.md)
|
313 |
+
|
314 |
+
For community templates and tool libraries, explore the `/tools/` and `/examples/` directories in the main repository.
|
315 |
+
|
316 |
+
|
317 |
+
For contributions, open a pull request or submit an issue in `/tools/Tool Use and Integration.md`.
|