File size: 10,063 Bytes
36cdf5a
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
# [Prompting for Instruction Following](https://chatgpt.com/canvas/shared/6825ebe022148191bceb9fa5473a34eb)

## Overview

GPT-4.1 represents a significant shift in how developers should structure prompts for reliable, deterministic, and consistent behavior. Unlike earlier models which often inferred intent liberally, GPT-4.1 adheres to instructions in a far more literal, detail-sensitive manner. This brings both increased control and greater responsibility for developers: well-designed prompts yield exceptional results, while ambiguous or conflicting instructions may result in brittle or unexpected behavior.

This guide outlines best practices, real-world examples, and design patterns to fully utilize GPT-4.1’s instruction-following improvements across a variety of applications. It is structured to help you:

* Understand GPT-4.1’s instruction handling behavior
* Design high-integrity prompt scaffolds
* Debug prompt failures and mitigate ambiguity
* Align instructions with OpenAI’s guidance around tool usage, task persistence, and planning

This file is designed to stand alone for practical use and is fully aligned with the broader `openai-cookbook-pro` repository.


## Why Instruction-Following Matters

Instruction following is central to:

* **Agent behavior**: models acting in multi-step environments must reliably interpret commands
* **Tool use**: execution hinges on clearly-defined tool invocation criteria
* **Support workflows**: factual grounding depends on accurate boundary adherence
* **Security and safety**: systems must not misinterpret prohibitions or fail to enforce policy constraints

With GPT-4.1’s shift toward literal interpretation, instruction scaffolding becomes the primary control interface.


## GPT-4.1 Instruction Characteristics

### 1. **Literal Compliance**

GPT-4.1 follows instructions with minimal assumption. If a step is missing or unclear, the model is less likely to “fill in” or guess the user’s intent.

* **Previous behavior**: interpreted vague prompts broadly
* **Current behavior**: waits for or requests clarification

This improves safety and traceability but also increases fragility in loosely written prompts.

### 2. **Order-Sensitive Resolution**

When instructions conflict, GPT-4.1 favors those listed **last** in the prompt. This means developers should order rules hierarchically:

* General rules go early
* Specific overrides go later

Example:

```markdown
# Instructions
- Do not guess if unsure
- Use your knowledge if a tool isn’t available
- If both options are available, prefer the tool
```

### 3. **Format-Aware Behavior**

GPT-4.1 performs better with clearly formatted instructions. Prefer structured formats:

* Markdown with headers and lists
* XML with nested tags
* Structured sections like `# Steps`, `# Output Format`

Poorly formatted, unsegmented prompts lead to instruction bleed and undesired merging of behaviors.


## Recommended Prompt Structure

Organize your prompt using a structure that mirrors OpenAI’s internal evaluation standards.

### 📁 Standard Sections

```markdown
# Role and Objective
# Instructions
## Sub-categories for Specific Behavior
# Workflow Steps (Optional)
# Output Format
# Examples (Optional)
# Final Reminder
```

### Example Prompt Template

```markdown
# Role and Objective
You are a customer service assistant. Your job is to resolve user issues efficiently, using tools when needed.

# Instructions
- Greet the user politely.
- Use a tool before answering any account-related question.
- If unsure how to proceed, ask the user for clarification.
- If a user requests escalation, refer them to a human agent.

## Output Format
- Always use a friendly tone.
- Format your answer in plain text.
- Include a summary at the end of your response.

## Final Reminder
Do not rely on prior knowledge. Use provided tools and context only.
```


## Instruction Categories

### 1. **Task Definition**

Clearly state the model’s job in the opening lines. Be explicit:

✅ “You are an assistant that reviews and edits legal contracts.”

🚫 “Help with contracts.”

### 2. **Behavioral Constraints**

List what the model must or must not do:

* Must call tools before responding to factual queries
* Must ask for clarification if user input is incomplete
* Must not provide financial or legal advice

### 3. **Response Style**

Define tone, length, formality, and structure.

* “Keep responses under 250 words.”
* “Avoid lists unless asked.”
* “Use a neutral tone.”

### 4. **Tool Use Protocols**

Models often hallucinate tools unless guided:

* “If you don’t have enough information to use a tool, ask the user for more.”
* “Always confirm tool usage before responding.”


## Debugging Instruction Failures

Instruction-following failures often stem from the following:

### Common Causes

* Ambiguous rule phrasing
* Conflicting instructions (e.g., both asking to guess and not guess)
* Implicit behaviors expected, not stated
* Overloaded instructions without formatting

### Diagnosis Steps

1. Read the full prompt in sequence
2. Identify potential ambiguity
3. Reorder to clarify precedence
4. Break complex rules into atomic steps
5. Test with structured evals


## Instruction Layering: The 3-Tier Model

When designing prompts for multi-step tasks, layer your instructions in tiers:

| Tier | Layer Purpose               | Example                                    |
| ---- | --------------------------- | ------------------------------------------ |
| 1    | Role Declaration            | “You are an assistant for legal tasks.”    |
| 2    | Global Behavior Constraints | “Always cite sources.”                     |
| 3    | Task-Specific Instructions  | “In contracts, highlight ambiguous terms.” |

Each layer helps disambiguate behavior and provides a fallback structure if downstream instructions fail.


## Long Context Instruction Handling

In prompts exceeding 50,000 tokens:

* Place **key instructions** both **before and after** the context.
* Use format anchors (`# Instructions`, `<rules>`) to signal boundaries.
* Avoid relying solely on the top-of-prompt instructions.

GPT-4.1 is trained to respect these placements, especially when consistent structure is maintained.


## Literal vs. Flexible Models

| Capability             | GPT-3.5 / GPT-4-turbo | GPT-4.1         |
| ---------------------- | --------------------- | --------------- |
| Implicit inference     | High                  | Low             |
| Literal compliance     | Moderate              | High            |
| Prompt flexibility     | Higher tolerance      | Lower tolerance |
| Instruction debug cost | Lower                 | Higher          |

GPT-4.1 performs better **when prompts are precise**. Treat prompt engineering as API design — clear, testable, and version-controlled.


## Tips for Designing Instruction-Sensitive Prompts

### ✔️ DO:

* Use structured formatting
* Scope behaviors into separate bullet points
* Use examples to anchor expected output
* Rewrite ambiguous instructions into atomic steps
* Add conditionals explicitly (e.g., “if X, then Y”)

### ❌ DON’T:

* Assume the model will “understand what you meant”
* Use overloaded sentences with multiple actions
* Rely on invisible or implied rules
* Assume formatting styles (e.g., bullets) are optional


## Example: Instruction-Controlled Code Agent

```markdown
# Objective
You are a code assistant that fixes bugs in open-source projects.

# Instructions
- Always use the tools provided to inspect code.
- Do not make edits unless you have confirmed the bug’s root cause.
- If a change is proposed, validate using tests.
- Do not respond unless the patch is applied.

## Output Format
1. Description of bug
2. Explanation of root cause
3. Tool output (e.g., patch result)
4. Confirmation message

## Final Note
Do not guess. If you are unsure, use tools or ask.
```

> For a complete walkthrough, see `/examples/code-agent-instructions.md`


## Instruction Evolution Across Iterations

As your prompts grow, preserve instruction integrity using:

* Versioned templates
* Structured diffs for instruction edits
* Commented rules for traceability

Example diff:

```diff
- Always answer user questions.
+ Only answer user questions after validating tool output.
```

Maintain a changelog for prompts as you would with source code. This ensures instructional integrity during collaborative development.


## Testing and Evaluation

Prompt engineering is empirical. Validate instruction design using:

* **A/B tests**: Compare variants with and without behavioral scaffolds
* **Prompt evals**: Use deterministic queries to test edge case behavior
* **Behavioral matrices**: Track compliance with instruction categories

Example matrix:

| Instruction Category | Prompt A Pass | Prompt B Pass |
| -------------------- | ------------- | ------------- |
| Ask if unsure        | ✅             | ❌             |
| Use tools first      | ✅             | ✅             |
| Avoid sensitive data | ❌             | ✅             |


## Final Reminders

GPT-4.1 is exceptionally effective **when paired with well-structured, comprehensive instructions**. Follow these principles:

* Instructions should be modular and auditable.
* Avoid unnecessary repetition, but reinforce critical rules.
* Use formatting styles that clearly separate content.
* Assume literalism — write prompts as if programming a function, not chatting with a person.

Every prompt is a contract. GPT-4.1 honors that contract, but only if written clearly.


## See Also

* [`Agent Workflows`](../agent_design/swe_bench_agent.md)
* [`Prompt Format Reference`](../reference/prompting_guide.md)
* [`Long Context Strategies`](../examples/long-context-formatting.md)
* [`OpenAI 4.1 Prompting Guide`](https://platform.openai.com/docs/guides/prompting)


For questions, suggestions, or prompt design contributions, submit a pull request to `/examples/instruction-following.md` or open an issue in the main repo.