openai-cookbook-pro / prompting_for_instruction_following.md
recursivelabs's picture
Upload 13 files
36cdf5a verified
# [Prompting for Instruction Following](https://chatgpt.com/canvas/shared/6825ebe022148191bceb9fa5473a34eb)
## Overview
GPT-4.1 represents a significant shift in how developers should structure prompts for reliable, deterministic, and consistent behavior. Unlike earlier models which often inferred intent liberally, GPT-4.1 adheres to instructions in a far more literal, detail-sensitive manner. This brings both increased control and greater responsibility for developers: well-designed prompts yield exceptional results, while ambiguous or conflicting instructions may result in brittle or unexpected behavior.
This guide outlines best practices, real-world examples, and design patterns to fully utilize GPT-4.1’s instruction-following improvements across a variety of applications. It is structured to help you:
* Understand GPT-4.1’s instruction handling behavior
* Design high-integrity prompt scaffolds
* Debug prompt failures and mitigate ambiguity
* Align instructions with OpenAI’s guidance around tool usage, task persistence, and planning
This file is designed to stand alone for practical use and is fully aligned with the broader `openai-cookbook-pro` repository.
## Why Instruction-Following Matters
Instruction following is central to:
* **Agent behavior**: models acting in multi-step environments must reliably interpret commands
* **Tool use**: execution hinges on clearly-defined tool invocation criteria
* **Support workflows**: factual grounding depends on accurate boundary adherence
* **Security and safety**: systems must not misinterpret prohibitions or fail to enforce policy constraints
With GPT-4.1’s shift toward literal interpretation, instruction scaffolding becomes the primary control interface.
## GPT-4.1 Instruction Characteristics
### 1. **Literal Compliance**
GPT-4.1 follows instructions with minimal assumption. If a step is missing or unclear, the model is less likely to “fill in” or guess the user’s intent.
* **Previous behavior**: interpreted vague prompts broadly
* **Current behavior**: waits for or requests clarification
This improves safety and traceability but also increases fragility in loosely written prompts.
### 2. **Order-Sensitive Resolution**
When instructions conflict, GPT-4.1 favors those listed **last** in the prompt. This means developers should order rules hierarchically:
* General rules go early
* Specific overrides go later
Example:
```markdown
# Instructions
- Do not guess if unsure
- Use your knowledge if a tool isn’t available
- If both options are available, prefer the tool
```
### 3. **Format-Aware Behavior**
GPT-4.1 performs better with clearly formatted instructions. Prefer structured formats:
* Markdown with headers and lists
* XML with nested tags
* Structured sections like `# Steps`, `# Output Format`
Poorly formatted, unsegmented prompts lead to instruction bleed and undesired merging of behaviors.
## Recommended Prompt Structure
Organize your prompt using a structure that mirrors OpenAI’s internal evaluation standards.
### 📁 Standard Sections
```markdown
# Role and Objective
# Instructions
## Sub-categories for Specific Behavior
# Workflow Steps (Optional)
# Output Format
# Examples (Optional)
# Final Reminder
```
### Example Prompt Template
```markdown
# Role and Objective
You are a customer service assistant. Your job is to resolve user issues efficiently, using tools when needed.
# Instructions
- Greet the user politely.
- Use a tool before answering any account-related question.
- If unsure how to proceed, ask the user for clarification.
- If a user requests escalation, refer them to a human agent.
## Output Format
- Always use a friendly tone.
- Format your answer in plain text.
- Include a summary at the end of your response.
## Final Reminder
Do not rely on prior knowledge. Use provided tools and context only.
```
## Instruction Categories
### 1. **Task Definition**
Clearly state the model’s job in the opening lines. Be explicit:
✅ “You are an assistant that reviews and edits legal contracts.”
🚫 “Help with contracts.”
### 2. **Behavioral Constraints**
List what the model must or must not do:
* Must call tools before responding to factual queries
* Must ask for clarification if user input is incomplete
* Must not provide financial or legal advice
### 3. **Response Style**
Define tone, length, formality, and structure.
* “Keep responses under 250 words.”
* “Avoid lists unless asked.”
* “Use a neutral tone.”
### 4. **Tool Use Protocols**
Models often hallucinate tools unless guided:
* “If you don’t have enough information to use a tool, ask the user for more.”
* “Always confirm tool usage before responding.”
## Debugging Instruction Failures
Instruction-following failures often stem from the following:
### Common Causes
* Ambiguous rule phrasing
* Conflicting instructions (e.g., both asking to guess and not guess)
* Implicit behaviors expected, not stated
* Overloaded instructions without formatting
### Diagnosis Steps
1. Read the full prompt in sequence
2. Identify potential ambiguity
3. Reorder to clarify precedence
4. Break complex rules into atomic steps
5. Test with structured evals
## Instruction Layering: The 3-Tier Model
When designing prompts for multi-step tasks, layer your instructions in tiers:
| Tier | Layer Purpose | Example |
| ---- | --------------------------- | ------------------------------------------ |
| 1 | Role Declaration | “You are an assistant for legal tasks.” |
| 2 | Global Behavior Constraints | “Always cite sources.” |
| 3 | Task-Specific Instructions | “In contracts, highlight ambiguous terms.” |
Each layer helps disambiguate behavior and provides a fallback structure if downstream instructions fail.
## Long Context Instruction Handling
In prompts exceeding 50,000 tokens:
* Place **key instructions** both **before and after** the context.
* Use format anchors (`# Instructions`, `<rules>`) to signal boundaries.
* Avoid relying solely on the top-of-prompt instructions.
GPT-4.1 is trained to respect these placements, especially when consistent structure is maintained.
## Literal vs. Flexible Models
| Capability | GPT-3.5 / GPT-4-turbo | GPT-4.1 |
| ---------------------- | --------------------- | --------------- |
| Implicit inference | High | Low |
| Literal compliance | Moderate | High |
| Prompt flexibility | Higher tolerance | Lower tolerance |
| Instruction debug cost | Lower | Higher |
GPT-4.1 performs better **when prompts are precise**. Treat prompt engineering as API design — clear, testable, and version-controlled.
## Tips for Designing Instruction-Sensitive Prompts
### ✔️ DO:
* Use structured formatting
* Scope behaviors into separate bullet points
* Use examples to anchor expected output
* Rewrite ambiguous instructions into atomic steps
* Add conditionals explicitly (e.g., “if X, then Y”)
### ❌ DON’T:
* Assume the model will “understand what you meant”
* Use overloaded sentences with multiple actions
* Rely on invisible or implied rules
* Assume formatting styles (e.g., bullets) are optional
## Example: Instruction-Controlled Code Agent
```markdown
# Objective
You are a code assistant that fixes bugs in open-source projects.
# Instructions
- Always use the tools provided to inspect code.
- Do not make edits unless you have confirmed the bug’s root cause.
- If a change is proposed, validate using tests.
- Do not respond unless the patch is applied.
## Output Format
1. Description of bug
2. Explanation of root cause
3. Tool output (e.g., patch result)
4. Confirmation message
## Final Note
Do not guess. If you are unsure, use tools or ask.
```
> For a complete walkthrough, see `/examples/code-agent-instructions.md`
## Instruction Evolution Across Iterations
As your prompts grow, preserve instruction integrity using:
* Versioned templates
* Structured diffs for instruction edits
* Commented rules for traceability
Example diff:
```diff
- Always answer user questions.
+ Only answer user questions after validating tool output.
```
Maintain a changelog for prompts as you would with source code. This ensures instructional integrity during collaborative development.
## Testing and Evaluation
Prompt engineering is empirical. Validate instruction design using:
* **A/B tests**: Compare variants with and without behavioral scaffolds
* **Prompt evals**: Use deterministic queries to test edge case behavior
* **Behavioral matrices**: Track compliance with instruction categories
Example matrix:
| Instruction Category | Prompt A Pass | Prompt B Pass |
| -------------------- | ------------- | ------------- |
| Ask if unsure | ✅ | ❌ |
| Use tools first | ✅ | ✅ |
| Avoid sensitive data | ❌ | ✅ |
## Final Reminders
GPT-4.1 is exceptionally effective **when paired with well-structured, comprehensive instructions**. Follow these principles:
* Instructions should be modular and auditable.
* Avoid unnecessary repetition, but reinforce critical rules.
* Use formatting styles that clearly separate content.
* Assume literalism — write prompts as if programming a function, not chatting with a person.
Every prompt is a contract. GPT-4.1 honors that contract, but only if written clearly.
## See Also
* [`Agent Workflows`](../agent_design/swe_bench_agent.md)
* [`Prompt Format Reference`](../reference/prompting_guide.md)
* [`Long Context Strategies`](../examples/long-context-formatting.md)
* [`OpenAI 4.1 Prompting Guide`](https://platform.openai.com/docs/guides/prompting)
For questions, suggestions, or prompt design contributions, submit a pull request to `/examples/instruction-following.md` or open an issue in the main repo.