Prompt Engineering
At first, I didn't think that prompt engineering mattered much, but after thinking about it in terms of searching latent space and given that it could cost a lot of money to use LLM APIs, I want to learn more about it.
References
Notes
Define Success Criteria
Building a successful LLM based application starts with clearly defining your success criteria. How will you know when your application is good enough to publish?
Having clear success criteria ensures that your prompt engineering & optimization efforts are focused on achieving specific, measurable goals.
Good success criteria are:
- specific: Clearly define what you want to achieve
- measurable: quantitative or well-defined qualitative scales
- Achievable: Base your targets on industry benchmarks, prior experiments, AI research, or expert knowledge.
- Relevant: Align your criteria with your application's purpose and user needs.
Common success criteria:
- Task Fidelity
- Consistency
- Relevance and Coherence
- Tone and style
- Privacy preservation
- Context Utilization
- Latency
- Price
Develop Test Cases
After defining your success criteria, the next step is designing evaluations to measure LLM performance against those criteria. This is a vital part of the prompt engineering cycle.
Building Evals and Test Cases
Eval Design Principles
- Be Task-Specific: Design evals that mirror your real-world task distribution.
- Automate when possible: Structure questions to allow for automated grading (e.g., multiple-choice, string match, code-graded, LLM-graded)
- Prioritize volume over quality: More questions with slightly lower signal automated grading is better than fewer questions with high-quality human graded evals.
Grading Evals
When deciding which method to use to grade evals, choose the fastest, most reliable, most scalable method:
- Code-based grading: Fastest and most reliable, extremely scalable, but also lacks nuance for more complex judgements that require less rule-based rigidity.
- Human grading: Most flexible and high quality, but slow and expensive
- LLM-based grading: Fast and flexible, scalable and suitable for complex judgement.
- Tips for LLM-based Grading
- Have detailed, clear rubrics for grading
- Empirical or specific: have the LLM output a number or a word from a set of words
{ "incorrect", "correct" }
- Encourage reasoning: Ask the LLM to think before deciding an evaluation score, and then discard the reasoning.
Prompt Engineering
Overview
Prompt engineering is far faster than other methods of control, such as finetuning, and can often yield leaps in performance in far less time. Some reasons to prefer prompt engineering over finetuning:
- Resource efficiency: Finetuning requires high-end GPUs and large memory, while prompt engineering only requires text input
- Cost effectiveness: Finetuning incurs significant cost for cloud-based services.
- Maintaining model updates: Fine-tuned versions might need retraining when providers update models.
- Time-saving: Fine-tuning can take hours or even days. Prompt engineering is near instantaneous
- Minimal Data Needs: Fine-tuning needs substantial task-specific, labeled data, which can be scarce or expensive. Prompt engineering works with few-shot or even zero-shot learning.
- Flexibility and Rapid Iteration: Better ability to iterate
- Domain Adaptation: Easier to change domains
- Comprehension Improvements: Prompt engineering more efficient at helping models understand context
- Preserves General Knowledge: Prompt engineering retains models' general knowledge
- Transparency: Prompts are human-readable, showing exactly what information the model receives.
Prompt Generator
Sometimes, the hardest part of using an AI model is figuring out how to prompt it effectively. Because of this, Anthropic has developed a prompt generator tool that is available in the console. There is also a prompt improver and evaluation tool for improving and evaluating your prompts.
Use Prompt Templates and Variables
Your API calls with Claude will typically consist of two types of content:
- Fixed Content: Static instructions of context that remain constant across multiple interactions.
- Variable Content: Dynamic elements that change with each request or conversation, such as:
- User Inputs
- Retrieved context for Retrieval-Augmented Generation (RAG)
- Conversation context such as user account history
- System-generated data such as tool use results fed in from other independent calls to Claude
A prompt template combines these fixed and variable parts, using placeholders for the dynamic content.
When to use Prompt Templates and Variables
You should always use prompt templates and variables when you expect any part of your prompt to be repeated in another call to Clause. Prompt templates offer several benefits:
- Consistency: Ensure a consistent structure for your prompts across multiple interactions
- Efficiency: Easily swap out variable content without rewriting the entire prompt
- Testability: Quickly test different inputs and edge cases by changing only the variable portion.
- Scalability: Simplify different inputs and edge cases by changing only the variable portion.
- Version Control: Easily track changes to your prompt structure over time by keeping tabs only on the core part of your prompt, separate from dynamic inputs.
Prompt Improver
The prompt improver helps you quickly iterate and improve your prompts through automated analysis and enhancement. It excels at making prompts more robust for complex tasks that require high accuracy.
How the prompt improver works
- Example identification: Locates and extracts examples from your prompt template
- Initial Draft: Creates a structured template with clear sections and XML tags
- Chain of thought refinement: Adds and refines detailed reasoning instructions.
- Example Enhancement: Updates examples to demonstrate the new reasoning process
Be clear, direct, and detailed
When interacting with Claude, think of it as a brilliant but very new employee (with amnesia) who needs explicit instructions. [...] The more precisely you explain what you want, the better Claude's response will be.
- Give Claude contextual information: Just like you might be able to better perform a task if you knew more context, Claude will perform better if it has more contextual information.
- Be specific about what you want Claude to do: For example, if you want Claude to output only code and nothing else, say so.
- Provide instructions as sequential steps: Use numbered list or bullet points to make sure Claude carries out the task the exact way you want it to.
Use Examples (multishot prompting) to guide Claude's behavior
By proving a few well-crafted examples in your prompt, you can dramatically improve the accuracy, consistency, and quality of Claude's outputs.
- Why use Examples?
- Accuracy: Example reduce misrepresentation of instructions.
- Consistency: Examples enforce uniform structure and style
- Performance: Well-chosen examples boost Claude's ability to handle complex tasks.
- Crafting effective examples
- Relevant: Your examples mirror actual use case.
- Diverse: Your examples cover edge cases and potential challenges, and vary enough that Claude doesn't inadvertently pick up on unintended patterns
- Clear: Your examples are wrapped in
<example>
tags (if multiple, nested within<examples>
tags) for structure
Let Claude Think (Chain of Thought Prompting) to Increase Performance
When faced with complex tasks like research, analysis, or problem-solving, giving Claude space to think can dramatically improve its performance. This technique, known as chain of thought (CoT) prompting, encourages Claude to break down problems step-by-step, leading to more accurate and nuanced outputs.
- Why let Claude think?
- Accuracy: Stepping through problems reduces errors, especially in math, logic, analysis, or generally complex tasks
- Coherence: Structured thinking leads to more cohesive, well-organized responses
- Debugging: Seeing Claude's thought process helps you pinpoint where prompts may be unclear
- Why not let Claude think?
- Increased output length
- Not all tasks require in-depth thinking. Use CoT judiciously.
How to prompt for thinking
List below ordered from least to more complex (less complex methods take up less space in context window):
- Basic Prompt
- Include
Think step-by-step
in your prompt
- Include
- Guided Prompt
- Outline specific steps for Clause to follow in its thinking process
- Structured Prompt
- Use XML tags like
<thinking>
and<answer>
to separate reasoning form the final answer
- Use XML tags like
Use XML tags to Structure your Prompts
When your prompts involve multiple components like context, instructions, and examples, XML tags can be a game-changer. They help Claude parse your prompts more accurately, leading to higher-quality outputs.
Use tags like <instructions>
, <example>
, and <formatting>
to clearly separate different parts of your prompt. This prevents Claude form mixing up instructions with examples or context.
Why Use XML Tags?
- Clarity: Clearly separate different parts for your prompt and ensure your prompt is well structured.
- Accuracy: Reduce errors cause by Clause misinterpreting parts of your prompt
- Flexibility: Easily find, add, remove, or modify parts of your prompt while rewriting everything
- Parseability: Having Claude use XML tags in its output makes it easier to extract specific parts of its response by post-processing
Tagging Best Practices
- Be Consistent:
- Use the same tag names throughout your prompts and refer to those tag names when taking about the context (
Using teh contact in the <contract> tags...
)
- Use the same tag names throughout your prompts and refer to those tag names when taking about the context (
- Nest Tags:
- You should nest tags
<outer><inner></inner></outer>
for hierarchal context
- You should nest tags
Give Claude a Role with a System Prompt
When using Claude, you can dramatically improve its performance by using the system
parameter to give it a role. This technique, known as role prompting, is the most powerful way to use system prompts with Claude.
Prefill Claude's Response for Greater Output Control
When using Claude, you have the unique ability to guide its responses by prefilling the Assistant
message. This powerful technique allows you to direct Claude's actions, skip preambles, enforce specific formats like JSON or XML, and even help Claude maintain character consistency.
How to prefill Claude's response
To prefill, include the desired initial text in the Assistant
message.
import anthropic
client = anthropic.Anthropic()
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{"role": "user", "content": "What is your favorite color?"},
{"role": "assistant", "content": "As an AI assistant, I don't have a favorite color, But if I had to pick, it would be green because"} # Prefill here
]
)
Chain Complex Prompts
Prompting chains (breaking down complex tasks into smaller, manageable subtasks) can help Claude. Chaining prompts can increase accuracy, clarity, and traceability. Use prompt chaining for multi-step tasks like research synthesis, document analysis, or iterative content creation.
How to Chain Prompts
- Identify Subtasks: Break your task into distinct, sequential steps
- Structure with XML for Clear Handoffs: Use XML tags to pass outputs between prompts
- Have a single-task goal: Each subtask should have a single, clear objective
- Iterate: Refine subtasks based on Claude's performance
Long Context Prompting Tips
Claude’s extended context window (200K tokens for Claude 3 models) enables handling complex, data-rich tasks.
Essential Tips for Long Context Prompts
- Put longform data at the tip: Place your long documents and inputs (20k+ tokens) near the top of your prompt, above your query, instructions, and examples. This can significantly improve Claude's performance across all models.
- Structure document content and metadata with XML tags: When using multiple documents, wrap each document in
<document>
tags with<document_content>
and<source>
(and other metadata) subtags for clarity - Ground Responses in quotes: For long document tasks, ask Claude to quote relevant parts of the document first before carrying out its task.