Prompt Engineering

At first, I didn't think that prompt engineering mattered much, but after thinking about it in terms of searching latent space and given that it could cost a lot of money to use LLM APIs, I want to learn more about it.

Date Created:

2 1493

References

Notes

Define Success Criteria

Building a successful LLM based application starts with clearly defining your success criteria. How will you know when your application is good enough to publish?
Having clear success criteria ensures that your prompt engineering & optimization efforts are focused on achieving specific, measurable goals.

Good success criteria are:

specific: Clearly define what you want to achieve
measurable: quantitative or well-defined qualitative scales
Achievable: Base your targets on industry benchmarks, prior experiments, AI research, or expert knowledge.
Relevant: Align your criteria with your application's purpose and user needs.

Common success criteria:

Task Fidelity
Consistency
Relevance and Coherence
Tone and style
Privacy preservation
Context Utilization
Latency
Price

Develop Test Cases

After defining your success criteria, the next step is designing evaluations to measure LLM performance against those criteria. This is a vital part of the prompt engineering cycle.

Building Evals and Test Cases

Eval Design Principles

Be Task-Specific: Design evals that mirror your real-world task distribution.
Automate when possible: Structure questions to allow for automated grading (e.g., multiple-choice, string match, code-graded, LLM-graded)
Prioritize volume over quality: More questions with slightly lower signal automated grading is better than fewer questions with high-quality human graded evals.

Grading Evals

When deciding which method to use to grade evals, choose the fastest, most reliable, most scalable method:

Code-based grading: Fastest and most reliable, extremely scalable, but also lacks nuance for more complex judgements that require less rule-based rigidity.
Human grading: Most flexible and high quality, but slow and expensive
LLM-based grading: Fast and flexible, scalable and suitable for complex judgement.

- Tips for LLM-based Grading
- - Have detailed, clear rubrics for grading
  - Empirical or specific: have the LLM output a number or a word from a set of words { "incorrect", "correct" }
  - Encourage reasoning: Ask the LLM to think before deciding an evaluation score, and then discard the reasoning.

Prompt Engineering

Overview

Prompt engineering is far faster than other methods of control, such as finetuning, and can often yield leaps in performance in far less time. Some reasons to prefer prompt engineering over finetuning:

Resource efficiency: Finetuning requires high-end GPUs and large memory, while prompt engineering only requires text input
Cost effectiveness: Finetuning incurs significant cost for cloud-based services.
Maintaining model updates: Fine-tuned versions might need retraining when providers update models.
Time-saving: Fine-tuning can take hours or even days. Prompt engineering is near instantaneous
Minimal Data Needs: Fine-tuning needs substantial task-specific, labeled data, which can be scarce or expensive. Prompt engineering works with few-shot or even zero-shot learning.
Flexibility and Rapid Iteration: Better ability to iterate
Domain Adaptation: Easier to change domains
Comprehension Improvements: Prompt engineering more efficient at helping models understand context
Preserves General Knowledge: Prompt engineering retains models' general knowledge
Transparency: Prompts are human-readable, showing exactly what information the model receives.

Prompt Generator

Sometimes, the hardest part of using an AI model is figuring out how to prompt it effectively. Because of this, Anthropic has developed a prompt generator tool that is available in the console. There is also a prompt improver and evaluation tool for improving and evaluating your prompts.

Use Prompt Templates and Variables

Your API calls with Claude will typically consist of two types of content:

Fixed Content: Static instructions of context that remain constant across multiple interactions.
Variable Content: Dynamic elements that change with each request or conversation, such as:
- User Inputs
- Retrieved context for Retrieval-Augmented Generation (RAG)
- Conversation context such as user account history
- System-generated data such as tool use results fed in from other independent calls to Claude

A prompt template combines these fixed and variable parts, using placeholders for the dynamic content.

When to use Prompt Templates and Variables

You should always use prompt templates and variables when you expect any part of your prompt to be repeated in another call to Clause. Prompt templates offer several benefits:

Consistency: Ensure a consistent structure for your prompts across multiple interactions
Efficiency: Easily swap out variable content without rewriting the entire prompt
Testability: Quickly test different inputs and edge cases by changing only the variable portion.
Scalability: Simplify different inputs and edge cases by changing only the variable portion.
Version Control: Easily track changes to your prompt structure over time by keeping tabs only on the core part of your prompt, separate from dynamic inputs.

Prompt Improver

The prompt improver helps you quickly iterate and improve your prompts through automated analysis and enhancement. It excels at making prompts more robust for complex tasks that require high accuracy.

How the prompt improver works

Example identification: Locates and extracts examples from your prompt template
Initial Draft: Creates a structured template with clear sections and XML tags
Chain of thought refinement: Adds and refines detailed reasoning instructions.
Example Enhancement: Updates examples to demonstrate the new reasoning process

Be clear, direct, and detailed

When interacting with Claude, think of it as a brilliant but very new employee (with amnesia) who needs explicit instructions. [...] The more precisely you explain what you want, the better Claude's response will be.

Give Claude contextual information: Just like you might be able to better perform a task if you knew more context, Claude will perform better if it has more contextual information.
Be specific about what you want Claude to do: For example, if you want Claude to output only code and nothing else, say so.
Provide instructions as sequential steps: Use numbered list or bullet points to make sure Claude carries out the task the exact way you want it to.

Use Examples (multishot prompting) to guide Claude's behavior

By proving a few well-crafted examples in your prompt, you can dramatically improve the accuracy, consistency, and quality of Claude's outputs.

Why use Examples?
- Accuracy: Example reduce misrepresentation of instructions.
- Consistency: Examples enforce uniform structure and style
- Performance: Well-chosen examples boost Claude's ability to handle complex tasks.
Crafting effective examples
- Relevant: Your examples mirror actual use case.
- Diverse: Your examples cover edge cases and potential challenges, and vary enough that Claude doesn't inadvertently pick up on unintended patterns
- Clear: Your examples are wrapped in <example> tags (if multiple, nested within <examples> tags) for structure

Let Claude Think (Chain of Thought Prompting) to Increase Performance

When faced with complex tasks like research, analysis, or problem-solving, giving Claude space to think can dramatically improve its performance. This technique, known as chain of thought (CoT) prompting, encourages Claude to break down problems step-by-step, leading to more accurate and nuanced outputs.

Why let Claude think?
- Accuracy: Stepping through problems reduces errors, especially in math, logic, analysis, or generally complex tasks
- Coherence: Structured thinking leads to more cohesive, well-organized responses
- Debugging: Seeing Claude's thought process helps you pinpoint where prompts may be unclear
Why not let Claude think?
- Increased output length
- Not all tasks require in-depth thinking. Use CoT judiciously.

How to prompt for thinking

List below ordered from least to more complex (less complex methods take up less space in context window):

Basic Prompt
- Include Think step-by-step in your prompt
Guided Prompt
- Outline specific steps for Clause to follow in its thinking process
Structured Prompt
- Use XML tags like <thinking> and <answer> to separate reasoning form the final answer

Use XML tags to Structure your Prompts

When your prompts involve multiple components like context, instructions, and examples, XML tags can be a game-changer. They help Claude parse your prompts more accurately, leading to higher-quality outputs.

Use tags like <instructions>, <example>, and <formatting> to clearly separate different parts of your prompt. This prevents Claude form mixing up instructions with examples or context.

Why Use XML Tags?

Clarity: Clearly separate different parts for your prompt and ensure your prompt is well structured.
Accuracy: Reduce errors cause by Clause misinterpreting parts of your prompt
Flexibility: Easily find, add, remove, or modify parts of your prompt while rewriting everything
Parseability: Having Claude use XML tags in its output makes it easier to extract specific parts of its response by post-processing

Tagging Best Practices

Be Consistent:
1. Use the same tag names throughout your prompts and refer to those tag names when taking about the context (Using teh contact in the <contract> tags...)
Nest Tags:
1. You should nest tags <outer><inner></inner></outer> for hierarchal context

Give Claude a Role with a System Prompt

When using Claude, you can dramatically improve its performance by using the system parameter to give it a role. This technique, known as role prompting, is the most powerful way to use system prompts with Claude.

Prefill Claude's Response for Greater Output Control

When using Claude, you have the unique ability to guide its responses by prefilling the Assistant message. This powerful technique allows you to direct Claude's actions, skip preambles, enforce specific formats like JSON or XML, and even help Claude maintain character consistency.

How to prefill Claude's response

To prefill, include the desired initial text in the Assistant message.

import anthropic

client = anthropic.Anthropic()
response = client.messages.create(
    model="claude-3-5-sonnet-20241022",
    max_tokens=1024,
    messages=[
        {"role": "user", "content": "What is your favorite color?"},
        {"role": "assistant", "content": "As an AI assistant, I don't have a favorite color, But if I had to pick, it would be green because"}  # Prefill here
    ]
)

Chain Complex Prompts

Prompting chains (breaking down complex tasks into smaller, manageable subtasks) can help Claude. Chaining prompts can increase accuracy, clarity, and traceability. Use prompt chaining for multi-step tasks like research synthesis, document analysis, or iterative content creation.

How to Chain Prompts

Identify Subtasks: Break your task into distinct, sequential steps
Structure with XML for Clear Handoffs: Use XML tags to pass outputs between prompts
Have a single-task goal: Each subtask should have a single, clear objective
Iterate: Refine subtasks based on Claude's performance

Long Context Prompting Tips

Claude’s extended context window (200K tokens for Claude 3 models) enables handling complex, data-rich tasks.

Essential Tips for Long Context Prompts

Put longform data at the tip: Place your long documents and inputs (20k+ tokens) near the top of your prompt, above your query, instructions, and examples. This can significantly improve Claude's performance across all models.
Structure document content and metadata with XML tags: When using multiple documents, wrap each document in <document> tags with <document_content> and <source> (and other metadata) subtags for clarity
Ground Responses in quotes: For long document tasks, ask Claude to quote relevant parts of the document first before carrying out its task.

Prompt Engineering

References

Notes

Define Success Criteria

Develop Test Cases

Building Evals and Test Cases

Eval Design Principles

Grading Evals

Prompt Engineering

Overview

Prompt Generator

Use Prompt Templates and Variables

When to use Prompt Templates and Variables

Prompt Improver

How the prompt improver works

Be clear, direct, and detailed

Use Examples (multishot prompting) to guide Claude's behavior

Let Claude Think (Chain of Thought Prompting) to Increase Performance

How to prompt for thinking

Use XML tags to Structure your Prompts

Why Use XML Tags?

Tagging Best Practices

Give Claude a Role with a System Prompt

Prefill Claude's Response for Greater Output Control

How to prefill Claude's response

Chain Complex Prompts

How to Chain Prompts

Long Context Prompting Tips

Essential Tips for Long Context Prompts

Comments

User Comments