Prompt Engineering

At first, I didn't think that prompt engineering mattered much, but after thinking about it in terms of searching latent space and given that it could cost a lot of money to use LLM APIs, I want to learn more about it.

Date Created:

References



Notes


Define Success Criteria

Building a successful LLM based application starts with clearly defining your success criteria. How will you know when your application is good enough to publish?
Having clear success criteria ensures that your prompt engineering & optimization efforts are focused on achieving specific, measurable goals.

Good success criteria are:

  • specific: Clearly define what you want to achieve
  • measurable: quantitative or well-defined qualitative scales
  • Achievable: Base your targets on industry benchmarks, prior experiments, AI research, or expert knowledge.
  • Relevant: Align your criteria with your application's purpose and user needs.

Common success criteria:

  • Task Fidelity
  • Consistency
  • Relevance and Coherence
  • Tone and style
  • Privacy preservation
  • Context Utilization
  • Latency
  • Price

Develop Test Cases

After defining your success criteria, the next step is designing evaluations to measure LLM performance against those criteria. This is a vital part of the prompt engineering cycle.

Building Evals and Test Cases

Eval Design Principles
  1. Be Task-Specific: Design evals that mirror your real-world task distribution.
  2. Automate when possible: Structure questions to allow for automated grading (e.g., multiple-choice, string match, code-graded, LLM-graded)
  3. Prioritize volume over quality: More questions with slightly lower signal automated grading is better than fewer questions with high-quality human graded evals.

Grading Evals

When deciding which method to use to grade evals, choose the fastest, most reliable, most scalable method:

  1. Code-based grading: Fastest and most reliable, extremely scalable, but also lacks nuance for more complex judgements that require less rule-based rigidity.
  2. Human grading: Most flexible and high quality, but slow and expensive
  3. LLM-based grading: Fast and flexible, scalable and suitable for complex judgement.
    • Tips for LLM-based Grading
      • Have detailed, clear rubrics for grading
      • Empirical or specific: have the LLM output a number or a word from a set of words { "incorrect", "correct" }
      • Encourage reasoning: Ask the LLM to think before deciding an evaluation score, and then discard the reasoning.

Prompt Engineering

Overview

Prompt engineering is far faster than other methods of control, such as finetuning, and can often yield leaps in performance in far less time. Some reasons to prefer prompt engineering over finetuning:

  • Resource efficiency: Finetuning requires high-end GPUs and large memory, while prompt engineering only requires text input
  • Cost effectiveness: Finetuning incurs significant cost for cloud-based services.
  • Maintaining model updates: Fine-tuned versions might need retraining when providers update models.
  • Time-saving: Fine-tuning can take hours or even days. Prompt engineering is near instantaneous
  • Minimal Data Needs: Fine-tuning needs substantial task-specific, labeled data, which can be scarce or expensive. Prompt engineering works with few-shot or even zero-shot learning.
  • Flexibility and Rapid Iteration: Better ability to iterate
  • Domain Adaptation: Easier to change domains
  • Comprehension Improvements: Prompt engineering more efficient at helping models understand context
  • Preserves General Knowledge: Prompt engineering retains models' general knowledge
  • Transparency: Prompts are human-readable, showing exactly what information the model receives.

Prompt Generator

Sometimes, the hardest part of using an AI model is figuring out how to prompt it effectively. Because of this, Anthropic has developed a prompt generator tool that is available in the console. There is also a prompt improver and evaluation tool for improving and evaluating your prompts.

Use Prompt Templates and Variables

Your API calls with Claude will typically consist of two types of content:

  • Fixed Content: Static instructions of context that remain constant across multiple interactions.
  • Variable Content: Dynamic elements that change with each request or conversation, such as:
    • User Inputs
    • Retrieved context for Retrieval-Augmented Generation (RAG)
    • Conversation context such as user account history
    • System-generated data such as tool use results fed in from other independent calls to Claude
A prompt template combines these fixed and variable parts, using placeholders for the dynamic content.

When to use Prompt Templates and Variables

You should always use prompt templates and variables when you expect any part of your prompt to be repeated in another call to Clause. Prompt templates offer several benefits:

  • Consistency: Ensure a consistent structure for your prompts across multiple interactions
  • Efficiency: Easily swap out variable content without rewriting the entire prompt
  • Testability: Quickly test different inputs and edge cases by changing only the variable portion.
  • Scalability: Simplify different inputs and edge cases by changing only the variable portion.
  • Version Control: Easily track changes to your prompt structure over time by keeping tabs only on the core part of your prompt, separate from dynamic inputs.

Prompt Improver

The prompt improver helps you quickly iterate and improve your prompts through automated analysis and enhancement. It excels at making prompts more robust for complex tasks that require high accuracy.
How the prompt improver works
  1. Example identification: Locates and extracts examples from your prompt template
  2. Initial Draft: Creates a structured template with clear sections and XML tags
  3. Chain of thought refinement: Adds and refines detailed reasoning instructions.
  4. Example Enhancement: Updates examples to demonstrate the new reasoning process

Be clear, direct, and detailed

When interacting with Claude, think of it as a brilliant but very new employee (with amnesia) who needs explicit instructions. [...] The more precisely you explain what you want, the better Claude's response will be.
  • Give Claude contextual information: Just like you might be able to better perform a task if you knew more context, Claude will perform better if it has more contextual information.
  • Be specific about what you want Claude to do: For example, if you want Claude to output only code and nothing else, say so.
  • Provide instructions as sequential steps: Use numbered list or bullet points to make sure Claude carries out the task the exact way you want it to.

Use Examples (multishot prompting) to guide Claude's behavior

By proving a few well-crafted examples in your prompt, you can dramatically improve the accuracy, consistency, and quality of Claude's outputs.

  • Why use Examples?
    • Accuracy: Example reduce misrepresentation of instructions.
    • Consistency: Examples enforce uniform structure and style
    • Performance: Well-chosen examples boost Claude's ability to handle complex tasks.
  • Crafting effective examples
    • Relevant: Your examples mirror actual use case.
    • Diverse: Your examples cover edge cases and potential challenges, and vary enough that Claude doesn't inadvertently pick up on unintended patterns
    • Clear: Your examples are wrapped in <example> tags (if multiple, nested within <examples> tags) for structure

Let Claude Think (Chain of Thought Prompting) to Increase Performance

When faced with complex tasks like research, analysis, or problem-solving, giving Claude space to think can dramatically improve its performance. This technique, known as chain of thought (CoT) prompting, encourages Claude to break down problems step-by-step, leading to more accurate and nuanced outputs.
  • Why let Claude think?
    • Accuracy: Stepping through problems reduces errors, especially in math, logic, analysis, or generally complex tasks
    • Coherence: Structured thinking leads to more cohesive, well-organized responses
    • Debugging: Seeing Claude's thought process helps you pinpoint where prompts may be unclear
  • Why not let Claude think?
    • Increased output length
    • Not all tasks require in-depth thinking. Use CoT judiciously.
How to prompt for thinking

List below ordered from least to more complex (less complex methods take up less space in context window):

  • Basic Prompt
    • Include Think step-by-step in your prompt
  • Guided Prompt
    • Outline specific steps for Clause to follow in its thinking process
  • Structured Prompt
    • Use XML tags like <thinking> and <answer> to separate reasoning form the final answer
Use XML tags to Structure your Prompts
When your prompts involve multiple components like context, instructions, and examples, XML tags can be a game-changer. They help Claude parse your prompts more accurately, leading to higher-quality outputs.

Use tags like <instructions>, <example>, and <formatting> to clearly separate different parts of your prompt. This prevents Claude form mixing up instructions with examples or context.

Why Use XML Tags?
  • Clarity: Clearly separate different parts for your prompt and ensure your prompt is well structured.
  • Accuracy: Reduce errors cause by Clause misinterpreting parts of your prompt
  • Flexibility: Easily find, add, remove, or modify parts of your prompt while rewriting everything
  • Parseability: Having Claude use XML tags in its output makes it easier to extract specific parts of its response by post-processing
Tagging Best Practices
  1. Be Consistent:
    1. Use the same tag names throughout your prompts and refer to those tag names when taking about the context (Using teh contact in the <contract> tags...)
  2. Nest Tags:
    1. You should nest tags <outer><inner></inner></outer> for hierarchal context

Give Claude a Role with a System Prompt

When using Claude, you can dramatically improve its performance by using the system parameter to give it a role. This technique, known as role prompting, is the most powerful way to use system prompts with Claude.

Prefill Claude's Response for Greater Output Control

When using Claude, you have the unique ability to guide its responses by prefilling the Assistant message. This powerful technique allows you to direct Claude's actions, skip preambles, enforce specific formats like JSON or XML, and even help Claude maintain character consistency.

How to prefill Claude's response

To prefill, include the desired initial text in the Assistant message.

import anthropic

client = anthropic.Anthropic()
response = client.messages.create(
model="claude-3-5-sonnet-20241022",
max_tokens=1024,
messages=[
{"role": "user", "content": "What is your favorite color?"},
{"role": "assistant", "content": "As an AI assistant, I don't have a favorite color, But if I had to pick, it would be green because"} # Prefill here
]
)

Chain Complex Prompts

Prompting chains (breaking down complex tasks into smaller, manageable subtasks) can help Claude. Chaining prompts can increase accuracy, clarity, and traceability. Use prompt chaining for multi-step tasks like research synthesis, document analysis, or iterative content creation.

How to Chain Prompts
  1. Identify Subtasks: Break your task into distinct, sequential steps
  2. Structure with XML for Clear Handoffs: Use XML tags to pass outputs between prompts
  3. Have a single-task goal: Each subtask should have a single, clear objective
  4. Iterate: Refine subtasks based on Claude's performance

Long Context Prompting Tips

Claude’s extended context window (200K tokens for Claude 3 models) enables handling complex, data-rich tasks.
Essential Tips for Long Context Prompts
  • Put longform data at the tip: Place your long documents and inputs (20k+ tokens) near the top of your prompt, above your query, instructions, and examples. This can significantly improve Claude's performance across all models.
  • Structure document content and metadata with XML tags: When using multiple documents, wrap each document in <document> tags with <document_content> and <source> (and other metadata) subtags for clarity
  • Ground Responses in quotes: For long document tasks, ask Claude to quote relevant parts of the document first before carrying out its task.

Insert Math Markup

ESC
About Inserting Math Content
Display Style:

Embed News Content

ESC
About Embedding News Content

Embed Youtube Video

ESC
Embedding Youtube Videos

Embed TikTok Video

ESC
Embedding TikTok Videos

Embed X Post

ESC
Embedding X Posts

Embed Instagram Post

ESC
Embedding Instagram Posts

Insert Details Element

ESC

Example Output:

Summary Title
You will be able to insert content here after confirming the title of the <details> element.

Insert Table

ESC
Customization
Align:
Preview:

Insert Horizontal Rule

#000000

Preview:


Insert Chart

ESC

View Content At Different Sizes

ESC

Edit Style of Block Nodes

ESC

Edit the background color, default text color, margin, padding, and border of block nodes. Editable block nodes include paragraphs, headers, and lists.

#ffffff
#000000

Edit Selected Cells

Change the background color, vertical align, and borders of the cells in the current selection.

#ffffff
Vertical Align:
Border
#000000
Border Style:

Edit Table

ESC
Customization:
Align:

Upload Lexical State

ESC

Upload a .lexical file. If the file type matches the type of the current editor, then a preview will be shown below the file input.

Upload 3D Object

ESC

Upload Jupyter Notebook

ESC

Upload a Jupyter notebook and embed the resulting HTML in the text editor.

Insert Custom HTML

ESC

Edit Image Background Color

ESC
#ffffff

Insert Columns Layout

ESC
Column Type:

Select Code Language

ESC
Select Coding Language