Prompt Caching: The Key to Reducing LLM Costs up to 90%
Find out how to set up prompt caching and save on AI costs
Prompt caching is a clever technique that involves saving frequently used prompts and their answers for future use. With this approach, you can dramatically cut the costs of using large language models while improving speed and efficiency.
Here’s how you can set up prompt caching to lower your LLM costs by up to 90% (depending on the LLM you choose) and improve latency by 80%.
TLDR:
- The goal: Optimize LLM costs and improve LLM performance
- The tactic: Set up prompt caching
- The result: Save up to 90% of LLM costs and boost performance by 80%
Step 1: Identify repeating prompts
As you’ll see in Step 2, prompt caching depends heavily on the presence of repetitive generative AI prompts or prompt sections.
To identify repeating prompts, you’ll need to track all prompts you make over a specific period, such as a week or month. This lets you establish a baseline of prompt usage.
Next, you’ll have to review these prompts for similar phrasing and keywords, then cluster and rank prompts by their theme and frequency. You should prioritize prompts with the highest frequency as a candidate for caching.
Good candidates for caching include:
- Email templates
- Email frameworks
- Sample emails
- General rules and instructions
Step 2: Structure prompts
You’ll need to structure your prompts carefully if you want to maximize your cache hits, which is only possible when parts of your prompt are an exact match. In other words, any inputs that vary by even a single character won’t trigger a cache hit.
When structuring prompts, you’ll need to place static content like instructions and examples at the beginning while dynamic content and variable data come at the end. If you have prompts that are similar enough to get the same result yet they’re not precisely the same, you should adjust the prompts to make them identical (This is also a good time for any prompt debugging).
This also applies to any images and tools you use within your prompts.
Here’s how this looks in action:
- Cache lookup – The system checks if the “prefix” is stored in the cache.
- Cache hit – When the system finds a matching prefix, it will use the cached result.
- Cache miss – If the system doesn’t find a matching prefix, the system will process your entire prompt and cache any prefixes for future prompts.
Cache hits are what you need to aim for if you want to reduce your cost and improve latency.
Step 3: Choose your LLM
Different LLMs have their own rules for caching, requirements, and costs, so you’ll want to review each one separately to see which will work best for you.
Here’s a comparison of OpenAI and Anthropic:
OpenAI | Anthropic | |
Cost of caching | free | +25% |
Caching savings | -50% (and 80% better latency) | -90% |
Models that support caching | GPT-4oGPT-4o-minio1-minio1-previewFine-tuned versions | Claude 3.5 SonnetClaude 3 HaikuClaude 3 Opus |
Requirements | 1024 tokens | 1024 tokens (Sonnet & Opus)2048 tokens (Haiku) |
Time to cache | 5-10 minutes | 5 minutes |
Cache mechanism | Partial | Exact |
What can be cached | MessagesImagesTool useStructured outputs | MessagesImagesSystem messagesToolsTool useTool results |
Step 4: Monitor your performance
Prompt caching doesn’t affect the generation of output tokens or the LLM’s final response. As a result, the output will be the same regardless of whether prompt caching was used or not. This is because only the prompt gets cached while the output is re-generated each time.
Still, you’ll want to monitor your prompt caching performance. You should watch your:
- Cache hit rate
- Latency
- Percentage of tokens cached
The more hits, the better your latency and the lower your costs.
There are a few ways you can increase your odds of a cache hit:
- Cache a higher percentage of tokens
- Use longer prompts
- Make requests during off-peak hours
- Use the same prompt prefixes consistently (prompts that haven’t been used recently are automatically removed from your cache)
Result
At AiSDR, we’ve used prompt caching to decrease our monthly LLM expenses by over 34% while speeding up general performance. This is because cached tokens are usually half the price of regular tokens.
According to OpenAI and Anthropic, if you’re able to cache a huge percentage of your prompts, you can potentially see your cost savings reach up to 90%.
But remember, any small variation between prompts – even if it’s a single letter – will prevent the cache hit you need for cost savings. Unfortunately, this means you won’t be able to optimize costs if you’re in the process of testing and fine-tuning prompts.