Prompt Caching: The Key to Reducing LLM Costs up to 90%

Oct 31, 2024

By:

Find out how to set up prompt caching and save on AI costs

3m 43s reading time

Prompt caching is a clever technique that involves saving frequently used prompts and their answers for future use. With this approach, you can dramatically cut the costs of using large language models while improving speed and efficiency.

Here’s how you can set up prompt caching to lower your LLM costs by up to 90% (depending on the LLM you choose) and improve latency by 80%.

TLDR:

The goal: Optimize LLM costs and improve LLM performance
The tactic: Set up prompt caching
The result: Save up to 90% of LLM costs and boost performance by 80%

Step 1: Identify repeating prompts

As you’ll see in Step 2, prompt caching depends heavily on the presence of repetitive generative AI prompts or prompt sections.

To identify repeating prompts, you’ll need to track all prompts you make over a specific period, such as a week or month. This lets you establish a baseline of prompt usage.

Next, you’ll have to review these prompts for similar phrasing and keywords, then cluster and rank prompts by their theme and frequency. You should prioritize prompts with the highest frequency as a candidate for caching.

Good candidates for caching include:

Email templates
Email frameworks
Sample emails
General rules and instructions

Step 2: Structure prompts

You’ll need to structure your prompts carefully if you want to maximize your cache hits, which is only possible when parts of your prompt are an exact match. In other words, any inputs that vary by even a single character won’t trigger a cache hit.

When structuring prompts, you’ll need to place static content like instructions and examples at the beginning while dynamic content and variable data come at the end. If you have prompts that are similar enough to get the same result yet they’re not precisely the same, you should adjust the prompts to make them identical (This is also a good time for any prompt debugging).

This also applies to any images and tools you use within your prompts.

Here’s how this looks in action:

Cache lookup – The system checks if the “prefix” is stored in the cache.
Cache hit – When the system finds a matching prefix, it will use the cached result.
Cache miss – If the system doesn’t find a matching prefix, the system will process your entire prompt and cache any prefixes for future prompts.

Cache hits are what you need to aim for if you want to reduce your cost and improve latency.

Step 3: Choose your LLM

Different LLMs have their own rules for caching, requirements, and costs, so you’ll want to review each one separately to see which will work best for you.

Here’s a comparison of OpenAI and Anthropic:

	OpenAI	Anthropic
Cost of caching	free	+25%
Caching savings	-50% (and 80% better latency)	-90%
Models that support caching	GPT-4o GPT-4o-mini o1-mini o1-preview Fine-tuned versions	Claude 3.5 Sonnet Claude 3 Haiku Claude 3 Opus
Requirements	1024 tokens	1024 tokens (Sonnet & Opus) 2048 tokens (Haiku)
Time to cache	5-10 minutes	5 minutes
Cache mechanism	Partial	Exact
What can be cached	Messages Images Tool use Structured outputs	Messages Images System messages Tools Tool use Tool results

Step 4: Monitor your performance

Prompt caching doesn’t affect the generation of output tokens or the LLM’s final response. As a result, the output will be the same regardless of whether prompt caching was used or not. This is because only the prompt gets cached while the output is re-generated each time.

Still, you’ll want to monitor your prompt caching performance. You should watch your:

Cache hit rate
Latency
Percentage of tokens cached

The more hits, the better your latency and the lower your costs.

There are a few ways you can increase your odds of a cache hit:

Cache a higher percentage of tokens
Use longer prompts
Make requests during off-peak hours
Use the same prompt prefixes consistently (prompts that haven’t been used recently are automatically removed from your cache)

Result

At AiSDR, we’ve used prompt caching to decrease our monthly LLM expenses by over 34% while speeding up general performance. This is because cached tokens are usually half the price of regular tokens.

According to OpenAI and Anthropic, if you’re able to cache a huge percentage of your prompts, you can potentially see your cost savings reach up to 90%.

But remember, any small variation between prompts – even if it’s a single letter – will prevent the cache hit you need for cost savings. Unfortunately, this means you won’t be able to optimize costs if you’re in the process of testing and fine-tuning prompts.

Book more, stress less with AiSDR

See how AiSDR runs your sales

GET MY DEMO

#AI #Marketing AI #Prompt engineering #Sales AI

Written by:

Oleg Zaremba