burger
Book a demo Log In
Speak with our AI
See if we're a fit
Start conversation
Speak with our AI
See if we're a fit
Start conversation
Speak with our AI
See if we're a fit
Start conversation
Speak with our AI
See if we're a fit
Start conversation
wave background
Featured
AI SDR Industry Report 2026
Get a clear read on where the market stands today
Get my copy
We respect your privacy. Unsubscribe at any time.
Home > Blog > Prompt Caching: The Key to Reducing LLM Costs up to 90%

Prompt Caching: The Key to Reducing LLM Costs up to 90%

Prompt caching is a clever technique that involves saving frequently used prompts and their answers for future use. With this approach, you can dramatically cut the costs of using large language models while improving speed and efficiency. 

Here’s how you can set up prompt caching to lower your LLM costs by up to 90% (depending on the LLM you choose) and improve latency by 80%.

TLDR

  • The goal: Optimize LLM costs and improve LLM performance
  • The tactic: Set up prompt caching
  • The result: Save up to 90% of LLM costs and boost performance by 80%

Step 1: Identify repeating prompts

As you’ll see in Step 2, prompt caching depends heavily on the presence of repetitive generative AI prompts or prompt sections.

To identify repeating prompts, you’ll need to track all prompts you make over a specific period, such as a week or month. This lets you establish a baseline of prompt usage.

Next, you’ll have to review these prompts for similar phrasing and keywords, then cluster and rank prompts by their theme and frequency. You should prioritize prompts with the highest frequency as a candidate for caching.

Good candidates for caching include:

Step 2: Structure prompts

You’ll need to structure your prompts carefully if you want to maximize your cache hits, which is only possible when parts of your prompt are an exact match. In other words, any inputs that vary by even a single character won’t trigger a cache hit.

When structuring prompts, you’ll need to place static content like instructions and examples at the beginning while dynamic content and variable data come at the end. If you have prompts that are similar enough to get the same result yet they’re not precisely the same, you should adjust the prompts to make them identical (This is also a good time for any prompt debugging).

This also applies to any images and tools you use within your prompts.

Here’s how this looks in action:

  • Cache lookup – The system checks if the “prefix” is stored in the cache.
  • Cache hit – When the system finds a matching prefix, it will use the cached result.
  • Cache miss – If the system doesn’t find a matching prefix, the system will process your entire prompt and cache any prefixes for future prompts.

Cache hits are what you need to aim for if you want to reduce your cost and improve latency.

Step 3: Choose your LLM

Different LLMs have their own rules for caching, requirements, and costs, so you’ll want to review each one separately to see which will work best for you.

Here’s a comparison of OpenAI and Anthropic:

OpenAIAnthropic
Cost of cachingfree+25%
Caching savings-50% (and 80% better latency)-90%
Models that support cachingGPT-4o
GPT-4o-mini
o1-mini
o1-preview
Fine-tuned versions
Claude 3.5 Sonnet
Claude 3 Haiku
Claude 3 Opus
Requirements1024 tokens1024 tokens (Sonnet & Opus)
2048 tokens (Haiku)
Time to cache5-10 minutes5 minutes
Cache mechanismPartialExact
What can be cachedMessages
Images
Tool use
Structured outputs
Messages
Images
System messages
Tools
Tool use
Tool results

Step 4: Monitor your performance

Prompt caching doesn’t affect the generation of output tokens or the LLM’s final response. As a result, the output will be the same regardless of whether prompt caching was used or not. This is because only the prompt gets cached while the output is re-generated each time.

Still, you’ll want to monitor your prompt caching performance. You should watch your:

  • Cache hit rate
  • Latency
  • Percentage of tokens cached

The more hits, the better your latency and the lower your costs.

There are a few ways you can increase your odds of a cache hit:

  • Cache a higher percentage of tokens
  • Use longer prompts
  • Make requests during off-peak hours
  • Use the same prompt prefixes consistently (prompts that haven’t been used recently are automatically removed from your cache)

Result

At AiSDR, we’ve used prompt caching to decrease our monthly LLM expenses by over 34% while speeding up general performance. This is because cached tokens are usually half the price of regular tokens. 

According to OpenAI and Anthropic, if you’re able to cache a huge percentage of your prompts, you can potentially see your cost savings reach up to 90%.

But remember, any small variation between prompts – even if it’s a single letter – will prevent the cache hit you need for cost savings. Unfortunately, this means you won’t be able to optimize costs if you’re in the process of testing and fine-tuning prompts.

Book more, stress less with AiSDR

See how AiSDR runs your sales
GET MY DEMO
helpful
Did you enjoy this blog?
Oct 31, 2024
Last reviewed Feb 11, 2025
By:
Oleg Zaremba

Find out how to set up prompt caching and save on AI costs

3m 43s reading time
Summarize with
ChatGPT Claude Perplexity Grok Google AI
TABLE OF CONTENTS
1. Step 1: Identify repeating prompts 2. Step 2: Structure prompts 3. Step 3: Choose your LLM 4. Step 4: Monitor your performance 5. Result
AiSDR | Website Illustrations | LinkedIn icon | 1AiSDR Website Illustrations | LI iconAiSDR | Website Illustrations | X icon | 1AiSDR Website Illustrations | X iconAiSDR | Website Illustrations | Insta icon | 1AiSDR Website Illustrations | IG icon 2AiSDR | Website Illustrations | Facebook icon | 1AiSDR Website Illustrations | FB icon
link
AiSDR Website Illustrations | Best AI Tools for Primary and Secondary Market Research | Preview
Get an AI SDR than you can finally trust. Book more, stress less.
GO LIVE IN 2 HOURS

You might also like:

Check out all blogs>
AiSDR-Blog-HeroImage-Cut-AI-Costs-Without-Sacrificing-Performance-with-Model-Distillation-upd
Cut AI Costs Without Sacrificing Performance with Model Distillation
AiSDR website images | Photos for the Authors page - Oleg 1
Oleg Zaremba •
Oct 3, 2024 •
3m 47s
Find out how to use model distillation to cut AI costs
Read blog>
AiSDR-Blog-Hero-Images-How-to-Get-Generative-AI-to-Score-Your-Leads-upd
How to Get Generative AI to Score Your Leads
AiSDR website images | Photos for the Authors page - Oleg 1
Oleg Zaremba •
Sep 5, 2024 •
2m 56s
Want to speed up lead scoring? Find out how to make generative AI do it for you.
Read blog>
AiSDR-Blog-HeroImage-How-to-Train-an-AI-to-Write-Sales-Emails-upd
How to Train an AI to Write Sales Emails
AiSDR website images | Photos for the Authors page - Vika 1
Viktoria Sinchuhova •
Sep 12, 2024 •
3m 54s
Find out how to train an AI SDR to write emails that sound like you wrote them
Read blog>
AiSDR-Blog-HeroImage-Teaching-Generative-AI-to-Classify-Email-Responses-upd
Teaching Generative AI to Classify Email Responses
AiSDR website images | Photos for the Authors page - Oleg 1
Oleg Zaremba •
Sep 22, 2025 •
9m 29s
We’ve trained AI to handle real-life email replies, the vague ones, the abrupt ones, and everything in between. Here’s how it works inside AiSDR.
Read blog>

See how AiSDR will sell to you.

Share your info and get the first-hand experience
See how AiSDR will sell to you