burger
Features

Every tool you need for AI sales outreach

Independent AI sales assistant

An extra pair of hands for your sales growth

Our best AI emails

Clients' favorite emails generated by AiSDR

End-to-end AI Sales Outreach

All your bases covered within one solution

AI for HubSpot sales

Make the best of your CRM data

AiSDR Website Illustrations | Starts and lightning icon 1
Speak with our AI

Let AiSDR try and convince you to book a meeting with us

Explore Q2 2024 outreach benchmarks Grab my copy
<Back to blog

Prompt Caching: The Key to Reducing LLM Costs up to 90%

Prompt Caching: The Key to Reducing LLM Costs up to 90%
Oct 31, 2024
By:
Oleg Zaremba

Find out how to set up prompt caching and save on AI costs

3m 43s reading time

Prompt caching is a clever technique that involves saving frequently used prompts and their answers for future use. With this approach, you can dramatically cut the costs of using large language models while improving speed and efficiency. 

Here’s how you can set up prompt caching to lower your LLM costs by up to 90% (depending on the LLM you choose) and improve latency by 80%.

TLDR

  • The goal: Optimize LLM costs and improve LLM performance
  • The tactic: Set up prompt caching
  • The result: Save up to 90% of LLM costs and boost performance by 80%

Step 1: Identify repeating prompts

As you’ll see in Step 2, prompt caching depends heavily on the presence of repetitive generative AI prompts or prompt sections.

To identify repeating prompts, you’ll need to track all prompts you make over a specific period, such as a week or month. This lets you establish a baseline of prompt usage.

Next, you’ll have to review these prompts for similar phrasing and keywords, then cluster and rank prompts by their theme and frequency. You should prioritize prompts with the highest frequency as a candidate for caching.

Good candidates for caching include:

Step 2: Structure prompts

You’ll need to structure your prompts carefully if you want to maximize your cache hits, which is only possible when parts of your prompt are an exact match. In other words, any inputs that vary by even a single character won’t trigger a cache hit.

When structuring prompts, you’ll need to place static content like instructions and examples at the beginning while dynamic content and variable data come at the end. If you have prompts that are similar enough to get the same result yet they’re not precisely the same, you should adjust the prompts to make them identical (This is also a good time for any prompt debugging).

This also applies to any images and tools you use within your prompts.

Here’s how this looks in action:

  • Cache lookup – The system checks if the “prefix” is stored in the cache.
  • Cache hit – When the system finds a matching prefix, it will use the cached result.
  • Cache miss – If the system doesn’t find a matching prefix, the system will process your entire prompt and cache any prefixes for future prompts.

Cache hits are what you need to aim for if you want to reduce your cost and improve latency.

Step 3: Choose your LLM

Different LLMs have their own rules for caching, requirements, and costs, so you’ll want to review each one separately to see which will work best for you.

Here’s a comparison of OpenAI and Anthropic:

OpenAIAnthropic
Cost of cachingfree+25%
Caching savings-50% (and 80% better latency)-90%
Models that support cachingGPT-4oGPT-4o-minio1-minio1-previewFine-tuned versionsClaude 3.5 SonnetClaude 3 HaikuClaude 3 Opus
Requirements1024 tokens1024 tokens (Sonnet & Opus)2048 tokens (Haiku)
Time to cache5-10 minutes5 minutes
Cache mechanismPartialExact
What can be cachedMessagesImagesTool useStructured outputsMessagesImagesSystem messagesToolsTool useTool results

Step 4: Monitor your performance

Prompt caching doesn’t affect the generation of output tokens or the LLM’s final response. As a result, the output will be the same regardless of whether prompt caching was used or not. This is because only the prompt gets cached while the output is re-generated each time.

Still, you’ll want to monitor your prompt caching performance. You should watch your:

  • Cache hit rate
  • Latency
  • Percentage of tokens cached

The more hits, the better your latency and the lower your costs.

There are a few ways you can increase your odds of a cache hit:

  • Cache a higher percentage of tokens
  • Use longer prompts
  • Make requests during off-peak hours
  • Use the same prompt prefixes consistently (prompts that haven’t been used recently are automatically removed from your cache)

Result

At AiSDR, we’ve used prompt caching to decrease our monthly LLM expenses by over 34% while speeding up general performance. This is because cached tokens are usually half the price of regular tokens. 

According to OpenAI and Anthropic, if you’re able to cache a huge percentage of your prompts, you can potentially see your cost savings reach up to 90%.

But remember, any small variation between prompts – even if it’s a single letter – will prevent the cache hit you need for cost savings. Unfortunately, this means you won’t be able to optimize costs if you’re in the process of testing and fine-tuning prompts.

Book more, stress less with AiSDR
See how AiSDR runs your sales
GET MY DEMO
helpful
Did you enjoy this blog?
TABLE OF CONTENTS
1. Step 1: Identify repeating prompts 2. Step 2: Structure prompts 3. Step 3: Choose your LLM 4. Step 4: Monitor your performance 5. Result
AiSDR | Website Illustrations | LinkedIn icon | 1AiSDR Website Illustrations | LI iconAiSDR | Website Illustrations | X icon | 1AiSDR Website Illustrations | X iconAiSDR | Website Illustrations | Insta icon | 1AiSDR Website Illustrations | IG icon 2AiSDR | Website Illustrations | Facebook icon | 1AiSDR Website Illustrations | FB icon
link
AiSDR Website Illustrations | Best AI Tools for Primary and Secondary Market Research | Preview
Get an AI SDR than you can finally trust. Book more, stress less.
GO LIVE IN 2 HOURS
You might also like:
Check out all blogs>
Cut AI Costs Without Sacrificing Performance with Model Distillation
Cut AI Costs Without Sacrificing Performance with Model Distillation
Oleg Zaremba
Oleg Zaremba •
Oct 3, 2024 •
4m 12s
Find out how to use model distillation to cut AI costs
Read blog>
How to Get Generative AI to Score Your Leads
How to Get Generative AI to Score Your Leads
Oleg Zaremba
Oleg Zaremba •
Sep 5, 2024 •
2m 55s
Want to speed up lead scoring? Find out how to make generative AI do it for you.
Read blog>
How to Train an AI to Write Sales Emails
How to Train an AI to Write Sales Emails
Viktoria Sinchuhova
Viktoria Sinchuhova •
Sep 12, 2024 •
3m 54s
Find out how to train an AI SDR to write emails that sound like you wrote them
Read blog>
Teaching Generative AI to Classify Email Responses
Teaching Generative AI to Classify Email Responses
Oleg Zaremba
Oleg Zaremba •
Jul 11, 2024 •
3m 6s
Training generative AI on how to classify sales emails is straightforward. Get a sneak peek into how AiSDR does it.
Read blog>
See how AiSDR will sell to you.
Share your info and get the first-hand experience
See how AiSDR will sell to you