Cut AI Costs Without Sacrificing Performance with Model Distillation
Find out how to use model distillation to cut AI costs
Model distillation is a powerful technique for making generative AI more efficient. It allows you to use the outputs of a large generative AI model like ChatGPT to train a smaller model.
The payoff is that the smaller model performs nearly as well as the large model, but with lower costs and latency (i.e. how long a large-language model takes to generate a response to a prompt). This comes in quite useful when you want to deploy a model with limited resources and without needing the full power of a large model.
Here’s a closer look at the steps you need to take to distill a larger model into a smaller one.
TLDR:
- The goal: Fine-tune a smaller model to have similar performance to a larger model
- The tactic: Use model distillation where outputs of a large model train a smaller model
- The result: Get a smaller model that has similar performance to the larger model, but with less cost and latency
Step 1: Decide the model you’ll use and the task you’ll complete
Your first step is to figure out which LLM will serve as the large model. In addition to GPT-4 and its ‘relatives’ (e.g. GPT-4o), there are several other LLM alternatives you can use: Gemini, Claude, Meta AI, and more.
Additionally, you should consider what task you want to achieve since this will determine the output content you’ll generate. For example, sales AI companies like AiSDR would generate outputs related to standard sales tasks like classifying emails or scoring leads.
Lastly, it’s a good idea to select the latest, most advanced version of the LLM you choose.
Step 2: Collect outputs from the large model
After you’ve settled on a large model and a corresponding task, it’s time for you to start generating high-quality outputs. You’ll use these to train the smaller model.
For good results, you’ll need upwards of 300 outputs, also known as records.
(Note: If you’ve never heard the term record before in the context of LLMs and generative AI, it can mean a single example of training data or a single stored output. If you’re using an LLM via an API, a record can refer to a single API request and the corresponding output.)
High-quality outputs are essential. If you’re uncertain about the quality of the outputs you’re generating, you can use a second LLM to validate an LLM.
Step 3: Set a baseline for evaluating performance
Once you’ve collected and stored outputs (i.e. records) from the large model, you need to create a baseline for testing and evaluating the smaller model’s performance.
The baseline essentially serves as a measuring stick. By comparing the smaller model’s results to the larger model, you’ll see the difference in accuracy and how much better the large model is. This gives you insight into how much fine-tuning you’ll need to do, as well as where you should fine-tune.
Don’t be surprised if the large model always outperforms the smaller one. This is to be expected.
Step 4: Create a training dataset and fine-tune the smaller model
You have your models. You know your baseline. And you even have an idea about how much better the original model is.
Your next step is to start improving or fine-tuning the smaller model.
This is where your set of 300 outputs comes in. However, if you have thousands of high-quality samples, you can get better results. It’s up to you how many outputs you want to collect.
Start by filtering your outputs and selecting the best examples that align with the task you want to do. The more diverse and relevant the data, the better the small model will perform after training.
Step 4: Evaluate and optimize the fine-tuned model
After fine-tuning the smaller model for the first time, compare it to the baseline you established and the larger model’s outputs.
You should see improvement, as well as areas where the smaller model may need further fine-tuning.
If the smaller model produces acceptable results, then you’re done. But if you want to further refine the model, here are some actions you can take:
- Adjust or debug prompts
- Adjust the training data by adding, subtracting, or changing outputs
- Adjust the evaluation process
By continuously evaluating and fine-tuning, you should be able to push the smaller model closer to the performance baseline of the larger model for your specific tasks. You can even combine model distillation with prompt caching to unlock even more cost savings.
The Result
If all goes well and the new model performs as expected, you should see these benefits:
- Reduced cost – You can significantly lower operational costs since smaller models are less resource-intensive, making them cheaper to run.
- Faster performance – Smaller models usually have faster inference times, which translates to reduced latency and quicker response times in applications.
- Efficient deployment – Compact models are easier to deploy on devices with less computational power, such as a smartphone, and still get comparable results.
- Simpler task-specific fine-tuning – It’s much simpler and quicker to fine-tune a smaller model than it is to fine-tune a large model.
- Easier scalability – When models are more efficient, you can scale products and applications more effectively so that they handle more users and processes simultaneously without overloading systems.