Using an AI to Validate Another AI’s Output
Have you ever asked ChatGPT the same question twice?
Did you get the same answer?
The answer’s probably no, assuming the question wasn’t too basic.
This is because outputs from generative AI are “non-deterministic”. In other words, answers will vary, which means we need some way to assess the quality of any given output through AI validation or AI evaluation.
There are frameworks available for evaluating AI outputs, but they frequently require a great deal of industry-specific knowledge.
Or you can do what we did at AiSDR, which is take a shortcut and use one generative AI model to validate AI outputs from a different model.
TLDR:
- The goal: Confirm whether an AI’s output is sufficiently good
- The tactic: Use a different AI to check the output of the original AI (i.e. use Gemini to check ChatGPT’s output)
- The result: Speed up the process of checking AI outputs
Step 1: Generate an Output
Let’s say you want to take a closer look at the outputs of GPT-4o.
Before you can test an output’s quality, you’ll need to create an output, which can be text, image, audio, or any other type of content.
Step 2: Reformat the Output
Depending on the AI models you use, you might have to “reformat” the output so that the second AI can work with it better. This is especially true if you’re using LLMs from different providers, like Anthropic and OpenAI.
This step is more commonly known as preprocessing.
Common examples of preprocessing include:
- Adding or removing text elements like HTML tags
- Breaking larger texts into smaller parts (i.e. essays into individual paragraphs or paragraphs into individual sentences)
- Identifying people, organizations, locations, or dates
- Converting data into a format that’s compatible with the second model
Step 3: Evaluate the Output
Once the output’s been preprocessed, you can enter the output into the second AI (e.g. Claude 3.5 Sonnet). To systematize your AI evaluation, you should create a set of criteria by which you can compare AI models.
The set of criteria you create will vary by situation, but common criteria include:
| Criterion | What it checks | How to assess |
| Computational efficiency | Resource usage, speed, scalability, and performance | Compare how quickly the model processes data and whether it can handle larger input volumes without slowing down. |
| Currency | Whether content reflects new information | Check if the model includes up-to-date facts, trends, or product details relevant to your use case. |
| Relevancy | Alignment with the topic or request | See if the output directly addresses the user’s question or sales context without going off topic. |
| Authority | Support from reputable sources or evidence | Verify whether the model cites or infers from reliable data or best-known practices. |
| Accuracy | Freedom from errors or misleading information | Review the response for internal consistency and correctness across facts and metrics. |
| Purpose | Whether content matches the intended use | Assess if the output format and tone align with the task (e.g., outreach email vs. technical summary). |
You can then use the results to determine whether or not the original AI is effective and suitable for your requirements, which could be anything from model distillation to carrying out simple day-to-day work tasks.
The Result
Using a second AI to validate an AI has helped us achieve several good results:
Resource efficiency – AI’s capacity to process large amounts of data faster and more thoroughly than humans has helped us allocate people and resources to strategic-level tasks. It’s also allowed us to quickly fine-tune our AI’s ability to flag potential issues and non-standard sales emails, as well as recognize common types of messages like auto-replies.
- How you can use it – Delegate high-volume, low-level tasks that can be completed with minimal training to AI. Tell the AI when it makes a mistake, and provide examples of what you expect. The more AI works with data, the better its performance. It’s not much different than onboarding a new teammate.
Scalability – Our AI works with thousands of emails each day. If our team was stuck manually checking and classifying every email, we wouldn’t have much time to focus on product development.
- How you can use this – If you’re uncertain about how effective AI is at classification or other high-volume, low-level tasks, you can use a different AI to test its ability. This will help you build confidence in your AI or provide insight into where improvements are needed.
Redundancy – Learning how to interact with different AI models has helped us prepare for any situation when a major AI model like ChatGPT goes down. If this happens, we can simply switch the AI engine as we developed AiSDR to be language agnostic.
- How you can use this – Design your processes and product development to also be language agnostic. This limits your exposure to an AI going down as your processes will continue to operate as expected.
However, there is one caveat to using generative AI.
Specifically, you’ll have to get used to and work around AI behaving unpredictably.
Tips for Getting the Best Results from AI
Based on our experience using one AI to validate AI outputs from another, here are some practical tips:
- Delegate high-volume, low-level tasks – Use AI for repetitive classification tasks and free up your team for higher-level work. This type of AI evaluation saves time and resources.
- Provide feedback and examples – Tell the AI when it makes mistakes and show it the kind of output you expect, just like training a new teammate. Consistent AI validation improves performance over time.
- Leverage multiple models – Rely on different AIs to cross-check each other. This form of AI verification and validation reduces risk and creates redundancy if one model goes down.
- Design processes to be model-agnostic – Build workflows that aren’t tied to a single AI provider. Combining approaches with an AI assessment tool makes your systems more adaptable and resilient.
These practices not only improve output quality but also build long-term resilience into your AI operations through structured AI validation.
Book more, stress less with AiSDR
FAQ
What is an AI evaluation?
AI evaluation, also called AI verification and validation, is the process of checking whether an AI’s output meets defined criteria such as accuracy, relevancy, and efficiency. It ensures the results are trustworthy and aligned with the intended purpose.
How to evaluate an AI response?
You can validate AI responses by preprocessing the output if needed and then using another AI model to review it against structured criteria like accuracy, authority, and purpose. This cross-validation approach helps identify weak points or inconsistencies and ensures outputs stay accurate and aligned with the intended goal. Using an AI assessment tool can make this process more systematic.
Why validate the results of AI?
Because AI outputs are non-deterministic, AI validation ensures accuracy, scalability, and reliability. Validating results also builds redundancy, so teams can trust outputs even when switching between different AI models.
Can you trust AI accuracy?
Yes, but only with AI verification and validation. While AI can be accurate and resource-efficient, unpredictability means it’s essential to validate AI outputs with structured checks or an AI assessment tool for confidence in results.
Generative AI can be unreliable at times. Use this shortcut to quickly validate the quality of AI outputs