LLM as Judge: A Built-In Second Opinion for AI Hallucinations

Every AI model hallucinates. Sometimes it invents an API that does not exist, sometimes it cites a paper that was never written, sometimes it confidently moves a file into the wrong folder. The hard part is not avoiding hallucinations entirely, that is impossible with current models, the hard part is catching them before they cost you. VoxyAI now ships with LLM as Judge: a built-in second opinion that runs your AI's answers past a different model and flags the ones that look wrong.

What Is LLM as Judge?

LLM as Judge is a simple but powerful pattern. Whenever your primary AI produces a response, a second AI model called the judge inspects the original prompt and the answer, then returns one of three verdicts:

Approved, the response is reasonable. It passes through silently and you never even see the review window.
Uncertain, the judge cannot fully verify a specific claim, so it flags the response for your review.
Rejected, the judge spotted a concrete error or fabrication, like an invented API call or a wrong fact.

When the judge flags something, VoxyAI opens a review window with the verdict, the judge's reasoning, and the primary AI's response in an editable text box. You decide what happens next: use the response anyway (with or without your edits), ask the AI to try again, or cancel and discard.

Why a Second AI Catches Errors the First One Misses

The intuition behind LLM as Judge is that different models have different blind spots. A model that confidently fabricates a function name from a library it half-remembers will often be caught by a model trained on different data, with different biases, and a different idea of what looks reasonable. The judge does not need to be smarter than the primary, it just needs to be different.

This is the same reason code review works in human teams. The author of a function is the worst person to spot its bugs, because they wrote the code by reasoning their way to it, and that same reasoning hides the assumptions they got wrong. A reviewer who comes in fresh sees what the author cannot. The judge plays the same role for AI output.

Where VoxyAI Uses It

You enable verification per feature, not globally. Every verified action makes a second AI call, so you only want it on the paths where catching a wrong answer is worth the extra cost and latency. The five verified paths in VoxyAI are:

Code Generation

This is the highest-value path for verification. The primary AI suggests a function call, an import, or a chunk of code, and the judge checks whether the symbols and APIs actually exist in the language and version you asked for. Catching an invented API before it lands in your editor saves the round-trip of running broken code, reading the error, and asking the AI to fix what it just made up.

Directory Organizing

When VoxyAI proposes a batch of file moves and renames to organize a folder, the judge sanity-checks the plan. Did it propose moving a text file into an images folder? Are filenames consistent and reasonable for the content? A flag here lets you spot a misclassified batch before it touches your disk, which matters more than catching a single bad sentence in a chat reply.

Chat

In chat, the response streams to you immediately, the way you expect from any AI assistant. The judge runs in the background after the response is shown, and only opens the review window if it spots a clear factual error or fabrication. You get the speed of streaming and the safety net of verification, with no added latency on the happy path.

Writing Rewrites

The Writing Coach rewrites your text in a different tone or style. Verification here checks that the rewrite preserved your meaning and did not invent new claims, names, or facts that were not in the original. Style changes are fine, semantic drift is not.

Screenshot Rename

When VoxyAI generates a descriptive filename for a screenshot, the judge rejects names that invent specifics the AI could not have known just from the image. This keeps the auto-renamer honest, you get filenames that describe what is actually visible, not what the model imagines might be there.

Picking a Judge Model

A common mistake is reaching for the most expensive model as the judge. That is not the goal. The goal is independent verification, and the cheapest path to independence is using a different model family from your primary.

Some good pairings:

Claude as primary, GPT-4o-mini as judge, or vice versa
Gemini as primary, Claude Haiku as judge
Apple Intelligence as primary, a small Ollama model as judge, both fully local
Ollama (a larger local model) as primary, Apple Intelligence as judge, also fully local

A small fast model is usually enough. The judge only needs to spot clear errors, not produce its own answer, and a small model with a different training history makes a fine reviewer.

Tuning the Judge to Avoid Crying Wolf

A judge that flags everything is worse than no judge at all, because you start ignoring the warnings. VoxyAI tunes the judge's prompt to lean toward approval and only flag responses with a specific, identifiable problem. Many requests have multiple valid answers, and the judge's job is not to second-guess phrasing or style, only to catch concrete errors.

In practice this means the review window stays closed most of the time. When it does open, the judge's reasoning points to something specific you can act on: a fabricated function name, a wrong statistic, an off-topic answer. That is the difference between a useful safety net and an annoying interruption.

Privacy and Local Verification

Verification means sending the original prompt and the primary response to the judge, so the judge sees everything the primary saw. If you want that to stay on your Mac, configure a local provider as the judge: Ollama or Apple Intelligence. Verification then runs entirely on-device, even when the primary uses a cloud provider, so you can mix a powerful cloud model for generation with a private local model for verification.

The Cost-Quality Tradeoff

Every verified action makes two AI calls instead of one, so cost roughly doubles on the paths you enable. Latency increases on non-streaming paths because the judge runs after the primary returns. This is a real tradeoff, and the right answer depends on what you are doing.

Our suggestion: start by enabling verification only on code generation and directory organizing, the two paths where a wrong answer is most expensive to catch later. Live with that for a week. If you find yourself wishing the judge were watching another path too, expand from there.

Get Started

LLM as Judge is built into VoxyAI and ready to use. Open VoxyAI Settings, select the Judge section, flip the master switch, pick a judge provider, and choose which features to verify. The next time the judge spots a problem, the review window will explain what it saw and let you decide what to do.

Hallucinations are a fact of life with current AI models. They are not a fact of life with your workflow.