Beyond Prompts: Why Evaluating AI Is the Skill You Actually Need

Credit: Arize

The Dunning–Kruger Effect shows up a lot in AI. It is the bias where people overestimate their ability early on. With AI, it is easy to feel confident quickly, but developing real expertise takes much more time and practice than most people expect.

Here’s what usually happens:

Someone writes a prompt, doesn’t like the output, and dismisses the tool as “no good” (false conclusion, the problem was the process, not the tool).
Or, someone writes a prompt, uses the output, but then rewrites or fixes it themselves (this feels productive, but it’s where most people stall).

Surprisingly, it is the second behaviour that creates the biggest trap. On the surface, reusing AI outputs and fixing them yourself feels like progress (in the beginning, it is). But over time, this approach hides weak prompting and creates the illusion of progress. It convinces people they are getting value while the core issue (unclear prompts and no structured way to evaluate outputs) remains hidden.

This is what fuels the Dunning–Kruger Effect in AI. It’s easy to assume the tools are falling short, when in many cases the real blocker is simply building the skills to push them further.

At scale, this mindset slows adoption. Everyone is typing into ChatGPT or Gemini, leaning on metaprompting, and believing they have “got it”. But as people begin using AI more consistently in their daily work and personal workflows, the gap between casual use and disciplined practice is widening fast. And this gap becomes even more important to close as business use cases start to scale.

The Two Key Skills That Unlock The AI Flywheel.

Most people stall with AI because they don’t understand two critical skills, and more importantly, how they work together:

Prompt Engineering: structuring inputs clearly and effectively so the AI produces useful, reliable outputs. Without strong prompts, everything else falls apart.
AI Evals: testing, scoring, and refining outputs against a clear framework. Evals turn prompting from guesswork into a repeatable process and show you exactly where and how to improve.

Together, these two skills create a flywheel. Strong prompts lead to better outputs. Evals then surface gaps and opportunities, which in turn sharpen your prompts. Over time, this loop compounds, making your workflows more reliable, your results more consistent, and your skills much stronger.

Leaders in AI have been clear about the importance of evals:

Greg Brockman (President, OpenAI): “The most overlooked skill in machine learning is creating evals. Worthy metrics which beg for improvement are the root of progress”.
Garry Tan (President, Y Combinator): “Evals are emerging as the real moat for AI startups”.

Evals matter because they act as quality control. They ensure outputs are accurate, consistent, and trustworthy. Without them, you don’t know if your prompts are improving, or if you are simply running in circles. At their core, evals:

Provide systematic assessment of outputs
Create feedback loops that guide improvement
Maintain consistency and reliability across repeated use

This Is Not Just For Developers Or Startups.

If you are reusing prompts (whether for rewriting emails, summarising reports, or running a Custom GPT) you are building small AI systems. Each repeat creates a workflow that needs to be checked for consistency, quality, and reliability.

The scale of evals depends on how critical the output is, but the principle is simple: if you use it more than once, you need some form of evaluation process. Writing the prompt is only the first 20%. The next 80% comes from testing, iterating, and refining outputs until they consistently perform at the level you need.

When and Where to Use Evals.

Evals matter most in three situations:

Model updates: A prompt that worked in GPT-4o might change when GPT-5 arrives (prompt drift).
Missed expectations: Outputs fall short on tone, accuracy, or completeness.
Your evolving standards and processes: What counts as “good” today won’t be the same in three months.

Iteration closes the loop, like a quarterly review that ensures performance is improving, not drifting. Not every use case needs evals. Their importance depends on the stakes:

✅ Critical: Customer support bots, content systems, healthcare or finance assistants, productivity agents.

❌ Low-stakes: Quick brainstorming, playful outputs, casual one-offs.

Rule of Thumb: If you’re going to reuse a prompt or AI system, apply some form of evaluation.

One-off, low-stakes → a gut check is fine.
Recurring use → structured evals (e.g. 1–5 scale for accuracy and helpfulness).
Production systems → full frameworks with rubrics, scoring, and iteration loops.

The more often you plan to rely on an AI workflow, and the more impact its outputs have, the more rigorous your eval process needs to be.

How to Run an AI Eval Process.

Running an eval doesn’t have to be complicated. At its core, it’s a structured way of checking if outputs meet your standards, learning from the gaps, and refining prompts until they do. This can be very simple or complex as you need it to be.

Steps for Conducting Evals

Define Success: State clearly what a good output looks like.
Select Key Criteria: Choose aspects to measure (accuracy, tone, completeness, creativity, etc).
Run the Prompt: Generate the output.
Evaluate the Output: Score against your criteria using a rubric.
Refine and Repeat: Share your evaluation (scores + feedback) with the model and ask it to improve the prompt or output (this step can be automated). Re-run the prompt. Continue looping until results are consistent.

Rubrics Make It Repeatable

Just like university assignments were graded with criteria and scales, rubrics make AI evaluation consistent. They define:

Criteria: What you’re measuring (clarity, accuracy, tone, completeness, etc.). The criteria should differ based on your use case and what's important.
Scale: How you score it (1–5, 0–10, 0–100).
Descriptors: What each score means in practice.
Weighting (optional but really helpful): How much each criterion counts toward the overall score. This depends on your use case. For example, a medical assistant should weight accuracy and safety highest, while a content generator might place more weight on tone and originality.

Objective vs Subjective Criteria

Objective: Fact-based checks (“No emojis”, “Exactly 100 characters”, “Must include New Zealand AI”). These leave no room for interpretation.
Subjective: Judgment-based checks (tone, originality, clarity). These require human evaluation, which makes feedback even more important.

Here's A Simple Example:

Prompt:

❝

“Write a LinkedIn post with these rules: exactly 100 characters, no emojis, must include “New Zealand AI”, end with [link], tone friendly and professional.”

Model Output:

❝

“Proud to share new insights from the New Zealand AI community 😊 Read more here [link] today”

Evaluation:

Objective: Emoji (fail), 92 characters (4/5), “New Zealand AI” (pass), wrong ending (fail).
Subjective: Tone (4/5), Originality (3/5).

Feedback Loop: Feed this eval back into the AI: “Here’s where you failed: emoji included, length off, wrong ending. Rewrite the prompt or adjust the output to fix these issues”. This makes the AI itself part of the iteration process.

The value lies in the feedback, not just the score. By capturing why it failed, you create a roadmap for improvement.

This is a good example of evaluating a prompt that might sit in your Prompt Library for re-use every now and again. When you're creating a system instruction for a Custom GPT or an advanced AI system, your Evals process will be a lot more robust.

Building a Robust AI Evals Prompt.

The best way to master evals is to create yourself a reusable AI Evals Prompt that you can run your feedback through. Mine is a mix of prompts from AI courses, stitched together with the best parts from each.

The Basics

Clear instructions on the task and format
Defined evaluation criteria (accuracy, clarity, tone, etc.)
A scoring scale (1–5, 0–10, or 0–100) with descriptors
Both a score and an explanation for why it got that score

The Advanced Layer

Combine objective (fact-based) and subjective (judgment-based) checks
Apply weighting so criteria match the use case (accuracy in healthcare vs. tone in content)
Close the feedback loop by feeding evaluations back into the AI for refinement
Bake in advanced techniques like Chain of Thought, Few-Shot, or Chain of Density for deeper analysis

A strong eval prompt turns subjective guesswork into a repeatable process. The basics bring consistency. The advanced layer makes it scalable and adaptable, the key to moving from casual use to professional practice.

Embedding Refined Prompts into Your Workflow.

When a prompt has been polished and proven reliable, it should move from experimentation into everyday use. The cycle looks like this:

Validate: Test until the prompt produces consistent, dependable results.
Record: Store it in a prompt library with notes on task, context, and evaluation outcomes.
Integrate: Where possible, plug the prompt into automated processes to save time.
Track: Recheck performance periodically to spot model drift or shifting requirements.
Adjust: Update the prompt whenever standards evolve or outputs begin to slip.

Prompt libraries make this process much easier. They act as a central hub for proven prompts, so you’re not reinventing the wheel every time you sit down to work.

The Next Step Up: Structured Evals in a Spreadsheet.

Once you’ve got a library of prompts, the next level is creating structured evals that test multiple variations side by side. A simple spreadsheet can get you surprisingly far:

Column A: 20–50 variations of a prompt or input that you're inputting into your system instruction (could be in your Custom GPT or an advanced AI system).
Columns B → F: Rubric criteria (accuracy, clarity, tone, format, originality, etc.).
Cells: Each scored and annotated with comments.
Summary Row: Aggregated averages to show which variations consistently perform best.

This transforms evals into a dataset. Instead of guesswork, you start to see patterns, which instructions are robust, which collapse under variation, and how performance holds up over time.

Advanced AI Evals.

Beyond the Spreadsheet: Opening Up APIs

The natural progression from manual grids is automation. Instead of copy-pasting prompts and outputs, you can:

Pipe dozens of test inputs into your system instruction via API.
Collect outputs back into a sheet or database.
Run evaluation agents to grade results automatically against your rubric.
Flag drift in real time when scores slip (e.g., a 98 drops to a 95).

This is the bridge from personal practice to enterprise discipline. You start small with a feedback loop, then graduate to a spreadsheet, and finally open it up via API to scale across hundreds of test cases.

Tuning Advanced Settings

System parameters matter as much as wording:

Temperature: Higher (0.7–1.0) for creative variety, lower (0.0–0.2) for reliable precision.
Max Tokens: Limits output length and helps enforce brevity.
Context Length: Controls how much prior conversation is remembered; longer context improves coherence but uses more resources.

An advanced eval workflow doesn’t just test prompts, it experiments with these levers to find the best balance for the use case.

Final Word.

Evaluations are the backbone of professional AI use. They transform prompting from a casual, trial-and-error activity into a disciplined process that can scale. With clear rubrics, iterative cycles, spreadsheets, and API-driven automation, you can move beyond the illusion of progress and unlock AI’s real potential.

Written by Mike ✌

Passionate about all things AI, emerging tech and start-ups, Mike is the Founder of The AI Corner.

Subscribe to The AI Corner

The fastest way to keep up with AI in New Zealand, in just 5 minutes a week. Join thousands of readers who rely on us every Monday for the latest AI news.