Evaluating LLM Applications the Right Way

November 17, 2024 2 minute read

This article from Hamel Husain is the highest-signal post I’ve read on building evaluations for LLM-based applications. I encourage you to spend 20 minutes reading the entire thing and not just my notes. Hamel breaks down how to evaluate AI systems in a way that cuts through noise and delivers practical, actionable results.

Here are the five points I found most interesting:

1. Work with a Domain Expert

If you do nothing else, at least work with someone who is an expert in the domain you’re solving for. They bring the judgment, context, and understanding to ensure your AI system is solving the right problem in the right way. This advice isn’t limited to LLMs—it’s true for almost everything. Without domain expertise, you risk building something disconnected from reality.

2. Define and Cover Input Dimensions

Start by defining the inputs your system is expected to handle and the dimensions of those inputs. For example, these might include features, scenarios, and personas—but it depends on your use case. The goal is to build a taxonomy of inputs that represents all relevant combinations, so your tests cover the full range of potential use cases.

3. Use Synthetic Input Creation

Synthetic inputs, generated by an LLM, are a powerful tool for testing. These inputs should align with the dimensions you’ve defined, ensuring coverage across your taxonomy. It’s surprisingly effective (and a bit meta)—you can use one LLM to generate inputs for testing another. This approach helps uncover weaknesses and edge cases, even without extensive real-world data.

4. Start with Pass / Fail Judgments From The Domain Expert

Avoid the temptation to add complex scoring systems or multi-dimensional ratings. These are just noise. Instead, force the domain expert to make a clear choice: Is it right or is it wrong? This simplicity keeps the evaluation focused and actionable. Once you’ve mastered the basics, you can layer in more complexity if needed.

5. Refine with Critiques From The Domain Expert

Written critiques from the domain expert clarify why something passed or failed. This feedback sharpens the evaluation process and improves your system. Critiques uncover subtle gaps, help you refine prompts, and even help clarify the actual problem you’re solving. They aren’t just about evaluation — they’re about aligning on the details of what your system should actually do.

The Bottom Line

The most important aspect of this to me is to work with someone who deeply understands the problem you’re solving. Domain experts are essential for defining success, guiding improvements, and ensuring your system delivers real-world value. It’s not just about the tech.