← back to home
created: Sometime 2024
LLM evaluations are tedious, especially for agents, but they're a necessary evil. Here's what I've learned while building evaluation systems:
The Stakes Are Real Every change in your AI stack impacts performance - whether you're switching models, tweaking prompts, or updating tools. Even your model provider might push updates without notice. Without good evals, you're basically building blind. Honestly, don't recommend evaling every model being released, it can be a waste of time since each model is different and 5 percent increase in benchmark is not worth the time.
Preparing Your Eval Dataset For us, this meant creating a dataset of questions and manually gathering correct answers, double-checked with human input. This becomes your golden dataset - your ground truth. For you, it might be different. Maybe you're evaluating customer service responses, code quality, or domain-specific knowledge. Whatever it is, invest time in building this baseline dataset.
Running Evals Once you have your baseline, every change runs through the same evaluation process. If your current setup gets 70/100 correct answers and that hot new model everyone's talking about only hits 60/100, you've saved yourself from a downgrade. Track these numbers over time to see if you're actually improving or regressing.
Metrics That Matter Pick metrics that align with what you care about. For us, it was accuracy and not letting the AI yap too much. You might care about exact word matching, response consistency, or something specific to your domain. Sometimes you just need to vibe check with your eyes - not everything needs a fancy metric.
Keep Building I started with internal tools for running evals, but maintaining them became its own project. We switched to BrainTrust, though any evaluation platform works fine. The important thing is to keep collecting data and enriching your eval dataset. Use LLMs to help clean and structure data, but keep adding new test cases as you learn more about what your users need.
Building AI apps is only the first step. Making them reliable, accurate and optimized for your specific goal is where majority of the work is so I recommend getting good at evals.