You possibly can construct a fortress in two ways: Start stacking bricks one above the opposite, or draw an image of the fortress you’re about to construct and plan its execution; then, keep evaluating it against your plan.
Everyone knows the second is the one way we are able to possibly construct a fortress.
Sometimes, I’m the worst follower of my advice. I’m talking about jumping straight right into a notebook to construct an LLM app. It’s the worst thing we are able to do to break our project.
Before we start anything, we want a mechanism to inform us we’re moving in the correct direction — to say that the last item we tried was higher than before (or otherwise.)
In software engineering, it’s called test-driven development. For machine learning, it’s evaluation.
Step one and the most useful skill in developing LLM-powered applications is to define the way you’ll evaluate your project.
Evaluating LLM applications is nowhere like software testing. I don’t undermine the challenges in software testing, but evaluating LLMs isn’t as straightforward as testing.