This PR adds a new evaluator to judge correctness of a response. It is based loosely on LlamaIndex's correctness.py evaluator.

Note that I couldn't decide the best way to provide the reference answer. I saw three choices:

  • Add it directly to EvaluationRequest - This seemed like the worst choice, as it would be a property that is probably only used for this evaluator.
  • Override the evaluate() method to take an EvaluationRequest as well as the reference answer. I started this way, but it felt clunky.
  • Extend EvaluationRequest with CorrectnessEvaluationRequest and implement evaluate() to check for a CorrectnessEvaluationRequest and use its reference answer. This is the option I chose.

Also note that this change is built upon the change in #967, such that it takes a String for the response in EvaluationRequest.

Comment From: habuma

This returns both a pass/fail boolean and an explanation. The pass/fail is determined by the score threshold, such that if the score is below a certain threshold, then the test fails. And the explanation is provided in the feedback property of the EvaluationResponse.

Comment From: ilopezluna

This returns both a pass/fail boolean and an explanation. The pass/fail is determined by the score threshold, such that if the score is below a certain threshold, then the test fails. And the explanation is provided in the feedback property of the EvaluationResponse.

I believe it makes more sense to rely on the judgment of the LLM to determine if the test passes or not. However, I've seen many examples where evaluations are based on scores, so I might be wrong.

If you always want the explanation included in the response, I suggest being more explicit in the prompt. Currently, you are asking for:

Output a single score that represents a holistic evaluation.

I believe this could lead the LLM to omit the explanation. When I was working on this, I realized that I had to be extremely explicit about what I wanted.