This PR adds a new evaluator to judge correctness of a response. It is based loosely on LlamaIndex's correctness.py
evaluator.
Note that I couldn't decide the best way to provide the reference answer. I saw three choices:
- Add it directly to
EvaluationRequest
- This seemed like the worst choice, as it would be a property that is probably only used for this evaluator. - Override the
evaluate()
method to take anEvaluationRequest
as well as the reference answer. I started this way, but it felt clunky. - Extend
EvaluationRequest
withCorrectnessEvaluationRequest
and implementevaluate()
to check for aCorrectnessEvaluationRequest
and use its reference answer. This is the option I chose.
Also note that this change is built upon the change in #967, such that it takes a String
for the response in EvaluationRequest
.
Comment From: habuma
This returns both a pass/fail boolean and an explanation. The pass/fail is determined by the score threshold, such that if the score is below a certain threshold, then the test fails. And the explanation is provided in the feedback
property of the EvaluationResponse
.
Comment From: ilopezluna
This returns both a pass/fail boolean and an explanation. The pass/fail is determined by the score threshold, such that if the score is below a certain threshold, then the test fails. And the explanation is provided in the
feedback
property of theEvaluationResponse
.
I believe it makes more sense to rely on the judgment of the LLM to determine if the test passes or not. However, I've seen many examples where evaluations are based on scores, so I might be wrong.
If you always want the explanation included in the response, I suggest being more explicit in the prompt. Currently, you are asking for:
Output a single score that represents a holistic evaluation.
I believe this could lead the LLM to omit the explanation. When I was working on this, I realized that I had to be extremely explicit about what I wanted.