fix: exclude failed queries from aggregate score in FaithfulnessEvaluator and ContextRelevanceEvaluator#11385
Conversation
|
Someone is attempting to deploy a commit to the deepset Team on Vercel. A member of the Team first needs to authorize it. |
|
NIK-TIGER-BILL seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
|
Hi @NIK-TIGER-BILL! git config user.email "new.email@example.com"
git commit --amend --author="Your Name <new.email@example.com>" --no-edit
git push --force-with-lease |
|
Hi @bogdankostic, thank you for the heads-up! I'll amend the commits to use the email address associated with this GitHub account and force-push the updated branch. That should resolve the CLA check. I'll ping you once it's done. |
aba0041 to
a0c8d17
Compare
|
@bogdankostic Done — I amended the commit to use the verified email associated with this account and force-pushed the updated branch. The CLA check should now pass. Thanks for the guidance! |
FaithfulnessEvaluator and ContextRelevanceEvaluator previously included NaN scores from failed LLM calls when computing the aggregate mean, causing the overall score to silently become NaN. Now failed queries are excluded and a warning is logged. Fixes deepset-ai#11383 Signed-off-by: NIK-TIGER-BILL <nik.tiger.bill@github.com>
a0c8d17 to
a368154
Compare
|
@bogdankostic Fixed — amended the commit author to use the verified email associated with this GitHub account and force-pushed the branch. The CLA check should now be fully resolved. Thanks again for the guidance! |
|
@NIK-TIGER-BILL Can you please make sure that the CI checks pass, for example the linter? You can find more details in our contributing guidelines. |
|
@bogdankostic Thanks for the follow-up! I checked the linter output on the changed files. The |
Related Issues
Proposed Changes:
When
FaithfulnessEvaluatororContextRelevanceEvaluatorrun withraise_on_failure=Falseand an LLM call fails, the per-query score becomesNaN. Previously theseNaNvalues were included in the aggregatemean, causing the overall score to silently becomeNaNand giving the user no indication that some queries were skipped.Changes:
NaNscores before computing the aggregate mean.WARNINGtelling the user how many queries were excluded.How did you test it?
Updated
test_run_returns_nan_raise_on_failure_falsein bothtest_faithfulness_evaluator.pyandtest_context_relevance_evaluator.pyto verify that the aggregate score is computed from valid scores only and that the warning message is emitted.Notes for the reviewer
Checklist
fix:,feat:,build:,chore:,ci:,docs:,style:,refactor:,perf:,test:and added!in case the PR includes breaking changes.