You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Hi,
We are using the in-built evals to measure llm performance. Due to the inherent stochasticity of the process we would like to evaluate the same dataset multiple times. This would allow one to more robustly gauge how a e.g. a prompt changes the outcome distribution for scores.
At the moment the workaround would be to manually collect EvaluationReports from different runs, group the individual ReportCases and aggregate them manually into a new report. While doable, this feels somewhat cumbersome and at the same time like sth. others would benefit from as well!
References
No response
tradeqvest, EmilOstlinAA, azayz, dhimmel, PeterRodenkirchAA and 1 moretradeqvest and EmilOstlinAA