Skip to content

Feature Request: Repeated Evaluations in Evaluation Report #2053

@ghost

Description

Description

Hi,
We are using the in-built evals to measure llm performance. Due to the inherent stochasticity of the process we would like to evaluate the same dataset multiple times. This would allow one to more robustly gauge how a e.g. a prompt changes the outcome distribution for scores.

At the moment the workaround would be to manually collect EvaluationReports from different runs, group the individual ReportCases and aggregate them manually into a new report. While doable, this feels somewhat cumbersome and at the same time like sth. others would benefit from as well!

References

No response

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions