Skip to content

Conversation

@dgageot
Copy link
Member

@dgageot dgageot commented Jan 7, 2026

  • Run evaluations in isolated Docker containers with Docker-in-Docker
  • Support concurrent evaluation runs with configurable parallelism
  • Generate memorable run names (e.g., swift-falcon-042)
  • Save results to JSON file and debug log in /results/
  • Check response size, tool calls, handoffs, and relevance criteria
  • Use judge model (configurable) for relevance checking
  • Show progress bar with colored output and per-evaluation results
  • Exclude errored evaluations from summary totals
  • Add comprehensive debug logging for troubleshooting

Assisted-By: cagent

@dgageot dgageot requested a review from a team as a code owner January 7, 2026 22:16
Copy link
Member Author

@dgageot dgageot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Let's improve the code!

)

// GenerateRunName creates a memorable name for an evaluation run.
func GenerateRunName() string {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Extract this to its own file

- Run evaluations in isolated Docker containers with Docker-in-Docker
- Support concurrent evaluation runs with configurable parallelism
- Generate memorable run names (e.g., swift-falcon-042)
- Save results to JSON file and debug log in <evals-dir>/results/
- Check response size, tool calls, handoffs, and relevance criteria
- Use judge model (configurable) for relevance checking
- Show progress bar with colored output and per-evaluation results
- Exclude errored evaluations from summary totals
- Add comprehensive debug logging for troubleshooting

Assisted-By: cagent
@dgageot dgageot merged commit cc75e80 into docker:main Jan 8, 2026
5 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants