-
Notifications
You must be signed in to change notification settings - Fork 21
[Enhancement] Stop looping when runtime is stable (CF-934) #967
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
…sistent-loop-break
…sistent-loop-break
PR Reviewer Guide 🔍Here are some key observations to aid the review process:
|
PR Code Suggestions ✨Explore these optional code suggestions:
|
| return 0.2 | ||
| if avg < 0.1: # < 100 ms | ||
| return 0.1 | ||
| return 0.03 # > 0.1 s |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@mohammedahmed18 how did you set these numbers?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@aseembits93
there are just experiment numbers, the whole idea is very small runtimes should have high tolerance value because the noise is relatively large
…sistent-loop-break
…sistent-loop-break
The optimization achieves a **437% speedup** by eliminating expensive function calls and using more efficient algorithms for median calculation and min/max operations. **Key optimizations applied:** 1. **Custom median calculation**: Replaced `statistics.median(recent)` with a custom implementation using `sorted(recent)` and direct indexing. This eliminates the overhead of the statistics module's generic median function. 2. **Reused sorted array**: The sorted array from median calculation is reused for min/max operations (`recent_sorted[0]`, `recent_sorted[-1]`) instead of calling `min(recent)` and `max(recent)` separately, eliminating redundant iterations. 3. **Replaced `statistics.mean()` calls**: Substituted `mean(recent[:half])` and `mean(recent[half:])` with direct `sum()/length` calculations, removing function call overhead. 4. **Early termination optimization**: Changed the `all()` generator expression to an explicit loop with early `break`, avoiding unnecessary iterations when the first non-conforming value is found. **Performance impact analysis:** The line profiler shows dramatic improvements in the most expensive operations: - `first = mean(recent[:half])`: 2.07ms → 47.7μs (98% reduction) - `second = mean(recent[half:])`: 1.54ms → 35.4μs (98% reduction) - `m = median(recent)`: 220μs → eliminated, replaced with ~55μs custom implementation **Hot path significance:** Based on the function reference, `should_stop()` is called in the main test loop (`pytest_runtestloop`) after every iteration to determine if benchmarking should continue. This makes it a critical hot path function where the 437% speedup directly translates to faster benchmark completion times. **Test case performance:** The optimization performs exceptionally well across all test scenarios, with speedups ranging from 400-500% for typical cases involving stable data, fluctuations, and large datasets. Even edge cases with insufficient data show 10-25% improvements.
⚡️ Codeflash found optimizations for this PR📄 438% (4.38x) speedup for
|
Code Review for PR #967: Stop looping when runtime is stableThanks for this enhancement! Here is my detailed review: ✅ Strengths
🔴 Critical Issues - MUST FIX1. Type Hint Error (best_summed_runtime.py:1)
2-4. Division by Zero Vulnerabilities (pytest_plugin.py)
|
…sh/optimize-pr967-2025-12-22T14.11.46
…25-12-22T14.11.46 ⚡️ Speed up function `should_stop` by 438% in PR #967 (`exp/consistent-loop-break`)
|
This PR is now faster! 🚀 @mohammedahmed18 accepted my optimizations from: |
Code Review: Stop looping when runtime is stable (CF-934)Thank you for this enhancement! The dynamic loop termination based on runtime stability is a smart optimization. Below is my detailed review: ✅ Strengths
🐛 Critical Issues1. Type annotation bug in
|
…odeflash into exp/consistent-loop-break
PR Review: Stop Looping When Runtime is Stable (CF-934)SummaryThis PR introduces intelligent loop termination based on runtime stability, replacing fixed iteration counts with dynamic detection. The implementation tracks test durations and stops when measurements stabilize, improving efficiency while maintaining accuracy. 🟢 Strengths
🟡 Issues & RecommendationsCritical: Type Annotation BugLocation: def calculate_best_summed_runtime(grouped_runtime_info: dict[any, list[int]]) -> int:Issue: Fix: from typing import Any
def calculate_best_summed_runtime(grouped_runtime_info: dict[Any, list[int]]) -> int:
return sum([min(usable_runtime_data) for _, usable_runtime_data in grouped_runtime_info.items()])High Priority: Edge Case Handling1. Division by Zero in
|
|
@misrasaurabh1 @mohammedahmed18 claude reviews are way too frequent, let's do something about it |
| N_CANDIDATES_LP = 6 | ||
|
|
||
| # pytest loop stability | ||
| STABILITY_WARMUP_LOOPS = 4 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
how did you determine these magic numbers @mohammedahmed18 ?
Code Review: Loop Stability DetectionI've reviewed PR #967 which implements runtime stability detection to optimize test loop termination. Here's my comprehensive feedback: OverviewThis PR introduces a smart early-exit mechanism for pytest benchmarking loops by detecting when runtime measurements stabilize, potentially saving significant execution time while maintaining accuracy. Critical Issues1. Hardcoded Development Path (HIGH PRIORITY)Location: Hardcoded user-specific path will fail in CI/CD and other environments. Use a configurable path via environment variable or config, fall back to temp directory or disable logging if path unavailable, and add error handling for file write failures. 2. Division by Zero RiskLocation: If median 3. Type Annotation IssueLocation:
4. ValueError RiskLocation:
Code Quality & Best Practices5. Commented-out Break StatementLocation: The loop continues running even after stability is detected. Is this intentional for data collection? If so, add a comment explaining why. 6. Magic Numbers in Stability CheckLocation: The threshold 7. Unnecessary List ComprehensionLocation: Remove list brackets for generator expression to avoid creating an intermediate list: return sum(min(usable_runtime_data) for usable_runtime_data in grouped_runtime_info.values())8. Inconsistent JSON FormatLocation: Use Performance Considerations9. Sorting OverheadThe 10. Redundant Calculation
Security Concerns11. Path Traversal RiskThe hardcoded path writes to a user directory without validation. If this becomes configurable, ensure proper path sanitization. Test Coverage12. Missing TestsCritical Gap: No unit tests found for:
Minor Issues13. Typo
14. Import OrganizationConsider grouping imports more clearly (stdlib, third-party, local). Positive Aspects
SummaryMust Fix Before Merge:
Should Fix:
Consider:
Overall, this is a valuable optimization with a solid algorithmic foundation. The main concerns are production-readiness issues (hardcoded paths, error handling, tests) that need to be addressed before merging. |
Code Review: PR #967 - Stop looping when runtime is stableSummaryThis PR introduces an intelligent early-stopping mechanism for pytest performance testing loops based on runtime stability detection. The implementation tracks per-test durations and uses statistical measures (median, spread) to determine when runtime has stabilized, potentially saving significant test execution time. Code Quality & Best PracticesStrengths✅ Well-structured implementation with clear separation of concerns Issues & Recommendations1. Type Annotation Issue -
|
PR Review: Stop looping when runtime is stableOverviewThis PR introduces an intelligent early stopping mechanism for pytest loop benchmarking based on runtime stability detection. The implementation tracks test durations and stops looping when performance metrics stabilize, potentially saving significant benchmarking time. Strengths1. Well-Designed AlgorithmThe stability detection algorithm in should_stop() is thoughtful:
2. Performance OptimizedThe should_stop() function has already been optimized (PR 984) with custom median calculation and reused sorted arrays. 3. Good Separation of Concerns
Issues and Concerns1. Type Annotation Bug (codeflash/result/best_summed_runtime.py:1)Issue: any should be Any (capitalized, from typing) Fix: from typing import Any 2. Potential Division by Zero (pytest_plugin.py:320-321)Issue: If m (median) is 0, this will raise ZeroDivisionError in the line: abs(r - m) / m Recommendation: Add zero check before division 3. Silent Failure Risk (pytest_plugin.py:348-349)Issue: If get_runtime_from_stdout() returns None, data is silently skipped Recommendation: Add debug logging when runtime extraction fails 4. Magic String Dependency (pytest_plugin.py:275-276)The markers appear hardcoded with no documentation. Consider making them constants at module level and documenting the expected stdout format. 5. Missing Edge Case Handling (pytest_plugin.py:401-404)Issue: estimated_total_loops could be 0 even when elapsed > 0 if the calculated value rounds down Impact: Would cause warmup_loops and window_size to be 0, preventing early stopping 6. Unused Variable (pytest_plugin.py:340)self.total_loop_runtimes is populated but never read - consider removing if truly unused 7. Missing Unit TestsNo dedicated tests found for should_stop(), get_runtime_from_stdout(), or calculate_best_summed_runtime() Code Quality
Minor: Line 14 comment placement is confusing Security ConcernsNone identified - this is internal benchmarking logic Performance ConsiderationsPositive: Early stopping will save significant time when stability is reached Question: Have you tested this with varying total_time values to ensure the percentages work well across different scales? Recommendations SummaryCritical:
Important:
Nice-to-have:
Approval StatusConditional Approval - The implementation is solid overall, but the type annotation bug should be fixed before merge, and zero-division protection should be added for safety. Questions for Author
Great work on this enhancement! The algorithm is thoughtful and the implementation is mostly clean. Looking forward to seeing this feature in action. |
PR Type
Enhancement, Tests
Description
Add consistent-loop break based on durations
Track per-loop test durations via hook
Introduce dynamic tolerance by runtime scale
New config: CONSISTENT_LOOP_COUNT
Diagram Walkthrough
File Walkthrough
config_consts.py
Introduce loop consistency count constantcodeflash/code_utils/config_consts.py
CONSISTENT_LOOP_COUNTdefault to 3.pytest_plugin.py
Duration-based consistent loop termination logiccodeflash/verification/pytest_plugin.py
dynamic_tolerancebased on avg runtime.CONSISTENT_LOOP_COUNT.env_utils.py
Minor formatting cleanupcodeflash/code_utils/env_utils.py