Add exception handling for pdlp in concurrent mode#966
Add exception handling for pdlp in concurrent mode#966Iroy30 wants to merge 17 commits intoNVIDIA:release/26.04from
Conversation
|
Note Reviews pausedIt looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the Use the following commands to manage reviews:
Use the checkboxes below for quick actions:
📝 WalkthroughWalkthroughAdds exception-capture and propagation scaffolding to the concurrent PDLP runner in Changes
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~25 minutes 🚥 Pre-merge checks | ✅ 2 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (2 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
📝 Coding Plan
Comment |
There was a problem hiding this comment.
Actionable comments posted: 1
🧹 Nitpick comments (1)
cpp/src/pdlp/solve.cu (1)
1190-1192: TODO lacks issue tracking; commented-out code should be cleaned up.The TODO indicates a known intermittent issue but provides no way to track resolution. Additionally, commented-out code with
printfstatements should not remain in production code.Recommendations:
- Create a tracked issue and reference it in the comment (e.g.,
// TODO(#123): ...)- Either remove the commented-out rethrow entirely, or conditionally enable it behind a debug flag if needed for investigation
♻️ Suggested cleanup
- // TODO: Active Issue: PDLP throws an Exception interminttently. - // if (pdlp_exception) { printf("Rethrowing PDLP exception from concurrent mode\n"); - // std::rethrow_exception(pdlp_exception); } + // TODO(`#ISSUE_NUMBER`): PDLP throws exceptions intermittently in concurrent mode. + // Once root cause is fixed, rethrow: std::rethrow_exception(pdlp_exception);🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/pdlp/solve.cu` around lines 1190 - 1192, Remove the dead commented-out rethrow and replace the vague TODO with a tracked issue reference and optional debug guard: create a repository issue for the intermittent PDLP exception and update the comment to include the issue ID (e.g., "// TODO(#<issue>): Intermittent PDLP exception observed when running concurrent mode"), and either delete the commented block that references pdlp_exception/printf or move that logic behind a compile-time/debug flag (e.g., PDLP_DEBUG_RETHROW) so the rethrow of pdlp_exception can be enabled only for debugging; locate the commented code around pdlp_exception in solve.cu and apply the change accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 1175-1183: The catch block around run_pdlp currently stores the
exception in pdlp_exception but discards it, losing diagnostics; update the
catch in solve.cu to extract and surface the exception message (from
pdlp_exception or std::current_exception()) — e.g., convert to a string via
std::rethrow_exception / try-catch and log it with the existing logging
facilities and/or set a more informative termination status on sol_pdlp (instead
of only pdlp_termination_status_t::NumericalError) before setting
*settings_pdlp.concurrent_halt = 1; ensure run_pdlp, pdlp_exception, sol_pdlp,
and settings_pdlp.concurrent_halt are referenced so callers receive meaningful
error info.
---
Nitpick comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 1190-1192: Remove the dead commented-out rethrow and replace the
vague TODO with a tracked issue reference and optional debug guard: create a
repository issue for the intermittent PDLP exception and update the comment to
include the issue ID (e.g., "// TODO(#<issue>): Intermittent PDLP exception
observed when running concurrent mode"), and either delete the commented block
that references pdlp_exception/printf or move that logic behind a
compile-time/debug flag (e.g., PDLP_DEBUG_RETHROW) so the rethrow of
pdlp_exception can be enabled only for debugging; locate the commented code
around pdlp_exception in solve.cu and apply the change accordingly.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 36088b3c-2460-43f0-af07-139729eea9b0
📒 Files selected for processing (1)
cpp/src/pdlp/solve.cu
|
/ok to test 439bf38 |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 1175-1190: The code captures exceptions into pdlp_exception in the
run_pdlp try/catch but never rethrows it after thread cleanup, causing swallowed
errors and returning sol_pdlp with pdlp_termination_status_t::NumericalError;
fix by rethrowing pdlp_exception after the concurrent threads are joined (i.e.,
after the thread-join/cleanup section that follows this block) so the caller
observes the original exception, while keeping the existing
settings_pdlp.concurrent_halt update and logging in the catch; alternatively, if
swallowing is intentional remove pdlp_exception and add a clarifying
comment—refer to pdlp_exception, run_pdlp, settings_pdlp.concurrent_halt, and
sol_pdlp when making the change.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 7636b542-cc14-47d7-88fd-6753bae7ce52
📒 Files selected for processing (1)
cpp/src/pdlp/solve.cu
| std::exception_ptr pdlp_exception; | ||
| optimization_problem_solution_t<i_t, f_t> sol_pdlp{pdlp_termination_status_t::NumericalError, | ||
| problem.handle_ptr->get_stream()}; | ||
| try { | ||
| sol_pdlp = run_pdlp(problem, settings_pdlp, timer, is_batch_mode); | ||
| } catch (...) { | ||
| pdlp_exception = std::current_exception(); | ||
| *settings_pdlp.concurrent_halt = 1; | ||
| try { | ||
| std::rethrow_exception(pdlp_exception); | ||
| } catch (const std::exception& e) { | ||
| CUOPT_LOG_ERROR("PDLP exception in concurrent mode: %s", e.what()); | ||
| } catch (...) { | ||
| CUOPT_LOG_ERROR("PDLP unknown exception in concurrent mode"); | ||
| } | ||
| } |
There was a problem hiding this comment.
pdlp_exception is captured but never rethrown after thread cleanup.
The exception is captured in pdlp_exception and logged, but the variable is never used after line 1190. After threads join (lines 1193-1195), the function continues and returns sol_pdlp with NumericalError status without rethrowing.
If the intent (per PR description: "rethrowing exceptions") is to propagate the exception after ensuring threads are cleaned up, add a rethrow after the joins:
🛡️ Proposed fix to rethrow after thread cleanup
barrier_thread.join();
+
+ // Rethrow captured exception after threads are safely joined
+ if (pdlp_exception) {
+ std::rethrow_exception(pdlp_exception);
+ }
// copy the dual simplex solution to the deviceIf the current behavior (graceful degradation returning NumericalError status) is intentional, consider removing the unused pdlp_exception variable and directly logging within the catch block, or add a comment clarifying the exception is intentionally swallowed.
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.
In `@cpp/src/pdlp/solve.cu` around lines 1175 - 1190, The code captures exceptions
into pdlp_exception in the run_pdlp try/catch but never rethrows it after thread
cleanup, causing swallowed errors and returning sol_pdlp with
pdlp_termination_status_t::NumericalError; fix by rethrowing pdlp_exception
after the concurrent threads are joined (i.e., after the thread-join/cleanup
section that follows this block) so the caller observes the original exception,
while keeping the existing settings_pdlp.concurrent_halt update and logging in
the catch; alternatively, if swallowing is intentional remove pdlp_exception and
add a clarifying comment—refer to pdlp_exception, run_pdlp,
settings_pdlp.concurrent_halt, and sol_pdlp when making the change.
chris-maes
left a comment
There was a problem hiding this comment.
LGTM. Thanks @Iroy30 . Great job tracking this bug down and fixing it.
|
/ok to test 2903db8 |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 1181-1193: Remove the temporary std::cout debugging statements
around the PDLP exception handling block: delete the prints that contain
"DEBUGGING: CAUGHT PDLP EXCEPTION", "DEBUGGING: CAUGHT PDLP EXCEPTION RETHROW
1", "DEBUGGING: CAUGHT PDLP EXCEPTION RETHROW 2", and "DEBUGGING:AFTER TRY CATCH
BLOCK"; preserve the logic that assigns pdlp_exception, sets
*settings_pdlp.concurrent_halt = 1, rethrows and logs the exception via
CUOPT_LOG_ERROR("PDLP exception in concurrent mode: %s", e.what()), and keep the
catch(...) branch unchanged. Ensure no other stdout usage is introduced and that
logging remains through CUOPT_LOG_ERROR only.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: 96d7626b-23c9-4a77-9d44-ac31295d9ae0
📒 Files selected for processing (4)
.github/workflows/build.yaml.github/workflows/pr.yaml.github/workflows/test.yamlcpp/src/pdlp/solve.cu
💤 Files with no reviewable changes (2)
- .github/workflows/build.yaml
- .github/workflows/test.yaml
|
/ok to test 5c8a52e |
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Inline comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 1180-1191: In the top-level catch(...) block that captures
pdlp_exception, signal concurrent halt so the other threads stop: after setting
pdlp_exception = std::current_exception(), set *settings_pdlp.concurrent_halt =
1 (or use the same atomic/store pattern used elsewhere) so barrier_thread and
dual_simplex_thread will see the halt flag and exit; keep the existing
rethrow/logging logic (the pdlp_exception handling and CUOPT_LOG_ERROR) but
ensure the concurrent_halt assignment occurs immediately after capturing the
exception.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: e811ddce-ab5d-4d4d-9e16-803372f3eaa0
📒 Files selected for processing (1)
cpp/src/pdlp/solve.cu
There was a problem hiding this comment.
Caution
Some comments are outside the diff and can’t be posted inline due to platform limitations.
⚠️ Outside diff range comments (1)
cpp/src/pdlp/solve.cu (1)
1175-1198:⚠️ Potential issue | 🟠 Major
pdlp_exceptionis captured but never rethrown after thread cleanup.The exception is captured at line 1182 and the PR description states the intent is to "rethrow the exception," but after threads are joined (lines 1196-1198),
pdlp_exceptionis never checked or rethrown. The function continues and returnssol_pdlpwithNumericalErrorstatus, effectively swallowing the exception.If the intent is to propagate the exception to callers (as stated in the PR objectives), add a rethrow after the joins. If graceful degradation (returning
NumericalError) is intentional, remove the unusedpdlp_exceptionvariable and document that behavior. As per coding guidelines, "Verify error propagation from CUDA to user-facing APIs is complete."🛡️ Proposed fix to rethrow after thread cleanup
barrier_thread.join(); + + // Rethrow captured exception after threads are safely joined + if (pdlp_exception) { + std::rethrow_exception(pdlp_exception); + } // copy the dual simplex solution to the device🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/pdlp/solve.cu` around lines 1175 - 1198, The captured exception pdlp_exception from the run_pdlp call is never propagated after joining threads (dual_simplex_thread, barrier_thread), so either rethrow it after cleanup or remove it if swallowing is intentional; to fix, if you want propagation: after joining dual_simplex_thread and barrier_thread check if pdlp_exception is non-null and call std::rethrow_exception(pdlp_exception) so the original exception propagates to callers (keep settings_pdlp.concurrent_halt logic as-is), otherwise remove pdlp_exception and the related catch/rethrow blocks and document returning sol_pdlp with pdlp_termination_status_t::NumericalError as the intended behavior.
♻️ Duplicate comments (1)
cpp/src/pdlp/solve.cu (1)
1181-1193:⚠️ Potential issue | 🟠 MajorRemove debug
std::coutstatements before merging.Lines 1181, 1187, 1190, and 1193 contain debugging artifacts with "DEBUGGING:" prefixes. These statements pollute stdout and are inconsistent with the codebase's logging conventions (which use
CUOPT_LOG_*macros).Additionally, the
catch(...)block at lines 1189-1191 silently swallows unknown exceptions without logging. Consider adding a log statement for unknown exception types.🐛 Proposed fix to remove debug statements and add logging
} catch (...) { - std::cout << "\n DEBUGGING: CAUGHT PDLP EXCEPTION \n" << std::endl; pdlp_exception = std::current_exception(); *settings_pdlp.concurrent_halt = 1; try { std::rethrow_exception(pdlp_exception); } catch (const std::exception& e) { - std::cout << "\n DEBUGGING: CAUGHT PDLP EXCEPTION RETHROW 1\n" << std::endl; CUOPT_LOG_ERROR("PDLP exception in concurrent mode: %s", e.what()); } catch (...) { - std::cout << "\n DEBUGGING: CAUGHT PDLP EXCEPTION RETHROW 2\n" << std::endl; + CUOPT_LOG_ERROR("PDLP unknown exception in concurrent mode"); } } - std::cout << "\n DEBUGGING:AFTER TRY CATCH BLOCK \n" << std::endl;🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed. In `@cpp/src/pdlp/solve.cu` around lines 1181 - 1193, Remove the stdout debug prints around the PDLP exception handling (the std::cout lines that print "DEBUGGING: ...") and instead rely on the existing logging macros; keep setting *settings_pdlp.concurrent_halt = 1 and storing pdlp_exception = std::current_exception(). In the rethrow block, keep the catch(const std::exception& e) and its CUOPT_LOG_ERROR("PDLP exception in concurrent mode: %s", e.what()) call, and modify the catch(...) handler to log an unknown exception via CUOPT_LOG_ERROR (e.g., "PDLP unknown exception in concurrent mode") instead of silently swallowing it. Ensure no std::cout remains in the function around pdlp_exception, the rethrow, or after the try/catch.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.
Outside diff comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 1175-1198: The captured exception pdlp_exception from the run_pdlp
call is never propagated after joining threads (dual_simplex_thread,
barrier_thread), so either rethrow it after cleanup or remove it if swallowing
is intentional; to fix, if you want propagation: after joining
dual_simplex_thread and barrier_thread check if pdlp_exception is non-null and
call std::rethrow_exception(pdlp_exception) so the original exception propagates
to callers (keep settings_pdlp.concurrent_halt logic as-is), otherwise remove
pdlp_exception and the related catch/rethrow blocks and document returning
sol_pdlp with pdlp_termination_status_t::NumericalError as the intended
behavior.
---
Duplicate comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 1181-1193: Remove the stdout debug prints around the PDLP
exception handling (the std::cout lines that print "DEBUGGING: ...") and instead
rely on the existing logging macros; keep setting *settings_pdlp.concurrent_halt
= 1 and storing pdlp_exception = std::current_exception(). In the rethrow block,
keep the catch(const std::exception& e) and its CUOPT_LOG_ERROR("PDLP exception
in concurrent mode: %s", e.what()) call, and modify the catch(...) handler to
log an unknown exception via CUOPT_LOG_ERROR (e.g., "PDLP unknown exception in
concurrent mode") instead of silently swallowing it. Ensure no std::cout remains
in the function around pdlp_exception, the rethrow, or after the try/catch.
ℹ️ Review info
⚙️ Run configuration
Configuration used: Path: .coderabbit.yaml
Review profile: CHILL
Plan: Pro
Run ID: ed5addf2-b91e-427d-9808-11730620eac2
📒 Files selected for processing (1)
cpp/src/pdlp/solve.cu
jameslamb
left a comment
There was a problem hiding this comment.
Removing all conda-related CI is a pretty big change and doesn't at all match the title or description. Are those things just temporarily removed to make PR CI lighter while you're testing?
If so, you can revert the build.yaml and test.yaml changes... they are not used in PRs.
Yes we are debugging some CI fails, this is not to be merged as is. Will be reverted. |
|
/ok to test d71bce2 |
Description
Fixes fails when pdlp throws exception in concurrent mode. Still needs evaluation on why pdlp fails.
Issue
Checklist