Skip to content

Add exception handling for pdlp in concurrent mode#966

Open
Iroy30 wants to merge 17 commits intoNVIDIA:release/26.04from
Iroy30:add_exception_handling_pdlp
Open

Add exception handling for pdlp in concurrent mode#966
Iroy30 wants to merge 17 commits intoNVIDIA:release/26.04from
Iroy30:add_exception_handling_pdlp

Conversation

@Iroy30
Copy link
Member

@Iroy30 Iroy30 commented Mar 17, 2026

Description

Fixes fails when pdlp throws exception in concurrent mode. Still needs evaluation on why pdlp fails.

Issue

Checklist

  • I am familiar with the Contributing Guidelines.
  • Testing
    • New or existing tests cover these changes
    • Added tests
    • Created an issue to follow-up
    • NA
  • Documentation
    • The documentation is up to date with these changes
    • Added new documentation
    • NA

@Iroy30 Iroy30 requested a review from a team as a code owner March 17, 2026 20:41
@Iroy30 Iroy30 requested review from chris-maes and rg20 March 17, 2026 20:41
@copy-pr-bot
Copy link

copy-pr-bot bot commented Mar 17, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

@coderabbitai
Copy link

coderabbitai bot commented Mar 17, 2026

Note

Reviews paused

It looks like this branch is under active development. To avoid overwhelming you with review comments due to an influx of new commits, CodeRabbit has automatically paused this review. You can configure this behavior by changing the reviews.auto_review.auto_pause_after_reviewed_commits setting.

Use the following commands to manage reviews:

  • @coderabbitai resume to resume automatic reviews.
  • @coderabbitai review to trigger a single review.

Use the checkboxes below for quick actions:

  • ▶️ Resume reviews
  • 🔍 Trigger review
📝 Walkthrough

Walkthrough

Adds exception-capture and propagation scaffolding to the concurrent PDLP runner in cpp/src/pdlp/solve.cu (try/catch, std::exception_ptr, signaling settings_pdlp.concurrent_halt, join-before-exit semantics, minor barrier-thread setup change, added includes); removes multiple conda-related CI jobs and their dependencies/filters from GitHub Actions workflows (.github/workflows/*.yaml).

Changes

Cohort / File(s) Summary
Concurrent PDLP exception handling
cpp/src/pdlp/solve.cu
Wraps PDLP concurrent execution in try/catch, captures exceptions via std::exception_ptr, sets settings_pdlp.concurrent_halt on error, logs diagnostics, ensures spawned threads are joined and coordinates exception signaling with barrier/dual-simplex threads, adjusts barrier-thread scoped initialization, and adds #include <exception> (retains <thread>).
CI workflows — remove conda jobs
.github/workflows/build.yaml, .github/workflows/pr.yaml, .github/workflows/test.yaml
Removes conda-related jobs and outputs (e.g., cpp-build, python-build, upload-conda, conda-*-tests), strips related needs/with/outputs and conda matrix filters, and updates job dependency graphs so docs/tests no longer depend on the removed conda jobs.

Estimated code review effort

🎯 3 (Moderate) | ⏱️ ~25 minutes

🚥 Pre-merge checks | ✅ 2 | ❌ 1

❌ Failed checks (1 warning)

Check name Status Explanation Resolution
Docstring Coverage ⚠️ Warning Docstring coverage is 0.00% which is insufficient. The required threshold is 80.00%. Write docstrings for the functions missing them to satisfy the coverage threshold.
✅ Passed checks (2 passed)
Check name Status Explanation
Title check ✅ Passed The title clearly and specifically describes the main change: adding exception handling for PDLP in concurrent mode, which aligns with the primary purpose of the changeset.
Description check ✅ Passed The description is related to the changeset, explaining that the fix addresses failures when PDLP throws exceptions in concurrent mode, which matches the code changes.

✏️ Tip: You can configure your own custom pre-merge checks in the settings.

✨ Finishing Touches
🧪 Generate unit tests (beta)
  • Create PR with unit tests
📝 Coding Plan
  • Generate coding plan for human review comments

Comment @coderabbitai help to get the list of available commands and usage tips.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
cpp/src/pdlp/solve.cu (1)

1190-1192: TODO lacks issue tracking; commented-out code should be cleaned up.

The TODO indicates a known intermittent issue but provides no way to track resolution. Additionally, commented-out code with printf statements should not remain in production code.

Recommendations:

  1. Create a tracked issue and reference it in the comment (e.g., // TODO(#123): ...)
  2. Either remove the commented-out rethrow entirely, or conditionally enable it behind a debug flag if needed for investigation
♻️ Suggested cleanup
-  // TODO: Active Issue: PDLP throws an Exception interminttently.
-  // if (pdlp_exception) { printf("Rethrowing PDLP exception from concurrent mode\n");
-  // std::rethrow_exception(pdlp_exception); }
+  // TODO(`#ISSUE_NUMBER`): PDLP throws exceptions intermittently in concurrent mode.
+  // Once root cause is fixed, rethrow: std::rethrow_exception(pdlp_exception);
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/pdlp/solve.cu` around lines 1190 - 1192, Remove the dead
commented-out rethrow and replace the vague TODO with a tracked issue reference
and optional debug guard: create a repository issue for the intermittent PDLP
exception and update the comment to include the issue ID (e.g., "//
TODO(#<issue>): Intermittent PDLP exception observed when running concurrent
mode"), and either delete the commented block that references
pdlp_exception/printf or move that logic behind a compile-time/debug flag (e.g.,
PDLP_DEBUG_RETHROW) so the rethrow of pdlp_exception can be enabled only for
debugging; locate the commented code around pdlp_exception in solve.cu and apply
the change accordingly.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 1175-1183: The catch block around run_pdlp currently stores the
exception in pdlp_exception but discards it, losing diagnostics; update the
catch in solve.cu to extract and surface the exception message (from
pdlp_exception or std::current_exception()) — e.g., convert to a string via
std::rethrow_exception / try-catch and log it with the existing logging
facilities and/or set a more informative termination status on sol_pdlp (instead
of only pdlp_termination_status_t::NumericalError) before setting
*settings_pdlp.concurrent_halt = 1; ensure run_pdlp, pdlp_exception, sol_pdlp,
and settings_pdlp.concurrent_halt are referenced so callers receive meaningful
error info.

---

Nitpick comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 1190-1192: Remove the dead commented-out rethrow and replace the
vague TODO with a tracked issue reference and optional debug guard: create a
repository issue for the intermittent PDLP exception and update the comment to
include the issue ID (e.g., "// TODO(#<issue>): Intermittent PDLP exception
observed when running concurrent mode"), and either delete the commented block
that references pdlp_exception/printf or move that logic behind a
compile-time/debug flag (e.g., PDLP_DEBUG_RETHROW) so the rethrow of
pdlp_exception can be enabled only for debugging; locate the commented code
around pdlp_exception in solve.cu and apply the change accordingly.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 36088b3c-2460-43f0-af07-139729eea9b0

📥 Commits

Reviewing files that changed from the base of the PR and between 0daf1e1 and 3fcf5fb.

📒 Files selected for processing (1)
  • cpp/src/pdlp/solve.cu

@Iroy30 Iroy30 added bug Something isn't working non-breaking Introduces a non-breaking change pdlp labels Mar 17, 2026
@anandhkb anandhkb added this to the 26.04 milestone Mar 17, 2026
@Iroy30 Iroy30 changed the title Add exception handling pdlp Add exception handling for pdlp in concurrent mode Mar 17, 2026
@Iroy30
Copy link
Member Author

Iroy30 commented Mar 18, 2026

/ok to test 439bf38

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 1175-1190: The code captures exceptions into pdlp_exception in the
run_pdlp try/catch but never rethrows it after thread cleanup, causing swallowed
errors and returning sol_pdlp with pdlp_termination_status_t::NumericalError;
fix by rethrowing pdlp_exception after the concurrent threads are joined (i.e.,
after the thread-join/cleanup section that follows this block) so the caller
observes the original exception, while keeping the existing
settings_pdlp.concurrent_halt update and logging in the catch; alternatively, if
swallowing is intentional remove pdlp_exception and add a clarifying
comment—refer to pdlp_exception, run_pdlp, settings_pdlp.concurrent_halt, and
sol_pdlp when making the change.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 7636b542-cc14-47d7-88fd-6753bae7ce52

📥 Commits

Reviewing files that changed from the base of the PR and between 3fcf5fb and 439bf38.

📒 Files selected for processing (1)
  • cpp/src/pdlp/solve.cu

Comment on lines +1175 to +1190
std::exception_ptr pdlp_exception;
optimization_problem_solution_t<i_t, f_t> sol_pdlp{pdlp_termination_status_t::NumericalError,
problem.handle_ptr->get_stream()};
try {
sol_pdlp = run_pdlp(problem, settings_pdlp, timer, is_batch_mode);
} catch (...) {
pdlp_exception = std::current_exception();
*settings_pdlp.concurrent_halt = 1;
try {
std::rethrow_exception(pdlp_exception);
} catch (const std::exception& e) {
CUOPT_LOG_ERROR("PDLP exception in concurrent mode: %s", e.what());
} catch (...) {
CUOPT_LOG_ERROR("PDLP unknown exception in concurrent mode");
}
}
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

pdlp_exception is captured but never rethrown after thread cleanup.

The exception is captured in pdlp_exception and logged, but the variable is never used after line 1190. After threads join (lines 1193-1195), the function continues and returns sol_pdlp with NumericalError status without rethrowing.

If the intent (per PR description: "rethrowing exceptions") is to propagate the exception after ensuring threads are cleaned up, add a rethrow after the joins:

🛡️ Proposed fix to rethrow after thread cleanup
   barrier_thread.join();
+
+  // Rethrow captured exception after threads are safely joined
+  if (pdlp_exception) {
+    std::rethrow_exception(pdlp_exception);
+  }

   // copy the dual simplex solution to the device

If the current behavior (graceful degradation returning NumericalError status) is intentional, consider removing the unused pdlp_exception variable and directly logging within the catch block, or add a comment clarifying the exception is intentionally swallowed.

🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/pdlp/solve.cu` around lines 1175 - 1190, The code captures exceptions
into pdlp_exception in the run_pdlp try/catch but never rethrows it after thread
cleanup, causing swallowed errors and returning sol_pdlp with
pdlp_termination_status_t::NumericalError; fix by rethrowing pdlp_exception
after the concurrent threads are joined (i.e., after the thread-join/cleanup
section that follows this block) so the caller observes the original exception,
while keeping the existing settings_pdlp.concurrent_halt update and logging in
the catch; alternatively, if swallowing is intentional remove pdlp_exception and
add a clarifying comment—refer to pdlp_exception, run_pdlp,
settings_pdlp.concurrent_halt, and sol_pdlp when making the change.

Copy link
Contributor

@chris-maes chris-maes left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM. Thanks @Iroy30 . Great job tracking this bug down and fixing it.

@Iroy30 Iroy30 requested a review from a team as a code owner March 18, 2026 21:57
@Iroy30 Iroy30 requested a review from jameslamb March 18, 2026 21:57
@Iroy30
Copy link
Member Author

Iroy30 commented Mar 18, 2026

/ok to test 2903db8

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 1181-1193: Remove the temporary std::cout debugging statements
around the PDLP exception handling block: delete the prints that contain
"DEBUGGING: CAUGHT PDLP EXCEPTION", "DEBUGGING: CAUGHT PDLP EXCEPTION RETHROW
1", "DEBUGGING: CAUGHT PDLP EXCEPTION RETHROW 2", and "DEBUGGING:AFTER TRY CATCH
BLOCK"; preserve the logic that assigns pdlp_exception, sets
*settings_pdlp.concurrent_halt = 1, rethrows and logs the exception via
CUOPT_LOG_ERROR("PDLP exception in concurrent mode: %s", e.what()), and keep the
catch(...) branch unchanged. Ensure no other stdout usage is introduced and that
logging remains through CUOPT_LOG_ERROR only.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: 96d7626b-23c9-4a77-9d44-ac31295d9ae0

📥 Commits

Reviewing files that changed from the base of the PR and between 439bf38 and 2903db8.

📒 Files selected for processing (4)
  • .github/workflows/build.yaml
  • .github/workflows/pr.yaml
  • .github/workflows/test.yaml
  • cpp/src/pdlp/solve.cu
💤 Files with no reviewable changes (2)
  • .github/workflows/build.yaml
  • .github/workflows/test.yaml

@Iroy30
Copy link
Member Author

Iroy30 commented Mar 18, 2026

/ok to test 5c8a52e

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Inline comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 1180-1191: In the top-level catch(...) block that captures
pdlp_exception, signal concurrent halt so the other threads stop: after setting
pdlp_exception = std::current_exception(), set *settings_pdlp.concurrent_halt =
1 (or use the same atomic/store pattern used elsewhere) so barrier_thread and
dual_simplex_thread will see the halt flag and exit; keep the existing
rethrow/logging logic (the pdlp_exception handling and CUOPT_LOG_ERROR) but
ensure the concurrent_halt assignment occurs immediately after capturing the
exception.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: e811ddce-ab5d-4d4d-9e16-803372f3eaa0

📥 Commits

Reviewing files that changed from the base of the PR and between 2903db8 and 5c8a52e.

📒 Files selected for processing (1)
  • cpp/src/pdlp/solve.cu

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
cpp/src/pdlp/solve.cu (1)

1175-1198: ⚠️ Potential issue | 🟠 Major

pdlp_exception is captured but never rethrown after thread cleanup.

The exception is captured at line 1182 and the PR description states the intent is to "rethrow the exception," but after threads are joined (lines 1196-1198), pdlp_exception is never checked or rethrown. The function continues and returns sol_pdlp with NumericalError status, effectively swallowing the exception.

If the intent is to propagate the exception to callers (as stated in the PR objectives), add a rethrow after the joins. If graceful degradation (returning NumericalError) is intentional, remove the unused pdlp_exception variable and document that behavior. As per coding guidelines, "Verify error propagation from CUDA to user-facing APIs is complete."

🛡️ Proposed fix to rethrow after thread cleanup
   barrier_thread.join();
+
+  // Rethrow captured exception after threads are safely joined
+  if (pdlp_exception) {
+    std::rethrow_exception(pdlp_exception);
+  }
 
   // copy the dual simplex solution to the device
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/pdlp/solve.cu` around lines 1175 - 1198, The captured exception
pdlp_exception from the run_pdlp call is never propagated after joining threads
(dual_simplex_thread, barrier_thread), so either rethrow it after cleanup or
remove it if swallowing is intentional; to fix, if you want propagation: after
joining dual_simplex_thread and barrier_thread check if pdlp_exception is
non-null and call std::rethrow_exception(pdlp_exception) so the original
exception propagates to callers (keep settings_pdlp.concurrent_halt logic
as-is), otherwise remove pdlp_exception and the related catch/rethrow blocks and
document returning sol_pdlp with pdlp_termination_status_t::NumericalError as
the intended behavior.
♻️ Duplicate comments (1)
cpp/src/pdlp/solve.cu (1)

1181-1193: ⚠️ Potential issue | 🟠 Major

Remove debug std::cout statements before merging.

Lines 1181, 1187, 1190, and 1193 contain debugging artifacts with "DEBUGGING:" prefixes. These statements pollute stdout and are inconsistent with the codebase's logging conventions (which use CUOPT_LOG_* macros).

Additionally, the catch(...) block at lines 1189-1191 silently swallows unknown exceptions without logging. Consider adding a log statement for unknown exception types.

🐛 Proposed fix to remove debug statements and add logging
   } catch (...) {
-    std::cout << "\n DEBUGGING: CAUGHT PDLP EXCEPTION \n" << std::endl;
     pdlp_exception                 = std::current_exception();
     *settings_pdlp.concurrent_halt = 1;
     try {
       std::rethrow_exception(pdlp_exception);
     } catch (const std::exception& e) {
-      std::cout << "\n DEBUGGING: CAUGHT PDLP EXCEPTION RETHROW 1\n" << std::endl;
       CUOPT_LOG_ERROR("PDLP exception in concurrent mode: %s", e.what());
     } catch (...) {
-      std::cout << "\n DEBUGGING: CAUGHT PDLP EXCEPTION RETHROW 2\n" << std::endl;
+      CUOPT_LOG_ERROR("PDLP unknown exception in concurrent mode");
     }
   }
-  std::cout << "\n DEBUGGING:AFTER TRY CATCH BLOCK \n" << std::endl;
🤖 Prompt for AI Agents
Verify each finding against the current code and only fix it if needed.

In `@cpp/src/pdlp/solve.cu` around lines 1181 - 1193, Remove the stdout debug
prints around the PDLP exception handling (the std::cout lines that print
"DEBUGGING: ...") and instead rely on the existing logging macros; keep setting
*settings_pdlp.concurrent_halt = 1 and storing pdlp_exception =
std::current_exception(). In the rethrow block, keep the catch(const
std::exception& e) and its CUOPT_LOG_ERROR("PDLP exception in concurrent mode:
%s", e.what()) call, and modify the catch(...) handler to log an unknown
exception via CUOPT_LOG_ERROR (e.g., "PDLP unknown exception in concurrent
mode") instead of silently swallowing it. Ensure no std::cout remains in the
function around pdlp_exception, the rethrow, or after the try/catch.
🤖 Prompt for all review comments with AI agents
Verify each finding against the current code and only fix it if needed.

Outside diff comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 1175-1198: The captured exception pdlp_exception from the run_pdlp
call is never propagated after joining threads (dual_simplex_thread,
barrier_thread), so either rethrow it after cleanup or remove it if swallowing
is intentional; to fix, if you want propagation: after joining
dual_simplex_thread and barrier_thread check if pdlp_exception is non-null and
call std::rethrow_exception(pdlp_exception) so the original exception propagates
to callers (keep settings_pdlp.concurrent_halt logic as-is), otherwise remove
pdlp_exception and the related catch/rethrow blocks and document returning
sol_pdlp with pdlp_termination_status_t::NumericalError as the intended
behavior.

---

Duplicate comments:
In `@cpp/src/pdlp/solve.cu`:
- Around line 1181-1193: Remove the stdout debug prints around the PDLP
exception handling (the std::cout lines that print "DEBUGGING: ...") and instead
rely on the existing logging macros; keep setting *settings_pdlp.concurrent_halt
= 1 and storing pdlp_exception = std::current_exception(). In the rethrow block,
keep the catch(const std::exception& e) and its CUOPT_LOG_ERROR("PDLP exception
in concurrent mode: %s", e.what()) call, and modify the catch(...) handler to
log an unknown exception via CUOPT_LOG_ERROR (e.g., "PDLP unknown exception in
concurrent mode") instead of silently swallowing it. Ensure no std::cout remains
in the function around pdlp_exception, the rethrow, or after the try/catch.

ℹ️ Review info
⚙️ Run configuration

Configuration used: Path: .coderabbit.yaml

Review profile: CHILL

Plan: Pro

Run ID: ed5addf2-b91e-427d-9808-11730620eac2

📥 Commits

Reviewing files that changed from the base of the PR and between 5c8a52e and 705c71f.

📒 Files selected for processing (1)
  • cpp/src/pdlp/solve.cu

Copy link
Member

@jameslamb jameslamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Removing all conda-related CI is a pretty big change and doesn't at all match the title or description. Are those things just temporarily removed to make PR CI lighter while you're testing?

If so, you can revert the build.yaml and test.yaml changes... they are not used in PRs.

@Iroy30 Iroy30 added the do not merge Do not merge if this flag is set label Mar 19, 2026
@Iroy30
Copy link
Member Author

Iroy30 commented Mar 19, 2026

Removing all conda-related CI is a pretty big change and doesn't at all match the title or description. Are those things just temporarily removed to make PR CI lighter while you're testing?

If so, you can revert the build.yaml and test.yaml changes... they are not used in PRs.

Yes we are debugging some CI fails, this is not to be merged as is. Will be reverted.

@Iroy30
Copy link
Member Author

Iroy30 commented Mar 19, 2026

/ok to test d71bce2

@rgsl888prabhu rgsl888prabhu changed the base branch from main to release/26.04 March 19, 2026 16:24
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

bug Something isn't working do not merge Do not merge if this flag is set non-breaking Introduces a non-breaking change pdlp

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants