Skip to content

Conversation

@jupierce
Copy link
Contributor

No description provided.

The old method of creating a deleting instance profiles
was complex and led to a leak which debilitated
AWS temporarily. Have all instances share the same
instance profile and manage it via cloud formations.
Cloud formation IDs are too difficult to keep in
sync with PCO pruning. Retain resources and
just delete the lambda explicitly on "destroy"
@openshift-ci-robot
Copy link
Contributor

Pipeline controller notification
This repo is configured to use the pipeline controller. Second-stage tests will be triggered either automatically or after lgtm label is added, depending on the repository configuration. The pipeline controller will automatically detect which contexts are required and will utilize /test Prow commands to trigger the second stage.

For optional jobs, comment /test ? to see a list of all defined jobs. To trigger manually all jobs from second stage use /pipeline required command.

This repository is configured in: automatic mode

@coderabbitai
Copy link

coderabbitai bot commented Dec 25, 2025

Walkthrough

Adds a new multi-account/multi-region monitoring script and agent documentation, extends the deploy script with AWS profile/action handling and explicit region/profile usage, hardens NAT-instance provisioning (idempotency, profile-based IAM, multi-type fallbacks, waits, guarded cleanup), and updates CloudFormation templates to add IAM role/profile resources and multiple DeletionPolicy: Retain entries.

Changes

Cohort / File(s) Summary
Deployment script
hack/reduce_nat_gateway_cost/deploy.sh
Change working dir at start; add usage() and arg validation; introduce AWS_PROFILE and ACTION (default deploy); scope AWS CLI calls to profile and explicit regions; consolidate deploy/destroy flows; explicitly delete Lambda and primary/forwarder stacks; add user notes about retained resources and cleanup guidance.
Monitoring tool
hack/reduce_nat_gateway_cost/monitor_resources.py
New comprehensive Python monitor: boto3 session helper, multi-profile/region scanning, VPC existence checks, orphaned EC2/IAM detection, instance-profile counting, NAT gateway/instance stats and aging, route effectiveness checks, Lambda CloudWatch metrics, expected-resource validation, aggregated run_check() and CLI loop with optional alarm.
NAT provisioning logic
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py
Bumped VERSION to v1.4; added NAT_INSTANCE_PROFILE_NAME; removed inline instance-profile creation helpers; make NAT SG creation idempotent and race-tolerant; try multiple instance types with graceful fallbacks; add defensive validation, explicit waits for routes/RTs, improved tagging, and expanded cleanup on failures.
CloudFormation — main stack
hack/reduce_nat_gateway_cost/use-nat-instance.yaml
Added NatInstanceRole (AWS::IAM::Role) and NatInstanceProfile (AWS::IAM::InstanceProfile) with DeletionPolicy: Retain; tightened PassRole to !GetAtt NatInstanceRole.Arn; removed broad IAM actions from function policy; added DeletionPolicy: Retain to function, log group, execution role, and event rule.
CloudFormation — forwarders
hack/reduce_nat_gateway_cost/use-nat-instance-forwarders.yaml
Added DeletionPolicy: Retain to UseNatInstanceForwardRole and UseNatInstanceForwardEventRule.
Documentation
hack/reduce_nat_gateway_cost/AGENTS.md
New agent reference documenting architecture, event flow, deployment/update steps, monitoring, resources per account/region, troubleshooting, tagging, commands, and version history.

Estimated code review effort

🎯 4 (Complex) | ⏱️ ~45 minutes

✨ Finishing touches
  • 📝 Generate docstrings

Comment @coderabbitai help to get the list of available commands and usage tips.

@openshift-ci openshift-ci bot requested review from Prucek and bear-redhat December 25, 2025 13:55
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (2)
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py (1)

659-740: Enhanced error handling with iterative instance type selection.

The refactored logic properly handles instance type availability by iterating through fallback options and ensures security group cleanup on failures. The approach aligns well with the CloudFormation-managed instance profile strategy.

Consider these improvements based on static analysis:

  1. Line 724: Use logging.exception instead of logging.error to automatically include stack trace.
  2. Lines 728, 738: Replace bare Exception catches with specific exception types or at least log the exception details.
🔎 Proposed refinements
     except Exception as e:
         # If instance creation fails, clean up the security group we created
-        logger.error(f'Failed to create NAT instance: {e}. Cleaning up security group {nat_sg_id}')
+        logger.exception(f'Failed to create NAT instance: {e}. Cleaning up security group {nat_sg_id}')
         try:
             ec2_client.delete_security_group(GroupId=nat_sg_id)
             logger.info(f'Deleted orphaned security group {nat_sg_id}')
-        except Exception as cleanup_error:
+        except botocore.exceptions.ClientError as cleanup_error:
             logger.warning(f'Failed to clean up security group {nat_sg_id}: {cleanup_error}')
         raise
 
     if not nat_instance:
         # All instance types failed - clean up the security group
         logger.error(f'All instance types failed. Cleaning up security group {nat_sg_id}')
         try:
             ec2_client.delete_security_group(GroupId=nat_sg_id)
             logger.info(f'Deleted orphaned security group {nat_sg_id}')
-        except Exception as cleanup_error:
+        except botocore.exceptions.ClientError as cleanup_error:
             logger.warning(f'Failed to clean up security group {nat_sg_id}: {cleanup_error}')
         return None
hack/reduce_nat_gateway_cost/monitor_resources.py (1)

343-466: Well-structured orchestration with good visibility.

The main check function properly coordinates all monitoring checks and provides clear, color-coded output. The aggregation of issues across profiles is well done.

Consider addressing these minor style improvements:

  1. Line 369: The present_resources variable is unpacked but never used
  2. Line 374: The ip_count variable is unpacked but never used
  3. Line 448: The all_healthy variable is assigned but never used
🔎 Proposed refinements
         # Check expected resources exist
-        missing_resources, present_resources = check_expected_resources(profile)
+        missing_resources, _ = check_expected_resources(profile)
         if missing_resources:
             profile_issues.extend([f"  {r}" for r in missing_resources])
         
         # Check instance profile count
-        ip_count, ip_warning = check_instance_profile_count(profile)
+        _, ip_warning = check_instance_profile_count(profile)
         if ip_warning:
             profile_issues.append(ip_warning)

For line 448, either remove the variable or use it:

     # Display infrastructure health summary
     print(f"{CYAN}Infrastructure Health:{NC}")
-    all_healthy = True
     for profile in PROFILES:
         missing, present = check_expected_resources(profile)
         if missing:
             print(f"  {profile}: {RED}UNHEALTHY - {len(missing)} missing resource(s){NC}")
-            all_healthy = False
         else:
             print(f"  {profile}: {GREEN}OK ({len(present)} resources){NC}")
     print()
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 4a066ea and 5bd2a02.

📒 Files selected for processing (5)
  • hack/reduce_nat_gateway_cost/deploy.sh
  • hack/reduce_nat_gateway_cost/monitor_resources.py
  • hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py
  • hack/reduce_nat_gateway_cost/use-nat-instance-forwarders.yaml
  • hack/reduce_nat_gateway_cost/use-nat-instance.yaml
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • hack/reduce_nat_gateway_cost/deploy.sh
  • hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py
  • hack/reduce_nat_gateway_cost/monitor_resources.py
  • hack/reduce_nat_gateway_cost/use-nat-instance-forwarders.yaml
  • hack/reduce_nat_gateway_cost/use-nat-instance.yaml
🧬 Code graph analysis (1)
hack/reduce_nat_gateway_cost/monitor_resources.py (1)
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py (1)
  • main (1061-1081)
🪛 Ruff (0.14.10)
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py

724-724: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


728-728: Do not catch blind exception: Exception

(BLE001)


738-738: Do not catch blind exception: Exception

(BLE001)

hack/reduce_nat_gateway_cost/monitor_resources.py

49-49: Starting a process with a partial executable path

(S607)


55-55: Starting a process with a partial executable path

(S607)


65-65: Starting a process with a partial executable path

(S607)


86-86: Consider moving this statement to an else block

(TRY300)


256-256: Consider moving this statement to an else block

(TRY300)


369-369: Unpacked variable present_resources is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


374-374: Unpacked variable ip_count is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


401-401: Consider [f"Profile {profile}:", *profile_issues] instead of concatenation

Replace with [f"Profile {profile}:", *profile_issues]

(RUF005)


413-413: Consider ["NAT Instance Age Issues:", *age_issues] instead of concatenation

Replace with ["NAT Instance Age Issues:", *age_issues]

(RUF005)


448-448: Local variable all_healthy is assigned to but never used

Remove assignment to unused variable all_healthy

(F841)

🪛 Shellcheck (0.11.0)
hack/reduce_nat_gateway_cost/deploy.sh

[warning] 6-6: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

🔇 Additional comments (31)
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py (8)

118-119: LGTM!

The updated comments accurately reflect the simplified IAM permission model where the Lambda only needs iam:PassRole for the pre-existing CloudFormation-managed instance profile, rather than the broader IAM permissions previously required for dynamic profile creation.


151-154: LGTM!

The constant correctly references the CloudFormation-managed instance profile and aligns with the profile name defined in use-nat-instance.yaml.


278-280: LGTM!

Good defensive coding to handle cases where the NAT gateway may not have been fully tagged during creation.


301-301: LGTM!

Proper guard against None from get_tag prevents potential AttributeError on string operations.


333-333: LGTM!

Proper guard against None from get_tag prevents potential AttributeError on string operations.


500-502: LGTM!

Proper error handling with early return prevents downstream issues when NAT instance creation fails. Security group cleanup is handled within create_nat_instance, ensuring no resource leaks.


644-645: LGTM!

Improved error logging with appropriate severity level and context.


963-963: LGTM!

More specific exception handling improves code clarity and prevents accidentally catching unrelated exceptions.

hack/reduce_nat_gateway_cost/use-nat-instance-forwarders.yaml (2)

7-7: LGTM!

DeletionPolicy: Retain on the IAM role prevents disruption to event forwarding when the CloudFormation stack is deleted. This aligns with the PR's strategy to avoid synchronization issues between CloudFormation and PCO pruning.


37-37: LGTM!

DeletionPolicy: Retain on the EventBridge rule ensures continued event forwarding even when the CloudFormation stack is deleted. This is consistent with the PR's retention policy approach.

hack/reduce_nat_gateway_cost/deploy.sh (3)

8-21: LGTM!

Clear usage instructions and proper argument validation improve the script's usability and safety for multi-account deployments.


24-44: LGTM!

Consistent use of --profile ensures operations target the correct AWS account. The deployment flow is well-structured.


48-66: LGTM!

The destroy logic properly handles the DeletionPolicy: Retain resources by explicitly deleting the Lambda function (the critical component) while providing clear user guidance about retained resources. The approach is safe and well-documented.

hack/reduce_nat_gateway_cost/use-nat-instance.yaml (7)

11-49: LGTM!

The new IAM role and instance profile correctly implement the least-privilege principle for NAT instance operations. The permissions are appropriately scoped:

  • ec2:ReplaceRoute allows the NAT instance to update route tables
  • ec2:CreateTags is restricted to route-table resources
  • ec2:ModifyInstanceAttribute enables disabling source/dest checks

The DeletionPolicy: Retain prevents resource cleanup issues during stack deletion, aligning with the PR's strategy to avoid CloudFormation/PCO pruning synchronization problems.


52-52: LGTM!

Retention policy prevents accidental deletion of the Lambda execution role, maintaining infrastructure stability.


83-83: LGTM!

Narrowing the PassRole permission from wildcard to the specific NatInstanceRole ARN improves security by following the principle of least privilege.


86-86: LGTM!

Retention policy ensures Lambda logs are preserved even after stack deletion.


92-92: LGTM!

Retention policy on the Lambda function is critical - as the deploy script notes, this is explicitly deleted on destroy. Retaining it in CloudFormation prevents accidental deletion while allowing controlled removal.


121-121: LGTM!

Retention policy prevents accidental deletion of the Lambda execution role used by EventBridge.


147-147: LGTM!

Retention policy ensures the EventBridge rule continues forwarding events even after stack deletion.

hack/reduce_nat_gateway_cost/monitor_resources.py (11)

1-41: LGTM!

Well-structured monitoring script with clear configuration constants and proper documentation. The multi-account, multi-region approach is appropriate for the use case.


43-75: LGTM!

The alarm function provides useful feedback with graceful fallbacks across different operating systems. Good user experience.


77-91: LGTM!

Utility functions are well-implemented with proper error handling for VPC existence checks.


93-157: LGTM!

Thorough orphaned resource detection with proper VPC validation. The function correctly returns both issues and NAT instance details for age checking.


159-187: LGTM!

IAM resource leak detection properly identifies the "Created-" prefix pattern that indicates dynamically created resources that should have been cleaned up.


189-206: LGTM!

Instance profile count monitoring with appropriate threshold warning helps prevent hitting AWS service limits.


208-226: LGTM!

NAT instance age checking with proper timezone handling ensures instances aren't left running indefinitely.


228-259: LGTM!

CloudWatch metrics integration for Lambda error tracking is well-implemented with appropriate error handling.


261-269: LGTM!

Simple region-wise counting provides useful visibility into resource distribution.


271-341: LGTM!

Comprehensive resource existence checks across Lambda, IAM, and EventBridge resources with proper error handling for different AWS service exceptions.


468-515: LGTM!

Clean CLI interface with appropriate argument parsing and proper signal handling for continuous monitoring mode.

Comment on lines +5 to +6
# Change to the directory where this script is located
cd "$(dirname "$0")"
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟡 Minor

Add error handling for directory change.

The cd command can fail if the directory doesn't exist or is inaccessible, which would cause the script to continue in the wrong directory and potentially deploy to the wrong location.

🔎 Proposed fix
 # Change to the directory where this script is located
-cd "$(dirname "$0")"
+cd "$(dirname "$0")" || exit 1
📝 Committable suggestion

‼️ IMPORTANT
Carefully review the code before committing. Ensure that it accurately replaces the highlighted code, contains no missing lines, and has no issues with indentation. Thoroughly test & benchmark the code to ensure it meets the requirements.

Suggested change
# Change to the directory where this script is located
cd "$(dirname "$0")"
# Change to the directory where this script is located
cd "$(dirname "$0")" || exit 1
🧰 Tools
🪛 Shellcheck (0.11.0)

[warning] 6-6: Use 'cd ... || exit' or 'cd ... || return' in case cd fails.

(SC2164)

🤖 Prompt for AI Agents
In hack/reduce_nat_gateway_cost/deploy.sh around lines 5-6, the script blindly
runs cd "$(dirname "$0")" which can fail; update the script to check the result
of the cd and abort on failure: attempt to change to the script directory, and
if cd returns a non-zero status print a descriptive error to stderr and exit
with a non-zero status to avoid continuing in the wrong directory.

@hector-vido
Copy link
Contributor

/retest

@jupierce
Copy link
Contributor Author

jupierce commented Jan 5, 2026

/test all

@openshift-ci-robot
Copy link
Contributor

Scheduling required tests:
/test e2e

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test integration-optional-test

@jupierce
Copy link
Contributor Author

jupierce commented Jan 5, 2026

/test e2e

Copy link
Member

@Prucek Prucek left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/hold
You can unhold whenever you are ready

@openshift-ci openshift-ci bot added the do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command. label Jan 6, 2026
@openshift-ci openshift-ci bot added the lgtm Indicates that a PR is ready to be merged. label Jan 6, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 6, 2026

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: jupierce, Prucek

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@openshift-ci openshift-ci bot added approved Indicates a PR has been approved by an approver from all required OWNERS files. and removed lgtm Indicates that a PR is ready to be merged. labels Jan 6, 2026
@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 6, 2026

New changes are detected. LGTM label has been removed.

Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py (2)

301-301: Consider refactoring for readability.

The control plane detection logic is correct but complex. Consider extracting it into a helper function for improved clarity:

def is_control_plane_instance(cluster_role: Optional[str], instance_name: Optional[str]) -> bool:
    """Check if instance is a control plane node based on tags."""
    # Modern identification via cluster_role tag
    if cluster_role:
        return cluster_role in ['master', 'control-plane']
    # Legacy identification via instance name
    return instance_name is not None and '-master' in instance_name

# Then use:
if not is_control_plane_instance(cluster_role, instance_name):
    return

696-777: Consider using logging.exception for better error diagnostics.

The retry and cleanup logic is well-structured, but Line 761 should use logging.exception instead of logging.error to include the full traceback, which aids debugging.

🔎 Suggested improvement
     except Exception as e:
         # If instance creation fails, clean up the security group we created
-        logger.error(f'Failed to create NAT instance: {e}. Cleaning up security group {nat_sg_id}')
+        logger.exception(f'Failed to create NAT instance. Cleaning up security group {nat_sg_id}')
         try:
             ec2_client.delete_security_group(GroupId=nat_sg_id)
             logger.info(f'Deleted orphaned security group {nat_sg_id}')

The broad exception handling is appropriate here to ensure security group cleanup occurs even for unexpected errors.

Based on learnings, using logging.exception provides better diagnostics.

hack/reduce_nat_gateway_cost/monitor_resources.py (1)

458-615: Address unused variables and optimize string operations.

The monitoring logic is comprehensive, but several minor issues flagged by static analysis should be addressed:

  1. Lines 484, 489: Unused unpacked variables should be prefixed with underscore
  2. Lines 516, 528: Use spread operator instead of concatenation
  3. Line 547: Remove unnecessary f-string prefix
  4. Line 597: Variable all_healthy is computed but never used
🔎 Proposed fixes
         # Check expected resources exist
-        missing_resources, present_resources = check_expected_resources(profile)
+        missing_resources, _ = check_expected_resources(profile)
         if missing_resources:
             profile_issues.extend([f"  {r}" for r in missing_resources])
         
         # Check instance profile count
-        ip_count, ip_warning = check_instance_profile_count(profile)
+        _, ip_warning = check_instance_profile_count(profile)
         if ip_warning:
             profile_issues.append(ip_warning)
         if profile_issues:
             print(f"{RED}ISSUES FOUND:{NC}")
             for issue in profile_issues:
                 print(issue)
-            all_issues.extend([f"Profile {profile}:"] + profile_issues)
+            all_issues.extend([f"Profile {profile}:", *profile_issues])
             has_problems = True
     if age_issues:
         print(f"{RED}NAT INSTANCE AGE ISSUES:{NC}")
         for issue in age_issues:
             print(issue)
-        all_issues.extend(["NAT Instance Age Issues:"] + age_issues)
+        all_issues.extend(["NAT Instance Age Issues:", *age_issues])
         has_problems = True
     effectiveness = check_nat_instance_effectiveness(all_nat_instances)
     if effectiveness["eligible_count"] == 0:
-        print(f"  No NAT instances old enough to evaluate")
+        print("  No NAT instances old enough to evaluate")
     else:

For the all_healthy variable at line 597: either use it to affect the has_problems flag or remove it if not needed:

-    all_healthy = True
     for profile in PROFILES:
         missing, present = check_expected_resources(profile)
         if missing:
             print(f"  {profile}: {RED}UNHEALTHY - {len(missing)} missing resource(s){NC}")
-            all_healthy = False
+            has_problems = True
         else:
             print(f"  {profile}: {GREEN}OK ({len(present)} resources){NC}")
hack/reduce_nat_gateway_cost/AGENTS.md (1)

50-59: Fix fenced code block language specification.

The file structure block should have a language identifier for proper rendering.

🔎 Suggested fix
-```
+```text
 hack/reduce_nat_gateway_cost/
 ├── replace_nat_with_nat_instance.py  # Main Lambda function code
 ├── use-nat-instance.yaml             # CloudFormation template for us-east-1
 ├── use-nat-instance-forwarders.yaml  # CloudFormation template for other regions
 ├── deploy.sh                          # Deployment script
 ├── monitor_resources.py               # Python monitoring script
 ├── lambda.zip                         # Packaged Lambda (auto-generated)
 └── AGENTS.md                          # This file
</details>

Otherwise, the documentation is comprehensive and well-organized.

</blockquote></details>

</blockquote></details>

<details>
<summary>📜 Review details</summary>

**Configuration used**: Organization UI

**Review profile**: CHILL

**Plan**: Pro

**Cache: Disabled due to data retention organization setting**

**Knowledge base: Disabled due to `Reviews -> Disable Knowledge Base` setting**

<details>
<summary>📥 Commits</summary>

Reviewing files that changed from the base of the PR and between 5bd2a021d651ce3628f9b662fcb3b05fa4409b6a and 81cd85662a7c4be37469ed6da52ba252131a9553.

</details>

<details>
<summary>📒 Files selected for processing (3)</summary>

* `hack/reduce_nat_gateway_cost/AGENTS.md`
* `hack/reduce_nat_gateway_cost/monitor_resources.py`
* `hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py`

</details>

<details>
<summary>🧰 Additional context used</summary>

<details>
<summary>📓 Path-based instructions (1)</summary>

<details>
<summary>**</summary>


**⚙️ CodeRabbit configuration file**

> -Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:
- `hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py`
- `hack/reduce_nat_gateway_cost/monitor_resources.py`
- `hack/reduce_nat_gateway_cost/AGENTS.md`

</details>

</details><details>
<summary>🧬 Code graph analysis (1)</summary>

<details>
<summary>hack/reduce_nat_gateway_cost/monitor_resources.py (1)</summary><blockquote>

<details>
<summary>hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py (1)</summary>

* `main` (1116-1136)

</details>

</blockquote></details>

</details><details>
<summary>🪛 markdownlint-cli2 (0.18.1)</summary>

<details>
<summary>hack/reduce_nat_gateway_cost/AGENTS.md</summary>

50-50: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

</details>

</details>
<details>
<summary>🪛 Ruff (0.14.10)</summary>

<details>
<summary>hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py</summary>

761-761: Use `logging.exception` instead of `logging.error`

Replace with `exception`

(TRY400)

---

765-765: Do not catch blind exception: `Exception`

(BLE001)

---

775-775: Do not catch blind exception: `Exception`

(BLE001)

</details>
<details>
<summary>hack/reduce_nat_gateway_cost/monitor_resources.py</summary>

50-50: Starting a process with a partial executable path

(S607)

---

56-56: Starting a process with a partial executable path

(S607)

---

66-66: Starting a process with a partial executable path

(S607)

---

87-87: Consider moving this statement to an `else` block

(TRY300)

---

257-257: Consider moving this statement to an `else` block

(TRY300)

---

484-484: Unpacked variable `present_resources` is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

---

489-489: Unpacked variable `ip_count` is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)

---

516-516: Consider `[f"Profile {profile}:", *profile_issues]` instead of concatenation

Replace with `[f"Profile {profile}:", *profile_issues]`

(RUF005)

---

528-528: Consider `["NAT Instance Age Issues:", *age_issues]` instead of concatenation

Replace with `["NAT Instance Age Issues:", *age_issues]`

(RUF005)

---

547-547: f-string without any placeholders

Remove extraneous `f` prefix

(F541)

---

597-597: Local variable `all_healthy` is assigned to but never used

Remove assignment to unused variable `all_healthy`

(F841)

</details>

</details>

</details>

<details>
<summary>🔇 Additional comments (17)</summary><blockquote>

<details>
<summary>hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py (10)</summary><blockquote>

`118-119`: **LGTM!**

The updated permission comments accurately document the Lambda's IAM requirements for SSM parameter access and passing the NAT instance role to EC2.

---

`151-154`: **LGTM!**

The NAT_INSTANCE_PROFILE_NAME constant improves maintainability by centralizing the CloudFormation-created instance profile name, and the VERSION bump to v1.4 appropriately reflects the scope of changes.

---

`278-280`: **LGTM!**

The VPC ID validation prevents cleanup() from being called with missing data, which could cause errors downstream. The defensive check with early return is appropriate.

---

`333-336`: **LGTM!**

Filtering out bootstrap instance terminations prevents premature NAT instance cleanup, as bootstrap nodes are terminated early in the cluster lifecycle while other nodes continue running.

---

`377-379`: **LGTM!**

The validation ensures nat_instance_id is an actual EC2 instance ID (starts with 'i-') before using it to update routes. This prevents errors if the userData script failed and the tag still contains a NAT Gateway ID.

---

`502-505`: **LGTM!**

The error handling prevents downstream failures by checking for None return from create_nat_instance and logging appropriately.

---

`539-597`: **LGTM!**

The race condition handling for security group creation is well-implemented. It checks for existing groups, handles InvalidGroup.Duplicate errors gracefully, and falls back to fetching the existing group when concurrent invocations create duplicates.

---

`681-682`: **LGTM!**

The timeout error handling for route table availability is appropriate and returns None to signal failure, which is properly checked by the caller.

---

`897-926`: **LGTM!**

The race condition handling in the userData script is a critical improvement. It waits up to 5 minutes for the 0.0.0.0/0 route to exist before attempting to replace it, which addresses the timing issue where the NAT instance may start before the cluster installer creates the route.

---

`1018-1018`: **LGTM!**

The specific exception type (ClientError) is more precise than broad exception handling, which is a best practice improvement.

</blockquote></details>
<details>
<summary>hack/reduce_nat_gateway_cost/monitor_resources.py (7)</summary><blockquote>

`44-76`: **LGTM!**

The alarm function has appropriate fallback logic and timeout protection. The static analysis warnings about subprocess paths (S607) are false positives here, as the commands and file paths are hardcoded system utilities, not user input.

---

`83-92`: **LGTM!**

The VPC existence check is correct and clear. The static analysis suggestion (TRY300) about using an else block is a minor style preference; the current structure is fine.

---

`94-158`: **LGTM!**

The orphaned resource detection logic correctly identifies resources whose VPCs no longer exist, with appropriate error handling and useful metadata collection for downstream analysis.

---

`229-260`: **LGTM!**

The Lambda error count retrieval correctly uses CloudWatch metrics with appropriate error handling. The static analysis suggestion (TRY300) is a minor style preference; the current structure is clear.

---

`272-384`: **LGTM!**

The effectiveness check is well-designed with efficient API batching, proper route validation (checking for ENI in 0.0.0.0/0 routes), and comprehensive error handling.

---

`386-456`: **LGTM!**

The resource validation comprehensively checks all expected infrastructure components across regions with appropriate error handling.

---

`617-664`: **LGTM!**

The main function correctly handles both one-shot and continuous monitoring modes with appropriate exit codes and keyboard interrupt handling.

</blockquote></details>

</blockquote></details>

</details>

<!-- This is an auto-generated comment by CodeRabbit for review status -->

@openshift-ci-robot
Copy link
Contributor

Scheduling required tests:
/test e2e

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test integration-optional-test

@jupierce jupierce force-pushed the fix_instance_profile_leak branch from 81cd856 to f7fe819 Compare January 6, 2026 17:57
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

Caution

Some comments are outside the diff and can’t be posted inline due to platform limitations.

⚠️ Outside diff range comments (1)
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py (1)

550-633: Security group race condition handling is improved, but ingress rules may fail on reuse.

When an existing security group is reused (lines 558-561 or 590-593), the subsequent authorize_security_group_ingress calls (lines 600-632) will fail with InvalidPermission.Duplicate if the rules already exist.

🔎 Proposed fix to handle duplicate ingress rules
+    # Skip ingress rules if we reused an existing security group
+    # (rules would already be configured)
+    if existing_sgs.get("SecurityGroups"):
+        return nat_sg_id
+
     # Allow inbound traffic from the private subnet
-    ec2_client.authorize_security_group_ingress(
+    try:
+        ec2_client.authorize_security_group_ingress(

Alternatively, wrap each authorize_security_group_ingress call in a try/except that handles InvalidPermission.Duplicate:

     # Allow inbound traffic from the private subnet
-    ec2_client.authorize_security_group_ingress(
-        GroupId=nat_sg_id,
-        IpPermissions=[
-            {
-                "IpProtocol": "-1",  # Allow all traffic
-                "IpRanges": [{"CidrIp": private_cidr}]
-            }
-        ]
-    )
+    try:
+        ec2_client.authorize_security_group_ingress(
+            GroupId=nat_sg_id,
+            IpPermissions=[
+                {
+                    "IpProtocol": "-1",  # Allow all traffic
+                    "IpRanges": [{"CidrIp": private_cidr}]
+                }
+            ]
+        )
+    except botocore.exceptions.ClientError as e:
+        if 'InvalidPermission.Duplicate' not in str(e):
+            raise
🧹 Nitpick comments (4)
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py (1)

759-777: Good cleanup on failure, but use logger.exception for better diagnostics.

The cleanup pattern correctly prevents orphaned security groups. Per Ruff TRY400, use logger.exception instead of logger.error to automatically include the stack trace.

🔎 Proposed fix
     except Exception as e:
         # If instance creation fails, clean up the security group we created
-        logger.error(f'Failed to create NAT instance: {e}. Cleaning up security group {nat_sg_id}')
+        logger.exception(f'Failed to create NAT instance. Cleaning up security group {nat_sg_id}')
         try:
             ec2_client.delete_security_group(GroupId=nat_sg_id)
             logger.info(f'Deleted orphaned security group {nat_sg_id}')
-        except Exception as cleanup_error:
+        except botocore.exceptions.ClientError as cleanup_error:
             logger.warning(f'Failed to clean up security group {nat_sg_id}: {cleanup_error}')
         raise
hack/reduce_nat_gateway_cost/monitor_resources.py (3)

567-572: Minor: Unused variables could be prefixed with underscore.

Per Ruff RUF059, present_resources and ip_count are unpacked but never used. Either use them or prefix with underscore to indicate intentional discard.

🔎 Proposed fix
         # Check expected resources exist
-        missing_resources, present_resources = check_expected_resources(profile)
+        missing_resources, _present_resources = check_expected_resources(profile)
         if missing_resources:
             profile_issues.extend([f"  {r}" for r in missing_resources])
         
         # Check instance profile count
-        ip_count, ip_warning = check_instance_profile_count(profile)
+        _ip_count, ip_warning = check_instance_profile_count(profile)

630-630: Minor: f-string without placeholders.

Per Ruff F541, remove the extraneous f prefix.

🔎 Proposed fix
-        print(f"  No NAT instances old enough to evaluate")
+        print("  No NAT instances old enough to evaluate")

688-698: all_healthy variable is assigned but never used.

Consider either removing it or using it (e.g., to influence the return value or final summary).

🔎 Proposed fix - remove unused variable
     # Display infrastructure health summary
     print(f"{CYAN}Infrastructure Health:{NC}")
-    all_healthy = True
     for profile in PROFILES:
         missing, present = check_expected_resources(profile)
         if missing:
             print(f"  {profile}: {RED}UNHEALTHY - {len(missing)} missing resource(s){NC}")
-            all_healthy = False
         else:
             print(f"  {profile}: {GREEN}OK ({len(present)} resources){NC}")
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 81cd856 and f7fe819.

📒 Files selected for processing (4)
  • hack/reduce_nat_gateway_cost/AGENTS.md
  • hack/reduce_nat_gateway_cost/monitor_resources.py
  • hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py
  • hack/reduce_nat_gateway_cost/use-nat-instance.yaml
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py
  • hack/reduce_nat_gateway_cost/use-nat-instance.yaml
  • hack/reduce_nat_gateway_cost/monitor_resources.py
  • hack/reduce_nat_gateway_cost/AGENTS.md
🧬 Code graph analysis (1)
hack/reduce_nat_gateway_cost/monitor_resources.py (1)
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py (1)
  • main (1116-1136)
🪛 markdownlint-cli2 (0.18.1)
hack/reduce_nat_gateway_cost/AGENTS.md

50-50: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.10)
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py

761-761: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


765-765: Do not catch blind exception: Exception

(BLE001)


775-775: Do not catch blind exception: Exception

(BLE001)

hack/reduce_nat_gateway_cost/monitor_resources.py

50-50: Starting a process with a partial executable path

(S607)


56-56: Starting a process with a partial executable path

(S607)


66-66: Starting a process with a partial executable path

(S607)


87-87: Consider moving this statement to an else block

(TRY300)


257-257: Consider moving this statement to an else block

(TRY300)


286-286: Consider moving this statement to an else block

(TRY300)


340-340: Consider moving this statement to an else block

(TRY300)


567-567: Unpacked variable present_resources is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


572-572: Unpacked variable ip_count is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


599-599: Consider [f"Profile {profile}:", *profile_issues] instead of concatenation

Replace with [f"Profile {profile}:", *profile_issues]

(RUF005)


611-611: Consider ["NAT Instance Age Issues:", *age_issues] instead of concatenation

Replace with ["NAT Instance Age Issues:", *age_issues]

(RUF005)


630-630: f-string without any placeholders

Remove extraneous f prefix

(F541)


695-695: Local variable all_healthy is assigned to but never used

Remove assignment to unused variable all_healthy

(F841)

🔇 Additional comments (14)
hack/reduce_nat_gateway_cost/AGENTS.md (1)

1-314: Comprehensive agent reference documentation - LGTM!

This is a well-structured operational guide covering deployment, monitoring, troubleshooting, and common issues. The documentation aligns well with the code changes in this PR.

hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py (7)

151-154: Key fix for instance profile leak - using static CloudFormation-managed profile.

This change eliminates the dynamic instance profile creation that was causing leaks. The profile name correctly matches the NatInstanceProfile resource in use-nat-instance.yaml.


278-281: Good defensive guard against missing VPC tag.

This prevents the cleanup function from running with a None vpc_id, which could cause issues or incorrect resource targeting.


301-302: Complex but correct control plane detection logic.

The conditional handles both the explicit cluster-api-provider-aws/role tag and falls back to name-based detection with -master in the name. Consider extracting this to a helper function for readability in a future refactor.


333-336: Good safeguard against premature NAT cleanup during bootstrap.

Correctly prevents NAT instance termination when the bootstrap machine is terminated, which is expected cluster lifecycle behavior.


377-379: Good validation to prevent route misconfiguration.

This check ensures the route table update uses an actual instance ID rather than a NAT Gateway ID, which could happen if the userData script failed.


680-682: Good use of logger for timeout error.

Consistent with the rest of the codebase's logging approach.


901-926: Excellent fix for the route creation race condition.

The retry loop (up to 5 minutes) correctly handles the timing issue where the NAT instance may start before the cluster installer creates the 0.0.0.0/0 route. This addresses the "There is no route defined" error mentioned in AGENTS.md.

hack/reduce_nat_gateway_cost/use-nat-instance.yaml (3)

11-50: Core fix for instance profile leak - static IAM resources.

The new NatInstanceRole and NatInstanceProfile replace the dynamic creation approach. The DeletionPolicy: Retain is important to prevent accidental service disruption. The profile name correctly matches NAT_INSTANCE_PROFILE_NAME in the Lambda code.


81-84: Good security improvement - PassRole restriction.

Restricting iam:PassRole to only the specific NatInstanceRole ARN instead of a wildcard pattern follows the principle of least privilege.


51-168: Consistent DeletionPolicy: Retain pattern across all critical resources.

This prevents accidental service disruption during stack updates or deletions, as documented in AGENTS.md. The destroy command in deploy.sh explicitly handles cleanup when needed.

hack/reduce_nat_gateway_cost/monitor_resources.py (3)

19-34: Configuration aligns well with the CloudFormation template and AGENTS.md.

The expected resource names correctly match the CloudFormation resources, enabling accurate health checks.


44-75: Alarm function has good fallback chain.

The partial executable paths (paplay, afplay) are acceptable for a developer utility script. The fallback to terminal bell ensures functionality across environments.


429-448: Fix route checking to match route creation method.

The effectiveness check looks for NetworkInterfaceId in routes (line 434), but routes are created using InstanceId parameter (replace_nat_with_nat_instance.py line 395). When routes are created with InstanceId, the route object will have the InstanceId field populated, not NetworkInterfaceId. This causes the condition to always fail, reporting all NAT instances as ineffective even when routes have been properly updated.

Change line 434 to check route.get("InstanceId") == instance_id instead, since instance_id is already available from line 400.

@jupierce jupierce force-pushed the fix_instance_profile_leak branch from f7fe819 to 7f40fa2 Compare January 6, 2026 18:03
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI Agents
In @hack/reduce_nat_gateway_cost/monitor_resources.py:
- Around line 429-448: The route-check incorrectly compares
route["NetworkInterfaceId"] to eni_id; change it to check
route.get("InstanceId") against the NAT instance's InstanceId (e.g.,
inst.get("InstanceId")) and optionally fall back to NetworkInterfaceId==eni_id
for ENI-based routes so effective NAT instances created via --instance-id are
detected; update the condition inside the loop that inspects route.get(...)
accordingly and keep the existing found_route/ineffective_instances logic.
🧹 Nitpick comments (2)
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py (1)

759-777: Good cleanup on instance creation failure.

Cleaning up the security group when instance creation fails prevents orphaned resources.

Consider using logging.exception instead of logging.error on line 761 to include the traceback in logs, which aids debugging.

🔎 Optional: Use logging.exception for better traceability
     except Exception as e:
         # If instance creation fails, clean up the security group we created
-        logger.error(f'Failed to create NAT instance: {e}. Cleaning up security group {nat_sg_id}')
+        logger.exception(f'Failed to create NAT instance. Cleaning up security group {nat_sg_id}')
         try:
             ec2_client.delete_security_group(GroupId=nat_sg_id)
             logger.info(f'Deleted orphaned security group {nat_sg_id}')
         except Exception as cleanup_error:
             logger.warning(f'Failed to clean up security group {nat_sg_id}: {cleanup_error}')
         raise
hack/reduce_nat_gateway_cost/monitor_resources.py (1)

567-572: Minor: Unused variables and unnecessary f-string.

A few unused variables flagged by static analysis:

  • Line 567: present_resources is unused in the first call (but used on line 692)
  • Line 572: ip_count is computed but never used
  • Line 630: f-string without placeholders
  • Line 695: all_healthy is assigned but never read
🔎 Proposed cleanup
         # Check expected resources exist
-        missing_resources, present_resources = check_expected_resources(profile)
+        missing_resources, _present_resources = check_expected_resources(profile)
         if missing_resources:
             profile_issues.extend([f"  {r}" for r in missing_resources])
         
         # Check instance profile count
-        ip_count, ip_warning = check_instance_profile_count(profile)
+        _ip_count, ip_warning = check_instance_profile_count(profile)
         if ip_warning:
             profile_issues.append(ip_warning)
     if effectiveness["eligible_count"] == 0:
-        print(f"  No NAT instances old enough to evaluate")
+        print("  No NAT instances old enough to evaluate")
     print(f"{CYAN}Infrastructure Health:{NC}")
-    all_healthy = True
     for profile in PROFILES:
         missing, present = check_expected_resources(profile)
         if missing:
             print(f"  {profile}: {RED}UNHEALTHY - {len(missing)} missing resource(s){NC}")
-            all_healthy = False
         else:
             print(f"  {profile}: {GREEN}OK ({len(present)} resources){NC}")

Also applies to: 630-630, 695-695

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between f7fe819 and 7f40fa2.

📒 Files selected for processing (4)
  • hack/reduce_nat_gateway_cost/AGENTS.md
  • hack/reduce_nat_gateway_cost/monitor_resources.py
  • hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py
  • hack/reduce_nat_gateway_cost/use-nat-instance.yaml
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • hack/reduce_nat_gateway_cost/use-nat-instance.yaml
  • hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py
  • hack/reduce_nat_gateway_cost/monitor_resources.py
  • hack/reduce_nat_gateway_cost/AGENTS.md
🧬 Code graph analysis (1)
hack/reduce_nat_gateway_cost/monitor_resources.py (1)
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py (1)
  • main (1116-1136)
🪛 LanguageTool
hack/reduce_nat_gateway_cost/AGENTS.md

[grammar] ~339-~339: Use a hyphen to join words.
Context: ...y:** EventBridge retries once with max 5 minute event age. 5. Regions: Lambd...

(QB_NEW_EN_HYPHEN)

🪛 markdownlint-cli2 (0.18.1)
hack/reduce_nat_gateway_cost/AGENTS.md

50-50: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.10)
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py

761-761: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


765-765: Do not catch blind exception: Exception

(BLE001)


775-775: Do not catch blind exception: Exception

(BLE001)

hack/reduce_nat_gateway_cost/monitor_resources.py

50-50: Starting a process with a partial executable path

(S607)


56-56: Starting a process with a partial executable path

(S607)


66-66: Starting a process with a partial executable path

(S607)


87-87: Consider moving this statement to an else block

(TRY300)


257-257: Consider moving this statement to an else block

(TRY300)


286-286: Consider moving this statement to an else block

(TRY300)


340-340: Consider moving this statement to an else block

(TRY300)


567-567: Unpacked variable present_resources is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


572-572: Unpacked variable ip_count is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


599-599: Consider [f"Profile {profile}:", *profile_issues] instead of concatenation

Replace with [f"Profile {profile}:", *profile_issues]

(RUF005)


611-611: Consider ["NAT Instance Age Issues:", *age_issues] instead of concatenation

Replace with ["NAT Instance Age Issues:", *age_issues]

(RUF005)


630-630: f-string without any placeholders

Remove extraneous f prefix

(F541)


695-695: Local variable all_healthy is assigned to but never used

Remove assignment to unused variable all_healthy

(F841)

🔇 Additional comments (15)
hack/reduce_nat_gateway_cost/AGENTS.md (1)

1-356: Comprehensive documentation for the NAT Gateway cost reduction solution.

This is well-structured documentation covering the solution overview, deployment, monitoring, and troubleshooting. The content aligns with the code changes in this PR.

Minor static analysis hints (optional fixes):

  • Line 50: Consider adding a language specifier to the fenced code block (e.g., ```text)
  • Line 339: "5 minute" → "5-minute" (hyphenation)
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py (8)

151-154: Good fix: Static instance profile replaces dynamic creation.

Using a CloudFormation-managed instance profile (use-nat-instance-profile) instead of dynamically creating profiles per NAT instance directly addresses the instance profile leak issue.


278-281: Good guard against missing VPC tag during cleanup.

This prevents cleanup failures when the NAT gateway's VPC association is lost during deletion, which is a sensible defensive measure.


301-302: Control plane identification logic is correct.

The conditional properly handles both the sigs.k8s.io/cluster-api-provider-aws/role tag and fallback to name-based detection for master nodes.


333-336: Correct guard to prevent premature NAT cleanup on bootstrap termination.

Bootstrap machine termination is a normal part of cluster installation and should not trigger NAT instance cleanup.


377-398: Good validation to ensure NAT instance ID is actually an instance.

This prevents attempting to route traffic to a NAT Gateway ID if the userData script failed to update the route table tag properly.


548-597: Robust race condition handling for security group creation.

The two-phase approach (check before create + handle duplicate error) properly handles concurrent Lambda invocations that might try to create the same security group.


901-926: Race condition fix: Wait for route before replacing.

The retry loop (up to 5 minutes) properly handles the race condition where the NAT instance might start before the cluster installer creates the 0.0.0.0/0 route. This aligns with the v1.3 fix mentioned in AGENTS.md.


1018-1026: Good: Specific exception handling for AWS API errors.

Using botocore.exceptions.ClientError instead of a broad Exception catch is more appropriate here.

hack/reduce_nat_gateway_cost/use-nat-instance.yaml (3)

11-50: Good: Static instance profile and role with minimal permissions.

The NatInstanceRole has appropriately scoped permissions for what NAT instances need:

  • ec2:ReplaceRoute and ec2:DescribeRouteTables for route management
  • ec2:CreateTags restricted to route-table resources
  • ec2:ModifyInstanceAttribute for disabling source/dest check

DeletionPolicy: Retain on these resources prevents accidental deletion during stack updates.


81-84: Security improvement: PassRole now scoped to specific role.

Restricting iam:PassRole to only NatInstanceRole.Arn instead of a wildcard pattern follows the principle of least privilege.


51-168: Consistent DeletionPolicy: Retain across all resources.

This aligns with the documentation in AGENTS.md (line 108) explaining that retained resources prevent accidental service disruption. The destroy action in deploy.sh explicitly handles cleanup.

hack/reduce_nat_gateway_cost/monitor_resources.py (3)

19-34: Good configuration with reasonable thresholds.

The instance profile threshold of 500 provides adequate buffer before reaching the AWS limit of 1000. The NAT age threshold of 8 hours and effectiveness check at 15 minutes are sensible defaults.


44-75: Alarm function with appropriate fallbacks.

The function handles different platforms (Linux/macOS) and falls back to terminal bell. The partial executable paths flagged by static analysis are appropriate for well-known system utilities.


715-761: Well-designed CLI with proper signal handling.

The script supports both single-run (--once) and continuous monitoring modes with appropriate exit codes for scripted use.

@jupierce jupierce force-pushed the fix_instance_profile_leak branch from 7f40fa2 to 6709b81 Compare January 6, 2026 18:50
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI Agents
In @hack/reduce_nat_gateway_cost/monitor_resources.py:
- Around line 441-460: The route check is comparing
route.get("NetworkInterfaceId") to eni_id, but routes created with replace_route
using --instance-id set the InstanceId field; update the logic where
route_tables_response is scanned (the block touching eni_id, inst,
effective_count, and ineffective_instances) to treat a route as matching if
route.get("NetworkInterfaceId") == eni_id OR route.get("InstanceId") ==
inst.get("InstanceId") so you detect both ENI-targeted and instance-targeted
routes and correctly increment effective_count or append to
ineffective_instances.
🧹 Nitpick comments (2)
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py (1)

759-777: Consider using logging.exception for better debugging.

The cleanup logic is correct. However, when logging the initial failure at line 761, using logging.exception instead of logging.error would automatically include the stack trace, aiding debugging.

🔎 Proposed fix
     except Exception as e:
         # If instance creation fails, clean up the security group we created
-        logger.error(f'Failed to create NAT instance: {e}. Cleaning up security group {nat_sg_id}')
+        logger.exception(f'Failed to create NAT instance. Cleaning up security group {nat_sg_id}')
         try:
             ec2_client.delete_security_group(GroupId=nat_sg_id)
             logger.info(f'Deleted orphaned security group {nat_sg_id}')
         except Exception as cleanup_error:
             logger.warning(f'Failed to clean up security group {nat_sg_id}: {cleanup_error}')
         raise
hack/reduce_nat_gateway_cost/monitor_resources.py (1)

707-717: Unused variable all_healthy.

The variable all_healthy is assigned but never used. Consider removing it or using it for the function's return value or summary output.

🔎 Proposed fix
     # Display infrastructure health summary
     print(f"{CYAN}Infrastructure Health:{NC}")
-    all_healthy = True
     for profile in PROFILES:
         missing, present = check_expected_resources(profile)
         if missing:
             print(f"  {profile}: {RED}UNHEALTHY - {len(missing)} missing resource(s){NC}")
-            all_healthy = False
+            has_problems = True
         else:
             print(f"  {profile}: {GREEN}OK ({len(present)} resources){NC}")
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 7f40fa2 and 6709b81.

📒 Files selected for processing (4)
  • hack/reduce_nat_gateway_cost/AGENTS.md
  • hack/reduce_nat_gateway_cost/monitor_resources.py
  • hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py
  • hack/reduce_nat_gateway_cost/use-nat-instance.yaml
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py
  • hack/reduce_nat_gateway_cost/use-nat-instance.yaml
  • hack/reduce_nat_gateway_cost/monitor_resources.py
  • hack/reduce_nat_gateway_cost/AGENTS.md
🧬 Code graph analysis (1)
hack/reduce_nat_gateway_cost/monitor_resources.py (1)
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py (1)
  • main (1116-1136)
🪛 LanguageTool
hack/reduce_nat_gateway_cost/AGENTS.md

[grammar] ~339-~339: Use a hyphen to join words.
Context: ...y:** EventBridge retries once with max 5 minute event age. 5. Regions: Lambd...

(QB_NEW_EN_HYPHEN)

🪛 markdownlint-cli2 (0.18.1)
hack/reduce_nat_gateway_cost/AGENTS.md

50-50: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.10)
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py

761-761: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


765-765: Do not catch blind exception: Exception

(BLE001)


775-775: Do not catch blind exception: Exception

(BLE001)

hack/reduce_nat_gateway_cost/monitor_resources.py

62-62: Starting a process with a partial executable path

(S607)


68-68: Starting a process with a partial executable path

(S607)


78-78: Starting a process with a partial executable path

(S607)


99-99: Consider moving this statement to an else block

(TRY300)


269-269: Consider moving this statement to an else block

(TRY300)


298-298: Consider moving this statement to an else block

(TRY300)


352-352: Consider moving this statement to an else block

(TRY300)


579-579: Unpacked variable present_resources is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


584-584: Unpacked variable ip_count is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


611-611: Consider [f"Profile {profile}:", *profile_issues] instead of concatenation

Replace with [f"Profile {profile}:", *profile_issues]

(RUF005)


623-623: Consider ["NAT Instance Age Issues:", *age_issues] instead of concatenation

Replace with ["NAT Instance Age Issues:", *age_issues]

(RUF005)


642-642: f-string without any placeholders

Remove extraneous f prefix

(F541)


714-714: Local variable all_healthy is assigned to but never used

Remove assignment to unused variable all_healthy

(F841)

🔇 Additional comments (10)
hack/reduce_nat_gateway_cost/AGENTS.md (1)

1-356: LGTM! Comprehensive documentation for the NAT Gateway cost reduction solution.

The documentation is well-structured and provides valuable context for operators and AI agents. The minor static analysis hints (hyphenation in "5 minute" at line 339 and missing language specifier at line 50) are cosmetic and can be addressed optionally.

hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py (5)

151-154: Good fix: Static instance profile eliminates the leak.

Using a CloudFormation-managed instance profile instead of dynamically creating profiles per NAT instance properly addresses the instance profile leak issue.


278-281: Good guard against missing VPC tag.

This prevents cleanup failures when the NAT gateway doesn't have the expected VPC tag.


377-397: Good validation of instance ID format.

Validating that the tag value actually contains an instance ID (starts with 'i-') before attempting route updates prevents failures when the userData script didn't complete successfully.


550-597: Good race condition handling for security groups.

The check-then-create pattern with proper handling of InvalidGroup.Duplicate errors makes this robust against concurrent Lambda invocations.


901-926: Good fix for race condition with cluster installer.

Waiting for the 0.0.0.0/0 route to exist before attempting to replace it properly handles the timing race between the NAT instance startup and the cluster installer creating the route.

hack/reduce_nat_gateway_cost/use-nat-instance.yaml (3)

11-42: Well-scoped IAM permissions for NAT instances.

The permissions are appropriately scoped for the operations the NAT instance needs to perform. The use of Resource: "*" for ec2:DescribeRouteTables and ec2:ModifyInstanceAttribute is necessary since the specific resources aren't known at deployment time.


43-50: Instance profile correctly configured.

The instance profile properly references the NAT instance role and matches the NAT_INSTANCE_PROFILE_NAME constant in the Lambda code.


81-84: Good security improvement: PassRole now restricted to specific role.

Changing from a wildcard to !GetAtt NatInstanceRole.Arn follows the principle of least privilege, ensuring the Lambda can only pass the intended NAT instance role.

hack/reduce_nat_gateway_cost/monitor_resources.py (1)

1-106: Well-structured monitoring script.

Good organization with clear separation of concerns, proper type hints, and useful CLI interface. The use of partial executable paths for system audio utilities (paplay, afplay) is acceptable for this monitoring tool.

Comment on lines 441 to 519
# Check if any route table has a route pointing to this ENI
found_route = False
for rt in route_tables_response.get("RouteTables", []):
for route in rt.get("Routes", []):
# Check if this route points to our NAT instance's ENI
if route.get("NetworkInterfaceId") == eni_id:
# Verify it's for the default route or a significant route
if route.get("DestinationCidrBlock") == "0.0.0.0/0":
found_route = True
break
if found_route:
break

if found_route:
effective_count += 1
else:
ineffective_instances.append({
**inst,
"reason": "No route table updated to use this instance",
})
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Bug: Route check uses wrong attribute.

The route check at line 446 compares against NetworkInterfaceId, but routes created via replace_route with --instance-id (as done in replace_nat_with_nat_instance.py lines 392-396 and 919) have the InstanceId attribute set instead.

This will incorrectly report all effective NAT instances as ineffective.

🔎 Proposed fix
                 # Check if any route table has a route pointing to this ENI
                 found_route = False
                 for rt in route_tables_response.get("RouteTables", []):
                     for route in rt.get("Routes", []):
-                        # Check if this route points to our NAT instance's ENI
-                        if route.get("NetworkInterfaceId") == eni_id:
+                        # Check if this route points to our NAT instance
+                        if route.get("InstanceId") == instance_id:
                             # Verify it's for the default route or a significant route
                             if route.get("DestinationCidrBlock") == "0.0.0.0/0":
                                 found_route = True
                                 break
                     if found_route:
                         break

Committable suggestion skipped: line range outside the PR's diff.

🤖 Prompt for AI Agents
In @hack/reduce_nat_gateway_cost/monitor_resources.py around lines 441 - 460,
The route check is comparing route.get("NetworkInterfaceId") to eni_id, but
routes created with replace_route using --instance-id set the InstanceId field;
update the logic where route_tables_response is scanned (the block touching
eni_id, inst, effective_count, and ineffective_instances) to treat a route as
matching if route.get("NetworkInterfaceId") == eni_id OR route.get("InstanceId")
== inst.get("InstanceId") so you detect both ENI-targeted and instance-targeted
routes and correctly increment effective_count or append to
ineffective_instances.

@jupierce jupierce force-pushed the fix_instance_profile_leak branch from 6709b81 to 61c9ff0 Compare January 6, 2026 19:46
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (2)
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py (1)

759-767: Consider using logging.exception for better debugging.

At line 761, replace logging.error with logging.exception to include the traceback in logs, making it easier to diagnose NAT instance creation failures.

🔎 Proposed fix
     except Exception as e:
         # If instance creation fails, clean up the security group we created
-        logger.error(f'Failed to create NAT instance: {e}. Cleaning up security group {nat_sg_id}')
+        logger.exception(f'Failed to create NAT instance. Cleaning up security group {nat_sg_id}')
         try:
             ec2_client.delete_security_group(GroupId=nat_sg_id)
             logger.info(f'Deleted orphaned security group {nat_sg_id}')

Based on static analysis hints (Ruff TRY400).

hack/reduce_nat_gateway_cost/monitor_resources.py (1)

783-792: Unused variable: all_healthy is assigned but never used.

The all_healthy variable is set but never referenced. Consider either removing it or using it to provide a summary status in the output.

🔎 Proposed fix

Remove the unused variable:

     # Display infrastructure health summary
     print(f"{CYAN}Infrastructure Health:{NC}")
-    all_healthy = True
     for profile in PROFILES:
         missing, present = check_expected_resources(profile)
         if missing:
             print(f"  {profile}: {RED}UNHEALTHY - {len(missing)} missing resource(s){NC}")
-            all_healthy = False
         else:
             print(f"  {profile}: {GREEN}OK ({len(present)} resources){NC}")
     print()

Based on static analysis hints (Ruff F841).

📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 6709b81 and 61c9ff0.

📒 Files selected for processing (4)
  • hack/reduce_nat_gateway_cost/AGENTS.md
  • hack/reduce_nat_gateway_cost/monitor_resources.py
  • hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py
  • hack/reduce_nat_gateway_cost/use-nat-instance.yaml
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • hack/reduce_nat_gateway_cost/use-nat-instance.yaml
  • hack/reduce_nat_gateway_cost/monitor_resources.py
  • hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py
  • hack/reduce_nat_gateway_cost/AGENTS.md
🪛 LanguageTool
hack/reduce_nat_gateway_cost/AGENTS.md

[grammar] ~339-~339: Use a hyphen to join words.
Context: ...y:** EventBridge retries once with max 5 minute event age. 5. Regions: Lambd...

(QB_NEW_EN_HYPHEN)

🪛 markdownlint-cli2 (0.18.1)
hack/reduce_nat_gateway_cost/AGENTS.md

50-50: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.10)
hack/reduce_nat_gateway_cost/monitor_resources.py

62-62: Starting a process with a partial executable path

(S607)


68-68: Starting a process with a partial executable path

(S607)


78-78: Starting a process with a partial executable path

(S607)


99-99: Consider moving this statement to an else block

(TRY300)


269-269: Consider moving this statement to an else block

(TRY300)


298-298: Consider moving this statement to an else block

(TRY300)


352-352: Consider moving this statement to an else block

(TRY300)


405-405: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


630-630: Unpacked variable present_resources is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


635-635: Unpacked variable ip_count is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


662-662: Consider [f"Profile {profile}:", *profile_issues] instead of concatenation

Replace with [f"Profile {profile}:", *profile_issues]

(RUF005)


674-674: Consider ["NAT Instance Age Issues:", *age_issues] instead of concatenation

Replace with ["NAT Instance Age Issues:", *age_issues]

(RUF005)


717-717: f-string without any placeholders

Remove extraneous f prefix

(F541)


789-789: Local variable all_healthy is assigned to but never used

Remove assignment to unused variable all_healthy

(F841)

hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py

761-761: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


765-765: Do not catch blind exception: Exception

(BLE001)


775-775: Do not catch blind exception: Exception

(BLE001)

@openshift-ci-robot
Copy link
Contributor

Scheduling required tests:
/test e2e

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test integration-optional-test

@jupierce jupierce force-pushed the fix_instance_profile_leak branch from 61c9ff0 to e9563cd Compare January 7, 2026 02:09
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 0

🧹 Nitpick comments (4)
hack/reduce_nat_gateway_cost/AGENTS.md (1)

50-59: Specify language for code block.

The file structure code block lacks a language identifier, which improves syntax highlighting and readability.

🔎 Proposed fix
-```
+```text
 hack/reduce_nat_gateway_cost/
 ├── replace_nat_with_nat_instance.py  # Main Lambda function code
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py (2)

301-302: Consider simplifying the conditional logic.

The conditional at line 301 is complex and difficult to parse. Breaking it into multiple conditions or extracting a helper function would improve readability.

🔎 Proposed refactor
             instance_name = get_tag(tag_set, 'Name')
             cluster_role = get_tag(tag_set, 'sigs.k8s.io/cluster-api-provider-aws/role')
-            if (cluster_role and cluster_role not in ['master', 'control-plane']) or (not cluster_role and (not instance_name or '-master' not in instance_name)):
+            
+            # Check if this is a control plane instance
+            is_control_plane = False
+            if cluster_role:
+                is_control_plane = cluster_role in ['master', 'control-plane']
+            elif instance_name:
+                is_control_plane = '-master' in instance_name
+            
+            if not is_control_plane:
                 return

759-768: Use logging.exception for better error diagnostics.

Line 761 uses logging.error to log an exception. Using logging.exception would include the full traceback, which is helpful for debugging Lambda failures.

🔎 Proposed fix
     except Exception as e:
         # If instance creation fails, clean up the security group we created
-        logger.error(f'Failed to create NAT instance: {e}. Cleaning up security group {nat_sg_id}')
+        logger.exception(f'Failed to create NAT instance. Cleaning up security group {nat_sg_id}')
         try:
             ec2_client.delete_security_group(GroupId=nat_sg_id)
hack/reduce_nat_gateway_cost/use-nat-instance.yaml (1)

11-50: LGTM! Static IAM resources prevent profile leaks.

The introduction of static NatInstanceRole and NatInstanceProfile resources with DeletionPolicy: Retain directly addresses the instance profile leak issue. This approach is much more robust than dynamic creation.

However, consider tightening the IAM permissions for better security posture:

  • Lines 34, 42: Resource: "*" for ec2:ReplaceRoute, ec2:DescribeRouteTables, and ec2:ModifyInstanceAttribute is overly broad
  • The existing pattern with ec2:CreateTags scoped to arn:aws:ec2:*:*:route-table/* shows precedent for specific scoping in this file
  • Consider aligning these with similar ARN restrictions (e.g., route-table and instance resources)
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between 61c9ff0 and e9563cd.

📒 Files selected for processing (4)
  • hack/reduce_nat_gateway_cost/AGENTS.md
  • hack/reduce_nat_gateway_cost/monitor_resources.py
  • hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py
  • hack/reduce_nat_gateway_cost/use-nat-instance.yaml
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py
  • hack/reduce_nat_gateway_cost/monitor_resources.py
  • hack/reduce_nat_gateway_cost/use-nat-instance.yaml
  • hack/reduce_nat_gateway_cost/AGENTS.md
🧬 Code graph analysis (1)
hack/reduce_nat_gateway_cost/monitor_resources.py (1)
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py (1)
  • main (1116-1136)
🪛 LanguageTool
hack/reduce_nat_gateway_cost/AGENTS.md

[grammar] ~339-~339: Use a hyphen to join words.
Context: ...y:** EventBridge retries once with max 5 minute event age. 5. Regions: Lambd...

(QB_NEW_EN_HYPHEN)

🪛 markdownlint-cli2 (0.18.1)
hack/reduce_nat_gateway_cost/AGENTS.md

50-50: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.10)
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py

761-761: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


765-765: Do not catch blind exception: Exception

(BLE001)


775-775: Do not catch blind exception: Exception

(BLE001)

hack/reduce_nat_gateway_cost/monitor_resources.py

62-62: Starting a process with a partial executable path

(S607)


68-68: Starting a process with a partial executable path

(S607)


78-78: Starting a process with a partial executable path

(S607)


99-99: Consider moving this statement to an else block

(TRY300)


269-269: Consider moving this statement to an else block

(TRY300)


298-298: Consider moving this statement to an else block

(TRY300)


352-352: Consider moving this statement to an else block

(TRY300)


405-405: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


638-638: Unpacked variable present_resources is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


643-643: Unpacked variable ip_count is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


670-670: Consider [f"Profile {profile}:", *profile_issues] instead of concatenation

Replace with [f"Profile {profile}:", *profile_issues]

(RUF005)


682-682: Consider ["NAT Instance Age Issues:", *age_issues] instead of concatenation

Replace with ["NAT Instance Age Issues:", *age_issues]

(RUF005)


725-725: f-string without any placeholders

Remove extraneous f prefix

(F541)


797-797: Local variable all_healthy is assigned to but never used

Remove assignment to unused variable all_healthy

(F841)

🔇 Additional comments (8)
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py (7)

278-281: LGTM!

This defensive guard prevents cleanup failures when the VPC tag is missing, which aligns with the PR objective of fixing resource leaks.


377-379: LGTM!

This validation prevents treating a NAT Gateway ID as an instance ID if the userData script failed to update the route, which is a good defensive measure.


502-505: LGTM!

The early return pattern with error logging prevents continuing with invalid state when NAT instance creation fails.


539-598: LGTM!

The idempotent security group creation with race condition handling is excellent. This prevents duplicate security groups and handles concurrent Lambda invocations gracefully, which directly addresses resource leak concerns.


680-683: LGTM!

Proper error handling for route table timeout with clear logging. The None return allows the caller to handle the failure appropriately.


759-777: Catching broad Exception is justified here.

While static analysis flags the broad exception catching at lines 759 and 769, this is intentional and necessary. The security group cleanup must occur regardless of the exception type to prevent resource leaks, which is the core objective of this PR.


1018-1018: LGTM!

Using the specific botocore.exceptions.ClientError exception type instead of a generic catch improves code clarity and ensures only AWS API errors are caught here.

hack/reduce_nat_gateway_cost/use-nat-instance.yaml (1)

84-84: Excellent security improvement with specific PassRole ARN.

Changing the PassRole resource from a wildcard pattern to the specific !GetAtt NatInstanceRole.Arn follows the principle of least privilege and prevents unintended role usage. This is a critical fix that helps prevent the instance profile leak.

@jupierce jupierce force-pushed the fix_instance_profile_leak branch from e9563cd to b39cf52 Compare January 7, 2026 02:17
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🤖 Fix all issues with AI agents
In @hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py:
- Around line 759-767: Replace the broad Exception handling in the NAT instance
creation block by logging the full traceback with logger.exception when creation
fails and narrow the cleanup catch to boto3/botocore client errors;
specifically, in the except block that references nat_sg_id, call
logger.exception('Failed to create NAT instance...') to preserve the stack
trace, then attempt ec2_client.delete_security_group(GroupId=nat_sg_id) and
catch botocore.exceptions.ClientError (or the specific AWS client error class
you import) for the cleanup warning instead of Exception, and finally re-raise
the original exception.
🧹 Nitpick comments (1)
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py (1)

301-302: Consider simplifying the conditional logic for readability.

The nested conditional logic is complex and could be difficult to maintain. Consider extracting into a helper function or breaking it into multiple conditions for clarity.

🔎 Suggested refactor for improved readability
+def is_control_plane_instance(cluster_role: Optional[str], instance_name: Optional[str]) -> bool:
+    """Check if instance is a control plane (master) node."""
+    if cluster_role:
+        return cluster_role in ['master', 'control-plane']
+    return instance_name and '-master' in instance_name
+
 def lambda_handler(event, context):
     # ... existing code ...
             instance_name = get_tag(tag_set, 'Name')
             cluster_role = get_tag(tag_set, 'sigs.k8s.io/cluster-api-provider-aws/role')
-            if (cluster_role and cluster_role not in ['master', 'control-plane']) or (not cluster_role and (not instance_name or '-master' not in instance_name)):
+            if not is_control_plane_instance(cluster_role, instance_name):
                 return
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between e9563cd and b39cf52.

📒 Files selected for processing (4)
  • hack/reduce_nat_gateway_cost/AGENTS.md
  • hack/reduce_nat_gateway_cost/monitor_resources.py
  • hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py
  • hack/reduce_nat_gateway_cost/use-nat-instance.yaml
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py
  • hack/reduce_nat_gateway_cost/use-nat-instance.yaml
  • hack/reduce_nat_gateway_cost/monitor_resources.py
  • hack/reduce_nat_gateway_cost/AGENTS.md
🪛 LanguageTool
hack/reduce_nat_gateway_cost/AGENTS.md

[grammar] ~339-~339: Use a hyphen to join words.
Context: ...y:** EventBridge retries once with max 5 minute event age. 5. Regions: Lambd...

(QB_NEW_EN_HYPHEN)

🪛 markdownlint-cli2 (0.18.1)
hack/reduce_nat_gateway_cost/AGENTS.md

50-50: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.10)
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py

761-761: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


765-765: Do not catch blind exception: Exception

(BLE001)


775-775: Do not catch blind exception: Exception

(BLE001)

hack/reduce_nat_gateway_cost/monitor_resources.py

62-62: Starting a process with a partial executable path

(S607)


68-68: Starting a process with a partial executable path

(S607)


78-78: Starting a process with a partial executable path

(S607)


99-99: Consider moving this statement to an else block

(TRY300)


269-269: Consider moving this statement to an else block

(TRY300)


298-298: Consider moving this statement to an else block

(TRY300)


352-352: Consider moving this statement to an else block

(TRY300)


405-405: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


627-627: Unpacked variable present_resources is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


632-632: Unpacked variable ip_count is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


659-659: Consider [f"Profile {profile}:", *profile_issues] instead of concatenation

Replace with [f"Profile {profile}:", *profile_issues]

(RUF005)


671-671: Consider ["NAT Instance Age Issues:", *age_issues] instead of concatenation

Replace with ["NAT Instance Age Issues:", *age_issues]

(RUF005)


714-714: f-string without any placeholders

Remove extraneous f prefix

(F541)


786-786: Local variable all_healthy is assigned to but never used

Remove assignment to unused variable all_healthy

(F841)

🔇 Additional comments (1)
hack/reduce_nat_gateway_cost/monitor_resources.py (1)

489-500: Past review comment has been correctly addressed.

The route check now correctly uses InstanceId instead of NetworkInterfaceId to detect routes created via replace_route with --instance-id, which aligns with how the routes are created in replace_nat_with_nat_instance.py.

Comment on lines +759 to +767
except Exception as e:
# If instance creation fails, clean up the security group we created
logger.error(f'Failed to create NAT instance: {e}. Cleaning up security group {nat_sg_id}')
try:
ec2_client.delete_security_group(GroupId=nat_sg_id)
logger.info(f'Deleted orphaned security group {nat_sg_id}')
except Exception as cleanup_error:
logger.warning(f'Failed to clean up security group {nat_sg_id}: {cleanup_error}')
raise
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

⚠️ Potential issue | 🟠 Major

Improve error handling specificity and logging.

The code catches broad Exception which masks the actual error type and makes debugging difficult. Additionally, logging.error doesn't preserve the full traceback.

🔎 Recommended fix
     except Exception as e:
         # If instance creation fails, clean up the security group we created
-        logger.error(f'Failed to create NAT instance: {e}. Cleaning up security group {nat_sg_id}')
+        logger.exception(f'Failed to create NAT instance. Cleaning up security group {nat_sg_id}')
         try:
             ec2_client.delete_security_group(GroupId=nat_sg_id)
             logger.info(f'Deleted orphaned security group {nat_sg_id}')
-        except Exception as cleanup_error:
+        except ClientError as cleanup_error:
             logger.warning(f'Failed to clean up security group {nat_sg_id}: {cleanup_error}')
         raise

Using logger.exception preserves the full stack trace, and catching ClientError instead of broad Exception for cleanup is more appropriate.

🧰 Tools
🪛 Ruff (0.14.10)

761-761: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


765-765: Do not catch blind exception: Exception

(BLE001)

🤖 Prompt for AI Agents
In @hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py around lines
759 - 767, Replace the broad Exception handling in the NAT instance creation
block by logging the full traceback with logger.exception when creation fails
and narrow the cleanup catch to boto3/botocore client errors; specifically, in
the except block that references nat_sg_id, call logger.exception('Failed to
create NAT instance...') to preserve the stack trace, then attempt
ec2_client.delete_security_group(GroupId=nat_sg_id) and catch
botocore.exceptions.ClientError (or the specific AWS client error class you
import) for the cleanup warning instead of Exception, and finally re-raise the
original exception.

@openshift-ci-robot
Copy link
Contributor

Scheduling required tests:
/test e2e

Scheduling tests matching the pipeline_run_if_changed or not excluded by pipeline_skip_if_only_changed parameters:
/test integration-optional-test

@jupierce jupierce force-pushed the fix_instance_profile_leak branch from b39cf52 to 05c786e Compare January 8, 2026 14:55
Copy link

@coderabbitai coderabbitai bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Actionable comments posted: 1

🧹 Nitpick comments (1)
hack/reduce_nat_gateway_cost/monitor_resources.py (1)

405-407: Consider logging AWS API errors during NAT gateway enumeration.

The current implementation silently swallows ClientError exceptions when describing NAT gateways across regions. While the monitoring continues to other profiles/regions, this makes it difficult to diagnose authentication issues, permission problems, or API throttling.

💡 Recommended logging enhancement
             except ClientError as e:
                 # Log but continue - don't fail the whole check
-                pass
+                logger.warning(f"Error describing NAT gateways in {profile}/{region}: {e}")
📜 Review details

Configuration used: Organization UI

Review profile: CHILL

Plan: Pro

Cache: Disabled due to data retention organization setting

Knowledge base: Disabled due to Reviews -> Disable Knowledge Base setting

📥 Commits

Reviewing files that changed from the base of the PR and between b39cf52 and 05c786e.

📒 Files selected for processing (4)
  • hack/reduce_nat_gateway_cost/AGENTS.md
  • hack/reduce_nat_gateway_cost/monitor_resources.py
  • hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py
  • hack/reduce_nat_gateway_cost/use-nat-instance.yaml
🧰 Additional context used
📓 Path-based instructions (1)
**

⚙️ CodeRabbit configuration file

-Focus on major issues impacting performance, readability, maintainability and security. Avoid nitpicks and avoid verbosity.

Files:

  • hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py
  • hack/reduce_nat_gateway_cost/monitor_resources.py
  • hack/reduce_nat_gateway_cost/use-nat-instance.yaml
  • hack/reduce_nat_gateway_cost/AGENTS.md
🪛 LanguageTool
hack/reduce_nat_gateway_cost/AGENTS.md

[grammar] ~339-~339: Use a hyphen to join words.
Context: ...y:** EventBridge retries once with max 5 minute event age. 5. Regions: Lambd...

(QB_NEW_EN_HYPHEN)

🪛 markdownlint-cli2 (0.18.1)
hack/reduce_nat_gateway_cost/AGENTS.md

50-50: Fenced code blocks should have a language specified

(MD040, fenced-code-language)

🪛 Ruff (0.14.10)
hack/reduce_nat_gateway_cost/replace_nat_with_nat_instance.py

761-761: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


765-765: Do not catch blind exception: Exception

(BLE001)


775-775: Do not catch blind exception: Exception

(BLE001)

hack/reduce_nat_gateway_cost/monitor_resources.py

62-62: Starting a process with a partial executable path

(S607)


68-68: Starting a process with a partial executable path

(S607)


78-78: Starting a process with a partial executable path

(S607)


99-99: Consider moving this statement to an else block

(TRY300)


269-269: Consider moving this statement to an else block

(TRY300)


298-298: Consider moving this statement to an else block

(TRY300)


352-352: Consider moving this statement to an else block

(TRY300)


405-405: Local variable e is assigned to but never used

Remove assignment to unused variable e

(F841)


627-627: Unpacked variable present_resources is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


632-632: Unpacked variable ip_count is never used

Prefix it with an underscore or any other dummy variable pattern

(RUF059)


659-659: Consider [f"Profile {profile}:", *profile_issues] instead of concatenation

Replace with [f"Profile {profile}:", *profile_issues]

(RUF005)


671-671: Consider ["NAT Instance Age Issues:", *age_issues] instead of concatenation

Replace with ["NAT Instance Age Issues:", *age_issues]

(RUF005)


717-717: f-string without any placeholders

Remove extraneous f prefix

(F541)


789-789: Local variable all_healthy is assigned to but never used

Remove assignment to unused variable all_healthy

(F841)

Comment on lines +759 to +777
except Exception as e:
# If instance creation fails, clean up the security group we created
logger.error(f'Failed to create NAT instance: {e}. Cleaning up security group {nat_sg_id}')
try:
ec2_client.delete_security_group(GroupId=nat_sg_id)
logger.info(f'Deleted orphaned security group {nat_sg_id}')
except Exception as cleanup_error:
logger.warning(f'Failed to clean up security group {nat_sg_id}: {cleanup_error}')
raise

if not nat_instance:
# All instance types failed - clean up the security group
logger.error(f'All instance types failed. Cleaning up security group {nat_sg_id}')
try:
ec2_client.delete_security_group(GroupId=nat_sg_id)
logger.info(f'Deleted orphaned security group {nat_sg_id}')
except Exception as cleanup_error:
logger.warning(f'Failed to clean up security group {nat_sg_id}: {cleanup_error}')
return None
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🛠️ Refactor suggestion | 🟠 Major

Improve exception handling during NAT instance creation.

The current exception handling has several issues:

  1. Line 761: Use logger.exception() instead of logger.error() to automatically include the stack trace, which aids debugging.
  2. Lines 765, 775: Catching bare Exception is too broad. Consider catching specific exceptions (e.g., ClientError, BotoCoreError) or at minimum use logger.exception() to preserve stack traces for unexpected errors.
  3. The cleanup-on-error pattern is good, but errors during cleanup are not being raised, potentially hiding issues.
🔧 Recommended exception handling improvements
     except Exception as e:
         # If instance creation fails, clean up the security group we created
-        logger.error(f'Failed to create NAT instance: {e}. Cleaning up security group {nat_sg_id}')
+        logger.exception(f'Failed to create NAT instance. Cleaning up security group {nat_sg_id}')
         try:
             ec2_client.delete_security_group(GroupId=nat_sg_id)
             logger.info(f'Deleted orphaned security group {nat_sg_id}')
-        except Exception as cleanup_error:
+        except ClientError as cleanup_error:
             logger.warning(f'Failed to clean up security group {nat_sg_id}: {cleanup_error}')
         raise
 
     if not nat_instance:
         # All instance types failed - clean up the security group
         logger.error(f'All instance types failed. Cleaning up security group {nat_sg_id}')
         try:
             ec2_client.delete_security_group(GroupId=nat_sg_id)
             logger.info(f'Deleted orphaned security group {nat_sg_id}')
-        except Exception as cleanup_error:
+        except ClientError as cleanup_error:
             logger.warning(f'Failed to clean up security group {nat_sg_id}: {cleanup_error}')
         return None
🧰 Tools
🪛 Ruff (0.14.10)

761-761: Use logging.exception instead of logging.error

Replace with exception

(TRY400)


765-765: Do not catch blind exception: Exception

(BLE001)


775-775: Do not catch blind exception: Exception

(BLE001)

@jupierce
Copy link
Contributor Author

jupierce commented Jan 8, 2026

/retest

@openshift-ci
Copy link
Contributor

openshift-ci bot commented Jan 8, 2026

@jupierce: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/breaking-changes 05c786e link false /test breaking-changes
ci/prow/images 05c786e link true /test images

Full PR test history. Your PR dashboard.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

approved Indicates a PR has been approved by an approver from all required OWNERS files. do-not-merge/hold Indicates that a PR should not merge because someone has issued a /hold command.

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants