-
Notifications
You must be signed in to change notification settings - Fork 427
test(e2e): fix AKS e2e test flakiness #7410
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
Fix two sources of test flakiness in AKS e2e tests: 1. Add network-node-identity to podCrashTolerations map to allow one restart for webhook startup timing issues 2. Generate unique names for HostedCluster resources in TestOnCreateAPIUX to prevent "already exists" race condition from sequential test iterations Signed-off-by: Jesse Jaggars <[email protected]> Commit-Message-Assisted-by: Claude (via Claude Code)
|
Skipping CI for Draft Pull Request. |
WalkthroughTwo test utility files were modified: unique HostedCluster names are generated by appending loop index and nanosecond timestamp to avoid race conditions during cluster creation tests, and a new pod crash toleration entry for network-node-identity webhook startup timing was added to the crash tolerations map. Changes
Estimated code review effort🎯 1 (Trivial) | ⏱️ ~5 minutes
✨ Finishing touches
🧪 Generate unit tests (beta)
📜 Recent review detailsConfiguration used: Organization UI Review profile: CHILL Plan: Pro Cache: Disabled due to data retention organization setting Knowledge base: Disabled due to 📒 Files selected for processing (2)
🧰 Additional context used📓 Path-based instructions (1)**⚙️ CodeRabbit configuration file
Files:
🧬 Code graph analysis (1)test/e2e/create_cluster_test.go (3)
🔇 Additional comments (2)
Comment |
|
/test ? |
|
@jhjaggars: The following commands are available to trigger required jobs: The following commands are available to trigger optional jobs: Use DetailsIn response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
/test e2e-aks-4-21 |
|
/approve |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: csrwng, jhjaggars The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
@jhjaggars: all tests passed! Full PR test history. Your PR dashboard. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here. |
Summary
This PR fixes two sources of flakiness in AKS e2e tests identified from failures in PR #6975:
Changes
1. Add network-node-identity to podCrashTolerations
File:
test/e2e/util/util.goThe
network-node-identitywebhook container can restart once during cluster initialization due to startup timing issues. This change adds it to the crash tolerations map with a tolerance of 1 restart.Previous behavior: Test fails immediately on any container restart
New behavior: Test tolerates 1 restart for network-node-identity
2. Fix TestOnCreateAPIUX race condition
File:
test/e2e/create_cluster_test.goThe test was using a hardcoded name "base" for all HostedCluster resources in sequential test iterations. When the delete from iteration N didn't complete before iteration N+1 started, the create would fail with "already exists" error.
Previous behavior: All test iterations use hardcoded name "base", causing race conditions
New behavior: Each test iteration generates a unique name using timestamp and index
Test Plan
make staticcheckpassesmake verifypassesNotes
These are pre-existing flakiness issues unrelated to the scale-from-zero functionality in PR #6975. The failures were environmental/test infrastructure issues rather than product bugs.