Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: unify nodeclass status and termination controllers to prevent ra… #7597

Merged
merged 2 commits into from
Jan 14, 2025

Conversation

saurav-agarwalla
Copy link
Contributor

Fixes #N/A

Description
As part of a recent investigation, I found that the nodeclass status and termination controllers race against each other at times, and depending on who wins the race, the instance profile can be leaked. Discussed options with the team and merging these two controllers is the most straightforward thing to do at the moment. All other solutions are not a 100% failsafe with the exception of adding new finalizers for instance profiles but they have their own issues due to being backwards incompatible.

The primary downside of this change is that nodeclaim delete events can trigger a reconciliation now if the nodeclass is not deleted/deleting but looking through the current list of reconcilers, the impact of that should be minimal since all we're doing is making a few additional calls to EC2 for subnets and security groups and the additional calls for Instance Profile reconciliation only when nodeclaims are getting deleted.

How was this change tested?
make presubmit and E2E testing via CI as part of this PR.

Does this change impact docs?

  • Yes, PR includes docs updates
  • Yes, issue opened: #
  • No

By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.

…ce conditions from leaking instance profiles

As part of a recent investigation, I found that the nodeclass status and termination controllers race against each other at times, and depending on who wins the race, the instance profile can be leaked. Discussed options with the team and merging these two controllers is the most straightforward thing to do at the moment. All other solutions are not a 100% failsafe with the exception of adding new finalizers for instance profiles but they have their own issues due to being backwards incompatible.

The primary downside of this change is that nodeclaim delete events can trigger a reconciliation now if the nodeclass is not deleted/deleting but looking through the current list of reconcilers, the impact of that should be minimal since all we're doing is making a few additional calls to EC2 for subnets and security groups and the additional calls for Instance Profile reconciliation only when nodeclaims are getting deleted.
Copy link

netlify bot commented Jan 14, 2025

Deploy Preview for karpenter-docs-prod ready!

Name Link
🔨 Latest commit c45d481
🔍 Latest deploy log https://app.netlify.com/sites/karpenter-docs-prod/deploys/6786db8ac68b150008d5b17d
😎 Deploy Preview https://deploy-preview-7597--karpenter-docs-prod.netlify.app
📱 Preview on mobile
Toggle QR Code...

QR Code

Use your smartphone camera to open QR code link.

To edit notification comments on pull requests, go to your Netlify site configuration.

@coveralls
Copy link

coveralls commented Jan 14, 2025

Pull Request Test Coverage Report for Build 12776979373

Details

  • 51 of 64 (79.69%) changed or added relevant lines in 3 files are covered.
  • No unchanged relevant lines lost coverage.
  • Overall coverage increased (+0.05%) to 64.983%

Changes Missing Coverage Covered Lines Changed/Added Lines %
pkg/controllers/controllers.go 0 1 0.0%
pkg/controllers/nodeclass/instanceprofile.go 3 5 60.0%
pkg/controllers/nodeclass/controller.go 48 58 82.76%
Totals Coverage Status
Change from base Build 12754714725: 0.05%
Covered Lines: 5775
Relevant Lines: 8887

💛 - Coveralls

@saurav-agarwalla saurav-agarwalla marked this pull request as ready for review January 14, 2025 18:08
@saurav-agarwalla saurav-agarwalla requested a review from a team as a code owner January 14, 2025 18:08
Copy link
Contributor Author

@saurav-agarwalla saurav-agarwalla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/karpenter snapshot

Copy link
Contributor

Snapshot successfully published to oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter:0-37580bfaf1e45c87f7420907729735e065aa39a6.
To install you must login to the ECR repo with an AWS account:

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com

helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-37580bfaf1e45c87f7420907729735e065aa39a6" --namespace "kube-system" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

Copy link
Contributor Author

@saurav-agarwalla saurav-agarwalla left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/karpenter snapshot

…inations controllers are merged into a single nodeclass controller for 1.2.0+
Copy link
Contributor

Snapshot successfully published to oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter:0-6935fcba4c6b910e01181b0629d6895a33c19fc3.
To install you must login to the ECR repo with an AWS account:

aws ecr get-login-password --region us-east-1 | docker login --username AWS --password-stdin 021119463062.dkr.ecr.us-east-1.amazonaws.com

helm upgrade --install karpenter oci://021119463062.dkr.ecr.us-east-1.amazonaws.com/karpenter/snapshot/karpenter --version "0-6935fcba4c6b910e01181b0629d6895a33c19fc3" --namespace "kube-system" --create-namespace \
  --set "settings.clusterName=${CLUSTER_NAME}" \
  --set "settings.interruptionQueue=${CLUSTER_NAME}" \
  --set controller.resources.requests.cpu=1 \
  --set controller.resources.requests.memory=1Gi \
  --set controller.resources.limits.cpu=1 \
  --set controller.resources.limits.memory=1Gi \
  --wait

Copy link
Contributor

@engedaam engedaam left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM 🚀

@saurav-agarwalla saurav-agarwalla merged commit b320ff1 into aws:main Jan 14, 2025
17 checks passed
@saurav-agarwalla saurav-agarwalla deleted the instance-profile-leak branch January 14, 2025 22:04
edibble21 pushed a commit to edibble21/karpenter-provider-aws that referenced this pull request Jan 22, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants