Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

KEP-4471: kubeadm: make a control-plane's kubelet talk to the local API Server on kubeadm join. #4496

Merged
merged 1 commit into from
May 7, 2024

Conversation

chrischdi
Copy link
Member

  • One-line PR description: adding new KEP
  • Other comments:

@k8s-ci-robot k8s-ci-robot added the cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. label Feb 8, 2024
@k8s-ci-robot k8s-ci-robot added kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files. labels Feb 8, 2024
@chrischdi
Copy link
Member Author

cc @neolit123 @fabriziopandini @sbueringer :-) PTAL

Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍
i will have a look tomorrow.

@chrischdi
Copy link
Member Author

FYI: I'm on PTO the next days and will be back on 15th. Will start addressing feedback then :-)

@sbueringer
Copy link
Member

Sounds reasonable overall, but Fabrizio and Lubomir are way more familiar with the kubeadm details

Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks @chrischdi
i did a first pass and the KEP seems good, minus some typos and some missing minor clarifications.

Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

/cc @pacoxu @SataQiu
PTAL when you have time. might be a bit too late for us to apply this in 1.30, but if there are no objections we can try.

@SataQiu
Copy link
Member

SataQiu commented Feb 25, 2024

Pointing the control-plane's kubelet to the local kube-apiserver may cause some stability issues. When kube-apiserver fails, the local kubelet will crash, and the node will become NotReady, the entire control-plane node cannot handle the new Pod. However, using LB will not cause the entire control-plane node to be NotReady, just some components can not work well.
The second point is that upgrading the kubelet binary version seems to be user control, and kubeadm only modifies the kubelet configurations and does not immediately update the binary version of kubelet. https://kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/#upgrade-kubelet-and-kubectl

When kubeadm join is executed, version skew does not occur, and that only occurs during the upgrade phase.

@chrischdi
Copy link
Member Author

Pointing the control-plane's kubelet to the local kube-apiserver may cause some stability issues. When kube-apiserver fails, the local kubelet will crash, and the node will become NotReady, the entire control-plane node cannot handle the new Pod.

I agree, I will add that information.
One thought: in case of using kubeadm, kube-apiserver is run as static pod anyway, so shouldn't the kubelet be able to bring it up again?

An improvement may be to only use the local endpoint during TLS Bootstrap and make kubeadm change the config afterwards back to the load balancer 🤔

However, using LB will not cause the entire control-plane node to be NotReady, just some components can not work well. The second point is that upgrading the kubelet binary version seems to be user control, and kubeadm only modifies the kubelet configurations and does not immediately update the binary version of kubelet. kubernetes.io/docs/tasks/administer-cluster/kubeadm/kubeadm-upgrade/#upgrade-kubelet-and-kubectl

When kubeadm join is executed, version skew does not occur, and that only occurs during the upgrade phase.

In case of immutable upgrades (e.g. Cluster API) where you update the whole image and the image contains the binary versions: when correctly following the version skew you would have to (example: v1.x -> v1.y):

  • have one image A which has v1.y kubeadm + v1.x kubelet
  • have one image B which has v1.y kubeadm + v1.y kubelet

And upgrade via:

  1. rolling update control-plane nodes to image A
  2. rolling update control-plane nodes to image B

Everyone nowadays skips step 1 with CAPI (because normally it works).

@neolit123
Copy link
Member

Pointing the control-plane's kubelet to the local kube-apiserver may cause some stability issues. When kube-apiserver fails, the local kubelet will crash, and the node will become NotReady, the entire control-plane node cannot handle the new Pod.

I agree, I will add that information.
One thought: in case of using kubeadm, kube-apiserver is run as static pod anyway, so shouldn't the kubelet be able to bring it up again?

this is something that i mentioned/asked on the kubeadm office hours zoom call we had.
i think the new way is less HA, and less resilient to apiserver component faliures.

currently with a kubelet pointing at the LB if a local apiserver fails (e.g. due to an internal apiserver bug) the kubelet will try to find a healthy kube-apiserver and the Node will remain healthy, while in parallel it will try to restart the local filing pod. with the new way if the apiserver fails the node will become not ready.

An improvement may be to only use the local endpoint during TLS Bootstrap and make kubeadm change the config afterwards back to the load balancer 🤔

bootstrap to local apiserver makes sense.
but if it then points to the LB it can still violate the skew policy, which according to the prior discussions from api-machinery can cause problems. we haven't seen such problems yet, mainly because of k8s kubelet / apiserver stability.

i was thinking that maybe we can let the user control this behavior with an option in the config API. i personally don't like this idea, but it can be added in v1beta4 if people vote +1 for that.

alternatively, the default behavior can be - point to local, but they can manually edit to point to LB if they want more HA.

@neolit123 neolit123 requested review from neolit123 and removed request for CecileRobertMichon April 29, 2024 06:14
@neolit123
Copy link
Member

neolit123 commented Apr 29, 2024

@SataQiu @pacoxu any more comments here?

i was thinking that maybe we can let the user control this behavior with an option in the config API. i personally don't like this idea, but it can be added in v1beta4 if people vote +1 for that.

still not a fan of making it configurable as pointing to the LB can violate kubelet / apiserver skew, but to make it configurable by kubeadm we can do one of these things:

  1. add a new patchtarget kubeletkubeconfig. problem with this is that the user will likely only use this patch target for changing server and it will be a rare use case.
  2. add clusterconfiguration.pointKubeletAtControlPlaneEndpoint false by default, still a rare use case.
  3. document somewhere (maybe in the kubeadm HA docs at k/website) that users can edit a node kubelet.conf kubeconfig file to point to LB and kubeadm or kubelet will not modify the contents of this file for e.g. cert rotation. i.e. the server value will persist.

my vote goes to 3.
WDYT?

EDIT: and yes we should still have a FG while the feature graduates.

@SataQiu
Copy link
Member

SataQiu commented Apr 29, 2024

also vote to 3
When designing the kubeadm upgrade apply phases, we should support skipping the kubelet upgrade phase.
Then users can upgrade the kubelet version after ensuring that all the control-plane apiserver instances have been upgraded.

Just like kubeadm upgrade node --skip-phases kubelet-config

FYI: It seems that kubeadm just configures kubelet, but not restart the kubelet service. We can state in the documentation that please restart the kubelet service in turn after all the control-plane instances have been upgraded.

@neolit123
Copy link
Member

also vote to 3

👍

When designing the kubeadm upgrade apply phases, we should support skipping the kubelet upgrade phase.
Then users can upgrade the kubelet version after ensuring that all the control-plane apiserver instances have been upgraded.

upgrade does not touch "etc/kubernetes/kubelet.conf" today, so if the user modifies the "server" in there it will persist.

FYI: It seems that kubeadm just configures kubelet, but not restart the kubelet service. We can state in the documentation that please restart the kubelet service in turn after all the control-plane instances have been upgraded.

yes, a kubelet restart will be required.
they have to do it one time after changing the value of "server"

@pacoxu
Copy link
Member

pacoxu commented Apr 30, 2024

3. document somewhere (maybe in the kubeadm HA docs at k/website) that users can edit a node kubelet.conf kubeconfig file to point to LB and kubeadm or kubeadm will not modify the contents of this file for e.g. cert rotation. i.e. the server value will persist.

IIUC, kubelet.conf change is a behavior change or a bugfix.

  • Need we notice cluster-api or other kubeadm adapters about this?

vote to 2 and 3.
3 may be the right direction, but 2 is also a good choice to me.

@neolit123
Copy link
Member

  1. document somewhere (maybe in the kubeadm HA docs at k/website) that users can edit a node kubelet.conf kubeconfig file to point to LB and kubeadm or kubeadm will not modify the contents of this file for e.g. cert rotation. i.e. the server value will persist.

IIUC, kubelet.conf change is a behavior change or a bugfix.

we can say it's a bug fix for potential skew problems that introduces a change in behavior.

  • Need we notice cluster-api or other kubeadm adapters about this?

vote to 2 and 3. 3 may be the right direction, but 2 is also a good choice to me.

we could just document it as our first choice (3), but if users really need better flexibility, we can think about alternatives like 2 or something else.

@pacoxu
Copy link
Member

pacoxu commented Apr 30, 2024

Need we notice cluster-api or other kubeadm adapters about this?
vote to 2 and 3. 3 may be the right direction, but 2 is also a good choice to me.
we could just document it as our first choice (3), but if users really need better flexibility, we can think about alternatives like 2 or something else.

3 is ok to go. +1
2 can be a later choice if many users really want that.

Copy link
Member

@neolit123 neolit123 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

please squash all commits, then lgtm
/approve

@k8s-ci-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: chrischdi, neolit123

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@k8s-ci-robot k8s-ci-robot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label May 7, 2024
@neolit123
Copy link
Member

/lgtm

@k8s-ci-robot k8s-ci-robot added the lgtm "Looks good to me", indicates that a PR is ready to be merged. label May 7, 2024
@k8s-ci-robot k8s-ci-robot merged commit 060cedf into kubernetes:master May 7, 2024
4 checks passed
@k8s-ci-robot k8s-ci-robot added this to the v1.31 milestone May 7, 2024
@chrischdi chrischdi deleted the pr-kubeadm-4471 branch May 10, 2024 15:29
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. kind/kep Categorizes KEP tracking issues and PRs modifying the KEP directory lgtm "Looks good to me", indicates that a PR is ready to be merged. sig/cluster-lifecycle Categorizes an issue or PR as relevant to SIG Cluster Lifecycle. size/XL Denotes a PR that changes 500-999 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants