Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: reflect pod failed creates on the chaperon status #227

Open
wants to merge 3 commits into
base: master
Choose a base branch
from

Conversation

marwanad
Copy link
Contributor

This fixes a bug where pod chaperons in a target cluster can delay the scheduling loop if the pods fail to create and no status is set on the chaperon.

This changes the behavior to set PodScheduled condition with reason PodFailedCreate and checking for that in the proxy filter step. The pod creation gets requeued and retried. Upon success, the chaperon status will inherit the pod one as before.

@marwanad marwanad force-pushed the pod-chaperon-terminal-state branch from a64ae22 to d377bf1 Compare September 18, 2024 02:43
Copy link
Contributor

@adrienjt adrienjt left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This could use an e2e test.

pkg/controllers/chaperon/controller.go Outdated Show resolved Hide resolved
if podChaperon.Status.Phase == "" {
return true
}
for _, condition := range podChaperon.Status.Conditions {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't understand. If the candidate pod creation failed, why would podChaperon.Status.Phase not be empty?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So prior to this change, the following will happen:

  1. Attempt to create the pod.
    pod, err = c.kubeclientset.CoreV1().Pods(podChaperon.Namespace).Create(ctx, newPod(podChaperon), metav1.CreateOptions{})
  2. Pod creation fails.
    return nil, fmt.Errorf("cannot create pod for pod chaperon %v", err)
    and method returns
  3. The status setting code on the chaperon never gets to run because of the early exit above so the chaperon status is never updated.
    if needStatusUpdate {

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After this change, only the condition is updated, so the phase should still be empty, right? So line 235 (return true) is never executed. Checking the condition here appears to be unnecessary.

Copy link
Contributor Author

@marwanad marwanad Oct 6, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

woops I misunderstood your earlier question, good catch! I forgot to also set the phase here. I think it would be reasonable to set the phase to Pending in that failed pod create case since Failed generally implies container termination.

pkg/controllers/chaperon/controller.go Outdated Show resolved Hide resolved
@marwanad marwanad force-pushed the pod-chaperon-terminal-state branch 5 times, most recently from 7f2c1b7 to 93a6633 Compare October 6, 2024 17:27
@marwanad marwanad force-pushed the pod-chaperon-terminal-state branch from 93a6633 to 2db32a6 Compare October 6, 2024 17:45
@marwanad marwanad requested a review from adrienjt October 7, 2024 14:59
@adrienjt
Copy link
Contributor

adrienjt commented Oct 21, 2024

  1. The e2e test tests the implementation, not that the delay issue is fixed.
  2. This breaks the invariant that so far the pod chaperon status was simply the candidate pod status. Using the phase and condition to store a candidate pod creation error feels hacky. Could you store the pod creation error as a chaperon annotation instead? Sorry to mention that only now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants