fix: reflect pod failed creates on the chaperon status #227

marwanad · 2024-09-18T02:40:11Z

This fixes a bug where pod chaperons in a target cluster can delay the scheduling loop if the pods fail to create and no status is set on the chaperon.

This changes the behavior to set PodScheduled condition with reason PodFailedCreate and checking for that in the proxy filter step. The pod creation gets requeued and retried. Upon success, the chaperon status will inherit the pod one as before.

pkg/controllers/chaperon/controller.go

adrienjt

This could use an e2e test.

pkg/controllers/chaperon/controller.go

adrienjt · 2024-10-05T04:01:50Z

pkg/controllers/chaperon/controller.go

+	if podChaperon.Status.Phase == "" {
+		return true
+	}
+	for _, condition := range podChaperon.Status.Conditions {


I don't understand. If the candidate pod creation failed, why would podChaperon.Status.Phase not be empty?

So prior to this change, the following will happen:

Attempt to create the pod.

admiralty/pkg/controllers/chaperon/controller.go

Line 120 in f98c59d

pod, err = c.kubeclientset.CoreV1().Pods(podChaperon.Namespace).Create(ctx, newPod(podChaperon), metav1.CreateOptions{})

Pod creation fails.

admiralty/pkg/controllers/chaperon/controller.go

Line 122 in f98c59d

return nil, fmt.Errorf("cannot create pod for pod chaperon %v", err)

and method returns

The status setting code on the chaperon never gets to run because of the early exit above so the chaperon status is never updated.

admiralty/pkg/controllers/chaperon/controller.go

Line 158 in f98c59d

if needStatusUpdate {

After this change, only the condition is updated, so the phase should still be empty, right? So line 235 (return true) is never executed. Checking the condition here appears to be unnecessary.

woops I misunderstood your earlier question, good catch! I forgot to also set the phase here. I think it would be reasonable to set the phase to Pending in that failed pod create case since Failed generally implies container termination.

pkg/controllers/chaperon/controller.go

adrienjt · 2024-10-21T23:16:50Z

The e2e test tests the implementation, not that the delay issue is fixed.
This breaks the invariant that so far the pod chaperon status was simply the candidate pod status. Using the phase and condition to store a candidate pod creation error feels hacky. Could you store the pod creation error as a chaperon annotation instead? Sorry to mention that only now.

marwanad commented Sep 18, 2024

View reviewed changes

pkg/controllers/chaperon/controller.go Outdated Show resolved Hide resolved

marwanad force-pushed the pod-chaperon-terminal-state branch from a64ae22 to d377bf1 Compare September 18, 2024 02:43

adrienjt requested changes Oct 5, 2024

View reviewed changes

marwanad force-pushed the pod-chaperon-terminal-state branch 5 times, most recently from 7f2c1b7 to 93a6633 Compare October 6, 2024 17:27

marwanad added 3 commits October 6, 2024 10:45

fix: reflect pod failed creates on the chaperon status

9b3aca5

chore: set pod phase and log err

e991d66

add e2e test case for chaperon reach a terminal status

2db32a6

marwanad force-pushed the pod-chaperon-terminal-state branch from 93a6633 to 2db32a6 Compare October 6, 2024 17:45

marwanad requested a review from adrienjt October 7, 2024 14:59

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: reflect pod failed creates on the chaperon status #227

fix: reflect pod failed creates on the chaperon status #227

marwanad commented Sep 18, 2024

adrienjt left a comment

adrienjt Oct 5, 2024

marwanad Oct 5, 2024

adrienjt Oct 5, 2024

marwanad Oct 6, 2024 •

edited

Loading

adrienjt commented Oct 21, 2024 •

edited

Loading

fix: reflect pod failed creates on the chaperon status #227

Are you sure you want to change the base?

fix: reflect pod failed creates on the chaperon status #227

Conversation

marwanad commented Sep 18, 2024

adrienjt left a comment

Choose a reason for hiding this comment

adrienjt Oct 5, 2024

Choose a reason for hiding this comment

marwanad Oct 5, 2024

Choose a reason for hiding this comment

adrienjt Oct 5, 2024

Choose a reason for hiding this comment

marwanad Oct 6, 2024 • edited Loading

Choose a reason for hiding this comment

adrienjt commented Oct 21, 2024 • edited Loading

marwanad Oct 6, 2024 •

edited

Loading

adrienjt commented Oct 21, 2024 •

edited

Loading