Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Kueue support for Kubeflow Notebook #3352

Closed
2 of 3 tasks
varshaprasad96 opened this issue Oct 28, 2024 · 59 comments
Closed
2 of 3 tasks

Add Kueue support for Kubeflow Notebook #3352

varshaprasad96 opened this issue Oct 28, 2024 · 59 comments
Labels
kind/feature Categorizes issue or PR as related to a new feature.

Comments

@varshaprasad96
Copy link
Member

varshaprasad96 commented Oct 28, 2024

What would you like to be added:
Enable Kueue to manage Kubeflow Notebook CRs (https://www.kubeflow.org/docs/components/notebooks/api-reference/notebook-v1/).

Why is this needed:
Notebook themselves can be resource heavy (in terms of being GPU enabled), and having Kueue manage it like any other workload on the cluster would be helpful.

Open Questions
I've been working on a PoC to enable this - however, looks like the underlying implementation of a Notebook resource is a Pod. The spec of the Notebook CR only defines a PodspecTemplate, without a suspend field.
Does implementing a similar mechanism as with pod integration sound reasonable - wherein we add a scheduling gate with a default webhook, and when it is ready to be accepted we ungate the pod. Or can the Notebook introduce a suspend field, to say expose the functionality of pausing/stopping the underlying Jupyter server which can be controlled by Kueue (not sure if this would be an ask by the NB users)?

This enhancement requires the following artifacts:

  • Design doc
  • API change
  • Docs update

The artifacts should be linked in subsequent comments.

@varshaprasad96 varshaprasad96 added the kind/feature Categorizes issue or PR as related to a new feature. label Oct 28, 2024
@varshaprasad96
Copy link
Member Author

cc: @alculquicondor @tenzen-y

@varshaprasad96
Copy link
Member Author

varshaprasad96 commented Oct 28, 2024

Looking into the issues - seems like a stop feature does exist in v1beta1 API (hub version) of Notebooks where an annotation could be set to scale down the replica count to 0 (https://github.com/kubeflow/kubeflow/blob/ec82fbf58b79cff529d948b96e44ffd06bdfe679/components/notebook-controller/controllers/notebook_controller.go#L363).

@kannon92
Copy link
Contributor

Can you get this working with the pod integration?

I think suspend and first class support for the workload would be the best long term option but for POC and first integration I wonder if seeing if this can work with pod integration would be helpful.

My main reasoning for going with first class support would be in case the underlying objects change (pod to Job etc).

@varshaprasad96
Copy link
Member Author

varshaprasad96 commented Oct 29, 2024

I think suspend and first class support for the workload would be the best long term option but for POC and first integration I wonder if seeing if this can work with pod integration would be helpful.

Hey @kannon92, the idea is to avoid enabling pod integration directly as that previously caused drastic effects on platform, with managing non-notebook pods in the same namespace, and could not retry brining up the failed ones. Which is why the idea is to enable first-class support to the NB API directly and not expose pod integration on Kueue (at least intentionally for customers).

@kannon92
Copy link
Contributor

So if you don’t want pod integration then I think adding suspend field to notebook controller would be the option.

Have you brought this up with KubeFlow community about Kueue integration?

@tenzen-y
Copy link
Member

@varshaprasad96 Thank you for raising this issue. Since v0.9.0, we will start the StatefulSet support. Could you verify if the Kueue StatefulSet integration could resolve your issue?

@tenzen-y
Copy link
Member

We have already include the StatefulSet integration in the RC version: https://github.com/kubernetes-sigs/kueue/releases/tag/v0.9.0-rc.1

@kannon92
Copy link
Contributor

kannon92 commented Oct 29, 2024

@tenzen-y I am confused on your statfulset request. From what I can tell and what @varshaprasad96 mentions, it seems that the notebook controller is submitting pods. Is there a statefulset integration with KubeFlow notebooks?

Edit: Nvm I see https://github.com/kubeflow/kubeflow/blob/ec82fbf58b79cff529d948b96e44ffd06bdfe679/components/notebook-controller/controllers/notebook_controller.go#L138 that it seems that statefulsets are created from the pod spec template.

@tenzen-y
Copy link
Member

@tenzen-y I am confused on your statfulset request. From what I can tell and what @varshaprasad96 mentions, it seems that the notebook controller is submitting pods. Is there a statefulset integration with KubeFlow notebooks?

Edit: Nvm I see https://github.com/kubeflow/kubeflow/blob/ec82fbf58b79cff529d948b96e44ffd06bdfe679/components/notebook-controller/controllers/notebook_controller.go#L138 that it seems that statefulsets are created from the pod spec template.

Yeah, AFAIK, the notebook instance seems to be created via StatefulSet. So, I'm wondering if we can use the Kueue StatefulSet integration.

@mimowo
Copy link
Contributor

mimowo commented Oct 31, 2024

This might be a good idea to make it work of free. The only caveat I see is that StatefulSet integration does not support resizes yet (not sure if this is needed for notebook though).

So actually it might all just work if you ensure the StatefulSet instance has the queue-name label.

@tenzen-y
Copy link
Member

This might be a good idea to make it work of free. The only caveat I see is that StatefulSet integration does not support resizes yet (not sure if this is needed for notebook though).

So actually it might all just work if you ensure the StatefulSet instance has the queue-name label.

In general, the Notebook Server is a Stateful application, and it is hard to make it HA. So, I guess that resizing is not needed.

@thesuperzapper
Copy link

Hey all, I am one of the maintainers of Kubeflow Notebooks.

But I am not quite sure if I understand how Kueue could be used to improve Notebooks.

Can one of the maintainers of Kueue explain how it relates to non-job workloads that are managed by other controllers?


Also, we are actually working on a Kubeflow Notebooks 2.0 right now, and it's quite a different design. While we still use StatefulSets internally, the outer Notebook CRD has been replaced with a cluster WorkspaceKind and namespaced Workspace resource.

The key difference is that the Workspace CRD is now templated based on the selected WorkspaceKind and is not a wrapper around PodSpec, this will probably solve the problems faced by @varshaprasad96, see here for more info:

@mimowo
Copy link
Contributor

mimowo commented Oct 31, 2024

@thesuperzapper this is great news. PTAL here - the doc documents how to schedule StatefulSets with Kueue (requires main version of Kueue, which can be isntalled like here: https://kueue.sigs.k8s.io/docs/installation/#install-the-latest-development-version)

@varshaprasad96
Copy link
Member Author

varshaprasad96 commented Oct 31, 2024

Thanks @thesuperzapper @mimowo @tenzen-y for your reply. Replying to the questions below:

But I am not quite sure if I understand how Kueue could be used to improve Notebooks.

Kueue is a workload queuing system primarily aimed at managing batch jobs, with resource quotas as a central feature. While its initial design is for batch workloads, we've seen interest from users who want to extend Kueue’s capabilities to non-batch workloads, to leverage its quota management, admission policies, and fair sharing model.

Notebooks, especially GPU-enabled ones, can demand substantial resources, similar to other ML batch workloads. Managing them through Kueue allows users to schedule Notebooks more efficiently within cluster resources. For reference, there has been work on integrating plain pods, StatefulSets, and Deployments into Kueue’s ecosystem makes it possible to use Kueue's resource management without duplicating functionalities of the native Kubernetes scheduler (https://kueue.sigs.k8s.io/docs/tasks/run/plain_pods/).

The way we could expect NB integration to work would be (similar to ray clusters) - the individual NB pods are managed by nb-controller, but the responsibility of admitting the NB pod for scheduling it on a node would be done by Kueue based on available designated quota set by the admin.

Notebook Server is a Stateful application, and it is hard to make it HA. So, I guess that resizing is not needed.

I tried this by enabling SS integration and using it to manage NB (with v1beta APIs for now) and it works well. But the issue in here is that the replica count is immutable, so if a user likes to use the stop/pause feature in NB by adding annotations that doesn't work. Once paused, the Validating Webhook does not allow the spec to be changed, which means a stopped NB cannot be restarted.

Also, we are actually working on a Kubeflow Notebooks 2.0 right now, and it's quite a different design.

Thanks for pointing this out, I'll look into the v2 API. The major concern here seems to be a suspend equivalent field in NB API, that the Kueue can manage to admit/pre-empt resources. But there is a caveat in here - Notebooks are long running workloads, containing user-facing data, so in the scenario where Kueue preempt's and deletes notebooks at any time, especially in the absence of backup or persistent storage, this could lead to data loss and unexpected disruptions for a user which is not ideal.

The question comes to be on how we intend the lifecycle of a Notebook resource to be considering the fact that it's going to be a non-batch workload (regardless of the underlying implementation). I'm not sure what a reasonable solution for being able to "suspend" Notebooks by Kueue would be without having to completely kill the underlying Pod.

@mimowo
Copy link
Contributor

mimowo commented Nov 4, 2024

@varshaprasad96 thanks for the summary!

I tried this by enabling SS integration and using it to manage NB (with v1beta APIs for now) and it works well. But the issue in here is that the replica count is immutable, so if a user likes to use the kubeflow/kubeflow#4857 (comment) that doesn't work. Once paused, the Validating Webhook does not allow the spec to be changed, which means a stopped NB cannot be restarted.

Interesting, maybe a small fix somewhere is possible? Validating Webhook of which project is blocking that?

@varshaprasad96
Copy link
Member Author

Interesting, maybe a small fix somewhere is possible? Validating Webhook of which project is blocking that?

Yes, its in Kueue - within the statefulset webhook

allErrs = append(allErrs, apivalidation.ValidateImmutableField(
newStatefulSet.Spec.Replicas,
oldStatefulSet.Spec.Replicas,
statefulsetReplicasPath,
)...)

@mimowo
Copy link
Contributor

mimowo commented Nov 5, 2024

hm, but then I'm not sure I understand " which means a stopped NB cannot be restarted". We base the StatefulSet integration on the PodGroup support, which means that when Kueue evicts a StatefulSet, then the PodGroup gets evicted and the newly created pods are gated (until the workload is unsuspended). So we use gating pods rather than "suspend" field for StatefulSet, Deployment and soon LWS integration.

So, I imagine the workflow does not require modifying the StatefulSet spec, unless I'm missing something (and I didn't test any of that).

@varshaprasad96
Copy link
Member Author

varshaprasad96 commented Nov 5, 2024

So, I imagine the workflow does not require modifying the StatefulSet spec, unless I'm missing something (and I didn't test any of that).

This is actually done by the notebook-controller. In the sense, when the user wants to pause a NB, they do so by adding an annotation to the respective pod:
eg:

kubectl annotate notebook/<pod> kubeflow-resource-stopped="true" -n notebook

The notebook-controller kicks in, and scales down the replica count in the stateful set. The validating webhook by Kueue denies the request:

	/go/pkg/mod/sigs.k8s.io/[email protected]/pkg/internal/controller/controller.go:227
1.730850184723808e+09	ERROR	controller.notebook	Reconciler error	{"reconciler group": "kubeflow.org", "reconciler kind": "Notebook", "name": "notebook-sample-v1", "namespace": "notebook", "error": "admission webhook \"vstatefulset.kb.io\" denied the request: spec.replicas: Invalid value: 0: field is immutable"}

This means that stop/pause feature by adding annotations would not be available to users.

The above was tested using v1 APIs of NB. I'm not sure yet, if it's the same with v2 APIs.

@mimowo
Copy link
Contributor

mimowo commented Nov 6, 2024

I see, thanks for sharing the info. So, in this scenario we always scale down to 0, then scale up from 0 to full size?

If this is the case I think we could support this special case relatively easy in the integration by removing the entire pod group, and recreating.

@akram
Copy link
Contributor

akram commented Nov 6, 2024

Hi @mimowo ,
I will start working with @varshaprasad96 on this topic and we were discussing the scale down to 0 impacts yesterday indeed and we wanted to test it. If I am not wrong that was to see if it could work around the issue.

@mimowo
Copy link
Contributor

mimowo commented Nov 6, 2024

sounds great, feel free to investigate and if possible send a PR.

We already planned some scaling support here: #3279, but this issue has a bit bigger scope - scaling down and up by recreating the PodGroup. Your case might be simpler (as a special case), but maybe not that much. In both cases PodGroup size is considered immutable.

cc @mwielgus

Edit: If we can support it without API changes then I would be leaning to include it in 0.9.1, but let's see.

@tenzen-y
Copy link
Member

tenzen-y commented Nov 6, 2024

Edit: If we can support it without API changes then I would be leaning to include it in 0.9.1, but let's see.

In that case, I would like to add it to v0.10 since the v0.10 will be released in easily Dec, right?

@mimowo
Copy link
Contributor

mimowo commented Nov 6, 2024

Yes, early release of 0.10 is another possibility, we can keep both options open for now and see how things go.

@tenzen-y
Copy link
Member

tenzen-y commented Nov 6, 2024

Yes, early release of 0.10 is another possibility, we can keep both options open for now and see how things go.

As far as I know, the patch version includes only bug fixes. But this request looks like an enhancement (scale support).

@mimowo
Copy link
Contributor

mimowo commented Nov 6, 2024

We did cherry-pick small features in the past if they didn't impact API (which is potentially the case here), examples in release-notes 0.7.1 or 0.8.1, but I agree the early release of 0.10 is the cleanest path.

@tenzen-y
Copy link
Member

tenzen-y commented Nov 6, 2024

We did cherry-pick small features in the past if they didn't impact API (which is potentially the case here), examples in release-notes 0.7.1 or 0.8.1, but I agree the early release of 0.10 is the cleanest path.

It would be better not to include the new enhancement in the patch release to avoid introducing additional bugs as much as possible to the minor version. If we want to release any enhancement, I would recommend releasing a new minor early.

https://github.com/kubernetes/community/blob/master/contributors/devel/sig-release/cherry-picks.md#what-kind-of-prs-are-good-for-cherry-picks

@mimowo
Copy link
Contributor

mimowo commented Nov 7, 2024

@akram @varshaprasad96 I know that also @mbobrovskyi is going to work on the solution to scaling under: #3279.

The idea we discussed with @mbobrovskyi is that for StatefulSet we will have an extra controller, when the controller detects that the size changed (sees the replicas count different than in the Workload object), then it will remove the old Pod Group and create the new Pod Group. In the special case of scaling down to 0 it will just delete the old PodGroup, and we will create the PodGroup when scaled back to >0.

It would be great if you could help reviewing and testing the approach.

@varshaprasad96
Copy link
Member Author

varshaprasad96 commented Nov 7, 2024

Thanks @mimowo and @mbobrovskyi. We will keep an eye and verify if it matches our needs.

One question on the NB side:
We insist that the underlying NB pods use persistent volumes to store data. In case that a Notebook is preempted, should the notebook-controller be modified to add a finalizer to perform backups?

Alternatively, is it reasonable to assume that, when Kueue is managing NBs based on cluster quotas (with the potential risk of preempting NB workloads), the responsibility falls to the admin and user to:

  1. Ensure sufficient quota is allocated for Notebooks to reduce preemption risk.
  2. Configure NB workloads with the highest priority to further minimize the likelihood of preemption.

@thesuperzapper could you please provide your thoughts on this. Is the v2 design considering backups along with the culling feature?

@mimowo
Copy link
Contributor

mimowo commented Nov 8, 2024

We insist that the underlying NB pods use persistent volumes to store data. In case that a Notebook is preempted, should the notebook-controller be modified to add a finalizer to perform backups?

TBH this is out-of-scope for Kueue integrations (at least for now, we don't support check-pointing natively in Kueue for any integrations, Jobs etc). I guess a finalizer is an option, or you may consider long enough spec.terminationGracePeriodSeconds on the pod template.

In the approach I discussed with @mbobrovskyi Kueue does not delete the Pods, they are fully managed by StatefulSets, we just remove the Kueue finalizer, to let the pod go, but this should allow you to use another finalizer or control the graceful deletion.

You may want to test this PR: #3487.

FYI @tenzen-y since the support requires changes to RBAC I think 0.9.1 is out of question anyway. We will aim to release the feature in 0.10.

@xiongzubiao
Copy link

@mimowo Thanks for the explanation. Yes, #3487 sounds a more general solution. I will give it a try!

@andreyvelich
Copy link

andreyvelich commented Nov 15, 2024

Hi Folks, just drop a few ideas here on how Kueue can be integrated with Jupyter Notebooks.

Jupyter is capable to run remote Kernels via gateway. One of the examples could be to use the Enterprise Gateway to provision remote Kernels: https://github.com/jupyter-server/enterprise_gateway (cc @lresende @Zsailer)
In that case, the remote Kernel represents the runtime which is attached to the Notebook/Text Editor to execute the user's commands.
The Kernel can be as simple as the iPython process, or Spark cluster with iPython process running on the driver or Almond Kernel for Scala Spark: https://github.com/almond-sh/almond. Additionally, users can trigger the "derivative" jobs (e.g. TrainJob, JobSet, TFJob) using the Python Kernel.

All of those workloads (Python Kernel, Spark Kernel, or "derivative" workload) can be considered as interactive and needs to have higher priority over non-interactive workloads. @shravan-achar can share more on how those interactive workloads should work with queues.

In this scenario, Kueue should work directly with the Jupyter Kernels and "derivative" workloads, not with the stateful Jupyter Servers since Jupyter Servers don't require expensive compute resources (e.g. GPUs, TPUs).

I understand that today Kubeflow Notebooks don't support remote Kernel, but it is something which we can discuss in the future (cc @vikas-saxena02)

We talked a little bit about our remote Kernel orchestration in this Scheduling Talk: https://youtu.be/DdW5WUAvNuY?list=PLj6h78yzYM2OOkGhEJgb3Lx6YWoA3xQl4

@xiongzubiao
Copy link

@mimowo Thanks for the explanation. Yes, #3487 sounds a more general solution. I will give it a try!

I confirm that it works when adding a queue label to the Notebook object. The Notebook can be suspended/resumed by changing the special annotation that the notebook-controller watches.

The only problem is, the Workload object corresponds to the pod. Its active field can't be used to suspend/resume the admission anymore, like other job-type workloads supports.

@mimowo
Copy link
Contributor

mimowo commented Nov 18, 2024

I confirm that it works when adding a queue label to the Notebook object. The Notebook can be suspended/resumed by > changing the special annotation that the notebook-controller watches.

Awesome!

The only problem is, the Workload object corresponds to the pod. Its active field can't be used to suspend/resume the
admission anymore, like other job-type workloads supports.

Actually, for statefulSet, the workload object corresponds to PodGroup, and so the action field should suspend/resume the entire group. If this does not work I believe this is a "fixable" bug rather than a limitation of the mechanism. @mbobrovskyi can you check that?

@xiongzubiao
Copy link

Actually, for statefulSet, the workload object corresponds to PodGroup, and so the action field should suspend/resume the entire group.

It doesn't work exactly same as a plain StatefulSet. I (as a user) adds the queue label in Notebook object's metadata.labels. But when the notebook-controller creates the backed StatefulSet, it adds the queue label in StatefulSet's spec.template.metadata.labels, not in metadata.labels. I think that is why no PodGroup is added.

I am not sure if it should be fixed in notebook-controller side or kueue side though.

@mbobrovskyi
Copy link
Contributor

It doesn't work exactly same as a plain StatefulSet. I (as a user) adds the queue label in Notebook object's metadata.labels. But when the notebook-controller creates the backed StatefulSet, it adds the queue label in StatefulSet's spec.template.metadata.labels, not in metadata.labels. I think that is why no PodGroup is added.

This is correct behaviour. It should add kueue.x-k8s.io/queue-name and kueue.x-k8s.io/pod-group-name labels and kueue.x-k8s.io/pod-group-total-count annotation on each pod.

@mimowo
Copy link
Contributor

mimowo commented Nov 18, 2024

But when the notebook-controller creates the backed StatefulSet, it adds the queue label in StatefulSet's spec.template.metadata.labels, not in metadata.labels. I think that is why no PodGroup is added.

Yeah, the STS webhook we have in Kueue will set up the PodGroup workload based on the metadata.labels (see here)

If the label is set at the PodTemplate level, then you have a workload per pod. This is not good for STS, for example, in case of preemption, you may lose pod0, which might be problematic for STS.

I am not sure if it should be fixed in notebook-controller side or kueue side though.

IIUC inside the notebook-controller to set it on the metadata.labels of STS, but feel free to share with us the relevant code pointers to help us better understand it.

@xiongzubiao
Copy link

IIUC inside the notebook-controller to set it on the metadata.labels of STS, but feel free to share with us the relevant code pointers to help us better understand it.

I think here is how the notebook-controller adds queue label to spec.template.metadata.labels of STS (by copying from metadata.labels of Notebook):
https://github.com/kubeflow/kubeflow/blob/master/components/notebook-controller/controllers/notebook_controller.go#L392

@mimowo
Copy link
Contributor

mimowo commented Nov 18, 2024

So the approaches I see:

  1. hard-code the special rule to copy the Kueue queue-name label onto matadata.labels instead
  2. copy all the labels into both places
  3. have some form of configurable filter (to avoid hard-coding Kueue label)

It would be great if one could prototype one of them to double-check but it seems to be the decision for the Notebook project at this point.

EDIT: maybe there are more alternatives - like a dedicated webhook for StatefulSet managed by Notebook.

@xiongzubiao
Copy link

xiongzubiao commented Nov 19, 2024

I am testing with adding queue-name label to metadata.labels of STS. Here are my current observations:

  1. The queue-name and pod-group-name labels, and pod-group-total-count annotation are added to the pod. The Workload object has a is-group-workload annotation being true. I believe these are expected.
  2. However, when I scale down the STS from 1 to 0, the Workload object is deleted along with the pod. Is this expected?
  3. On the other hand, if I change the active field of the Workload object from true to false (when STS's replica is 1), the Workload object's admittance status changes from true to false accordingly. But the pod is stuck at terminating state, likely due to the finalizer kueue.x-k8s.io/managed. Is this a bug?
  4. After manually deleting the finalizer on the pod from step 3, the pod disappears correctly. But the Workload object is automatically admitted again, which creates a new pod automatically. That means I can't really stop the STS by setting the Workload's active field.

@mbobrovskyi
Copy link
Contributor

mbobrovskyi commented Nov 20, 2024

  1. The queue-name and pod-group-name labels, and pod-group-total-count annotation are added to the pod. The Workload object has a is-group-workload annotation being true. I believe these are expected.

Yeah, that’s correct because the StatefulSet creates the pod group.

  1. However, when I scale down the STS from 1 to 0, the Workload object is deleted along with the pod. Is this expected?

Yes, that’s correct. Kueue removes the Workload if there are no Pods in the group.

  1. On the other hand, if I change the active field of the Workload object from true to false (when STS's replica is 1), the Workload object's admittance status changes from true to false accordingly. But the pod is stuck at terminating state, likely due to the finalizer kueue.x-k8s.io/managed. Is this a bug?

We can't suspend Pods; we can only remove the pod group and recreate it with a gate. If you set active=false on the Workload, Kueue evicts the Pods, but the process gets stuck because of the finalizers. So this is correct behaviour.

  1. After manually deleting the finalizer on the pod from step 3, the pod disappears correctly. But the Workload object is automatically admitted again, which creates a new pod automatically. That means I can't really stop the STS by setting the Workload's active field.

This is how the pod group works. After you remove the finalizers, all Pods should disappear, and the Workload should be finalized as well. However, the StatefulSet recreates the Pods, and the Workload is recreated as well.

Yes, we need to handle this case to remove the finalizers when active=true is set.

@xiongzubiao
Copy link

Thank you @mbobrovskyi .

Yes, we need to handle this case to remove the finalizers when active=true is set.

This makes a lot of sense to me now. I think the finalizer should be removed also when the workload is re-admitted back from the preempted state.

@xiongzubiao
Copy link

However, it does feel confusing (at least a bit strange...) to make the pod stuck at terminating state when the workload is inactive or preempted, rather than actually scaling down the STS.

I understand that it is a technical limitation of the pod group method, but why not implementing a direct integration for STS (meaning a workload object corresponding to the STS directly)? STS is K8s native, it doesn't sound too much to have a direct integration for it. It will be more straightforward and much easier to understand IMHO.

@mimowo
Copy link
Contributor

mimowo commented Nov 21, 2024

Yes, scaling down the STS to 0 is another option, but it would require modifying the spec.replicas field by Kueue, and the spec field might be managed by another third-party autoscaler or a human.

Maybe we should have 2 modes for the STS suspend support (via scaling down to 0, or pod group gating), the method could be selected by annotation. Still, it remains unclear to me if this is a needed complication, so would like to make sure first there are some use-cases blocked by the pod group approach.

@xiongzubiao
Copy link

Fair enough. Should I create a new issue for the finalizer bug of the pod group method?

The finalizer causes the pod stuck at terminating state in the following cases:

  • When the workload is re-activated.
  • When the workload is re-admitted after preemption.
  • When the STS is deleted.

@mimowo
Copy link
Contributor

mimowo commented Dec 17, 2024

Sorry, for returning here late, it was quite a busy period before releasing 0.10.

Should I create a new issue for the finalizer bug of the pod group method?

Can you please re-test with Kueue 0.10.0 as we make some bug fixes to StatefulSet which may solve the issue already?

Thank you for opening #3851, to my knowledge this is the last remaining known issue.

@varshaprasad96, @xiongzubiao do you think we can close this issue, or more is needed here? The support for the workload "spec.active" field is already ticketed and actively worked on.

@xiongzubiao
Copy link

@mimowo No worries! Yes, I agree that #3851 is the last issue and I am okay with closing this one.

@mimowo
Copy link
Contributor

mimowo commented Dec 17, 2024

/close
Thank you all for the engagement and fruitful discussions!

@k8s-ci-robot
Copy link
Contributor

@mimowo: Closing this issue.

In response to this:

/close
Thank you all for the engagements and fruitful discussions!

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@mimowo
Copy link
Contributor

mimowo commented Dec 17, 2024

Actually, It would be great to add a docs page "under https://kueue.sigs.k8s.io/docs/tasks/run/ (or even https://kueue.sigs.k8s.io/docs/tasks/run/kubeflow/), to demonstrate how to use Kueue to running Notebook.
@xiongzubiao @varshaprasad96 are you up to?

@varshaprasad96
Copy link
Member Author

+1, agree with @xiongzubiao, the integration PoC internally works well for our internal use case. I can help contribute to the docs!

@mimowo
Copy link
Contributor

mimowo commented Dec 17, 2024

/reopen
Awesome, looking forward to a contributor to update the docs

@k8s-ci-robot
Copy link
Contributor

@mimowo: Reopened this issue.

In response to this:

/reopen
Awesome, looking forward to a contributor to update the docs

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot reopened this Dec 17, 2024
@xiongzubiao
Copy link

Yes, I am happy to contribute too :-)

@mimowo
Copy link
Contributor

mimowo commented Dec 18, 2024

Cool, actually, let me track the remaining work in the follow up issues, so that the scope is clear: #3878
Feel free to assign.
/close

@k8s-ci-robot
Copy link
Contributor

@mimowo: Closing this issue.

In response to this:

Cool, actually, let me track the remaining work in the follow up issues, so that the scope is clear: #3878
Feel free to assign.
/close

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
kind/feature Categorizes issue or PR as related to a new feature.
Projects
None yet
Development

No branches or pull requests

10 participants