-
Notifications
You must be signed in to change notification settings - Fork 284
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Kueue support for Kubeflow Notebook #3352
Comments
Looking into the issues - seems like a |
Can you get this working with the pod integration? I think suspend and first class support for the workload would be the best long term option but for POC and first integration I wonder if seeing if this can work with pod integration would be helpful. My main reasoning for going with first class support would be in case the underlying objects change (pod to Job etc). |
Hey @kannon92, the idea is to avoid enabling pod integration directly as that previously caused drastic effects on platform, with managing non-notebook pods in the same namespace, and could not retry brining up the failed ones. Which is why the idea is to enable first-class support to the NB API directly and not expose pod integration on Kueue (at least intentionally for customers). |
So if you don’t want pod integration then I think adding suspend field to notebook controller would be the option. Have you brought this up with KubeFlow community about Kueue integration? |
@varshaprasad96 Thank you for raising this issue. Since v0.9.0, we will start the StatefulSet support. Could you verify if the Kueue StatefulSet integration could resolve your issue? |
We have already include the StatefulSet integration in the RC version: https://github.com/kubernetes-sigs/kueue/releases/tag/v0.9.0-rc.1 |
@tenzen-y I am confused on your statfulset request. From what I can tell and what @varshaprasad96 mentions, it seems that the notebook controller is submitting pods. Is there a statefulset integration with KubeFlow notebooks? Edit: Nvm I see https://github.com/kubeflow/kubeflow/blob/ec82fbf58b79cff529d948b96e44ffd06bdfe679/components/notebook-controller/controllers/notebook_controller.go#L138 that it seems that statefulsets are created from the pod spec template. |
Yeah, AFAIK, the notebook instance seems to be created via StatefulSet. So, I'm wondering if we can use the Kueue StatefulSet integration. |
This might be a good idea to make it work of free. The only caveat I see is that StatefulSet integration does not support resizes yet (not sure if this is needed for notebook though). So actually it might all just work if you ensure the StatefulSet instance has the queue-name label. |
In general, the Notebook Server is a Stateful application, and it is hard to make it HA. So, I guess that resizing is not needed. |
Hey all, I am one of the maintainers of Kubeflow Notebooks. But I am not quite sure if I understand how Kueue could be used to improve Notebooks. Can one of the maintainers of Kueue explain how it relates to non-job workloads that are managed by other controllers? Also, we are actually working on a Kubeflow Notebooks 2.0 right now, and it's quite a different design. While we still use StatefulSets internally, the outer The key difference is that the |
@thesuperzapper this is great news. PTAL here - the doc documents how to schedule StatefulSets with Kueue (requires main version of Kueue, which can be isntalled like here: https://kueue.sigs.k8s.io/docs/installation/#install-the-latest-development-version) |
Thanks @thesuperzapper @mimowo @tenzen-y for your reply. Replying to the questions below:
Kueue is a workload queuing system primarily aimed at managing batch jobs, with resource quotas as a central feature. While its initial design is for batch workloads, we've seen interest from users who want to extend Kueue’s capabilities to non-batch workloads, to leverage its quota management, admission policies, and fair sharing model. Notebooks, especially GPU-enabled ones, can demand substantial resources, similar to other ML batch workloads. Managing them through Kueue allows users to schedule Notebooks more efficiently within cluster resources. For reference, there has been work on integrating plain pods, StatefulSets, and Deployments into Kueue’s ecosystem makes it possible to use Kueue's resource management without duplicating functionalities of the native Kubernetes scheduler (https://kueue.sigs.k8s.io/docs/tasks/run/plain_pods/). The way we could expect NB integration to work would be (similar to ray clusters) - the individual NB pods are managed by nb-controller, but the responsibility of admitting the NB pod for scheduling it on a node would be done by Kueue based on available designated quota set by the admin.
I tried this by enabling SS integration and using it to manage NB (with v1beta APIs for now) and it works well. But the issue in here is that the replica count is immutable, so if a user likes to use the stop/pause feature in NB by adding annotations that doesn't work. Once paused, the Validating Webhook does not allow the spec to be changed, which means a stopped NB cannot be restarted.
Thanks for pointing this out, I'll look into the v2 API. The major concern here seems to be a The question comes to be on how we intend the lifecycle of a Notebook resource to be considering the fact that it's going to be a non-batch workload (regardless of the underlying implementation). I'm not sure what a reasonable solution for being able to "suspend" Notebooks by Kueue would be without having to completely kill the underlying Pod. |
@varshaprasad96 thanks for the summary!
Interesting, maybe a small fix somewhere is possible? Validating Webhook of which project is blocking that? |
Yes, its in Kueue - within the statefulset webhook kueue/pkg/controller/jobs/statefulset/statefulset_webhook.go Lines 127 to 131 in b3c53d5
|
hm, but then I'm not sure I understand " which means a stopped NB cannot be restarted". We base the StatefulSet integration on the PodGroup support, which means that when Kueue evicts a StatefulSet, then the PodGroup gets evicted and the newly created pods are gated (until the workload is unsuspended). So we use gating pods rather than "suspend" field for StatefulSet, Deployment and soon LWS integration. So, I imagine the workflow does not require modifying the StatefulSet spec, unless I'm missing something (and I didn't test any of that). |
This is actually done by the
The notebook-controller kicks in, and scales down the replica count in the stateful set. The validating webhook by Kueue denies the request:
This means that stop/pause feature by adding annotations would not be available to users. The above was tested using v1 APIs of NB. I'm not sure yet, if it's the same with v2 APIs. |
I see, thanks for sharing the info. So, in this scenario we always scale down to 0, then scale up from 0 to full size? If this is the case I think we could support this special case relatively easy in the integration by removing the entire pod group, and recreating. |
Hi @mimowo , |
sounds great, feel free to investigate and if possible send a PR. We already planned some scaling support here: #3279, but this issue has a bit bigger scope - scaling down and up by recreating the PodGroup. Your case might be simpler (as a special case), but maybe not that much. In both cases PodGroup size is considered immutable. cc @mwielgus Edit: If we can support it without API changes then I would be leaning to include it in 0.9.1, but let's see. |
In that case, I would like to add it to v0.10 since the v0.10 will be released in easily Dec, right? |
Yes, early release of 0.10 is another possibility, we can keep both options open for now and see how things go. |
As far as I know, the patch version includes only bug fixes. But this request looks like an enhancement (scale support). |
It would be better not to include the new enhancement in the patch release to avoid introducing additional bugs as much as possible to the minor version. If we want to release any enhancement, I would recommend releasing a new minor early. |
@akram @varshaprasad96 I know that also @mbobrovskyi is going to work on the solution to scaling under: #3279. The idea we discussed with @mbobrovskyi is that for StatefulSet we will have an extra controller, when the controller detects that the size changed (sees the It would be great if you could help reviewing and testing the approach. |
Thanks @mimowo and @mbobrovskyi. We will keep an eye and verify if it matches our needs. One question on the NB side: Alternatively, is it reasonable to assume that, when Kueue is managing NBs based on cluster quotas (with the potential risk of preempting NB workloads), the responsibility falls to the admin and user to:
@thesuperzapper could you please provide your thoughts on this. Is the v2 design considering backups along with the culling feature? |
TBH this is out-of-scope for Kueue integrations (at least for now, we don't support check-pointing natively in Kueue for any integrations, Jobs etc). I guess a finalizer is an option, or you may consider long enough In the approach I discussed with @mbobrovskyi Kueue does not delete the Pods, they are fully managed by StatefulSets, we just remove the Kueue finalizer, to let the pod go, but this should allow you to use another finalizer or control the graceful deletion. You may want to test this PR: #3487. FYI @tenzen-y since the support requires changes to RBAC I think 0.9.1 is out of question anyway. We will aim to release the feature in 0.10. |
Hi Folks, just drop a few ideas here on how Kueue can be integrated with Jupyter Notebooks. Jupyter is capable to run remote Kernels via gateway. One of the examples could be to use the Enterprise Gateway to provision remote Kernels: https://github.com/jupyter-server/enterprise_gateway (cc @lresende @Zsailer) All of those workloads (Python Kernel, Spark Kernel, or "derivative" workload) can be considered as interactive and needs to have higher priority over non-interactive workloads. @shravan-achar can share more on how those interactive workloads should work with queues. In this scenario, Kueue should work directly with the Jupyter Kernels and "derivative" workloads, not with the stateful Jupyter Servers since Jupyter Servers don't require expensive compute resources (e.g. GPUs, TPUs). I understand that today Kubeflow Notebooks don't support remote Kernel, but it is something which we can discuss in the future (cc @vikas-saxena02) We talked a little bit about our remote Kernel orchestration in this Scheduling Talk: https://youtu.be/DdW5WUAvNuY?list=PLj6h78yzYM2OOkGhEJgb3Lx6YWoA3xQl4 |
I confirm that it works when adding a queue label to the Notebook object. The Notebook can be suspended/resumed by changing the special annotation that the notebook-controller watches. The only problem is, the Workload object corresponds to the pod. Its |
Awesome!
Actually, for statefulSet, the workload object corresponds to PodGroup, and so the action field should suspend/resume the entire group. If this does not work I believe this is a "fixable" bug rather than a limitation of the mechanism. @mbobrovskyi can you check that? |
It doesn't work exactly same as a plain StatefulSet. I (as a user) adds the queue label in Notebook object's I am not sure if it should be fixed in |
This is correct behaviour. It should add |
Yeah, the STS webhook we have in Kueue will set up the PodGroup workload based on the If the label is set at the PodTemplate level, then you have a workload per pod. This is not good for STS, for example, in case of preemption, you may lose pod0, which might be problematic for STS.
IIUC inside the |
I think here is how the notebook-controller adds queue label to |
So the approaches I see:
It would be great if one could prototype one of them to double-check but it seems to be the decision for the Notebook project at this point. EDIT: maybe there are more alternatives - like a dedicated webhook for StatefulSet managed by Notebook. |
I am testing with adding queue-name label to
|
Yeah, that’s correct because the StatefulSet creates the pod group.
Yes, that’s correct. Kueue removes the Workload if there are no Pods in the group.
We can't suspend Pods; we can only remove the pod group and recreate it with a gate. If you set active=false on the Workload, Kueue evicts the Pods, but the process gets stuck because of the finalizers. So this is correct behaviour.
This is how the pod group works. After you remove the finalizers, all Pods should disappear, and the Workload should be finalized as well. However, the StatefulSet recreates the Pods, and the Workload is recreated as well. Yes, we need to handle this case to remove the finalizers when active=true is set. |
Thank you @mbobrovskyi .
This makes a lot of sense to me now. I think the finalizer should be removed also when the workload is re-admitted back from the preempted state. |
However, it does feel confusing (at least a bit strange...) to make the pod stuck at terminating state when the workload is inactive or preempted, rather than actually scaling down the STS. I understand that it is a technical limitation of the pod group method, but why not implementing a direct integration for STS (meaning a workload object corresponding to the STS directly)? STS is K8s native, it doesn't sound too much to have a direct integration for it. It will be more straightforward and much easier to understand IMHO. |
Yes, scaling down the STS to 0 is another option, but it would require modifying the Maybe we should have 2 modes for the STS suspend support (via scaling down to 0, or pod group gating), the method could be selected by annotation. Still, it remains unclear to me if this is a needed complication, so would like to make sure first there are some use-cases blocked by the pod group approach. |
Fair enough. Should I create a new issue for the finalizer bug of the pod group method? The finalizer causes the pod stuck at terminating state in the following cases:
|
Sorry, for returning here late, it was quite a busy period before releasing 0.10.
Can you please re-test with Kueue 0.10.0 as we make some bug fixes to StatefulSet which may solve the issue already? Thank you for opening #3851, to my knowledge this is the last remaining known issue. @varshaprasad96, @xiongzubiao do you think we can close this issue, or more is needed here? The support for the workload "spec.active" field is already ticketed and actively worked on. |
/close |
@mimowo: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Actually, It would be great to add a docs page "under https://kueue.sigs.k8s.io/docs/tasks/run/ (or even https://kueue.sigs.k8s.io/docs/tasks/run/kubeflow/), to demonstrate how to use Kueue to running Notebook. |
+1, agree with @xiongzubiao, the integration PoC internally works well for our internal use case. I can help contribute to the docs! |
/reopen |
@mimowo: Reopened this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
Yes, I am happy to contribute too :-) |
Cool, actually, let me track the remaining work in the follow up issues, so that the scope is clear: #3878 |
@mimowo: Closing this issue. In response to this:
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
What would you like to be added:
Enable Kueue to manage Kubeflow Notebook CRs (https://www.kubeflow.org/docs/components/notebooks/api-reference/notebook-v1/).
Why is this needed:
Notebook themselves can be resource heavy (in terms of being GPU enabled), and having Kueue manage it like any other workload on the cluster would be helpful.
Open Questions
I've been working on a PoC to enable this - however, looks like the underlying implementation of a Notebook resource is a Pod. The spec of the Notebook CR only defines a PodspecTemplate, without a suspend field.
Does implementing a similar mechanism as with pod integration sound reasonable - wherein we add a scheduling gate with a default webhook, and when it is ready to be accepted we ungate the pod. Or can the Notebook introduce a
suspend
field, to say expose the functionality of pausing/stopping the underlying Jupyter server which can be controlled by Kueue (not sure if this would be an ask by the NB users)?This enhancement requires the following artifacts:
The artifacts should be linked in subsequent comments.
The text was updated successfully, but these errors were encountered: