What's New
Welcome to the v1.11.0 release of Volcano! π π π£
In this release, we have brought a bunch of significant enhancements that have long-awaited by community users.
Feature Preview: Network Topology Aware Scheduling
In AI large model training scenarios, model parallelism splits the model across multiple nodes, requiring frequent data exchange between these nodes during training. At this point, network transmission performance between nodes often becomes a bottleneck, significantly impacting training efficiency. Data centers have diverse network types (e.g., IB, RoCE, NVSwitch) and complex network topologies, typically involving multiple layers of switches. The fewer switches between two nodes, the lower the communication latency and the higher the throughput. Therefore, users want to schedule workloads to the best performance domain with the highest throughput and lowest latency, minimizing cross-switch communication to accelerate data exchange and improve training efficiency.
To address this, Volcano has introduced the Network Topology Aware Scheduling strategy, solving the network communication performance issues in large-scale data center AI training tasks through a unified network topology API and intelligent scheduling policies. It provides the following capabilities:
-
Unified Network Topology API
Introduced the HyperNode CRD to accurately express the network topology of data centers.
-
Network Topology-Aware Scheduling Policy
Volcano Job and PodGroup can set topology constraints for jobs through the
networkTopology
field, supporting the following configurations:- mode: Supports
hard
andsoft
modes.hard
: Hard constraint, tasks within the job must be deployed within the same HyperNode.soft
: Soft constraint, attempts to deploy the job within the same HyperNode.
- highestTierAllowed: Used with
hard
mode, indicating the highest tier of HyperNode the job is allowed to span.
- mode: Supports
Design doc: Topology Aware Scheduling.
User Guide: Topology Aware Scheduling | Volcano.
Related PRs: (#3850, #144, #3874, #3922, #3964, #3971, #3974, #3887, #3897, @ecosysbin, @weapons97, @Xu-Wentao,@penggu @JesseStutler, @Monokaix)
Supports Elastic Hierarchical Queue
In multi-tenant scenarios, fairness, isolation, and task priority control in resource allocation are core requirements. Different departments or teams often need to share cluster resources while ensuring their tasks can obtain resources on demand, avoiding resource contention or waste. To address this, Volcano has introduced the Elastic Hierarchical Queue feature, significantly enhancing queue resource management capabilities. Through hierarchical queues, users can achieve finer-grained resource quota management, cross-level resource sharing and reclamation, and flexible preemption strategies, building an efficient and fair unified scheduling platform. For users of YARN, they can seamlessly migrate big data workloads to Kubernetes clusters using Volcano.
Volcano's elastic hierarchical queues have the following key features to meet the complex demands of multi-tenant scenarios:
- Supports Configuring Queue Hierarchies
Users can create multi-level queues as needed, forming a tree structure. Each queue can set independent resource quotas and priorities, ensuring fair resource allocation. - Cross-Level Resource Sharing and Reclamation
When a sub-queue is idle, its resources can be shared with other sub-queues. When jobs are submitted to the sub-queue, resources can be reclaimed from other sub-queues. - Fine-Grained Resource Quota Management
Each queue can set the following resource parameters:capability
: The upper limit of the queue's resource capacity.deserved
: The amount of resources the queue deserves. If the queue's allocated resources exceed thedeserved
value, the excess can be reclaimed.guarantee
: The reserved resources for the queue, ensuring the minimum resource guarantee.
- Flexible Preemption Strategies
Supports priority-based resource preemption, ensuring high-priority tasks can obtain the required resources promptly.
For detailed design and usage guidance on elastic hierarchical queues, please refer to:
Design doc: hierarchical-queue-on-capacity-plugin.
User Guide: Hierarchical Queue | Volcano.
Related PRs: (#3591, #3743, @Rui-Gan)
Supports Multi-Cluster AI Job Scheduling
With the rapid growth of enterprise business, a single Kubernetes cluster often cannot meet the demands of large-scale AI training and inference tasks. Users typically need to manage multiple Kubernetes clusters to achieve unified workload distribution, deployment, and management. Currently there are already many users using Volcano in multiple clusters and using Karmada to managem them, in order to better support AI jobs in multi-cluster environment, support global queue management, job priority and fair scheduling, etc., the Volcano community has incubated the Volcano Global sub-project. This project extends Volcano's powerful scheduling capabilities in single clusters to provide a unified scheduling platform for multi-cluster AI jobs, supporting cross-cluster job distribution, resource management, and priority control.
Volcano Global provides the following enhancements on top of Karmada to meet the complex demands of multi-cluster AI job scheduling:
- Supports Cross-Cluster Scheduling of Volcano Jobs
Users can deploy and schedule Volcano Jobs across multiple clusters, fully utilizing the resources of multiple clusters to improve task execution efficiency. - Queue Priority Scheduling
Supports cross-cluster queue priority management, ensuring high-priority queue tasks can obtain resources first. - Job Priority Scheduling and Queuing
Supports job-level priority scheduling and queuing mechanisms in multi-cluster environments, ensuring critical tasks are executed promptly. - Multi-Tenant Fair Scheduling
Provides cross-cluster multi-tenant fair scheduling capabilities, ensuring fair resource allocation among tenants and avoiding resource contention.
For detailed introduction and user guide, please refer to: Multi-cluster Scheduling | Volcano.
Related PRs: (see https://github.com/volcano-sh/volcano-global.git, @Vacant2333, @MondayCha, @lowang-bh, @Monokaix)
Supports Online and Offline Workloads Colocation
The core idea of online and offline colocation is to deploy online services (e.g., real-time services) and offline jobs (e.g., batch processing tasks) in the same cluster. When online services are in a trough, offline jobs can utilize idle resources; when online services peak, offline jobs are suppressed through priority control to ensure the resource needs of online services. This dynamic resource allocation mechanism not only improves resource utilization but also guarantees the quality of service for online services.
Volcano's cloud native colocation solution provides end-to-end resource isolation and sharing mechanisms from the application layer to the kernel, including the following core components:
Volcano Scheduler
Responsible for unified scheduling of online and offline jobs, providing abstractions such as queues, groups, job priorities, fair scheduling, and resource reservations to meet the scheduling needs of various business scenarios like microservices, big data, and AI.
Volcano SLO Agent
The SLO Agent deployed on each node monitors the node's resource usage in real-time, dynamically calculates overcommitted resources, and allocates these resources to offline jobs. Meanwhile, the SLO Agent detects CPU/memory pressure on the node and evicts offline jobs when necessary to ensure the priority of online services.
Enhanced OS
To further strengthen resource isolation, Volcano implements fine-grained QoS guarantees at the kernel level. Through cgroup interfaces, different resource limits are set for online and offline services, ensuring online services receive sufficient resources even under high load.
Volcano's cloud native colocation solution has the following key capabilities, helping users achieve a win-win situation in resource utilization and business stability:
- Unified Scheduling: Supports unified scheduling of various workloads, including microservices, batch jobs, and AI tasks.
- QoS-Based Resource Model: Provides quality of service (QoS)-based resource management for online and offline services, ensuring the stability of high-priority services.
- Dynamic Resource Overcommitment: Dynamically calculates overcommitted resources based on real-time CPU/memory utilization of nodes, maximizing resource utilization.
- CPU Burst: Allows containers to temporarily exceed CPU limits, avoiding throttling at critical moments and improving business responsiveness.
- Network Bandwidth Isolation: Supports overall machine network egress bandwidth limits, ensuring the network usage needs of online services.
For detailed introduction and user guide about online and offline colocation, please refer to: Cloud Native Colocation | Volcano
Related PRs: (#3789, @william-wang)
Supports Load-Aware Descheduling
In Kubernetes clusters, as workloads dynamically change, uneven node resource utilization often occurs, leading to some nodes becoming hotspots, affecting the overall stability and efficiency of the cluster. To address this, Volcano introduces the Load-Aware Descheduling feature, dynamically adjusting Pod distribution based on the actual load of nodes, ensuring balanced resource utilization across the cluster, avoiding resource hotspots, and improving overall performance and reliability. Load-aware descheduling is incubated through the subproject https://github.com/volcano-sh/descheduler.
Main features include:
- Load-Aware Scheduling: Monitors real load metrics such as CPU and memory on nodes, dynamically adjusting Pod distribution to avoid relying solely on rough Pod Request scheduling.
- Timed and Dynamic Triggers: Supports triggering descheduling via CronTab or fixed intervals, flexibly adapting to different scenarios.
Applicable scenarios:
- Uneven Node Resource Utilization: When some nodes in the cluster have high resource utilization while others are idle, load-aware descheduling can automatically balance node loads.
- Hotspot Node Management: When nodes become performance bottlenecks or face failure risks due to high load, descheduling can promptly migrate Pods to ensure business stability.
For detailed introduction and user guide about load-aware descheduling, please refer to: Load-aware Descheduling | Volcano
Related PRs: (see volcano-sh/descheduler: The Volcano Descheduler (github.com), @Monokaix)
Supports Fine-Grained Job Failure Recovery Strategies
In AI, big data, and high-performance computing (HPC) scenarios, job stability and failure recovery capabilities are crucial. Traditional job failure recovery strategies often restart the entire Job when a Pod fails, wasting resources and potentially causing training tasks to start over, significantly impacting efficiency. With the adoption of checkpointing and resume from checkpoint techniques in AI scenarios, a single Pod failure no longer requires restarting the entire Job. To address this, Volcano introduces Fine-Grained Job Failure Recovery Strategies, supporting more flexible failure handling mechanisms, helping users efficiently handle task interruptions, and significantly improving training efficiency. The following enhancements are provided:
Supports Pod-Level Restart Policies: Users can set RestartPod
to restart only the failed Pod instead of the entire Job when a PodFailed
event occurs.
Supports Setting Timeout for Actions: Pod failures may be caused by temporary issues (e.g., network jitter or hardware problems). Volcano allows users to set timeouts for failure recovery actions. If the Pod recovers within the timeout, the action is not executed.
New PodPending Event Handling: When a Pod remains in the Pending state for a long time due to insufficient resources or topology constraints, users can set a timeout for the Pending event. If the Pod does not run after the timeout, the entire Job can be terminated to avoid resource waste.
For detailed explanations and usage of fine-grained job failure recovery strategies, please refer to: how to use job policy.
Related PRs: (volcano-sh/apis#140, #3813, #3973, @bibibox)
Supports Resources Visualization via Volcano Dashboard
The Volcano dashboard is a new GUI used for Volcano resources visualization. After deploying Volcano, users can deploy the Volcano dashboard to visually display Volcano-related resources in the cluster through a graphical interface, facilitating queries and operations. Project address: https://github.com/volcano-sh/dashboard.
Currently supported views include:
- Cluster Overview: including Job counts, statuses, completion rates, Queue counts, and Queue resource utilization.
- Job lists and details: supporting fuzzy search, filtering by Namespace, Queue, Status, and sorting Jobs.
- Queue lists and details: supporting fuzzy search, filtering by Status, and sorting Queues.
- Pod lists and details: supporting fuzzy search, filtering by Namespace, Status, and sorting Pods.
Related PRs: (see https://github.com/volcano-sh/dashboard, @ WY-Dev0, @Monokaix )
Supports Kubernetes v1.31
Volcano versions closely follow Kubernetes community versions, supporting each major Kubernetes version. The latest supported version is v1.31, with complete UT and E2E test cases ensuring functionality and reliability.
If you want to participate in adapting Volcano to new Kubernetes versions, please refer to: adapt-k8s-todo for community contributions.
Related PRs: (#3767, #3837, @vie-serendipity, @dongjiang1989)
Supports Preemption Policy for Volcano Job
PriorityClass represents Pod priority, including a priority value and preemption policy. During scheduling and preemption, PriorityClass is used as the basis for scheduling and preemption. Higher-priority Pods are scheduled before lower-priority Pods and can preempt lower-priority Pods. Volcano fully supports priority scheduling and preemption policies at the Pod level and supports priority scheduling and preemption based on PriorityClass value at the Volcano Job level. However, in some scenarios, users want Volcano Jobs to trigger resource reclamation without preemption, waiting for cluster resources to be released automatically to ensure overall business stability. Volcano now supports Job-level PreemptionPolicy, where Volcano Jobs configured with PreemptionPolicy as Never will not preempt other Pods.
Volcano Jobs and tasks within Jobs support configuring PriorityClass. For the relationship between the two PriorityClasses and configuration examples, please refer to: how to configure priorityclass for job.
Related PRs: (#3739, @JesseStutler)
Performance Optimization in Large-Scale Scenarios
In Volcano, Queue is one of the most basic and important resources. The status
field of a Queue records PodGroups with statuses such as Unknown
, Pending
, Running
, Inqueue
, and Completed
. However, in large-scale scenarios, when PodGroups in a Queue frequently change (e.g., a large number of short-running tasks are submitted in a loop), many PodGroup statuses change from Running
to Completed
. In this case, the Volcano Controller needs to frequently refresh the Queue's status
field, putting significant pressure on the APIServer. Additionally, the Volcano Scheduler updates the status.allocated
field of the Queue after Job scheduling, which can cause Queue update conflicts in large-scale scenarios, further impacting system performance.
To thoroughly solve the issues of frequent Queue refreshes and update conflicts in large-scale scenarios, Volcano has optimized the Queue management mechanism by migrating PodGroup statistics in Queues to Metrics, no longer persisting them. Users can view PodGroup statistics in Queues through vcctl
. This optimization significantly reduces the pressure on the APIServer while improving overall system performance and stability.
For detailed design and metric names of migrating PodGroup status statistics in Queues to metrics, please refer to: Queue podgroup statistics.
Related PRs: (#3750, #3751,@JesseStutler)
Changes
- [bugfix]fix leader-elect-resource-namespace flag not take effect (#3975 @Monokaix)
- fix staticcheck deprecated workqueue.RateLimitingInterface warnings on pkg/controllers (#3791 @xovoxy)
- Fix the restartPod action will cancel the restartJob delay action (#3973 @bibibox)
- multi goroutine deal taskUnschedulable (#3921 @lishangyuzi)
- Add missing license boilerplate (#3963 @SataQiu)
- Support more actions for volcano job failure scenario (#3813 @bibibox)
- fix: if the capability cpu or memory is not specified in the hierarchical queue, it will be set to the corresponding value of the parent queue (#3917 @JesseStutler)
- fix: hierarchical queue webhook validation use listing podgroups instead (#3913 @JesseStutler)
- reclaim: When choosing a preemptor, choose a starving one rather than one with pending tasks. (#3951 @JesseStutler)
- vc-scheduler: rename parents to ancestors in hierarchical queueAttr. (#3958 @bogo-y)
- Support export kind logs (#3957 @Monokaix)
- fix ut permission err for release ci (#3956 @Monokaix)
- vc-scheduler: fix omitting check ancestor-queues' real-capability in capacity plugin. (#3940 @bogo-y)
- vc-scheduler: optimize bachNodeOrderFn in numaaware plugin (#3954 @bogo-y)
- chore: update development docs (#3952 @zedongh)
- vs-scheduler: use max function to simplify code of capacity plugin (#3941 @bogo-y)
- Add PULL_REQUEST_TEMPLATE.md (#3942 @JesseStutler)
- export logs via workflow (#3927 @Monokaix)
- fix NPU issue on vc-scheduler (#3924 @archlitchi)
- chore: use Infof for slice taskinfo instead of InfoS log less details (#3933 @zedongh)
- chore: Fix minor comment issues (#3908 @Yanping-io)
- add network-topology-aware design doc (#3850 @william-wang)
- fix: configmgr need wait for all processed avoid tests data race (#3891 @zedongh)
- Fix flaky ut of jobFlow cli (#3905 @Monokaix)
- fix: mark PodGroup completed when pod fails (#3807 @bood)
- Update owners: add archlitchi to approver in scheduler plugins (#3882 @archlitchi)
- Remove duplicated codes (#3879 @kerthcet)
- Fix: Fix a bug when the hostname label of a node does not match the node name, pods bound to a PV with nodeAffinity using the hostname may be scheduled to the wrong node or experience scheduling failures. (#3837 @dongjiang1989)
- fix: nodegroup return nil 3880 (#3886 @lut777)
- vc-scheduler: simplify code for drf-plugin (#3858 @bogo-y)
- Update gen-admission-secret.sh (#3843 @raravena80)
- Supports rollback when allocate callback function fails (#3863 @wangyang0616)
- Delete current incorrect benchmark testing results (#3857 @JesseStutler)
- Fix predicate return unexpected result (#3840 @bibibox)
- Fix flaky ut: create queues first (#3851 @Monokaix)
- Fix flaky ut (#3849 @Monokaix)
- chore: Replace deprecation
ioutil
fucntions and add depguard rules in.golangci.yml
(#3808 @dongjiang1989) - feature: Add podgroups statistics (#3751 @JesseStutler)
- fix untranslated sections. (#3811 @SherlockShemol)
- Add install dashboard guide (#3816 @Monokaix)
- remove useless report file (#3821 @william-wang)
- fix ci and add event logging to help debug ci (#3817 @JesseStutler)
- fix the type error in dashboard's name (#3803 @zhifanggao)
- chore: remove benchmark unuse g1_aff.png (#3801 @conghuhu)
- feat: add benchmark code (#3730 @conghuhu)
- feat: add hierarchical queues for capacity plugin (#3743 @Rui-Gan)
- fix allocating more pods to a GPU when using volcano-vgpu feature (#3774 @archlitchi)
- update the talks in readme (#3796 @william-wang)
- fix panic when get job's elastic resource (#3106 @lowang-bh)
- Support colocation for computing workload and mircro-service (#3789 @william-wang)
- Introducing Volcano Guru on Gurubase.io (#3788 @kursataktas)
- Optimize the admission log of vcjob update (#3764 @hwdef)
- [Proposal] Add podgroup statistics doc (#3750 @JesseStutler)
- fix: update the volcano metric document. (#3782 @fengruotj)
- feat: Volcano Supports K8s v1.31 (#3767 @vie-serendipity)
- feature:add preemptionpolicy in preempt and reclaim (#3739 @JesseStutler)
- feat: add volcano jobs phase metric (#3650 @Prepmachine4)
- Can not sync job status correctly when upgrading from v1.5 #3640 (#3786 @QingyaFan)
- remove klog v1 (#3785 @hwdef)
- fix: remove the default nodeSelector value from the volcano monitoring installation yaml. (#3781 @fengruotj)
- perf: imporve end to end queue processing latency (#3772 @fengruotj)
- add scheduler-name field in charts (#3766 @lengrongfu)
- Fix staticcheck warnings on pkg/controller and pkg/scheduler (#3722 @xovoxy)
- Proposal: Hierarchical Queue on Capacity Plugin (#3591 @Rui-Gan)
- fix: frequently create and delete vcjobs with the same name, and the β¦ (#3771 @wangyang0616)
- Fix staticcheck warnings on cmd and example (#3720 @xovoxy)
- Update Kubernetes compatibility (#3757 @Monokaix)
- Remove init variable in Reclaimable and Preemptable parts of session_plugins (#3551 @PigNatovsky)
- Use watch to listen expected event instead of wait.poll (#3763 @Monokaix)
- fixes #3759 (#3762 @PigNatovsky)
- Don't depend on ready condition of nodes instead of taints in session (#3756 @Monokaix)
- feature: add approve workflows action (#3758 @JesseStutler)
- feat: allow configuration of ipFamilyPolicy (#3741 @dongjiang1989)
- docs: fix README broken URL (#3718 @jasondrogba)
- controllers: fix queue controller syncQueue bug (#3742 @sceneryback)
- fix: nil pointer when set node other resources (#3717 @TymonLee)
- fix job controller reports duplicate warnings (#3746 @liuyuanchun11)
- Specify the fields using the key Duration (#3747 @Yanping-io)
- optimize the webhook manager (#3588 @Vacant2333)
- controllers: fix miscalculation of RunningDuration when killing job (#3719 @matbme)
- Remove hardcoded nodeSelector for kubestatematrics (#3740 @lekaf974)