Gang Scheduling Ray Clusters on Kubernetes with Multi-Cluster-App-Dispatcher (MCAD)

LinkIntroduction

Large scale machine learning workloads on Kubernetes commonly suffer from the lack of a resource reservation system to allow gang scheduling. For example, a training job requiring 100 GPU pods can be stuck with only 80 out of 100 GPUs provisioned, while other pods are stuck in a pending state. This is common when there are other jobs competing for resources in the same K8s cluster.

Using KubeRay with the Multi-Cluster-App-Dispatcher (MCAD) Kubernetes controller helps to avoid such situations and unblock your ML workload. Specifically, MCAD allows you to queue each of your Ray workloads until resource availability requirements are met. With MCAD, your Ray cluster’s pods will only be created once there is a guarantee that all of the pods can be scheduled.

LinkWorkload requirements

To train large models using Ray, users must submit Ray workloads that are long-running and run to completion. Such workloads benefit from the following capabilities:

  1. Gang scheduling: Start one of the queued Ray clusters when aggregated resources (i.e., resources for all Ray worker pods) are available in the Kubernetes cluster.

  2. Workload pre-emption: Pre-empt lower priority Ray workloads when high priority Ray workloads are requested.

Requirement #2, workload pre-emption, is in the future roadmap for MCAD development. This blog helps to address requirement #1, gang-scheduling.

Note that Ray supports internal gang-scheduling via placement groups. Placement groups allow you to queue a Ray workload until the Ray cluster can accommodate the workload’s tasks and actors. On the other hand, MCAD allows you to queue Ray cluster creation until your Kubernetes cluster can accommodate all of the Ray cluster’s pods.

LinkComponents for scaling workloads on Kubernetes

In this blog, we discuss how specific workloads can be scaled performantly with three components: Ray, KubeRay, and MCAD. Ray is the distributed Python runtime that executes your ML application. KubeRay provides K8s-native tooling to manage your Ray cluster on Kubernetes. Finally, MCAD ensures physical resource availability, so that your Ray cluster never gets stuck in a partially provisioned state. We elaborate on Ray, KubeRay, and MCAD below. 

LinkRay 

Ray is a unified way to scale Python and AI applications from a laptop to a cluster. With Ray, you can seamlessly scale the same code from a laptop to a cluster. Ray is designed to be a general-purpose library, meaning that it can run a broad array of distributed compute workloads performantly. If your application is written in Python, you can scale it with Ray, no other infrastructure is required.

LinkKubeRay

KubeRay is an open-source toolkit to run Ray applications on Kubernetes. KubeRay provides several tools to simplify managing Ray clusters on Kubernetes. The key player is the KubeRay operator, which converts your Ray configuration into a Ray cluster consisting of one or more Ray nodes; each Ray node occupies its own Kubernetes pod. 

LinkMulti-Cluster-App-Dispatcher (MCAD)

The Multi-Cluster-App-Dispatcher (MCAD) is a Kubernetes controller providing mechanisms for applications to manage batch jobs in a single Kubernetes cluster or multi-Kubernetes-cluster environment:

mcad_ray_diagram
Figure 1. MCAD dispatch cluster
  • MCAD uses AppWrappers to wrap any Kubernetes object (Service, PodGroup, Job, Deployment, or any custom resource) the user provides. An AppWrapper is another object represented as a custom resource (CR) that wraps any Kubernetes object.

  • Wrapping objects means appending user yaml definitions to .Spec.GenericItem level inside the AppWrapper.

  • User objects within an AppWrapper are queued until aggregated resources are available in one of the Kubernetes clusters.

A sample AppWrapper configuration can be found here.

LinkRunning the workload

We will run the General Language Understanding Evaluation (GLUE) benchmark . The GLUE benchmark is a collection of resources for training, evaluating, and analyzing natural language understanding systems. For our example workload, we will fine-tune the RoBERTa model,  with KubeRay and MCAD installed on Openshift or Kubernetes. 

Fine-tuning is a process that takes a model that has already been trained for one given task and then tunes or tweaks the model to make it perform another similar task. In this example, we will train 90 models on different natural language understanding (NLU) tasks. Running a GLUE workload on a single V100 GPU would take tens of GPU hours. We plan to run fine-tuning on 4 GPUs. With enough resources available in the cluster, we can run the job fairly quickly in a few hours.

Pre-requisites:

  • Please ensure KubeRay and MCAD are installed on your Kubernetes cluster using the steps outlined here. Configure a blob store with the GLUE dataset by following the instructions here.

  • This workload requires Kubernetes nodes with GPU available. We recommend using at least 2 AWS p3.2xlarge nodes or the equivalent with your preferred cloud provider.

The following commands demonstrate how to run the GLUE workload described above. The commands will use the AppWrapper configuration aw-ray-glue.yaml. Below is a condensed view of the configuration. Observe that the AppWrapper reserves aggregate resources for the wrapped RayCluster. The wrapped RayCluster object is specified under generictemplate.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
kind: AppWrapper
metadata: {"name": "raycluster-glue"}
spec:
  resources:
      custompodresources:
      - replicas: 2
        requests: {"cpu": "3", "memory": "16G", "nvidia.com/gpu": "1"}
        limits: {"cpu": "3", "memory": "16G", "nvidia.com/gpu": "1"}
      generictemplate:
        kind: RayCluster
        spec:
          headGroupSpec:
                containers:
                - image: projectcodeflare/codeflare-glue:latest
                  resources:
                    limits: {"cpu":"2", "memory":"16G", "nvidia.com/gpu":"0"}
                    requests: {"cpu":"2", "memory":"16G", "nvidia.com/gpu":"0"}
          workerGroupSpecs:
          - replicas: 1
                containers:
                - image: projectcodeflare/codeflare-glue:latest
                  resources:
                    limits: {"cpu":"4", "memory":"16G", "nvidia.com/gpu":"2"}
                    requests: {"cpu":"4", "memory":"16G", "nvidia.com/gpu": "2"}

To schedule our Ray cluster, we need to place a head pod with 2 CPUs and 16GB memory. We also need to place a worker pod with 4CPUs, 16GB memory, and 2 GPU. To accommodate these requirements, the AppWrapper reserves aggregate resources of 2 * ( 3 CPUs + 16GB memory + 1 GPUs ).

To demonstrate MCAD’s gang scheduling functionality, we create two AppWrappers. We create our first AppWrapper as follows:

1
$ kubectl apply –f aw-ray-glue.yaml  : First Ray GLUE cluster
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
Conditions:
    Last Transition Micro Time:  2022-10-03T18:14:16.693259Z
    Last Update Micro Time:      2022-10-03T18:14:16.693257Z
    Status:                      True
    Type:                        Init
    Last Transition Micro Time:  2022-10-03T18:14:16.693574Z
    Last Update Micro Time:      2022-10-03T18:14:16.693571Z
    Reason:                      AwaitingHeadOfLine
    Status:                      True
    Type:                        Queueing
    Last Transition Micro Time:  2022-10-03T18:14:16.711161Z
    Last Update Micro Time:      2022-10-03T18:14:16.711159Z
    Reason:                      FrontOfQueue.
    Status:                      True
    Type:                        HeadOfLine
    Last Transition Micro Time:  2022-10-03T18:14:18.363510Z
    Last Update Micro Time:      2022-10-03T18:14:18.363509Z
    Reason:                      AppWrapperRunnable
    Status:                      True
    Type:                        Dispatched
  Controllerfirsttimestamp:      2022-10-03T18:14:16.692755Z
  Filterignore:                  true
  Queuejobstate:                 Dispatched
  Sender:                        before manageQueueJob - afterEtcdDispatching
  State:                         Running
  Systempriority:                9

The above output shows various conditions Ray AppWrapper object takes to get dispatched.

  • Init: state is set when Ray AppWrapper is submitted to MCAD.

  • Queueing : State is set when Ray AppWrapper object is queued since point-in time aggregated resources are not available in the cluster. 

  • HeadOfLine : state is set after applying FIFO policy to bring Ray AppWrapper to the head of the queue. 

  • Dispatched: state is set when Ray AppWrapper is dispatched and aggregated resources are available in the Kubernetes cluster.

Running: state is set after all pods associated with Ray AppWrapper are in phase Running.

Now follow these instruction to run the GLUE benchmark: Training is kicked-off using the Ray driver script glue_benchmark.py.

Next, change the AppWrapper Name under  .metadata.name and the RayCluster name under  .spec.GenericItems.generictemplate.metadata.name in aw-ray-glue.yaml and re-apply the same yaml with the following command:

1
2
$ kubectl apply –f aw-ray-glue.yaml : Second Ray GLUE cluster
$ kubectl get appwrappers

1
2
3
NAME                      AGE
raycluster-glue     30m
raycluster-glue-1   5s

We see that two AppWrappers have been created. However only the pods corresponding to the first AppWrapper have been provisioned:

1
$ kubectl get pods

1
2
3
NAME                                    READY   STATUS    RESTARTS   AGE
glue-cluster-head-8nnh9                 1/1     Running   0          31m
glue-cluster-worker-small-group-ssk75   1/1     Running   0          31m

When the first GLUE cluster completes its work and is deleted, MCAD will dispatch the second GLUE cluster; the second cluster’s pods will be created only at dispatch time. You may optionally observe this by deleting the first AppWrapper:

1
 $ kubectl delete appwrapper raycluster-glue 

After deleting the first AppWrapper, the second RayCluster’s pods should be created;  you can confirm this by asking Kubernetes to list pods: 

1
$ kubectl get pods.

LinkConclusion

In this blog, we configured a Kubernetes cluster that had KubeRay and MCAD installed. We submitted Ray clusters that ran a GLUE workload and used an MCAD AppWrapper to guarantee resource availability for the workload. In doing so, we learned about gang dispatching and queuing of Ray clusters. In summary, MCAD allowed us to do batch computing with Ray on Kubernetes. 

Please feel free to open issues for any questions on the KubeRay+MCAD integration.

LinkReferences

Next steps

Anyscale's Platform in your Cloud

Get started today with Anyscale's self-service AI/ML platform:


  • Powerful, unified platform for all your AI jobs from training to inference and fine-tuning
  • Powered by Ray. Built by the Ray creators. Ray is the high-performance technology behind many of the most sophisticated AI projects in the world (OpenAI, Uber, Netflix, Spotify)
  • AI App building and experimentation without the Infra and Ops headaches
  • Multi-cloud and on-prem hybrid support