12x
faster iteration for companies like Canva
60s
node startup and autoscaling
up to
60%
lower costs on many workloads (vs open source Ray) through spot instance and elastic training support
50%
reduction in cloud costs for companies like Canva
What is Ray Train?
Ray Train is an open source machine learning library built on top of Ray, a best-in-class distributed compute platform for AI/ML workloads.
Ray Train integrates with your preferred training frameworks, including PyTorch, Hugging Face, Tensorflow, XGBoost, and more—so you can develop with your preferred tech stack, then scale to the cloud with just one line of code.

Use Cases
Distributed Training
Increase training iteration speed without increasing cost by implementing distributed training on Anyscale. Easily scale from your laptop to any number of GPUs with just one line of code.

Benefits
Set-it-and-Forget-it Training
Ray Train includes built-in checkpointing to reduce compute. Easily recover from system failures and resume training from a recent checkpoint.
Faster Iteration, Same Cost
Train with parallelized compute to complete training jobs faster. Increase iteration speed with the ability to scale across nodes during development.
Maximize GPU and CPU Utilization
Leverage CPUs and GPUs in the same pipeline with to increase GPU utilization and decrease costs
Compatible with Any Training Framework
Integrate with training frameworks like PyTorch, Hugging Face, Tensorflow, and more. Develop with your preferred tech stack, then scale to the cloud with just one line of code.
Supercharge Ray Train with Anyscale
Easily Get Started with Distributed Training at Scale
Elastic Training & Spot Instance Support


Job Retries & Fault Tolerance Support


Fast Node Launching and Autoscaling


Fractional Heterogeneous Resource Allocation


Detailed Training Dashboard


Last-Mile Data Preprocessing


Autoscaling Development Environment


Distributed Debugger


Data Integrations (Databricks, Snowflake, S3, GCS, etc)


Framework Support (Pytorch, Huggingface, Tensorflow, XGBoost, etc)


Experiment Tracking Integrations (Weights and Biases, MLflow, etc)


Orchestration Integrations (Prefect, Apache Airflow, etc)


Alerting


Resumable Jobs


Priority Scheduling


Job Queues


EFA Support


![]() | ![]() | ||
|---|---|---|---|
Elastic Training & Spot Instance Support | ![]() - | ![]() - | |
Job Retries & Fault Tolerance Support | ![]() | ![]() - | |
Fast Node Launching and Autoscaling | ![]() - | ![]() - | 60 sec |
Fractional Heterogeneous Resource Allocation | ![]() - | ![]() | |
Detailed Training Dashboard | ![]() - | ![]() - | |
Last-Mile Data Preprocessing | ![]() - | ![]() | |
Autoscaling Development Environment | ![]() - | ![]() - | |
Distributed Debugger | ![]() - | ![]() - | |
Data Integrations (Databricks, Snowflake, S3, GCS, etc) | ![]() | ![]() | |
Framework Support (Pytorch, Huggingface, Tensorflow, XGBoost, etc) | ![]() | ![]() | |
Experiment Tracking Integrations (Weights and Biases, MLflow, etc) | ![]() | ![]() | |
Orchestration Integrations (Prefect, Apache Airflow, etc) | ![]() | ![]() | |
Alerting | ![]() | ![]() - | |
Resumable Jobs | ![]() | ![]() | |
Priority Scheduling | ![]() - | ![]() | |
Job Queues | ![]() | ![]() - | |
EFA Support | ![]() | ![]() Custom |
Out-of-the-Box Templates & App Accelerators
Jumpstart your development process with custom-made templates, only available on Anyscale.
End-to-End LLM Workflows
Execute end-to-end LLM workflows to develop and productionize LLMs at scale
Pre-Train Stable Diffusion
Pre-train a Stable Diffusion V2 model with Ray Train and Ray Data
Fine-Tune Stable Diffusion
Fine-tune a personalized Stable Diffusion XL model with Ray Train
FAQs
Distributed AI Model Training at Scale
Enable simple, fast, and affordable distributed model training with Anyscale. Learn more, or get started today.



