How Nixtla uses Ray to accurately predict more than a million time series in half an hour

By Nixtla Team   

Nixtla is an open-source time-series startup that seeks to democratize the use of state-of-the-art models in the field. Its founders are building the platform and tools they wanted to have while forecasting for the world’s leading companies. Recently, Nixtla launched its forecasting API driven by transfer learning using deep learning and efficient implementation of statistical models, relying on Ray for distributed use.

Today, businesses in every industry collect time-series data and want to be able to predict those patterns. Notable examples include measuring and predicting the temperature and humidity of sensors to help manufacturers prevent failures or predicting streaming metrics to identify popular music and artists. Time series forecasting also has other applications like predicting sales of thousands of SKUs across different locations for supply chain optimization or anticipating peaks in the electricity market for cost reduction strategies. The growing need for efficient tools for time series data has been addressed by specialized databases such as Timescale and QuestDB. Now, we push the limits of time series forecast efficiency.

In this post, we will train and forecast millions of series with Nixtla’s StatsForecast library using its Ray integration for distributed computing. 

About StatsForecast and AutoARIMA

StatsForecast is the first library to efficiently address the challenge of predicting millions of time series with AutoARIMA. Before StatsForecast and its Ray integration, practitioners would have had to spend a lot of time parallelizing their tasks manually, setting up their clusters from scratch, and depending on slower ARIMA implementations. Now, Nixtla facilitates these tasks using the full power of Numba and Ray.

AutoARIMA is one of the most widely used models for time series forecasting. It was developed by Hyndman and Khandakar and from its conception has been an industry standard for its speed and accuracy. ARIMA models the time series through three parameters: the number of autoregressive terms p, the number of differences d, and the number of moving average parameters q. AutoARIMA finds the best ARIMA model; in this sense, the hyperparameter tuning (the right set of p,d,q) occurs inside the model using the Hyndman and Khandakar algorithm, so the user does not have to think about this task. The model focuses on leveraging autocorrelations to obtain accurate predictions and constitutes a well-proven model that works great for baselines.

Prior to the implementation of AutoARIMA by Nixtla, the model was only available in R (the one originally developed by Hyndman). In addition, there was a Python version, pmdarima, based on StatsModels. As this experiment shows, pmdarima is much more expensive in time and performance, particularly for time series with seasonality.

StatsForecast compiles AutoARIMA just in time (JIT) using Numba, making it exceptionally fast. With Ray, we can distribute the computation across many different cores and scale horizontally. Here we show how this code allows data scientists and developers to fit a million series in less than an hour and spend less than USD $30 on AWS.  

The rest of this post is structured as follows: 

  1. Installing StatsForecast

  2. Setting up a Ray cluster on AWS

  3. Experimental design

  4. Using StatsForecast on a Ray cluster

  5. Conclusion

Installing StatsForecast

The first thing you have to do is install StatsForecast and Ray.
You can do this easily using the following line: 

pip install “statsforecast[ray]”

Setting up a Ray cluster

Setting up a Ray cluster is as simple as following the instructions here. In this experiment, we use the following yaml file, and we run the experiments using AWS. Before launching the cluster you have to set up your AWS credentials. The easiest way to do this is using the AWS CLI; just run aws configure and follow the instructions.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
cluster_name: default
max_workers: 249
upscaling_speed: 1.0
docker:
    image: rayproject/ray:latest-cpu 
    container_name: "ray_container"
    pull_before_run: True
    run_options: 
        - --ulimit nofile=65536:65536
idle_timeout_minutes: 5
provider:
    type: aws
    region: us-east-1
    cache_stopped_nodes: True.
    security_group:
        GroupName: ray_client_security_group
        IpPermissions:
              - FromPort: 10001
                ToPort: 10001
                IpProtocol: TCP
                IpRanges:
                    - CidrIp: 0.0.0.0/0
auth:
    ssh_user: ubuntu
available_node_types:
    ray.head.default:
        node_config:
            InstanceType: m5.2xlarge
            BlockDeviceMappings:
                - DeviceName: /dev/sda1
                  Ebs:
                      VolumeSize: 100
    ray.worker.default:
        min_workers: 249
        max_workers: 1000
        node_config:
            InstanceType: m5.2xlarge
            InstanceMarketOptions:
                MarketType: spot
head_node_type: ray.head.default
rsync_exclude:
    - "**/.git"
    - "**/.git/**"
rsync_filter:
    - ".gitignore"
initialization_commands: []
setup_commands: 
    - pip install statsforecast
head_setup_commands: []
worker_setup_commands: []
head_start_ray_commands:
    - ray stop
    - ray start --head --port=6379 --object-manager-port=8076 --autoscaling-config=~/ray_bootstrap_config.yaml
worker_start_ray_commands:
    - ray stop
    - ray start --address=$RAY_HEAD_IP:6379 --object-manager-port=8076
head_node: {}
worker_nodes: {}

As can be seen, the head node is an m5.2xlarge instance (8 CPU, 32 GB RAM). The worker nodes are configured using the same instance, and 249 are used for the experiments. Therefore, the deployed cluster uses 2,000 CPUs and 8,000 GB of RAM. Each worker installs StatsForecast. To launch the cluster, simply use:

ray up cluster.yaml

Don’t forget to shut down the cluster once the job is done:

ray down cluster.yaml

Experimental design

The experiment consists of training sets of time series of different sizes on the cluster. The chosen sizes are 10 thousand, 100 thousand, 500 thousand, 1 million, and 2 million. The time series were randomly generated using the generate_series function of StatsForecast. All have a size of between 50 and 500 observations, and their frequency is daily.

Using StatsForecast on a Ray cluster

Using StatsForecast is as simple as specifying the Ray cluster address and using the ray_address argument of the class, as shown below.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
import argparse
import os
from time import time

import ray
import pandas as pd
from statsforecast.utils import generate_series
from statsforecast.models import auto_arima
from statsforecast.core import StatsForecast

if __name__=="__main__":
	parser = argparse.ArgumentParser(
		description='Scale StatsForecast using ray'
	)
	parser.add_argument('--ray-address')
	args = parser.parse_args()

	for length in [10_000, 100_000, 500_000, 1_000_000, 2_000_000]:
		print(f'length: {length}')
		series = generate_series(n_series=length, seed=1)

	model = StatsForecast(series, 
				models=[auto_arima], freq='D', 
				n_jobs=-1, 
				ray_address=args.ray_address)
	init = time()
	forecasts = model.forecast(7)
	total_time = (time() - init) / 60
	print(f'n_series: {length} total time: {total_time}')

After training the model for all the series, forecasts like the ones shown below will be obtained.

blog-nixtla-1

The results of the experiment are shown below. As can be seen, with Ray and StatsForecast you can train AutoARIMA for 1 million time series in half an hour.

blog-nixtla-2

We also performed an experiment without using Ray. As shown below, fitting time is much faster with Ray than without Ray.

blog-nixtla-3

StatsForecast allows you to train more models — just import them and add them to the model's list as follows:

1
2
3
4
5
6
from statsforecast.models import auto_arima, naive, seasonal_naive

model = StatsForecast(series,
                      models=[auto_arima, naive, seasonal_naive],
                      n_jobs=-1,
                      ray_address=’ray_address’)

Conclusion

In this blog post, we showed the power of StatsForecast in combination with Ray, forecasting millions of time series with AutoARIMA in less than one hour. 

Please try out StatsForecast and its Ray integration and let us know about your experience. Feedback of any kind is always welcome. Here’s a link to our repo: https://github.com/Nixtla/statsforecast.

Sharing

Tags

Ray Core

Sign up for product updates