Deploying XGBoost models with Ray Serve

By Simon Mo and Chandler Gibbons   

XGBoost is an optimized distributed gradient boosting library and algorithm that implements machine learning algorithms under the gradient boosting framework. This library is designed to be highly efficient and flexible, using parallel tree boosting to provide fast and efficient solutions for several data science and machine learning problems. In a previous blog post, we explored three ways to speed up XGBoost model training.

XGBoost has quickly become the state-of-the-art machine learning algorithm for solving tasks with structured data. This is mainly due to its high speed and exceptional performance. It is faster than other ensemble classifiers and its core algorithm is parallelizable, meaning that it can run on multi-core and GPU computers.

Options for serving machine learning and XGBoost models include cloud-hosted platforms such as Amazon SageMaker, KubeFlow, Google Cloud AI Platform, and Microsoft’s Azure ML SDK. These are powerful serving tools provided by some of the largest tech companies, but they can be very expensive to use. In addition, these tools only work with their own ecosystems.

Manually taking machine learning models from concept to production is typically complex and time-consuming, so there are several frameworks for deploying XGBoost in production.

In this article, we’ll cover how to deploy XGBoost with two frameworks: Flask and Ray Serve. We’ll also highlight the advantages of Ray Serve over other serving solutions when comparing models in production.

LinkDeploying XGBoost with Flask

Flask is the most common Python-based microframework used for deploying XGBoost, as it has no dependencies on external libraries. Flask is considered an exceptional deployment framework for XGBoost because it is easy to set up and is an efficient tool with REST endpoints. Plus, unlike XGBoost Server, Flask is framework-agnostic and has an HTTP request handling function. Flask is also a free deployment framework, unlike SageMaker and other cloud-hosted solutions. These are only a handful of features that make the Flask an optimal solution for deploying XGBoost into production.

In this section, we’ll train, test, and deploy an XGBoost model with Flask. This XGBoost model will be trained to predict the onset of diabetes using the pima-indians-diabetes dataset from the UCI Machine Learning Repository website. This small dataset contains several numerical medical variables of eight different features related to diabetes, in addition to one target variable — Outcome. So, we’ll use XGBoost to model and solve a simple prediction problem.

LinkBuilding the XGBoost model

First, we will load some dependencies in addition to the data. Then the training starts:

1
2
3
4
5
6
7
8
9
10
11
from numpy import loadtxt
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score

# load data
dataset = loadtxt('pima-indians-diabetes.csv', delimiter=",")
# split data into X and y
X = dataset[:,0:8]
Y = dataset[:,8]
X_train, X_test, y_train, y_test = train_test_split(X, Y, test_size=0.33, random_state=7)

The second step is to create the XGBoost model and fit it to the numerical data we have:

1
2
model = XGBClassifier()
model.fit(X_train, y_train)

Once the model is trained, we test it using our testing set, and then calculate some metrics for evaluation purposes:

1
2
3
4
y_pred = model.predict(X_test)
predictions = [round(value) for value in y_pred]
accuracy = accuracy_score(y_test, predictions)
print("Accuracy: %.2f%%" % (accuracy * 100.0))

LinkDeploying XGBoost with Flask

This step contains several stages:

LinkPickling (serialization)

Once the model is trained and tested, we can save it for future inferences using the pickle serialization module.

1
2
3
4
5
6
7
8
9
import pickle

# saving the model
with open('model.pkl','wb') as f:
    pickle.dump(model, f)

# loading the model
with open("model.pkl", "rb") as f:
    model = pickle.load(f)

LinkCreating a Flask app to serve the model

To deploy our XGBoost model, we'll use Flask. In order to create our Flask web app that can predict the onset of diabetes, we will need a prediction route to make inferences from our XGBoost.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
import pickle
import numpy as np
from flask import Flask, request, jsonify, render_template

app = Flask(__name__)
with open("model.pk", "rb") as f:
    model = pickle.load(f)

@app.route("/predict", methods=["POST"])
def predict():
    data = request.get_json(force=True)
    prediction = model.predict([np.array(list(data.values()))])
    output = prediction[0]
    return jsonify(output)

if __name__ == "__main__":
    app.run(debug=True)

LinkQuerying the predict API using requests

In this step, we create the request.py file. This file displays the predicted value by calling the APIs defined in the app.py.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
import requests

url = "http://localhost:5000/predict"
r = requests.post(
    url,
    json={
        "Pregnancies": 6,
        "Glucose": 148,
        "BloodPressure": 72,
        "SkinThickness": 35,
        "Insulin": 0,
        "BMI": 33.6,
        "DiabetesPedigree": 0.625,
        "Age": 50,
        "Outcome": 1,
    },
)
print(r.json())

LinkDeploying XGBoost with Ray Serve

Despite the usefulness of Flask in deploying machine learning models, it still has some drawbacks. For instance, it is unsuitable for large applications and lacks login and authentication capabilities. 

Beyond these, Flask’s main drawback for machine learning model serving is the challenges it presents with scaling. With Flask, scaling every component requires you to run many parallel instances, and you must decide how to do it. Will you use virtual machines, physical machines, or perhaps a Kubernetes cluster? Whichever method you choose, you will be on the hook for spinning up instances of your app and load balancing. Deploying an XGBoost model with Ray Serve is one solution to this problem, since it provides you with a simple web server that leverages the complex routing, scaling, and testing logic necessary for production deployments.

With Ray Serve, it’s easier to scale out your model on a multi-node Ray cluster, as you can take full advantage of its serving ability to dynamically update running deployments. In addition, Ray Serve is framework-agnostic, so it can serve different machine learning frameworks such as TensorFlow, PyTorch, and Scikit-learn. Altogether, it allows for high-efficiency, high-performance production.

Now, let’s use the XGBoost model that we created before and deploy it with Ray Serve.

First, let’s install Ray Serve:

pip install "ray[serve]"

Then, we start Ray Serve, which runs on top of several Ray clusters.

ray start --head

Next, we run the following Python script to import Ray Serve, start it up, and connect to the local running Ray cluster:

1
2
3
4
import ray
from ray import serve
ray.init(address='auto', namespace="serve") # Connect to the local running Ray cluster.
serve.start(detached=True) # Start the Ray Serve processes within the Ray cluster.

Note that the serve.start method is used to start up a few Ray actors, which are used by Ray Serve to route HTTP requests to the appropriate models.

Also note that we’re only running Ray locally to test our code. This already gives us an advantage over Flask, because, by default, Ray uses all available CPU cores on our machine, while the Flask app we created previously only uses a single core. And this is only a small taste of the benefits Ray can provide — because we can just as easily deploy our model to a Ray cluster with dozens or even hundreds of nodes to serve our XGBoost model at scale without changing the code at all.

Now that Ray Serve is ready, it is time to create the model and deploy it. Since our XGBoost model is already created and trained, we just need to load and read it as a class, and then start the deployment process using Ray Serve.

Let’s jump right into it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
import pickle
import json
import ray
from ray import serve

@serve.deployment(num_replicas=2, route_prefix="/regressor")
class XGB:
    def __init__(self):
        with open("model.pkl", "rb") as f:
            self.model = pickle.load(f)

    async def __call__(self, starlette_request):
        payload = await starlette_request.json()
        print("Worker: received starlette request with data", payload)

        input_vector = [
            payload["Pregnancies"],
            payload["Glucose"],
            payload["Blood Pressure"],
            payload["Skin Thickness"],
            payload["Insulin"],
            payload["BMI"],
            payload["DiabetesPedigree"],
            payload["Age"],
        ]
        prediction = self.model.predict([input_vector])[0]
        return {"result": prediction}

Now, here’s where the magic happens. The following few lines of code will deploy our XGBoost model to our running Ray Serve instance. We simply do this by running Ray Serve API calls in the Python framework.

1
2
3
4
# now we initialize /connect to the Ray service
serve.start(detached=True)
# Deploy the model.
XGB.deploy()

And there we go! Our XGBoost model is now deployed successfully on a Ray Serve application by simply calling deploy on the class we have defined. In fact, there are two copies of the model running at the same time and handling responses. It is even easier to scale it out by changing the num_replicas parameters. 

We can now query the endpoint of our deployed model by sending a request to it. Note that our HTTP runs at localhost:8000 by default.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
import requests

sample_request_input = {
    "Pregnancies": 6,
    "Glucose": 148,
    "BloodPressure": 72,
    "SkinThickness": 35,
    "Insulin": 0,
    "BMI": 33.6,
    "DiabetesPedigree": 0.625,
    "Age": 50,
}
response = requests.get("http://localhost:8000/regressor", json=sample_request_input)
print(response.text)
# Response:
#  "result": "1"

As seen above, the final result will be 1 for a diabetes onset that is predicted as “possible” or “imminent,” or 0 for when a diabetes onset is not predicted.

LinkConclusion

And that's all there is to it! You now know how to serve an XGBoost model using Flask and scale it easily using Ray Serve. 

To learn more about Ray Serve, the official documentation is a great place to start. You can begin by learning the basics, and then dive into the details on serving up machine learning models. Or, register for our upcoming meetup, where we'll discuss productionizing ML at scale with Ray Serve.

Next steps

Anyscale's Platform in your Cloud

Get started today with Anyscale's self-service AI/ML platform:


  • Powerful, unified platform for all your AI jobs from training to inference and fine-tuning
  • Powered by Ray. Built by the Ray creators. Ray is the high-performance technology behind many of the most sophisticated AI projects in the world (OpenAI, Uber, Netflix, Spotify)
  • AI App building and experimentation without the Infra and Ops headaches
  • Multi-cloud and on-prem hybrid support