This is part 4 of our blog series on Generative AI. In the previous blog posts we explained:
1.Why Ray is a sound platform for Generative AI
2.we showed how it can push the performance limits
3.how you can use Ray for stable diffusion.
In this blog, we share a practical approach on how you can use the combination of HuggingFace, DeepSpeed, and Ray to build a system for fine-tuning and serving LLMs, in 40 minutes for less than $7 for a 6 billion parameter model. In particular, we illustrate the following:
Using these three components, you can simply and quickly put together an open-source LLM fine-tuning and serving system.
By taking advantage of Ray’s distributed capabilities, we show how this can be both more cost-effective and faster than using a single large (and often unobtainable) machine.
Discussing why you might want to run your own LLM instead of using one of the new API providers.
Showing you the evolving tech stack we are seeing for cost-effective LLM fine-tuning and serving, combining HuggingFace, DeepSpeed, Pytorch, and Ray.
Showing you 40 lines of Python code that can enable you to serve a 6 billion parameter GPT-J model.
Showing you, for less than $7, how you can fine-tune the model to sound more medieval using the works of Shakespeare by doing it in a distributed fashion on low-cost machines, which is considerably more cost-effective than using a single large powerful machine.
Showing how you can serve the fine-tuned 6B LLM compiled model binary.
Showing how the fine-tuned model compares to a prompt engineering approach with large systems.
There are many, many providers of LLM APIs online. Why would you want to run your own? There are a few reasons:
Cost, especially for fine-tuned inference: For example, OpenAI charges 12c per 1000 tokens (about 700 words) for a fine-tuned model on Davinci. It’s important to remember that many user interactions require multiple backend calls (e.g. one to help with the prompt generation, post-generation moderation, etc), so it’s very possible that a single interaction with an end user could cost a few dollars. For many applications, this is cost prohibitive.
Latency: using these LLMs is especially slow. A GPT-3.5 query for example can take up to 30 seconds. Combine a few round trips from your data center to theirs and it is possible for a query to take minutes. Again, this makes many applications impossible. Bringing the processing in-house allows you to optimize the stack for your application, e.g. by using low-resolution models, tightly packing queries to GPUs, and so on. We have heard from users that optimizing their workflow has often resulted in a 5x or more latency improvement.
Data Security & Privacy: In order to get the response from these APIs, you have to send them a lot of data for many applications (e.g. send a few snippets of internal documents and ask the system to summarize them). Many of the API providers reserve the right to use those instances for retraining. Given the sensitivity of organizational data and also frequent legal constraints like data residency, this is especially limiting. One, particularly concerning recent development, is the ability to regenerate training data from learned models, and people unintentionally disclosing secret information.
The LLM space is an incredibly fast-moving space, and it is currently evolving very rapidly. What we are seeing is a particular technology stack that combines multiple technologies:
What we’ve also seen is a reluctance to go beyond a single machine for training. In part, because there is a perception that moving to multiple machines is seen as complicated. The good news is this is where Ray.io shines (ba-dum-tish). It simplifies cross-machine coordination and orchestration aspects using not much more than Python and Ray decorators, but also is a great framework for composing this entire stack together.
Recent results on Dolly and Vicuna (both trained on Ray or trained on models built with Ray like GPT-J) are small LLMs (relatively speaking – say the open source model GPT-J-6B with 6 billion parameters) that can be incredibly powerful when fine-tuned on the right data. The key is fine-tuning and the right data parts. So you do not always need to use the latest and greatest model with 150 billion-plus parameters to get useful results. Let’s get started!
The detailed steps on how to serve the GPT-J model with Ray can be found here, so let’s highlight some of the aspects of how we do that.
1 @serve.deployment(ray_actor_options={"num_gpus":1})
2 classPredictDeployment:
3 def__init__(self, model_id:str, revision:str=None):
4 from transformers import AutoModelForCausalLM, AutoTokenizer
5 import torch
6 self.model = AutoModelForCausalLM.from_pretrained(
7 "EleutherAI/gpt-j-6B",
8 revision=revision,
9 torch_dtype=torch.float16,
10 low_cpu_mem_usage=True,
11 device_map="auto", # automatically makes use of all GPUs available to the Actor
12 )
13
Serving in Ray happens in actors, in this case, one called PredictDeployment. This code shows the __init__ method of the action that downloads the model from Hugging Face. To launch the model on the current node, we simply do:
1deployment = PredictDeployment.bind(model_id=model_id, revision=revision)
2serve.run(deployment)
3
That starts a service on port 8000 of the local machine.
We can now query that service using a few lines of Python
1import requests
2prompt = (
3 “Once upon a time, there was a horse. “
4)
5sample_input = {"text": prompt}
6output = requests.post("http://localhost:8000/", json=[sample_input]).json()
7print(output)
And sure enough, this prints out a continuation of the above opening. Each time it runs, there is something slightly different.
"Once upon a time, there was a horse.
But this particular horse was too big to be put into a normal stall. Instead, the animal was moved into an indoor pasture, where it could take a few hours at a time out of the stall. The problem was that this pasture was so roomy that the horse would often get a little bored being stuck inside. The pasture also didn’t have a roof, and so it was a good place for snow to accumulate."
This is certainly an interesting direction and story … but now we want to set it in the medieval era. What can we do?
Now that we’ve shown how to serve a model, how do we fine-tune it to be more medieval? What about if we train it on 2500 lines from Shakespeare?
This is where DeepSpeed comes in. DeepSpeed is a set of optimized algorithms for training and fine-tuning networks. The problem is that DeepSpeed doesn’t have an orchestration layer. This is not so much of a problem on a single machine, but if you want to use multiple machines, this typically involves a bunch of bespoke ssh commands, complex managed keys, and so on.
This is where Ray can help.
This page in the Ray documentation discusses how to fine-tune it to sound more like something from the 15th century with a bit of flair.
Let’s go through the key parts. First, we load the data from hugging face
1from datasets import load_dataset
2print("Loading tiny_shakespeare dataset")
3current_dataset = load_dataset("tiny_shakespeare")
4
Skipping the tokenization code, here’s the heart of the code that we will run for each worker.
1def trainer_init_per_worker(train_dataset, eval_dataset=None,**config):
2 # Use the actual number of CPUs assigned by Ray
3 model = GPTJForCausalLM.from_pretrained(model_name, use_cache=False)
4 model.resize_token_embeddings(len(tokenizer))
5 enable_progress_bar()
6 metric = evaluate.load("accuracy")
7 trainer = Trainer(
8 model=model,
9 args=training_args,
10 train_dataset=train_dataset,
11 eval_dataset=eval_dataset,
12 compute_metrics=compute_metrics,
13 tokenizer=tokenizer,
14 data_collator=default_data_collator,
15 )
16 return trainer
17
And now we create a Ray AIR HuggingFaceTrainer that orchestrates the distributed run and wraps around multiple copies of the training loop above:
1trainer = HuggingFaceTrainer(
2 trainer_init_per_worker=trainer_init_per_worker,
3 trainer_init_config={
4 "batch_size":16, # per device
5 "epochs":1,
6 },
7 scaling_config=ScalingConfig(
8 num_workers=num_workers,
9 use_gpu=use_gpu,
10 resources_per_worker={"GPU":1,"CPU": cpus_per_worker},
11 ),
12 datasets={"train": ray_datasets["train"],"evaluation": ray_datasets["validation"]},
13 preprocessor=Chain(splitter, tokenizer),
14)
15results = trainer.fit()
16
While there is some complexity here, it is not much more complex than the code to get it to run on a simple machine.
One of the most important topics related to LLMs is the question of cost. In this particular case, the costs are small (in part because we ran only one epoch of fine-tuning, depending on the problem 1-10 epochs of fine-tuning are used, and also in part because this dataset is not so large). But running the tests on different configurations shows us that understanding performance is not always easy. The below shows some benchmarking results with different configurations of machines on AWS.
Configuration | #instances | Time (mins) | Total Cost (on-demand) | Total Cost (spot) | Cost Ratio |
---|---|---|---|---|---|
16 x g4dn.4xlarge (1 x T4 16GB GPU) | 16 | 48 | $15.41 | $6.17 | 100% |
32 x g4dn.4xlarge (1 x T4 16GB GPU) | 32 | 30 | $19.26 | $7.71 | 125% |
1 x p3.16xlarge (8 x V100 16GB GPU) | 1 | 44 | $17.95 | $9.27 | 150% |
1 x g5.48xlarge (8 x A10G 24GB GPU) | 1 | 84 | $22.81 | $10.98 | 178% |
Note: we tried to run the same test with A100s, but we were unable to obtain the p4d machines to do so.
Looking at these numbers, we see some surprises:
Perhaps the most obvious machine to use – the g5.48xlarge – the machine with the highest on-paper performance – is both the most expensive and the slowest at almost twice the price when using spot instances.
The p3.16xlarge with its use of NVLink between the GPUs is a considerably better option.
Most surprising of all, using multiple machines is both the cheapest and the fastest option.
The exact same code is running on all the machines, and aside from tweaking the number of GPU workers, nothing else was changed. Using multiple machines gave us the cheapest (16 machines) and the fastest (32 machines) option of the ones we benchmarked.
This is the beauty and power of Ray. The code itself was simple enough, and in fact, was able to use a standard library – DeepSpeed – with no modifications. So it was no more complex in this case than a single machine. Simultaneously, it gave more options and flexibility to optimize to be both cheaper and faster than a single machine.
Now that we have a fine-tuned model, let’s try to serve it. The only change we need to make is to (a) copy the model to s3 from the fine-tuning process and (b) load it from there. In other words, the only change from the previous code we started with originally is:
1 checkpoint = Checkpoint.from_uri(
2 "s3://demo-pretrained-model/gpt-j-6b-shakespeare"
3 )
4 with checkpoint.as_directory() as dir:
5 self.model = AutoModelForCausalLM.from_pretrained(
6 dir,
7 torch_dtype=torch.float16,
8 low_cpu_mem_usage=True,
9 device_map="auto")
10 self.tokenizer = AutoTokenizer.from_pretrained(dir)
11
And now let’s try querying it again:
Once upon a time there was a horse. This horse was in my youth, a little unruly, but yet the best of all. I have, sir; I know every horse in the field, and the best that I have known is the dead. And now I thank the gods, and take my leave.
As you can see, it definitely has more of a Shakespearean flavor.
We have shown a new tech stack that combines Ray, HuggingFace, DeepSpeed, and PyTorch to make a system that:
Makes it simple and quick to deploy as a service.
Can be used to cost-effectively fine-tune and is actually most cost-effective when using multiple machines without the complexity.
How fine-tuning – even a single epoch – can change the output of a trained model.
Deploying a fine-tuned model is only marginally harder than deploying a standard one.
If you want to use LangChain + Ray to serve LLM's, see our LangChain blog series.
If you are interested in learning more about Ray, see Ray.io and Docs.Ray.io.
To connect with the Ray community join #LLM on the Ray Slack or our Discuss forum.
If you are interested in our Ray hosted service for ML Training and Serving, see Anyscale.com/Platform and click the 'Try it now' button
Ray Summit 2023: If you are interested to learn much more about how Ray can be used to build performant and scalable LLM applications and fine-tune/train/serve LLMs on Ray, join Ray Summit on September 18-20th! We have a set of great keynote speakers including John Schulman from OpenAI and Aidan Gomez from Cohere, community and tech talks about Ray as well as practical training focused on LLMs.