Home BlogBlog Detail

End-to-end LLM Workflows Guide

By Goku Mohandas | June 17, 2024

[ Anyscale template | GitHub | Notebook ] · 25 min. read

In this guide, we'll learn how to execute the end-to-end LLM workflows to develop & productionize LLMs at scale.

Data preprocessing: prepare our dataset for fine-tuning with batch data processing.
Fine-tuning: tune our LLM (LoRA / full param) with key optimizations with distributed training.
Evaluation: apply batch inference with our tuned LLMs to generate outputs and perform evaluation.
Serving: serve our LLMs as a production application that can autoscale, swap between LoRA adapters, etc.

Throughout these workloads we'll be using Ray, a framework for distributing ML, used by OpenAI, Netflix, Uber, etc. And Anyscale, a platform to scale your ML workloads from development to production.

💵 Cost: $0 (using free Anyscale credits)
🕑 Total time: 90 mins (including fine-tuning)
🔄 REPLACE indicates to replace with your unique values
💡 INSIGHT indicates infrastructure insight
Join Slack community to share issues / questions

LinkSet up

We can execute this notebook entirely for free (no credit card needed) by creating an Anyscale account. Once you log in, you'll be directed to the main console where you'll see a collection of notebook templates. Click on the "End-to-end LLM Workflows" to open up the workspace and click on the README.ipynb to get started.

Workspaces are a fully managed development environment which allow us to use our favorite tools (VSCode, notebooks, terminal, etc.) on top of infinite compute (when we need it). In fact, by clicking on the compute at the top right (✅ 1 node, 8 CPU), we can see the cluster information:

Head node (Workspace node): manages the cluster, distributes tasks, and hosts development tools.
Worker nodes: machines that execute work orchestrated by the head node and can scale back to 0.

💡 INSIGHT : Because we have Auto-select worker nodesenabled, that means that the required worker nodes (ex. GPU workers) will automagically be provisioned based on our workload's needs! They'll spin up, run the workload and then scale back to zero. This allows us to maintain a lean workspace environment (and only pay for compute when we need it) and completely remove the need to manage any infrastructure.

Note: we can explore all the metrics (ex. hardware util), logs, dashboards, manage dependencies (ex. images, pip packages, etc.) on the menu bar above.

1import os
2import ray
3import warnings
4warnings.filterwarnings("ignore")
5%load_ext autoreload
6%autoreload 2

We'll need a free Hugging Face token to load our base LLMs and tokenizers, etc. And since we are using Llama models, we need to login and accept the terms and conditions here.

🔄 REPLACE : Place your unique HF token below. If you accidentally ran this code block before pasting your HF token, then click the Restart button up top to restart the notebook kernel.

1# Initialize HF token
2os.environ['HF_TOKEN'] = ''  # <-- replace with your token
3ray.init(runtime_env={'env_vars': {'HF_TOKEN': os.environ['HF_TOKEN']}})

LinkData Preprocessing

We'll start by preprocessing our data in preparation for fine-tuning our LLM. We'll use batch processing to apply our preprocessing across our dataset at scale.

LinkDataset

For our task, we'll be using the Viggo dataset dataset, where the input (meaning_representation) is a structured collection of the overall intent (ex. inform) and entities (ex. release_year) and the output (target) is an unstructured sentence that incorporates all the structured input information. But for our task, we'll reverse this task where the input will be the unstructured sentence and the output will be the structured information.

1Input (unstructured sentence):
2"Dirt: Showdown from 2012 is a sport racing game for the PlayStation, Xbox, PC rated E 10+ (for Everyone 10 and Older). It's not available on Steam, Linux, or Mac."
3
4Output (intent + entities): 
5"inform(name[Dirt: Showdown], release_year[2012], esrb[E 10+ (for Everyone 10 and Older)], genres[driving/racing, sport], platforms[PlayStation, Xbox, PC], available_on_steam[no], has_linux_release[no], has_mac_release[no])"

1from datasets import load_dataset
2ray.data.set_progress_bars(enabled=False)

1# Load the VIGGO dataset
2dataset = load_dataset("GEM/viggo", trust_remote_code=True)

1# Data splits
2train_set = dataset['train']
3val_set = dataset['validation']
4test_set = dataset['test']
5print (f"train: {len(train_set)}")
6print (f"val: {len(val_set)}")
7print (f"test: {len(test_set)}")

1train: 5103
2val: 714
3test: 1083

1# Sample
2train_set[0]

1{'gem_id': 'viggo-train-0',
2 'meaning_representation': 'inform(name[Dirt: Showdown], release_year[2012], esrb[E 10+ (for Everyone 10 and Older)], genres[driving/racing, sport], platforms[PlayStation, Xbox, PC], available_on_steam[no], has_linux_release[no], has_mac_release[no])',
3 'target': "Dirt: Showdown from 2012 is a sport racing game for the PlayStation, Xbox, PC rated E 10+ (for Everyone 10 and Older). It's not available on Steam, Linux, or Mac.",
4 'references': ["Dirt: Showdown from 2012 is a sport racing game for the PlayStation, Xbox, PC rated E 10+ (for Everyone 10 and Older). It's not available on Steam, Linux, or Mac."]}

LinkData Preprocessing

We'll use Ray to load our dataset and apply preprocessing to batches of our data at scale.

1import re

1# Load as a Ray Dataset
2train_ds = ray.data.from_items(train_set)
3train_ds.take(1)

1[{'gem_id': 'viggo-train-0',
2  'meaning_representation': 'inform(name[Dirt: Showdown], release_year[2012], esrb[E 10+ (for Everyone 10 and Older)], genres[driving/racing, sport], platforms[PlayStation, Xbox, PC], available_on_steam[no], has_linux_release[no], has_mac_release[no])',
3  'target': "Dirt: Showdown from 2012 is a sport racing game for the PlayStation, Xbox, PC rated E 10+ (for Everyone 10 and Older). It's not available on Steam, Linux, or Mac.",
4  'references': ["Dirt: Showdown from 2012 is a sport racing game for the PlayStation, Xbox, PC rated E 10+ (for Everyone 10 and Older). It's not available on Steam, Linux, or Mac."]}]

The preprocessing we'll do involves formatting our dataset into the schema required for fine-tuning (system, user, assistant) conversations.

system: description of the behavior or personality of the model. As a best practice, this should be the same for all examples in the fine-tuning dataset, and should remain the same system prompt when moved to production.
user: user message, or "prompt," that provides a request for the model to respond to.
assistant: stores previous responses but can also contain examples of intended responses for the LLM to return.

1conversations = [
2    {"messages": [
3        {'role': 'system', 'content': system_content},
4        {'role': 'user', 'content': item['target']},
5        {'role': 'assistant', 'content': item['meaning_representation']}]},
6    {"messages": [...],}
7    ...
8]

1def to_schema(item, system_content):
2    messages = [
3        {'role': 'system', 'content': system_content},
4        {'role': 'user', 'content': item['target']},
5        {'role': 'assistant', 'content': item['meaning_representation']}]
6    return {'messages': messages}

Our system_content will guide the LLM on how to behave. Our specific directions involve specifying the list of possible intents and entities to extract.

1# System content
2system_content = (
3    "Given a target sentence construct the underlying meaning representation of the input "
4    "sentence as a single function with attributes and attribute values. This function "
5    "should describe the target string accurately and the function must be one of the "
6    "following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', "
7    "'suggest', 'request_explanation', 'recommend', 'request_attribute']. The attributes "
8    "must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', "
9    "'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', "
10    "'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']")

To apply our function on our dataset at scale, we can pass it to ray.data.Dataset.map. Here, we can specify the function to apply to each sample in our data, what compute to use, etc. The diagram below shows how we can read from various data sources (ex. cloud storage) and apply operations at scale across different hardware (CPU, GPU). For our workload, we'll just use the default compute strategy which will use CPUs to scale out our workload.

Note: If we want to distribute a workload across batches of our data instead of individual samples, we can use ray.data.Dataset.map_batches. We'll see this in action when we perform batch inference in our evaluation template. There are also many other distributed operations we can perform on our dataset.

1# Distributed preprocessing
2ft_train_ds = train_ds.map(to_schema, fn_kwargs={'system_content': system_content})
3ft_train_ds.take(1)

1[{'messages': [{'content': "Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute']. The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']",
2    'role': 'system'},
3   {'content': "Dirt: Showdown from 2012 is a sport racing game for the PlayStation, Xbox, PC rated E 10+ (for Everyone 10 and Older). It's not available on Steam, Linux, or Mac.",
4    'role': 'user'},
5   {'content': 'inform(name[Dirt: Showdown], release_year[2012], esrb[E 10+ (for Everyone 10 and Older)], genres[driving/racing, sport], platforms[PlayStation, Xbox, PC], available_on_steam[no], has_linux_release[no], has_mac_release[no])',
6    'role': 'assistant'}]}]

1# Repeat the steps for other splits
2ft_val_ds = ray.data.from_items(val_set).map(to_schema, fn_kwargs={'system_content': system_content})
3ft_test_ds = ray.data.from_items(test_set).map(to_schema, fn_kwargs={'system_content': system_content})

LinkSave and load data

We can save our data locally and/or to remote storage to use later (training, evaluation, etc.). All workspaces come with a default cloud storage locations and shared storage that we can write to.

1os.environ['ANYSCALE_ARTIFACT_STORAGE']

1's3://anyscale-test-data-cld-i2w99rzq8b6lbjkke9y94vi5/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage'

1# Write to cloud storage
2ft_train_ds.write_json(f"{os.environ['ANYSCALE_ARTIFACT_STORAGE']}/viggo/train.jsonl")
3ft_val_ds.write_json(f"{os.environ['ANYSCALE_ARTIFACT_STORAGE']}/viggo/val.jsonl")
4ft_test_ds.write_json(f"{os.environ['ANYSCALE_ARTIFACT_STORAGE']}/viggo/test.jsonl")

1# Load from cloud storage
2ft_train_ds = ray.data.read_json(f"{os.environ['ANYSCALE_ARTIFACT_STORAGE']}/viggo/train.jsonl")
3ft_train_ds.take(1)

1[{'messages': [{'content': "Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute']. The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']",
2    'role': 'system'},
3   {'content': "Dirt: Showdown from 2012 is a sport racing game for the PlayStation, Xbox, PC rated E 10+ (for Everyone 10 and Older). It's not available on Steam, Linux, or Mac.",
4    'role': 'user'},
5   {'content': 'inform(name[Dirt: Showdown], release_year[2012], esrb[E 10+ (for Everyone 10 and Older)], genres[driving/racing, sport], platforms[PlayStation, Xbox, PC], available_on_steam[no], has_linux_release[no], has_mac_release[no])',
6    'role': 'assistant'}]}]

LinkFine-tuning

In this template, we'll fine-tune a large language model (LLM) using our dataset from the previous data preprocessing template.

Note: We normally would not jump straight to fine-tuning a model. We would first experiment with a base model and evaluate it so that we can have a baseline performance to compare it to.

LinkConfigurations

We'll fine-tune our LLM by choosing a set of configurations. We have created recipes for different LLMs in the training configs folder which can be used as is or modified for experiments. These configurations provide flexibility over a broad range of parameters such as model, data paths, compute to use for training, number of training epochs, how often to save checkpoints, padding, loss, etc. We also include several DeepSpeed configurations to choose from for further optimizations around data/model parallelism, mixed precision, checkpointing, etc.

We also have recipes for LoRA (where we train a set of small low ranked matrices instead of the original attention and feed forward layers) or full parameter fine-tuning. We recommend starting with LoRA as it's less resource intensive and quicker to train.

1# View the training (LoRA) configuration for llama-3-8B
2!cat configs/training/lora/llama-3-8b.yaml

1model_id: meta-llama/Meta-Llama-3-8B-Instruct # <-- change this to the model you want to fine-tune
2train_path: s3://llm-guide/data/viggo/train.jsonl # <-- change this to the path to your training data
3valid_path: s3://llm-guide/data/viggo/val.jsonl # <-- change this to the path to your validation data. This is optional
4context_length: 512 # <-- change this to the context length you want to use
5num_devices: 16 # <-- change this to total number of GPUs that you want to use
6num_epochs: 5 # <-- change this to the number of epochs that you want to train for
7train_batch_size_per_device: 16
8eval_batch_size_per_device: 16
9learning_rate: 1e-4
10padding: "longest" # This will pad batches to the longest sequence. Use "max_length" when profiling to profile the worst case.
11num_checkpoints_to_keep: 1
12dataset_size_scaling_factor: 10000
13output_dir: /mnt/local_storage
14deepspeed:
15    config_path: configs/deepspeed/zero_3_offload_optim+param.json
16dataset_size_scaling_factor: 10000 # internal flag. No need to change
17flash_attention_2: true
18trainer_resources:
19    memory: 53687091200 # 50 GB memory
20worker_resources:
21    accelerator_type:A10G: 0.001
22lora_config:
23    r: 8
24    lora_alpha: 16
25    lora_dropout: 0.05
26    target_modules:
27    - q_proj
28    - v_proj
29    - k_proj
30    - o_proj
31    - gate_proj
32    - up_proj
33    - down_proj
34    - embed_tokens
35    - lm_head
36    task_type: "CAUSAL_LM"
37    modules_to_save: []
38    bias: "none"
39    fan_in_fan_out: false
40    init_lora_weights: true

LinkFine-tuning

This Workspace is still running on a small, lean head node. But based on the compute we want to use (ex. num_devices and accelerator_type) for fine-tuning, the appropriate worker nodes will automatically be initialized and execute the workload. And afterwards, they'll scale back to zero!

💡 INSIGHT : With Ray we're able to execute a large, compute intensive workload like this using smaller, more available resources (ex. using A10s instead of waiting for elusive A100s). And Anyscale's smart instance manager will automatically provision the appropriate and available compute for the workload based on what's needed.

While we could execute python src/ft.py configs/training/lora/llama-3-8b.yaml directly inside a Workspace notebook (see this example), we'll instead kick off the fine-tuning workload as an isolated job. An Anyscale Job is a great way to scale and execute a specific workload. Here, we specify the command that needs to run (ex. python [COMMAND][ARGS]) along with the requirements (ex. docker image, additional, pip packages, etc.).

Note: Executing an Anyscale Job within a Workspace will ensure that files in the current working directory are available for the Job (unless excluded with --exclude). But we can also load files from anywhere (ex. Github repo, S3, etc.) if we want to launch a Job from anywhere.

1# View job yaml config
2!cat deploy/jobs/ft.yaml

1name: llm-fine-tuning-guide
2entrypoint: python src/ft.py configs/training/lora/llama-3-8b.yaml
3image_uri: localhost:5555/anyscale/llm-forge:0.4.3.2
4requirements: []
5max_retries: 0

💡 INSIGHT : When defining this Job config, if we don't specify the compute config to use, then Anyscale will autoselect based on the required compute. However, we also have the optionality to specify and even make highly cost effective decisions such as spot to on-demand fallback (or vice-versa).

1# Sample compute config
2- name: gpu-worker-a10
3  instance_type: g5.2xlarge
4  min_workers: 0
5  max_workers: 16
6  use_spot: true
7  fallback_to_ondemand: true

1# Job submission
2!anyscale job submit --config-file deploy/jobs/ft.yaml --exclude assets

1Output
2(anyscale +0.8s) Submitting job with config JobConfig(name='llm-fine-tuning-guide', image_uri='localhost:5555/anyscale/llm-forge:0.4.3.2', compute_config=None, env_vars=None, py_modules=None).
3(anyscale +3.2s) Uploading local dir '.' to cloud storage.
4(anyscale +4.8s) Job 'llm-fine-tuning-guide' submitted, ID: 'prodjob_515se1nqf8ski7scytd52vx65e'.
5(anyscale +4.8s) View the job in the UI: https://console.anyscale.com/jobs/prodjob_515se1nqf8ski7scytd52vx65e
6(anyscale +4.8s) Use `--wait` to wait for the job to run and stream logs.

This workload (we set to five epochs) will take ~45 min. to complete. As the job runs, you can monitor logs, metrics, Ray dashboard, etc. by clicking on the generated Job link above (https://console.anyscale.com/jobs/prodjob_...)

Note: If we didn't want to have all this control and flexibility for our fine-tuning workload, there is also a much easier workflow with Anyscale serverless endpoints.

1# Get your API key from https://console.anyscale.com/credentials
2ANYSCALE_API_KEY = "esecret_yourKeyHere"  
3ANYSCALE_API_BASE = "https://api.endpoints.anyscale.com/v1"
4
5# Anyscale Endpoints are OpenAI compatible
6client = openai.OpenAI(base_url = ANYSCALE_API_BASE, api_key = ANYSCALE_API_KEY)
7training_file_id = client.files.create(file=open(train_file_path,'rb'), purpose="fine-tune").id
8valid_file_id = client.files.create(file=open(validation_file_path,'rb'), purpose="fine-tune").id
9
10# Create finetuning job. Other parameters like context length will be chosen appropriately based on dataset size
11fine_tuning_job_id = client.fine_tuning.jobs.create(
12    model="meta-llama/Meta-Llama-3-8B-Instruct",
13    hyperparameters={"n_epochs": 4},
14    training_file=training_file_id,
15    validation_file=valid_file_id).id

LinkLoad artifacts

From the very end of the logs, we can also see where our model artifacts are stored. For example:

1Successfully copied files to to bucket: anyscale-test-data-cld-i2w99rzq8b6lbjkke9y94vi5 and path: org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/lora_fine_tuning/meta-llama/Meta-Llama-3-8B-Instruct:gokum:atyhk

We'll load these artifacts from cloud storage to a local cluster storage to use for other workloads.

1from src.utils import download_files_from_bucket

🔄 REPLACE : Update the information below for the specific model and artifacts path for our fine-tuned model (retrieved from the logs from the Anyscale Job we launched above).

1# Locations
2artifacts_dir = '/mnt/cluster_storage'  # storage accessible by head and worker nodes
3model = 'meta-llama/Meta-Llama-3-8B-Instruct'
4uuid = 'goku_:ueewk'  # REPLACE with your NAME + MODEL ID (from Job logs)
5artifacts_path = (
6    f"{os.environ['ANYSCALE_ARTIFACT_STORAGE'].split(os.environ['ANYSCALE_CLOUD_STORAGE_BUCKET'])[-1][1:]}"
7    f"/lora_fine_tuning/{model}:{uuid}")

1# Download artifacts
2download_files_from_bucket(
3    bucket=os.environ['ANYSCALE_CLOUD_STORAGE_BUCKET'], 
4    path=artifacts_path, 
5    local_dir=artifacts_dir)

1Downloaded org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/lora_fine_tuning/meta-llama/Meta-Llama-3-8B-Instruct:gokum:atyhk/README.md to /mnt/cluster_storage/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/lora_fine_tuning/meta-llama/Meta-Llama-3-8B-Instruct:gokum:atyhk/README.md
2Downloaded org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/lora_fine_tuning/meta-llama/Meta-Llama-3-8B-Instruct:gokum:atyhk/adapter_config.json to /mnt/cluster_storage/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/lora_fine_tuning/meta-llama/Meta-Llama-3-8B-Instruct:gokum:atyhk/adapter_config.json
3Downloaded org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/lora_fine_tuning/meta-llama/Meta-Llama-3-8B-Instruct:gokum:atyhk/adapter_model.safetensors to /mnt/cluster_storage/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/lora_fine_tuning/meta-llama/Meta-Llama-3-8B-Instruct:gokum:atyhk/adapter_model.safetensors
4Downloaded org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/lora_fine_tuning/meta-llama/Meta-Llama-3-8B-Instruct:gokum:atyhk/config.json to /mnt/cluster_storage/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/lora_fine_tuning/meta-llama/Meta-Llama-3-8B-Instruct:gokum:atyhk/config.json
5Downloaded org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/lora_fine_tuning/meta-llama/Meta-Llama-3-8B-Instruct:gokum:atyhk/new_embeddings.safetensors to /mnt/cluster_storage/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/lora_fine_tuning/meta-llama/Meta-Llama-3-8B-Instruct:gokum:atyhk/new_embeddings.safetensors
6Downloaded org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/lora_fine_tuning/meta-llama/Meta-Llama-3-8B-Instruct:gokum:atyhk/special_tokens_map.json to /mnt/cluster_storage/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/lora_fine_tuning/meta-llama/Meta-Llama-3-8B-Instruct:gokum:atyhk/special_tokens_map.json
7Downloaded org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/lora_fine_tuning/meta-llama/Meta-Llama-3-8B-Instruct:gokum:atyhk/tokenizer.json to /mnt/cluster_storage/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/lora_fine_tuning/meta-llama/Meta-Llama-3-8B-Instruct:gokum:atyhk/tokenizer.json
8Downloaded org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/lora_fine_tuning/meta-llama/Meta-Llama-3-8B-Instruct:gokum:atyhk/tokenizer_config.json to /mnt/cluster_storage/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/lora_fine_tuning/meta-llama/Meta-Llama-3-8B-Instruct:gokum:atyhk/tokenizer_config.json

LinkEvaluation

Now we'll evaluate our fine-tuned LLM to see how well it performs on our task. We'll perform offline batch inference where we will use our tuned model to generate the outputs.

LinkLoad test data

1# Load test set for eval
2ft_test_ds = ray.data.read_json(f"{os.environ['ANYSCALE_ARTIFACT_STORAGE']}/viggo/test.jsonl")
3test_data = ft_test_ds.take_all()
4test_data[0]

1{'messages': [{'content': "Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute']. The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']",
2'role': 'system'},
3{'content': 'I remember you saying you found Little Big Adventure to be average. Are you not usually that into single-player games on PlayStation?',
4'role': 'user'},
5{'content': 'verify_attribute(name[Little Big Adventure], rating[average], has_multiplayer[no], platforms[PlayStation])',
6'role': 'assistant'}]}

1# Separate into inputs/outputs
2test_inputs = []
3test_outputs = []
4for item in test_data:
5    test_inputs.append([message for message in item['messages'] if message['role'] != 'assistant'])
6    test_outputs.append([message for message in item['messages'] if message['role'] == 'assistant'])

LinkTokenizer

We'll also load the appropriate tokenizer to apply to our input data.

1from transformers import AutoTokenizer

1# Model and tokenizer
2HF_MODEL = 'meta-llama/Meta-Llama-3-8B-Instruct'
3tokenizer = AutoTokenizer.from_pretrained(HF_MODEL)

1Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.

LinkChat template

When we fine-tuned our model, special tokens (ex. beginning/end of text, etc.) were automatically added to our inputs. We want to apply the same special tokens to our inputs prior to generating outputs using our tuned model. Luckily, the chat template to apply to our inputs (and add those tokens) is readily available inside our tuned model's tokenizer_config.json file. We can use our tokenizer to apply this template to our inputs.

1import json

1# Extract chat template used during fine-tuning
2with open(os.path.join(artifacts_dir, artifacts_path, 'tokenizer_config.json')) as file:
3    tokenizer_config = json.load(file)
4chat_template = tokenizer_config['chat_template']
5print (chat_template)

1{% set loop_messages = messages %}{% for message in loop_messages %}{% set content = '<|start_header_id|>' + message['role'] + '<|end_header_id|>
2
3'+ message['content'] | trim + '<|eot_id|>' %}{% if loop.index0 == 0 %}{% set content = bos_token + content %}{% endif %}{{ content }}{% endfor %}{% if add_generation_prompt %}{{ '<|start_header_id|>assistant<|end_header_id|>
4
5' }}{% endif %}

1# Apply chat template
2test_input_prompts = [{'inputs': tokenizer.apply_chat_template(
3    conversation=inputs, 
4    chat_template=chat_template, 
5    add_generation_prompt=True, 
6    tokenize=False, 
7    return_tensors='np'), 'outputs': outputs} for inputs, outputs in zip(test_inputs, test_outputs)]
8test_input_prompts_ds = ray.data.from_items(test_input_prompts)
9print (test_input_prompts_ds.take(1))

1[{'inputs': "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nGiven a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute']. The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nI remember you saying you found Little Big Adventure to be average. Are you not usually that into single-player games on PlayStation?<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n", 'outputs': [{'content': 'verify_attribute(name[Little Big Adventure], rating[average], has_multiplayer[no], platforms[PlayStation])', 'role': 'assistant'}]}]

LinkBatch inference

We will use vLLM's offline LLM class to load the model and use it for inference. We can easily load our LoRA weights and merge them with the base model (just pass in lora_path). And we'll wrap all of this functionality in a class that we can pass to ray.data.Dataset.map_batches` to apply batch inference at scale.

1from vllm import LLM, SamplingParams
2from vllm.anyscale.lora.utils import LoRARequest

1class LLMPredictor:
2    def __init__(self, hf_model, sampling_params, lora_path=None):
3        self.llm = LLM(model=hf_model, enable_lora=bool(lora_path))
4        self.sampling_params = sampling_params
5        self.lora_path = lora_path
6
7    def __call__(self, batch):
8        if not self.lora_path:
9            outputs = self.llm.generate(
10                prompts=batch['inputs'], 
11                sampling_params=self.sampling_params)
12        else:
13            outputs = self.llm.generate(
14                prompts=batch['inputs'], 
15                sampling_params=self.sampling_params, 
16                lora_request=LoRARequest('lora_adapter', 1, self.lora_path))
17        inputs = []
18        generated_outputs = []
19        for output in outputs:
20            inputs.append(output.prompt)
21            generated_outputs.append(' '.join([o.text for o in output.outputs]))
22        return {
23            'prompt': inputs,
24            'expected_output': batch['outputs'],
25            'generated_text': generated_outputs,
26        }

During our data preprocessing template, we used the default compute strategy with map_batches. But this time we'll specify a custom compute strategy (concurrency, num_gpus, batch_size and accelerator_type).

1# Fine-tuned model
2hf_model = 'meta-llama/Meta-Llama-3-8B-Instruct'
3sampling_params = SamplingParams(temperature=0, max_tokens=2048)
4ft_pred_ds = test_input_prompts_ds.map_batches(
5    LLMPredictor,
6    concurrency=4,  # number of LLM instances
7    num_gpus=1,  # GPUs per LLM instance
8    batch_size=10,  # maximize until OOM, if OOM then decrease batch_size
9    fn_constructor_kwargs={
10        'hf_model': hf_model,
11        'sampling_params': sampling_params,
12        'lora_path': os.path.join(artifacts_dir, artifacts_path)},
13    accelerator_type='A10G',  # A10G or L4
14)

1# Batch inference will take ~4 minutes
2ft_pred = ft_pred_ds.take_all()
3ft_pred[3]

1{'expected_output': array([{'content': 'give_opinion(name[Might & Magic: Heroes VI], rating[average], player_perspective[bird view], platforms[PC])', 'role': 'assistant'}], dtype=object),
2
3 'generated_text': 'give_opinion(name[Might & Magic: Heroes VI], rating[average], player_perspective[bird view], platforms[PC])<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'}

LinkEvaluation

There are a lot of different ways to perform evaluation. For our task, we can use traditional deterministic metrics (ex. accuracy, precsion, recall, etc.) since we know what the outputs should be (extracted intent and entities).

However for many generative tasks, the outputs are very unstructured and highly subjective. For these scenarios, we can use distance/entropy based metrics like cosine, bleu, perplexity, etc. But, these metrics are often not very representative of the underlying task. A common strategy here is to use a larger LLM to judge the quality of the generated outputs. We can ask the larger LLM to directly assess the quality of the response (ex. rate between 1-5) with a set of rules or compare it to a golden / preferred output and rate it against that.

1# Exact match (strict!)
2matches = 0
3mismatches = []
4for item in ft_pred:
5    if item['expected_output'][0]['content'] == item['generated_text'].split('<|eot_id|>')[0]:
6        matches += 1
7    else:
8        mismatches.append(item)
9matches / float(len(ft_pred))

10.938134810710988

**Note**: you can train for more epochs (`num_epochs: 10`) to further improve the performance.

Even our mismatches are not too far off and sometimes it might be worth a closer look because the dataset itself might have a few errors that the model may have identified.

1# Inspect a few of the mismatches
2mismatches[0:2]

1[{'prompt': "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nGiven a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute']. The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nWorld of Warcraft is an MMORPG adventure game that was released in 2004 by Blizzard Entertainment.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
2  'expected_output': array([{'content': 'inform(name[World of Warcraft], release_year[2004], developer[Blizzard Entertainment], genres[adventure, MMORPG])', 'role': 'assistant'}],
3        dtype=object),
4  'generated_text': 'inform(name[World of Warcraft], release_year[2004], developer[Blizzard Entertainment], genres[adventure, MMORPG], platforms[PC])<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n'},
5 {'prompt': "<|begin_of_text|><|start_header_id|>system<|end_header_id|>\n\nGiven a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute']. The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']<|eot_id|><|start_header_id|>user<|end_header_id|>\n\nAh, I wish Mirror's Edge Catalyst was on Steam, they're the only game source I trust to be legit.<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n",
6  'expected_output': array([{'content': "give_opinion(name[Mirror's Edge Catalyst], rating[poor], available_on_steam[no])", 'role': 'assistant'}],
7        dtype=object),
8  'generated_text': "give_opinion(name[Mirror's Edge Catalyst], rating[average], available_on_steam[no])<|eot_id|><|start_header_id|>assistant<|end_header_id|>\n\n"}]

LinkServing

For model serving, we'll first serve it locally, test it and then launch a production grade service that can autoscale to meet any demand.

We'll start by generating the configuration for our service. We provide a convenient CLI experience to generate this configuration but you can create one from scratch as well. Here we can specify where our model lives, autoscaling behavior, accelerators to use, lora adapters, etc.

💡 INSIGHT : Ray Serve and Anyscale support serving multiple LoRA adapters with a common base model in the same request batch which allows you to serve a wide variety of use-cases without increasing hardware spend. In addition, we use Serve multiplexing to reduce the number of swaps for LoRA adapters. There is a slight latency overhead to serving a LoRA model compared to the base model, typically 10-20%.

LoRA weights storage URI: s3://anyscale-test-data-cld-i2w99rzq8b6lbjkke9y94vi5/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/lora_fine_tuning

model: meta-llama/Meta-Llama-3-8B-Instruct:gokum:atyhk

We'll start by running the python command below to start the CLI workflow to generate the service yaml configuration:

1mkdir /home/ray/default/deploy/services
2cd /home/ray/default/deploy/services
3python /home/ray/default/src/generate_serve_config.py

🔄 REPLACE : Use the serve configuration generated for you.

1# Generated service configuration
2!cat /home/ray/default/deploy/services/serve_{TIMESTAMP}.yaml

1applications:
2- args:
3    dynamic_lora_loading_path: s3://anyscale-test-data-cld-i2w99rzq8b6lbjkke9y94vi5/org_7c1Kalm9WcX2bNIjW53GUT/cld_kvedZWag2qA8i5BjxUevf5i7/artifact_storage/lora_fine_tuning
4    embedding_models: []
5    function_calling_models: []
6    models: []
7    multiplex_lora_adapters: []
8    multiplex_models:
9    - ./model_config/model_config_20240516095237.yaml
10    vllm_base_models: []
11  import_path: aviary_private_endpoints.backend.server.run:router_application
12  name: llm-endpoint
13  route_prefix: /

This also generated a model configuration file that has all the information on auto scaling, inference engine, workers, compute, etc. It will be located under model_config/{MODEL_NAME}-{TIMESTAMP}.yaml. This configuration also includes the prompt_format which seamlessly matches any formatting we did prior to fine-tuning and applies it during inference automatically.

LinkLocal deployment

We can now serve our model locally and query it. Run the follow in the terminal (change to your serve yaml config):

1cd /home/ray/default/deploy/services
2serve run serve_{TIMESTAMP}.yaml

1from openai import OpenAI

1# Query function to call the running service
2def query(base_url: str, api_key: str):
3    if not base_url.endswith("/"):
4        base_url += "/"
5
6    # List all models
7    client = OpenAI(base_url=base_url + "v1", api_key=api_key)
8    models = client.models.list()
9    print(models)
10
11    # Note: not all arguments are currently supported and will be ignored by the backend
12    chat_completions = client.chat.completions.create(
13        model=model,  # with your unique model ID
14        messages=[
15            {"role": "system", "content": "Given a target sentence construct the underlying meaning representation of the input sentence as a single function with attributes and attribute values. This function should describe the target string accurately and the function must be one of the following ['inform', 'request', 'give_opinion', 'confirm', 'verify_attribute', 'suggest', 'request_explanation', 'recommend', 'request_attribute']. The attributes must be one of the following: ['name', 'exp_release_date', 'release_year', 'developer', 'esrb', 'rating', 'genres', 'player_perspective', 'has_multiplayer', 'platforms', 'available_on_steam', 'has_linux_release', 'has_mac_release', 'specifier']"},
16            {"role": "user", "content": "I remember you saying you found Little Big Adventure to be average. Are you not usually that into single-player games on PlayStation?"},
17        ],
18        temperature=0,
19        stream=True
20    )
21
22    response = ""
23    for chat in chat_completions:
24        if chat.choices[0].delta.content is not None:
25            response += chat.choices[0].delta.content
26
27    return response

1# Generate response
2response = query("http://localhost:8000", "NOT A REAL KEY")
3print (response.split('<|eot_id|>')[0])

1verify_attribute(name[Little Big Adventure], rating[average], has_multiplayer[no], platforms[PlayStation])

LinkProduction service

Now we'll create a production service that can truly scale. We have full control over this Service from autoscaling behavior, monitoring via dashboard, canary rollouts, termination, etc. → Anyscale Services

💡 INSIGHT : With Ray Serve and Anyscale, it's extremely easy to define our configuration that can scale to meet any demand but also scale back to zero to create the most efficient service possible. Check out this guide on how to optimize behavior around auto scaling, latency/throughout, etc.

Stop the local service (Control + C) and run the following:

1cd /home/ray/default/deploy/services
2anyscale service deploy -f serve_{TIMESTAMP}.yaml

Go to Home> Services(left panel) to view the production service.

🔄 REPLACE : the service_urland service_bearer_tokengenerated for your service (top right corner under the Querybutton on the Service's page).

1# Query the remote serve application we just deployed
2service_url = "your_api_url"  # REPLACE ME
3service_bearer_token = "your_secret_bearer_token"  # REPLACE ME
4query(service_url, service_bearer_token)

1verify_attribute(name[Little Big Adventure], rating[average], has_multiplayer[no], platforms[PlayStation])

Note: If we chose to fine-tune our model using the simpler Anyscale serverless endpoints method, then we can serve that model by going to Endpoints API > Services on the left panel of the main console page. Click on the three dots on the right side of your tuned model and follow the instructions to query it.

LinkDev → Prod

We've now served our model into production via Anyscale Services but we can just easily productionize our other workloads with Anyscale Jobs (like we did for fine-tuning above) to execute this entire workflow completely programmatically outside of Workspaces.

For example, suppose that we want to preprocess batches of new incoming data, fine-tune a model, evaluate it and then compare it to the existing production version. All of this can be productionized by simply launching the workload as a Job, which can be triggered manually, periodically (cron) or event-based (via webhooks, etc.). We also provide integrations with your platform/tools to make all of this connect with your existing production workflows.

💡 INSIGHT : Most industry ML issues arise from a discrepancy between the development (ex. local laptop) and production (ex. large cloud clusters) environments. With Anyscale, your development and production environments can be exactly the same so there is little to no difference introduced. And with features like smart instance manager, the development environment can stay extremely lean while having the power to scale as needed.

LinkClean up

🛑 IMPORTANT: Please Terminate your service from the Service page to avoid depleting your free trial credits.

1# Clean up
2!python src/clear_cell_nums.py
3!find . | grep -E ".ipynb_checkpoints" | xargs rm -rf
4!find . | grep -E "(__pycache__|\.pyc|\.pyo)" | xargs rm -rf
5!rm -rf __pycache__ data .HF_TOKEN deploy/services

LinkNext steps

We have a lot more guides that address more nuanced use cases:

Batch text embeddings with Ray data
Continued fine-tuning from checkpoint
Serving multiple LoRA adapters with same base model (+ multiplexing)
Deploy models for embedding generation
Function calling fine-tuning and deployment
Configs to optimize the latency/throughput
Configs to control optimization parameters and tensor-parallelism
Stable diffusion fine-tuning and serving

And if you're interested in using our hosted Anyscale or connecting it to your own cloud, reach out to us at Anyscale. And join us on Twitter, LinkedIn and the Ray Summit for more real-time updates on new features!

Sharing

Sign up for product updates

Building RAG-based LLM Applications for Production (Part 1)

continuous-batching-llm-inference

Deploy DeepSeek‑R1 with vLLM and Ray Serve on Kubernetes

Ready to try Anyscale?

Access Anyscale today to see how companies using Anyscale and Ray benefit from rapid time-to-market and faster iterations across the entire AI lifecycle.

End-to-end LLM Workflows Guide

LinkSet up

LinkData Preprocessing

LinkDataset

LinkData Preprocessing

LinkSave and load data

LinkFine-tuning

LinkConfigurations

LinkFine-tuning

LinkLoad artifacts

LinkEvaluation

LinkLoad test data

LinkTokenizer

LinkChat template

LinkBatch inference

LinkEvaluation

LinkServing

LinkLocal deployment

LinkProduction service

LinkDev → Prod

LinkClean up

LinkNext steps

Table of contents

Sharing

Sign up for product updates

Recommended content

Building RAG-based LLM Applications for Production (Part 1)

continuous-batching-llm-inference

Deploy DeepSeek‑R1 with vLLM and Ray Serve on Kubernetes

Ready to try Anyscale?