Home BlogBlog Detail

Fine-tuning a Text-to-SQL Model with Tinker and Ray

By Robert Nishihara and Philipp Moritz | October 1, 2025

Thinking Machines just released Tinker, an LLM training API for researchers and hackers. The API offers low-level control while abstracting away model deployment challenges.

We built a simple example showcasing how to use Ray along with Tinker to build and run a text-to-SQL model.

There are two primary parts to this use case: data generation and model fine-tuning. We show how to generate a dataset for supervised fine-tuning using Ray. We then show how to use the dataset to fine tune an LLM using Tinker.

LinkData generation

We first need to generate data for supervised fine tuning. There are two components to this: generation and evaluation. We generate queries by deploying Qwen-8B using vLLM along with Ray Serve as an Anyscale service to scale LLM inference. We then use Ray Core to execute a large number of parallel tasks to generate candidate SQL queries, then we evaluate each of those queries in a SQL environment and calculate rewards using skyrl-gym.

Here is the application code for running Qwen-8B as a service. This uses Ray Serve’s built-in integration with vLLM to deploy the model.

1# deploy_qwen.py
2
3from ray.serve.llm import LLMConfig, build_openai_app
4
5llm_config = LLMConfig(
6   model_loading_config=dict(
7       model_id="my-qwen-8B",
8       model_source="Qwen/Qwen3-8B",
9   ),
10   accelerator_type="L40S",
11   deployment_config=dict(
12       autoscaling_config=dict(
13           min_replicas=4, max_replicas=8,
14       )
15   ),
16   engine_kwargs=dict(
17       max_model_len=8192,
18       tensor_parallel_size=1
19   )
20)
21
22app = build_openai_app({"llm_configs": [llm_config]})

This service can be deployed by running

1anyscale service deploy -f service.yaml

The service.yaml file is provided in the appendix.

Here is the code for scaling querying the model, evaluating the queries, and filtering out the unsuccessful queries. This code can naturally be extended in a multi-turn fashion to feed the output of the unsuccessful query back into the model to generate new candidate queries.

1# data_generation.py
2
3from urllib.parse import urljoin
4from datasets import load_dataset
5from skyrl_gym.envs.sql.env import SQLEnv
6from omegaconf import DictConfig
7from openai import OpenAI
8from datasets import load_dataset
9import json
10import ray
11
12dataset = load_dataset("NovaSky-AI/SkyRL-SQL-653-data-newfmt", split="train").to_list()
13
14token = # <FILL IN APPROPRIATE TOKEN>
15base_url = # <FILL IN APPROPRIATE BASE URL>
16
17@ray.remote(num_cpus=0.1)
18def generate_sql(messages):
19    client = OpenAI(api_key=token, base_url=urljoin(base_url, "v1"))
20    response = client.chat.completions.create(
21        model="my-qwen-8B",
22        messages=messages
23    )
24    return response.choices[0].message.content
25
26# Generate SQL queries in parallel
27object_refs = [generate_sql.remote(record["prompt"]) for record in dataset]
28
29# Fetch the results and filter out the unsuccessful ones
30object_refs_and_records = dict(zip(object_refs, dataset))
31successful = []
32remaining = object_refs
33while remaining:
34    [ready_ref], remaining = ray.wait(remaining, num_returns=1)
35    record = object_refs_and_records[ready_ref]
36    messages = record["prompt"]
37
38    try:
39        assistant_response = ray.get(ready_ref)
40    except Exception as e:
41        continue
42
43    conf = DictConfig({"db_path": "/home/ray/data"})
44    env = SQLEnv(conf, record)
45    env.init(messages)
46    try:
47        output = env.step(assistant_response)
48    except AssertionError as e:
49        continue
50
51    print("Reward: ", output["reward"])
52
53    if output["reward"] > 0:
54        successful.append((record, assistant_response))
55
56examples = []
57for record, assistant_response in successful:
58    examples.append(record["prompt"] + [{"role": "assistant", "content": assistant_response}])
59
60with open("/mnt/shared_storage/successful.json", "w") as f:
61   import json
62   json.dump(examples, f)

This job can be submitted by running

1anyscale job submit -f job.yaml --env HF_TOKEN=$HF_TOKEN

The job.yaml file is provided in the appendix. The examples will be stored in a shared filesystem, though you can store them wherever you want.

LinkModel fine-tuning

We use the Tinker API to tokenize the data and fine-tune the model.

The Tinker API offers a high level of control for training and fine tuning LLMs.

The following can be run in an Anyscale workspace that has tinker installed.

1import tinker
2from tinker import types
3import json
4import numpy as np
5
6service_client = tinker.ServiceClient()
7
8training_client = service_client.create_lora_training_client(
9    base_model="Qwen/Qwen3-8B", rank=32
10)
11tokenizer = training_client.get_tokenizer()
12
13def process_example(messages: dict, tokenizer) -> types.Datum:
14    tokens = tokenizer.apply_chat_template(messages)
15    weights = [1] * len(tokens)
16    input_tokens = tokens[:-1]
17    target_tokens = tokens[1:]
18    weights = weights[1:]
19    return types.Datum(
20        model_input=types.ModelInput.from_ints(tokens=input_tokens),
21        loss_fn_inputs=dict(weights=weights, target_tokens=target_tokens)
22    )
23
24examples = json.load(open("/mnt/shared_storage/successful.json", "r"))
25processed_examples = [process_example(ex, tokenizer) for ex in examples]
26
27# Note: If you are going to train on a larger dataset, you should implement proper minibatch training.
28for _ in range(6):
29    fwdbwd_future = training_client.forward_backward(processed_examples, "cross_entropy")
30    optim_future = training_client.optim_step(types.AdamParams(learning_rate=1e-4))
31    # Wait for the results
32    fwdbwd_result = fwdbwd_future.result()
33    optim_result = optim_future.result()
34    # fwdbwd_result contains the logprobs of all the tokens we put in. Now we can compute the weighted
35    # average log loss per token.
36    logprobs = np.concatenate([output["logprobs"].tolist() for output in fwdbwd_result.loss_fn_outputs])
37    weights = np.concatenate([example.loss_fn_inputs["weights"].tolist() for example in processed_examples])
38    print(f"Loss per token: {-np.dot(logprobs, weights) / weights.sum():.4f}")
39
40# Save the weights
41sampling_client = training_client.save_weights_and_get_sampling_client(name="sql_model")
42print(f"model path: {sampling_client.model_path}")

LinkModel evaluation

We now want to check how well the model performs. Let’s first download the model checkpoint (make sure to fill out the model path that was printed by the above code).

1import tinker
2from urllib.parse import urlparse
3
4MODEL_PATH = # <FILL IN THE MODEL PATH PRINTED ABOVE>
5
6parsed_url = urlparse(MODEL_PATH)
7
8service_client = tinker.ServiceClient()
9rest_client = service_client.create_rest_client()
10data = rest_client.download_checkpoint_archive(parsed_url.netloc, parsed_url.path.lstrip('/')).result()
11
12with open('output.tar.gz', 'wb') as f:
13    f.write(data)

We then extract the LoRA weights with mkdir -p /home/ray/sql_lora && tar xvfz output.tar.gz -C /home/ray/sql_lora and merge the weights with the base (we do this because currently the tinker LoRA weights are not compatible with vLLM and can’t be served directly – this will be fixed going forward).

1from peft import PeftModel
2from transformers import AutoModelForCausalLM, AutoTokenizer
3base_model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B")
4tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
5model = PeftModel.from_pretrained(base_model, "/home/ray/sql_lora")
6merged_model = model.merge_and_unload()
7
8save_path = "/home/ray/merged_sql_model"
9merged_model.save_pretrained(save_path)
10tokenizer.save_pretrained(save_path)

LinkAppendix - additional setup

To run the above code, a few additional setup steps are required.

We define our base image using the following Dockerfile.

1# Dockerfile
2
3FROM anyscale/ray:2.48.0-slim-py312-cu128
4
5RUN sudo apt-get update -y \
6   && sudo apt-get install --no-install-recommends -y build-essential libnuma-dev \
7   && sudo rm -f /etc/apt/sources.list.d/*
8
9RUN curl -LsSf https://astral.sh/uv/install.sh | sh
10
11RUN git clone https://github.com/novasky-ai/SkyRL.git
12WORKDIR /home/ray/SkyRL/skyrl-gym/
13RUN uv pip install --system .
14RUN uv pip install --system "huggingface_hub[cli]" "datasets" "openai" "transformers" "torch" "vllm==0.10.0" "pydantic"

We define the service config as follows.

1# service.yaml
2
3name: deploy-qwen
4containerfile: ./Dockerfile
5
6compute_config:
7  auto_select_worker_config: true
8
9working_dir: .
10
11applications:
12- import_path: deploy_qwen:app

We define the job config as follows.

1# job.yaml
2
3name: data-generation
4containerfile: ./Dockerfile
5
6compute_config:
7 head_node:
8   instance_type: c6a.12xlarge
9 auto_select_worker_config: true
10
11working_dir: .
12
13entrypoint: |
14  uv run --with huggingface_hub huggingface-cli download seeklhy/OmniSQL-datasets data.zip --repo-type dataset --local-dir $HOME && \
15  unzip $HOME/data.zip -d $HOME && \
16  python data_generation.py
17
18max_retries: 0

Data generation
Model fine-tuning
Model evaluation
Appendix - additional setup

Sharing

Sign up for product updates

Ray Serve: Advancing Flexibility with Async Inference, Custom Request Routing, and Custom Autoscaling

Ray Data: Scalable Data Processing for AI workloads

Announcing Anyscale Runtime for Faster, Cheaper and More Resilient AI, Powered by Ray

Ready to try Anyscale?

Access Anyscale today to see how companies using Anyscale and Ray benefit from rapid time-to-market and faster iterations across the entire AI lifecycle.