This tutorial shows how you can federate the fine tuning of a language model without ever storing private text information. In this tutorial, we simulate several organizations and demonstrate how they adapt a base model shared by all clients locally using only the lightweight LoRA parameters. Flower’s simulation engine for federated LLMs and its parameter-efficient refinement allow us to demonstrate an effective, scalable solution that allows organizations to tailor LLMs using sensitive data, while still preserving privacy. Click here to see the FULL CODES here.
!pip install -q -U "protobuf manager -> compliance for high-risk cases."
],
2: [
"Fleet ops: preventive maintenance reduces downtime; prioritize vehicles with repeated fault codes.",
"Dispatch note: optimize routes by time windows and driver hours to reduce empty miles.",
"Safety policy: enforce rest breaks and log inspections before long-haul trips.",
"Inventory update: track spare parts usage; reorder thresholds should reflect lead time and seasonality.",
"Customer SLA: late deliveries require proactive notifications and documented root cause."
],
}
for cid in list(CLIENT_TEXTS.keys()):
CLIENT_TEXTS base = client_text[cid]
CLIENT_TEXTS[cid] The base is + [f"Q: Summarize this for leadership. A: {t}" for t in base]
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
if tokenizer.pad_token is None:
tokenizer.pad_token = tokenizer.eos_token
bnb_config : Optional[BitsAndBytesConfig] No.
If DEVICE== "cuda":
compute_dtype = torch.bfloat16 If you want to know more about if torch.cuda.get_device_capability(0)[0] >= 8 else torch.float16
bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=compute_dtype)
if "gpt2" MODEL_ID.lower():
TARGET_MODULES = ["c_attn", "c_proj"]
else:
TARGET_MODULES = ["q_proj", "k_proj", "v_proj", "o_proj"]
LORA_R = 16
LORA_ALPHA = 32
LORA_DROPOUT = 0.05
lora_config = LoraConfig(r=LORA_R, lora_alpha=LORA_ALPHA, lora_dropout=LORA_DROPOUT, bias="none", task_type="CAUSAL_LM", target_modules=TARGET_MODULES)
def model_primary_device(model) -> torch.device:
return next(model.parameters()).device
def build_model_with_lora():
If DEVICE == "cuda":
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", quantization_config=bnb_config, torch_dtype="auto")
model = prepare_model_for_kbit_training(model)
else:
model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.float32)
model.to("cpu")
model = get_peft_model(model, lora_config)
model.train()
Model Return
Make a dataset with def (texts) ListDataset.from_dict()[str]) -> Dataset:
ds = Dataset.from_dict({"text": texts})
Def tok (batch)
return tokenizer(batch["text"], truncation=True, max_length=MAX_LEN, padding="max_length")
ds = ds.map(tok, batched=True, remove_columns=["text"])
return ds
collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
def lora_state_keys(model) -> List[str]:
Model.state_dict = sd()
keys = sorted([k for k in sd.keys() if "lora_" in k])
If you don't have keys
raise RuntimeError("No LoRA keys found. Your model might not have the target_modules specified. " F"Current TARGET_MODULES={TARGET_MODULES}, MODEL_ID={MODEL_ID}")
return keys
def get_lora_ndarrays(model) -> List[np.ndarray]:
Model.state_dict = sd()
keys = lora_state_keys(model)
Return to the Homepage [sd[k].detach().float().cpu().numpy() for k in keys]
def set_lora_ndarrays(model, arrays: List[np.ndarray]) -> None:
keys = lora_state_keys(model)
If len(keys), then len(arrays).
raise ValueError(f"Mismatch: got {len(arrays)} arrays but expected {len(keys)}.")
Model.state_dict = sd()
for k, arr in zip(keys, arrays):
t = torch.from_numpy(arr).to(sd[k].device).to(sd[k].dtype)
sd[k].copy_(t)
def cosine_warmup_lr(step: int, total_steps: int, base_lr: float, warmup_steps: int) -> float:
If Step Float:
model.eval()
dl = DataLoader(ds, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collator)
losses = []
dev = model_primary_device(model)
For i, batch in the enumerate (dl):
if i >= max_batches:
Breaks
batch = {Batch = k : v.to (dev) for batch.items k and v()}
out = model(**batch, labels=batch["input_ids"])
losses.append(float(out.loss.detach().cpu()))
model.train()
return float(np.mean(losses)) if losses else float("nan")
def train_one_client_round(model, ds: Dataset, epochs: int, lr: float, grad_accum: int, warmup_steps: int) -> Tuple[float, int]:
dl = DataLoader(ds, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collator)
total_steps = max(1, (len(dl) * epochs) // max(1, grad_accum))
step = 0
optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=WEIGHT_DECAY)
optimizer.zero_grad(set_to_none=True)
Running = []
examples = 0
dev = model_primary_device(model)
For _, in the range (epochs).
Enumerate in batch(dl).
batch = {Batch = k is v.to (dev) in order to k and v.()}
out = model(**batch, labels=batch["input_ids"])
loss = out.loss / grad_accum
loss.backward()
running.append(float(loss.detach().cpu()) * grad_accum)
examples += batch["input_ids"].shape[0]
if (bi + 1) % grad_accum == 0:
lr_t = cosine_warmup_lr(step, total_steps, lr, warmup_steps)
for pg in optimizer.param_groups:
pg["lr"] = lr_t
optimizer.step()
optimizer.zero_grad(set_to_none=True)
Step += 1
If % LOG_EVERY > 0, step:
print(f" step={step}/{total_steps} loss={np.mean(running[-LOG_EVERY:]):.4f} lr={lr_t:.2e}")
return float(np.mean(running)) if running else float("nan"Examples
The full execution environment is set up and all configurations are defined for the test. The private client text silos and tokenizer are prepared so that they can automatically adjust to the CPU or GPU available. Also, we set up all the helper tools that allow parameter-efficient fine tuning and safe device-handling across federated users. Visit the FULL CODES here.
class FedLoRAClient(fl.client.NumPyClient):
def __init__(self, cid: int):
Self.cid = cid
self._model = None
This is the self-ds_train.
self._ds_eval= None
def _ensure(self):
If self._model = None
print(f"[Client {self.cid}] Loading model + LoRA (MODEL_ID={MODEL_ID})...")
self._model = build_model_with_lora()
text = CLIENT_TEXTMetrics =[self.cid].copy()
random.shuffle(texts)
split = max(1, int(0.8 * len(texts)))
self._ds_train = make_dataset(texts[:split])
self._ds_eval = make_dataset(texts[split:])
def get_parameters(self, config):
self._ensure()
return get_lora_ndarrays(self._model)
def fit(self, parameters, config):
self._ensure()
set_lora_ndarrays(self._model, parameters)
loss_before = eval_loss(self._model, self._ds_eval, max_batches=10)
print(f"[Client {self.cid}] eval_loss_before={loss_before:.4f}")
train_loss, n_examples = train_one_client_round(self._model, self._ds_train, epochs=int(config.get("local_epochs", LOCAL_EPOCHS)), lr=float(config.get("lr", LR)), grad_accum=int(config.get("grad_accum", GRAD_ACCUM)), warmup_steps=int(config.get("warmup_steps", WARMUP_STEPS)))
loss_after = eval_loss(self._model, self._ds_eval, max_batches=10)
print(f"[Client {self.cid}] train_loss={train_loss:.4f} eval_loss_after={loss_after:.4f}")
new_params = get_lora_ndarrays(self._model)
metrics = {"eval_loss_before": loss_before, "eval_loss_after": loss_after, "train_loss": train_loss}
Return new_params n_examples metrics
def evaluate(self, parameters, config):
self._ensure()
set_lora_ndarrays(self._model, parameters)
loss = eval_loss(self._model, self._ds_eval, max_batches=20)
return float(loss), len(self._ds_eval), {"eval_loss": float(loss)}
Def Client_fn (Context: Context).
Cd = None
try:
cid = int(context.node_config.get("partition-id"))
The exception:
try:
Node_id = context.cid
The exception:
cid = 0.
return FedLoRAClient(cid).to_client()
Define the federated clients logic which simulates different organizations taking part in training. Each client is given a LoRA augmented language model and we ensure local datasets stay isolated. The client is responsible for training, evaluating, and exchanging parameters. Only the LoRA adapter values are exposed to the server.Return
def fit_config(server_round: int):
return {"local_epochs": LOCAL_EPOCHS, "lr": LR, "grad_accum": GRAD_ACCUM, "warmup_steps": WARMUP_STEPS}
strategy = fl.server.strategy.FedAvg(fraction_fit=1.0, fraction_evaluate=1.0, min_fit_clients=NUM_CLIENTS, min_evaluate_clients=NUM_CLIENTS, min_available_clients=NUM_CLIENTS, on_fit_config_fn=fit_config)
print("nStarting Flower simulation...n")
client_resources = {"num_cpus": 2, "num_gpus": 0.0}
If DEVICE == "cuda":
client_resources = {"num_cpus": 2, "num_gpus": 0.25}
history = fl.simulation.start_simulation(client_fn=client_fn, num_clients=NUM_CLIENTS, config=fl.server.ServerConfig(num_rounds=ROUNDS), strategy=strategy, client_resources=client_resources, ray_init_args={"include_dashboard": False, "ignore_reinit_error": True})
print("nSimulation done.")
Wir configure the federated-learning strategy and orchestrate global training. The number of clients participating, the way parameters are aggregated and how to schedule training rounds is specified. The Flower simulation is launched to facilitate communication and aggregate data across virtual clients. See the FULL CODES here.
demo_model = build_model_with_lora()
demo_model.eval()
prompt = "Summarize this internal note for leadership in 2 bullets:nDispatch note: optimize routes by time windows and driver hours to reduce empty miles.nnAnswer:"
Inputs = tokenizer (prompt return_tensors="pt")
dev = model_primary_device(demo_model)
inputs = {If inputs.items is k and v, then inputs.items will be v.to (dev).()}
No_grad. With torch():
out = demo_model.generate(**inputs, max_new_tokens=80, do_sample=True, temperature=0.8, top_p=0.95, repetition_penalty=1.05, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id)
print("n=== Generation output ===n")
print(tokenizer.decode(out[0], skip_special_tokens=True))
Load a final LoRA-augmented instance of the model in order to demonstrate the results after training. We create a realistic question and generate text using the same training architecture. The pipeline is tested by checking that it produces outputs which are coherent and task-aligned.
print(type(history))
print(history.__dict__.keys())
The federated run produces simulation and training outputs. Examine the history object returned to confirm rounds, metrics, aggregation, etc. were successfully completed. This step is used to verify the reproducibility and integrity of the workflow.
We concluded that we could run federated fine tuning of LLMs end-to-end today in the Colab environment. Without sharing any raw text, or model weights, we successfully coordinated server-side LoRA aggregation and evaluation, as well as client-side LoRA. This workflow demonstrates how, in combination with PEFT, federated models can be adapted to maintain privacy while retaining robustness. It also provides the foundation needed for further development of the system, including personalization and enterprise deployment.
Take a look at the FULL CODES here. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe Now our Newsletter. Wait! What? now you can join us on telegram as well.

