Learn how to create a federated pipeline that protects privacy and fine-tunes large language models using LoRA, Flower, and PEFT

This tutorial shows how you can federate the fine tuning of a language model without ever storing private text information. In this tutorial, we simulate several organizations and demonstrate how they adapt a base model shared by all clients locally using only the lightweight LoRA parameters. Flower’s simulation engine for federated LLMs and its parameter-efficient refinement allow us to demonstrate an effective, scalable solution that allows organizations to tailor LLMs using sensitive data, while still preserving privacy. Click here to see the FULL CODES here.

!pip install -q -U "protobuf manager -> compliance for high-risk cases."
   ],
   2: [
       "Fleet ops: preventive maintenance reduces downtime; prioritize vehicles with repeated fault codes.",
       "Dispatch note: optimize routes by time windows and driver hours to reduce empty miles.",
       "Safety policy: enforce rest breaks and log inspections before long-haul trips.",
       "Inventory update: track spare parts usage; reorder thresholds should reflect lead time and seasonality.",
       "Customer SLA: late deliveries require proactive notifications and documented root cause."
   ],
}
for cid in list(CLIENT_TEXTS.keys()):
 CLIENT_TEXTS base = client_text[cid]
   CLIENT_TEXTS[cid] The base is + [f"Q: Summarize this for leadership. A: {t}" for t in base]
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, use_fast=True)
if tokenizer.pad_token is None:
   tokenizer.pad_token = tokenizer.eos_token
bnb_config : Optional[BitsAndBytesConfig] No.
If DEVICE== "cuda":
   compute_dtype = torch.bfloat16 If you want to know more about if torch.cuda.get_device_capability(0)[0] >= 8 else torch.float16
   bnb_config = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True, bnb_4bit_compute_dtype=compute_dtype)
if "gpt2" MODEL_ID.lower():
   TARGET_MODULES = ["c_attn", "c_proj"]
else:
   TARGET_MODULES = ["q_proj", "k_proj", "v_proj", "o_proj"]
LORA_R = 16
LORA_ALPHA = 32
LORA_DROPOUT = 0.05
lora_config = LoraConfig(r=LORA_R, lora_alpha=LORA_ALPHA, lora_dropout=LORA_DROPOUT, bias="none", task_type="CAUSAL_LM", target_modules=TARGET_MODULES)
def model_primary_device(model) -> torch.device:
   return next(model.parameters()).device
def build_model_with_lora():
 If DEVICE == "cuda":
       model = AutoModelForCausalLM.from_pretrained(MODEL_ID, device_map="auto", quantization_config=bnb_config, torch_dtype="auto")
       model = prepare_model_for_kbit_training(model)
   else:
       model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype=torch.float32)
       model.to("cpu")
   model = get_peft_model(model, lora_config)
   model.train()
 Model Return
Make a dataset with def (texts) ListDataset.from_dict()[str]) -> Dataset:
   ds = Dataset.from_dict({"text": texts})
 Def tok (batch)
       return tokenizer(batch["text"], truncation=True, max_length=MAX_LEN, padding="max_length")
   ds = ds.map(tok, batched=True, remove_columns=["text"])
   return ds
collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)
def lora_state_keys(model) -> List[str]:
 Model.state_dict = sd()
   keys = sorted([k for k in sd.keys() if "lora_" in k])
 If you don't have keys
       raise RuntimeError("No LoRA keys found. Your model might not have the target_modules specified. " F"Current TARGET_MODULES={TARGET_MODULES}, MODEL_ID={MODEL_ID}")
   return keys
def get_lora_ndarrays(model) -> List[np.ndarray]:
 Model.state_dict = sd()
   keys = lora_state_keys(model)
 Return to the Homepage [sd[k].detach().float().cpu().numpy() for k in keys]
def set_lora_ndarrays(model, arrays: List[np.ndarray]) -> None:
   keys = lora_state_keys(model)
 If len(keys), then len(arrays).
       raise ValueError(f"Mismatch: got {len(arrays)} arrays but expected {len(keys)}.")
 Model.state_dict = sd()
   for k, arr in zip(keys, arrays):
       t = torch.from_numpy(arr).to(sd[k].device).to(sd[k].dtype)
 sd[k].copy_(t)
def cosine_warmup_lr(step: int, total_steps: int, base_lr: float, warmup_steps: int) -> float:
 If Step Float:
   model.eval()
   dl = DataLoader(ds, batch_size=BATCH_SIZE, shuffle=False, collate_fn=collator)
   losses = []
   dev = model_primary_device(model)
 For i, batch in the enumerate (dl):
       if i >= max_batches:
 Breaks
       batch = {Batch = k : v.to (dev) for batch.items k and v()}
       out = model(**batch, labels=batch["input_ids"])
       losses.append(float(out.loss.detach().cpu()))
   model.train()
   return float(np.mean(losses)) if losses else float("nan")
def train_one_client_round(model, ds: Dataset, epochs: int, lr: float, grad_accum: int, warmup_steps: int) -> Tuple[float, int]:
   dl = DataLoader(ds, batch_size=BATCH_SIZE, shuffle=True, collate_fn=collator)
   total_steps = max(1, (len(dl) * epochs) // max(1, grad_accum))
 step = 0
   optimizer = torch.optim.AdamW(model.parameters(), lr=lr, weight_decay=WEIGHT_DECAY)
   optimizer.zero_grad(set_to_none=True)
 Running = []
   examples = 0
   dev = model_primary_device(model)
 For _, in the range (epochs).
 Enumerate in batch(dl).
           batch = {Batch = k is v.to (dev) in order to k and v.()}
           out = model(**batch, labels=batch["input_ids"])
           loss = out.loss / grad_accum
           loss.backward()
           running.append(float(loss.detach().cpu()) * grad_accum)
           examples += batch["input_ids"].shape[0]
           if (bi + 1) % grad_accum == 0:
               lr_t = cosine_warmup_lr(step, total_steps, lr, warmup_steps)
               for pg in optimizer.param_groups:
 pg["lr"] = lr_t
               optimizer.step()
               optimizer.zero_grad(set_to_none=True)
 Step += 1
 If % LOG_EVERY > 0, step:
                   print(f"  step={step}/{total_steps} loss={np.mean(running[-LOG_EVERY:]):.4f} lr={lr_t:.2e}")
   return float(np.mean(running)) if running else float("nan"Examples

The full execution environment is set up and all configurations are defined for the test. The private client text silos and tokenizer are prepared so that they can automatically adjust to the CPU or GPU available. Also, we set up all the helper tools that allow parameter-efficient fine tuning and safe device-handling across federated users. Visit the FULL CODES here.

class FedLoRAClient(fl.client.NumPyClient):
   def __init__(self, cid: int):
 Self.cid = cid
       self._model = None
 This is the self-ds_train.
 self._ds_eval= None
   def _ensure(self):
 If self._model = None
           print(f"[Client {self.cid}] Loading model + LoRA (MODEL_ID={MODEL_ID})...")
           self._model = build_model_with_lora()
 text = CLIENT_TEXTMetrics =[self.cid].copy()
           random.shuffle(texts)
           split = max(1, int(0.8 * len(texts)))
           self._ds_train = make_dataset(texts[:split])
           self._ds_eval = make_dataset(texts[split:])
   def get_parameters(self, config):
       self._ensure()
       return get_lora_ndarrays(self._model)
   def fit(self, parameters, config):
       self._ensure()
       set_lora_ndarrays(self._model, parameters)
       loss_before = eval_loss(self._model, self._ds_eval, max_batches=10)
       print(f"[Client {self.cid}] eval_loss_before={loss_before:.4f}")
       train_loss, n_examples = train_one_client_round(self._model, self._ds_train, epochs=int(config.get("local_epochs", LOCAL_EPOCHS)), lr=float(config.get("lr", LR)), grad_accum=int(config.get("grad_accum", GRAD_ACCUM)), warmup_steps=int(config.get("warmup_steps", WARMUP_STEPS)))
       loss_after = eval_loss(self._model, self._ds_eval, max_batches=10)
       print(f"[Client {self.cid}] train_loss={train_loss:.4f} eval_loss_after={loss_after:.4f}")
       new_params = get_lora_ndarrays(self._model)
       metrics = {"eval_loss_before": loss_before, "eval_loss_after": loss_after, "train_loss": train_loss}
 Return new_params n_examples metrics
   def evaluate(self, parameters, config):
       self._ensure()
       set_lora_ndarrays(self._model, parameters)
       loss = eval_loss(self._model, self._ds_eval, max_batches=20)
       return float(loss), len(self._ds_eval), {"eval_loss": float(loss)}
Def Client_fn (Context: Context).
 Cd = None
   try:
       cid = int(context.node_config.get("partition-id"))
 The exception:
       try:
 Node_id = context.cid
 The exception:
 cid = 0.
   return FedLoRAClient(cid).to_client()

Define the federated clients logic which simulates different organizations taking part in training. Each client is given a LoRA augmented language model and we ensure local datasets stay isolated. The client is responsible for training, evaluating, and exchanging parameters. Only the LoRA adapter values are exposed to the server.Return

def fit_config(server_round: int):
   return {"local_epochs": LOCAL_EPOCHS, "lr": LR, "grad_accum": GRAD_ACCUM, "warmup_steps": WARMUP_STEPS}
strategy = fl.server.strategy.FedAvg(fraction_fit=1.0, fraction_evaluate=1.0, min_fit_clients=NUM_CLIENTS, min_evaluate_clients=NUM_CLIENTS, min_available_clients=NUM_CLIENTS, on_fit_config_fn=fit_config)
print("nStarting Flower simulation...n")
client_resources = {"num_cpus": 2, "num_gpus": 0.0}
If DEVICE == "cuda":
   client_resources = {"num_cpus": 2, "num_gpus": 0.25}
history = fl.simulation.start_simulation(client_fn=client_fn, num_clients=NUM_CLIENTS, config=fl.server.ServerConfig(num_rounds=ROUNDS), strategy=strategy, client_resources=client_resources, ray_init_args={"include_dashboard": False, "ignore_reinit_error": True})
print("nSimulation done.")

Wir configure the federated-learning strategy and orchestrate global training. The number of clients participating, the way parameters are aggregated and how to schedule training rounds is specified. The Flower simulation is launched to facilitate communication and aggregate data across virtual clients. See the FULL CODES here.

demo_model = build_model_with_lora()
demo_model.eval()
prompt = "Summarize this internal note for leadership in 2 bullets:nDispatch note: optimize routes by time windows and driver hours to reduce empty miles.nnAnswer:"
Inputs = tokenizer (prompt return_tensors="pt")
dev = model_primary_device(demo_model)
inputs = {If inputs.items is k and v, then inputs.items will be v.to (dev).()}
No_grad. With torch():
   out = demo_model.generate(**inputs, max_new_tokens=80, do_sample=True, temperature=0.8, top_p=0.95, repetition_penalty=1.05, eos_token_id=tokenizer.eos_token_id, pad_token_id=tokenizer.pad_token_id)
print("n=== Generation output ===n")
print(tokenizer.decode(out[0], skip_special_tokens=True))

Load a final LoRA-augmented instance of the model in order to demonstrate the results after training. We create a realistic question and generate text using the same training architecture. The pipeline is tested by checking that it produces outputs which are coherent and task-aligned.

print(type(history))
print(history.__dict__.keys())

The federated run produces simulation and training outputs. Examine the history object returned to confirm rounds, metrics, aggregation, etc. were successfully completed. This step is used to verify the reproducibility and integrity of the workflow.

We concluded that we could run federated fine tuning of LLMs end-to-end today in the Colab environment. Without sharing any raw text, or model weights, we successfully coordinated server-side LoRA aggregation and evaluation, as well as client-side LoRA. This workflow demonstrates how, in combination with PEFT, federated models can be adapted to maintain privacy while retaining robustness. It also provides the foundation needed for further development of the system, including personalization and enterprise deployment.

Take a look at the FULL CODES here. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe Now our Newsletter. Wait! What? now you can join us on telegram as well.

Learn how to create a federated pipeline that protects privacy and fine-tunes large language models using LoRA, Flower, and PEFT

OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing

OpenAI Releases GPT-5.5, a Absolutely Retrained Agentic Mannequin That Scores 82.7% on Terminal-Bench 2.0 and 84.9% on GDPval

Mend Releases AI Safety Governance Framework: Masking Asset Stock, Danger Tiering, AI Provide Chain Safety, and Maturity Mannequin

Google DeepMind Introduces Decoupled DiLoCo: An Asynchronous Coaching Structure Attaining 88% Goodput Below Excessive {Hardware} Failure Charges

Gear News of the week: Another AI Browser and Fujifilm’s X-T30 III debut.

ByteDance & DeepSeek Place Very Different AI Bets

OpenAI Anthropic Block are teaming up to create AI agents that play nicely

AI Slop is Ruining Reddit and Reddit for All

Wired Roundup: AI Psychosis and Missing Files from the FTC

Top Insights

Trump paid for his ballroom with YouTube

The Complexipy Guide for Measuring Cognitive Complexity and Visualizing it in Python Projects

Latest News

Apple’s new CEO must launch an AI killer product

OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing

Learn how to create a federated pipeline that protects privacy and fine-tunes large language models using LoRA, Flower, and PEFT

Related Posts