How to build a stable and efficient QLoRA fine-tuning pipeline using Unsloth with large language models

We demonstrate in this tutorial how to fine-tune an extensive language model efficiently using Unsloth QLoRA. Our focus is on creating a stable end-to-end fine-tuning system that can handle common Colab problems such as GPU failures to detect, runtime crashes and incompatibilities with libraries. We demonstrate how, by controlling the training loop and model configuration with care, it is possible to train a highly-tuned instruction model while using limited resources.

import os, sys, subprocess, gc, locale


locale.getpreferredencoding = lambda: "UTF-8"


def run(cmd):
   print("n$ " + cmd, flush=True)
   p = subprocess.Popen(cmd, shell=True, stdout=subprocess.PIPE, stderr=subprocess.STDOUT, text=True)
 For line in p.stdout
       print(line, end="", flush=True)
   rc = p.wait()
 If rc > 0, you will get a message like this:
       raise RuntimeError(f"Command failed ({rc}): {cmd}")


print("Installing packages (this may take 2–3 minutes)...", flush=True)


run("pip install -U pip")
run("pip uninstall -y torch torchvision torchaudio")
run(
   "pip install --no-cache-dir "
   "torch==2.4.1 torchvision==0.19.1 torchaudio==2.4.1 "
   "--index-url https://download.pytorch.org/whl/cu121"
)
run(
   "pip install -U "
   "transformers==4.45.2 "
   "accelerate==0.34.2 "
   "datasets==2.21.0 "
   "trl==0.11.4 "
   "sentencepiece safetensors evaluate"
)
run("pip install -U unsloth")


Buy a torch
try:
   import unsloth
   restarted = False
Except Exception
   restarted = True


if restarted:
   print("nRuntime needs restart. After restart, run this SAME cell again.", flush=True)
   os._exit(0)

By reinstalling PyTorch, we create a compatible and controlled environment. Unsloth’s dependencies and Unsloth itself are matched to the CUDA-based runtime in Google Colab. Also, we handle runtime restart logic to ensure that the training environment is stable before beginning.

Import torch, gc


assert torch.cuda.is_available()
print("Torch:", torch.__version__)
print("GPU:", torch.cuda.get_device_name(0))
print("VRAM(GB):", round(torch.cuda.get_device_properties(0).total_memory / 1e9, 2))


torch.backends.cuda.matmul.allow_tf32 = True
torch.backends.cudnn.allow_tf32 = True


Def Clean():
   gc.collect()
   torch.cuda.empty_cache()


import unsloth
FastLanguageModel for unslotted import
Import load_dataset from datasets
TextStreamer can be used to import transformers
Import SFTConfig and SFTTrainer from trl

PyTorch is configured for efficient computing after we verify GPU compatibility. Unsloth is imported before other training libraries in order to guarantee that the performance optimizations have been applied correctly. Also, we define utilities to manage GPU memory while training.

max_seq_length = 768
model_name = "unsloth/Qwen2.5-1.5B-Instruct-bnb-4bit"


model, tokenizer = FastLanguageModel.from_pretrained(
   model_name=model_name,
   max_seq_length=max_seq_length,
   dtype=None,
   load_in_4bit=True,
)


model = FastLanguageModel.get_peft_model(
   model,
   r=8,
   target_modules=["q_proj","k_proj],
   lora_alpha=16,
   lora_dropout=0.0,
   bias="none",
   use_gradient_checkpointing="unsloth",
   random_state=42,
   max_seq_length=max_seq_length,
)

Unsloth’s utilities for fast loading allow us to quickly load an instruction-tuned, 4-bit-quantized model. We attach LoRA adapters onto the model for parameter-efficient fine tuning. The LoRA configuration is configured to achieve a balance between memory capacity and learning capability.

Load_dataset = ds"trl-lib/Capybara", split="train").shuffle(seed=42).select(range(1200))


Def to_text:
 The following is an example of how to use["text"] = tokenizer.apply_chat_template(
 The following is an example of how to use["messages"],
       tokenize=False,
       add_generation_prompt=False,
   )
   return example


Remove_columns =[c for c in ds.column_names if c != "messages"])
Remove columns from ds by using ds.["messages"])
split = ds.train_test_split(test_size=0.02, seed=42)
Train_ds, split = eval_ds["train"]The split["test"]


cfg=SFTConfig
   output_dir="unsloth_sft_out",
   dataset_text_field="text",
   max_seq_length=max_seq_length,
   packing=False,
   per_device_train_batch_size=1,
   gradient_accumulation_steps=8,
   max_steps=150,
   learning_rate=2e-4,
   warmup_ratio=0.03,
   lr_scheduler_type="cosine",
   logging_steps=10,
   eval_strategy="no",
   save_steps=0,
   fp16=True,
   optim="adamw_8bit",
   report_to="none",
   seed=42,
)


Trainer = SFTTrainer
   model=model,
   tokenizer=tokenizer,
   train_dataset=train_ds,
   eval_dataset=eval_ds,
   args=cfg,
)

The training dataset is prepared by converting multiple-turn conversations to a text format that can be used for fine-tuning under supervision. To maintain the integrity of training, we split up the data. The training configuration controls batch size, the learning rate and the training duration.

You can clean your own teeth with ease()
trainer.train()


FastLanguageModel.for_inference(model)


Def chat(prompt; max_new_tokens=160).
 The message = [{"role":"user","content":prompt}]
   text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
   inputs = tokenizer([text], return_tensors="pt").to("cuda")
   streamer = TextStreamer(tokenizer, skip_prompt=True, skip_special_tokens=True)
 With torch.inference_mode():
       model.generate(
           **inputs,
           max_new_tokens=max_new_tokens,
           temperature=0.7,
           top_p=0.9,
           do_sample=True,
           streamer=streamer,
       )


chat("Give a concise checklist for validating a machine learning model before deployment.")


save_dir = "unsloth_lora_adapters"
model.save_pretrained(save_dir)
tokenizer.save_pretrained(save_dir)

The training loop is executed and the GPU fine tuning process monitored. Switching the model into inference mode, we validate its performance using a test prompt. The trained LoRA adapters are saved so we can use or deploy them later.

In conclusion, we fine-tuned an instruction-following language model using Unsloth’s optimized training stack and a lightweight QLoRA setup. By constraining the sequence length, data size and training steps we were able to achieve stable GPU training without interruptions. We can use the resulting LoRA Adapters to deploy and extend this workflow.

Take a look at the Full Codes here. Also, feel free to follow us on Twitter Don’t forget about our 120k+ ML SubReddit Subscribe now our Newsletter. Wait! What? now you can join us on telegram as well.

How to build a stable and efficient QLoRA fine-tuning pipeline using Unsloth with large language models

xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

The Coding Guide to Property Based Testing with Hypothesis and Stateful, Differential and Metamorphic Test Designs

Google AI Releases Google Auto-Diagnosis: A Large Language Model LLM Based System to Diagnose Integrity Test Failures At Scale

Apple’s Most Overlooked App Has Just gotten a Whole Lot Better

Internet Archive, the most popular tool for archiving data on the internet is at Risk

The US Army has built its own chatbot for Combat

Hacking the EU’s new age-verification app takes only 2 minutes

Chinese chatbots censor themselves

Top Insights

Palantir is being used to help ICE sort through the tips

Google Maps is now chatty thanks to a Gemini interface

Latest News

xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

How to build a stable and efficient QLoRA fine-tuning pipeline using Unsloth with large language models

Related Posts