This Ultra-Light Mistral Devstral Colab is a guide that has been designed for users with limited disk space. This tutorial describes how to use the devstral small model, which is a powerful tool for running large language models such as Mistral in environments that have limited memory and storage. The tutorial guides you to build a lightweight and interactive assistant using BitsAndBytes. It also explains how to manage caches and generate tokens efficiently. This set-up is perfect for prototyping or debugging codes, as well as writing tools and small utilities.
!pip installation -q bitandbytes-transformers--kagglehub-common-mistral
Install!pip -q Accelerator torch --no cache-dir
Import shutil
Import os
Import gc
Installing essential lightweight packages like kagglehub and mistral-common is the first step of this tutorial. To minimize disk use, no cache should be stored. The tutorial also uses torch and accelerate for efficient inference and model loading. For further space optimization, all temporary or cache directories will be cleared by using Python’s os and gc module.
Def cleanup_cache():
"""Clean up unnecessary files to save disk space"""
cache_dirs = ['/root/.cache', '/tmp/kagglehub']
If you want to know what cache_dir is, click here.
if os.path.exists(cache_dir):
shutil.rmtree(cache_dir, ignore_errors=True)
gc.collect()
cleanup_cache()
print("🧹 Disk space optimized!")
Cleanup_cache is used to maintain minimal footprints on disk during execution.() This function removes redundant cache directories such as /root/.cache, /tmp/kagglehub and others. The proactive cleaning helps to free up disk space both before and after important operations. The function will confirm that the disk space is optimized once it has been invoked. This reinforces the focus of the tutorial on resource efficiency.
import warnings
warnings.filterwarnings("ignore")
Import torch
Import kagglehub
from mistral_common.protocol.instruct.messages import UserMessage
from mistral_common.protocol.instruct.request import ChatCompletionRequest
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
For smoother execution, all warning messages are suppressed using Python’s warnings module. Importing essential libraries to interact with models, like torch for computing tensors and kagglehub, for streaming, as well as transformers, for loading a quantized LLM, is the next step. Mistral-specific classes like UserMessage, ChatCompletionRequest, and MistralTokenizer are also packed to handle tokenization and request formatting tailored to Devstral’s architecture.
Class Lightweight
def __init__(self):
print("📦 Downloading model (streaming mode)...")
self.model_path = kagglehub.model_download(
'mistral-ai/devstral-small-2505/Transformers/devstral-small-2505/1',
force_download=False
)
quantization_config = BitsAndBytesConfig(
bnb_4bit_compute_dtype=torch.float16,
bnb_4bit_quant_type="nf4",
bnb_4bit_use_double_quant=True,
bnb_4bit_quant_storage=torch.uint8,
load_in_4bit=True
)
print("⚡ Loading ultra-compressed model...")
self.model = AutoModelForCausalLM.from_pretrained(
self.model_path,
torch_dtype=torch.float16,
device_map="auto",
quantization_config=quantization_config,
low_cpu_mem_usage=True,
trust_remote_code=True
)
self.tokenizer = MistralTokenizer.from_file(f'{self.model_path}/tekken.json')
cleanup_cache()
print("✅ Lightweight assistant ready! (~2GB disk usage)")
def generate(self, prompt, max_tokens=400):
"""Memory-efficient generation"""
tokenized = self.tokenizer.encode_chat_completion(
ChatCompletionRequest(messages=[UserMessage(content=prompt)])
)
input_ids = torch.tensor([tokenized.tokens])
if torch.cuda.is_available():
input_ids = input_ids.to(self.model.device)
With torch.inference_mode():
The output is generated by self.model.generate.
input_ids=input_ids,
max_new_tokens=max_tokens,
temperature=0.6,
top_p=0.85,
do_sample=True,
pad_token_id=self.tokenizer.eos_token_id,
use_cache=True
)[0]
del input_ids
torch.cuda.empty_cache() if torch.cuda.is_available() Other than that,
return self.tokenizer.decode(output[len(tokenized.tokens):])
print("🚀 Initializing lightweight AI assistant...")
assistant = LightweightDevstral()
The core of this tutorial is the LightweightDevstral component, which manages model loading and the generation of text in an efficient manner. This tutorial begins with streaming the devstral – small-2505 using kagglehub. It avoids redundant downloads. BitsAndBytesConfig is used to load the model with 4-bit quantization, resulting in a significant reduction of memory usage and disk space while allowing for performant inference. After initializing a custom tokenizer from a JSON file on premise, the cache of that tokenizer will be cleared. This method is memory safe, using torch.inference_mode and other techniques.() The empty_cache()This assistant is able to produce responses quickly, even in environments where hardware resources are limited.
Def Run_demo (title, prompt, Emoji="🎯"):
"""Run a single demo with cleanup"""
print(f"n{emoji} {title}")
print("-" * 50)
result = assistant.generate(prompt, max_tokens=350)
print(result)
gc.collect()
if torch.cuda.is_available():
torch.cuda.empty_cache()
run_demo(
"Quick Prime Finder",
"Write a fast prime checker function `is_prime(n)` with explanation and test cases.",
"🔢"
)
run_demo(
"Debug This Code",
""Fix this bug and describe the issue:
```python
def avg_positive(numbers):
total = sum([n for n in numbers if n > 0])
return total / len([n for n in numbers if n > 0])
```""",
"🐛"
)
run_demo(
"Text Tool Creator",
"Create a simple `TextAnalyzer` class with word count, char count, and palindrome check methods.",
"🛠️"
)
This compact run_demo suite is used to demonstrate the model’s coding capabilities.() function. The Devstral Assistant is prompted to generate a response and then printed. Memory cleanup follows to avoid accumulation over repeated runs. These examples show how to write an efficient function for prime checking, fix a Python script with logical bugs, or build a TextAnalyzer. The demonstrations show the model as an efficient, lightweight coding assistant that can generate and explain code in real time.
def quick_coding():
"""Lightweight interactive session"""
print("n🎮 QUICK CODING MODE")
print("=" * 40)
print("Enter short coding prompts (type 'exit' to quit)")
session_count = 0
max_sessions = 5
Session_count
Quick Coding Mode allows you to quickly code and submit your prompts straight to Devstral. To limit the memory used, this session limits interactions to five prompts. After each prompt, memory is aggressively cleaned to guarantee responsiveness. This mode is ideal for prototyping and debugging on the go, as well as exploring new coding concepts.
def check_disk_usage():
"""Monitor disk usage"""
Import subprocess
try:
result = subprocess.run(['df', '-h', '/'], capture_output=True, text=True)
lines = result.stdout.split('n')
if len(lines) > 1:
Usage_line = line[1].split()
Use = line[2]
available = usage_line[3]
print(f"💾 Disk: {used} used, {available} available")
except:
print("💾 Disk usage check unavailable")
print("n🎉 Tutorial Complete!")
cleanup_cache()
check_disk_usage()
print("n💡 Space-Saving Tips:")
print("• Model uses ~2GB vs original ~7GB+")
print("• Automatic cache cleanup after each use")
print("• Limited token generation to save memory")
print("• Use 'del assistant' when done to free ~2GB")
print("• Restart runtime if memory issues persist")
Finaly, we provide a disk space monitor as well as a routine for cleaning up. The df command, via Python’s subprocess module shows how much space is available and used. The model is still lightweight after re-invoking the cleanup_cache command() The script ends with some practical tips to save space and ensure that there is minimal residue.
Conclusion: We are now able to leverage Mistral’s Devstral models in spaces like Google Colab that have limited floor space without losing usability or performance. Model loads quickly, generates text efficiently, and clears the memory after each use. Users can quickly test their ideas with the included interactive coding mode.
Click here to find out more Codes. The researchers are the sole owners of all credit. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe now our Newsletter.
Asif Razzaq, CEO of Marktechpost Media Inc. is a visionary engineer and entrepreneur who is dedicated to harnessing Artificial Intelligence’s potential for the social good. Marktechpost is his latest venture, a media platform that focuses on Artificial Intelligence. It is known for providing in-depth news coverage about machine learning, deep learning, and other topics. The content is technically accurate and easy to understand by an audience of all backgrounds. Over 2 million views per month are a testament to the platform’s popularity.


