We will explore the creation of neural networks by using this tutorial. Tinygrad We will remain fully hands-on while utilizing tensors and autograd. Attention mechanisms, transformer architectures, and attention mechanisms are all covered. Each component is built by us, starting with basic tensor functions, moving on to attention mechanisms, transformer blocks, then a mini-GPT. We observe Tinygrad’s simplicity as we progress through each stage. See the FULL CODES here.
Import subprocess, Sys, Os
print("Installing dependencies...")
subprocess.check_call(["apt-get", "install", "-qq", "clang"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "git+https://github.com/tinygrad/tinygrad.git"])
Numpy can be imported as a np
Import Tensor, device, from Tinygrad
From tinygrad.nn, import the optim
import time
print(f"🚀 Using device: {Device.DEFAULT}")
print("=" * 60)
print("n📚 PART 1: Tensor Operations & Autograd")
print("-" * 60)
The Tensor is x.[[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
Tensor = y[[2.0, 0.0], [1.0, 2.0]], requires_grad=True)
z = (x @ y).sum() + (x ** 2).mean()
z.backward()
print(f"x:n{x.numpy()}")
print(f"y:n{y.numpy()}")
print(f"z (scalar): {z.numpy()}")
print(f"∂z/∂x:n{x.grad.numpy()}")
print(f"∂z/∂y:n{y.grad.numpy()}")
Tinygrad is installed in Colab and we immediately start experimenting with automatic differentiation and tensors. In a simple graph, we observe how the gradients are effected by matrix operations. Tinygrad is explained in a more intuitive way as we see the printed output. See the FULL CODES here.
print("nn🧠 PART 2: Building Custom Layers")
print("-" * 60)
Class MultiHeadAttention
def __init__(self, dim, num_heads):
self.num_heads = num_heads
self.dim = dim
self.head_dim = dim // num_heads
self.qkv = Tensor.glorot_uniform(dim, 3 * dim)
self.out = Tensor.glorot_uniform(dim, dim)
def __call__(self, x):
B, T C = x.shape[0], x.shape[1], x.shape[2]
qkv = x.reshape(B * T, C).dot(self.qkv).reshape(B, T, 3, self.num_heads, self.head_dim)
q, k, v = qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2]
scale = (self.head_dim ** -0.5)
attn = (q @ k.transpose(-2, -1)) * scale
attn = attn.softmax(axis=-1)
Out = (attn@ v).transpose(1).reshape(2).
return out.reshape(B * T, C).dot(self.out).reshape(B, T, C)
Class TransformerBlock
def __init__(self, dim, num_heads):
self.attn = MultiHeadAttention(dim, num_heads)
self.ff1 = Tensor.glorot_uniform(dim, 4 * dim)
self.ff2 = Tensor.glorot_uniform(4 * dim, dim)
self.ln1_w = Tensor.ones(dim)
self.ln2_w = Tensor.ones(dim)
def __call__(self, x):
x = x + self.attn(self._layernorm(x, self.ln1_w))
ff = (x.shape,-1), x.reshape[-1])
ff = ff.dot(self.ff1).gelu().dot(self.ff2)
x = shape(x) + ff.reshape (x.shape).
Return self._layernorm (x, self.ln2_w).
def _layernorm(self, x, w):
mean = x.mean(axis=-1, keepdim=True)
var = ((x - mean) ** 2).mean(axis=-1, keepdim=True)
return w * (x - mean) / (var + 1e-5).sqrt()
Our own attention block and multi-head module are designed from scratch. Manually, we implement projections, scores of attention, layers with feedforward, layer normalization, softmax and the layer-normalization. While running the code, you can observe how different components contribute to a layer’s behavior. See the FULL CODES here.
print("n🤖 PART 3: Mini-GPT Architecture")
print("-" * 60)
MiniGPT class:
def __init__(self, vocab_size=256, dim=128, num_heads=4, num_layers=2, max_len=32):
self.vocab_size = vocab_size
self.dim = dim
self.tok_emb = Tensor.glorot_uniform(vocab_size, dim)
self.pos_emb = Tensor.glorot_uniform(max_len, dim)
self.blocks = [TransformerBlock(dim, num_heads) for _ in range(num_layers)]
self.ln_f = Tensor.ones(dim)
self.head = Tensor.glorot_uniform(dim, vocab_size)
def __call__(self, idx):
B. T = shape.idx[0], idx.shape[1]
tok_emb = self.tok_emb[idx.flatten()].reshape(B, T, self.dim)
pos_emb = self.pos_emb[:T].reshape(1, T, self.dim)
x = pos_emb - tok_emb
Blocks in the self:
x = block (x)
mean = x.mean(axis=-1, keepdim=True)
var = ((x - mean) ** 2).mean(axis=-1, keepdim=True)
x = self.ln_f * (x - mean) / (var + 1e-5).sqrt()
return x.reshape(B * T, self.dim).dot(self.head).reshape(B, T, self.vocab_size)
def get_params(self):
params = [self.tok_emb, self.pos_emb, self.ln_f, self.head]
Self-blocks are blocks that you can use to block yourself.
params.extend([block.attn.qkv, block.attn.out, block.ff1, block.ff2, block.ln1_w, block.ln2_w])
Return parameters
model = MiniGPT(vocab_size=256, dim=64, num_heads=4, num_layers=2, max_len=16)
Model.get_params params()
total_params = sum(p.numel() For p, use params
print(f"Model initialized with {total_params:,} parameters")
Assemble the MiniGPT Architecture using components that you have already built. Then we embed the tokens into multiple blocks of transformers, insert positional information and stack them. Finally, we project back our final outputs as vocablogits. The model is initialized and we can see that a transformer with few moving components has a surprising compactness. See the FULL CODES here.
print("nn🏋️ PART 4: Training Loop")
print("-" * 60)
def gen_data(batch_size, seq_len):
x = (np.randint.randint (0, 256), batch_size, seq_len).
You can also find out more about y = np.roll(x, 1, axis=1)
y[:, 0] = x[:, 0]
Return Tensor() (x, type=dtype)"int32"), Tensor(y, dtype="int32")
optimizer = optim.Adam(params, lr=0.001)
Loss = []
print("Training to predict previous token in sequence...")
Tensor.train():
Step in the range (20)
Start = Time()
x_batch, y_batch = gen_data(batch_size=16, seq_len=16)
Logits = Model(x_batch).
Shapes B, T, and V are all logit shapes[0], logits.shape[1], logits.shape[2]
loss = logits.reshape(B * T, V).sparse_categorical_crossentropy(y_batch.reshape(B * T))
optimizer.zero_grad()
loss.backward()
optimizer.step()
losses.append(loss.numpy())
Time = elapsed() Startseite
if step == 5:
print(f"Step {step:3d} | Loss: {loss.numpy():.4f} | Time: {elapsed*1000:.1f}ms")
print("nn⚡ PART 5: Lazy Evaluation & Kernel Fusion")
print("-" * 60)
N = 514
A = Tensor.randn (N, N).
b = Tensor.randn (N, N).
print("Creating computation: (A @ B.T + A).sum()")
lazy_result = (a @ b.T + a).sum()
print("→ No computation done yet (lazy evaluation)")
print("nCalling .realize() to execute...")
"Start" = time.()
realized = lazy_result.realize()
Elapsed = Time() Startseite
print(f"✓ Computed in {elapsed*1000:.2f}ms")
print(f"Result: {realized.numpy():.4f}")
print("nNote: Operations were fused into optimized kernels!")
We observe that the MiniGPT loss decreases as we progress through the steps. Tinygrad’s laziness execution model is also explored by creating a kernel fused that only executes when the realization occurs. By monitoring timings we can understand the performance improvements of kernel fusion. Look at the FULL CODES here.
print("nn🔧 PART 6: Custom Operations")
print("-" * 60)
def custom_activation(x):
return x * x.sigmoid()
If x is Tensor, then x will be tensor.[[-2.0, -1.0, 0.0, 1.0, 2.0]], requires_grad=True)
Custom_activation (x) = y
Loss = sum of y.()
loss.backward()
print(f"Input: {x.numpy()}")
print(f"Swish(x): {y.numpy()}")
print(f"Gradient: {x.grad.numpy()}")
print("nn" + "=" * 60)
print("✅ Tutorial Complete!")
print("=" * 60)
print("""
The Key Concepts:
1. Automatic differentiation of tensor operations
2. Custom neural network layers (Attention, Transformer)
3. Making a Mini-GPT Language Model from Scratch
4. Adam optimizer training loop
5. The kernel and lazy evaluation
6. Customized activation functions
""")
Then we implement an activation function that is custom and test to see if gradients will propagate through the system correctly. Then we print out a summary of the major concepts that were covered during this tutorial. We then reflect upon how we have improved our understanding, modification, and extension of deep learning using Tinygrad.
Tinygrad allows us to play with each internal detail. In a transparent, minimal framework, we have created a transformer and trained it with synthetic data. We also experimented on lazy evaluation, kernel fusion and custom operations. We finally see how this workflow is preparing us to do deeper experiments, such as extending the model or integrating real datasets.
Click here to find out more FULL CODES here. Check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe Now our Newsletter. Wait! What? now you can join us on telegram as well.
Asif Razzaq, CEO of Marktechpost Media Inc. is a visionary engineer and entrepreneur who is dedicated to harnessing Artificial Intelligence’s potential for the social good. Marktechpost is his latest venture, a media platform that focuses on Artificial Intelligence. It is known for providing in-depth news coverage about machine learning, deep learning, and other topics. The content is technically accurate and easy to understand by an audience of all backgrounds. Over 2 million views per month are a testament to the platform’s popularity.

