Tinygrad is used to implement the functional components in the Transformer and Mini GPT models.

We will explore the creation of neural networks by using this tutorial. Tinygrad We will remain fully hands-on while utilizing tensors and autograd. Attention mechanisms, transformer architectures, and attention mechanisms are all covered. Each component is built by us, starting with basic tensor functions, moving on to attention mechanisms, transformer blocks, then a mini-GPT. We observe Tinygrad’s simplicity as we progress through each stage. See the FULL CODES here.

Import subprocess, Sys, Os
print("Installing dependencies...")
subprocess.check_call(["apt-get", "install", "-qq", "clang"], stdout=subprocess.DEVNULL, stderr=subprocess.DEVNULL)
subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", "git+https://github.com/tinygrad/tinygrad.git"])


Numpy can be imported as a np
Import Tensor, device, from Tinygrad
From tinygrad.nn, import the optim
import time


print(f"🚀 Using device: {Device.DEFAULT}")
print("=" * 60)


print("n📚 PART 1: Tensor Operations & Autograd")
print("-" * 60)


The Tensor is x.[[1.0, 2.0], [3.0, 4.0]], requires_grad=True)
Tensor = y[[2.0, 0.0], [1.0, 2.0]], requires_grad=True)


z = (x @ y).sum() + (x ** 2).mean()
z.backward()


print(f"x:n{x.numpy()}")
print(f"y:n{y.numpy()}")
print(f"z (scalar): {z.numpy()}")
print(f"∂z/∂x:n{x.grad.numpy()}")
print(f"∂z/∂y:n{y.grad.numpy()}")

Tinygrad is installed in Colab and we immediately start experimenting with automatic differentiation and tensors. In a simple graph, we observe how the gradients are effected by matrix operations. Tinygrad is explained in a more intuitive way as we see the printed output. See the FULL CODES here.

print("nn🧠 PART 2: Building Custom Layers")
print("-" * 60)


Class MultiHeadAttention
   def __init__(self, dim, num_heads):
       self.num_heads = num_heads
       self.dim = dim
       self.head_dim = dim // num_heads
       self.qkv = Tensor.glorot_uniform(dim, 3 * dim)
       self.out = Tensor.glorot_uniform(dim, dim)
  
   def __call__(self, x):
 B, T C = x.shape[0], x.shape[1], x.shape[2]
       qkv = x.reshape(B * T, C).dot(self.qkv).reshape(B, T, 3, self.num_heads, self.head_dim)
       q, k, v = qkv[:, :, 0], qkv[:, :, 1], qkv[:, :, 2]
       scale = (self.head_dim ** -0.5)
       attn = (q @ k.transpose(-2, -1)) * scale
       attn = attn.softmax(axis=-1)
 Out = (attn@ v).transpose(1).reshape(2).
       return out.reshape(B * T, C).dot(self.out).reshape(B, T, C)


Class TransformerBlock
   def __init__(self, dim, num_heads):
       self.attn = MultiHeadAttention(dim, num_heads)
       self.ff1 = Tensor.glorot_uniform(dim, 4 * dim)
       self.ff2 = Tensor.glorot_uniform(4 * dim, dim)
       self.ln1_w = Tensor.ones(dim)
       self.ln2_w = Tensor.ones(dim)
  
   def __call__(self, x):
       x = x + self.attn(self._layernorm(x, self.ln1_w))
 ff = (x.shape,-1), x.reshape[-1])
       ff = ff.dot(self.ff1).gelu().dot(self.ff2)
 x = shape(x) + ff.reshape (x.shape).
 Return self._layernorm (x, self.ln2_w).
  
   def _layernorm(self, x, w):
       mean = x.mean(axis=-1, keepdim=True)
       var = ((x - mean) ** 2).mean(axis=-1, keepdim=True)
       return w * (x - mean) / (var + 1e-5).sqrt()

Our own attention block and multi-head module are designed from scratch. Manually, we implement projections, scores of attention, layers with feedforward, layer normalization, softmax and the layer-normalization. While running the code, you can observe how different components contribute to a layer’s behavior. See the FULL CODES here.

print("n🤖 PART 3: Mini-GPT Architecture")
print("-" * 60)


MiniGPT class:
   def __init__(self, vocab_size=256, dim=128, num_heads=4, num_layers=2, max_len=32):
       self.vocab_size = vocab_size
       self.dim = dim
       self.tok_emb = Tensor.glorot_uniform(vocab_size, dim)
       self.pos_emb = Tensor.glorot_uniform(max_len, dim)
       self.blocks = [TransformerBlock(dim, num_heads) for _ in range(num_layers)]
       self.ln_f = Tensor.ones(dim)
       self.head = Tensor.glorot_uniform(dim, vocab_size)
  
   def __call__(self, idx):
 B. T = shape.idx[0], idx.shape[1]
       tok_emb = self.tok_emb[idx.flatten()].reshape(B, T, self.dim)
       pos_emb = self.pos_emb[:T].reshape(1, T, self.dim)
 x = pos_emb - tok_emb
 Blocks in the self:
 x = block (x)
       mean = x.mean(axis=-1, keepdim=True)
       var = ((x - mean) ** 2).mean(axis=-1, keepdim=True)
       x = self.ln_f * (x - mean) / (var + 1e-5).sqrt()
       return x.reshape(B * T, self.dim).dot(self.head).reshape(B, T, self.vocab_size)
  
   def get_params(self):
 params = [self.tok_emb, self.pos_emb, self.ln_f, self.head]
 Self-blocks are blocks that you can use to block yourself.
           params.extend([block.attn.qkv, block.attn.out, block.ff1, block.ff2, block.ln1_w, block.ln2_w])
 Return parameters


model = MiniGPT(vocab_size=256, dim=64, num_heads=4, num_layers=2, max_len=16)
Model.get_params params()
total_params = sum(p.numel() For p, use params
print(f"Model initialized with {total_params:,} parameters")

Assemble the MiniGPT Architecture using components that you have already built. Then we embed the tokens into multiple blocks of transformers, insert positional information and stack them. Finally, we project back our final outputs as vocablogits. The model is initialized and we can see that a transformer with few moving components has a surprising compactness. See the FULL CODES here.

print("nn🏋️ PART 4: Training Loop")
print("-" * 60)


def gen_data(batch_size, seq_len):
 x = (np.randint.randint (0, 256), batch_size, seq_len).
 You can also find out more about y = np.roll(x, 1, axis=1)
   y[:, 0] = x[:, 0]
 Return Tensor() (x, type=dtype)"int32"), Tensor(y, dtype="int32")


optimizer = optim.Adam(params, lr=0.001)
Loss = []


print("Training to predict previous token in sequence...")
Tensor.train():
 Step in the range (20)
 Start = Time()
       x_batch, y_batch = gen_data(batch_size=16, seq_len=16)
 Logits = Model(x_batch).
 Shapes B, T, and V are all logit shapes[0], logits.shape[1], logits.shape[2]
       loss = logits.reshape(B * T, V).sparse_categorical_crossentropy(y_batch.reshape(B * T))
       optimizer.zero_grad()
       loss.backward()
       optimizer.step()
       losses.append(loss.numpy())
 Time = elapsed() Startseite
 if step == 5:
           print(f"Step {step:3d} | Loss: {loss.numpy():.4f} | Time: {elapsed*1000:.1f}ms")


print("nn⚡ PART 5: Lazy Evaluation & Kernel Fusion")
print("-" * 60)


N = 514
A = Tensor.randn (N, N).
b = Tensor.randn (N, N).


print("Creating computation: (A @ B.T + A).sum()")
lazy_result = (a @ b.T + a).sum()
print("→ No computation done yet (lazy evaluation)")


print("nCalling .realize() to execute...")
"Start" = time.()
realized = lazy_result.realize()
Elapsed = Time() Startseite


print(f"✓ Computed in {elapsed*1000:.2f}ms")
print(f"Result: {realized.numpy():.4f}")
print("nNote: Operations were fused into optimized kernels!")

We observe that the MiniGPT loss decreases as we progress through the steps. Tinygrad’s laziness execution model is also explored by creating a kernel fused that only executes when the realization occurs. By monitoring timings we can understand the performance improvements of kernel fusion. Look at the FULL CODES here.

print("nn🔧 PART 6: Custom Operations")
print("-" * 60)


def custom_activation(x):
   return x * x.sigmoid()


If x is Tensor, then x will be tensor.[[-2.0, -1.0, 0.0, 1.0, 2.0]], requires_grad=True)
Custom_activation (x) = y
Loss = sum of y.()
loss.backward()


print(f"Input:    {x.numpy()}")
print(f"Swish(x): {y.numpy()}")
print(f"Gradient: {x.grad.numpy()}")


print("nn" + "=" * 60)
print("✅ Tutorial Complete!")
print("=" * 60)
print("""
The Key Concepts:
1. Automatic differentiation of tensor operations
2. Custom neural network layers (Attention, Transformer)
3. Making a Mini-GPT Language Model from Scratch
4. Adam optimizer training loop
5. The kernel and lazy evaluation
6. Customized activation functions
""")

Then we implement an activation function that is custom and test to see if gradients will propagate through the system correctly. Then we print out a summary of the major concepts that were covered during this tutorial. We then reflect upon how we have improved our understanding, modification, and extension of deep learning using Tinygrad.

Tinygrad allows us to play with each internal detail. In a transparent, minimal framework, we have created a transformer and trained it with synthetic data. We also experimented on lazy evaluation, kernel fusion and custom operations. We finally see how this workflow is preparing us to do deeper experiments, such as extending the model or integrating real datasets.

Click here to find out more FULL CODES here. Check out our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe Now our Newsletter. Wait! What? now you can join us on telegram as well.

Asif Razzaq, CEO of Marktechpost Media Inc. is a visionary engineer and entrepreneur who is dedicated to harnessing Artificial Intelligence’s potential for the social good. Marktechpost is his latest venture, a media platform that focuses on Artificial Intelligence. It is known for providing in-depth news coverage about machine learning, deep learning, and other topics. The content is technically accurate and easy to understand by an audience of all backgrounds. Over 2 million views per month are a testament to the platform’s popularity.

🙌 Follow MARKTECHPOST: Add us as a preferred source on Google.

Tinygrad is used to implement the functional components in the Transformer and Mini GPT models.

How to Create AI Agents that Use Short-Term Memory, Long-Term Memory, and Episodic memory

A Coding Analysis and Experimentation of Decentralized Federated Education with Gossip protocols and Differential privacy

PyKEEN: Coding for Training, Optimizing and Evaluating Knowledge Graph Embeddings

Robbyant LingBot World – a Real Time World Model of Interactive Simulations and Embodied AI

Wired Roundup: 5 Trends in Tech and Politics that Will Shape 2025

There is Only One AI Company. Blob Welcome!

Does AI have the ability to diagnose Alzheimer’s by looking at your retina? Eric Topol Is Hopeful

What is AI? AI Blog

OpenAI Sneezes – and the Software Firms Get a Cold

Top Insights

AI isn’t coming for Hollywood. It has already arrived

YouTube eases up on the rules regarding profanity for videos that are monetized

Latest News

How to Create AI Agents that Use Short-Term Memory, Long-Term Memory, and Episodic memory

A Coding Analysis and Experimentation of Decentralized Federated Education with Gossip protocols and Differential privacy

Tinygrad is used to implement the functional components in the Transformer and Mini GPT models.

Related Posts