AutoResearch Framework: Hyperparameters Discovery, Experiment Tracking and the Building of an Autonomous Machine Learning Research Loop using Google Colab.

We will implement in this tutorial a Colab ready version of. AutoResearch framework originally proposed by Andrej Karpathy. We create an experimentation pipeline which clones AutoResearch, sets up a training environment and performs baseline experiments to determine initial performance metrics. Then, we create an automated loop which edits hyperparameters programmatically in train.py and runs training iterations. It then evaluates the model using the Validation Bits-Per-Byte metric. We demonstrate that by running the workflow on Google Colab we can replicate the fundamental idea behind autonomous machine-learning research, which is to iteratively modify training configurations and evaluate performance. The best configurations are then preserved without the need for specialized hardware.

import os, sys, subprocess, json, re, random, shutil, time
Import Path from pathlib


Define pip_install (pkg)
   subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pkg])


Pkg for sale [
   "numpy","pandas","pyarrow","requests",
   "rustbpe","tiktoken","openai"
]:
   try:
       __import__(pkg)
   except:
       pip_install(pkg)


import pandas as pd


If not path("autoresearch").exists():
   subprocess.run(["git","clone","https://github.com/karpathy/autoresearch.git"])


os.chdir("autoresearch")


OPENAI_API_KEY=None
try:
 From Google.colab, import Userdata
   OPENAI_API_KEY = userdata.get("OPENAI_API_KEY")
except:
   OPENAI_API_KEY=os.environ.get("OPENAI_API_KEY")


OPENAI_API_KEY
   os.environ["OPENAI_API_KEY"]=OPENAI_API_KEY

The first step is to import the Python core libraries that are required for an automated research workflow. Installation of all required dependencies is performed, as well as cloning the autoresearch repository from GitHub. This ensures that the training framework has been included in the environment. If the OpenAI key is available, we configure the access, so that the system can support LLM assisted experimentation at a later stage.

prepare_path=Path("prepare.py")
train_path=Path("train.py")
program_path=Path("program.md")


prepare_text=prepare_path.read_text()
train_text=train_path.read_text()


prepare_text=re.sub(r"MAX_SEQ_LEN = d+","MAX_SEQ_LEN = 512",prepare_text)
prepare_text=re.sub(r"TIME_BUDGET = d+","TIME_BUDGET = 120",prepare_text)
prepare_text=re.sub(r"EVAL_TOKENS = .*","EVAL_TOKENS = 4 * 65536",prepare_text)


train_text=re.sub(r"DEPTH = d+","DEPTH = 4",train_text)
train_text=re.sub(r"DEVICE_BATCH_SIZE = d+","DEVICE_BATCH_SIZE = 16",train_text)
train_text=re.sub(r"TOTAL_BATCH_SIZE = .*","TOTAL_BATCH_SIZE = 2**17",train_text)
train_text=re.sub(r'WINDOW_PATTERN = "SSSL"','WINDOW_PATTERN = "L"',train_text)


prepare_path.write_text(prepare_text)
train_path.write_text(train_text)


program_path.write_text("""
Goal:
Run autonomous research loop on Google Colab.


Rules:
Train.py Hyperparameters can only be modified.


Metric:
It is best to have a lower value of val_bpb.
""")


subprocess.run(["python","prepare.py","--num-shards","4","--download-workers","2"])

The repository is modified to include key parameters that make it compatible with Google Colab hardware. The context length and training budget are reduced, as well as the evaluation token count, so that the experiments can run on limited GPU resources. Then, after we apply these patches to the code, the datasets shards are prepared for training. This allows the model’s experiments to immediately start.

subprocess.run("python train.py > baseline.log 2>&1",shell=True)


def parse_run_log(log_path):
   text=Path(log_path).read_text(errors="ignore")
   def find(p):
       m=re.search(p,text,re.MULTILINE)
 if M else none, return float(m.group(1)).Return
   return {
       "val_bpb":find(r"^val_bpb:s*([0-9.]+)"),
       "training_seconds":find(r"^training_seconds:s*([0-9.]+)"),
       "peak_vram_mb":find(r"^peak_vram_mb:s*([0-9.]+)"),
       "num_steps":find(r"^num_steps:s*([0-9.]+)")
   }


baseline=parse_run_log("baseline.log")


results_path=Path("results.tsv")


rows=[{
   "commit":"baseline",
   "val_bpb":baseline["val_bpb"] If you baseline["val_bpb"] 0
   "memory_gb":round((baseline["peak_vram_mb"] Or 0/1024 1,
   "status":"keep",
   "description":"baseline"
}]


pd.DataFrame(rows).to_csv(results_path,sep="t",index=False)


print("Baseline:",baseline)

The baseline run is executed to set up an initial reference performance for the model. A log-parsing feature is implemented to retrieve key metrics such as training time, GPU usage and training bits-per-byte. These baseline results are then stored in an experiment table that is structured so all subsequent experiments can be compared to this initial configuration.

TRAIN_FILE=Path("train.py")
BACKUP_FILE=Path("train.base.py")


If BACKUP_FILE.exists():
   shutil.copy2(TRAIN_FILE,BACKUP_FILE)


HP_KEYS=[
"WINDOW_PATTERN",
"TOTAL_BATCH_SIZE",
"EMBEDDING_LR",
"UNEMBEDDING_LR",
"MATRIX_LR",
"SCALAR_LR",
"WEIGHT_DECAY",
"ADAM_BETAS",
"WARMUP_RATIO",
"WARMDOWN_RATIO",
"FINAL_LR_FRAC",
"DEPTH",
"DEVICE_BATCH_SIZE"
]


Def read_text():
 Return Path (path).read_text()


def write_text(path,text):
   Path(path).write_text(text)


def extract_hparams(text):
   vals={}
 For k in HP_KEYS
       m=re.search(rf"^{k}s*=s*(.+?)$",text,re.MULTILINE)
 If m
 VALS[k]=m.group(1).strip()
 Return vals


def set_hparam(text,key,value):
 Return to re.sub."^{key}s*=.*$",f"{key} = {value}",text,flags=re.MULTILINE)


base_text=read_text(BACKUP_FILE)
base_hparams=extract_hparams(base_text)


SEARCH_SPACE={
"WINDOW_PATTERN":['"L"','"SSSL"'],
"TOTAL_BATCH_SIZE":["2**16","2**17","2**18"],
"EMBEDDING_LR":["0.2","0.4","0.6"],
"MATRIX_LR":["0.01","0.02","0.04"],
"SCALAR_LR":["0.3","0.5","0.7"],
"WEIGHT_DECAY":["0.05","0.1","0.2"],
"ADAM_BETAS":["(0.8,0.95)","(0.9,0.95)"],
"WARMUP_RATIO":["0.0","0.05","0.1"],
"WARMDOWN_RATIO":["0.3","0.5","0.7"],
"FINAL_LR_FRAC":["0.0","0.05"],
"DEPTH":["3","4","5","6"],
"DEVICE_BATCH_SIZE":["8","12","16","24"]
}


Def Sample_Candidate():
   keys=random.sample(list(SEARCH_SPACE.keys()),random.choice([2,3,4]))
   cand=dict(base_hparams)
   changes={}
   for k in keys:
 You can also read about it here[k]=random.choice(SEARCH_SPACE[k])
 Changes to the way you think[k]=cand[k]
   return cand,changes


def apply_hparams(candidate):
   text=read_text(BACKUP_FILE)
 For k,v candidate.items():
       text=set_hparam(text,k,v)
   write_text(TRAIN_FILE,text)


Def Run_Experiment(tag).
   log=f"{tag}.log"
   subprocess.run(f"python train.py > {log} 2>&1",shell=True)
 On the other hand,=parse_run_log(log)
   metrics["log"]=log
 Return Metrics

The core utilities are built to enable hyperparameter experiments that can be automated. We take the hyperparameters out of train.py and define the parameter space that is searchable. Finally, we implement functions which allow programmatic editing. In addition, we create mechanisms for generating candidate configurations. These are then applied to the training script. Finally, experiments with recorded outputs can be run.

N_EXPERIMENTS=3


df=pd.read_csv(results_path,sep="t")
best=df["val_bpb"].replace(0,999).min()


for i in range(N_EXPERIMENTS):


   tag=f"exp_{i+1}"


   candidate,changes=sample_candidate()


   apply_hparams(candidate)


   metrics=run_experiment(tag)


 If metrics["val_bpb"] The metric system["val_bpb"]

We then run an automated loop to evaluate and suggest new configurations of hyperparameters. Each experiment is a modification of the training script. After running the training, the resultant validation score is compared to the current best configuration. All experiment results are logged, improved configurations preserved, and the best script is exported along with all the history of the experiments for analysis.

As a conclusion, we created a completely automated research workflow which demonstrates that machines can explore different model configurations iteratively and improve performance in training with minimum manual intervention. We prepared the data, created a baseline, implemented a loop to propose new hyperparameter settings, run experiments and track results over multiple trials. We created an extensible and reproducible research process by maintaining logs of experiments and automatically preserving improvements. This closely resembles modern machine-learning experimentation workflow. The approach demonstrates how to combine automation, experimentation tracker, lightweight infrastructure, in order to enable scalable and rapid research within the cloud notebook environment.

Check it Out Full Codes here. Also, feel free to follow us on Twitter Join our Facebook group! 120k+ ML SubReddit Subscribe now our Newsletter. Wait! What? now you can join us on telegram as well.

AutoResearch Framework: Hyperparameters Discovery, Experiment Tracking and the Building of an Autonomous Machine Learning Research Loop using Google Colab.

xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

The Coding Guide to Property Based Testing with Hypothesis and Stateful, Differential and Metamorphic Test Designs

Google AI Releases Google Auto-Diagnosis: A Large Language Model LLM Based System to Diagnose Integrity Test Failures At Scale

Can AI Kill Venture Capitalists?

Trump signs executive order that threatens states with punishment for passing AI laws

Anthropic Plots Major London Expansion

Google AI Workers Fired Hundreds Amid Struggle Over Working Conditions

Jon M. Chu says AI couldn’t have made one of Wicked’s best moments

Top Insights

Nested Learning is a New Machine Learning Approach that views models as nested optimization problems to enhance long context processing.

NVIDIA AI released DiffusionRenderer – An AI Model to Create Editable, Photorealistic Scenes in 3D from One Video

Latest News

xAI Releases Standalone Grok Speech to text and Text to speech APIs, Aimed at Enterprise Voice Developers

Anthropic releases Claude Opus 4.7, a major upgrade for agentic coding, high-resolution vision, and long-horizon autonomous tasks

AutoResearch Framework: Hyperparameters Discovery, Experiment Tracking and the Building of an Autonomous Machine Learning Research Loop using Google Colab.

Related Posts