In this first tutorial, we will demonstrate harnessing. TPOT Practically automating and optimizing machine learning pipelines. We ensure that the setup is accessible, lightweight and reproducible by working in Google Colab. We go through the steps of loading data, defining a scorer and tailoring search space by using advanced models, such as XGBoost. Finally, we set up a cross validation strategy. We will explore the evolutionary algorithms used in TPOT to search for pipelines with high performance, and provide transparency via Pareto Fronts and Checkpoints. Click here to view the FULL CODES here.
Install tpot==0.12.2 scikit-learn==1.4.2 Graphviz==0.20.3
import os, json, math, time, random, numpy as np, pandas as pd
Import load_breast_cancer from sklearn.datasets
from sklearn.model_selection import train_test_split, StratifiedKFold
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import make_scorer, f1_score, classification_report, confusion_matrix
From sklearn.pipeline. import Pipeline
Import TPOTClassifier from tpot
From sklearn.linear_model, import LogisticRegression
GaussianNB can be imported from sklearn.naive_bayes
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier, ExtraTreesClassifier, GradientBoostingClassifier
Import XGBClassifier from xgboost
SEED = 7
random.seed(SEED); np.random.seed(SEED); os.environ["PYTHONHASHSEED"]=str(SEED)
Installing the libraries, and then importing the modules essential to data management, model construction, and pipeline optimization, is the first step. To ensure that our notebook’s results are reproducible, we set up a random seed. Click here to see the FULL CODES here.
X, y = load_breast_cancer(return_X_y=True, as_frame=True)
X_tr, X_te, y_tr, y_te = train_test_split(X, y, test_size=0.3, stratify=y, random_state=SEED)
scaler = StandardScaler().fit(X_tr)
X_tr_s, X_te_s = scaler.transform(X_tr), scaler.transform(X_te)
def f1_cost_sensitive(y_true, y_pred):
return f1_score(y_true, y_pred, average="binary", pos_label=1)
cost_f1 = make_scorer(f1_cost_sensitive, greater_is_better=True)
Here we split the training data into two sets and maintain class balance. Standardizing the features to ensure stability, we define a F1-based custom scorer that allows us evaluate pipelines with an emphasis on effectively capturing cases. See the FULL CODES here.
tpot_config = {
'sklearn.linear_model.LogisticRegression': {
'C': [0.01, 0.1, 1.0, 10.0],
'penalty': ['l2'], 'solver': ['lbfgs'], 'max_iter': [200]
},
'sklearn.naive_bayes.GaussianNB': {},
'sklearn.tree.DecisionTreeClassifier': {
'criterion': ['gini','entropy'], 'max_depth': [3,5,8,None],
'min_samples_split':[2,5,10], 'min_samples_leaf':[1,2,4]
},
'sklearn.ensemble.RandomForestClassifier': {
'n_estimators':[100,300], 'criterion':['gini','entropy'],
'max_depth':[None,8], 'min_samples_split':[2,5], 'min_samples_leaf':[1,2]
},
'sklearn.ensemble.ExtraTreesClassifier': {
'n_estimators':[200], 'criterion':['gini','entropy'],
'max_depth':[None,8], 'min_samples_split':[2,5], 'min_samples_leaf':[1,2]
},
'sklearn.ensemble.GradientBoostingClassifier': {
'n_estimators':[100,200], 'learning_rate':[0.03,0.1],
'max_depth':[2,3], 'subsample':[0.8,1.0]
},
'xgboost.XGBClassifier': {
'n_estimators':[200,400], 'max_depth':[3,5], 'learning_rate':[0.05,0.1],
'subsample':[0.8,1.0], 'colsample_bytree':[0.8,1.0],
'reg_lambda':[1.0,2.0], 'min_child_weight':[1,3],
'n_jobs':[0], 'tree_method':['hist'], 'eval_metric':['logloss'],
'gamma':[0,1]
}
}
cv = StratifiedKFold(n_splits=5, shuffle=True, random_state=SEED)
We use carefully selected hyperparameters in a TPOT custom configuration. This configuration combines tree-based learning, linear models, ensembles, as well as XGBoost. The stratified cross-validation ensures that all candidate pipelines are tested on balanced dataset splits. Click here to see the FULL CODES here.
t0 = time.time()
tpot = TPOClassifier
generations=5,
population_size=40,
offspring_size=40,
scoring=cost_f1,
cv=cv,
subsample=0.8,
n_jobs=-1,
config_dict=tpot_config,
verbosity=2,
random_state=SEED,
max_time_mins=10,
early_stop=3,
periodic_checkpoint_folder="tpot_ckpt",
warm_start=False
)
tpot.fit(X_tr_s, y_tr)
print(f"n⏱️ First search took {time.time()-t0:.1f}s")
def pareto_table(tpot_obj, k=5):
rows=[]
for ind, meta in tpot_obj.pareto_front_fitted_pipelines_.items():
rows.append({
"pipeline": ind, "cv_score": Meta['internal_cv_score'],
"size": len(str(meta['pipeline'])),
})
df = pd.DataFrame(rows).sort_values("cv_score", ascending=False).head(k)
return df.reset_index(drop=True)
pareto_df = pareto_table(tpot, k=5)
print("nTop Pareto pipelines (cv):n", pareto_df)
def eval_pipeline(pipeline, X_te, y_te, name):
y_hat = pipeline.predict(X_te)
f1 = f1_score(y_te, y_hat)
print(f"n[{name}] F1(test) = {f1:.4f}")
print(classification_report(y_te, y_hat, digits=3))
print("nEvaluating top pipelines on test:")
For i, (ind) in sort(
tpot.pareto_front_fitted_pipelines_.items(),
The key=lambda is kv[1]['internal_cv_score'], reverse=True)[:3], 1):
eval_pipeline(meta['pipeline'], X_te_s, y_te, name=f"Pareto#{i}")
We start an evolutionary search using TPOT. Then we cap the runtime to make it practical and track progress. This allows us to hunt down strong pipelines in a reproducible manner. Then we inspect the Pareto-front to determine the most important trade-offs. We convert the table into a compact format and then select the leaders using the cross validation score. We then evaluate the top candidates using the test set that was held out to verify real-world performances with F1 as well as a classification report. Click here to see the FULL CODES here.
print("n🔁 Warm-start for extra refinement...")
t1 = time.time()
tpot2 = TPOTClassifier(
generations=3, population_size=40, offspring_size=40,
scoring=cost_f1, cv=cv, subsample=0.8, n_jobs=-1,
config_dict=tpot_config, verbosity=2, random_state=SEED,
warm_start=True, periodic_checkpoint_folder="tpot_ckpt"
)
try:
tpot2._population = tpot._population
tpot2._pareto_front = tpot._pareto_front
The exception:
pass
tpot2.fit(X_tr_s, y_tr)
print(f"⏱️ Warm-start extra search took {time.time()-t1:.1f}s")
best_model = tpot2.fitted_pipeline_ if hasattr(tpot2, "fitted_pipeline_") else tpot.fitted_pipeline_
eval_pipeline(best_model, X_te_s, y_te, name="BestAfterWarmStart")
export_path = "tpot_best_pipeline.py"
Hasattr (tpot2, "fitted_pipeline_") else tpot).export(export_path)
print(f"n📦 Exported best pipeline to: {export_path}")
Import util from importlib as _utilReport =
spec = _util.spec_from_file_location("tpot_best", export_path)
tbest = _util.module_from_spec(spec); spec.loader.exec_module(tbest)
reloaded_clf = tbest.exported_pipeline_
pipe = Pipeline([("scaler", scaler), ("model", reloaded_clf)])
pipe.fit(X_tr, y_tr)
eval_pipeline(pipe, X_te, y_te, name="ReloadedExportedPipeline")
report = {
"dataset": "sklearn breast_cancer",
"train_size": int(X_tr.shape[0]), "test_size": int(X_te.shape[0]),
"cv": "StratifiedKFold(5)",
"scorer": "custom F1 (binary)",
"search": {"gen_1": 5, "gen_2_warm": 3, "pop": 40, "subsample": 0.8},
"exported_pipeline_first_120_chars": str(reloaded_clf)[:120]+"...",
}
print("n🧾 Model Card:n", json.dumps(report, indent=2))
We then continue our search using a hot start. We use the information we have learned from the previous warm start in order to select the candidate that performs best on the test set. We then export the pipeline that has been selected, load it with our scaler and simulate deployment. Finally, we check its results. Finaly, we create a compact card with the export pipeline summary, search settings and dataset.
We can conclude that TPOT is a powerful tool for automating, reproducible and explainable model optimization. Exporting the optimal pipeline and validating it with unseen data is a great way to confirm that it’s not only experimental, but also production-ready. Combining reproducibility, interpretability and flexibility we create a framework we are confident to apply to complex datasets or real-world issues.
Click here to find out more FULL CODES here. Please feel free to browse our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe Now our Newsletter.
Asif Razzaq, CEO of Marktechpost Media Inc. is a visionary engineer and entrepreneur who is dedicated to harnessing Artificial Intelligence’s potential for the social good. Marktechpost is his latest venture, a media platform that focuses on Artificial Intelligence. It is known for providing in-depth news coverage about machine learning, deep learning, and other topics. The content is technically accurate and easy to understand by an audience of all backgrounds. This platform has over 2,000,000 monthly views which shows its popularity.

