Many foundational biology models have an important blindspot: they view cells as static snapshots. Give a model a single-cell transcriptome — a readout of which genes are active in a cell at a given moment — and it can tell you a lot about what that cell is doing right now. The model can tell you what the cell’s current state is, but not where it is going.
When studying the aging process, this limitation becomes extremely important. It takes years for age-related diseases such as heart disease, Alzheimer’s dementia and pulmonary fission to develop. These diseases are a result of slow, gradual changes to gene networks. They develop over many decades. To understand and eventually reverse these trajectories, you need a model that thinks in time — not just in snapshots.
MaxToki was designed for just that.
MaxToki: What it is under the Hood
This research team includes researchers from Gladstone Institute of Cardiovascular Disease, Gladstone Institute of Data Science and Biotechnology, Gladstone Institute of Neurological Disease, Department of Pathology, Department of Neurology and Bakar Aging Research Institute, Department of Pediatrics and Cardiovascular Research Institute, and Institute for Human Genetics at the University of California San Francisco. NVIDIA and University of California Berkeley Department of Molecular and Cell Biology, along with Goethe University Frankfurt’s Institute of Cardiovascular Regeneration and Centre for Molecular Medicine, as well as the Cardiopulmonary Institute and Clinic for Cardiology at University Hospital Frankfurt in Germany and the Center for iPS Cell Research and Application at Kyoto University, also contributed. MaxToki is a transformer decoder model — the same architectural family behind large language models — but trained on single-cell RNA sequencing data. Models are available in 2 parameter sizes, namely: 1 billion and 217 millions parameters.
It is important to encode rank values. Instead of feeding raw counts, the transcriptome for each cell is represented by a list of genes ranked by relative expression in that cell, after scaling expression over the pretraining corpus. This nonparametric approach deprioritizes ubiquitously expressed housekeeping genes and amplifies genes like transcription factors that have high dynamic range across distinct cell states — even when lowly expressed in absolute terms. This method is also less susceptible to technical batch effects because relative counts within cells are more stable.
The training process is divided into two phases. Stage 1. used Genecorpus-175M — approximately 175 million single-cell transcriptomes from publicly available data across a broad range of human tissues in health and disease, covering 10,795 datasets, generating approximately 290 billion tokens. We excluded malignant and immortalized cells, as their gain-of function mutations could confuse what we learned about the normal dynamics of gene networks. No single tissue can make up more that 25% of our corpus. The model was trained with an autoregressive objective: given the preceding genes in the rank value encoding, predict the next ranked gene — conceptually identical to how language models predict the next token in a sentence.
The key finding of Stage 1 was that the model’s performance in terms of the generative goal scaled with the number parameters. This motivated the choice to fully pretrain exactly two variants — the 217M and 1B — rather than exploring the full spectrum, balancing performance against compute budget constraints.
Stage 2 extended the context length from 4,096 to 16,384 tokens using RoPE (Rotary Positional Embeddings) scaling — a technique that interpolates more tokens into the existing positional framework by reducing the rotation frequency. The model was able to handle multiple cells sequentially, which allowed it to make temporal decisions rather than just one at a given time. Stage 2 training used Genecorpus-Aging-22M: approximately 22 million single-cell transcriptomes across roughly 600 human cell types from about 3,800 donors representing every decade of life from birth to 90-plus years, balanced by gender (49% male, 51% female), generating approximately 650 billion tokens. MaxToki was trained using nearly 1 trillion total gene tokens across both phases.
Temporal Prompting is an effective strategy
MaxToki’s strategy for prompting is one of the architecturally most novel contributions. A prompt consists of a context trajectory — two or three cell states plus the timelapses between them — followed by a query. Model then performs either of the following tasks:
Step 1: Predict the number of months (in) it will take to get from the previous context cell to the query cell.
Work 2 Generating the transcriptomes of cells that will arise from a timelapse query and context trajectory.
In Task 1, the standard cross-entropy is not sufficient because each value of a timelapse represents a separate category. The research team instead used continuous numerical tokenization and a loss function based on mean-squared errors (MSE), teaching the model to treat timelapses as a continuum. This design choice produced dramatically lower prediction errors — the median prediction error for held-out ages dropped to 87 months with MaxToki, compared to 178 months for a linear SGDRegressor baseline and 180 months for the naive baseline of assuming each query cell was the most common age for that cell type and gender.
It is important to note that the model never receives explicit information about which gender or cell type it deals with. It infers the trajectory context from the cells themselves — a form of in-context learning. The model is able to generalize even when it has never seen the held-out cells types during training. It achieves Pearson correlations of 0.85 on predicted and real-time timelapses for completely unknown cell types, as well as 0.77 Pearson correlations between held-out donor ages and held-out type trajectory.
GPU Engineering on Scale
Training nearly 1 trillion gene tokens required serious infrastructure work. The team used FlashAttention-2 to implement the variant with 1 billion parameters using the NVIDIA BioNeMo Stack built on NeMo Megatron-LM and Transformer Engine. To enable FlashAttention-2, they modified feed-forward hidden dimensions to be evenly divisible by the number of attention heads — a hard compatibility requirement. These changes, combined with mixed precision training on 80GB H100 GPUs using bf16 yielded a 5x gain in training performance and 4x in the size of micro-batch that could be achieved. For inference, adopting the Megatron-Core DynamicInferenceContext abstraction with key-value caching resulted in over 400x faster autoregressive generation compared to the naive baseline.
What the Model Learned — Without Being Told
The interpretationability analysis of the variant with 217,000,000 parameters revealed something remarkable: it was only through self-supervised learning without gene function labels that approximately half the attention heads were able to learn to pay much more attention to transcription factor genes than to any other genes. Model discovered that transcription factors, which are the main regulators of transitions in cell states, were important.
Ablation studies confirmed that both the context cells and the query cell are equally necessary for accurate predictions — masking either component significantly and equivalently degraded performance. Rearranging genes in the rank-value coding can produce “bag of genes” The models’ ability to predict accurately was also affected by the loss of relative gene ordering in cells. It showed that they learned how to take into account the expression order, rather than just the presence or absence, of genes. Further attention analysis showed that individual heads specialized for different components of the prompt — some attending primarily to context cells, others to timelapse tokens, others to the query — with many heads exhibiting cell type-specific activation patterns across the roughly 60 cell types tested.
Learning to output averaged images is one failure mode for generative models. The research team trained a doublet detector — a classifier distinguishing individual cells from simulated doublets formed by merging two cells of the same cell type — on ground truth cells, then applied it to MaxToki-generated cells. About 95% of MaxToki-generated cells were classified singlets. This confirms that the model generates single-cell transcriptomes and not blended averages.
Inferring Age Acceleration in Disease — Including Diseases Never Seen During Training
Researchers tested the ability of the model to detect aging in diseases that were not part of its training. It is possible to use a normal cell context as a baseline, and then compare it with an aged control cell.
In lung mucosal epithelial cells from donors exposed to heavy smoking, the model inferred approximately 5 years of age acceleration compared to age-matched non-smoking controls — consistent with prior reports linking smoking status to telomere shortening and lung aging signatures. In lung fibroblasts from patients with pulmonary fibrosis — a disease characterized by telomere attrition and cellular senescence — the model inferred approximately 15 years of age acceleration.
The Alzheimer’s analysis revealed several important clinical findings. Models based on microglia collected from Mount Sinai NIH Neurobiobank in Alzheimer’s patients showed accelerated aging of 3 years compared to controls with similar age. The results were replicated by an independent group of Duke and Johns Hopkins Alzheimer Disease Research Centers, using specifically homeostatic microglia. Critically, this second cohort also included patients with mild cognitive impairment and Alzheimer-resilient patients — individuals who share the same neuropathological changes as Alzheimer’s patients but exhibit no cognitive impairment. It was not possible to infer an age increase in the homeostatic glia of either the mild cognitive impaired or resilient groups when compared with the controls. These patients could be shielded from disease-related age increases in the microglial subtype. This distinction between full Alzheimer’s disease and Alzheimer resilience — captured without any disease-specific training — is one of the most clinically significant findings in the paper.
The conclusion of the article is:
MaxToki marks a major step forward for AI models in their ability to reason about biological timelines. It addresses an issue that had been a problem for computational biology. By going beyond single cell snapshots, MaxToki models entire trajectory of gene network changes over the course of a human lifetime. The combination of rank value encoding, continuous numerical tokenization, RoPE-based context extension, and in-context learning allowed the model to generalize to unseen cell types, unseen ages, and even disease states it was never trained on — all while learning, without any supervision, to pay higher attention to the transcription factors that actually drive cell state transitions.
MaxToki’s predictions are not limited to the computer level. This is what makes it so compelling, both for engineers and researchers. The model nominated novel pro-aging drivers in cardiac cell types that were subsequently validated to cause age-related gene network dysregulation in iPSC-derived cardiomyocytes and measurable cardiac dysfunction in living mice within six weeks — a direct line from in silico screening to in vivo consequence. MaxToki’s publicly-available models and code for training are a framework on which the community can expand, adapt to specific diseases, or extend it to other tissue types. MaxToki, a temporal foundation model that uses single-cell longitudinal datasets to identify intervention points prior to age-related disease taking hold, may be a tool of choice for the broader community as these datasets grow.
Check out the Paper, Model The following are some examples of how to get started: Repo. Also, feel free to follow us on Twitter Don’t forget about our 120k+ ML SubReddit Subscribe Now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.
Want to promote your GitHub repo, Hugging Face page, Product release or Webinar?? Connect with us

