How can you cut your AI training costs by 80 percent? Oxford's new optimizer delivers 7.5x faster training by optimizing how a model learns

GPU bill: the hidden cost of AI

AI model training typically consumes millions of dollars in GPU compute—a burden that shapes budgets, limits experimentation, and slows progress. It’s the status quo. Training an ImageNet-1K-based modern language model, or vision transformator can take thousands of hours. The current state of affairs is unsustainable for tech startups, research labs or even larger companies.

But what if you could cut your GPU bill by 87%—simply by changing the optimizer?

This is the promise Fisher-Orthogonal ProjectionA new research by the University of Oxford. This article will show you why gradients and noise are not the same thing, what FOP is and how it works.

We Train Models Wrong

Modern deep learning Rely on gradient descentThe optimizer adjusts the model parameters to reduce loss. With large-scale education, however, the optimizer only works on a small number of models. mini-batches—subsets of the training data—and Averages To get an update in a single direction, you can use their gradients.

The catch is: Gradients from different elements in a batch are always different. This standard method dismisses the differences and smoothes them to achieve stability. In reality, This is what you should do: “noise” This is an important directional sign about the real shape of the landscape for loss.

FOP: the Terrain Aware Navigator

FOP Treats The difference between the gradients in a sample is not noise but a map of terrain. The average gradient is the main direction. Projects out Differences in the construction of a geometry-aware, curvature-sensitive component that steers the optimizer away from walls and along the canyon floor—even when the main direction is straight ahead.

It works:

Average gradient Points the way.
Difference in gradient The sensor acts as an indicator of the terrain. It will tell you whether there are steep walls or flat surfaces (go fast, but be careful).
FOP combines both signals: The addition of a “curvature-aware” The step is orthogonal in relation to the main direction. Make sure it doesn’t fight with itself, or go overboard.
Result: Convergence faster and more stable, even for extreme batch sizes—the regime where SGD, AdamW, and even state-of-the-art KFAC fail.

The deepest learning term is: FOP applications a Corrective correction for Fisher-orthogonal On top of the standard natural gradient descent. The natural gradient descend (NGD) is preserved. intra-batch varianceFOP provides information to the public about its activities. Local curvature A signal was lost by averaging in the past.

ImageNet-1K: FOP Practice: Faster by 7.5 times

It is a dramatic result.

ImageNet-1K (ResNet-50): It is possible to achieve a standard accuracy of 75.9%. SGD Takes 71 epochs, and 2,511 minute. FOP reaches the same accuracy in just 40 epochs and 335 minutes—a 7.5x wall-clock speedup.
CIFAR-10: The FOP (Focal Point of View) is The 1.7x Faster AdamW The 1.3x Faster KFAC. When the batch is at its largest (50,000), Only FOP achieves 91% accuracyOthers fail to deliver.
ImageNet-100 (Vision Transformer): The FOP (Focal Point of View) is Up to 10x faster AdamW Two times faster KFAC is comparable to the highest batch size.
Long-tailed (imbalanced) datasets: FOP reduces Top-1 error by 2.3–3.3% over strong baselines—a meaningful gain for real-world, messy data.

Memory usage FOP’s peak GPU memory footprint is higher for small-scale jobs, but when distributed across many devices, it matches KFAC—and the time savings far outweigh the cost.

Scalability: FOP Even when batches reach tens and thousands, convergence is still possible—something no other optimizer tested could do. More GPUs. The training time decreases almost linearly—unlike existing methods, which often degrade in parallel efficiency.

Why this matters for business, research, and practice

Business: Training costs reduced by 87% AI will transform the economics behind AI. It isn’t incremental. Savings can be reinvested into more ambitious, larger models. Or, teams could build a moat by experimenting faster and cheaper.
Practitioners: Plug-and-play FOP: With a simple line of code, the open-source paper can be added to existing PyTorch workflows. no extra tuning. You’re halfway there if you are using KFAC.
Researchers: FOP defines what “noise” is in gradient descent. Intra-batch variance is not only useful—it’s essential. Robustness of data imbalances This is a great bonus for deployment in real life.

What FOP does to the landscape

Big batches have traditionally been a bad thing: SGD, AdamW and KFAC were all unstable. FOP is a new take on this. You can preserve and leverage your assets intra-batch gradient variationIt unlocks Stable, fast and scalable training with unprecedented batch sizes.

FOP is not a tweak—it’s a fundamental rethinking of what signals are valuable in optimization. It is important to note that the word “you” means “you”. “noise” You can average today’s value your terrain map tomorrow.

Summary Table FOP vs. Status Quo

Metric	SGD/AdamW	KFAC	This work is called FOP
Speed up the wall clock	Baseline	1.5–2x faster	Up to 7x faster
The stability of batches larger than one	Fail	Stalls, needs damping	Work at Extreme Scale
The lack of robustness	The Poor	Modest	The best in class
Plug-and-play	No,	No,	Yes (pip installable)
Distributed GPU Memory	Low-cost	The Moderate	The Moderate

The following is a summary of the information that you will find on this page.

Fisher-Orthogonal Projection (FOP) is a leap forward for large-scale AI training, delivering up to 7.5× faster convergence on datasets like ImageNet-1K at extremely large batch sizes, while also improving generalization—reducing error rates by 2.3–3.3% on challenging, imbalanced benchmarks. FOP is a new approach to optimizing that uses gradient variation, rather than conventional optimization, in order to determine the curvature and shape of the landscape of loss. “noise.” This not only slashes GPU compute costs—potentially by 87%—but also enables researchers and companies to train bigger models, iterate faster, and maintain robust performance even on real-world, uneven data. FOP, with its plug-andplay PyTorch and minimal tweaking, offers a scalable, practical solution for machine learning.

Click here to find out more Paper. Check out our website to learn more. GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Don’t forget about our 100k+ ML SubReddit Subscribe Now our Newsletter.

Asif Razzaq, CEO of Marktechpost Media Inc. is a visionary engineer and entrepreneur who is dedicated to harnessing Artificial Intelligence’s potential for the social good. Marktechpost was his most recent venture. This platform, an Artificial Intelligence Media Platform that covers machine learning and deep-learning news in a way that’s both technical and understandable to a broad audience, is renowned for its comprehensive coverage. This platform has over 2,000,000 monthly views which shows its popularity.

How can you cut your AI training costs by 80 percent? Oxford’s new optimizer delivers 7.5x faster training by optimizing how a model learns

GitNexus, an Open-Source Knowledge Graph Engine that is MCP Native and Gives Claude Coding and Cursor Complete Codebase Structure Awareness

Deepgram Python SDK Implementation for Transcription and Async Processing of Audio, Async Text Intelligence, and Async Text Intelligence.

DeepSeek AI releases DeepSeek V4: Sparse attention and heavily compressed attention enable one-million-token contexts.

OpenMythos Coding Tutorial: Recurrent-Depth Transformers, Depth Extrapolation and Mixture of Experts Routing

The Inside Story of the AI Summit where China presented its AI agenda to the world

Data Center Resistance is Here

ByteDance’s AI ambitions are being hampered by computer restrictions and copyright concerns

The worst fears of gamers about AI are coming true

“Create a replica of this image. Don’t change anything” AI Trend Takes Off

Top Insights

ServiceNow Analysis Introduces EnterpriseOps-Fitness center: A Excessive-Constancy Benchmark Designed to Consider Agentic Planning in Real looking Enterprise Settings

SpaCy: How to design an advanced multi-agent reasoning system with planning, reflection, memory, and knowledge graphs

Latest News

Ace the Ping Pong Robot can Whup your Ass