You Don't Need to Share Data to Train a Language Model Anymore—FlexOlmo Demonstrates How

In the past, large-scale languages models (LLMs), which require extensive data sets, have required central access. Many of these datasets are confidential, protected by copyright, or subject to usage restrictions. It is a constraint that severely limits participation by data-rich organisations operating in proprietary or regulated environments. FlexOlmo—introduced by researchers at the Allen Institute for AI and collaborators—proposes a modular training and inference framework that enables LLM development under data governance constraints.

Current LLMs…..

The current LLM pipelines are based on the aggregation of all training data in a single corpus. This imposes an inclusion decision that is static and removes any possibility for opting out post-training. This approach is not compatible:

Data sovereignty laws, HIPAA and GDPR are examples of regulatory regimes.
License-bound datasets (e.g., non-commercial or attribution-restricted),
Data that is context-sensitive (e.g. internal source code or clinical records).

FlexOlmo addresses two objectives:

Training modules decentralizedAllow independent modules trained on local datasets.
The flexibility of Inference TimeAllow deterministic opt in/out mechanisms without the need for retraining.

Model Architecture: Expert modularity via Mixture of Experts (MoE).

FlexOlmo is based on a Mixture-of-Experts architecture (MoE), wherein each expert corresponds with a Feedforward Network (FFN) Module trained independently. The fixed public model is also known as the MEach data owner trains an expert M. Each data owner traThe ins an expert M_i Using their own private data D_{The i}While all layers of attention and other parameters that are not experts remain frozen.

Architectural components that are key:

Sparse activationEach input token activates only a limited number of modules.
Expert RouteToken to expert assignment is guided by a routing matrix that derives from embedded embeddings that are domain-informed, which eliminates the need for a joint training.
Bias regularizationThe term “negative bias” is used to measure the selection of experts who are independently trained, and prevent overselection.

The design allows for selective inclusion of modules during inference while maintaining interoperability.

The Asynchronous and Isolated Optimizer

Each expert M_{The i} It is done in accordance with a strict procedure.. Specifically:

The training is conducted on a hybrid MoE scenario comprising of M_{The i} M.
This M Expert and shared attention layer are frozen.
You can only use the FFNs which correspond to M_{The i} The router embeddings are r_{The i} The latest version of the software is updated.

To The initialize r_iA set of D samples_{The i} The router is then embedded by using the average of the encoders. This optional lightweight router tuning improves performance by using proxy data obtained from public corpora.

FLEXMIX Dataset Creation:

There are three main parts to the FLEXMIX training system:

The following are some of the ways to get in touch with each other Public MixThe data is a general purpose web-data.
Seven Closed sets Simulating domains that are not shareable: News, Reddit Code, Academic Texts, Educational Texts, Creative Writing and Mathematics.

The experts are trained separately, without any joint access to data. This is a setup that approximates the real world, where companies cannot share data because of legal, ethical or operational restrictions.

Comparing baselines and evaluating evaluation

FlexOlmo’s performance was assessed on 31 benchmark tasks, which were divided into 10 categories. This included general language comprehension (e.g. MMLU), generative QA, (e.g. GEN5) code generation, (e.g. Code4) as well as mathematical reasoning.

The following are the baseline methods:

Model soupAverage weights for models individually tuned.
Branch-Train-Merge (BTM)Weighted assembly of probabilities.
BTXConverting independent dense models via parameter transfer into a MOE.
Routing based on promptsUse of classifiers that are tuned to the instructions to send queries to experts

FlexOlmo, compared to other methods:

The following are some of the ways to get in touch with each other 41% average relative improvement The base model is a public version.
The following are some of the ways to get in touch with each other 10% Improvement The strongest merging base (BTM)

Specialized experts are particularly useful for tasks that align with domains closed.

Architectural Analysis

Multiple controlled experiments show the impact of architectural choices:

When training, removing expert-public coordination reduces performance.
Inter-experts are less able to distinguish between routers that have been randomly initialized.
The bias term can be disabled to skew expert selections, especially when merging two or more experts.

The routing pattern of tokens shows expert expertise at certain layers. Inputs such as mathematical data activate the math expert in deeper layers while tokens for introductory use the public model. The model is more expressive than single expert routing.

Privacy and data governance

FlexOlmo has a number of key features. Ability to opt out based on a deterministic basis. The influence of an expert removed from the matrix is completely gone at inference. The experiments show that by removing NewsG expert, performance is reduced on NewsG. However, other tasks are not affected. This confirms the expert’s localized influence.

Privacy Issues

We evaluated the risk of training data being extracted using well-known attack techniques. Results show:

For a model that is only available to the public, 0.1% of its value will be extracted.
A dense model trained using the math dataset yields a 1.6% accuracy.
FlexOlmo comes with a math expert at 0.7%.

Although these rates may be low, for greater assurances, you can apply differential privacy training (DP) to every expert. The architecture allows for the use of DP and encrypted training methods.

Scalability

FlexOlmo was used to apply the methodology on an already strong base (OLMo-2, 7B), which had been pretrained using 4T tokens. By adding two more experts to the model (Math, code), benchmark average performance improved from 49.8 points to 52.8, without needing to train the core. The scalability of the system and its compatibility with current training pipelines are demonstrated.

The conclusion of the article is:

FlexOlmo provides a flexible framework to build modular LLMs within the constraints of data governance. His design allows distributed training with locally-maintained datasets. It also permits inclusion/exclusion at inference time of dataset influences. The empirical results show that it is competitive against monolithic baselines and ensembles.

This architecture is especially applicable in environments that:

Requirements for data locality
Data use policies that are dynamic
Constraints of regulatory compliance

FlexOlmo offers a path to building performant language model while maintaining real-world data boundaries.

Take a look at the Paper, Model on Hugging Face You can also find out more about the following: Codes. The researchers are the sole credit holders for this work.

Sponsorship Opportunity: Contact the top AI developers from US and Europe. Unlimited possibilities. 1M+ monthly subscribers, 500K+ active community builders. [Explore Sponsorship]

Asif Razzaq, CEO of Marktechpost Media Inc. is a visionary engineer and entrepreneur who is dedicated to harnessing Artificial Intelligence’s potential for the social good. Marktechpost was his most recent venture. This platform, a Media Platform for Artificial Intelligence, is notable for its technical and understandable coverage of news on machine learning and deep-learning. This platform has over 2,000,000 monthly views which shows its popularity.

You Don’t Need to Share Data to Train a Language Model Anymore—FlexOlmo Demonstrates How

OpenAI’s GPT-5.4 Cyber: A Finely Tuned Model for Verified Security Defenders

Code Implementation for an AI-Powered Pipeline to Detect File Types and Perform Security Analysis with OpenAI and Magika

TabPFN’s superior accuracy on tabular data sets is achieved by leveraging in-context learning compared to Random Forest or CatBoost

Moonshot AI Researchers and Tsinghua Researchers propose PrfaaS, a cross-datacenter KVCache architecture that rethinks how LLMs can be served at scale.

How Claude Code Is Reshaping Software—and Anthropic

Elon Musk Is Rolling xAI Into SpaceX—Creating the World’s Most Valuable Private Company

How much energy does AI use? The people who are in the know won’t say anything

OpenAI lost 4 key researchers to Meta

Truth Social’s AI chatbot, Donald Trump’s Media Diet Incarnate is the new AI chatbot from Truth Social

Top Insights

Google AI introduces Natively Adaptive Interfaces: A Multimodal Accessibility Framework Based on Gemini to Support Adaptive UI Design

Google introduces open-source full-Stack AI agent stack using Gemini 2.5 for multi-step web search, reflection, and synthesis.

Latest News