Meta AI releases NeuralBench, a unified open-source framework to benchmark NeuroAI models across 36 EEG tasks and 94 datasets.

It has been difficult to evaluate AI models that are trained using brain signals. Different research groups use different preprocessing pipelines, train models on different datasets, and report results on a narrow set of tasks — making it nearly impossible to know which model actually works best, or for what. Meta AI has developed a new framework to address this issue.

The Meta Researchers released a new release. NeuralBenchThe framework is an open-source platform for benchmarking AI brain activity models. The first version, NeuralBench-EEG v1.0, is the largest open benchmark of its kind: 36 downstream tasks, 94 datasets, 9,478 subjects, 13,603 hours of electroencephalography (EEG) data, and 14 deep learning architectures evaluated under a single standardized interface.

https://ai.meta.com/research/publications/neuralbench-a-unifying-framework-to-benchmark-neuroai-models/

NeuralBench Solves the problem

In recent years, the field of NeuroAI has grown rapidly. This is where neuroscience meets deep learning. Self-supervised techniques developed originally for images, language and speech are now being applied to building brain foundation modelsModels are trained on brain recordings without labels and tuned for tasks downstream, such as clinical seizure recognition or decoding the information a person sees or hears.

However, the landscape of evaluation is fragmented. MOABB, for example, covers up to 148 BCI datasets. However the evaluation is limited to only 5 downstream tasks. Other effYou can also find out more aboutts — EEG-Bench, EEG-FM-Bench, AdaBrain-Bench — are each constrained in their own ways. For modalities like magnetoencephalography (MEG) and functional magnetic resonance imaging (fMRI), there is no systematic benchmark at all.

The result — claims about foundation models being “generalizable” or “foundational” Often, the focus is on a few selected tasks without a common point of reference.

What is NeuralBench?

NeuralBench was built using Three Python core packages which form a flexible pipeline.

NeuralFetch The team is responsible for acquiring datasets, including OpenNeuro data, DANDI data, and NEMAR data. NeuralSet Preparing data in PyTorch dataloaders using existing tools for neuroscience like MNE Python and nilearn (for preprocessing) and HuggingFace to extract embeddings of stimuli for tasks that involve images, text, or speech. NeuralTrain The code is based on PyTorch, Pydantic and PyTorch Lightning. Exca The execution library and cache.

Installing via pip install neuralbenchThe framework can be controlled by a Command-Line Interface (CLI). It is easy to perform a single task with just three commands. Download data, create cache and then execute. Every task is configured through a lightweight YAML file that specifies the data source, train/validation/test splits, preprocessing steps, target processing, training hyperparameters, and evaluation metrics.

https://ai.meta.com/research/publications/neuralbench-a-unifying-framework-to-benchmark-neuroai-models/

NeuralBench EEG Version 1.0 covers

This first release covers EEG in eight different categories. cognitive decoding Image, sentence, speech (typing), video, word decoding, BCI (brain-computer interaction), evoked responses, Clinical tasks, Internal state, Sleeping, phenotyping” miscellaneous.

Comparison of three different classes:

Architectures based on specific tasks (~1.5K–4.2M parameters, trained from scratch): ShallowFBCSPNet, Deep4Net, EEGNet, BDTCN, ATCNet, EEGConformer, SimpleConvTimeAgg, and CTNet.
EEG foundation models (~3.2M–157.1M parameters, pretrained and fine-tuned): BENDR, LaBraM, BIOT, CBraMod, LUNA, and REVE.
Features of handcrafted baselinesThe sklearn pipelines use symmetric positive-definite matrix (SPD) representations that are fed into the logistic or Ridge regression.

All foundation models are fine-tuned end-to-end using a shared training recipe — AdamW optimizer, learning rate of 10⁻⁴, weight decay of 0.05, cosine-annealing with 10% warmup, up to 50 epochs with early stopping (patience=10). The sole exception is BENDR, for which the learning rate is lowered to 10⁻⁵ and gradient clipping is applied at 0.5 to obtain stable learning curves. This intentional standardization otherwise removes model-specific optimization tricks — such as layer-wise learning rate decay, two-stage probing, or LoRA — so that architecture and pretraining methodology are what actually gets evaluated.

The data splitting process is different for each task to better reflect the constraints of real generalization: splits are predefined by research teams. leave-concept-out For cognitive decoding (everyone seen during training but with a set of held out stimuli for testing), splits across subjects for the majority clinical tasks and BCI, as well as splits within subjects for datasets that have very few participants. The model is then trained 3 times on each task, using different seeds.

The evaluation metrics for each task are standardised: top 5 accuracy in retrieval, Pearson correlation, balanced accuracy, for binary or multiclass classifications, macro-F1 score for multilabel, regression and Pearson correlation. All results are additionally reported as normalized scores (s̃), where 0 corresponds to dummy-level performance and 1 corresponds to perfect performance, enabling fair cross-task comparisons regardless of metric scale.

A methodological point to note is that some EEG models have been pretrained using datasets which overlap NeuralBench’s downstream evaluation set. Rather than discarding these results, the benchmark flags them with hashed bars in result figures so readers can identify potential pretraining data leakage — no strong trend suggesting leakage inflates performance was observed, but the transparency is preserved.

Two variants are available for the benchmark: NeuralBench-EEG-Core v1.0This method uses one representative dataset for each task to ensure broad coverage. NeuralBench-EEG-Full v1.0A Kendall’s t of 0.926 (p 0.001) is used to determine the level of variability within a task. This can be done by comparing data from different recording devices, laboratories, or subject populations. A Kendall’s τ of 0.926 (p

The Two Most Important Findings

The foundation models outperform the task-specific ones only marginally. Overall, the top models are LaBraM (rank 0.21), LUNA (40.4M), and REVE (60.2M). But several task-specific models trained from scratch — CTNet (150K parameters, rank 0.32), SimpleConvTimeAgg (4.2M, rank 0.35), and Deep4Net (146K, rank 0.43) — trail closely behind. CTNet actually overtakes the LUNA foundation model to rank third in the Full variant, despite having roughly 270× fewer parameters. It shows that the difference between task-specific foundation models and those based on tasks is small enough for expanding data coverage to be sufficient in changing global rankings.

Find 2: There are still many tasks that remain difficult. Cognitive decoding tasks — recovering dense representations of images, speech, sentences, video, or words from brain activity — are particularly challenging, with even the best models scoring well below ceiling. Mental imagery, sleep-arousal decoding, cross-subject motor images, P300 classification, and psychopathology are all tasks that often yield results close to the dummy levels. They are the benchmarks that should be used to test EEG models for the future.

SSVEP classifying, detecting pathology and seizures, sleeping stage classification and phenotyping like age regression or sex classification are some of the tasks approaching saturation.

The MEG and the fMRI: Beyond EEG

NeuralBench supports MEG tasks and fMRI as proof-of-concept, even in the initial EEG focused release. Notably, the REVE model — pretrained exclusively on EEG data — achieves the best performance among all tested models on the typing decoding task in MEG. It is an early indication that EEG pretrained representations can transfer across recording modes. This hypothesis will be rigorously tested in the future.

This infrastructure was designed to expand into intracranial EEG, functional near-infrared spectrum (fNIRS) and electromyography.

What to Do?

Installing the software is as simple as a command. pip install neuralbench. The next step is to run the EEG task for audiovisual stimuli classification.

neuralbench eeg audiovisual_stimulus --download   # Download data
neuralbench eeg audiovisual_stimulus --prepare    # Prepare cache
neuralbench eeg audivisual_stimulus ## Run the task

Run all 36 tasks on all 14 EEG Models -m all_classic all_fm Flag handles orchestration. Full benchmark storage requirements are substantial: approximately 11 TB total (~3.2 TB raw data, ~7.8 TB preprocessed cache, ~333 GB logged results), with one GPU of at least 32 GB VRAM per job — though average peak GPU usage measured across experiments is only ~1.3 GB (maximum ~30.3 GB).

For the full NeuralBench EEG-Full run, approximately 1,751 GPU hours are required across 4,947 tests.

What you need to know

Meta AI’s NeuralBench-EEG v1.0 is an open EEG benchmark — 36 tasks, 94 datasets, 9,478 subjects, and 14 deep learning architectures under one standardized interface.
Despite up to 270× more parameters, EEG foundation models like REVE only marginally outperform lightweight task-specific models like CTNet (150K params) across the benchmark.
The cognitive decoding task (speech decoding, video decoding, sentence decoding, and word decoding based on brain activity) as well as clinical prediction remain challenging. Most models score near the dummy-level.
REVE, pretrained only on EEG data, outperformed all models on MEG typing decoding — an early signal of meaningful cross-modality transfer.
NeuralBench has MIT license.

Check out the Paper The following are some examples of how to get started: GitHub Repo. Also, feel free to follow us on Twitter Don’t forget about our 150k+ ML SubReddit Subscribe Now our Newsletter. Wait! What? now you can join us on telegram as well.

Want to promote your GitHub repo, Hugging Face page, Product release or Webinar?? Connect with us

Meta AI releases NeuralBench, a unified open-source framework to benchmark NeuroAI models across 36 EEG tasks and 94 datasets.

How to build a Django Admin Dashboard using Custom Filters and Actions with KPIs

Poetiq’s Meta-System automatically builds a model-agnostic harness that improved every LLM tested on LiveCodeBench without fine-tuning

The Coding implementation to master GPU computing with CuPy and Custom CUDA kernels. Sparse matrices, Streams.

Nous Analysis Releases Token Superposition Coaching to Velocity Up LLM Pre-Coaching by As much as 2.5x Throughout 270M to 10B Parameter Fashions

Google AI mode update aims to kill tab hopping in Chrome

Hacking the EU’s new age-verification app takes only 2 minutes

Three new tricks to try with Google Gemini after its latest major upgrade

AI Chatbots Guide Psychedelic Trips

Former top Google Researchers have made a new type of AI agent

Top Insights

Anthropic’s latest research shows Claude is able to detect injected concepts, but only when they are in controlled layers

A United Arab Emirates Lab Announces Frontier AI Projects—and a New Outpost in Silicon Valley

Latest News

How to build a Django Admin Dashboard using Custom Filters and Actions with KPIs

Poetiq’s Meta-System automatically builds a model-agnostic harness that improved every LLM tested on LiveCodeBench without fine-tuning

Meta AI releases NeuralBench, a unified open-source framework to benchmark NeuroAI models across 36 EEG tasks and 94 datasets.

NeuralBench Solves the problem

What is NeuralBench?

NeuralBench EEG Version 1.0 covers

The Two Most Important Findings

The MEG and the fMRI: Beyond EEG

What to Do?

What you need to know

Related Posts