It has been difficult to evaluate AI models that are trained using brain signals. Different research groups use different preprocessing pipelines, train models on different datasets, and report results on a narrow set of tasks — making it nearly impossible to know which model actually works best, or for what. Meta AI has developed a new framework to address this issue.
The Meta Researchers released a new release. NeuralBenchThe framework is an open-source platform for benchmarking AI brain activity models. The first version, NeuralBench-EEG v1.0, is the largest open benchmark of its kind: 36 downstream tasks, 94 datasets, 9,478 subjects, 13,603 hours of electroencephalography (EEG) data, and 14 deep learning architectures evaluated under a single standardized interface.
NeuralBench Solves the problem
In recent years, the field of NeuroAI has grown rapidly. This is where neuroscience meets deep learning. Self-supervised techniques developed originally for images, language and speech are now being applied to building brain foundation modelsModels are trained on brain recordings without labels and tuned for tasks downstream, such as clinical seizure recognition or decoding the information a person sees or hears.
However, the landscape of evaluation is fragmented. MOABB, for example, covers up to 148 BCI datasets. However the evaluation is limited to only 5 downstream tasks. Other effYou can also find out more aboutts — EEG-Bench, EEG-FM-Bench, AdaBrain-Bench — are each constrained in their own ways. For modalities like magnetoencephalography (MEG) and functional magnetic resonance imaging (fMRI), there is no systematic benchmark at all.
The result — claims about foundation models being “generalizable” or “foundational” Often, the focus is on a few selected tasks without a common point of reference.
What is NeuralBench?
NeuralBench was built using Three Python core packages which form a flexible pipeline.
NeuralFetch The team is responsible for acquiring datasets, including OpenNeuro data, DANDI data, and NEMAR data. NeuralSet Preparing data in PyTorch dataloaders using existing tools for neuroscience like MNE Python and nilearn (for preprocessing) and HuggingFace to extract embeddings of stimuli for tasks that involve images, text, or speech. NeuralTrain The code is based on PyTorch, Pydantic and PyTorch Lightning. Exca The execution library and cache.
Installing via pip install neuralbenchThe framework can be controlled by a Command-Line Interface (CLI). It is easy to perform a single task with just three commands. Download data, create cache and then execute. Every task is configured through a lightweight YAML file that specifies the data source, train/validation/test splits, preprocessing steps, target processing, training hyperparameters, and evaluation metrics.

NeuralBench EEG Version 1.0 covers
This first release covers EEG in eight different categories. cognitive decoding Image, sentence, speech (typing), video, word decoding, BCI (brain-computer interaction), evoked responses, Clinical tasks, Internal state, Sleeping, phenotyping” miscellaneous.
Comparison of three different classes:
- Architectures based on specific tasks (~1.5K–4.2M parameters, trained from scratch): ShallowFBCSPNet, Deep4Net, EEGNet, BDTCN, ATCNet, EEGConformer, SimpleConvTimeAgg, and CTNet.
- EEG foundation models (~3.2M–157.1M parameters, pretrained and fine-tuned): BENDR, LaBraM, BIOT, CBraMod, LUNA, and REVE.
- Features of handcrafted baselinesThe sklearn pipelines use symmetric positive-definite matrix (SPD) representations that are fed into the logistic or Ridge regression.
All foundation models are fine-tuned end-to-end using a shared training recipe — AdamW optimizer, learning rate of 10⁻⁴, weight decay of 0.05, cosine-annealing with 10% warmup, up to 50 epochs with early stopping (patience=10). The sole exception is BENDR, for which the learning rate is lowered to 10⁻⁵ and gradient clipping is applied at 0.5 to obtain stable learning curves. This intentional standardization otherwise removes model-specific optimization tricks — such as layer-wise learning rate decay, two-stage probing, or LoRA — so that architecture and pretraining methodology are what actually gets evaluated.
The data splitting process is different for each task to better reflect the constraints of real generalization: splits are predefined by research teams. leave-concept-out For cognitive decoding (everyone seen during training but with a set of held out stimuli for testing), splits across subjects for the majority clinical tasks and BCI, as well as splits within subjects for datasets that have very few participants. The model is then trained 3 times on each task, using different seeds.
The evaluation metrics for each task are standardised: top 5 accuracy in retrieval, Pearson correlation, balanced accuracy, for binary or multiclass classifications, macro-F1 score for multilabel, regression and Pearson correlation. All results are additionally reported as normalized scores (s̃), where 0 corresponds to dummy-level performance and 1 corresponds to perfect performance, enabling fair cross-task comparisons regardless of metric scale.
A methodological point to note is that some EEG models have been pretrained using datasets which overlap NeuralBench’s downstream evaluation set. Rather than discarding these results, the benchmark flags them with hashed bars in result figures so readers can identify potential pretraining data leakage — no strong trend suggesting leakage inflates performance was observed, but the transparency is preserved.
Two variants are available for the benchmark: NeuralBench-EEG-Core v1.0This method uses one representative dataset for each task to ensure broad coverage. NeuralBench-EEG-Full v1.0A Kendall’s t of 0.926 (p 0.001) is used to determine the level of variability within a task. This can be done by comparing data from different recording devices, laboratories, or subject populations. A Kendall’s τ of 0.926 (p

The Two Most Important Findings
The foundation models outperform the task-specific ones only marginally. Overall, the top models are LaBraM (rank 0.21), LUNA (40.4M), and REVE (60.2M). But several task-specific models trained from scratch — CTNet (150K parameters, rank 0.32), SimpleConvTimeAgg (4.2M, rank 0.35), and Deep4Net (146K, rank 0.43) — trail closely behind. CTNet actually overtakes the LUNA foundation model to rank third in the Full variant, despite having roughly 270× fewer parameters. It shows that the difference between task-specific foundation models and those based on tasks is small enough for expanding data coverage to be sufficient in changing global rankings.
Find 2: There are still many tasks that remain difficult. Cognitive decoding tasks — recovering dense representations of images, speech, sentences, video, or words from brain activity — are particularly challenging, with even the best models scoring well below ceiling. Mental imagery, sleep-arousal decoding, cross-subject motor images, P300 classification, and psychopathology are all tasks that often yield results close to the dummy levels. They are the benchmarks that should be used to test EEG models for the future.
SSVEP classifying, detecting pathology and seizures, sleeping stage classification and phenotyping like age regression or sex classification are some of the tasks approaching saturation.
The MEG and the fMRI: Beyond EEG
NeuralBench supports MEG tasks and fMRI as proof-of-concept, even in the initial EEG focused release. Notably, the REVE model — pretrained exclusively on EEG data — achieves the best performance among all tested models on the typing decoding task in MEG. It is an early indication that EEG pretrained representations can transfer across recording modes. This hypothesis will be rigorously tested in the future.
This infrastructure was designed to expand into intracranial EEG, functional near-infrared spectrum (fNIRS) and electromyography.
What to Do?
Installing the software is as simple as a command. pip install neuralbench. The next step is to run the EEG task for audiovisual stimuli classification.
neuralbench eeg audiovisual_stimulus --download # Download data
neuralbench eeg audiovisual_stimulus --prepare # Prepare cache
neuralbench eeg audivisual_stimulus ## Run the task
Run all 36 tasks on all 14 EEG Models -m all_classic all_fm Flag handles orchestration. Full benchmark storage requirements are substantial: approximately 11 TB total (~3.2 TB raw data, ~7.8 TB preprocessed cache, ~333 GB logged results), with one GPU of at least 32 GB VRAM per job — though average peak GPU usage measured across experiments is only ~1.3 GB (maximum ~30.3 GB).
For the full NeuralBench EEG-Full run, approximately 1,751 GPU hours are required across 4,947 tests.
What you need to know
- Meta AI’s NeuralBench-EEG v1.0 is an open EEG benchmark — 36 tasks, 94 datasets, 9,478 subjects, and 14 deep learning architectures under one standardized interface.
- Despite up to 270× more parameters, EEG foundation models like REVE only marginally outperform lightweight task-specific models like CTNet (150K params) across the benchmark.
- The cognitive decoding task (speech decoding, video decoding, sentence decoding, and word decoding based on brain activity) as well as clinical prediction remain challenging. Most models score near the dummy-level.
- REVE, pretrained only on EEG data, outperformed all models on MEG typing decoding — an early signal of meaningful cross-modality transfer.
- NeuralBench has MIT license.
Check out the Paper The following are some examples of how to get started: GitHub Repo. Also, feel free to follow us on Twitter Don’t forget about our 150k+ ML SubReddit Subscribe Now our Newsletter. Wait! What? now you can join us on telegram as well.
Want to promote your GitHub repo, Hugging Face page, Product release or Webinar?? Connect with us

