Nous Research Released Hermes 4Open-weight models of 14B (70B), 405B (based on Llama3.1 checkpoints), which achieve frontier performance by using only post-training methods. Hermes introduces Hybrid Reasoning – models can toggle between standard responses and explicit reasoning using When complex problems need deeper consideration, use tags.
Hermes 4’s ability to achieve the highest performance of all open weight models with complete transparency, and a neutral alignment philosophy is what makes it so significant. This shows how sophisticated reasoning can be achieved using only open source methodologies.
DataForge: Graph Based Synthetic Data Generating
DataForge Hermes 4 core is the key component. It is important to understand what it is. DataForge? DataForge DataForge is a graph-based system for creating synthetic data that revolutionizes the way training data are created. DataForge is a new approach to curation that operates differently than traditional approaches. Directed acyclic graph Where each node implements PDDL Action Interface (Planning Domain Definition Language)..
The nodes specify preconditions and postconditions as well as transformations. This allows for the creation of data pipelines that are complex. By using DCLM and FineWeb pre-training data, the system is able to transform Wikipedia articles into rap songs, then produce instruction-answer pairings on the basis of that transformation.
It generates about There are 5 million samples, totaling to 19 billion tokens, with reasoning samples being intentionally token-heavy – averaging five times more tokens than non-reasoning counterparts to accommodate thinking traces up to 16,000 tokens long.
The scale of Rejection Sampling is Unprecedented
Hermes 4 is used AtroposThe open-source, reinforcement learning system from Nous Research will be used to introduce rejection sampling in approximately There are over 1,000 task-specific Verifiers.. This vast verification infrastructure is a filter for reasoning pathways of high-quality across various domains.
Key Verification Environments include Question Format Training The correct output format is rewarded across 150+ formats Follow the Instructions Use RLVR IFEval with Complex Constraints (using RLVR IFEval tasks). Schema Adherence If you want to generate JSON with Pydantic model, click here. The Use of Tool Training for Agentic Behavior
This rejection sampling method creates an extensive corpus of reasoning paths that have been verified, and multiple solutions to reach the same result. The model is taught robust reasoning patterns, rather than having to memorize specific solutions templates.
How to solve the problem of overlong generation?
Hermes 4 is a highly innovative product that addresses the Overlong Reasoning Problem – where reasoning models generate excessively long chains of thought without termination. The team of researchers discovered the 14B model has reached its maximum length. 60 % of the Time LiveCodeBench in the reasoning mode.
They have a super effective method that involves a fine-tuning second stage, supervised by therapists. This is where they teach models how to stop reasoning exactly at the right time. 30,000 tokens:
- Generate a reasoning trace from current policy
- You can also Insert
Tokens at exactly 30 000 tokens - Only train on termination decisions, and not reasoning chains
- Use gradients only on the newest versions of
You can also find out more about the following:tokens
The results of this approach are remarkable: Reduction of 78.4% AIME’24 generates continuously for a long time. 65.3% AIME 25 is a great opportunity to learn about the latest in AIME technology. 79.8% LiveCodeBench is a cost-effective way to improve accuracy. This method avoids collapse risk by teaching effectively while focusing all signals solely on termination decisions. “counting behavior.”


Benchmark performance and neutral alignment
Hermes 4 demonstrates Performance at the cutting edge among open-weight models. Model 405B achieves 96.3% MATH-500 is a reasoning mode. 81.9% AIME’24 78.1% AIME 25 70.5% GPQA Diamond is a GPQA Diamond. 61.3% LiveCodeBench is a great tool for learning how to code.
Its performance is particularly notable. RefusalBenchAchieving 57.1% in reasoning mode – the highest score among evaluated models, significantly outperforming GPT-4o (17.67%) and Claude Sonnet 4 (17%). It shows the model’s ability to deal with contentious issues, while keeping boundaries in place.

The Technical Architecture of Training
Hermes training uses a modified version of the Hermes Training System. TorchTitan NVIDIA GPUs B200 192. The system handles highly heterogeneous sample length distribution through efficient packing (achieving >99.9% batch efficiency), flex attention, and sophisticated loss masking where only assistant-role tokens contribute to cross-entropy loss.
This training follows a schedule of cosine-based learning rates with 300 steps for warming up and 9,000 steps total at 16,384 tokens context with a global batch size 384 samples.
The following is a summary of the information that you will find on this page.
Hermes 4 represents a breakthrough in open-source AI, showing that high-level reasoning can be developed without the use of proprietary data, or closed-source development. Nous Research’s models are able to match leading proprietary systems in terms of performance, while maintaining neutral alignment, steerability and the ability to reject large amounts of data.
Click here to find out more Paper, Technical details, Model on Hugging Face You can also find out more about the following: Chat. Please feel free to browse our GitHub Page for Tutorials, Codes and Notebooks. Also, feel free to follow us on Twitter Join our Facebook group! 100k+ ML SubReddit Subscribe now our Newsletter.
Asif Razzaq serves as the CEO at Marktechpost Media Inc. As an entrepreneur, Asif has a passion for harnessing Artificial Intelligence’s potential to benefit society. Marktechpost was his most recent venture. This platform, which focuses on machine learning and deep-learning news, is technical and accessible to a broad audience. Over 2 million views per month are a testament to the platform’s popularity.

