In the current landscape of computer vision, the standard operating procedure involves a modular ‘Lego-brick’ approach: a pre-trained vision encoder for feature extraction paired with a separate decoder for task prediction. Although effective, the architectural separation between language and visual perception is a bottleneck for scaling.
It is important to note that the word “you” means “you”. Technology Innovation Institute The research team at MIT is working to challenge this paradigm. Falcon Perception, a 600M-parameter unified dense Transformer. TII researchers have developed an algorithm that processes image patches and tokens on a single parameter space starting from the very top layer. early-fusion Stack that is extremely efficient at handling perception and task modelling.
The architecture: a single stack for every modality
Falcon Perception’s core design is based on the assumption that one Transformer can learn visual representations while performing task-specific generation..
Hybrid Attention with GGROPE
Falcon Perception is a language model that uses causal masking instead of standard language models. hybrid attention strategy. Text and task tokens pay attention to the preceding tokens in order to achieve autoregressive forecasting..
The research team used a technique called ‘flattening’ to maintain spatial 2D relationships within a sequence. 3D Rotary Positional Embellishments. It decomposes a head’s dimension into an axis of sequential and spatial components using Golden Gate ROPE (GGROPE). GGROPE allows the attention heads to pay attention to relative positions at any angle, making it robust against rotations and changes in aspect ratio.
Minimalist Sequence Logical
Following is a basic architecture sequence. Chain-of-Perception format:
[Image] [Text] .
The model will resolve spatial ambiguity as a conditional signal (location and size), before it generates the segmentation mask..
Muon, FlexAttention & Raster Ordering: Engineering for Scale
The TII team has introduced several optimizations for stabilizing training and maximising GPU usage in these heterogeneous sequencing.
- Muon optimization: According to the research team, employing the Muon Optimizer AdamW specialized heads, which included coordinates, size and segmentation, resulted in lower training losses as well as improved performance on benchmarks.
- FlexAttention and Sequence packing: The model is able to process images in native resolutions, without using excessive compute for padding. scatter-and-pack strategy. Blocks of fixed size are used to pack valid patches. FlexAttention It is used for limiting the attention of a person to a specific image.
- Raster order: Falcon Perception is able to predict multiple objects when they are in the same scene. Order raster (top-to-bottom, left-to-right). The order was ordered in this way because it converges faster, and produces less coordinate loss.
This recipe is for training: Distillation up to 685GT
Models are used to demonstrate how the model works multi-teacher distillation Initialization is the distillation of knowledge. DINOv3 (ViT-H) For local content and SigLIP2 (So400m) Language-aligned Features. After initialization the model is subjected to a The three-stage training pipeline Totaling about 685 Gigatokens (GT):
- List in context (450 GT) Learning to ‘list’ the scene inventory to build global context.
- Work Alignment 225 GT: To switch to independent queries using The Query Masking To ensure that the model bases each question solely on an image.
- Long-Context Finetuning (10 GT): The masking limit is increased to 600 for the extreme density.
The task-specific sequentialization is applied during these stages:
.
It is important to note that the word “you” means “you”. The following are some examples of how to get started: Tokens make the model commit to an binary decision about whether or not an object is present before localization..
Profile Capabilities beyond Saturated Baselines
To gauge progress, TII introduced a research team PBenchThe benchmark divides the samples into levels of complexity in order to identify model failure modes.
Main Results: Falcon Perception (Macro) vs. SAM 3.F1)
| Benchmark Split | SAM 3. | Falcon Perception (600M) |
| Simple Objects | 64.3 | 65.1 |
| Attributes | 54.4 | 63.6 |
| L2: OCR-Guided | 24.6 | 38.0 |
| Spatial Understanding | 31.6 | 53.5 |
| Relationships L4 | 33.3 | 49.1 |
| Dense Split | 58.4 | 72.6 |
Falcon Perception outperforms SAM 3 in complex semantic tasks. +21.9 point gain Level 3: Understanding spatial relationships.

FalconOCR – The 300M Document Specialist
This early-fusion dish was also developed by the TII Team. FalconOCR, a compact 300M-parameter Initialized model from scratch in order to give priority to finely-granulated glyph recognition. FalconOCR can compete with other proprietary OCR products and modules.
- olmOCR: The Achieve 80.3% accuracyGemini 3 Pro (80,2%) and GPT 5.8 (69.8%) are comparable or better.
- OmniDocBench: Scores an overall score 88.64It is ahead of GPT 5 (86.56) and Mistral OCR3 (85.20), but it trails behind the leading modular pipeline PaddleOCR 1.5 (94.37).
The Key Takeaways
- Unified Early-Fusion ArchitectureFalcon Perception is a new encoder-decoder that replaces the modular pipelines. It uses a dense Transformer to process image patches, text tokens and shared parameters in one parameter space. It utilizes a hybrid attention mask—bidirectional for visual tokens and causal for task tokens—to act simultaneously as a vision encoder and an autoregressive decoder.
- Chain-of Perception SequenceModel serializes segmentation of instances into a structed sequence This forces the mask to use spatial size and position as a conditional signal prior to generating pixel level mask.
- Specialized Heads & GGROPEFor managing dense spatial data the model utilizes Fourier Feature Encoders (FFE) for high-dimensional mapping, and Golden Gate ROPE(GGROPE), to allow isotropic 2D attention. For these heads, the Muon optimizer balances learning rates with pre-trained brains.
- Semantic Performance ImprovementsOn the new PBench Benchmark, which separates semantic abilities (Levels 0-4) the 600M Model demonstrates significant advances over SAM 3. This includes a +13.4 points lead in OCR – guided queries and a +11.9 points lead in spatial comprehension.
- High-Efficiency OCR ExtensionThis architecture can be scaled down to Falcon OCR. A 300M parameter model, it achieves 80.3% in olmOCR & 88.64 at OmniDocBench. The system is as accurate or more than larger systems such as Gemini 3 Pro, GPT 5.2 and others. It also maintains high performance for processing large documents.
Check out the Paper, Model Weight, Repo The following are some examples of how to get started: Technical details. Also, feel free to follow us on Twitter Don’t forget about our 120k+ ML SubReddit Subscribe now our Newsletter. Wait! Are you using Telegram? now you can join us on telegram as well.

