TL;DR — Falcon Perception is a 0.6B-parameter early-fusion Transformer for open-vocabulary grounding and segmentation from natural language prompts. The model processes image patches + text in a single sequence using a hybrid attention mask, and produces variable numbers of instances with a small, structured token interface and light-weight output heads. On SA-Co, Falcon Perception reaches 68.0 Macro-F1 (vs. 62.3 for SAM 3) with the predominant remaining gap being presence calibration (MCC 0.64 vs. 0.82). We also introduce PBench, a diagnostic benchmark that breaks down performance by capability (attributes, OCR-guided disambiguation, spatial constraints, relations) and by dense long-context crowded scenes.
We also relase Falcon OCR, a 0.3B-parameter model which reaches a rating of 80.3 and 88.6 on the olmOCR benchmark and OmniDocBench respectively, while having the very best throughput of any open source OCR model.
This post is a transient, practical write-up of what we built, why we built it this fashion, and what we learned along the way in which.
The issue: why do perception systems find yourself as pipelines?
Many open-vocabulary perception systems are built as modular pipelines: a (often frozen) vision backbone extracts features, a separate fusion/decoder stage combines them with language, and extra components handle matching and post-processing. This family of designs works well in lots of settings, but it surely comes with trade-offs: it will probably be hard to scale cleanly, hard to attribute improvements to the correct component, and straightforward to build up complexity as we add a brand new fix for every failure mode.
We asked a less complicated query: can a single early-fusion Transformer backbone handle each perception and language modeling, if we decide the correct attention pattern, output interface, and training signal?
In our experiments, the reply is essentially yes. The remainder of this post describes the predominant design selections and the evidence behind them.
The architecture: early fusion, hybrid attention, and an efficient dense interface
A single autoregressive Transformer processes a unified sequence of image patches, text, and task tokens.
The model predicts object properties in a set order: → → .
Bounding box coordinates and sizes are decoded via specialized heads and re-injected as Fourier features.
High-resolution segmentation masks are generated by a dot product between the token and upsampled image features.
One Backbone, Two Behaviors
At its core, Falcon Perception is a dense Transformer that processes image patches and text tokens in a shared parameter space from the primary layer. As an alternative of a separate vision backbone followed by a late-fusion decoder, we keep a single backbone and depend on masking and a light-weight output interface to make the dense prediction problem tractable.
Images and text have different structure: pixels are 2D and profit from bidirectional context, while the prediction interface is of course sequential. We address this with a hybrid attention mask:
- Image tokens attend to all other image tokens bidirectionally, constructing a worldwide visual context (like a vision encoder would).
- Text and task tokens attend causally to every little thing before them — the total visual prefix plus preceding text.
This permits the identical backbone to behave like a bidirectional visual encoder on image tokens, while still supporting autoregressive prediction over task tokens.
Chain-of-Perception: coarse-to-fine supervision for dense outputs
Dense perception isn’t a fixed-size prediction problem: a picture may contain zero instances or a whole lot. Autoregressive generation gives a clean variable-length interface, but fully autoregressive dense generation (e.g., polygons or high-resolution masks token-by-token) quickly becomes expensive.
We use a small structured interface, Chain-of-Perception, which decomposes each instance into three steps:
→ →
- Coordinate token: The model first predicts the middle of the instance — resolving which object it’s talking about.
- Size token: Then the spatial extent — resolving how big it’s.
- Segmentation token: Finally, a single embedding that, when dot-producted with upsampled image features, produces a full-resolution binary mask.
This ordering is deliberate. Committing to geometry first reduces ambiguity (“which instance?”), and makes the mask prediction step closer to pixel refinement conditioned on the resolved object.
Specialized Heads, Minimal Overhead
The backbone is shared, while decoding uses lightweight heads tailored to the output type:
-
Coordinate & Size Heads use Fourier feature encoding : mapping continuous coordinates through a random Gaussian projection right into a high-dimensional sinusoidal space. This overcomes the spectral bias of neural networks, yielding more precise localization than discrete binning alone. Decoded coordinates are re-injected into the sequence as conditioning for subsequent tokens.
-
Segmentation Head computes a dot product between the
token’s hidden state and content-aware upsampled image features. Since thetoken is produced after geometry and has access to early-fused visual context, we are able to avoid the separate mask-query machinery and Hungarian matching that usually appears in decoder-based instance segmentation training.
PBench: a benchmark designed to isolate what’s missing
Existing referring-expression benchmarks like RefCOCO are saturated — models routinely hit 90%+ — and so they conflate what went incorrect. Did the model fail because it will probably’t read text? Cannot understand spatial relationships? Cannot handle a crowd?
We introduce PBench, a diagnostic benchmark that separates samples by the dominant capability required:
| Level | Capability | Example Prompt |
|---|---|---|
| L0 | Easy objects | “automobile” |
| L1 | Attributes & subtypes | “red automobile”, “broken fence” |
| L2 | OCR-guided identification | “Weight loss program Coke bottle”, “Nike shoes” |
| L3 | Spatial understanding | “automobile on the left”, “third window from left” |
| L4 | Relations & interactions | “person holding umbrella”, “tallest constructing” |
| Dense | Crowdedness stress test | Lots of of instances per image |
Each sample targets one dominant capability: OCR prompts avoid spatial qualifiers, and spatial prompts avoid in-image text disambiguators. This yields a capability profile somewhat than a single opaque rating, and makes it easier to determine where to speculate next (data, training curriculum, or post-training).
Training: distillation, large-scale data, and a three-stage recipe
Multi-Teacher Distillation
Fairly than training from random weights (which in our ablations was unstable for segmentation), Falcon Perception initializes via multi-teacher distillation. Two strong vision teachers contribute complementary signals:
- DINOv3 (ViT-H): strong local features critical for segmentation
- SigLIP2: language-aligned features for open-vocabulary understanding
The distilled initialization achieves 74.25% zero-shot accuracy on ImageNet-1k and 85.11% linear-probe mIoU on Pascal VOC, providing a robust visual foundation before perception-specific training.
Data: 54M Images, 195M Positive Expressions, 488M Hard Negatives
We construct the training set through a multi-stage pipeline:
- Hierarchical clustering of web-scraped images via DINOv3 embeddings to make sure uniform concept coverage.
- VLM-driven listing generates dense object descriptions per image, categorized by PBench complexity level (60% basic, 40% advanced).
- Negative mining produces semantic, visual, and fine-grained hard negatives to combat hallucination.
- Ensemble consensus — SAM 3, Qwen3-VL-30B, and Moondream3 must agree (IoU > 0.8) for automatic acceptance.
- Human verification — disagreements go to annotators, recovering hard samples that confuse automated systems.
We maintain a strict 1:1 ratio of positive to negative samples. This makes presence calibration a first-class goal: the model should reliably say “absent,” not only draw masks when confident.
The Three Stages (700 GT Total)
Stage 1 — In-Context Listing (450 GT): The model learns to autoregressively list scene inventories — predicting text expressions and their locations. Full causal attention between queries enables learning of object co-occurrence (“fork, then knife, then plate”). This builds broad scene understanding.
Stage 2 — Task Alignment (225 GT): The eye mask is modified so queries can not see one another, simulating independent queries at inference time. Loss on text tokens is masked, focusing gradient signal entirely on presence classification and localization. This stage transitions from “scene understanding” to “answer this specific query.”
Stage 3 — Long-Context Finetuning (10 GT): A brief phase with the mask limit raised to 600 per expression and a minimal constant learning rate. This adapts the model for extreme crowd density without forgetting earlier capabilities.
Key design selections validated through ablations:
- Muon optimizer for the specialized heads (vs. AdamW) — yields +4.8 points on SA-Co detection
- Raster ordering of instances (vs. random/size) — +10 points over random ordering on SA-Co
- Gram feature regularization — prevents drift from the distillation features, improving segmentation by +1.5 points
- Global loss normalization across ranks — corrects bias from variable-length packed sequences in FSDP
Results
SA-Co: Best-in-Class Mask Quality
On the SA-Co open-vocabulary segmentation benchmark, Falcon Perception (0.6B parameters) achieves 68.0 Macro-F1, in comparison with 62.3 for SAM 3, with large gains on attribute-heavy (+8.2), food & drink (+12.2), and sports equipment (+4.0) splits. At the identical time, Falcon Perception lags SAM 3 on presence calibration (MCC: 0.64 vs 0.82), which is the clearest remaining improvement axis.
Here’s an example output — the prompt “Falcon” produces precise instance masks:
Falcon Perception also performs well for reffering expressions, capable of appropriately segment the burger with a black bun in each frame of the video:
PBench: Scaling with Prompt Complexity
That is where the early-fusion design shows the biggest differences:
| Capability | SAM 3 | Falcon Perception | Gap |
|---|---|---|---|
| L0: Easy objects | 64.3 | 65.1 | +0.8 |
| L1: Attributes | 54.4 | 63.6 | +9.2 |
| L2: OCR-guided | 24.6 | 38.0 | +13.4 |
| L3: Spatial | 31.6 | 53.5 | +21.9 |
| L4: Relations | 33.3 | 49.1 | +15.8 |
| Dense | 58.4 | 72.6 | +14.2 |
On easy objects, the gap is modest. As prompts change into more compositional—requiring OCR-guided disambiguation, spatial constraints, or relational binding—the gap widens.
In our PBench Dense split, Falcon Perception (0.6B) substantially outperforms generalist VLM baselines (e.g., 72.6 vs 8.9 for Qwen3-VL-30B in our evaluation setup), and matches or exceeds the 8B model on spatial and relational tiers.
Qualitative Results: OCR, Spatial, Relational, and Dense
As prompts grow more compositional — requiring OCR-guided disambiguation, spatial constraints, relational binding, or scaling to a whole lot of instances — the early-fusion advantage becomes visually clear:
- OCR-Guided Grounding (Level 2): When the distinguishing signal is text written on an object, Falcon Perception reads it appropriately while SAM 3 cannot differentiate.
- Spatial Understanding (Level 3): When prompts specify spatial relationships, Falcon Perception forms a coherent 2D scene map.
- Relational Reasoning (Level 4): When the goal is defined through interactions somewhat than appearance, Falcon Perception understands the scene graph.
- Dense Scenes: Scaling to Lots of of Instances: The autoregressive interface is especially useful when scenes are extremely crowded, where fixed-query decoders can run into practical limits.
Level 2 — OCR-Guided Grounding: Falcon Perception reads text on objects to disambiguate; SAM 3 cannot.
“168 wine bottles”: Falcon Perception identifies the bottles labeled “168”,
while SAM 3 highlights every bottle. “Honolulu direction sign”: Falcon reads the text to seek out the correct sign.
Level 3 — Spatial Understanding: Falcon Perception resolves spatial constraints; SAM 3 returns false positives.
“Lower meat skewer on left grill,” “black automobile to the correct of red automobile at bottom,”
“Belgian flag on the left” — Falcon Perception resolves the right instance from spatial constraints.
SAM 3 predicts false positives for multiple candidates.
Level 4 — Relational Reasoning: Falcon Perception understands interactions; SAM 3 ignores relational constraints.
“Pastry next to brown round bread,” “person using phone,”
“person holding helmet in hand” — Falcon Perception identifies the interacting instance.
SAM 3 highlights all instances of the item class, ignoring the relational constraint.
Dense Scenes: Falcon Perception scales to a whole lot of instances; SAM 3’s decoder runs out of query tokens.
“Snow goose,” “pigeon,” “colourful canned drinks” — Falcon Perception autoregressively
segments a whole lot of instances. SAM 3’s fixed-size decoder runs out of query tokens beyond ~200 instances.
Falcon OCR: extending early fusion to document understanding
Modern OCR has moved well beyond extracting text from clean scans. Today’s systems must handle multi-column layouts, mathematical formulas, tables, charts, and multilingual content — multi function pass. Best OCR VLMs tackle this with a well-recognized recipe: a vision encoder feeding a separate text decoder, plus task-specific glue. These systems work, but they have a tendency to be large (1B–3B+ parameters).
We took a special path: reuse the identical early-fusion dense Transformer from Falcon Perception, but train a smaller 0.3B-parameter variant from scratch specifically for OCR. The result’s Falcon OCR — a single backbone that processes image patches and text tokens in a shared parameter space with the identical hybrid attention mask (bidirectional for image tokens, causal for text tokens), and switches tasks through prompts somewhat than additional modules.
We trained from scratch (no multi-teacher distillation) since the visual features OCR needs — fine-grained glyph recognition, stroke-level discrimination — differ substantially from the object-level features useful for segmentation. Starting fresh lets the backbone develop text-optimized representations from the bottom up.
Training
We train on a curated English-language mixture spanning three core tasks: general document text parsing (digital PDFs, old scans, typewritten documents), mathematical and scientific formula recognition, and table structure recognition. The mixture also includes handwriting, real-world scene text, and artificial samples generated from rendered LaTeX and HTML sources. The training objective is pure next-token prediction on structured text outputs.
Training proceeds in two phases: a protracted pre-training phase at constant learning rate where the model learns core OCR capabilities across all element types, followed by a brief cosine-decay finetuning phase where the educational rate is annealed to close zero.
Benchmark results
We evaluate on olmOCR (binary correctness checks across diverse inputs) and OmniDocBench (continuous metrics over full-page parses). All comparison models are significantly larger and/or use proprietary infrastructure. At 80.3% on olmOCR with only 0.3B parameters, Falcon OCR is inside 1.7 points of the highest system and leads all models on Multi-Column (87.1%) and Tables (90.3%). On OmniDocBench it scores 88.64 overall, ahead of DeepSeek OCR v2, GPT 5.2, and Mistral OCR 3.
Serving throughput
At 0.3B parameters, Falcon OCR is roughly 3x smaller than 0.9B-class OCR VLMs, which translates directly into higher serving throughput. Measured on a single A100-80GB with vLLM at high concurrency:
| Mode | tok/s | img/s | Description |
|---|---|---|---|
| Layout + OCR | 5,825 | 2.9 | Full pipeline: layout detection → crop → per-region OCR |
The compact footprint and vLLM integration (continuous batching, PagedAttention, optimized CUDA kernels) make it practical for large-scale document digitization where thousands and thousands of pages need processing.
What we see in the outcomes
More broadly, these results suggest that the early-fusion single-stack Transformer is a viable alternative to the “vision encoder plus text decoder” recipe for OCR. One backbone, shared parameter space, one decoding interface, and higher data and training signals somewhat than increasingly complex pipelines. We hope this encourages more work on this direction.
Qualitative examples
Falcon OCR processes images captured under difficult real-world conditions with various lighting, diverse text semantics (mathematical formulae, structured tables, handwritten notes), and complicated document layouts, to supply structured text output.
Click each category below to expand.
Handwriting and Real-world Images: Accurate transcription of handwritten text and in-the-wild captures under hostile conditions.
Falcon OCR extracts text from handwritten documents and real-world photographs with variable lighting, orientation, and content complexity.
Table Extraction: Faithful reproduction of tabular structure and cell content across diverse formats.
Falcon OCR accurately reproduces cell entries and structural layout from tables of various formats and complexity.
Mathematical Formulae: Accurate recognition of equations across various levels of symbolic complexity.
Falcon OCR appropriately transcribes mathematical expressions starting from easy equations to multi-line derivations with nested operators.
Complex Document Layouts: Faithful text extraction from multi-column, mixed-content documents.
Falcon OCR preserves reading order and structural fidelity when extracting text from documents with multi-column layouts, figures, and footnotes.
Inference: Fast, Practical, and Open
The discharge includes an inference stack built on PyTorch’s FlexAttention, which makes it practical to specific the custom attention patterns and efficiently serve packed variable-length sequences.
Paged Inference Engine
- Paged KV cache with virtual page tables (no wasted memory from padding)
- Continuous batching: recent sequences enter mid-generation, finished ones release pages immediately
- CUDA graph capture for the decode loop
- Background tokenization overlapped with GPU compute
- HR feature cache: LRU cache with pinned-memory buffers for async GPU-CPU transfer of upsampled image features — subsequent queries on the identical image skip the expensive upsampling step
In our setup on an H100, typical latencies are on the order of ~100ms prefill, ~200ms upsampling (0ms if cached), and ~50ms decode for a handful of instances. (These numbers depend upon resolution, sequence length, and the variety of predicted instances.)
Docker and MLX Integration for Falcon-OCR
For the Falcon-OCR model, we also provide a vLLM docker server for fast deployment and MLX integration for Apple-Silicon
Please take a look at the github repo for details.
The Greater Picture: A “Bitter Lesson” for Perception
Falcon Perception is intentionally minimal: one backbone, one objective family, and small heads only where outputs are continuous and dense. The working assumption is that most gains should come from data, compute, and training signals, somewhat than continually expanding the pipeline with specialized modules.
The architecture doesn’t block any obvious scaling path: add more images and harder prompts for higher grounding, mix in text-only data for higher language, increase context length for denser scenes. It’s still only one sequence model.
Falcon Perception is developed by the Falcon Vision Team on the Technology Innovation Institute (TII), Abu Dhabi, UAE.
Citation
When you use Falcon-Perception, please cite
@article{bevli2026falcon,
title = {Falcon Perception},
writer = {Bevli, Aviraj and Chaybouti, Sofian and Dahou, Yasser and Hacid, Hakim and Huynh, Ngoc Dung and Le Khac, Phuc H. and Narayan, Sanath and Para, Wamiq Reyaz and Singh, Ankit},
journal = {arXiv preprint arXiv:2603.27365},
12 months = {2026},
url = {https://arxiv.org/abs/2603.27365}
}



