OCR of documents is still a hard engineering problem? How can OCR be used for actual documents, instead of just clean images as a demo? Can a multimodal compact model deal with parsing and tables? Formulas and structured extractions without causing resource waste.
This is the main problem that you are aiming at. GLM-OCRResearchers from introduced a new. Zhipu AI Tsinghua University. The GLM-OCR team presents GLM to the public as a 0.9B-parameter compact multimodal model for document understanding. The software is a combination of a The CogViT 0.4B visual encoder is a 0.4B cogwheel.The lightweight cross-modal connector is a 0.5B GLM language decoder. Its stated objective is to achieve a balance between document recognition accuracy and low latency, while reducing computational costs compared with larger multimodal systems.
OCRs are good for plain-text transcription but not so great when a document contains mixed layouts and other elements such as tables, codes, seals or structured fields. The research team argues that multimodal large language systems improve document comprehension, but their standard autoregressive coding makes them costly for large-scale production and edge deployment. GLM-OCR has been designed to be a small system that is built specifically for deployment, rather than a vision-language model with OCR added as an afterthought.
Compact architecture for OCR workloads
The use of Multi-Token Prediction (MTP). Standard autoregressive coding only predicts a token at a given time. This is not optimal for OCR style tasks, where the outputs can be deterministic and local structured. GLM-OCR predicts more tokens in each step. The model has been trained to predict 10 tokens per step It generates 5.2 tokens per decoding step on average at inference timeThe yielding of about Improved 50% Throughput. In order to reduce memory usage, the implementation employs a scheme that shares parameters across all draft models.
The Flat Page Reader is Replaced by Two-Stage Layout Reading
GLM adopts the following at system level: two-stage pipeline. First stage: PP-DocLayout-V3 For layout analysis detects structured regions in the page. This second stage performs parallel region-level recognition The model will then move the cursor over these detected regions. It is crucial to note that the model does not read a page from left to right as would a generic model. The model breaks the page down into regions that are semantically significant, improving efficiency and making the system more robust when dealing with documents of complex layouts.
KIE and Document Parsing Use Different Output Routes
This architecture separates documents that are related. The architecture also separates two related document tasks. document parsingThe pipeline utilizes layout detection and area processing to generate structured outputs, such as Markdown You can also find out more about the following: JSON. For Key Information Extraction (KIE)The research team described a completely different approach: the entire document is sent to the model along with an instruction prompt and it generates directly. JSON The extracted fields are contained in the model. It is crucial to make this distinction because GLM OCR doesn’t present as one monolithic model. This is a system that has different modes of operation depending on what task it’s being used for.
The Four-Stage Pipeline for Training with Task-Specific Reward
It is divided into There are four stages to the 4 staged staircase. Stage 1. The vision encoder is trained on images-text and data retrieval/grounding. Stage 2.1 Performs multimodal training on document parsing and grounding. Stage 2.2 The MTP target is added. Stage 3 The fine tuning of OCR tasks, such as text recognition, formula transcriptions, table recovery and KIE, is under supervision. Stage 4 Use reinforcement learning with GRPO. It is not task specific: Normalized Editing Distance Text recognition is a useful tool. CDM score For formula recognition Score of TEDS For table recognition and field-level F1 KIE is a combination of structural penalties including repetition penalties, misformed structure penalties, JSON validation restrictions, etc.
Benchmark results show strong performance, with important caveats
GLM-OCR has strong performance on public benchmarks for several documents tasks. The GLM-OCR scores highly on public benchmarks. 94.6 On, OmniDocBench v1.5, 94.0 On, OCRBench (Text), 96.5 On, UniMERNet, 85.2 On, PubTabNet” 86.0 On, TEDS_TEST. KIE reports to it 93.7 On, Nanonets-KIE You can also find out more about the following: 86.1 On, Handwritten-KIE. Researchers note that the results of Gemini-3-Pro You can also find out more about the following: GPT-5.2-2025-12-11 The scores are only shown as a reference, and they are not included in the ranking of best score. This is an important point to consider when trying to interpret claims made about leadership models.
The benchmark story needs to be carefully framed. GLM achieves highest reported score among all non-reference evaluated models OmniDocBench v1.5, OCRBench (Text), UniMERNet” TEDS_TEST. On PubTabNetHowever, the fact remains that it is only a small part of the whole. You can also check out our other blog posts. Leading overall MinerU 2 5 Reports 88.4 versus GLM-OCR’s 85.2. GLM-OCR is the best open source competitor for KIE. Gemini-3-Pro Both score higher Nanonets-KIE You can also find out more about the following: Handwritten-KIE In the Reference column. So the reserach team supports a strong competitive claim, but not a blanket ‘best at everything’ claim.
Deployment Details
GLM-OCR is supported by the research team. VLLM, SGLang” OllamaIt can also be adjusted through LLaMA-Factory. They report also throughput. 0.67 images/s You can also find out more about the following: 1.86 PDF Pages/s They have a set of evaluations. Additionally, they mention a MaaS-API priced at The price per token is 0.2 RMBCost estimates are provided for PDFs and scanned images. This information suggests that GLM is being presented as a model for research and also a system to be deployed.
What you need to know
- GLM is a compact multimodal OCR model that measures 0.9B. Build with a Encoder 0.4B cogViT You can also find out more about the following: 0.5B GLM decoder.
- Multi-Token Prediction is used. To improve the efficiency of decoding, reach 5.2 tokens per step on average About Increased throughput of 50%.
- Model uses two-stage pipeline: PP-DocLayout-V3 GLM-OCR then performs region-level parallel recognition.
- The KIE parser is compatible with both documents and KIE: parsing outputs Markdown/JSONKIE generates directly JSON The full-document image is available.
- The benchmark results are good but not always universalGLM/OCR has been reported to be the benchmark for several other non-reference standards, however MinerU 2 5 The higher the number, the better PubTabNet” Gemini-3-Pro The KIE score is lower if you only use the KIE reference scores.
Check out Paper, Repo You can also find out more about the following: Model Page. Also, feel free to follow us on Twitter Join our Facebook group! 120k+ ML SubReddit Subscribe Now our Newsletter. Wait! What? now you can join us on telegram as well.

