SOTA OCR with Core ML and dots.ocr

Yearly our hardware is slightly more powerful, our models slightly smarter for every parameter. In 2025, it’s more feasible than ever to run truly competitive models on-device. dots.ocr, a 3B parameter OCR model from RedNote, surpasses Gemini 2.5 Pro in OmniDocBench, making OCR a really no compromises on-device use case. Running models on-device is actually appealing to developers: no smuggling API keys, zero cost, and no network required. Nonetheless, if we would like these models to run on-device, we should be mindful of the limited compute and power budgets.

Enter the Neural Engine, Apple’s custom AI accelerator that has shipped with every Apple device since 2017. This accelerator is designed for prime performance whilst sipping battery power. A few of our testing has found the Neural Engine to be 12x more power efficient than CPU, and 4x more power efficient than GPU.

Whilst this all sounds very appealing, unfortunately the Neural Engine is barely accessible through Core ML, Apple’s closed source ML framework. Moreover, even just converting a model from PyTorch to Core ML can present some challenges, and and not using a preconverted model or some knowledge of the sharp edges it could actually be arduous for developers. Luckily, Apple also offers MLX, a more modern and versatile ML framework that targets the GPU (not the Neural Engine), and will be used along with Core ML.

On this three part series, we are going to provide a reasoning trace of how we converted dots.ocr to run on-device, using a
combination of CoreML and MLX. This process needs to be applicable to many other models, and we hope that this may help
highlight the ideas and tools needed for developers trying to run their very own models on-device.

To follow along, clone the repo. You will need uv and hf installed to run
the setup command:

./boostrap.sh

In the event you just need to skip ahead and use the converted model, you possibly can download it here.

Conversion

Converting from PyTorch to CoreML is a two step process:

Capturing your PyTorch execution graph (via torch.jit.trace or, the more modern approach of torch.export).
Compiling this converted graph to an .mlpackage using coremltools.

Whilst we do have a couple of knobs we will tweak for step 2, most of our control is in step 1, the graph we feed to coremltools.

Following the programmers litany of make it work, make it right, make it fast, we are going to first deal with getting the
conversion working on GPU, in FLOAT32, and with static shapes. Once we’ve this working, we will dial down the precision and take a look at and
move to the Neural Engine.

Dots.OCR

Dots.OCR consists of two key components: A 1.2B parameter vision encoder trained from scratch, based on the NaViT
architecture, and a Qwen2.5-1.5B backbone. We can be using CoreML to run the vision encoder, and MLX to run the LM backbone.

Step 0: Understand and simplify the model

In an effort to convert a model, it is best to know the structure and performance before getting began. Taking a look at the
original vision modelling file
here,
we will see that the vision encoder is comparable to the QwenVL family. Like many vision encoders, the vision encoder for dots works on a patch basis, on this case 14x14 patches. The dots vision encoder is able to processing videos and batches of images. This provides us a chance to simplify by only processing a single image at a time. This approach is frequent in on-device apps, where we convert a model that gives the essential functions and iterate if we would like to process multiple images.

When kicking off the conversion process, it is best to start out with a minimal viable model. This implies removing any bells
and whistles that usually are not strictly crucial for the model to operate. In our case, dots has many alternative attention implementations available for each the vision encoder and the LM backbone. CoreML has a lot of infrastructure oriented across the scaled_dot_product_attention operator, which they introduced in iOS 18. We are able to simplify the model by removing the entire other attention implementations and just specializing in easy sdpa (not the memory efficient variant) for now, commit here.

Once we have done this, we see a scary warning message after we load the model:

Sliding Window Attention is enabled but not implemented for `sdpa`; unexpected results could also be encountered.

The model doesn’t require Sliding Window Attention to operate, so we will happily move on.

Step 1: An easy harness

Using torch.jit.trace continues to be probably the most mature method for converting models to CoreML. We normally encapsulate this in
a straightforward harness that means that you can modify the compute units used and the precision chosen.

You may try the initial harness here. If we run
the next on the unique code implementation:

uv run convert.py --precision FLOAT32 --compute_units CPU_AND_GPU

We should always bump into the primary (of many) issues.

Step 2: Bug hunting

It’s rare that a model will convert first time. Often, you’ll need to progressively make changes further and further
down the execution graph until you reach the ultimate node.

Our first issue is the next error:

ERROR - converting 'outer' op (positioned at: 'vision_tower/rotary_pos_emb/192'):
In op "matmul", when x and y are each non-const, their dtype must match, but got x as int32 and y as fp32

Luckily this error gives us quite a bit of data. We are able to take a look at the VisionRotaryEmbedding layer and see the next
code:

def forward(self, seqlen: int) -> torch.Tensor:
    seq = torch.arange(seqlen, device=self.inv_freq.device, dtype=self.inv_freq.dtype)
    freqs = torch.outer(seq, self.inv_freq)
    return freqs

Although torch.arange has a dtype argument, coremltools ignores this for arange and at all times outputs int32.
We are able to simply add a forged after the arange to repair this issue, commit here.

After fixing this, running the conversion again leads us to our next issue at repeat_interleave:

ERROR - converting 'repeat_interleave' op (positioned at: 'vision_tower/204'):
Cannot add const [None]

Whilst this error is less informative, we only have a single call to repeat_interleave in our vision encoder:

cu_seqlens = torch.repeat_interleave(grid_thw[:, 1] * grid_thw[:, 2], grid_thw[:, 0]).cumsum(
    dim=0,
    dtype=grid_thw.dtype if torch.jit.is_tracing() else torch.int32,
)

cu_seqlens is used for masking variable length sequences in flash_attention_2. It’s derived from the grid_thw
tensor, which represents time, height and width. Since we’re only processing a single image, we will simply remove
this call, commit here.

Onto the subsequent! This time, we get a more cryptic error:

ERROR - converting '_internal_op_tensor_inplace_fill_' op (positioned at: 'vision_tower/0/attn/301_internal_tensor_assign_1'):
_internal_op_tensor_inplace_fill doesn't support dynamic index

That is again resulting from the masking logic to handle variable length sequences. Since we’re only processing a single image (not
a video or batch of images), we do not really want attention masking in any respect! Subsequently, we will just use a mask of all True. To arrange ourselves for the Neural Engine conversion, we also
switch from using a boolean mask to a float mask of all zeros, because the Neural Engine doesn’t support bool tensors commit here

With all of this done, the model should now successfully convert to CoreML! Nonetheless, after we run the model, we get the
following error:

error: 'mps.reshape' op the result shape is just not compatible with the input shape

This reshape may very well be in multiple places! Luckily, we will use a previous warning message to assist us track down the difficulty:

TracerWarning: Iterating over a tensor might cause the trace to be incorrect. Passing a tensor of various shape won't change the variety of iterations executed (and might result in errors or silently give incorrect results).
  for t, h, w in grid_thw:

Most ML compilers do not like dynamic control flow. Luckily for us, as we’re only processing a single image, we will
simply remove the loop and process the only h, w pair, commit here.

And there we’ve it! If we run the conversion again, we must always see that the model successfully converts and matches the
original PyTorch precision:

Max difference: 0.006000518798828125, Mean difference: 1.100682402466191e-05

Step 3: Benchmarking

Now that we have the model working, let’s evaluate the scale and performance. The excellent news is the model is working, the bad news is that it’s over 5GB! This is totally untenable for on device deployment!
To benchmark the computation time, we will use the inbuilt XCode tooling by calling:

open DotsOCR_FLOAT32.mlpackage

which is able to launch the XCode inspector for the model. After clicking + Performance Report and launching a report on all compute devices, it’s best to see something like the next:

GPU Perf report

Over a second for a single forward pass of the vision encoder! We have now a lot of more work.

Within the second a part of this series, we are going to work on the mixing between CoreML and MLX, to run the total model on-device. Within the third part, we are going to dive deep into the optimizations required to get this model running on the
Neural Engine, including quantization and dynamic shapes.

Source link

SOTA OCR with Core ML and dots.ocr

Conversion

Dots.OCR

Step 0: Understand and simplify the model

Step 1: An easy harness

Step 2: Bug hunting

Step 3: Benchmarking

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

How artificial intelligence will help achieve a clean energy future

Model Quantization: Concepts, Methods, and Why It Matters

Democratizing AI Safety with RiskRubric.ai

Microsoft’s Fara-7B is a computer-use AI agent that rivals GPT-4o and works directly in your PC

What’s next for AlphaFold: A conversation with a Google DeepMind Nobel laureate

SOTA OCR with Core ML and dots.ocr

Conversion

Dots.OCR

Step 0: Understand and simplify the model

Step 1: An easy harness

Step 2: Bug hunting

Step 3: Benchmarking

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.