WWDC’ 24 is the moment Apple officially unveiled Apple Intelligence and
reiterated their commitment to efficient, private, and on-device AI.
In the course of the keynote and the sessions that followed, they demonstrated
Apple Intelligence, which powers an enormous array of AI-enhanced features
that show practical uses for on a regular basis tasks. These will not be
*AI-for-the-sake-of-AI* shiny demos. These are time-saving,
appropriate (and fun!) helpers which might be deeply integrated with apps and
the OS, that also offer developers numerous ways to incorporate these
features inside their very own apps.
Apple Intelligence features can only work this well
due to the vertically integrated software stack that harnesses
Apple Silicon’s capabilities to the fullest. Apple also offers a platform for developers to run models on-device, often known as
Core ML. This software stack lets you run ML models across all 3
compute units (CPU, GPU & Neural Engine) available on Apple Silicon hardware.
On this blog post, we’ll be exploring a few of the very best latest Core ML
features to copy the Mistral 7B example Apple showcased within the
WWDC’24 Deploy machine learning and AI models on-device with Core
ML
session, where they use a fork of
swift-transformers
to run a state-of-the-art LLM on a Mac. It is a high-quality model
with greater than 7 billion parameters that pushes the capabilities of
consumer hardware today. You can too take a look at WWDC’24 Bring your
machine learning and AI models to Apple
silicon
session, where a part of the Mistral 7B conversion process is shown.
Let’s see what steps to take to run it as efficiently as possible, and
learn the brand new tools available in iOS 18 & macOS Sequoia.
That is what we’ll be constructing today:
TL;DR
By the top of this blog post, you’ll have learnt all the brand new goodies
accompanying the most recent macOS release AND you’ll have successfully run
a 7B parameter model using lower than 4GB of memory in your Mac.
Step 1: Clone the preview branch of the swift-transformers repo: git clone -b preview https://github.com/huggingface/swift-transformers
Step 2: Download the converted Core ML models from this Hugging Face repo
Step 3: Run inference using Swift: swift run transformers "Best recommendations for a spot to go to in Paris in August 2024:" --max-length 200 Mistral7B-CoreML/StatefulMistralInstructInt4.mlpackage
Best latest Core ML features from WWDC’ 24
Listed below are among the most impactful Core ML features from WWDC’ 24 we
will use to run Mistral 7B on a Mac.
Swift Tensor
The primary feature we wish to spotlight is a wholly latest Swift type to
work with ML tensors. These are multi-dimensional data structures every
ML framework uses. Python developers working on ML are aware of
numpy arrays or torch tensors, which give convenient,
high-level interfaces to govern these large multi-dimensional
matrices easily. The brand new MLTensor type provides a high-level
abstraction that mimics those available in Python frameworks, greatly
simplifying working with tensor data in Swift.
Core ML already had multi-dimensional data types in the shape of
MLMultiArray
and
MLShapedArray.
Nevertheless, they were only meant for data storage and straightforward operations
like wrapping your data and sending it as input to a Core ML model, or
unwrapping results from a Core ML model. Nevertheless, manipulating tensor
data with these APIs is difficult. Only a number of primitive operations are
provided, and you’ll have to jot down your individual by accessing the underlying
storage as an opaque pointer to number data. That is time-consuming and
error-prone.
The brand new Tensor type provides a high-level abstraction that mimics
those available in Python frameworks, greatly simplifying working
with tensor data in Swift. Consider a language model just like the one we
need to port to Core ML. Language models absorb an input sequence of
tokens, and so they output an estimation of the chances of all of the
tokens within the vocabulary, meaning that tokens with a high probability
have a high probability of being plausible continuations of the input. The
application’s job is to pick out the very best next token to append to the
sequence based on those probabilities. Tensor type makes it easy to
handle these operations without custom code.
After we released swift-transformers,
we wrote a number of code (later prolonged by the community, thanks! ❤️) to
help with input preparations (convert words to tokens) and output
post-processing. For instance, take a look at our softmax operation
using Speed up. All this might be removed when using MLTensor, as
softmax is provided out of the box!
Stateful Buffers
Before WWDC’ 24, a Core ML model was essentially a pure stateless
function where you provide inputs and return some outputs. Nevertheless,
sometimes it is advisable to keep a state that is determined by previous
computations. The functional programming method for maintaining state is
so as to add a further input/output pair. So, based in your inputs and
state, the model computes the output and the brand new state. There’s nothing
unsuitable with this approach, and in truth, that’s the best way high-performance
frameworks like JAX work.
Nevertheless, there are practical limitations: the stateful data must be
sent to the model as an input and retrieved as an output each time you
call the model. If the stateful data is large, then all this going back
and forth increases overhead and slows things down. This is especially
essential for LLMs because you could have to run many iterations to generate a
sequence. The performance bottleneck is generally your computer’s memory
bandwidth (i.e., how briskly you may move things to your GPU and back).
Stateful models solve this problem by reserving a block of memory for
state data and keeping it on the GPU so that you don’t should send and
receive it each time you utilize the model.
Stateful buffers were introduced on this WWDC’ 24 session
using a toy example that is simple to know but not representative of
practical uses with big models corresponding to LLMs. An LLM performance trick
for transformers-based models is key-value caching (often known as
kv-caching). As shown in the next illustration, it avoids costly
matrix multiplications within the crucial attention block by caching the
results of previous operations performed in previous steps. We won’t go
into details, however the takeaways are: kv-cache dramatically increases
performance, and it requires a big block of memory that’s the proper
candidate for using stateful buffers. Here’s a coremltools user guide
update about stateful models.
Latest Quantization Techniques
In WWDC 23, we explored a really cool technique called palletization, and
we showed the way it could help bring text-to-image models, corresponding to Stable
Diffusion, to Macs and iPhones.
Whilst these techniques will let you reduce the dimensions considerably, if
pushed too far, the impact on quality is drastic. Larger models suffer
more from this, as the load data has an intensive dynamic range.
Attempting to create a small lookup table (LUT) that captures all possible
values becomes increasingly difficult. The answer introduced in WWDC
24 is to deal with a smaller portion of the info at a time, and create
multiple lookup tables for various areas of the identical tensor.
These methods (block-wise quantization) allow us to compress models to
as little as 4-bit precision. As an alternative of using 4 bytes (the dimensions of a
float32 number) to represent each model parameter, we are able to get away
with half a byte (a nibble) for every. That is an 8-fold reduction in
model size (minus some overhead to account for the block-wise
quantization tables), or 4 times smaller when put next to float16
precision.
Multifunction Support
We won’t use this feature for this instance but we desired to mention it
here because it was introduced at WWDC 24, and we will probably be showcasing it in
some upcoming work. Multifunction support essentially lets you
package LoRA adapters into generative models to make use of the identical model (with
a small set of additional parameters, called adapters) for various
tasks. LoRA is the popular community technique for big model
fine-tuning. In diffusion models, for instance, you need to use LoRA to
generate images with different styles, corresponding to photorealistic or
cartoonish. We imagine LoRA is a component of the answer that powers Apple’s
Genmoji implementation. For language models, LoRA adapters might be used
to adapt a generic LLM to specific tasks or domains.
To read more about LoRA, you may check this post.
To read more about Multifunction, you may take a look at Apple coremltools
user guide here.
Converting Mistral 7B to Core ML
The only most vital component for running a big language model
efficiently is the kv-cache. As mentioned above, that is a terrific
candidate for the brand new stateful model feature
released at WWDC’ 24. Models within the transformers library already use
efficient attention implementations that rely heavily on kv-caching.
Nevertheless, the default implementations are optimized for Nvidia GPUs, and
this hardware has a distinct set of constraints than Apple Silicon
does. Within the case of Core ML, we’d like to pre-allocate the total cache
buffer beforehand and be sure that every time we call the model, we update
the buffer in place. This avoids inefficient memory allocations and
tensor concatenations and can also be a requirement for Core ML stateful
buffers.
To attain this goal, we’ve got to make use of a distinct attention
implementation that considers these aspects. This requires modifying the
transformers modeling code for the Mistral architecture, and it’s done
in this fragment of code.
Note: If you ought to follow along and replicate the conversion (or
convert one other Mistral-based model, like a distinct fine-tune), you
can use this script
to run all of the conversion steps.
Tracing & Conversion
Step one is to load the model. We’ll use the patched
implementation with the in-place cache method.
MODEL_ID = "mistralai/Mistral-7B-Instruct-v0.3"
torch_model = StatefulMistralForCausalLM(MODEL_ID)
torch_model.eval()
Before running Core ML conversion, we’d like to trace the model with
example inputs. This process records the tensor operations performed on
those inputs, and the traced graph will probably be translated to Core ML
operations during conversion. We use sample inputs to trace the model;
we don’t need real data.
input_ids = torch.zeros((1, 2), dtype=torch.int32)
causal_mask = torch.zeros((1, 1, 2, 5), dtype=torch.float32)
traced_model = torch.jit.trace(torch_model, [input_ids, causal_mask])
The input to a language model is a sequence of tokens of various length.
We’ll allow the input to grow from a single token to a maximum context
length of 2048. We are able to use
coremltools range
dimensions to specify these bounds.
query_length = ct.RangeDim(lower_bound=1, upper_bound=2048, default=1)
end_step_dim = ct.RangeDim(lower_bound=1, upper_bound=2048, default=1)
inputs = [
ct.TensorType(shape=(1, query_length), dtype=np.int32, name="inputIds"),
ct.TensorType(shape=(1, 1, query_length, end_step_dim), dtype=np.float16, name="causalMask"),
]
outputs = [ct.TensorType(dtype=np.float16, name="logits")]
Along with the sequence tokens (called inputIds in the instance
above), there’s one other input called causalMask, which specifies the
tokens the model must concentrate to. This is generally used when
generating multiple sequences at the identical time using batching. Try
how these inputs are utilized in an example runner
here.
In this example, all of the input sequences inside a batch should have the
same length, so we use padding tokens and the causal mask to inform the
model that the padding tokens will not be to be regarded as inputs.
State Preparation
The PyTorch modeling code uses keyCache and valueCache because the
names of the cache buffers to carry the kv-cache. Those blocks are
allocated for the utmost context length (2048). We use coremltools‘
latest
StateType
to specify that those blocks should be converted to a stateful Core ML
buffer during conversion.
states = [
ct.StateType(
wrapped_type=ct.TensorType(shape=torch_model.kv_cache_shape, dtype=np.float16),
name="keyCache",
),
ct.StateType(
wrapped_type=ct.TensorType(shape=torch_model.kv_cache_shape, dtype=np.float16),
name="valueCache",
),
]
Core ML Conversion
To convert the model to Core ML, we’d like to specify the input and output
types, in addition to the states. The converted model will use float16
precision because that’s what we specified for the input data. We also
need to point the minimum deployment goal as iOS18, as that’s where
these features can be found. (We can even use macOS15, which refers
to the identical conversion goal.)
mlmodel_fp16 = ct.convert(
traced_model,
inputs=inputs,
states=states,
outputs=outputs,
minimum_deployment_target=ct.goal.iOS18,
skip_model_load=True,
)
Model Compression
Using the brand new block-size quantization strategies described above, we use
4-bit linear quantization with block size 32. This can greatly reduce
model size and make the model run faster. Despite the fact that computation will
still be performed in float16, weights are transferred in 4-bit mode
and decompressed on the fly, which is more efficient than transferring a
great amount of 16-bit weights.
The quantization parameters are configured as follows:
op_config = ct.optimize.coreml.OpLinearQuantizerConfig(
mode="linear_symmetric",
dtype="int4",
granularity="per_block",
block_size=32,
)
config = ct.optimize.coreml.OptimizationConfig(global_config=op_config)
Let’s use that configuration to quantize the model. The next line
will take a number of minutes to run:
mlmodel_int4 = ct.optimize.coreml.linear_quantize_weights(mlmodel_fp16, config=config)
mlmodel_int4.save("StatefulMistral7BInstructInt4.mlpackage")
There’s a final step after conversion and quantization are done. We want
to incorporate a bit of additional metadata that indicates the model
identifier we used (mistralai/Mistral-7B-Instruct-v0.3). The Swift
code will use this to download the tokenizer files from the Hub.
Tokenization is converting text data to the numerical representations
utilized by models, and it’s different for each model.
mlmodel_int4._spec.description.metadata.userDefined.update({
"co.huggingface.exporters.name": MODEL_ID
})
The generated model is a mlpackage of about 3.8G, compared with the
14G that a float16 conversion would produce. You’ll find it
here on the
Hub.
Running Mistral 7B with Swift
If you happen to followed the steps above or downloaded the model from the Hub,
you may run it locally using the preview branch of
swift-transformers. Apple engineers contributed it to the project,
including the next essential features:
-
Full
Tensorsupport, which greatly simplifies pre- and
post-processing tasks, and allows us to delete many lines of
low-level, confusing and fragile code. -
Support for the Swift counterpart of the Stateful API.
Since adopting these features is a breaking change and requires iOS 18
or macOS 15, we’ll keep them in a preview branch for now.
To run the model from the command line, please first clone the preview
branch from the GitHub repo:
git clone -b preview https://github.com/huggingface/swift-transformers
After which run the CLI to check the model:
swift run transformers "Best recommendations for a spot to go to in Paris in August 2024:" --max-length 128 Examples/Mistral7B/StatefulMistral7BInstructInt4.mlpackage
For easier testing, you may also use swift-chat, an easy app we
wrote to point out tips on how to integrate the swift-transformers package
inside. You will have to make use of the preview branch as well. An example of
swift-chat running the converted Mistral model was shown on the
starting of this post.
Running Mistral 7B with Python
For those of you who’re more aware of Python, it’s just as easy!
python3 generate.py Examples/Mistral7B/StatefulMistral7BInstructInt4.mlpackage --prompt "Best recommendations for a spot to go to in Paris in August 2024:"
coremltools makes it just as easy to run Core ML models with Python.
What’s Next?
We’re extremely excited concerning the progress in Core ML and
coremltools this yr,
and we’re looking forward to seeing plenty of third-party apps leveraging
ML models to unravel real tasks people need. On our side, we’re committed
to creating this as easy as possible so developers can consider
creating cool apps. There are a number of things on our drafting board:
-
The model updates presented listed here are excellent for GPUs on Mac
computers. Core ML can use the Neural Engine, which is especially
efficient on iPhones. Getting essentially the most performance out of the Neural
Engine requires some additional adaptations, which we plan to hold
out on a number of example models. This work will probably be based on the
learnings discussed on this 2022 (and still very relevant) article by Apple.
We won’t run Mistral 7B on iPhone, but there are several smaller
models, like Apple’s OpenELM or DCLM that make for nice
candidates to explore! -
The code presented here is extremely experimental. As summer goes on,
we plan to adopt these methods and incorporate them into
exporters, a Python tool designed to convert transformers models
to Core ML. Hopefully, you’ll soon have the opportunity to convert many
interesting model architectures very easily. -
We’ll keep working on the
previewbranch of
swift-transformersto include latest features or API changes as
they’re released. If you happen to have an interest, control it!
How are you going to help?
The tools released by Apple in WWDC help us on our long-term goal to
make AI easy and accessible to all, and we’d like to see where you may
take them. The instance we showed is experimental, but you need to use it to
convert any Mistral fine-tune to Core ML – please tell us if you happen to do!
If you ought to try other model architectures, please be happy to open
issues or PRs to the preview branch of swift-transformers –
we’ll attempt to make it easier to get going!
There’s never been a greater time than today to use your creativity to
solve problems that interest you! Go try things, have a good time, and tell us
how we can assist.


