Running Privacy-Preserving Inferences on Hugging Face Endpoints

It is a guest blog post by the Zama team. Zama is an open source cryptography company constructing state-of-the-art FHE solutions for blockchain and AI.

Eighteen months ago, Zama began Concrete ML, a privacy-preserving ML framework with bindings to traditional ML frameworks equivalent to scikit-learn, ONNX, PyTorch, and TensorFlow. To make sure privacy for users’ data, Zama uses Fully Homomorphic Encryption (FHE), a cryptographic tool that enables to make direct computations over encrypted data, without ever knowing the private key.

From the beginning, we desired to pre-compile some FHE-friendly networks and make them available somewhere on the web, allowing users to make use of them trivially. We’re ready today! And never in a random place on the web, but directly on Hugging Face.

More precisely, we use Hugging Face Endpoints and custom inference handlers, to have the ability to store our Concrete ML models and let users deploy on HF machines in a single click. At the tip of this blog post, you’ll understand easy methods to use pre-compiled models and easy methods to prepare yours. This blog will also be regarded as one other tutorial for custom inference handlers.

Deploying a pre-compiled model

Let’s start with deploying an FHE-friendly model (prepared by Zama or third parties – see Preparing your pre-compiled model section below for learning easy methods to prepare yours).

First, search for the model you need to deploy: Now we have pre-compiled a bunch of models on Zama’s HF page (or you may find them with tags). Let’s suppose you’ve got chosen concrete-ml-encrypted-decisiontree: As explained in the outline, this pre-compiled model permits you to detect spam without taking a look at the message content within the clear.

Like with another model available on the Hugging Face platform, select Deploy after which Inference Endpoint (dedicated):

Inference Endpoint (dedicated)

Next, select the Endpoint name or the region, and most significantly, the CPU (Concrete ML models don’t use GPUs for now; we’re working on it) in addition to the perfect machine available – in the instance below we selected eight vCPU. Now click on Create Endpoint and wait for the initialization to complete.

Create Endpoint

After just a few seconds, the Endpoint is deployed, and your privacy-preserving model is able to operate.

Endpoint is created

: Don’t forget to delete the Endpoint (or a minimum of pause it) if you end up not using it, or else it is going to cost greater than anticipated.

Using the Endpoint

Installing the client side

The goal will not be only to deploy your Endpoint but in addition to let your users play with it. For that, they should clone the repository on their computer. This is completed by choosing Clone Repository, within the dropdown menu:

Clone Repository

They will probably be given a small command line that they will run of their terminal:

git clone https://huggingface.co/zama-fhe/concrete-ml-encrypted-decisiontree

Once the command is completed, they go to the concrete-ml-encrypted-decisiontree directory and open play_with_endpoint.py with their editor. Here, they are going to find the road with API_URL = … and will replace it with the brand new URL of the Endpoint created within the previous section.

API_URL = "https://vtx9w974oxrq54ff.us-east-1.aws.endpoints.huggingface.cloud"

In fact, fill it in with with your Entrypoint’s URL. Also, define an access token and store it in an environment variable:

export HF_TOKEN=[your token hf_XX..XX]

Lastly, your user machines have to have Concrete ML installed locally: Make a virtual environment, source it, and install the vital dependencies:

python3.10 -m venv .venv
source .venv/bin/activate
pip install -U setuptools pip wheel
pip install -r requirements.txt

Remark that we currently force using Python 3.10 (which can be the default python version utilized in Hugging Face Endpoints). It is because our development files currently depend upon the Python version. We’re working on making them independent. This ought to be available in an extra version.

Running inferences

Now, your users can run inference on the Endpoint launching the script:

python play_with_endpoint.py

It should generate some logs much like the next:

Sending 0-th piece of the important thing (remaining size is 71984.14 kbytes)
Storing the important thing in the database under uid=3307376977
Sending 1-th piece of the important thing (remaining size is 0.02 kbytes)
Size of the payload: 0.23 kilobytes
for 0-th input, prediction=0 with expected 0 in 3.242 seconds
for 1-th input, prediction=0 with expected 0 in 3.612 seconds
for 2-th input, prediction=0 with expected 0 in 4.765 seconds

(...)

for 688-th input, prediction=0 with expected 1 in 3.176 seconds
for 689-th input, prediction=1 with expected 1 in 4.027 seconds
for 690-th input, prediction=0 with expected 0 in 4.329 seconds
Accuracy on 691 samples is 0.8958031837916064
Total time: 2873.860 seconds
Duration per inference: 4.123 seconds

Adapting to your application or needs

Should you edit play_with_endpoint.py, you will see that we iterate over different samples of the test dataset and run encrypted inferences directly on the Endpoint.

for i in range(nb_samples):

    
    encrypted_inputs = fhemodel_client.quantize_encrypt_serialize(X_test[i].reshape(1, -1))

    
    payload = {
        "inputs": "fake",
        "encrypted_inputs": to_json(encrypted_inputs),
        "method": "inference",
        "uid": uid,
    }

    if is_first:
        print(f"Size of the payload: {sys.getsizeof(payload) / 1024:.2f} kilobytes")
        is_first = False

    
    duration -= time.time()
    duration_inference = -time.time()
    encrypted_prediction = query(payload)
    duration += time.time()
    duration_inference += time.time()

    encrypted_prediction = from_json(encrypted_prediction)

    
    prediction_proba = fhemodel_client.deserialize_decrypt_dequantize(encrypted_prediction)[0]
    prediction = np.argmax(prediction_proba)

    if verbose:
        print(
            f"for {i}-th input, {prediction=} with expected {Y_test[i]} in {duration_inference:.3f} seconds"
        )

    
    nb_good += Y_test[i] == prediction

In fact, that is just an example of the Entrypoint’s usage. Developers are encouraged to adapt this instance to their very own use-case or application.

Under the hood

Please note that each one of this is completed because of the pliability of custom handlers, and we express our gratitude to the Hugging Face developers for offering such flexibility. The mechanism is defined in handler.py. As explained within the Hugging Face documentation, you may define the __call__ approach to EndpointHandler just about as you wish: In our case, we’ve defined a method parameter, which might be save_key (to avoid wasting FHE evaluation keys), append_key (to avoid wasting FHE evaluation keys piece by piece if the secret is too large to be sent in a single single call) and eventually inference (to run FHE inferences). These methods are used to set the evaluation key once after which run all of the inferences, one after the other, as seen in play_with_endpoint.py.

Limits

One can remark, nevertheless, that keys are stored within the RAM of the Endpoint, which will not be convenient for a production environment: At each restart, the keys are lost and have to be re-sent. Plus, when you’ve got several machines to handle massive traffic, this RAM will not be shared between the machines. Finally, the available CPU machines only provide eight vCPUs at most for Endpoints, which could possibly be a limit for high-load applications.

Preparing your pre-compiled model

Now that you realize how easy it’s to deploy a pre-compiled model, it’s possible you’ll want to organize yours. For this, you may fork certainly one of the repositories we’ve prepared. All of the model categories supported by Concrete ML (linear models, tree-based models, built-in MLP, PyTorch models) have a minimum of one example, that might be used as a template for brand spanking new pre-compiled models.

Then, edit creating_models.py, and alter the ML task to be the one you need to tackle in your pre-compiled model: For instance, in the event you began with concrete-ml-encrypted-decisiontree, change the dataset and the model kind.

As explained earlier, you have to have installed Concrete ML to organize your pre-compiled model. Remark that you’ll have to make use of the identical python version than Hugging Face use by default (3.10 when this blog is written), or your models might have people to make use of a container together with your python in the course of the deployment.

Now you may launch python creating_models.py. It will train the model and create the vital development files (client.zip, server.zip, and versions.json) within the compiled_model directory. As explained within the documentation, these files contain your pre-compiled model. If you’ve got any issues, you may get support on the fhe.org discord.

The last step is to change play_with_endpoint.py to also take care of the identical ML task as in creating_models.py: Set the dataset accordingly.

Now, you may save this directory with the compiled_model directory and files, in addition to your modifications in creating_models.py and play_with_endpoint.py on Hugging Face models. Definitely, you will have to run some tests and make slight adjustments for it to work. Don’t forget so as to add a concrete-ml and FHE tag, such that your pre-compiled model appears easily in searches.

Pre-compiled models available today

For now, we’ve prepared just a few pre-compiled models as examples, hoping the community will extend this soon. Pre-compiled models might be found by trying to find the concrete-ml or FHE tags.

Take note that there is a limited set of configuration options in Hugging Face for CPU-backed Endpoints (as much as 8 vCPU with 16 GB of RAM today). Depending in your production requirements and model characteristics, execution times could possibly be faster on more powerful cloud instances. Hopefully, more powerful machines will soon be available on Hugging Face Endpoints to enhance these timings.

Additional resources

Conclusion and next steps

On this blog post, we’ve shown that custom Endpoints are pretty easy yet powerful to make use of. What we do in Concrete ML is pretty different from the regular workflow of ML practitioners, but we’re still capable of accommodate the custom Endpoints to take care of most of our needs. Kudos to Hugging Face engineers for developing such a generic solution.

We explained how:

Developers can create their very own pre-compiled models and make them available on Hugging Face models.
Firms can deploy developers’ pre-compiled models and make them available to their users via HF Endpoints.
Final users can use these Endpoints to run their ML tasks over encrypted data.

To go further, it might be useful to have more powerful machines available on Hugging Face Endpoints to make inferences faster. Also, we could imagine that Concrete ML becomes more integrated into Hugging Face’s interface and has a Private-Preserving Inference Endpoint button, simplifying developers’ lives much more. Finally, for integration in several server machines, it could possibly be helpful to have a method to share a state between machines and keep this state non-volatile (FHE inference keys can be stored there).

Source link

Running Privacy-Preserving Inferences on Hugging Face Endpoints

Deploying a pre-compiled model

Using the Endpoint

Installing the client side

Running inferences

Adapting to your application or needs

Under the hood

Limits

Preparing your pre-compiled model

Pre-compiled models available today

Additional resources

Conclusion and next steps

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

A Tale of Two Variances: Why NumPy and Pandas Give Different Answers

How Vision Language Models Are Trained from “Scratch”

Why Care About Prompt Caching in LLMs?

Supply-chain attack using invisible code hits GitHub and other repositories

Introducing NVIDIA NeMo Retriever’s Generalizable Agentic Retrieval Pipeline

Running Privacy-Preserving Inferences on Hugging Face Endpoints

Deploying a pre-compiled model

Using the Endpoint

Installing the client side

Running inferences

Adapting to your application or needs

Under the hood

Limits

Preparing your pre-compiled model

Pre-compiled models available today

Additional resources

Conclusion and next steps

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.