Cohere on Hugging Face Inference Providers 🔥

-


banner image

We’re thrilled to share that Cohere is now a supported Inference Provider on HF Hub! This also marks the primary model creator to share and serve their models directly on the Hub.

Cohere is committed to constructing and serving models purpose-built for enterprise use-cases. Their comprehensive suite of secure AI solutions, from cutting-edge Generative AI to powerful Embeddings and Rating models, are designed to tackle real-world business challenges. Moreover, Cohere Labs, Cohere’s in house research lab, supports fundamental research and seeks to alter the spaces where research happens.

Starting now, you may run serverless inference to the next models via Cohere and Inference Providers:

Light up your projects with Cohere and Cohere Labs today!



Cohere Models

Cohere and Cohere Labs bring a swathe of their models to Inference Providers that excel at specific business applications. Let’s explore some intimately.



CohereLabs/c4ai-command-a-03-2025 🔗

Optimized for demanding enterprises that require fast, secure, and high-quality AI. Its 256k context length (2x most leading models) can handle for much longer enterprise documents. Other key features include Cohere’s advanced retrieval-augmented generation (RAG) with verifiable citations, agentic tool use, enterprise-grade security, and robust multilingual performance (support for 23 languages).



CohereLabs/aya-expanse-32b 🔗

Focuses on state-of-the-art multilingual support, applying the newest research on multilingual pre-training. Supports Arabic, Chinese (simplified & traditional), Czech, Dutch, English, French, German, Greek, Hebrew, Hindi, Indonesian, Italian, Japanese, Korean, Persian, Polish, Portuguese, Romanian, Russian, Spanish, Turkish, Ukrainian, and Vietnamese with 128K context length.



CohereLabs/c4ai-command-r7b-12-2024 🔗

Ideal for low-cost or low-latency use cases, bringing state-of-the-art performance in its class of open-weight models across real-world tasks. This model offers a context length of 128k. It delivers a strong combination of multilingual support, citation-verified retrieval-augmented generation (RAG), reasoning, tool use, and agentic behavior. Also supports 23 languages.



CohereLabs/aya-vision-32b 🔗

32-billion parameter model with advanced capabilities optimized for a wide range of vision-language use cases, including OCR, captioning, visual reasoning, summarization, query answering, code, and more. It expands multimodal capabilities to 23 languages spoken by over half the world’s population.



How it really works

You should use Cohere models directly on the Hub either on the web site UI or via the client SDKs.

You’ll find all of the examples mentioned on this section on the Cohere documentation page.



In the web site UI

You’ll be able to seek for Cohere models by filtering by the inference provider within the model hub.

Cohere provider UI

From the Model Card, you may select the inference provider and run inference directly within the UI.

gif screenshot of Cohere inference provider in the UI



From the client SDKs

Let’s walk through using Cohere models from client SDKs. We’ve also made a colab notebook with these snippets, in case you should try them out immediately.



from Python, using huggingface_hub

The next example shows learn how to use Command A using Cohere as your inference provider. You should use a Hugging Face token for automatic routing through Hugging Face, or your individual cohere API key if you will have one.

Install huggingface_hub v0.30.0 or later:

pip install -U "huggingface_hub>=0.30.0"

Use the huggingface_hub python library to call Cohere endpoints by defining the provider parameter.

from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="cohere",
    api_key="xxxxxxxxxxxxxxxxxxxxxxxx",
)

messages = [
        {
            "role": "user",
            "content": "How to make extremely spicy Mayonnaise?"
        }
]

completion = client.chat.completions.create(
    model="CohereLabs/c4ai-command-r7b-12-2024",
    messages=messages,
    temperature=0.7,
    max_tokens=512,
)

print(completion.selections[0].message)

Aya Vision, Cohere Labs’ multilingual, multimodal model can also be supported. You’ll be able to include images encoded in base64 as follows:

image_path = "img.jpg"
with open(image_path, "rb") as f:
    base64_image = base64.b64encode(f.read()).decode("utf-8")
image_url = f"data:image/jpeg;base64,{base64_image}"

from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="cohere",
    api_key="xxxxxxxxxxxxxxxxxxxxxxxx",
)

messages = [
        {
            "role": "user",
            "content": [
                {
                    "type": "text",
                    "text": "What's in this image?"
                },
                {
                    "type": "image_url",
                    "image_url": {"url": image_url},
                },
            ]
        }
]

completion = client.chat.completions.create(
    model="CohereLabs/aya-vision-32b",
    messages=messages,
    temperature=0.7,
    max_tokens=512,
)

print(completion.selections[0].message)



from JS using @huggingface/inference

import { HfInference } from "@huggingface/inference";

const client = latest HfInference("xxxxxxxxxxxxxxxxxxxxxxxx");

const chatCompletion = await client.chatCompletion({
    model: "CohereLabs/c4ai-command-a-03-2025",
    messages: [
        {
            role: "user",
            content: "How to make extremely spicy Mayonnaise?"
        }
    ],
    provider: "cohere",
    max_tokens: 512
});

console.log(chatCompletion.selections[0].message);



From OpenAI client

Here’s how you may call Command R7B using Cohere because the inference provider via the OpenAI client library.

from openai import OpenAI

client = OpenAI(
    base_url="https://router.huggingface.co/cohere/compatibility/v1",
    api_key="xxxxxxxxxxxxxxxxxxxxxxxx",
)

messages = [
        {
            "role": "user",
            "content": "How to make extremely spicy Mayonnaise?"
        }
]

completion = client.chat.completions.create(
    model="command-a-03-2025",
    messages=messages,
    temperature=0.7,
)

print(completion.selections[0].message)



Tool Use with Cohere Models

Cohere’s models bring state-of-the-art agentic tool use to Inference Providers so let’s explore that intimately. Each the Hugging Face Hub client and the OpenAI client are compatible with tools via inference providers, so the above examples could be expanded.

First, we are going to have to define tools for the model to make use of. Below we define the get_flight_info which calls an API for the newest flight information using two locations. This tool definition will probably be represented by the model’s chat template. Which we also can explore within the model card (🎉 open source).

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_flight_info",
            "description": "Get flight information between two cities or airports",
            "parameters": {
                "type": "object",
                "properties": {
                    "loc_origin": {
                        "type": "string",
                        "description": "The departure airport, e.g. MIA",
                    },
                    "loc_destination": {
                        "type": "string",
                        "description": "The destination airport, e.g. NYC",
                    },
                },
                "required": ["loc_origin", "loc_destination"],
            },
        },
    }
]

Next, we’ll have to pass messages to the inference client for the model to make use of the tools when relevant. In the instance below we define the assistant’s tool call in tool_calls, for the sake of clarity.


messages = [
    {"role": "developer", "content": "Today is April 30th"},
    {
        "role": "user",
        "content": "When is the next flight from Miami to Seattle?",
    },
    {
        "role": "assistant",
        "tool_calls": [
            {
                "function": {
                    "arguments": '{ "loc_destination": "Seattle", "loc_origin": "Miami" }',
                    "name": "get_flight_info",
                },
                "id": "get_flight_info0",
                "type": "function",
            }
        ],
    },
    {
        "role": "tool",
        "name": "get_flight_info",
        "tool_call_id": "get_flight_info0",
        "content": "Miami to Seattle, May 1st, 10 AM.",
    },
]

Finally, the tools and messages are passed to the create method.

from huggingface_hub import InferenceClient

client = InferenceClient(
    provider="cohere",
    api_key="xxxxxxxxxxxxxxxxxxxxxxxx",
)

completion = client.chat.completions.create(
    model="CohereLabs/c4ai-command-r7b-12-2024",
    messages=messages,
    tools=tools,
    temperature=0.7,
    max_tokens=512,
)

print(completion.selections[0].message)



Billing

For direct requests, i.e. while you use a Cohere key, you might be billed directly in your Cohere account.

For routed requests, i.e. while you authenticate via the Hub, you will only pay the usual Cohere API rates. There is no additional markup from us, we just go through the provider costs directly. (In the longer term, we may establish revenue-sharing agreements with our provider partners.)

Necessary Note ‼️ PRO users get $2 value of Inference credits every month. You should use them across providers. 🔥

Subscribe to the Hugging Face PRO plan to get access to Inference credits, ZeroGPU, Spaces Dev Mode, 20x higher limits, and more.



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x