Cerebras Launches Latest AI Inference Service… “20x Faster and 100x Cheaper than NVIDIA”

WSE-3 chip (Photo = Cerebras)

Artificial intelligence (AI) semiconductor startup Cerebras has launched the world’s fastest and most cost-effective AI inference service. As generative AI applications reminiscent of ‘ChatGPT’ turn into popular, the demand for AI inference is predicted to extend exponentially, and Cerebras has thrown down the gauntlet to Nvidia by highlighting its strengths.

Reuters reported on the twenty seventh (local time) that Cerebras launched ‘Cerebras Inference’, an AI inference service that’s as much as 20 times faster than Nvidia.

AI inference is the strategy of operating an already trained AI model to acquire outputs reminiscent of chatbot answers and solving various tasks.

“Inference is the fastest-growing segment within the AI industry, accounting for 40% of all AI-related workloads in cloud computing,” said Cerebras. “High-speed inference services can be a turning point for the AI industry.”

Cerebras Inference processes 1,800 tokens per second on the Large Language Model (LLM) Rama 3.1 8B and 450 tokens per second on Rama 3.1 70B. That is about 20 times faster than NVIDIA GPU-based AI inference services available on hyperscale clouds, including Microsoft Azure.

Along with a groundbreaking performance improvement, it’s also price competitive. For instance, it may well be used for just 10 cents per 1 million tokens, providing as much as 100 times higher price-performance than existing GPU clouds.

Cerebras explains that with a 20x faster inference speed, AI app developers can construct next-generation AI applications without losing accuracy or cost.

This groundbreaking cost-effectiveness was made possible by Cerebras’ ‘CS-3’ system and its Wafer Scale Engine 3 (WSE-3) AI processor. Cerebras’ WSE-3 chip, which is as big as a dinner plate, can process 1,000 tokens per second, which is the equivalent of broadband web. “The WSE-3 chip delivers significantly higher performance than Nvidia GPUs,” Cerebras CEO Andrew Feldman claimed.

Specifically, CS-3 has a memory bandwidth 7,000 times wider than Nvidia’s ‘H100’, solving the memory bandwidth problem of generative AI.

Cerebras Inference is available in three tiers: Free Tier, Developer Tier, and Enterprise Tier.

The free tier provides free API access to all logged-in users and generous usage limits. The developer tier is designed for flexible serverless deployments and offers API endpoints at 10 cents per million tokens and 60 cents per million tokens for the Llama 3.1 8B and 70B models, respectively. The enterprise tier offers fine-tuned models, custom service-level agreements (SLAs), and dedicated support.

Cerebras offers several sorts of inference services through the cloud, but additionally plans to sell its AI systems to enterprises that prefer to run their very own data centers.

Currently, the AI market is dominated by Nvidia, however the emergence of corporations like Cerebras and Grok heralds a change in industry dynamics, especially as demand for faster and cheaper AI inference services increases.

Reporter Park Chan cpark@aitimes.com

Cerebras Launches Latest AI Inference Service… “20x Faster and 100x Cheaper than NVIDIA”

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Constructing Scalable AI on Enterprise Data with NVIDIA Nemotron RAG and Microsoft SQL Server 2025

One API for Local and Distant LLMs on Apple Platforms

Tips on how to know in case your Asus router is one in all hundreds hacked by China-state hackers

Salesforce Agentforce Observability enables you to watch your AI agents think in near-real time

Realizing value with AI inference at scale and in production

Cerebras Launches Latest AI Inference Service… “20x Faster and 100x Cheaper than NVIDIA”

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.