How ElevenLabs Voice AI Is Replacing Screens in Warehouse and Manufacturing Operations

A picking operation is the means of collecting items from storage locations to fulfil customer orders.

It’s some of the labour-intensive activities in logistics, accounting for as much as 55% of total warehouse operating costs.

Example of warehouse layout where operators need to select in multiple locations – (Image by Samir Saci)

For every order, an operator receives an inventory of things to gather from their storage locations.

They walk to every location, discover the product, pick the proper quantity, and make sure the operation before moving to the following line.

In most warehouses, operators depend on RF scanners or handheld tablets to receive instructions and make sure each pick.

What happens when operators need each hands for handling?
Easy methods to onboard operators who don’t read the local language?

Voice picking solves this by replacing the screen with audio instructions: the system tells the operator where to go and what to select, and the operator confirms verbally.

Illustration of an operator using voice picking – (Image by Samir Saci)

Once I was designing supply chain solutions in logistics firms, vocalisation was the default alternative, especially for price-sensitive projects.

Based on my experience, with vocalization, operators’ productivity can reach 250 boxes/hour for retail and FMCG operations.

The concept just isn’t recent. Hardware providers and software editors have offered voice-picking solutions because the early 2000s.

But these systems include significant constraints:

Proprietary hardware at $2,000 to $5,000 per headset
Vendor-locked software with limited customisation
Long deployment cycles of three to six months per site
Rigid language support that requires retraining for every recent language

For a 50-FTE warehouse, the overall investment reaches $150K to $300K, excluding training costs.

It is just too expensive for my customers.

What if you happen to could achieve similar results using a smartphone, a custom-made web application, and modern AI voice technology?

In this text, I’ll show how I built a minimalist voice-picking module that integrates with Warehouse Management Systems, using ElevenLabs for text-to-speech and speech recognition.

Example of screens of this app designed for use on a smartphone with a vocal interface – (Image by Samir Saci)

This web application has been deployed within the distribution centre of a small supermarket chain with great results (the client is comfortable!).

The target just isn’t to design solutions that compete with market leaders, but relatively to offer an alternative choice to logistics and manufacturing operations that lack the capability to speculate in expensive equipment and wish customised solutions.

Problem Statement

Before we get into voice-picking powered by ElevenLabs, let me introduce the logistic operations this AI-powered web application will support.

Layout of the distribution centre – (Image by Samir Saci)

That is the central distribution centre of a small supermarket chain that delivers to 50 stores in Central Europe.

Layout of the warehouse with 10 aisles and 12 pallet positions displayed on the app – (Image by Samir Saci)

The ability is organised in a grid layout with aisles (A through L) and positions along each aisle:

Each location stores a particular item (called SKU) with a known quantity in boxes.
Operators have to know where to go and what to anticipate after they arrive.

What’s the target? Boost the operators productivity!

They weren’t comfortable concerning the order allocation and walking paths provided by their old system.

Solutions used to optimise picking operations for this warehouse – (Image by Samir Saci)

They first asked to cut back operators’ walking distance and boost the variety of boxes picked per hour using the solutions presented in this text.

The answer was an online application connected to the Warehouse Management System (WMS) database that guides the operator through the warehouse.

Operators can check their picking list but additionally detailed information per location – (Image by Samir Saci)

This visual layout provides a real-time view of what we now have within the system, with a greater routing solution.

Our objective is to go from a productivity of 75 boxes/hour to 200 boxes/hour with:

A greater order allocation of orders with spatial clustering and pathfinding to minimise the walking distance per box picked
Voice-picking to guide operators in a flawless manner

How the Picking Flow Works

Before jumping into the vocalisation of the tool, let me introuce the means of order picking.

Three stores sent orders to the warehouse:

Store 1 ordered 3 boxes of Organic Green Tea 500g which might be positioned in Location A1
Store 2 ordered 2 boxes of Earl Grey Tea 250g which might be positioned in Location A3
Store 3 ordered 5 boxes of Arabica Coffee Beans 1kg which might be positioned in Location B2

A picking batch is a bunch of store orders consolidated right into a single work task.

The operator will prepare the three orders in a single batch – (Image by Samir Saci)

The system generates a batch with multiple order lines with instructions:

Where to go (the storage location)
What to select (the SKU reference)
What number of boxes to gather

Picking list (left), layout (middle), details of location (right) – (Image by Samir Saci)

The operator just has to process each line sequentially.

Once they confirm a pick, the system advances to the following instruction.

This sequential flow is critical since it determines the walking path through the warehouse using the optimisation algorithms.

Example of the unique pathfinding solution (bottom) and the optimised (top)

As this can be a custom application, we could implement this optimisation without counting on an external editor.

Why constructing a custom solution? Since it’s cheaper and easier to implement.

Initially, the client planned to buy a business solution and wanted me to integrate the pathfinding solution.

After investigation, we discovered that it could have been costlier to integrate the app into the seller solution than to construct something from scratch.

What’s the process without the AI-based voice feature?

Manual Mode: The Screen-Based Baseline

In manual mode, the operator reads each instruction on screen and confirms by tapping a button.

Two actions can be found at each step:

Confirm Pick: operator collected the proper quantity
Report Issue: the situation is empty, the amount doesn’t match, or the product is broken

Our operator has to press the button to substantiate the picking or report a difficulty – (Image by Samir Saci)

I built the manual mode as a reliable fallback in case we now have issues with Elevenlabs.

Nevertheless it keeps the operator’s eyes and one hand tied to the device at every step.

We want so as to add vocal commands!

Voice Mode: Hands-Free with ElevenLabs

Now that you realize why we would like the voice mode to switch screen interaction, let me explain how I added two AI-powered components.

Technical architecture of this application – (Image by Samir Saci)

Text-to-Speech: ElevenLabs Reads the Instructions

When the operator starts a picking session in voice mode, each instruction is converted to speech using the ElevenLabs API.

As an alternative of reading “Location A-03-2, pick 4 boxes of SKU-1042” on a screen, the operator hears a natural voice say:

ElevenLabs provides several benefits over basic browser-based TTS:

Natural intonation that is simple to know in a loud warehouse
29+ languages available out of the box, with no retraining
Consistent voice quality across all instructions
Sub-second generation for brief sentences like pick instructions

But what about speech recognition?

Speech-to-Text: The Operator Confirms Verbally

After hearing the instruction, the operator walks to the situation, picks the items, and desires to substantiate.

Here, I made a deliberate design alternative relying on speech recognition and the reasoning capabilities of ElevenLabs.

Using a single endpoint, we capture the response and match it against expected commands:

or to validate the pick
or to flag a discrepancy
to listen to the instruction again

The agentic part translates the operator’s feedback and tries to match it to the expected interactions (CONFIRM, ISSUE, or REPEAT).

The whole process from left to right: Step 1 -> Step 2 -> Step 3 – (Image by Samir Saci)

For a multilingual warehouse, this can be a significant profit:

A Czech operator and a Filipino operator can each receive instructions of their native language from the identical system, with none hardware change.
I don’t have to think about all of the languages possible within the design of the answer

Why using ElevenLabs?

For an additional feature, the inventory cycle count tool presented on this video, I actually have used n8n with AI agent nodes to perform the identical task.

n8n workflow for the voice-powered inventory cycle count tools – (Image by Samir Saci)

This was working quite well, however it required a more complex setup

Two AI nodes: one for the audio transcription using OpenAI models, and one AI agent to format the output of the transcription
The system prompts were assuming that the operator was speaking English.

I actually have replaced that with a single ElevenLabs endpoint with multi-lingual capabilities.

Putting each components together, a single pick cycle looks like this:

The Complete Voice Picking Cycle – (Image by Samir Saci)

The app calls ElevenLabs to generate the audio instruction
The operator hears:
The operator walks to the situation (hands free, eyes free)
The operator picks the items and says,
The speech recognition endpoint processes the confirmation and moves to the following picking location

All the interaction takes a number of seconds of system time.

What concerning the costs?

That is where the comparison with traditional systems becomes striking.

Comparative study – (Image by Samir Saci)

For this mid-size warehouse with 50 FTEs, they estimated that the standard approach costs roughly $60K to $150K in the primary 12 months.

The AI-powered approach costs a number of API calls.

The trade-off is evident: traditional systems offer proven reliability and offline capability for high-volume operations.

In case of failures, we now have the manual solution as a rollback.

This AI-powered approach offers accessibility and speed for organisations that can’t justify a six-figure investment.

What Does That Mean for Operations Managers and Decision Makers?

Voice picking isn’t any longer a technology reserved for the most important 3PLs and retailers with large budgets.

In case your warehouse has WiFi and your operators have smartphones, you possibly can prototype a voice-guided picking system in days.

It is simple to check it on an actual batch to measure the impact before committing any significant budget for productisation.

Three scenarios where this approach makes particular sense:

Multilingual facilities where operators struggle with screen-based instructions in a language that just isn’t their very own
Multi-site operations where deploying proprietary hardware to each small warehouse just isn’t economically viable
High-turnover environments where training time on complex scanning systems directly impacts productivity

What about other processes?

Excellent news, the identical architecture extends beyond picking.

Voice-guided workflows can support any process where an operator needs instructions while keeping their hands free.

You’ll find a live demo of a listing cycle counting tool here:

Easy methods to start this journey?

As you can easily guess, the front end of those applications has been vibecoded using Lovable and Claude Code.

For the backend, if you might have limited coding capabilities, I’d suggest starting with n8n.

Example of n8n workflows – (Image by Samir Saci)

n8n is a low-code automation platform that helps you to connect APIs and AI models using visual workflows.

The initial version of this solution has been built with this tool:

I began with a backend connected to a Telegram Bot
Users were twiddling with the tool using this interface
After validation, we moved that to an online application

That is the easiest solution to start, even with limited coding skills.

I share a step-by-step tutorial with free templates to start out automating from day 1 on this video:

Let me know what you propose to construct using all these nice tools!

About Me

Let’s connect on LinkedIn and Twitter. I’m a Supply Chain Engineer who’s using data analytics to enhance logistics operations and reduce costs.

If you happen to’re in search of tailored consulting solutions to optimise your supply chain and meet sustainability goals, please contact me.

How ElevenLabs Voice AI Is Replacing Screens in Warehouse and Manufacturing Operations

Problem Statement

How the Picking Flow Works

Manual Mode: The Screen-Based Baseline

Voice Mode: Hands-Free with ElevenLabs

Text-to-Speech: ElevenLabs Reads the Instructions

Speech-to-Text: The Operator Confirms Verbally

What Does That Mean for Operations Managers and Decision Makers?

About Me

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

A Beginner’s Guide to Quantum Computing with Python

Meta’s latest open-source brain AI

MIT engineers design proteins by their motion, not only their shape

Seeing sounds

My Models Failed. That’s How I Became a Higher Data Scientist.

How ElevenLabs Voice AI Is Replacing Screens in Warehouse and Manufacturing Operations

Problem Statement

How the Picking Flow Works

Manual Mode: The Screen-Based Baseline

Voice Mode: Hands-Free with ElevenLabs

Text-to-Speech: ElevenLabs Reads the Instructions

Speech-to-Text: The Operator Confirms Verbally

What Does That Mean for Operations Managers and Decision Makers?

About Me

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.