A picking operation is the means of collecting items from storage locations to fulfil customer orders.
It’s some of the labour-intensive activities in logistics, accounting for as much as 55% of total warehouse operating costs.
For every order, an operator receives an inventory of things to gather from their storage locations.
They walk to every location, discover the product, pick the proper quantity, and make sure the operation before moving to the following line.
In most warehouses, operators depend on RF scanners or handheld tablets to receive instructions and make sure each pick.
- What happens when operators need each hands for handling?
- Easy methods to onboard operators who don’t read the local language?
Voice picking solves this by replacing the screen with audio instructions: the system tells the operator where to go and what to select, and the operator confirms verbally.

Once I was designing supply chain solutions in logistics firms, vocalisation was the default alternative, especially for price-sensitive projects.
Based on my experience, with vocalization, operators’ productivity can reach 250 boxes/hour for retail and FMCG operations.
The concept just isn’t recent. Hardware providers and software editors have offered voice-picking solutions because the early 2000s.
But these systems include significant constraints:
- Proprietary hardware at $2,000 to $5,000 per headset
- Vendor-locked software with limited customisation
- Long deployment cycles of three to six months per site
- Rigid language support that requires retraining for every recent language
For a 50-FTE warehouse, the overall investment reaches $150K to $300K, excluding training costs.
It is just too expensive for my customers.
What if you happen to could achieve similar results using a smartphone, a custom-made web application, and modern AI voice technology?
In this text, I’ll show how I built a minimalist voice-picking module that integrates with Warehouse Management Systems, using ElevenLabs for text-to-speech and speech recognition.

This web application has been deployed within the distribution centre of a small supermarket chain with great results (the client is comfortable!).
The target just isn’t to design solutions that compete with market leaders, but relatively to offer an alternative choice to logistics and manufacturing operations that lack the capability to speculate in expensive equipment and wish customised solutions.
Problem Statement
Before we get into voice-picking powered by ElevenLabs, let me introduce the logistic operations this AI-powered web application will support.

That is the central distribution centre of a small supermarket chain that delivers to 50 stores in Central Europe.

The ability is organised in a grid layout with aisles (A through L) and positions along each aisle:
- Each location stores a particular item (called SKU) with a known quantity in boxes.
- Operators have to know where to go and what to anticipate after they arrive.
What’s the target? Boost the operators productivity!
They weren’t comfortable concerning the order allocation and walking paths provided by their old system.

They first asked to cut back operators’ walking distance and boost the variety of boxes picked per hour using the solutions presented in this text.
The answer was an online application connected to the Warehouse Management System (WMS) database that guides the operator through the warehouse.

This visual layout provides a real-time view of what we now have within the system, with a greater routing solution.
Our objective is to go from a productivity of 75 boxes/hour to 200 boxes/hour with:
- A greater order allocation of orders with spatial clustering and pathfinding to minimise the walking distance per box picked
- Voice-picking to guide operators in a flawless manner
How the Picking Flow Works
Before jumping into the vocalisation of the tool, let me introuce the means of order picking.
Three stores sent orders to the warehouse:
- Store 1 ordered 3 boxes of
Organic Green Tea 500gwhich might be positioned in Location A1 - Store 2 ordered 2 boxes of
Earl Grey Tea 250gwhich might be positioned in Location A3 - Store 3 ordered 5 boxes of
Arabica Coffee Beans 1kgwhich might be positioned in Location B2
A picking batch is a bunch of store orders consolidated right into a single work task.

The system generates a batch with multiple order lines with instructions:
- Where to go (the storage location)
- What to select (the SKU reference)
- What number of boxes to gather

The operator just has to process each line sequentially.
Once they confirm a pick, the system advances to the following instruction.
This sequential flow is critical since it determines the walking path through the warehouse using the optimisation algorithms.

As this can be a custom application, we could implement this optimisation without counting on an external editor.
Why constructing a custom solution? Since it’s cheaper and easier to implement.
Initially, the client planned to buy a business solution and wanted me to integrate the pathfinding solution.
After investigation, we discovered that it could have been costlier to integrate the app into the seller solution than to construct something from scratch.
What’s the process without the AI-based voice feature?
Manual Mode: The Screen-Based Baseline
In manual mode, the operator reads each instruction on screen and confirms by tapping a button.
Two actions can be found at each step:
- Confirm Pick: operator collected the proper quantity
- Report Issue: the situation is empty, the amount doesn’t match, or the product is broken

I built the manual mode as a reliable fallback in case we now have issues with Elevenlabs.
Nevertheless it keeps the operator’s eyes and one hand tied to the device at every step.
We want so as to add vocal commands!
Voice Mode: Hands-Free with ElevenLabs
Now that you realize why we would like the voice mode to switch screen interaction, let me explain how I added two AI-powered components.

Text-to-Speech: ElevenLabs Reads the Instructions
When the operator starts a picking session in voice mode, each instruction is converted to speech using the ElevenLabs API.
As an alternative of reading “Location A-03-2, pick 4 boxes of SKU-1042” on a screen, the operator hears a natural voice say:
ElevenLabs provides several benefits over basic browser-based TTS:
- Natural intonation that is simple to know in a loud warehouse
- 29+ languages available out of the box, with no retraining
- Consistent voice quality across all instructions
- Sub-second generation for brief sentences like pick instructions
But what about speech recognition?
Speech-to-Text: The Operator Confirms Verbally
After hearing the instruction, the operator walks to the situation, picks the items, and desires to substantiate.
Here, I made a deliberate design alternative relying on speech recognition and the reasoning capabilities of ElevenLabs.
Using a single endpoint, we capture the response and match it against expected commands:
- or to validate the pick
- or to flag a discrepancy
- to listen to the instruction again
The agentic part translates the operator’s feedback and tries to match it to the expected interactions (CONFIRM, ISSUE, or REPEAT).

For a multilingual warehouse, this can be a significant profit:
- A Czech operator and a Filipino operator can each receive instructions of their native language from the identical system, with none hardware change.
- I don’t have to think about all of the languages possible within the design of the answer
Why using ElevenLabs?
For an additional feature, the inventory cycle count tool presented on this video, I actually have used n8n with AI agent nodes to perform the identical task.

This was working quite well, however it required a more complex setup
- Two AI nodes: one for the audio transcription using OpenAI models, and one AI agent to format the output of the transcription
- The system prompts were assuming that the operator was speaking English.
I actually have replaced that with a single ElevenLabs endpoint with multi-lingual capabilities.
Putting each components together, a single pick cycle looks like this:

- The app calls ElevenLabs to generate the audio instruction
- The operator hears:
- The operator walks to the situation (hands free, eyes free)
- The operator picks the items and says,
- The speech recognition endpoint processes the confirmation and moves to the following picking location
All the interaction takes a number of seconds of system time.
What concerning the costs?
That is where the comparison with traditional systems becomes striking.

For this mid-size warehouse with 50 FTEs, they estimated that the standard approach costs roughly $60K to $150K in the primary 12 months.
The AI-powered approach costs a number of API calls.
The trade-off is evident: traditional systems offer proven reliability and offline capability for high-volume operations.
In case of failures, we now have the manual solution as a rollback.
This AI-powered approach offers accessibility and speed for organisations that can’t justify a six-figure investment.
What Does That Mean for Operations Managers and Decision Makers?
Voice picking isn’t any longer a technology reserved for the most important 3PLs and retailers with large budgets.
In case your warehouse has WiFi and your operators have smartphones, you possibly can prototype a voice-guided picking system in days.
It is simple to check it on an actual batch to measure the impact before committing any significant budget for productisation.
Three scenarios where this approach makes particular sense:
- Multilingual facilities where operators struggle with screen-based instructions in a language that just isn’t their very own
- Multi-site operations where deploying proprietary hardware to each small warehouse just isn’t economically viable
- High-turnover environments where training time on complex scanning systems directly impacts productivity
What about other processes?
Excellent news, the identical architecture extends beyond picking.
Voice-guided workflows can support any process where an operator needs instructions while keeping their hands free.
You’ll find a live demo of a listing cycle counting tool here:
Easy methods to start this journey?
As you can easily guess, the front end of those applications has been vibecoded using Lovable and Claude Code.
For the backend, if you might have limited coding capabilities, I’d suggest starting with n8n.

n8n is a low-code automation platform that helps you to connect APIs and AI models using visual workflows.
The initial version of this solution has been built with this tool:
- I began with a backend connected to a Telegram Bot
- Users were twiddling with the tool using this interface
- After validation, we moved that to an online application
That is the easiest solution to start, even with limited coding skills.
I share a step-by-step tutorial with free templates to start out automating from day 1 on this video:
Let me know what you propose to construct using all these nice tools!
About Me
Let’s connect on LinkedIn and Twitter. I’m a Supply Chain Engineer who’s using data analytics to enhance logistics operations and reduce costs.
If you happen to’re in search of tailored consulting solutions to optimise your supply chain and meet sustainability goals, please contact me.
