Crafting a Custom Voice Assistant with Perplexity

-

, Alexa, and Siri are the dominating voice assistants available for on a regular basis use. These assistants have turn out to be ubiquitous in almost every home, carrying out tasks from home automation, note taking, recipe guidance and answering easy questions. Relating to answering questions though, within the age of LLMs, getting a concise and context-based answer from these voice assistants might be tricky, if not non-existent. For instance, when you ask Google Assistant how the market is reacting to Jerome Powell’s speech in Jackson Hole on Aug 22, it would simply reply that it doesn’t know the reply and provides a number of links that you may peruse. That is that if you’ve got the screen-based Google Assistant.

Often you only desire a quick answer on current events, or you desire to know if an Apple tree would survive the winter in Ohio, and sometimes voice assistants like Google and Siri fall wanting providing a satisfying answer. This got me thinking about constructing my very own voice assistant, one that will give me a straightforward, single sentence answer based on its search of the net.

Photo by Aerps.com on Unsplash

Of the varied LLM powered serps available, I even have been an avid user of Perplexity for greater than a 12 months now and I take advantage of it exclusively for all my searches apart from easy ones where I still return to Google or Bing. Perplexity, along with its live web index, which enables it to offer up-to-date, accurate, sourced answers, allows users access to its functionality through a strong API. Using this functionality and integrating it with a straightforward Raspberry Pi, I intended to create a voice assistant that will:

  • Answer to a wake word and be able to answer my query
  • Answer my query in a straightforward, concise sentence
  • Return to passive listening without selling my data or giving my unnecessary ads

The Hardware for the Assistant

Photo by Axel Richter on Unsplash

To construct our voice assistant, a number of key hardware components are required. The core of the project is a Raspberry Pi 5, which serves because the central processor for our application. For the assistant’s audio input, I selected a straightforward USB gooseneck microphone. The sort of microphone is omnidirectional, making it effective at hearing the wake word from different parts of a room, and its plug-and-play nature simplifies the setup. For the assistant’s output, a compact USB-powered speaker provides the audio output. A key advantage of this speaker is that it uses a single USB cable for each its power and audio signal, which minimizes cable clutter.

Block diagram showing the functionality of the custom voice assistant (image by writer)

This approach of using available USB peripherals makes the hardware assembly straightforward, allowing us to focus our efforts on the software.

Getting the environment ready

So as to query Perplexity using custom queries and so as to have a wake word for the voice assistant, we’d like to generate a few API keys. So as to generate a Perplexity API key one can join for a Perplexity account, go to the Settings menu, select the API tab, and click on “Generate API Key” to create and replica their personal key to be used in applications. Access to API key generation often requires a paid plan or payment method, so make sure the account is eligible before proceeding.

Platforms that provide wake word customization include PicoVoice Porcupine, Sensory TrulyHandsfree, and Snowboy, with PicoVoice Porcupine providing a straightforward online console for generating, testing, and deploying custom wake words across desktop, mobile, and embedded devices. A brand new user can generate a custom word for PicoVoice Porcupine by signing up for a free Picovoice Console account, navigating to the Porcupine page, choosing the specified language, typing within the custom wake word, and clicking “Train” to supply and download the platform-specific model file (.ppn) to be used. Be certain that to check the wake word for performance before finalizing, as this ensures reliable detection and minimal false positives. The wake word I even have trained and can use is “Hey Krishna”.

Coding the Assistant

The entire Python script for this project is on the market on my GitHub repository. On this section, let’s have a look at the important thing components of the code to know how the assistant functions.
The script is organized into a number of core functions that handle the assistant’s senses and intelligence, all managed by a central loop.

Configuration and Initialization

The primary a part of the script is devoted to setup. It handles loading the obligatory API keys, model files, and initializing the clients for the services we’ll use.

# --- 1. Configuration ---
load_dotenv()
PICOVOICE_ACCESS_KEY = os.environ.get("PICOVOICE_ACCESS_KEY")
PERPLEXITY_API_KEY = os.environ.get("PERPLEXITY_API_KEY")
KEYWORD_PATHS = ["Krishna_raspberry-pi.ppn"] # My wake word pat
MODEL_NAME = "sonar"

This section uses the dotenv library to securely load your secret API keys from a .env file, which is a best practice that keeps them out of your source code. It also defines key variables just like the path to your custom wake word file and the precise Perplexity model we would like to question.

Wake Word Detection

For the assistant to be truly hands-free, it must listen constantly for a selected wake word without using significant system resources. That is handled by the while True: loop within the major function, which uses the PicoVoice Porcupine engine.

# That is the major loop that runs constantly
while True:
    # Read a small chunk of raw audio data from the microphone
    pcm = audio_stream.read(porcupine.frame_length)
    pcm = struct.unpack_from("h" * porcupine.frame_length, pcm)
    
    # Feed the audio chunk into the Porcupine engine for evaluation
    keyword_index = porcupine.process(pcm)

    if keyword_index >= 0:
        # Wake word was detected, proceed to handle the command...
        print("Wake word detected!")

This loop is the center of the assistant’s “passive listening” state. It constantly reads small, raw audio frames from the microphone stream. Each frame is then passed to the porcupine.process() function. This can be a highly efficient, offline process that analyzes the audio for the precise acoustic pattern of your custom wake word (“Krishna”). If the pattern is detected, porcupine.process() returns a non-negative number, and the script proceeds to the energetic phase of listening for a full command.

Speech-to-Text — Converting user inquiries to text

After the wake word is detected, the assistant must listen for and understand the user’s query. That is handled by the Speech-to-Text (STT) component.

# --- This logic is contained in the major 'if keyword_index >= 0:' block ---

print("Listening for command...")
frames = []
# Record audio from the stream for a set duration (~10 seconds)
for _ in range(0, int(porcupine.sample_rate / porcupine.frame_length * 10)):
    frames.append(audio_stream.read(porcupine.frame_length))

# Convert the raw audio frames into an object the library can use
audio_data = sr.AudioData(b"".join(frames), porcupine.sample_rate, 2)

try:
    # Send the audio data to Google's service for transcription
    command = recognizer.recognize_google(audio_data)
    print(f"You (command): {command}")
except sr.UnknownValueError:
    speak_text("Sorry, I didn't catch that.")

Once the wake word is detected, the code actively records audio from the microphone for about 10 seconds, capturing the user’s spoken command. It then packages this raw audio data and sends it to Google’s speech recognition service using the speech_recognition library. The service processes the audio and returns the transcribed text, which is then stored within the command variable.

Getting Answers from Perplexity

Once the user’s command has been converted to text, it is shipped to the Perplexity API to get an intelligent, up-to-date answer.

# --- This logic runs if a command was successfully transcribed ---

if command:
    # Define the instructions and context for the AI
    messages = [{"role": "system", "content": "You are an AI assistant. You are located in Twinsburg, Ohio. All answers must be relevant to Cleveland, Ohio unless asked for differently by the user.  You MUST answer all questions in a single and VERY concise sentence."}]
    messages.append({"role": "user", "content": command})
    
    # Send the request to the Perplexity API
    response = perplexity_client.chat.completions.create(
        model=MODEL_NAME, 
        messages=messages
    )
    assistant_response_text = response.selections[0].message.content.strip()
    speak_text(assistant_response_text)

This code block is the “brain” of the operation. It first constructs a messages list, which incorporates a critical system prompt. This prompt gives the AI its personality and rules, equivalent to answering in a single sentence and being aware of its location in Ohio. The user’s command is then added to this list, and your complete package is shipped to the Perplexity API. The script then extracts the text from the AI’s response and passes it to the speak_text function to be read aloud.

Text-to-Speech — Converting Perplexity response to Voice

The speak_text function is what gives the assistant its voice.

def speak_text(text_to_speak, lang='en'):
    # Define a function that converts text to speech, default language is English
    
    print(f"Assistant (speaking): {text_to_speak}")
    # Print the text for reference so the user can see what's being spoken
    
    try:
        pygame.mixer.init()
        # Initialize the Pygame mixer module for audio playback
        
        tts = gTTS(text=text_to_speak, lang=lang, slow=False)
        # Create a Google Text-to-Speech (gTTS) object with the provided text and language
        # 'slow=False' makes the speech sound more natural (not slow-paced)
        
        mp3_filename = "response_audio.mp3"
        # Set the filename where the generated speech will probably be saved
        
        tts.save(mp3_filename)
        # Save the generated speech as an MP3 file
        
        pygame.mixer.music.load(mp3_filename)
        # Load the MP3 file into Pygame's music player for playback
        
        pygame.mixer.music.play()
        # Start playing the speech audio
        
        while pygame.mixer.music.get_busy():
            pygame.time.Clock().tick(10)
        # Keep this system running (by checking if playback is ongoing)
        # This prevents the script from ending before the speech finishes
        # The clock.tick(10) ensures it checks 10 times per second
        
        pygame.mixer.quit()
        # Quit the Pygame mixer once playback is complete to free resources
        
        os.remove(mp3_filename)
        # Delete the temporary MP3 file after playback to scrub up
        
    except Exception as e:
        print(f"Error in Text-to-Speech: {e}")
        # Catch and display any errors that occur throughout the speech generation or playback

This function takes a text string, prints it for reference, then uses the gTTS (Google Text-to-Speech) library to generate a short lived MP3 audio file. It plays the file through the system’s speakers using the pygame library, waits until playback is finished, after which deletes the file. Error handling is included to catch issues throughout the process.

Testing the assistant

Below is an illustration of the functioning of the custom voice assistant. To match its performance with Google Assistant, I even have asked the identical query from Google in addition to from the custom assistant.

As you possibly can see, Google provides links to the reply reasonably than providing a temporary summary of what the user wants. The custom assistant goes further and provides a summary and is more helpful and informational.

Conclusion

In this text, we checked out the means of constructing a completely functional, hands-free voice assistant on a Raspberry Pi. By combining the ability of a custom wake word and the Perplexity API by utilizing Python, we created a straightforward voice assistant device that helps in getting information quickly.

The important thing advantage of this LLM-based approach is its ability to deliver direct, synthesized answers to complex and current questions — a task where assistants like Google Assistant often fall short by simply providing a listing of search links. As an alternative of acting as a mere voice interface for a search engine, our assistant functions as a real answer engine, parsing real-time web results to present a single, concise response. The longer term of voice assistants lies on this deeper, more intelligent integration, and constructing your personal is the most effective technique to explore it.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x