Tips on how to Develop a Bilingual Voice Assistant

-

, and Siri are the ever-present voice assistants that serve many of the web connected population today. For essentially the most part, English is the dominant language used with these voice assistants. Nonetheless, for a voice assistant to be truly helpful, it must give you the chance to know the user as they naturally speak. In lots of parts of the world, especially in a various country like India, it is not uncommon for people to be multilingual and to change between multiple languages in a single conversation. A really smart assistant should give you the chance to handle this.

Google Assistant offers the power so as to add a second language; but its functionality is proscribed to certain devices only and offers this just for a limited set of major languages. For instance, Google’s Nest Hub doesn’t yet support bilingual capabilities for Tamil, a language spoken by over 80 million people. Alexa supports bilingual approach so long as it’s supported in its internal language pair; again this only supports a limited set of major languages. Siri doesn’t have bilingual capability and allows just one language at a time.

In this text I’ll discuss the approach taken to enable  to have a bilingual capability with English and Tamil because the languages. Using this approach, the voice assistant will give you the chance to robotically detect the language an individual is speaking by analyzing the audio directly. Through the use of a “confidence rating”-based algorithm, the system will determine if English or Tamil is spoken and respond within the corresponding language.

Approach to Bilingual Capability

To make the assistant understand each English and Tamil, there are a couple of potential solutions. The primary approach can be to coach a custom Machine Learning model from scratch, specifically on Tamil language data, after which integrate that model into the Raspberry Pi. While this may offer a high degree of customization, it’s an incredibly time-consuming and resource-intensive process. Training a model requires an enormous dataset and significant computational power. Moreover, running a heavy custom model would likely decelerate the Raspberry Pi, resulting in a poor user experience.

fastText Approach

A more practical solution is to make use of an existing, pre-trained model that’s already optimized for a particular task. For language identification, a terrific option is fastText.

fastText is an open-source library from Facebook AI Research designed for efficient text classification and word representation. It comes with pre-trained models that may quickly and accurately discover the language of a given piece of text from a lot of languages. Since it is lightweight and highly optimized, it is a superb alternative for running on a resource-constrained device like a Raspberry Pi without causing significant performance issues. The plan, due to this fact, was to make use of fastText to categorise the user’s spoken language.

To make use of fastText, you download the corresponding model (lid.176.bin) and store it in your project folder. Specify this because the MODEL_PATH and cargo the model.

import fastText
import speech_recognition as sr
import fasttext

# --- Configuration ---
MODEL_PATH = "./lid.176.bin" # That is the model file you downloaded and unzipped

# --- Foremost Application Logic ---
print("Loading fastText language identification model...")
try:
    # Load the pre-trained model
    model = fasttext.load_model(MODEL_PATH)
except Exception as e:
    print(f"FATAL ERROR: Couldn't load the fastText model. Error: {e}")
    exit()

The subsequent step can be to pass the voice commands, as recordings, to the model and get the prediction back. This will be achieved through a dedicated function.

def identify_language(text, model):
    # The model.predict() function returns a tuple of labels and probabilities
    predictions = model.predict(text, k=1)
    language_code = predictions[0][0] # e.g., '__label__en'
    return language_code

try:
    with microphone as source:
        recognizer.adjust_for_ambient_noise(source, duration=1)
        print("nPlease speak now...")
        audio = recognizer.listen(source, phrase_time_limit=8)

    print("Transcribing audio...")
    # Get a rough transcription without specifying a language
    transcription = recognizer.recognize_google(audio)
    print(f"Heard: "{transcription}"")

    # Discover the language from the transcribed text
    language = identify_language(transcription, model)

    if language == '__label__en':
        print("n---> Result: The detected language is English. <---")
    elif language == '__label__ta':
        print("n---> Result: The detected language is Tamil. <---")
    else:
        print(f"n---> Result: Detected a special language: {language}")

except sr.UnknownValueError:
    print("Couldn't understand the audio.")
except sr.RequestError as e:
    print(f"Speech recognition service error; {e}")
except Exception as e:
    print(f"An unexpected error occurred: {e}")

The code block above follows an easy path. It uses the  function to transcribe the voice command after which passes this transcription to the fastText model to get a prediction on the language. If the prediction is “__label__en” then English has been detected and if prediction is “__label_ta” then Tamil has been detected.

This approach led to poor predictions though. The issue is that  library defaults to English. So after I speak something in Tamil, it finds the closest (and incorrect) equivalent sounding words in English and passes it to fastText.

For instance after I said “En Peyar enna” (What’s my Name in Tamil),  understood it as  and hence fastText predicted the language as English. To beat this, I can hardcode the  function to detect only Tamil. But this may defeat the thought of being truly ‘smart’ and ‘bilingual’. The assistant should give you the chance to detect the language based on what’s spoken; not based on what is difficult coded.

Photo by Siora Photography on Unsplash

The ‘Confidence Rating’ method

What we’d like is a more direct and data-driven method. The answer lies inside a feature of the speech_recognition library. The  function is the Google Speech Recognition API and it could actually transcribe audio from an enormous variety of languages, including each English and Tamil. A key feature of this API is that for each transcription it provides, it could actually also return a confidence rating — a numerical value between 0 and 1, indicating how certain it’s that its transcription is correct.

This feature allows for a way more elegant and dynamic approach to language identification. Let’s take a have a look at the code.

def recognize_with_confidence(recognizer, audio_data):
    
    tamil_text = None
    tamil_confidence = 0.0
    english_text = None
    english_confidence = 0.0

    # 1. Try to recognize as Tamil and get confidence
    try:
        print("Attempting to transcribe as Tamil...")
        # show_all=True returns a dictionary with transcription alternatives
        response_tamil = recognizer.recognize_google(audio_data, language='ta-IN', show_all=True)
        # We only have a look at the highest alternative
        if response_tamil and 'alternative' in response_tamil:
            top_alternative = response_tamil['alternative'][0]
            tamil_text = top_alternative['transcript']
            if 'confidence' in top_alternative:
                tamil_confidence = top_alternative['confidence']
            else:
                tamil_confidence = 0.8 # Assign a default high confidence if not provided
    except sr.UnknownValueError:
        print("Couldn't understand audio as Tamil.")
    except sr.RequestError as e:
        print(f"Tamil recognition service error; {e}")

    # 2. Try to recognize as English and get confidence
    try:
        print("Attempting to transcribe as English...")
        response_english = recognizer.recognize_google(audio_data, language='en-US', show_all=True)
        if response_english and 'alternative' in response_english:
            top_alternative = response_english['alternative'][0]
            english_text = top_alternative['transcript']
            if 'confidence' in top_alternative:
                english_confidence = top_alternative['confidence']
            else:
                english_confidence = 0.8 # Assign a default high confidence
    except sr.UnknownValueError:
        print("Couldn't understand audio as English.")
    except sr.RequestError as e:
        print(f"English recognition service error; {e}")

    # 3. Compare confidence scores and return the winner
    print(f"nConfidence Scores -> Tamil: {tamil_confidence:.2f}, English: {english_confidence:.2f}")
    if tamil_confidence > english_confidence:
        return tamil_text, "Tamil"
    elif english_confidence > tamil_confidence:
        return english_text, "English"
    else:
        # If scores are equal (or each zero), return neither
        return None, None

The logic on this code block is easy. We pass the audio to the function and get the entire list of alternatives and its scores. First we try the language as Tamil and get the corresponding confidence rating. Then we try the identical audio as English and get the corresponding confidence rating from the API. Once now we have each, we then compare the arrogance scores and select the one with the upper rating because the language detected by the system.

Below is the output of the function after I speak in English and after I speak in Tamil.

Screenshot from Visual Studio output (Tamil). Image owned by creator.
Screenshot from Visual Studio output (English). Image owned by creator.

The outcomes above show how the code is capable of understand the language spoken dynamically, based on the arrogance rating.

Putting all of it together — The Bilingual Assistant

The ultimate step can be to integrate this approach into the code for the Raspberry Pi based Voice assistant. The total code will be present in my GitHub. Once integrated the subsequent step can be to check the functioning of the Voice Assistant by speaking in English and Tamil and seeing the way it responds for every language. The recordings below display the working of the Bilingual Voice Assistant when asked a matter in English and in Tamil.

Conclusion

In this text, now we have seen how one can successfully upgrade an easy voice assistant into a really bilingual tool. By implementing a “confidence rating” algorithm, the system will be made to find out whether a command is spoken in English or Tamil, allowing it to know and reply within the user’s chosen language for that specific query. This creates a more natural and seamless conversational experience.

The important thing advantage of this method is its reliability and scalability. While this project focused on just two languages, the identical confidence rating logic could easily be prolonged to support three, 4, or more by simply adding an API call for every latest language and comparing all the outcomes. The techniques explored here function a strong foundation for creating more advanced and intuitive personal AI tools.

Reference:

[1] A. Joulin, E. Grave, P. Bojanowski, T. Mikolov, 

[2] A. Joulin, E. Grave, P. Bojanowski, M. Douze, H. Jégou, T. Mikolov, 

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x