Speech to Text to Speech with AI Using Python — a How-To Guide

Artificial Intelligence

Speech to Text to Speech with AI Using Python — a How-To Guide

admin

February 12, 2024

Speech to Text to Speech with AI Using Python — a How-To Guide

The best way to Create a Speech-to-Text-to-Speech Program

Image by Mariia Shalabaieva from unsplash

It’s been exactly a decade since I began attending GeekCon (yes, a geeks’ conference 🙂) — a weekend-long hackathon-makeathon by which all projects have to be useless and just-for-fun, and this yr there was an exciting twist: all projects were required to include some type of AI.

My group’s project was a speech-to-text-to-speech game, and here’s how it really works: the user selects a personality to refer to, after which verbally expresses anything they’d wish to the character. This spoken input is transcribed and sent to ChatGPT, which responds as if it were the character. The response is then read aloud using text-to-speech technology.

Now that the sport is up and running, bringing laughs and fun, I’ve crafted this how-to guide to enable you create the same game on your individual. Throughout the article, we’ll also explore the varied considerations and decisions we made throughout the hackathon.

Wish to see the total code? Here is the link!

Once the server is running, the user will hear the app “talking”, prompting them to decide on the figure they need to refer to and begin conversing with their chosen character. Every time they need to talk out loud — they need to press and hold a key on the keyboard while talking. Once they finish talking (and release the important thing), their recording will probably be transcribed by Whisper (a text-to-speech model by OpenAI), and the transcription will probably be sent to ChatGPT for a response. The response will probably be read out loud using a text-to-speech library, and the user will hear it.

Disclaimer

Note: The project was developed on a Windows operating system and incorporates the pyttsx3 library, which lacks compatibility with M1/M2 chips. As pyttsx3 just isn’t supported on Mac, users are advised to explore alternative text-to-speech libraries which might be compatible with macOS environments.

Openai Integration

I utilized two OpenAI models: Whisper, for speech-to-text transcription, and the ChatGPT API for generating responses based on the user’s input to their chosen figure. While doing so costs money, the pricing model may be very low cost, and personally, my bill continues to be under $1 for all my usage. To start, I made an initial deposit of $5, and thus far, I actually have not exhausted this sediment, and this initial deposit won’t expire until a yr from now.
I’m not receiving any payment or advantages from OpenAI for writing this.

When you get your OpenAI API key — set it as an environment variable to make use of upon making the API calls. Be sure to not push your key to the codebase or any public location, and never to share it unsafely.

Speech to Text — Create Transcription

The implementation of the speech-to-text feature was achieved using Whisper, an OpenAI model.

Below is the code snippet for the function answerable for transcription:

async def get_transcript(audio_file_path: str, 
text_to_draw_while_waiting: str) -> Optional[str]:
openai.api_key = os.environ.get("OPENAI_API_KEY")
audio_file = open(audio_file_path, "rb")
transcript = Noneasync def transcribe_audio() -> None:
nonlocal transcript
try:
response = openai.Audio.transcribe(
model="whisper-1", file=audio_file, language="en")
transcript = response.get("text")
except Exception as e:
print(e)
draw_thread = Thread(goal=print_text_while_waiting_for_transcription(
text_to_draw_while_waiting))
draw_thread.start()
transcription_task = asyncio.create_task(transcribe_audio())
await transcription_task
if transcript is None:
print("Transcription not available inside the required timeout.")
return transcript

This function is marked as asynchronous (async) for the reason that API call may take a while to return a response, and we await it to be sure that this system doesn’t progress until the response is received.

As you’ll be able to see, the get_transcript function also invokes the print_text_while_waiting_for_transcription function. Why? Since obtaining the transcription is a time-consuming task, we desired to keep the user informed that this system is actively processing their request and never stuck or unresponsive. Consequently, this text is steadily printed because the user awaits the subsequent step.

String Matching Using FuzzyWuzzy for Text Comparison

After transcribing the speech into text, we either utilized it as is, or attempted to check it with an existing string.

The comparison use cases were: choosing a figure from a predefined list of options, deciding whether to proceed playing or not, and when opting to proceed – deciding whether to decide on a latest figure or persist with the present one.

In such cases, we wanted to check the user’s spoken input transcription with the choices in our lists, and subsequently we decided to make use of the FuzzyWuzzy library for string matching.

This enabled selecting the closest option from the list, so long as the matching rating exceeded a predefined threshold.

Here’s a snippet of our function:

def detect_chosen_option_from_transcript(
transcript: str, options: List[str]) -> str:
best_match_score = 0
best_match = ""for option in options:
rating = fuzz.token_set_ratio(transcript.lower(), option.lower())
if rating > best_match_score:
best_match_score = rating
best_match = option
if best_match_score >= 70:
return best_match
else:
return ""

If you would like to learn more in regards to the FuzzyWuzzy library and its functions — you’ll be able to try an article I wrote about it here.

Get ChatGPT Response

Once we’ve the transcription, we will send it over to ChatGPT to get a response.

For every ChatGPT request, we added a prompt asking for a brief and funny response. We also told ChatGPT which figure to pretend to be.

So our function looked as follows:

def get_gpt_response(transcript: str, chosen_figure: str) -> str:
system_instructions = get_system_instructions(chosen_figure)
try:
return make_openai_request(
system_instructions=system_instructions, 
user_question=transcript).selections[0].message["content"]
except Exception as e:
logging.error(f"couldn't get ChatGPT response. error: {str(e)}")
raise e

and the system instructions looked as follows:

def get_system_instructions(figure: str) -> str:
return f"You provide funny and short answers. You're: {figure}"

Text to Speech

For the text-to-speech part, we opted for a Python library called pyttsx3. This selection was not only straightforward to implement but additionally offered several additional benefits. It’s freed from charge, provides two voice options — female and male — and lets you select the speaking rate in words per minute (speech speed).

When a user starts the sport, they pick a personality from a predefined list of options. If we couldn’t discover a match for what they said inside our list, we’d randomly select a personality from our “fallback figures” list. In each lists, each character was related to a gender, so our text-to-speech function also received the voice ID corresponding to the chosen gender.

That is what our text-to-speech function looked like:

def text_to_speech(text: str, gender: str = Gender.FEMALE.value) -> None:
engine = pyttsx3.init()engine.setProperty("rate", WORDS_PER_MINUTE_RATE)
voices = engine.getProperty("voices")
voice_id = voices[0].id if gender == "male" else voices[1].id
engine.setProperty("voice", voice_id)
engine.say(text)
engine.runAndWait()

The Predominant Flow

Now that we’ve kind of got all of the pieces of our app in place, it’s time to dive into the gameplay! The most important flow is printed below. You would possibly notice some functions we haven’t delved into (e.g. choose_figure, play_round), but you’ll be able to explore the total code by testing the repo. Eventually, most of those higher-level functions tie into the inner functions we’ve covered above.

Here’s a snippet of the most important game flow:

import asynciofrom src.handle_transcript import text_to_speech
from src.main_flow_helpers import choose_figure, start, play_round, 
is_another_round
def farewell() -> None:
farewell_message = "It was great having you here, " 
"hope to see you again soon!"
print(f"n{farewell_message}")
text_to_speech(farewell_message)
async def get_round_settings(figure: str) -> dict:
new_round_choice = await is_another_round()
if new_round_choice == "latest figure":
return {"figure": "", "another_round": True}
elif new_round_choice == "no":
return {"figure": "", "another_round": False}
elif new_round_choice == "yes":
return {"figure": figure, "another_round": True}
async def most important():
start()
another_round = True
figure = ""
while True:
if not figure:
figure = await choose_figure()
while another_round:
await play_round(chosen_figure=figure)
user_choices = await get_round_settings(figure)
figure, another_round = 
user_choices.get("figure"), user_choices.get("another_round")
if not figure:
break
if another_round is False:
farewell()
break
if __name__ == "__main__":
asyncio.run(most important())

We had several ideas in mind that we didn’t get to implement throughout the hackathon. This was either because we didn’t find an API we were satisfied with during that weekend, or on account of the time constraints stopping us from developing certain features. These are the paths we didn’t take for this project:

Matching the Response Voice with the Chosen Figure’s “Actual” Voice

Imagine if the user selected to refer to Shrek, Trump, or Oprah Winfrey. We wanted our text-to-speech library or API to articulate responses using voices that matched the chosen figure. Nonetheless, we couldn’t discover a library or API throughout the hackathon that offered this feature at an inexpensive cost. We’re still open to suggestions if you’ve got any =)

Let the Users Refer to “Themselves”

One other intriguing idea was to prompt users to supply a vocal sample of themselves speaking. We’d then train a model using this sample and have all of the responses generated by ChatGPT read aloud within the user’s own voice. On this scenario, the user could select the tone of the responses (affirmative and supportive, sarcastic, offended, etc.), however the voice would closely resemble that of the user. Nonetheless, we couldn’t find an API that supported this inside the constraints of the hackathon.

Adding a Frontend to Our Application

Our initial plan was to incorporate a frontend component in our application. Nonetheless, on account of a last-minute change within the variety of participants in our group, we decided to prioritize the backend development. Consequently, the applying currently runs on the command line interface (CLI) and doesn’t have frontend side.

Latency is what bothers me most in the mean time.

There are several components within the flow with a comparatively high latency that in my view barely harm the user experience. For instance: the time it takes from ending providing the audio input and receiving a transcription, and the time it takes for the reason that user presses a button until the system actually starts recording the audio. So if the user starts talking right after pressing the important thing — there will probably be not less than one second of audio that won’t be recorded on account of this lag.

Wish to see the entire project? It’s right here!

Also, warm credit goes to Lior Yardeni, my hackathon partner with whom I created this game.

In this text, we learned learn how to create a speech-to-text-to-speech game using Python, and intertwined it with AI. We’ve used the Whisper model by OpenAI for speech recognition, played around with the FuzzyWuzzy library for text matching, tapped into ChatGPT’s conversational magic via their developer API, and brought all of it to life with pyttsx3 for text-to-speech. While OpenAI’s services (Whisper and ChatGPT for developers) do include a modest cost, it’s budget-friendly.

We hope you’ve found this guide enlightening and that it’s motivating you to embark in your projects.

Cheers to coding and fun! 🚀