Use OpenAI Whisper for Automated Transcriptions

-

development currently with large language models (LLMs). A number of the main focus is on the question-answering you may do with each pure text-based models, or vision-language models (VLMs), where you may also input images.

Nevertheless, there may be one other dimension that has evolved a ton over the previous couple of years: Audio. Models that may each transcribe (speech -> text), speech synthesis (text -> speech), and likewise speech-to-speech, where you may have an entire conversation with a language model, with audio going each out and in.

The arcitecture and and training pipeline for OpenAI’s Whisper model. Image from OpenAI Whisper GitHub repository with MIT license.

In this text, I’ll discuss how I’m utilizing the event inside the audio model space to my advantage, becoming a fair more efficient programmer.

That is an example video of me using the transcription tool. I first select the prompt field in Cursor and use my hotkey to activate the microphone, which is indicated by the orange icon in the highest left. I then speak out the sentence I need to transcribe, and it quickly appears within the prompt window without me having to type on the keyboard in any respect. This can be a more efficient technique to type long English prompts into your editor. Video by the creator.

Motivation

My primary motivation for writing this text is that I’m continually searching for ways to turn out to be a more efficient programmer. After using the ChatGPT mobile app for some time, I discovered their transcription option (the microphone icon to the proper within the user input field). I used the transcription and quickly realized how a lot better this transcription is in comparison with others I actually have used before, reminiscent of Apple’s built-in iPhone transcription.

OpenAI’s transcription almost at all times captures all of my words, with only a few mistakes. Even when I exploit less common words, for instance, acronyms related to computer science, it remains to be capable of pick up what I’m saying.

The transcription icon from the OpenAI application. Image by the creator, taken from OpenAI’s ChatGPT.

This transcription was only available within the ChatGPT app. Nevertheless, I do know that OpenAI has an API endpoint for his or her Whisper model, which is (presumably) the identical model they’re using to transcribe text within the app. I thus desired to set this model up on my Mac to be available via a shortcut.

(I do know there are apps reminiscent of Macwhisper available, but I desired to develop a totally free solution, apart from the prices of the API calls themselves)

Prerequisites

  • Alfred (I will probably be using Alfred on the Mac to trigger some scripts. Nevertheless, alternatives to this also exist. Typically, you would like a technique to trigger scripts in your Mac / PC from a hotkey.

Pros

The fundamental advantage of using this transcription is which you can input words into your computer more quickly. Once I type as quickly as I can on my computer, I’m not even capable of reach 100 words per minute, and if I’m to type at that speed, I actually must focus. Nevertheless, the common talking speed is at a minimum of 110, in line with this article.

This implies you may be lots simpler in the event you are capable of speak your words with transcription, as an alternative of typing them out on the keyboard.

I feel this is particularly relevant after the rise of huge language models reminiscent of ChatGPT. You spend more time prompting the language models, for instance, asking inquiries to ChatGPT, or prompting the cursor to implement a feature, or fixing a bug. Thus, the usage of the English language is far more prevalent now than before, in comparison with the usage of programming languages reminiscent of Python directly.

Note: After all, you’ll still be writing lots of code, but from experience, I spend lots more time prompting the cursor, for instance, with extensive English prompts, through which case, using this transcription saves me lots of time.

Cons

There can, nonetheless, be some downsides to using the transcription as well. One among the fundamental ones is that lots of times, you don’t need to talk out loud when programming. You could be sitting within the airport (as I’m when writing this text), and even in your office. While you’re in these scenarios, you almost certainly don’t wish to disturb those around you by speaking out loud. Nevertheless, in the event you are sitting in a house office, this is of course not an issue.

One other negative side is that smaller prompts may not be that much faster. Imagine this: in the event you just want to write down a prompt of a single sentence, it is going to, in lots of scenarios, be faster simply to type the prompt out by hand. It’s because of the delay in starting, stopping, and transcribing audio into text. Sending the API call takes somewhat little bit of time, and the shorter the prompt you may have, the larger fraction of the time you may have to spend waiting for the response.

implement

You’ll be able to see the code I utilized in this text on my GitHub. Nevertheless, you furthermore mght have to add hotkeys to run the scripts.

First, you may have to:

  • Clone the GitHub repository:
git clone https://github.com/EivindKjosbakken/whisper-shortcut.git
  • Create a virtual environment called and install the required packages:
python3 -m venv .venv
source .venv/bin/activate
pip install -r requirements.txt
  • Get an OpenAI API Key. You’ll be able to try this by:
    • Going to the OpenAI API Overview, logging in/making a profile
    • Go to your profile, and API Keys
    • Create a brand new key. Remember to repeat the important thing, as you won’t have the opportunity to see it again

The scripts from the GitHub repository work by:

  • start_recording.sh — starts recording your voice. The primary time you employ this, it is going to ask you for permission to make use of the microphone
  • stop_recording.sh — sends a stop signal to the script to stop recording. Then sends the recorded audio to OpenAI for transcription. Moreover, it adds the transcribed text to your clipboard and pastes the text if you may have a text field in your PC chosen

All the repository is out there with an MIT license.

Alfred

You will discover the Alfred workflow on the GitHub repository here: Transcribe.alfredworkflow.

That is how I arrange the Alfred workflow:

My Alfred workflow. I actually have two hotkeys, one to begin the transcription (record voice), and one to stop transcription (stop recording, and send the audio to the OpenAI Whisper API for transcription). The choice + Q command runs the start_recording.sh script, and the choice + W run the stop_recording.sh script. You’ll be able to, in fact, change the hotkeys for these commands. Image by the creator.

You’ll be able to simply download it and add it to your Alfred.

Also, remember to have a terminal window open at any time when you desire to run this script, as you activate the Python script from the terminal. I needed to do it this fashion because if the script was activated directly from Alfred, I got permission issues. The primary time you run the script, you ought to be prompted to offer your terminal access to the microphone, which it’s best to approve.

Cost

A crucial consideration when using APIs reminiscent of OpenAI Whisper is the price of the API usage. I’d consider the price of using OpenAI’s Whisper model moderately high. As at all times, the price is fully depending on how much you employ the model. I’d say I exploit the model as much as 25 times a day, as much as 150 words, and the price is lower than 1 dollar per day.

This implies, nonetheless, that in the event you use the model lots, you may see costs as much as 30 dollars per 30 days, which is certainly a considerable cost. Nevertheless, I feel it’s necessary to pay attention to the time savings you may have from the model. If each model usage saves you 30 seconds, and you employ it 20 times per day, you may have just saved ten minutes of your day. Personally, I’m willing to pay one dollar to save lots of ten minutes of my day, performing a task (writing on my keyboard), that doesn’t really grant me every other profit. If any, using your keyboard may contribute to the next risk of injuries reminiscent of carpal tunnel syndrome. Using the model is thus definitely value it for me.

Conclusion

In this text, I began off discussing the immense advances inside language models in the previous couple of years. This has helped us create powerful chatbots, saving us enormous amounts of time. Nevertheless, with the advances of language models, now we have also seen advances in voice models. Transcription using OpenAI Whisper is now near perfect (from personal experience), which makes it a strong tool you should utilize to input words in your computer more effectively. I discussed the professionals and cons of using OpenAI Whisper in your PC, and I also went step-by-step through how you may implement it on your personal computer.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x