AI Speech Recognition in Unity

-


Dylan Ebert's avatar


Open Source AI Game Jam



Introduction

This tutorial guides you thru the technique of implementing state-of-the-art Speech Recognition in your Unity game using the Hugging Face Unity API. This feature may be used for giving commands, chatting with an NPC, improving accessibility, or every other functionality where converting spoken words to text could also be useful.

To try Speech Recognition in Unity for yourself, try the live demo in itch.io.



Prerequisites

This tutorial assumes basic knowledge of Unity. It also requires you to have installed the Hugging Face Unity API. For instructions on establishing the API, try our earlier blog post.



Steps



1. Arrange the Scene

On this tutorial, we’ll arrange a quite simple scene where the player can start and stop a recording, and the result will likely be converted to text.

Begin by making a Unity project, then making a Canvas with 4 UI elements:

  1. Start Button: It will start the recording.
  2. Stop Button: It will stop the recording.
  3. Text (TextMeshPro): That is where the results of the speech recognition will likely be displayed.



2. Arrange the Script

Create a script called SpeechRecognitionTest and fix it to an empty GameObject.

Within the script, define references to your UI components:

[SerializeField] private Button startButton;
[SerializeField] private Button stopButton;
[SerializeField] private TextMeshProUGUI text;

Assign them within the inspector.

Then, use the Start() method to establish listeners for the beginning and stop buttons:

private void Start() {
    startButton.onClick.AddListener(StartRecording);
    stopButton.onClick.AddListener(StopRecording);
}

At this point, your script should look something like this:

using TMPro;
using UnityEngine;
using UnityEngine.UI;

public class SpeechRecognitionTest : MonoBehaviour {
    [SerializeField] private Button startButton;
    [SerializeField] private Button stopButton;
    [SerializeField] private TextMeshProUGUI text;

    private void Start() {
        startButton.onClick.AddListener(StartRecording);
        stopButton.onClick.AddListener(StopRecording);
    }

    private void StartRecording() {

    }

    private void StopRecording() {

    }
}



3. Record Microphone Input

Now let’s record Microphone input and encode it in WAV format. Start by defining the member variables:

private AudioClip clip;
private byte[] bytes;
private bool recording;

Then, in StartRecording(), using the Microphone.Start() method to start out recording:

private void StartRecording() {
    clip = Microphone.Start(null, false, 10, 44100);
    recording = true;
}

It will record as much as 10 seconds of audio at 44100 Hz.

In case the recording reaches its maximum length of 10 seconds, we’ll wish to stop the recording robotically. To achieve this, write the next within the Update() method:

private void Update() {
    if (recording && Microphone.GetPosition(null) >= clip.samples) {
        StopRecording();
    }
}

Then, in StopRecording(), truncate the recording and encode it in WAV format:

private void StopRecording() {
    var position = Microphone.GetPosition(null);
    Microphone.End(null);
    var samples = latest float[position * clip.channels];
    clip.GetData(samples, 0);
    bytes = EncodeAsWAV(samples, clip.frequency, clip.channels);
    recording = false;
}

Finally, we’ll must implement the EncodeAsWAV() method, to organize the audio data for the Hugging Face API:

private byte[] EncodeAsWAV(float[] samples, int frequency, int channels) {
    using (var memoryStream = latest MemoryStream(44 + samples.Length * 2)) {
        using (var author = latest BinaryWriter(memoryStream)) {
            author.Write("RIFF".ToCharArray());
            author.Write(36 + samples.Length * 2);
            author.Write("WAVE".ToCharArray());
            author.Write("fmt ".ToCharArray());
            author.Write(16);
            author.Write((ushort)1);
            author.Write((ushort)channels);
            author.Write(frequency);
            author.Write(frequency * channels * 2);
            author.Write((ushort)(channels * 2));
            author.Write((ushort)16);
            author.Write("data".ToCharArray());
            author.Write(samples.Length * 2);

            foreach (var sample in samples) {
                author.Write((short)(sample * short.MaxValue));
            }
        }
        return memoryStream.ToArray();
    }
}

The complete script should now look something like this:

using System.IO;
using TMPro;
using UnityEngine;
using UnityEngine.UI;

public class SpeechRecognitionTest : MonoBehaviour {
    [SerializeField] private Button startButton;
    [SerializeField] private Button stopButton;
    [SerializeField] private TextMeshProUGUI text;

    private AudioClip clip;
    private byte[] bytes;
    private bool recording;

    private void Start() {
        startButton.onClick.AddListener(StartRecording);
        stopButton.onClick.AddListener(StopRecording);
    }

    private void Update() {
        if (recording && Microphone.GetPosition(null) >= clip.samples) {
            StopRecording();
        }
    }

    private void StartRecording() {
        clip = Microphone.Start(null, false, 10, 44100);
        recording = true;
    }

    private void StopRecording() {
        var position = Microphone.GetPosition(null);
        Microphone.End(null);
        var samples = latest float[position * clip.channels];
        clip.GetData(samples, 0);
        bytes = EncodeAsWAV(samples, clip.frequency, clip.channels);
        recording = false;
    }

    private byte[] EncodeAsWAV(float[] samples, int frequency, int channels) {
        using (var memoryStream = latest MemoryStream(44 + samples.Length * 2)) {
            using (var author = latest BinaryWriter(memoryStream)) {
                author.Write("RIFF".ToCharArray());
                author.Write(36 + samples.Length * 2);
                author.Write("WAVE".ToCharArray());
                author.Write("fmt ".ToCharArray());
                author.Write(16);
                author.Write((ushort)1);
                author.Write((ushort)channels);
                author.Write(frequency);
                author.Write(frequency * channels * 2);
                author.Write((ushort)(channels * 2));
                author.Write((ushort)16);
                author.Write("data".ToCharArray());
                author.Write(samples.Length * 2);

                foreach (var sample in samples) {
                    author.Write((short)(sample * short.MaxValue));
                }
            }
            return memoryStream.ToArray();
        }
    }
}

To check whether this code is working accurately, you may add the next line to the tip of the StopRecording() method:

File.WriteAllBytes(Application.dataPath + "/test.wav", bytes);

Now, when you click the Start button, speak into the microphone, and click on Stop, a test.wav file must be saved in your Unity Assets folder together with your recorded audio.



4. Speech Recognition

Next, we’ll wish to use the Hugging Face Unity API to run speech recognition on our encoded audio. To achieve this, we’ll create a SendRecording() method:

using HuggingFace.API;

private void SendRecording() {
    HuggingFaceAPI.AutomaticSpeechRecognition(bytes, response => {
        text.color = Color.white;
        text.text = response;
    }, error => {
        text.color = Color.red;
        text.text = error;
    });
}

It will send the encoded audio to the API, displaying the response in white if successful, otherwise the error message in red.

Do not forget to call SendRecording() at the tip of the StopRecording() method:

private void StopRecording() {
    /* other code */
    SendRecording();
}



5. Final Touches

Finally, let’s improve the UX of this demo a bit using button interactability and standing messages.

The Start and Stop buttons should only be interactable when appropriate, i.e. when a recording is able to be began/stopped.

Then, set the response text to an easy status message while recording or waiting for the API.

The finished script should look something like this:

using System.IO;
using HuggingFace.API;
using TMPro;
using UnityEngine;
using UnityEngine.UI;

public class SpeechRecognitionTest : MonoBehaviour {
    [SerializeField] private Button startButton;
    [SerializeField] private Button stopButton;
    [SerializeField] private TextMeshProUGUI text;

    private AudioClip clip;
    private byte[] bytes;
    private bool recording;

    private void Start() {
        startButton.onClick.AddListener(StartRecording);
        stopButton.onClick.AddListener(StopRecording);
        stopButton.interactable = false;
    }

    private void Update() {
        if (recording && Microphone.GetPosition(null) >= clip.samples) {
            StopRecording();
        }
    }

    private void StartRecording() {
        text.color = Color.white;
        text.text = "Recording...";
        startButton.interactable = false;
        stopButton.interactable = true;
        clip = Microphone.Start(null, false, 10, 44100);
        recording = true;
    }

    private void StopRecording() {
        var position = Microphone.GetPosition(null);
        Microphone.End(null);
        var samples = latest float[position * clip.channels];
        clip.GetData(samples, 0);
        bytes = EncodeAsWAV(samples, clip.frequency, clip.channels);
        recording = false;
        SendRecording();
    }

    private void SendRecording() {
        text.color = Color.yellow;
        text.text = "Sending...";
        stopButton.interactable = false;
        HuggingFaceAPI.AutomaticSpeechRecognition(bytes, response => {
            text.color = Color.white;
            text.text = response;
            startButton.interactable = true;
        }, error => {
            text.color = Color.red;
            text.text = error;
            startButton.interactable = true;
        });
    }

    private byte[] EncodeAsWAV(float[] samples, int frequency, int channels) {
        using (var memoryStream = latest MemoryStream(44 + samples.Length * 2)) {
            using (var author = latest BinaryWriter(memoryStream)) {
                author.Write("RIFF".ToCharArray());
                author.Write(36 + samples.Length * 2);
                author.Write("WAVE".ToCharArray());
                author.Write("fmt ".ToCharArray());
                author.Write(16);
                author.Write((ushort)1);
                author.Write((ushort)channels);
                author.Write(frequency);
                author.Write(frequency * channels * 2);
                author.Write((ushort)(channels * 2));
                author.Write((ushort)16);
                author.Write("data".ToCharArray());
                author.Write(samples.Length * 2);

                foreach (var sample in samples) {
                    author.Write((short)(sample * short.MaxValue));
                }
            }
            return memoryStream.ToArray();
        }
    }
}

Congratulations, you may now use state-of-the-art Speech Recognition in Unity!

If you may have any questions or would really like to get more involved in using Hugging Face for Games, join the Hugging Face Discord!



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x