Voice Cloning with Consent

-


Margaret Mitchell's avatar

Lucie-Aimée Kaffee's avatar


On this blog post, we introduce the concept of a ‘voice consent gate’ to support voice cloning with consent. We offer an example Space and accompanying code to start out the ball rolling on the concept.
Line-drawing/clipart of a gate, where the family name says Consent

Realistic voice generation technology has gotten uncannily good previously few years. In some situations, it’s possible to generate an artificial voice that sounds almost exactly just like the voice of an actual person. And today, what once felt like science fiction is reality: Voice cloning. With just just a few seconds of recorded speech, anyone’s voice will be made to say almost anything.

Voice generation, and specifically the subtask of voice cloning, has notable risks and advantages. The risks of “deepfakes”, resembling the cloned voice of former President Biden utilized in robocalls, can mislead people into pondering that individuals have said things that they haven’t said. However, voice cloning generally is a powerful helpful tool, helping individuals who’ve lost the the power to talk communicate in their very own voice again, or assisting people in learning latest languages and dialects.

So how will we create meaningful use without malicious use? We’re exploring one possible answer: a voice consent gate. That’s a system where a voice will be cloned only when the speaker explicitly says they consent. In other words, the model won’t speak in your voice unless you say it’s okay.

We offer a basic demo of this concept below:



Ethics in Practice: Consent as System Infrastructure

The voice consent gate is a bit of infrastructure we’re exploring that gives methods for ethical principles like consent to be embedded directly into AI system workflows. In our demo, this implies the model only starts once the speaker’s consent phrase has been each spoken and recognized, effectively making consent a prerequisite for motion. This turns an abstract principle right into a concrete system condition, making a traceable, auditable interaction: an AI model can only run after an unambiguous act of consent.

Such design decisions matter beyond voice cloning. They illustrate how AI systems will be built to respect autonomy by default, and the way transparency and consent will be made functional, not only declarative.



The Technical Details

To create a basic voice cloning system with a voice consent gate, you would like three parts:

  1. A way of generating novel consent sentences for the person whose voice might be cloned – the “speaker” – to say, uniquely referencing the present consent context.
  2. An automatic speech recognition (ASR) system that recognizes the sentence conveying consent.
  3. A voice-cloning text-to-speech (TTS) system that takes as input text and the speaker’s speech snippets to generate speech.

Our commentary: Since some voice-cloning systems can now generate speech just like a speaker’s voice using only one sentence, a sentence used for consent can also be used for voice cloning.



Approach

The consent bit: To create a voice consent gate in an English voice cloning system, generate a brief, natural-sounding English utterance (~20 words) for an individual to read aloud that clearly states their informed consent in the present context. We recommend explicitly including a consent phrase and the model name, resembling “I give my consent to make use of the < MODEL > voice cloning model with my voice”. We also recommend using an audio recording that can’t be uploaded, but that as a substitute comes directly from a microphone, to be certain that that the sentence isn’t a part of an earlier recording that’s been manipulated. Pairing this with a novel (previously unsaid) sentence further helps to directly index the present consent context – supporting explicit, lively, context-specific, informed consent. While this design reduces risks of reusing prior recordings, it’s not foolproof; an individual could still generate an identical phrase using one other TTS system. Future iterations could explore lightweight audio provenance checks, speaker-embedding similarity, or metadata from real-time capture to assist confirm that the consent audio originates from the intended speaker.

The acceptable-for-voice-cloning bit: Previous work on voice cloning has shown that the phrases provided by the speaker should have phonetic variety, covering diverse vowels and consonants; have a “neutral” or polite tone, without background noise and with the speaker in a cushty position; and have a transparent start and end (i.e., don’t trim the clip mid-word).

To enact each of those features inside the demo, we prompt a language model to create pairs of sentences: one expressing explicit consent, and one other neutral sentence that adds phonetic diversity (covering different vowels, consonants, and tones). Each prompt utilizes a randomly-chosen on a regular basis topic (just like the weather, food, or music) to maintain the sentences varied and cozy to say, aiding in creating recordings which can be clear, natural, and phonetically wealthy, while also containing an unambiguous statement of consent. This generation step is automated somewhat than pre-written in order that each user receives a singular sentence pair, stopping reuse of the identical text and ensuring that consent recordings are specific to the present session. In other words, the language model generates two fresh sentences per consent instance: one for explicit consent and one for phonetic variety. For instance, the language model might generate: “I give my consent to make use of my voice for generating audio with the model EchoVoice. The weather is brilliant and calm this morning.” This approach ensures that each sample used for cloning incorporates verifiable, explicit consent, while remaining suitable as technical input for high-quality voice synthesis. (Note: It isn’t required that the language model be a “large” language model, which brings its own consent issues.)

Some examples:

  • “I give my consent to make use of my voice for generating synthetic audio with the Chatterbox model today. My day by day commute involves navigating through crowded streets on foot most days recently anyway.”
  • “I give my consent to make use of my voice for generating audio with the model Chatterbox. After a mild morning walk, I’m feeling relaxed and able to speak freely now.”
  • “I comply with the usage of my recorded voice for audio generation with the model Chatterbox. The coffee shop outside has a nice aroma of freshly brewed coffee this morning.”



Unlocking the Voice Consent Gate

Once the speaker’s input matches the generated text, the voice cloning system can start, using the speaker’s consent audio because the input.

There are just a few options for doing this, and we’d love to listen to further ideas. For now, there’s:

  • What we offer within the demo: Have the voice consent gate open on to the voice cloning model, where arbitrary text will be written and generated within the speaker’s voice. The model uses the consenting audio on to learn the speaker’s voice.
  • Alternatively, it’s possible to change the code we offer within the demo to model the speaker’s voice using quite a lot of different uploaded voice files that the speaker is consenting to – for instance, when providing consent for using online recordings. Prompts and consent phrases must be altered accordingly.
  • It’s also possible to save lots of the consent audio to be utilized by a given system, for instance, when the speaker is consenting to have their voice used for arbitrary utterances in the longer term. This will be done using the huggingface_hub upload capability. Read learn how to do that here. Again, prompts and consent phrases for the speaker to say should account for this context of use.



Check our demo out here!

You’ll be able to copy the code to suit your individual use.

The code is modular so it could possibly be sliced and diced in alternative ways to include into your individual projects. We’ll be working on making this more robust and secure over time, and we’re curious to listen to your ideas on learn how to improve.

Handled responsibly, this technology doesn’t should haunt us. It might probably as a substitute turn into a respectful collaboration between humans and machines — no ghosts within the machine, just good practice. 🎃



Source link

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x