To reply this query, we want to know two terms:
- Waveform
- Spectrogram
In the true world, sound is produced by vibrating objects creating acoustic waves (changes in air pressure over time). When sound is captured through a microphone or generated by a digital synthesizer, we are able to represent this sound wave as a waveform:
The waveform is helpful for recording and playing audio, however it is often avoided for music evaluation or machine learning with audio data. As a substitute, a rather more informative representation of the signal, the spectrogram, is used.
The spectrogram tells us which frequencies are kind of pronounced within the sound across time. Nonetheless, for this text, the important thing thing to notice is that a spectrogram is a picture. And with that, we come full circle.
When generating the corgi sound and image above, the AI creates a sound that, when transformed right into a spectrogram, looks like a corgi.
Which means the output of this AI is each sound and image at the identical time.
Despite the fact that you now understand what is supposed by a picture that sounds, you would possibly still wonder how that is even possible. How does the AI know which sound would produce the specified image? In any case, the waveform of the corgi sound looks nothing like a corgi.
First, we want to know one foundational concept: Diffusion models. Diffusion models are the technology behind image models like DALL-E 3 or Midjourney. In essence, a diffusion model encodes a user prompt right into a mathematical representation (an embedding) which is then used to generate the specified output image step-by-step from random noise.
Here’s the workflow of making images with a diffusion model
- Encode the prompt into an embedding (a bunch of numbers) using a synthetic neural network
- Initialize a picture with white noise (Gaussian noise)
- Progressively denoise the image. Based on the prompt embedding, the diffusion model determines an optimal, small denoising step that brings the image closer to the prompt description. Let’s call this the denoising instruction.
- Repeat denoising step until a noiseless, high-quality image is generated
To generate “images that sound”, the researchers used a clever technique by combining two diffusion models into one. One among the diffusion models is a text-to-image model (Stable Diffusion), and the opposite is a text-to-spectrogram model (Auffusion). Each of those models receives its own prompt, which is encoded into an embedding and determines its own denoising instruction.
Nonetheless, multiple different denoising instructions are problematic, since the model needs to determine how one can denoise the image. Within the paper, the authors solve this problem by averaging the denoising instructions from each prompts, effectively guiding the model to optimize for each prompts equally.
On a high level, you may consider this as ensuring the resulting image reflects each the image and audio prompt equally well. One downside of that is that the output will at all times be a combination of the 2 and never every sound or image coming out of the model will look/sound great. This inherent tradeoff significantly limits the model’s output quality.
Is AI just Mimicking Human Intelligence?
AI is often defined as computer systems mimicking human intelligence (e.g. IMB, TechTarget, Coursera). This definition works well for sales forecasting, image classification, and text generation AI models. Nonetheless, it comes with the inherent restriction that a pc system can only be an AI if it performs a task that humans have historically solved.
In the true world, there exist a high (likely infinite) variety of problems solvable through intelligence. While human intelligence has cracked a few of these problems, most remain unsolved. Amongst these unsolved problems, some are known (e.g. curing cancer, quantum computing, the character of consciousness) and others are unknown. In case your goal is to tackle these unsolved problems, mimicking human intelligence doesn’t seem like an optimal strategy.
Following the definition above, a pc system that discovers a cure for cancer without mimicking human intelligence wouldn’t be considered AI. That is clearly counterintuitive and counterproductive. I don’t intend to start out a debate on “the one and only definition”. As a substitute, I would like to emphasise that AI is rather more than an automation tool for human intelligence. It has the potential to resolve problems that we didn’t even know existed.
Can Spectrogram Art be Generated with Human Intelligence?
In an article on Mixmag, Becky Buckle explores the “history of artists concealing visuals inside the waveforms of their music”. One impressive example of human spectrogram art is the song “∆Mᵢ⁻¹=−α ∑ Dᵢ[η][ ∑ Fjᵢ[η−1]+Fextᵢ [η⁻¹]]” by the British musician Aphex Twin.
One other example is the track “Look” from the album “Songs about my Cats” by the Canadian musician Venetian Snares.
While each examples show that humans can encode images into waveforms, there may be a transparent difference to what “Images that Sound” is able to.
How is “Images that Sound” Different from Human Spectrogram Art?
In case you hearken to the above examples of human spectrogram art, you’ll notice that they sound like noise. For an alien face, this is likely to be an appropriate musical underscore. Nonetheless, listening to the cat example, there appears to be no intentional relationship between the sounds and the spectrogram image. Human composers were in a position to generate waveforms that appear to be a certain thing when transformed to a spectrogram. Nonetheless, to my knowledge, no human has been in a position to produce examples where the sound and pictures match, in accordance with predefined criteria.
“Images that Sound” can produce audio that seems like a cat and appears like a cat. It may well also produce audio that seems like a spaceship and appears like a dolphin. It’s capable of manufacturing intentional associations between the sound and image representation of the audio signal. On this regard, the AI exhibits non-human intelligence.
“Images that Sound” has no Use Case. That’s what Makes it Beautiful
Lately, AI has mostly been portrayed as a productivity tool that may enhance economic outputs through automation. While most would agree that this is extremely desirable to some extent, others feel threatened by this angle on the longer term. In any case, if AI keeps taking away work from humans, it’d find yourself replacing the work we love doing. Hence, our lives could turn into more productive, but less meaningful.
“Images that Sound” contrasts this angle and is a first-rate example of gorgeous AI art. This work will not be driven by an economic problem but by curiosity and creativity. It’s unlikely that there’ll ever by an economic use case for this technology, although we should always never say never…
From all of the people I’ve talked to about AI, artists are inclined to be essentially the most negative about AI. That is backed up by a recent study from the German GEMA, showing that over 60% of musicians “imagine that the risks of AI use outweigh its potential opportunities” and that only 11% “imagine that the opportunities outweigh the risks”.
More works much like this paper could help artists understand that AI has the potential to bring more beautiful art into the world and that this doesn’t should occur at the price of human creators.
Images that Sound has not been the primary use case of AI that has the potential to create beautiful art. On this section, I would like to showcase a couple of other approaches that can hopefully encourage you and make you’re thinking that in a different way about AI.
Restoring Art
AI helps restore art by repairing damaged pieces precisely, ensuring historical works last more. This mixture of technology and creativity keeps our artistic heritage alive for future generations. Read more.
Bringing Paintings to Live
AI can animate photos to create realistic videos with natural movements and lip-syncing. This will make historical figures or artworks just like the Mona Lisa move and speak (or rap). While this technology is actually dangerous within the context of deep fakes, applied to historical portraits, it might probably create funny and/or meaningful art. Read more.
Turning Mono-Recordings to Stereo
AI has the potential to reinforce old recordings by transforming their mono mix right into a stereo mix. There are classical algorithmic approaches for this, but AI guarantees to make artificial stereo mixes sound increasingly more realistic. Read more here and here.