How the Fourier Transform Converts Sound Into Frequencies

Why This Piece Exists

of the Fourier Transform — more like an intuition piece based on what I’ve learned from it and its application in sound frequency evaluation. The aim here is to construct intuition for the way the Fourier Transform helps us get to frequency domain features from time domain features. We won’t get into heavy math and derivations; as an alternative, we’ll attempt to simplify the meaning conveyed by the complex equations.

Before we get into the Fourier Transform, you must have a basic understanding of how digital sound is stored — specifically sampling and quantization. Let me quickly cover it here so we’re on the identical page.

Sound in the actual world is a continuous wave — air pressure changing easily over time. But computers can’t store continuous things. They need numbers, discrete values. To store sound digitally, we do two things.

First, sampling — we take “snapshots” of the sound wave’s amplitude at regular intervals. What number of snapshots per second? That’s the sampling rate. CD-quality audio takes 44,100 snapshots per second (44.1 kHz). For speech in ML pipelines, 16,000 per second (16 kHz) is common and mostly sufficient. I’ve worked with 16 kHz speech data extensively, and it captures just about every little thing that matters for speech. The important thing idea is that we’re converting a smooth continuous wave right into a series of discrete time limits.

Second, quantization — each snapshot must record how loud the wave is at that moment, and with how much precision. That is the bit depth. With 16-bit audio, each amplitude value will be certainly one of 65,536 possible levels (2¹⁶). That’s good enough for the human ear to note any difference from the unique. With only 8-bit, you’d have just 256 levels — the audio would sound rough and grainy since the gap between the true amplitude and the closest storable value (this gap is named quantization error) becomes audible.

After sampling and quantization, what now we have is a sequence of numbers — amplitude values at evenly spaced time steps — stored in the pc. That’s our time domain signal. That’s g(t). And that’s what the Fourier Transform takes as input.

I’ve spent a superb period of time working hands-on with audio data preprocessing and model training, mostly coping with speech data. While this piece builds every little thing from first principles, a whole lot of what’s written here comes from actually running into this stuff in real pipelines, not only textbook reading.

Also a promise — no AI slop here. Let’s get into it.

The Setup: What We’re Starting With

The unique audio signal — for complex sounds (including harmonic ones) just like the human voice or musical instruments — is usually made up of a mix of frequencies: constituent frequencies, or a superposition of frequencies.

The continual sound we’re talking about is within the time domain. It will be an amplitude vs. time graph. That’s how the sampled points from the unique sound are stored in a pc in digital format.

The Fourier Transform (FT) is the mechanism through which we convert that graph from the time domain (X-axis → Time, Y-axis → Amplitude) right into a frequency domain representation (X-axis → Frequency, Y-axis → Amplitude of contribution).

(Generated by google nano banana)

Should you’ve ever used librosa.stft() or np.fft.rfft() in your ML pipeline and wondered what’s actually happening under the hood while you go from raw audio to a spectrogram — that is it. The Fourier Transform is the muse underneath all of it.

Let’s talk more at an intuition level about what we’re aiming for and the way the Fourier Transform delivers it. We’ll try to grasp this in an organized way.

Our Goal

We wish to seek out the values of those frequencies whose combination makes up the unique sound. By “original sound,” I mean the digital signal that we’ve stored through sampling and quantization via an ADC into our digital system. In simpler terms – we wish to extract the constituent frequencies from which the complex sound consists.

It’s analogous to having a bucket by which all colors are mixed, and we wish to segregate the constituent colors. The bucket mixed with colors is the unique audio signal. The constituent colors are the constituent frequencies.

We wish a graph that easily tells us which frequencies have what amplitude of contribution in making the unique sound. The x-axis of that graph must have all of the frequency values, and the y-axis must have the amplitude of contribution corresponding to every frequency. The frequencies which might be actually present within the signal will show up as peaks. The whole lot else can be near zero.

Our input can be the amplitude-time graph, and the output can be the amplitude-frequency graph from the Fourier Transform.

It’s obvious that since these graphs look so different, there can be mathematics involved. And to be honest, advanced mathematical tools just like the Fourier Transform and sophisticated numbers are used to convert from our input (time domain graph) to our output (frequency domain graph). But to get the intuition of the Fourier Transform does the job appropriately, it’s essential to grasp the Fourier Transform does such that our goal is achieved. Then we’ll get to know it helps us achieve it at an intuition level.

The WHAT, the HOW, and the WHY.

The WHAT: What Does FT Actually Do?

In answering the WHAT, we don’t must see what math is happening inside — we just need to know what input it takes and what output it gives. We’ll treat it like a black box.

Here’s the thing: the input to the FT is all the original audio signal g(t), the entire time domain waveform. We evaluate the FT at a selected frequency value f, and the output for that frequency f is a single complex number. This complex number is named the Fourier coefficient for frequency f.

The subsequent query is: what that complex number that the FT outputs? What will we get from it?

From this complex number, we extract two things:

Magnitude = √(Real² + Imaginary²) — this tells us the amplitude of contribution of frequency f in the unique signal. A high magnitude means f is strongly present in the unique audio. A low magnitude means it’s barely there or not there in any respect.

Phase = arctan(Imaginary / Real) — this tells us the phase offset of that frequency component. It indicates where in its cycle that frequency starts. We’ll speak about phase properly later; don’t worry about it without delay. Just know that this information also comes out of the identical complex number.

What happens is that we do that for each frequency we care about. For every f, we get one complex number, extract the magnitude, and plot it. The gathering of all these (frequency, magnitude) pairs gives us the frequency domain graph. That’s the WHAT.

Let’s see HOW that complex number actually comes about — what’s the mechanism contained in the FT that produces it?

The HOW: How Does FT Compute This?

Here’s where things get really beautiful, imagine me.

The Winding Machine

The core idea is that we wrap the unique signal around a circle within the complex plane. The speed at which we wrap is determined by the input frequency f.

Mathematically, for a given frequency f, we compute:

g(t) · e^(−2πift)

at every cut-off date t, and plot the result on the complex plane (real axis, imaginary axis). Let’s break this down, since it’s essential to grasp find out how to visualize and interpret what’s happening here.

Here’s a vital thing to visualise: in the unique g(t) graph, as time t increases, we’re simply moving from left to right along the time axis — it’s a straight line, and we never come back. But within the complex plane, we’re moving in a circle across the origin (0,0). As time progresses, we keep coming back to the identical angular positions — each time one full loop is accomplished, we start over from the identical angle. The speed at which one full circle is accomplished is determined by f: one full rotation happens when 2πtf = 2π, which implies t·f = 1, so it takes 1/f seconds to finish one loop. Higher f → faster looping. Lower f → slower looping.

The time domain graph is a one-way journey left to right. The complex plane graph is a circular journey that keeps looping — and the speed of looping is controlled by the input frequency f.

You may think: since we keep coming back to the identical angular positions, does the second loop trace the very same path as the primary? Within the time domain, each individual constituent frequency is a repeating sine wave, right? The 300 Hz component repeats every 1/300 seconds, the 700 Hz component repeats every 1/700 seconds. Each individually has a clean repeating pattern. Once we wind g(t) across the complex plane, shouldn’t the trail from 0 to T (one period, T = 1/f) and from T to 2T be the exact same? Shouldn’t the loops overlap perfectly?

No. And this can be a subtle but vital thing to grasp early.

The person constituent frequencies inside g(t) do repeat — yes. But g(t) itself shouldn’t be a single frequency. It’s a superposition of multiple frequencies mixed together. Though the angular position within the complex plane resets every 1/f seconds (the e^(−2πift) part completes one full loop), the space from the origin — which is g(t) — is different at time t versus time t + 1/f. That’s because g(t) has other frequency components in it that don’t repeat at the identical rate as f. The worth of g(t) at the identical angular position changes from one loop to the following.

Each loop traces a rather different path within the complex plane. That is why, once we compute the Centre of Mass later, we compute it over all the path for the total duration — not only one loop. If g(t) happened to be a single pure sine wave at exactly frequency f and nothing else, then yes, every loop can be similar. But for any real-world signal with multiple frequencies, each loop is different, and we want to contemplate all of them.

Keep this in mind — it’ll make more sense once we get to the COM section below.

At any particular time t:

g(t) is the amplitude of the unique signal at that moment — this becomes the space from the origin within the complex plane. Consider it because the magnitude of a fancy number.

e^(−2πift) gives the angle — specifically, an angle of (−2πtf) radians measured clockwise from the positive real axis.

At every time t, we’re placing some extent at distance g(t) from the origin, at an angle determined by 2πtf.

As time progresses, the angle keeps rotating (because t increases), and the space from the origin keeps changing (because g(t) changes with the audio signal). The result’s a path — a curve within the complex plane.

We are able to interpret this as wrapping or winding the unique sound signal g(t) around a circle, where the speed of winding depends upon the input frequency f. Higher f means the curve wraps around faster. Lower f means slower wrapping. One full circle is accomplished when t·f = 1, so the time period of 1 full rotation is 1/f.

To visualise how this winding happens at different frequencies, see this video — it’s going to show the complex graph shape within the complex plane at different frequencies → 3Blue1Brown — But what’s the Fourier Transform? (https://www.youtube.com/watch?v=spUNpyF58BY). Among the finest resources on the market for constructing this intuition.

The Centre of Mass (COM)

Here’s where the magic happens. Once now we have this wound-up curve within the complex plane, we calculate its Centre of Mass (COM).

Consider the wound-up curve as if it has uniform mass density, like a wire. The COM is the one point that represents the typical position of all the curve. We wish the coordinates (Real, Imaginary) of this COM. Let’s see how we actually calculate this.

Our original sound g(t), as a digitally stored signal in a pc, won’t be continuous — we’d have sampled points of the unique sound. The corresponding sampled points can be there on the complex plane too after applying g(t)·e^(−2πift). The more sampled points there are in the unique audio, the more corresponding points there can be on the complex plane.

A fast note before the formulas: what we’ve been discussing up to now — the winding, the circular motion, the COM — all of that is identical whether we’re talking concerning the continuous version (with integrals) or the discrete version (with summations). The core concept of what the Fourier Transform does doesn’t change. Don’t get confused while you see a summation (Σ) in a single formula and an integral (∫) in one other — they’re doing the identical thing conceptually. Summation is for our finite sampled points; the integral is for the theoretical continuous case. For constructing intuition, you’ll be able to consider either one — the concept is similar. Just different tools for a similar job.

For our discrete digital signal with N sampled points, the COM coordinates are:

COM = (1/N) Σ g(t_n) · e^(-2πit_n·f)

That is the discrete version – and this is precisely what’s happening while you call np.fft.rfft() or np.fft.fft() in Python. It’s computing this winding + COM calculation for all frequencies directly. That one function call is doing this complete process across every frequency bin concurrently.

Now just imagine if this shouldn’t be done digitally. In that case, we don’t need sampled points and we will work on a continuous function. Meaning we can have infinite continuous points of original audio and corresponding infinite points on the complex plane. As a substitute of summation, we will integrate:

ĝ(f) = ∫ g(t) · e^(-2πift) dt

Integration over limits → t₁ and t₂ (time duration of original sound), integration over → g(t)·e^(-2πift), and the output is the complex Fourier coefficient for that frequency f. That is the continual Fourier Transform formula. In practice we at all times work with the discrete version since we’re coping with digital audio, but the continual form is sweet to know since it shows the identical idea without the distraction of indices and array lengths.

One thing price noting – the bounds t₁ and t₂ matter. The ultimate COM you get actually is determined by how much of the signal you’re including. A special time segment could give a special COM for a similar frequency. For this text, we’re applying FT to the total signal, so t₁ and t₂ are simply the beginning and end of our entire audio. But while you later get into STFT (Short-Time Fourier Transform), you’ll see that deliberately selecting short time segments and applying FT to each is precisely the concept – and that’s where window size becomes a design decision.

Now once we get the COM coordinates, we calculate its distance from the origin:

Magnitude = √(Real² + Imaginary²)

This magnitude is the amplitude of contribution of frequency f in the unique audio signal. That’s what gets plotted because the y-value for this frequency within the frequency domain graph.

The intuition for what this magnitude means: if the COM is at a big distance from the origin, that frequency has a powerful contribution in the unique signal. If the COM is sitting near or across the origin, that frequency is barely present or not present in any respect. The gap from origin is directly telling us how much that frequency matters.

And remember what we discussed earlier concerning the loops not overlapping – that is where it pays off. The COM averages over all those barely different loops, and that averaging is what makes the non-matching frequencies cancel out (their contributions point in numerous directions across loops and sum to close zero) while the matching frequencies pile up (their contributions consistently point in the identical direction across loops).

Why the COM Works: The Key Insight

That is the part that makes the entire thing click. Read this rigorously.

When the winding frequency f matches a constituent frequency of the signal, something special happens. The wound-up curve becomes lopsided — the points pile up on one side of the complex plane. The COM lands removed from the origin. High magnitude. We detect that frequency.

When f does match any constituent frequency, the wound-up curve distributes roughly evenly across the origin. Points on one side get cancelled out by points on the other side. The COM lands near the origin. Low magnitude. That frequency isn’t really present.

Match → lopsided → COM removed from origin → peak within the frequency domain.

No match → balanced → COM near origin → flat within the frequency domain.

That’s it. That’s how the Fourier Transform figures out what frequencies are contained in the original signal.

Worked Example: Walking Through the Numbers

Let’s make this concrete with actual numbers. That is where the intuition becomes rock solid — trust me on this one.

Setup: Suppose our original audio signal is:

g(t) = sin(2π·300·t) + sin(2π·700·t)

It is a signal made up of exactly two frequencies: 300 Hz and 700 Hz. In the actual world, this might sound like two pure tones playing concurrently. We all know the reply already — the frequency domain graph should show peaks at 300 and 700, and nothing else. Let’s see if the FT gets it right.

We apply the Fourier Transform at three frequencies: f = 300 Hz, f = 700 Hz, and f = 500 Hz.

FT at f = 300 Hz (a constituent frequency)

We wind g(t) across the complex plane at 300 rotations per second.

Take into consideration what happens — the 300 Hz component of g(t) is rotating at the very same speed as our winding. For this reason, the 300 Hz a part of the signal consistently lands on the identical side of the complex plane. It doesn’t cancel itself out. The wound-up curve becomes heavily lopsided in a single direction.

What concerning the 700 Hz component? It’s rotating at a special speed than our 300 Hz winding. Over time, it traces out a roughly symmetric path across the origin and averages out to close zero. It doesn’t contribute to the lopsidedness.

Result: The COM is much from the origin. The magnitude is high. The frequency domain graph gets a tall peak at f = 300 Hz. Correct — 300 Hz is indeed a constituent frequency.

FT at f = 700 Hz (the opposite constituent frequency)

Same logic, just reversed. The 700 Hz component of g(t) matches the winding speed, so it piles up on one side. The 300 Hz component, being at a special speed, averages out.

Result: The COM is much from the origin. High magnitude. A tall peak at f = 700 Hz. Correct again.

FT at f = 500 Hz (NOT a constituent frequency)

We wind g(t) at 500 rotations per second. Here’s the thing — neither the 300 Hz component nor the 700 Hz component matches this winding speed. Each of them trace roughly symmetric paths across the origin within the complex plane. Nothing piles up consistently on one side. The whole lot just cancels out; the curve is just about centered across the origin.

Result: The COM could be very near the origin. The magnitude is near zero. The frequency domain graph is flat at f = 500 Hz — appropriately telling us this frequency shouldn’t be present within the signal.

The Frequency Domain Graph

After doing this for all frequencies, our frequency domain graph would show exactly two sharp peaks — one at 300 Hz and one at 700 Hz — with every little thing else near zero. We’ve got successfully decomposed g(t) into its constituent frequencies. That’s the Fourier Transform doing its job.

The color bucket analogy holds perfectly: we had a mix (300 Hz + 700 Hz mixed together within the time domain), and the Fourier Transform segregated the constituent colors.

Seeing It in Code

For individuals who need to see this working in Python — here’s the worked example in actual code. It’s literally just a few lines:

import numpy as np

# Create the signal: 300 Hz + 700 Hz
sr = 8000  # sampling rate
t = np.linspace(0, 1, sr, endpoint=False)  # 1 second of audio
g = np.sin(2 * np.pi * 300 * t) + np.sin(2 * np.pi * 700 * t)

# Apply Fourier Transform - that is doing the winding + COM for all frequencies directly
fft_result = np.fft.rfft(g)

# Get magnitudes (amplitude of contribution for every frequency)
magnitudes = np.abs(fft_result)

# Get the frequency values corresponding to every bin
freqs = np.fft.rfftfreq(len(g), d=1/sr)

# The peaks in magnitudes can be at 300 Hz and 700 Hz
# The whole lot else can be near zero

That’s it. np.fft.rfft(g) is doing all the winding + COM process we discussed above – for each frequency bin concurrently. The np.abs() extracts the magnitude (distance of COM from origin), and the np.angle() would provide you with the phase offset if you happen to needed it. The rfft specifically gives you simply the useful half of the spectrum (as much as the Nyquist frequency) for the reason that other half is a mirror – if you happen to’ve read the aliasing article, you realize why.

Phase: The Hidden Variable

Let’s speak about something that confused me for some time — the phase. This idea is simpler to understand if you happen to have already got some understanding of phase and phase difference by way of waves and sinusoidal signals, but I’ll try to elucidate what I understood.

I do know a whole lot of ML audio pipelines work with magnitude spectrograms only and throw the phase away entirely. That’s advantageous for a lot of tasks — but understanding what phase is and what you’re discarding gives you a deeper understanding of the signal. And there are tasks where phase matters (speech synthesis, audio reconstruction, vocoder design), so this section is price reading even if you happen to’re only doing magnitude-based feature extraction without delay.

The COM we get from the FT is a fancy number. It has a magnitude (distance from the origin) and in addition an angle related to it:

Phase = arctan(Imaginary(COM) / Real(COM))

That angle tells us the phase offset of the frequency component f because it exists contained in the original signal. In easy terms, it tells you where in its cycle that frequency component starts at t = 0.

A Misconception I Had

I initially thought that for constituent frequencies, this phase would at all times be 0. If a frequency is a component of the unique signal, the COM should just lie on the actual axis, right? Phase 0, maximum sync, all that. It is smart intuitively, no?

That’s not true, and here’s why.

If the unique signal is g(t) = sin(2π·300·t + π/4), the frequency 300 Hz is totally a constituent frequency — it’s literally the one frequency within the signal. But its phase offset is π/4, not 0. The 300 Hz component doesn’t start at zero amplitude at t = 0; it starts shifted by π/4.

The FT will appropriately output a high magnitude at f = 300 Hz, the angle of the complex number can be π/4, recovering the precise phase with which the 300 Hz component exists within the signal.

Phase is 0 provided that the component happens to start out at exactly the suitable reference point at t = 0. Otherwise, it could actually be anything. The magnitude tells you the way much of that frequency is present. The phase tells you where in its cycle it starts. Each pieces of data come from the identical complex number.

In code, you’d get these individually:

magnitude = np.abs(fft_result)    # how much of every frequency
phase = np.angle(fft_result)      # where in its cycle each frequency starts

Once you compute a magnitude spectrogram (which is what most ML pipelines do), you’re keeping the primary and discarding the second. Now at the very least you realize what you’re throwing away.

For Non-Constituent Frequencies

For frequencies that aren’t a part of the unique signal (like f = 500 Hz in our worked example), the magnitude is near zero. The phase you get on this case is basically meaningless – it’s the angle of a near-zero vector pointing in some arbitrary direction. Consider it as noise. The direction doesn’t mean anything when the vector has no length.

It’s quite intuitive when you consider it: for a non-constituent frequency, regardless of the COM coordinates come out to be, they’re so near the origin that the angle is just numerical noise, not meaningful information concerning the signal.

Why FT Handles Phase Robotically (This One Really Confused Me)

Okay, so this can be a subtle point that took me some time to get. And I would like to elucidate it clearly since it’s the form of thing that bugs you once you begin fascinated by it.

Here’s the query: the FT only takes frequency f as input, right? We don’t give it a phase angle. But for a selected input frequency, we could get different correlations if we vary the phase alignment between our test wave and the unique signal. So how does FT find the “best” phase – the one that offers the utmost possible magnitude for input frequency f?

The reply: FT doesn’t search or optimize over phase in any respect. It doesn’t must.

Here’s why, and the hot button is Euler’s formula:

e^(-2πift) = cos(2πft) – i·sin(2πft)

Once we compute FT at frequency f, we’re concurrently correlating the signal with each cos(2πft) and sin(2πft). The actual a part of the output captures the cosine correlation. The imaginary part captures the sine correlation.

Now here’s the vital thing – any sinusoid at frequency f with any arbitrary phase φ will be decomposed as:

A·cos(2πft + φ) = A·cos(φ)·cos(2πft) – A·sin(φ)·sin(2πft)

No matter what phase the component has in the unique signal, the FT routinely captures it:

The actual part picks up A·cos(φ) — the cosine correlation. The imaginary part picks up A·sin(φ) — the sine correlation. Magnitude = √(real² + imag²) = A — the true amplitude, no matter φ. Angle = arctan(imag/real) = φ — recovers the precise phase.

It’s like measuring the length of a vector by projecting it onto each the x-axis and y-axis. Regardless of which direction the vector points, you mostly get better its full length through √(x² + y²). The complex exponential is testing all phases concurrently because cosine and sine together cover all possible phase angles — they’re orthogonal to one another.

No optimization. No searching. No iterating over phase values. Just the undeniable fact that cosine and sine are orthogonal and together they capture any phase. The maths does it in a single shot.

That is where I finally understood why complex numbers are used here and not only regular correlation with a single sine wave. Euler’s formula is doing something very clever — it’s correlating with two things directly, and the complex number neatly packages each results together.

Putting It All Together

Here is the total picture of how we get from the time domain to the frequency domain:

1. Take the unique audio signal g(t) — our time domain data

2. Pick a frequency f

3. Wind g(t) across the complex plane at speed f using g(t)·e^(−2πift)

4. Calculate the COM of the wound-up curve

5. The gap of the COM from the origin → amplitude of contribution of f

6. The angle of the COM → phase offset of f

7. Plot the purpose (f, magnitude) on the frequency domain graph

8. Repeat for all frequencies

The frequencies which might be actually present in the unique signal produce lopsided winding → COM removed from the origin → peaks within the graph. Frequencies that aren’t present produce balanced winding → COM near the origin → flat regions.

After doing this across all frequencies, now we have the entire frequency domain graph. The peaks tell us the constituent frequencies of the unique sound. That’s the Fourier Transform — decomposing a fancy signal into its constructing blocks.

The maths is a tool to justify the intuition — the actual understanding is within the winding, the Centre of Mass, and the best way the complex exponential handles phase routinely through Euler’s formula. Once these three things click, you get the Fourier Transform at an intuition level, and the heavy math derivations are only formalizing what you already understand. And once this clicks, you’ll see the FT in all places in signal processing, and it’s going to all start making sense.

The WHY

Why does the Fourier Transform work? The intuitive answer is what we’ve built through this complete piece – matching frequencies create lopsided windings, non-matching frequencies create balanced ones that cancel out. The winding machine is basically a correlation detector – it measures how much the unique signal correlates with a pure sinusoid at each frequency. High correlation means COM removed from origin which supplies a peak, low correlation means COM near origin and we get a flat region within the graph.

At its core, why this works rigorously would require heavy math derivation involving orthogonality of sinusoidal functions and properties of complex exponentials – which isn’t the aim of this piece. However the intuition we’ve built needs to be good enough to grasp what’s happening and why the output is smart. It really works!

What Comes Next

This piece covers the continual/conceptual Fourier Transform — the muse. In practice, while you work with digital audio in ML pipelines, you’re using the DFT (Discrete Fourier Transform) and its fast implementation, the FFT. And while you compute spectrograms, you’re using the STFT (Short-Time Fourier Transform), which applies the FT to small overlapping windows of the signal — that’s where window size N, hop length, and overlap are available. But that’s a subject for one more writeup.

All of that builds directly on top of what we covered here. The winding machine, the COM, the magnitude and phase — it’s the identical mechanism, just applied to short chunks of audio as an alternative of the entire thing directly. If this piece clicked for you, the remaining will follow naturally. I’d write concerning the DFT and STFT intimately later.

Thanks for the patience if you happen to’ve read this far, and because of Grammarly for helping with the editing.

Be happy to succeed in out with any questions:

Email: [email protected]

Twitter: @r4plh

GitHub: github.com/r4plh

LinkedIn: linkedin.com/in/r4plh

How the Fourier Transform Converts Sound Into Frequencies

Why This Piece Exists

The Setup: What We’re Starting With

Our Goal

The WHAT: What Does FT Actually Do?