Audio Spectrogram Transformers Beyond the Lab

Need to know what draws me to soundscape evaluation?

It’s a field that mixes science, creativity, and exploration in a way few others do. Initially, your laboratory is wherever your feet take you — a forest trail, a city park, or a distant mountain path can all change into spaces for scientific discovery and acoustic investigation. Secondly, monitoring a selected geographic area is all about creativity. Innovation is at the center of environmental audio research, whether it’s rigging up a custom device, hiding sensors in tree canopies, or using solar energy for off-grid setups. Finally, the sheer volume of information is really incredible, and as we all know, in spatial evaluation, all methods are fair game. From hours of animal calls to the subtle hum of urban machinery, the acoustic data collected will be vast and complicated, and that opens the door to using all the pieces from deep learning to geographical information systems (GIS) in making sense of all of it.

After my earlier adventures with soundscape evaluation of certainly one of Poland’s rivers, I made a decision to lift the bar and design and implement an answer able to analysing soundscapes in real time. On this blog post, you’ll find an outline of the proposed method, together with some code that powers all the process, mainly using an Audio Spectrogram Transformer (AST) for sound classification.

Outdoor/Urban version of the sensor prototype (image by writer)

Methods

Setup

There are a lot of the reason why, on this particular case, I selected to make use of a mixture of Raspberry Pi 4 and AudioMoth. Consider me, I tested a big selection of devices — from less power-hungry models of the Raspberry Pi family, through various Arduino versions, including the Portenta, all of the technique to the Jetson Nano. And that was only the start. Selecting the proper microphone turned out to be much more complicated.

Ultimately, I went with the Pi 4 B (4GB RAM) due to its solid performance and comparatively low power consumption (~700mAh when running my code). Moreover, pairing it with the AudioMoth in USB microphone mode gave me a whole lot of flexibility during prototyping. AudioMoth is a strong device with a wealth of configuration options, e.g. sampling rate from 8 kHz to stunning 384 kHz. I actually have a robust feeling that — in the long term — it will prove to be an ideal alternative for my soundscape studies.

AudioMoth USB Microphone configuration app. Remember about flashing the device with the correct firmware before configuring.

Capturing sound

Capturing audio from a USB microphone using Python turned out to be surprisingly troublesome. After combating various libraries for some time, I made a decision to fall back on the great old Linux arecord. The entire sound capture mechanism is encapsulated with the next command:

arecord -d 1 -D plughw:0,7 -f S16_LE -r 16000 -c 1 -q /tmp/audio.wav

I’m deliberately using a plug-in device to enable automatic conversion in case I would really like to introduce any changes to the USB microphone configuration. AST is run on 16 kHz samples, so the recording and AudioMoth sampling are set to this value.

Concentrate to the generator within the code. It’s necessary that the device constantly captures audio on the time intervals I specify. I aimed to store only probably the most recent audio sample on the device and discard it after the classification. This approach will probably be especially useful later during larger-scale studies in urban areas, because it helps ensure people’s privacy and aligns with GDPR compliance.

import asyncio
import re
import subprocess
from tempfile import TemporaryDirectory
from typing import Any, AsyncGenerator

import librosa
import numpy as np


class AudioDevice:
    def __init__(
        self,
        name: str,
        channels: int,
        sampling_rate: int,
        format: str,
    ):
        self.name = self._match_device(name)
        self.channels = channels
        self.sampling_rate = sampling_rate
        self.format = format

    @staticmethod
    def _match_device(name: str):
        lines = subprocess.check_output(['arecord', '-l'], text=True).splitlines()
        devices = [
            f'plughw:{m.group(1)},{m.group(2)}'
            for line in lines
            if name.lower() in line.lower()
            if (m := re.search(r'card (d+):.*device (d+):', line))
        ]

        if len(devices) == 0:
            raise ValueError(f'No devices found matching `{name}`')
        if len(devices) > 1:
            raise ValueError(f'Multiple devices found matching `{name}` -> {devices}')
        return devices[0]

    async def continuous_capture(
        self,
        sample_duration: int = 1,
        capture_delay: int = 0,
    ) -> AsyncGenerator[np.ndarray, Any]:
        with TemporaryDirectory() as temp_dir:
            temp_file = f'{temp_dir}/audio.wav'
            command = (
                f'arecord '
                f'-d {sample_duration} '
                f'-D {self.name} '
                f'-f {self.format} '
                f'-r {self.sampling_rate} '
                f'-c {self.channels} '
                f'-q '
                f'{temp_file}'
            )

            while True:
                subprocess.check_call(command, shell=True)
                data, sr = librosa.load(
                    temp_file,
                    sr=self.sampling_rate,
                )
                await asyncio.sleep(capture_delay)
                yield data

Classification

Now for probably the most exciting part.

Using the Audio Spectrogram Transformer (AST) and the wonderful HuggingFace ecosystem, we are able to efficiently analyse audio and classify detected segments into over 500 categories.
Note that I’ve prepared the system to support various pre-trained models. By default, I take advantage of , because it delivers the most effective results and runs well on the Raspberry Pi 4. Nevertheless, can be price exploring — especially its quantised version, which requires less memory and serves the inference results quicker.

You could notice that I’m not limiting the model to a single classification label, and that’s intentional. As a substitute of assuming that just one sound source is present at any given time, I apply a sigmoid function to the model’s logits to acquire independent probabilities for every class. This enables the model to specific confidence in multiple labels concurrently, which is crucial for real-world soundscapes where overlapping sources — like birds, wind, and distant traffic — often occur together. Taking the top five results ensures that the system captures the almost definitely sound events within the sample without forcing a winner-takes-all decision.

from pathlib import Path
from typing import Optional

import numpy as np
import pandas as pd
import torch
from optimum.onnxruntime import ORTModelForAudioClassification
from transformers import AutoFeatureExtractor, ASTForAudioClassification


class AudioClassifier:
    def __init__(self, pretrained_ast: str, pretrained_ast_file_name: Optional[str] = None):
        if pretrained_ast_file_name and Path(pretrained_ast_file_name).suffix == '.onnx':
            self.model = ORTModelForAudioClassification.from_pretrained(
                pretrained_ast,
                subfolder='onnx',
                file_name=pretrained_ast_file_name,
            )
            self.feature_extractor = AutoFeatureExtractor.from_pretrained(
                pretrained_ast,
                file_name=pretrained_ast_file_name,
            )
        else:
            self.model = ASTForAudioClassification.from_pretrained(pretrained_ast)
            self.feature_extractor = AutoFeatureExtractor.from_pretrained(pretrained_ast)

        self.sampling_rate = self.feature_extractor.sampling_rate

    async def predict(
        self,
        audio: np.array,
        top_k: int = 5,
    ) -> pd.DataFrame:
        with torch.no_grad():
            inputs = self.feature_extractor(
                audio,
                sampling_rate=self.sampling_rate,
                return_tensors='pt',
            )
            logits = self.model(**inputs).logits[0]
            proba = torch.sigmoid(logits)
            top_k_indices = torch.argsort(proba)[-top_k:].flip(dims=(0,)).tolist()

            return pd.DataFrame(
                {
                    'label': [self.model.config.id2label[i] for i in top_k_indices],
                    'rating': proba[top_k_indices],
                }
            )

To run the ONNX version of the model, you might want to add Optimum to your dependencies.

Sound pressure level

Together with the audio classification, I capture information on sound pressure level. This approach not only identifies made the sound but in addition gains insight into each sound was present. In that way, the model captures a richer, more realistic representation of the acoustic scene and may eventually be used to detect finer-grained noise pollution information.

import numpy as np
from maad.spl import wav2dBSPL
from maad.util import mean_dB


async def calculate_sound_pressure_level(audio: np.ndarray, gain=10 + 15, sensitivity=-18) -> np.ndarray:
    x = wav2dBSPL(audio, gain=gain, sensitivity=sensitivity, Vadc=1.25)
    return mean_dB(x, axis=0)

The gain (preamp + amp), sensitivity (dB/V), and Vadc (V) are set primarily for AudioMoth and confirmed experimentally. For those who are using a distinct device, you will need to discover these values by referring to the technical specification.

Storage

Data from each sensor is synchronised with a PostgreSQL database every 30 seconds. The present urban soundscape monitor prototype uses an Ethernet connection; due to this fact, I’m not restricted by way of network load. The device for more distant areas will synchronise the info each hour using a GSM connection.

label           rating        device   sync_id                                sync_time
Hum             0.43894055   yor      9531b89a-4b38-4a43-946b-43ae2f704961   2025-05-26 14:57:49.104271
Mains hum       0.3894045    yor      9531b89a-4b38-4a43-946b-43ae2f704961   2025-05-26 14:57:49.104271
Static          0.06389702   yor      9531b89a-4b38-4a43-946b-43ae2f704961   2025-05-26 14:57:49.104271
Buzz            0.047603738  yor      9531b89a-4b38-4a43-946b-43ae2f704961   2025-05-26 14:57:49.104271
White noise     0.03204195   yor      9531b89a-4b38-4a43-946b-43ae2f704961   2025-05-26 14:57:49.104271
Bee, wasp, etc. 0.40881288   yor      8477e05c-0b52-41b2-b5e9-727a01b9ec87   2025-05-26 14:58:40.641071
Fly, housefly   0.38868183   yor      8477e05c-0b52-41b2-b5e9-727a01b9ec87   2025-05-26 14:58:40.641071
Insect          0.35616025   yor      8477e05c-0b52-41b2-b5e9-727a01b9ec87   2025-05-26 14:58:40.641071
Speech          0.23579548   yor      8477e05c-0b52-41b2-b5e9-727a01b9ec87   2025-05-26 14:58:40.641071
Buzz            0.105577625  yor      8477e05c-0b52-41b2-b5e9-727a01b9ec87   2025-05-26 14:58:40.641071

Results

A separate application, built using Streamlit and Plotly, accesses this data. Currently, it displays information concerning the device’s location, temporal SPL (sound pressure level), identified sound classes, and a spread of acoustic indices.

And now we’re good to go. The plan is to increase the sensor network and reach around 20 devices scattered around multiple places in my city. More details about a bigger area sensor deployment will probably be available soon.

Furthermore, I’m collecting data from a deployed sensor and plan to share the info package, dashboard, and evaluation in an upcoming blog post. I’ll use an interesting approach that warrants a deeper dive into audio classification. The fundamental idea is to match different sound pressure levels to the detected audio classes. I hope to search out a greater way of describing noise pollution. So stay tuned for a more detailed breakdown soon.

Within the meantime, you possibly can read the preliminary paper on my soundscapes studies (headphones are obligatory).

This post was proofread and edited using Grammarly to enhance grammar and clarity.

Audio Spectrogram Transformers Beyond the Lab

Methods

Setup

Capturing sound

Classification

Sound pressure level

Storage

Results

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Speed up StarCoder with 🤗 Optimum Intel on Xeon: Q8/Q4 and Speculative Decoding

a Leaderboard for Real World Use Cases

Patch Time Series Transformer in Hugging Face

Constitutional AI with Open LLMs

Hugging Face Text Generation Inference available for AWS Inferentia2

Audio Spectrogram Transformers Beyond the Lab

Methods

Setup

Capturing sound

Classification

Sound pressure level

Storage

Results

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.