Evaluating Audio Reasoning with Big Bench Audio

The emergence of native Speech to Speech models offers exciting opportunities to extend voice agent capabilities and simplify speech-enabled workflows. Nevertheless, it’s crucial to guage whether this simplification comes at the price of model performance or introduces other trade-offs.

To support evaluation of this, Artificial Evaluation is releasing Big Bench Audio, a brand new evaluation dataset for assessing the reasoning capabilities of audio language models. This dataset adapts questions from Big Bench Hard – chosen for its rigorous testing of advanced reasoning – into the audio domain.

This post introduces the Big Bench Audio dataset alongside initial benchmark results for GPT-4o and Gemini 1.5 series models. Our evaluation examines these models across multiple modalities: native Speech to Speech, Speech to Text, Text to Speech and Text to Text. We present a summary of results below, and on the brand new Speech to Speech page on the Artificial Evaluation website. Our initial results show a major “speech reasoning gap”: while GPT-4o achieves 92% accuracy on a text-only version of the dataset, its Speech to Speech performance drops to 66%.

The Big Bench Audio Dataset

Big Bench Audio comprises 1,000 audio questions chosen from 4 categories of Big Bench Hard, each chosen for his or her suitability for audio evaluation:

Formal Fallacies: Evaluating logical deduction from given statements
Navigate: Determining if navigation steps return to a place to begin
Object Counting: Counting specific items inside collections
Web of Lies: Evaluating Boolean logic expressed in natural language

Each category contributes 250 questions, making a balanced dataset that avoids tasks heavily depending on visual elements or text that could possibly be potentially ambiguous when verbalized.

Each query within the dataset is structured as:

{
  "category": "formal_fallacies",
  "official_answer": "invalid",
  "file_name": "data/question_0.mp3",
  "id": 0
}

The audio files were generated using 23 synthetic voices from top-ranked Text to Speech models within the Artificial Evaluation Speech Arena. Each audio generation was rigorously verified using Levenshtein distance against transcriptions, and edge cases were reviewed manually. To search out out more about how the dataset was created, try the dataset card.

Evaluating Audio Reasoning

To evaluate the impact of audio on each model’s reasoning performance, we tested 4 different configurations on Big Bench Audio:

Speech to Speech: An input audio file is provided and the model generates an output audio file containing the reply.
Speech to Text: An input audio file is provided and the model generates a text answer.
Text to Speech: A text version of the query is provided and the model generates an output audio file containing the reply.
Text to Text: A text version of the query is provided and the model generates a text answer.

Based on these configurations we conducted eighteen experiments:

Model	Speech to Speech	Speech to Text	Text to Speech	Text to Text
GPT-4o Realtime Preview (Oct ’24)	✅		✅
GPT-4o Realtime Preview (Dec ’24)	✅
GPT-4o mini Realtime Preview (Dec ’24)	✅
GPT-4o ChatCompletions Audio Preview	✅		✅
Speech to Speech Pipeline (whisper, GPT-4o, tts-1)¹	✅
GPT-4o (Aug ’24)				✅
Gemini 1.5 Flash (May ‘24)		✅		✅
Gemini 1.5 Flash (Sep ‘24)		✅		✅
Gemini 1.5 Pro (May ‘24)		✅		✅
Gemini 1.5 Pro (Sep ‘24)		✅		✅
Gemini 2.0 Flash (Experimental)		✅		✅

(Table 1 – Experiment configuration)

Notes:

An input audio file is transcribed using OpenAI’s Whisper. The transcription is then fed into GPT-4o to generate a solution. This answer is then converted into audio using OpenAI’s TTS-1 model

Evaluation Methodology

To make sure consistent and scalable evaluation across all configurations, we developed an automatic assessment system using an LLM Evaluator. Here’s how it really works:

For audio responses, we first transcribe them to text using OpenAI’s Whisper API
For text responses, we use them directly because the “candidate answer”
The LLM Evaluator receives:
- The candidate answer
- The official answer
- The unique query (for context)

The LLM Evaluator is supplied with the candidate answer, official answer, and original query as context and is prompted to label the candidate answer as either correct or incorrect. We use Anthropic’s Claude 3.5 Sonnet (Oct ’24) because the LLM evaluator for the Big Bench Audio scores listed on Artificial Evaluation.

LLM Evaluator Prompt:

Assess whether the next CANDIDATE ANSWER is CORRECT or INCORRECT.
For the CANDIDATE ANSWER to be correct, it should be consistent with the OFFICIAL ANSWER.
If the CANDIDATE ANSWER contradicts itself, assess the primary proposed answer.
If the CANDIDATE ANSWER provides a final answer and dealing, assess the ultimate answer only.
If the CANDIDATE ANSWER includes irrelevant information, assess only the relevant information.
If the CANDIDATE ANSWER features a numeric value it's okay whether it is spelled e.g. 7 or seven
It's okay if the CANDIDATE ANSWER involves a misspelling of an individual's name e.g. Leda or Lida, Autry or Audrie.
  
The query, for reference only: START QUESTION {query} nnEND QUESTION

The OFFICIAL ANSWER:{official_answer}

BEGIN CANDIDATE ANSWER TO ASSESS

{candidate_answer}

END CANDIDATE ANSWER TO ASSESS

Reply only with CORRECT or INCORRECT.

Examples

Below are some example responses.

Speech to Speech

Speech to Speech Pipeline

Speech to Text

Model: Gemini 1.5 Flash (Sep ’24)
Input:

Output:

The argument is deductively valid.

The premise states: Every close friend of Myra is neither a cousin of Sue nor a daughter of Tommy.  This can be a
universal statement.  The conclusion appropriately reverses this statement using contraposition: No cousin of Sue and no daughter of Tommy is a detailed friend of Myra.

This can be a valid application of contraposition.  If A implies not (Bor C), then (B or C) implies not A.  The argument is structured appropriately and the conclusion follows logically from the premise.

Is correct: True

Text to Speech

Model: GPT-4o, ChatCompletions Audio Preview

Input:

In the event you follow these instructions, do you come to the start line? Turn right. Take 10 steps. Turn around. Take 10 steps.
Options:
- Yes
- No
Answer the query

Output:
Is correct: True

Text to Text

Model: Gemini 1.5 Pro (Sep ’24)

Input:

I actually have 4 oranges, a peach, a plum, three apples, three grapes, two
raspberries, a nectarine, a strawberry, a blackberry, and a banana.
What number of fruits do I actually have?. Answer the query"

Output:
```
You could have 18 fruits.
```
Is correct: True

Results

(Figure 1 – Summary of Results)

All results presented represent averages across three independent evaluation runs on each dataset.

The Audio Reasoning Gap

Our evaluation reveals a considerable performance gap between text reasoning and audio reasoning. GPT-4o (Aug ’24) achieves 92% accuracy within the Text to Text version of the dataset, while its Speech to Speech counterpart (GPT-4o Realtime Preview Oct ’24) reaches a rating of 66%. The Text to Speech configuration achieves intermediate performance at 74%, indicating that each speech input and speech output are contributing to the performance gap.

Speech to Speech Pipelines Currently Outperform Native Audio for Reasoning

Traditional pipeline approaches (using Whisper for transcription, GPT-4o (Aug ’24) for reasoning, and TTS-1 for voice generation) show minimal performance degradation in comparison with pure text processing. This implies that for applications where reasoning accuracy is critical, pipeline approaches currently offer the optimal balance of performance and audio capability.

We anticipate that this gap may narrow over time and can proceed to check latest Speech to Speech models with Big Bench Audio. Look out for an update with Speech to Speech mode for Google’s Gemini 2.0 Flash soon!

The best way to Contribute or Get in Touch

For further evaluation of Speech to Speech models, try the brand new Speech to Speech page on the Artificial Evaluation website: https://artificialanalysis.ai/speech-to-speech.

For updates, follow us on Twitter and LinkedIn. We welcome all feedback and can be found via message on Twitter, in addition to through the contact form on our website.

Source link

Evaluating Audio Reasoning with Big Bench Audio

The Big Bench Audio Dataset

Evaluating Audio Reasoning

Evaluation Methodology

Examples

Speech to Speech

Speech to Speech Pipeline

Speech to Text

Text to Speech

Text to Text

Results

The Audio Reasoning Gap

Speech to Speech Pipelines Currently Outperform Native Audio for Reasoning

The best way to Contribute or Get in Touch

What are your thoughts on this topic?
Let us know in the comments below.

Share this article

Recent posts

Large Language Models: A Recent Moore’s Law?

Scaling up BERT-like model Inference on modern CPU

Architecting GPUaaS for Enterprise AI On-Prem

Nice-Tune XLSR-Wav2Vec2 for low-resource ASR with 🤗 Transformers

Accelerating PyTorch distributed fine-tuning with Intel technologies

Evaluating Audio Reasoning with Big Bench Audio

The Big Bench Audio Dataset

Evaluating Audio Reasoning

Evaluation Methodology

Examples

Speech to Speech

Speech to Speech Pipeline

Speech to Text

Text to Speech

Text to Text

Results

The Audio Reasoning Gap

Speech to Speech Pipelines Currently Outperform Native Audio for Reasoning

The best way to Contribute or Get in Touch

What are your thoughts on this topic? Let us know in the comments below.

Share this article

Recent posts

What are your thoughts on this topic?
Let us know in the comments below.