Spoken question answering for visual queries¶

Anonymous submission to Interspeech 2025

Abstract¶

Question answering (QA) systems are designed to answer natural language questions. Visual QA (VQA) and Spoken QA (SQA) systems extend the textual QA system to accept visual and spoken input respectively.

This work aims to create a system that enables user interaction through both speech and images. That is achieved through the fusion of text, speech, and image modalities to tackle the task of spoken VQA (SVQA). The resulting multi-modal model has textual, visual, and spoken inputs and can answer spoken questions on images.

Training and evaluating SVQA models requires a dataset for all three modalities, but no such dataset currently exists. We address this problem by synthesizing VQA datasets using two zero-shot TTS models. Our initial findings indicate that a model trained only with synthesized speech nearly reaches the performance of the upper-bounding model trained on textual QAs. In addition, we show that the choice of the TTS model has a minor impact on accuracy.

Samples page¶

This page presents samples from our training and evaluation SVQA datasets and samples from a live demo. For each sample we provide the textual prompts, input image and the synthesized audio questions. The textual conversation is composed of the "USER" input turn and the "ASSISTANT" turn which is the expected response from the LLM.

In the live demo we used a real voice to ask the questions and provided the model images that it has never seen before.

The audio samples of the datasets were synthesized from the textual questions using StyleTTS2 and F5-TTS. Speaker prompts for the TTS were randomally chosen from the Multilingual LibriSpeech samples.

Live QA Examples

Example 1¶

No description has been provided for this image
Prompt Audio
USER: <image> <audio> Your browser does not support the audio element.
ASSISTANT: Green

Example 2¶

No description has been provided for this image
Prompt Audio
USER: <image> <audio> Your browser does not support the audio element.
ASSISTANT: The image features a cartoon giraffe wearing headphones. The giraffe is smiling and has a tongue sticking out of its mouth.

Example 3¶

No description has been provided for this image
Prompt Audio
USER: <image> <audio> Your browser does not support the audio element.
ASSISTANT: Open

Training set

LLaVA-1.5 Instruct training set¶

Example 1¶

No description has been provided for this image
Prompt Audio StyleTTS2 Audio F5-TTS
USER: <audio> <image> Your browser does not support the audio element. Your browser does not support the audio element.
ASSISTANT: The cat is positioned on top of the back of the couch in the living room.
USER: <audio> Your browser does not support the audio element. Your browser does not support the audio element.
ASSISTANT: The cat is coming out from some curtains onto the couch and is sitting or standing on top of it.

Example 2¶

No description has been provided for this image
Prompt Audio StyleTTS2 Audio F5-TTS
USER: <image> <audio> Your browser does not support the audio element. Your browser does not support the audio element.
ASSISTANT: The main focus of the image is a collection of stuffed toy bears.
USER: <audio> Your browser does not support the audio element. Your browser does not support the audio element.
ASSISTANT: The stuffed toy bears are placed on a table.

Example 3¶

No description has been provided for this image
Prompt Audio StyleTTS2 Audio F5-TTS
USER: <image> <audio> Your browser does not support the audio element. Your browser does not support the audio element.
ASSISTANT: The image captures a night scene.
USER: <audio> Your browser does not support the audio element. Your browser does not support the audio element.
ASSISTANT: The street is illuminated by many street lights and the bright lights from the surrounding buildings. Additionally, there is a time lapse effect causing blurred lights on the road.

Example 4¶

No description has been provided for this image
Prompt Audio StyleTTS2 Audio F5-TTS
USER: <image> <audio> Your browser does not support the audio element. Your browser does not support the audio element.
ASSISTANT: A red sports utility vehicle (SUV) is driving down the street in the image.
USER: <audio> Your browser does not support the audio element. Your browser does not support the audio element.
ASSISTANT: Yes, the street is filled with a considerable number of people, which indicates that the area is busy.

Evaluation sets

1. SEED-Bench¶

Adaptation from lmms-lab/SEED-Bench

Example 1¶

No description has been provided for this image
Prompt Audio StyleTTS2 Audio F5-TTS
USER: <image> <audio>
A. One.
B. Two.
C. Three.
D. Four.
Answer with the option's letter from the given choices directly.
Your browser does not support the audio element. Your browser does not support the audio element.
ASSISTANT: A

Example 2¶

No description has been provided for this image
Prompt Audio StyleTTS2 Audio F5-TTS
USER: <image> <audio>
A. Black.
B. Gray.
C. White.
D. Brown.
Answer with the option's letter from the given choices directly.
Your browser does not support the audio element. Your browser does not support the audio element.
ASSISTANT: C

2. MME¶

Adaptation from lmms-lab/MME

Example 1¶

No description has been provided for this image
Prompt Audio StyleTTS2 Audio F5-TTS
USER: <image> <audio> Please answer yes or no. Your browser does not support the audio element. Your browser does not support the audio element.
ASSISTANT: Yes

Example 2¶

No description has been provided for this image
Prompt Audio StyleTTS2 Audio F5-TTS
USER: <image> <audio> Please answer yes or no. Your browser does not support the audio element. Your browser does not support the audio element.
ASSISTANT: Yes

3. DocVQA¶

Adaptation from lmms-lab/DocVQA

Example 1¶

No description has been provided for this image
Prompt Audio StyleTTS2 Audio F5-TTS
USER: <image> <audio> Your browser does not support the audio element. Your browser does not support the audio element.
ASSISTANT: 20 years

Example 2¶

No description has been provided for this image
Prompt Audio StyleTTS2 Audio F5-TTS
USER: <image> <audio> Your browser does not support the audio element. Your browser does not support the audio element.
ASSISTANT: $ 3,038,444