Spoken question answering for visual queries¶

Anonymous submission to Interspeech 2025

Abstract¶

Question answering (QA) systems are designed to answer natural language questions. Visual QA (VQA) and Spoken QA (SQA) systems extend the textual QA system to accept visual and spoken input respectively.

This work aims to create a system that enables user interaction through both speech and images. That is achieved through the fusion of text, speech, and image modalities to tackle the task of spoken VQA (SVQA). The resulting multi-modal model has textual, visual, and spoken inputs and can answer spoken questions on images.

Training and evaluating SVQA models requires a dataset for all three modalities, but no such dataset currently exists. We address this problem by synthesizing VQA datasets using two zero-shot TTS models. Our initial findings indicate that a model trained only with synthesized speech nearly reaches the performance of the upper-bounding model trained on textual QAs. In addition, we show that the choice of the TTS model has a minor impact on accuracy.

Samples page¶

This page presents samples from our training and evaluation SVQA datasets and samples from a live demo. For each sample we provide the textual prompts, input image and the synthesized audio questions. The textual conversation is composed of the "USER" input turn and the "ASSISTANT" turn which is the expected response from the LLM.

In the live demo we used a real voice to ask the questions and provided the model images that it has never seen before.

The audio samples of the datasets were synthesized from the textual questions using StyleTTS2 and F5-TTS. Speaker prompts for the TTS were randomally chosen from the Multilingual LibriSpeech samples.

Live QA Examples

Example 1¶

No description has been provided for this image

Prompt	Audio
USER: <image> <audio>
ASSISTANT: Green

Example 2¶

Prompt	Audio
USER: <image> <audio>
ASSISTANT: The image features a cartoon giraffe wearing headphones. The giraffe is smiling and has a tongue sticking out of its mouth.

Example 3¶

Prompt	Audio
USER: <image> <audio>
ASSISTANT: Open

Training set

LLaVA-1.5 Instruct training set¶

Example 1¶

Prompt	Audio StyleTTS2	Audio F5-TTS
USER: <audio> <image>
ASSISTANT: The cat is positioned on top of the back of the couch in the living room.
USER: <audio>
ASSISTANT: The cat is coming out from some curtains onto the couch and is sitting or standing on top of it.

Example 2¶

$No description has been provided for this image$

Prompt	Audio StyleTTS2	Audio F5-TTS
USER: <image> <audio>
ASSISTANT: The main focus of the image is a collection of stuffed toy bears.
USER: <audio>
ASSISTANT: The stuffed toy bears are placed on a table.

Example 3¶

Prompt	Audio StyleTTS2	Audio F5-TTS
USER: <image> <audio>
ASSISTANT: The image captures a night scene.
USER: <audio>
ASSISTANT: The street is illuminated by many street lights and the bright lights from the surrounding buildings. Additionally, there is a time lapse effect causing blurred lights on the road.

Example 4¶

Prompt	Audio StyleTTS2	Audio F5-TTS
USER: <image> <audio>
ASSISTANT: A red sports utility vehicle (SUV) is driving down the street in the image.
USER: <audio>
ASSISTANT: Yes, the street is filled with a considerable number of people, which indicates that the area is busy.

Evaluation sets

1. SEED-Bench¶

Adaptation from lmms-lab/SEED-Bench

Example 1¶

Prompt	Audio StyleTTS2	Audio F5-TTS
USER: <image> <audio> A. One. B. Two. C. Three. D. Four. Answer with the option's letter from the given choices directly.
ASSISTANT: A

Example 2¶

Prompt	Audio StyleTTS2	Audio F5-TTS
USER: <image> <audio> A. Black. B. Gray. C. White. D. Brown. Answer with the option's letter from the given choices directly.
ASSISTANT: C

2. MME¶

Adaptation from lmms-lab/MME

Example 1¶

Prompt	Audio StyleTTS2	Audio F5-TTS
USER: <image> <audio> Please answer yes or no.
ASSISTANT: Yes

Example 2¶

Prompt	Audio StyleTTS2	Audio F5-TTS
USER: <image> <audio> Please answer yes or no.
ASSISTANT: Yes

3. DocVQA¶

Adaptation from lmms-lab/DocVQA

Example 1¶

Prompt	Audio StyleTTS2	Audio F5-TTS
USER: <image> <audio>
ASSISTANT: 20 years

Example 2¶

Prompt	Audio StyleTTS2	Audio F5-TTS
USER: <image> <audio>
ASSISTANT: $ 3,038,444