Spoken question answering for visual queries¶
Anonymous submission to Interspeech 2025
Abstract¶
Question answering (QA) systems are designed to answer natural language questions. Visual QA (VQA) and Spoken QA (SQA) systems extend the textual QA system to accept visual and spoken input respectively.
This work aims to create a system that enables user interaction through both speech and images. That is achieved through the fusion of text, speech, and image modalities to tackle the task of spoken VQA (SVQA). The resulting multi-modal model has textual, visual, and spoken inputs and can answer spoken questions on images.
Training and evaluating SVQA models requires a dataset for all three modalities, but no such dataset currently exists. We address this problem by synthesizing VQA datasets using two zero-shot TTS models. Our initial findings indicate that a model trained only with synthesized speech nearly reaches the performance of the upper-bounding model trained on textual QAs. In addition, we show that the choice of the TTS model has a minor impact on accuracy.
Samples page¶
This page presents samples from our training and evaluation SVQA datasets and samples from a live demo. For each sample we provide the textual prompts, input image and the synthesized audio questions. The textual conversation is composed of the "USER" input turn and the "ASSISTANT" turn which is the expected response from the LLM.
In the live demo we used a real voice to ask the questions and provided the model images that it has never seen before.
The audio samples of the datasets were synthesized from the textual questions using StyleTTS2 and F5-TTS. Speaker prompts for the TTS were randomally chosen from the Multilingual LibriSpeech samples.
Live QA Examples
Example 1¶
| Prompt | Audio | |
|---|---|---|
| USER: <image> <audio> | ||
| ASSISTANT: Green |
Example 2¶
| Prompt | Audio | |
|---|---|---|
| USER: <image> <audio> | ||
| ASSISTANT: The image features a cartoon giraffe wearing headphones. The giraffe is smiling and has a tongue sticking out of its mouth. |
Example 3¶
| Prompt | Audio | |
|---|---|---|
| USER: <image> <audio> | ||
| ASSISTANT: Open |
Training set
LLaVA-1.5 Instruct training set¶
Example 1¶
| Prompt | Audio StyleTTS2 | Audio F5-TTS |
|---|---|---|
| USER: <audio> <image> | ||
| ASSISTANT: The cat is positioned on top of the back of the couch in the living room. | ||
| USER: <audio> | ||
| ASSISTANT: The cat is coming out from some curtains onto the couch and is sitting or standing on top of it. |
Example 2¶
| Prompt | Audio StyleTTS2 | Audio F5-TTS |
|---|---|---|
| USER: <image> <audio> | ||
| ASSISTANT: The main focus of the image is a collection of stuffed toy bears. | ||
| USER: <audio> | ||
| ASSISTANT: The stuffed toy bears are placed on a table. |
Example 3¶
| Prompt | Audio StyleTTS2 | Audio F5-TTS |
|---|---|---|
| USER: <image> <audio> | ||
| ASSISTANT: The image captures a night scene. | ||
| USER: <audio> | ||
| ASSISTANT: The street is illuminated by many street lights and the bright lights from the surrounding buildings. Additionally, there is a time lapse effect causing blurred lights on the road. |
Example 4¶
| Prompt | Audio StyleTTS2 | Audio F5-TTS |
|---|---|---|
| USER: <image> <audio> | ||
| ASSISTANT: A red sports utility vehicle (SUV) is driving down the street in the image. | ||
| USER: <audio> | ||
| ASSISTANT: Yes, the street is filled with a considerable number of people, which indicates that the area is busy. |
Evaluation sets
1. SEED-Bench¶
Adaptation from lmms-lab/SEED-Bench
Example 1¶
| Prompt | Audio StyleTTS2 | Audio F5-TTS |
|---|---|---|
| USER: <image> <audio> A. One. B. Two. C. Three. D. Four. Answer with the option's letter from the given choices directly. | ||
| ASSISTANT: A |
Example 2¶
| Prompt | Audio StyleTTS2 | Audio F5-TTS |
|---|---|---|
| USER: <image> <audio> A. Black. B. Gray. C. White. D. Brown. Answer with the option's letter from the given choices directly. | ||
| ASSISTANT: C |
2. MME¶
Adaptation from lmms-lab/MME
3. DocVQA¶
Adaptation from lmms-lab/DocVQA