News

Using sound localization to make group conversations more accessible

By admin

Posted on July 30, 2025

5 min read

Comments Off

625

Mobile devices with speech-to-text capabilities, like Live Transcribe, have become essential for accessibility of hearing and speech, language translation, note-taking, and meeting transcripts. However, existing mobile automatic speech recognition (ASR) applications typically combine all transcribed speech from multiple participants in a conversation, making it difficult to determine who is saying what. Users who must simultaneously process the transcript, identify speakers, and participate in the conversation experience cognitive overload as a result of this limitation. Solutions have been deployed, but are currently impractical to set up in mobile scenarios.

Speaker embedding techniques, on the other hand, require a model to identify and register each speaker’s distinct voiceprint, just as audio-visual speech separation does. In “SpeechCompass: Enhancing Mobile Captioning with Diarization and Directional Guidance via Multi-Microphone Localization”, recipient of a Best Paper Award at CHI 2025, we explore an approach that enhances mobile captioning with speaker diarization (separating speakers in an ASR transcript) and real-time localization of incoming sound. By providing color-coded visual separation for each speaker and directional indicators (arrows) to assist users in determining the direction from which speech is coming, SpeechCompass produces user-friendly transcripts for group conversations.

Computational costs are reduced, latency is cut down, and privacy is preserved to a greater extent with this multi-microphone strategy. Effective audio localization in real time SpeechCompass is implemented in two ways by us: as a prototype for a phone case with four microphones connected to a low-power microcontroller and as software for phones that already have two microphones. The layout of the phone case places the microphone in the best spot to allow for 360-degree sound localization. The software implementation offers only 180-degree localization on devices with two or more microphones, such as the Pixel phone. Speech recognition is performed on the phone in both cases, and a mobile application is used to visualize the transcripts. Because of its low frequency, sound bounces around indoor environments, causing reverberations and making it difficult to precisely localize audio, particularly speech.

We employ a time-difference of arrival (TDOA)-based localization algorithm to address this issue. Audio signals arrive at each microphone at slightly different times, so the algorithm estimates the TDOA between microphone pairs with cross-correlation to predict the angle of arrival for the sound. In particular, we use Generalized Cross Correlation with Phase Transform (GCC-PHAT) to speed up computation and enhance noise robustness. Then, to increase the localizer’s precision, we use statistical estimations like kernel density estimation. The use of two omnidirectional microphones will always have “front–back” confusion (i.e., when the signals in front or in the back of the array appear identical to the microphone array), thus allowing only 180 degree localization. This issue is solved by using three or more microphones, making 360 degree localization possible.