Do Music Source Separation Models Preserve Spatial Information in Binaural Audio?

Do Music Source Separation Models Preserve Spatial Information in Binaural Audio?

Richa Namballa

rn2214@nyu.edu

Music Technology
New York University
New York City, USA

Agnieszka Roginska

ar137@nyu.edu

Music Technology
New York University
New York City, USA

Magdalena Fuentes

mf3734@nyu.edu

Music Technology / IDM
New York University,
New York City, USA


Accepted at ISMIR 2025

[PDF]


Summary

Binaural audio remains underexplored within the music information retrieval community. Motivated by the rising popularity of virtual and augmented reality experiences as well as potential applications to accessibility, we investigate how well existing music source separation (MSS) models perform on binaural audio. Although these models process two-channel inputs, it is unclear how effectively they retain spatial information. In this work, we evaluate how several popular MSS models preserve spatial information on both standard stereo and novel binaural datasets. Our binaural data is synthesized using stems from MUSDB18-HQ and open-source head-related transfer functions by positioning instrument sources randomly along the horizontal plane. We then assess the spatial quality of the separated stems using signal processing and interaural cue-based metrics. Our results show that stereo MSS models fail to preserve the spatial information critical for maintaining the immersive quality of binaural audio, and that the degradation depends on model architecture as well as the target instrument. Finally, we highlight valuable opportunities for future work at the intersection of MSS and immersive audio.



Audio Examples and Metrics

Diagram illustrating the random placement of instrument sources in Binaural-MUSDB.

SSR: Signal to Spatial Distortion Ratio     SRR: Signal to Residual Distortion Ratio

ΔITD: Distortion in Interaural Time Difference     ΔILD: Distortion in Interaural Level Difference

Bass: Hollow Ground - "Ill Fate"

Binaural
Stereo

Input - Mixture

Input - Mixture

Reference Stem (θ = 90°)

Reference Stem

Output - Demucs

SSR: 6.47 dB     SRR: 3.31 dB

ΔITD: 816.33 μs     ΔILD: 1.49 dB

Output - Demucs

SSR: 9.56 dB     SRR: 0.36 dB

ΔITD: 0.00 μs     ΔILD: 0.65 dB

Output - Open-Unmix

SSR: 6.79 dB     SRR: -1.10 dB

ΔITD: 997.73 μs     ΔILD: 2.61 dB

Output - Open-Unmix

SSR: 12.01dB     SRR: -2.62

ΔITD: 0.00 μs     ΔILD: 0.22 dB

Output - Spleeter

SSR: 12.13 dB     SRR: -2.60 dB

ΔITD: 1020.41 μs     ΔILD: 2.35 dB

Output - Spleeter

SSR: 4.09 dB     SRR: -3.53 dB

ΔITD: 22.68 μs     ΔILD: 0.39 dB

Vocals: Ben Carrigan - "We'll Talk About It All Tonight"

Binaural
Stereo

Input - Mixture

Input - Mixture

Reference Stem (θ = 60°)

Reference Stem

Output - Demucs

SSR: 5.52 dB     SRR: 0.00 dB

ΔITD: 566.89 μs     ΔILD: 0.27 dB

Output - Demucs

SSR: 5.77 dB     SRR: 0.08 dB

ΔITD: 0.00 μs     ΔILD: 0.03 dB

Output - Open-Unmix

SSR: 1.45 dB     SRR: -4.47 dB

ΔITD: 294.78 μs     ΔILD: 0.14 dB

Output - Open-Unmix

SSR: 2.05 dB     SRR: -1.92 dB

ΔITD: 0.00 μs     ΔILD: 0.13 dB

Output - Spleeter

SSR: 0.00 dB     SRR: -3.71 dB

ΔITD: 0.00 μs     ΔILD: 0.73 dB

Output - Spleeter

SSR: 3.06 dB     SRR: 0.00 dB

ΔITD: 0.00 μs     ΔILD: 0.23 dB

Drums: The Easton Ellises - "Falcon 69"

Binaural
Stereo

Input - Mixture

Input - Mixture

Reference Stem (θ = 90°)

Reference Stem

Output - Demucs

SSR: 25.53 dB     SRR: 10.06 dB

ΔITD: 0.00 μs     ΔILD: 0.30 dB

Output - Demucs

SSR: 21.36 dB     SRR: 10.16 dB

ΔITD: 0.00 μs     ΔILD: 0.02 dB

Output - Open-Unmix

SSR: 24.60 dB     SRR: 8.51 dB

ΔITD: 0.00 μs     ΔILD: 0.37 dB

Output - Open-Unmix

SSR: 15.91 dB     SRR: 6.40 dB

ΔITD: 0.00 μs     ΔILD: 0.01 dB

Output - Spleeter

SSR: 19.32 dB     SRR: 7.60 dB

ΔITD: 0.00 μs     ΔILD: 0.51 dB

Output - Spleeter

SSR: 15.87 dB     SRR: 7.14 dB

ΔITD: 0.00 μs     ΔILD: 0.00 dB

Other: Enda Reilly - "Cur An Long Ag Seol"

Binaural
Stereo

Input - Mixture

Input - Mixture

Reference Stem (θ = -30°)

Reference Stem

Output - Demucs

SSR: 14.28 dB     SRR: 7.64 dB

ΔITD: 0.00 μs     ΔILD: 0.05 dB

Output - Demucs

SSR: 19.06 dB     SRR: 9.40dB

ΔITD: 0.00 μs     ΔILD: 0.04 dB

Output - Open-Unmix

SSR: 15.93 dB     SRR: 4.69 dB

ΔITD: 90.70 μs     ΔILD: 0.24 dB

Output - Open-Unmix

SSR: 15.24 dB     SRR: 5.49 dB

ΔITD: 0.00 μs     ΔILD: 0.36 dB

Output - Spleeter

SSR: 15.41 dB     SRR: 3.77 dB

ΔITD: 0.00 μs     ΔILD: 0.45 dB

Output - Spleeter

SSR: 15.46 dB     SRR: 4.83

ΔITD: 0.00 μs     ΔILD: 0.62 dB