Link to bioRxiv paper: http://biorxiv.org/cgi/content/short/2023.07.20.549865v1?rss=1
Authors: Wang, D., Wang, J., Sun, M.
Abstract: Singing voice separation on robots faces the problem of interpreting ambiguous auditory signals. The acoustic signal, which the humanoid robot perceives through its onboard microphones, is a mixture of singing voice, music, and noise, with distortion, attenuation, and reverberation. In this paper, we used the 3 directional Inception-Resnet structure in the U-shaped encoding and decoding network to improve the utilization of the spatial and spectral information of the spectrograms. Multi-objectives were used to train the model: magnitude consistency loss, phase consistency loss, and magnitude correlation consistency loss. We recorded the singing voice and accompaniment derived from the MIR-1k datasets with NAO robots and synthesized the 10-channel datasets for training the model. The experimental results show that the proposed model trained by multi-objective reaches an average NSDR of 11.55db on the test datasets, which outperforms the comparison model.
Copy rights belong to original authors. Visit the link for more info
Podcast created by Paper Player, LLC