Home



Image-Music Alignment With Constrastive Learning



Author:   Li Wang
Institution:   Zhejiang University
Date:   May 04, 2025

Abstract

Cross-modal music retrieval remains a challenging task for current search engines. Existing engines match music tracks using coarse-granularity retrieval of metadata, like pre-defined tags and genres. These methods face difficulties handling fine-granularity contextual queries. We propose a novel dataset of 66,048 image-music pairs for cross-modal music retrieval and introduce a hybrid-granularity retrieval framework using contrastive learning. Our method outperforms previous approaches, ensuring superior image-music alignment.






MIPNet Dataset

66K image-music pairs

 

The proposed HG-CLIM Framework

HG-CLIM

 

Links and Downloads

Paper