Cross-Modal Contrastive Learning for Audio-Visual Synchronization
Exploring the architecture of dual-encoder models that align audio waveforms with video frames. This note details the implementation of a contrastive loss function (InfoNCE) to create a shared embedding space, allowing the model to accurately determine if a spoken audio clip temporally matches lip movements in a video snippet.