Deepak Bolleddu - Research Blogs & Notes

Multimodal AI Mar 2026

Cross-Modal Contrastive Learning for Audio-Visual Synchronization

Exploring the architecture of dual-encoder models that align audio waveforms with video frames. This note details the implementation of a contrastive loss function (InfoNCE) to create a shared embedding space, allowing the model to accurately determine if a spoken audio clip temporally matches lip movements in a video snippet.

Notes PDF Source Code

Computer Vision Feb 2026

Real-time 3D Scene Reconstruction using Neural Radiance Fields (NeRF)

A deep dive into optimizing NeRF architectures for real-time rendering. This blog covers the mathematics behind volumetric rendering and how techniques like hash-grid encoding (Instant NGP) drastically reduce the training time required to synthesize novel views from sparse 2D image sets.

Read Paper Implementation

Emotion AI Jan 2026

Micro-expression Recognition using Spatio-Temporal Graph Convolutional Networks

Micro-expressions are involuntary facial movements lasting fractions of a second. This post details a custom pipeline that extracts facial landmarks over sequential video frames and uses Graph Convolutional Networks (ST-GCN) to model the structural dynamics of facial muscles for accurate emotion classification.

Summary PDF View Repository

Speech Processing Nov 2025

Zero-shot Voice Conversion via Disentangled Representation Learning

Analyzing methods to separate speaker identity (timbre) from linguistic content (phonemes) in raw audio. The post breaks down the use of Vector Quantized Variational Autoencoders (VQ-VAE) to compress speech, allowing us to synthesize a target voice without requiring any paired training data.

Research Log Colab & Code

Multimodal AI Emotion AI Aug 2025

Affective Computing: Fusion of Facial Expressions and Speech Prosody

Human emotion is inherently multimodal. This research note investigates early-fusion vs. late-fusion strategies for combining CNN-based facial expression features with RNN-processed speech acoustics (pitch, energy, mel-spectrograms) to build a more robust, context-aware emotion classifier.

Architecture PDF Source Code

Research Notes

Cross-Modal Contrastive Learning for Audio-Visual Synchronization

Real-time 3D Scene Reconstruction using Neural Radiance Fields (NeRF)

Micro-expression Recognition using Spatio-Temporal Graph Convolutional Networks

Zero-shot Voice Conversion via Disentangled Representation Learning

Affective Computing: Fusion of Facial Expressions and Speech Prosody