Sitemap
A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.
Pages
Posts
Future Blog Post
Published:
Blog Post number 4
Published:
Blog Post number 3
Published:
Blog Post number 2
Published:
Blog Post number 1
Published:
portfolio
publications
Quo Vadis, Video Understanding with Vision-Language Foundation Models?
Mahmoud Ali, Di Yang, Arkaprava Sinha, Dominick Reilly, Srijan Das, Gianpiero Francesca, Francois Bremond
NeurIPSW, 2024
[Paper]
This study benchmarks Vision-Language Models (VLMs & VLLMs) on five ADL video tasks across 11 datasets, revealing their struggles with fine-grained action understanding. Despite their web-scale success, these models fall short on real-world, densely labeled, and long-video challenges.
SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living
Arkaprava Sinha, Dominick Reilly, Francois Bremond, Pu Wang, Srijan Das
AAAI, 2025
SKI models integrate 3D skeletons into vision-language models using SkeletonCLIP, enabling improved generalization to unseen actions in ADL videos. They enhance robustness by not requiring skeleton data during inference and show effectiveness in zero-shot action recognition and video captioning.
LLAVIDAL: A Large Language Vision Model for Daily Activities of Living
Dominick Reilly, Rajatsubhra Chakraborty, Arkaprava Sinha, Manish Kumar Govind, Pu Wang, Francois Bremond, Le Xue, Srijan Das
CVPR, 2025
DiffSwap++: 3D Latent-Controlled Diffusion for Identity-Preserving Face Swapping
Weston Bondurant, Arkaprava Sinha, Hieu Le, Srijan Das, Stephanie Schuckers
preprint, 2025
[Paper]
3D Guided Diffusion for Face Swapping
MS-Temba: Multi-Scale Temporal Mamba for Efficient Temporal Action Detection
Arkaprava Sinha, Monish Soundar Raj, Pu Wang, Ahmed Helmy, Hieu Le, Srijan Das
CVPR, 2026
MS-Temba adapts Mamba-based state-space modeling to Temporal Action Detection by introducing dilated multi-scale SSMs that capture both fine-grained and long-range temporal dynamics in long, untrimmed videos. Through scale-aware supervision and a dedicated multi-scale fusion module, it delivers precise localization of densely overlapping actions and generalizes effectively to long-form Video Summarization.
talks
Talk 1 on Relevant Topic in Your Field
Published:
Tutorial 1 on Relevant Topic in Your Field
Published:
Talk 2 on Relevant Topic in Your Field
Published:
teaching
Teaching experience 1
Undergraduate course, University 1, Department, 2014
Teaching experience 2
Workshop, University 1, Department, 2015
