LLAVIDAL: A Large Language Vision Model for Daily Activities of Living
Dominick Reilly, Rajatsubhra Chakraborty, Arkaprava Sinha, Manish Kumar Govind, Pu Wang, Francois Bremond, Le Xue, Srijan Das
CVPR, 2025
Dominick Reilly, Rajatsubhra Chakraborty, Arkaprava Sinha, Manish Kumar Govind, Pu Wang, Francois Bremond, Le Xue, Srijan Das
CVPR, 2025
Arkaprava Sinha, Dominick Reilly, Francois Bremond, Pu Wang, Srijan Das
AAAI, 2025
SKI models integrate 3D skeletons into vision-language models using SkeletonCLIP, enabling improved generalization to unseen actions in ADL videos. They enhance robustness by not requiring skeleton data during inference and show effectiveness in zero-shot action recognition and video captioning.
Arkaprava Sinha, Monish Soundar Raj, Pu Wang, Ahmed Helmy, Srijan Das
preprint, 2025
Multi-scale Temporal Mamba adapts Mamba for action detection in long untrimmed videos by introducing Temporal Mamba (Temba) Blocks with dilated temporal modeling and a Temporal Mamba Fuser for multi-scale feature aggregation. It outperforms SOTA methods on long videos while being significantly more efficient.
Mahmoud Ali, Di Yang, Arkaprava Sinha, Dominick Reilly, Srijan Das, Gianpiero Francesca, Francois Bremond
NeurIPSW, 2024
[Paper]
This study benchmarks Vision-Language Models (VLMs & VLLMs) on five ADL video tasks across 11 datasets, revealing their struggles with fine-grained action understanding. Despite their web-scale success, these models fall short on real-world, densely labeled, and long-video challenges.