Sitemap

A list of all the posts and pages found on the site. For you robots out there, there is an XML version available for digesting as well.

Pages

Posts

portfolio

publications

Quo Vadis, Video Understanding with Vision-Language Foundation Models?

Mahmoud Ali, Di Yang, Arkaprava Sinha, Dominick Reilly, Srijan Das, Gianpiero Francesca, Francois Bremond

NeurIPSW, 2024

[Paper]

This study benchmarks Vision-Language Models (VLMs & VLLMs) on five ADL video tasks across 11 datasets, revealing their struggles with fine-grained action understanding. Despite their web-scale success, these models fall short on real-world, densely labeled, and long-video challenges.

MS-Temba: Multi-Scale Temporal Mamba for Efficient Temporal Action Detection

Arkaprava Sinha, Monish Soundar Raj, Pu Wang, Ahmed Helmy, Hieu Le, Srijan Das

CVPR, 2026

[Paper] [Code]

MS-Temba adapts Mamba-based state-space modeling to Temporal Action Detection by introducing dilated multi-scale SSMs that capture both fine-grained and long-range temporal dynamics in long, untrimmed videos. Through scale-aware supervision and a dedicated multi-scale fusion module, it delivers precise localization of densely overlapping actions and generalizes effectively to long-form Video Summarization.

talks

teaching