Home

Hello! I am Arkaprava Sinha, a Graduate Research Assistant pursuing a Ph.D. in Computer Science at the University of North Carolina, Charlotte. I am advised by Prof. Srijan Das. My research lies in the intersection of Vision Language Models, Video Understanding, Temporal Modeling and Multimodal Learning with a focus on building scalable and reliable algorithms for Long Video Understanding.

Prior to my Ph.D., I worked as a Data Scientist, where I contributed to projects in Computer Vision, Natural Language Processing, and large-scale Machine Learning systems across industry and research settings.

Research

My research is centered on Temporal Representation Learning for Long Video Understanding, with applications to Temporal Action Detection and Video Summarization. I develop architectures that capture long-range temporal dependencies and address challenges such as action co-occurrence, temporal boundaries, and efficient representation learning in extended video sequences.

In parallel, I work on Vision-Language Models (VLMs) for video understanding, exploring how multimodal inputs, such as RGB video, text, and skeleton-based motion data, can be aligned to improve contextual interpretation of human activities. I am also interested learning generalized video representation, aiming to design scalable frameworks that enhance both Action Recognition and Multimodal Video Analysis, with potential applications in Robotics and Human-Centered AI systems.

News

Feb 2025 - LLAVIDAL accepted to CVPR 2025.
Dec 2024 - SKI Models accepted to AAAI 2025.
Oct 2024 - 2 papers accepted to NeurIPS 2024 workshops. Early version of LLAVIDAL is presented in NeurIPS 2024 workshop on Video-Language Models and Multimodal Algorithmic Reasoning.

Selected Publications

LLAVIDAL: A Large Language Vision Model for Daily Activities of Living
The IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR), 2025
Dominick Reilly, Rajatsubhra Chakraborty, Arkaprava Sinha, Manish Kumar Govind, Pu Wang, Francois Bremond, Le Xue, Srijan Das
Paper | Code
SKI Models: Skeleton Induced Vision-Language Embeddings for Understanding Activities of Daily Living
The 39th Annual AAAI Conference on Artificial Intelligence (AAAI), 2025
Arkaprava Sinha, Dominick Reilly, Francois Bremond, Pu Wang, Srijan Das
Paper | Code
MS-Temba: Multi-Scale Temporal Mamba for Efficient Temporal Action Detection
Preprint
Arkaprava Sinha, Monish Soundar Raj, Pu Wang, Ahmed Helmy, Srijan Das
Paper | Code
Quo Vadis, Video Understanding with Vision-Language Foundation Models?
NeurIPS Workshop on Video-Language Models, 2024
Mahmoud Ali, Di Yang, Arkaprava Sinha, Dominick Reilly, Srijan Das, Gianpiero Francesca, Francois Bremond
Paper