Data e ora inizio evento:
Data e ora fine evento:
Sede:
Dipartimento di Matematica, Università degli Studi di Roma Tor Vergata
Aula esterna:
Aula D'Antoni
Speaker ed affiliazione:
Benedetta Liberatori
The development of large multimodal deep learning models capable of learning joint representations of visual and textual information has led to unprecedented results in many image comprehension tasks. Despite their relative simplicity, these vision and language models (VLMs) have demonstrated remarkable generalization abilities. Although models like CLIP were initially developed to operate on images, VLMs have recently been successfully extended to the video domain as well. However, understanding videos is more complex than comprehending images. Videos generally contain more visual information and include a temporal dimension, which adds an extra layer of difficulty. This temporal dimension makes automated video understanding particularly challenging. Nevertheless, the multimodal approach has enabled significant progress in complex problems such as video classification, anomaly detection, and temporal action localization. Research in this area aims for an increasingly deep and detailed understanding of videos, which, being the form of data closest to our way of perceiving the world in both space and time, play a central role in numerous applications.
Contatti/Organizzatori:
doctorv.uniroma2@gmail.com