People understand the world by breaking down into parts. Events are perceived as a series of actions, objects are composed of multiple parts, and this sentence can be decomposed into a sequence of words. Although our knowledge representation is naturally compositional, most approaches to computer vision tasks generate representations that are not compositional.
We also understand that people use a variety of sensing modalities. Vision is an essential modality, but it can be noisy and requires a direct line of sight to perceive objects. Other sensors (e.g. audio, smell) can combat these shortcomings. They may allow us to detect otherwise imperceptible information about a scene. Prior workshops focused on multimodal learning have focused primarily on audio, video, and text as sensor modalities, but we found that these sensor modalities may not be inclusive enough. Both these points present interesting components that can add structure to the task of activity/scene recognition yet appear to be underexplored. To help encourage further exploration in these areas, we believe a challenge with each of these aspects is appropriate.
We announce the 2nd annual installment of the ”Compositionality and Multimodal Perception” Challenge (CAMP).
In this workshop, we have competitions and paper submission related to ”Compositionality" and "Multimodal Perception”.
Home Action Genome is a large-scale multi-view video database of indoor daily activities.
Every activity is captured by synchronized multi-view cameras, including an egocentric view.
Home Action Genome
Multimodal data, Labels of activities, atomic actions, object bounding boxes and human-object relationships
Call for Papers
Important Dates for Competition
Release test set with ground truth withheld: August 18th, 2021
Open evaluation server: September 1st, 2021
The leaderboard will be made public: September 16th, 2021
Close evaluation server: September 30th, 2021
Deadline for submitting the report: October 4th, 2021
Important Dates for Paper Submission
Workshop paper submission deadline: July 29th, 2021
Notification to authors: Aug 10th, 2021
Camera ready deadline: Aug 17th, 2021
Paper submissions Website
CMT submissions Website: https://cmt3.research.microsoft.com/CVPR2021
This workshop aims to bring together researchers from both academia and industry interested in addressing various aspects of multimodal and compositional understanding in computer vision.
The domains include but are not limited to scene understanding, video analysis, 3D vision and robotics. For each of these domains, we will discuss the following topics:
How should we develop and improve representations of compositionality for learning,
such as graph embedding, message-passing neural networks, probabilistic models, etc.?
What are the convincing metrics to measure the robustness, generalizability, and accuracy of compositional understanding algorithms?
How would cognitive science research inspire computational model to capture compositionality as humans do?
Optimization and scalability challenges
How should we handle the inherent representations of different components and curse of dimensionality of graph-based data?
How should we effectively collect large-scale databases for training multi-tasking models?
How should we improve scene graph generation, spatio-temporal-graph-based action recognition, structural 3D recognition and reconstruction,
meta-learning, reinforcement learning, etc.? Any other topic of interest for compositionality in computer vision.