International Challenge on Compositional and Multimodal Perception

International Conference on Computer Vision 2021 Workshop

About CAMP

People understand the world by breaking down into parts. Events are perceived as a series of actions, objects are composed of multiple parts, and this sentence can be decomposed into a sequence of words. Although our knowledge representation is naturally compositional, most approaches to computer vision tasks generate representations that are not compositional.

We also understand that people use a variety of sensing modalities. Vision is an essential modality, but it can be noisy and requires a direct line of sight to perceive objects. Other sensors (e.g. audio, smell) can combat these shortcomings. They may allow us to detect otherwise imperceptible information about a scene. Prior workshops focused on multimodal learning have focused primarily on audio, video, and text as sensor modalities, but we found that these sensor modalities may not be inclusive enough. Both these points present interesting components that can add structure to the task of activity/scene recognition yet appear to be underexplored. To help encourage further exploration in these areas, we believe a challenge with each of these aspects is appropriate.

We announce the 2nd annual installment of the ”Compositionality and Multimodal Perception” Challenge (CAMP).
In this workshop, we have competitions and paper submission related to ”Compositionality" and "Multimodal Perception”.


Home Action Genome is a large-scale multi-view video database of indoor daily activities.
Every activity is captured by synchronized multi-view cameras, including an egocentric view.

Home Action Genome

Multimodal data, Labels of activities, atomic actions, object bounding boxes and human-object relationships

Call for Papers

Important Dates for Competition

Release test set with ground truth withheld: August 18th, 2021

Open evaluation server: September 1st, 2021

The leaderboard will be made public: September 16th, 2021

Close evaluation server: September 30th, 2021

Deadline for submitting the report: October 4th, 2021

Important Dates for Paper Submission

Workshop paper submission deadline: July 29th, 2021

Notification to authors: Aug 10th, 2021

Camera ready deadline: Aug 17th, 2021

Paper submissions Website

CMT submissions Website:

Topics covered

This workshop aims to bring together researchers from both academia and industry interested in addressing various aspects of multimodal and compositional understanding in computer vision. The domains include but are not limited to scene understanding, video analysis, 3D vision and robotics. For each of these domains, we will discuss the following topics:

Algorithmic approaches
How should we develop and improve representations of compositionality for learning,
such as graph embedding, message-passing neural networks, probabilistic models, etc.?
Evaluation methods
What are the convincing metrics to measure the robustness, generalizability, and accuracy of compositional understanding algorithms?
Cognitive aspects
How would cognitive science research inspire computational model to capture compositionality as humans do?
Optimization and scalability challenges
How should we handle the inherent representations of different components and curse of dimensionality of graph-based data?
How should we effectively collect large-scale databases for training multi-tasking models?
Domain-specific applications
How should we improve scene graph generation, spatio-temporal-graph-based action recognition, structural 3D recognition and reconstruction,
meta-learning, reinforcement learning, etc.? Any other topic of interest for compositionality in computer vision.

Invited Speaker

TBD Coming soon.


Kazuki Kozuka


Jingwei Ji

Stanford University

Ranjay Krishna

Stanford University

Shun Ishizaka


Ehsan Adeli

Stanford University

Juan Carlos Niebles

Stanford University

FeiFei Li

Stanford University

Program Committees

Haofeng Chen

Stanford University

Nishant Rai

Stanford University