Model Card: CLIPort

Model Details

Developed by Shridhar et al. at University of Washington and NVIDIA. CLIPort is an end-to-end imitation-learning agent that can learn a single language-conditioned policy for various tabletop tasks. The framework combines the broad semantic understanding (what) of CLIP with the spatial precision (where) of TransporterNets to learn generalizable skills from limited training demonstrations. See: cliport.github.io.
Fully Convolutional Networks trained with end-to-end supervised learning.
Trained for pick-and-place tabletop manipulation tasks where objects appear on a planar surface.

October 2021

Primary intended use case: CLIPort is intended for robotic manipulation research. We hope the benchmark and pre-trained models will enable researchers to study the generalization capabilities of end-to-end manipulation frameworks. Specifically, we hope the setup serves a reproducible framework for evaluating robustness and scaling capabilities of manipulation agents.
Primary intended users: Robotics researchers.
Out-of-scope use cases: Deployed use cases in real-world autonomous systems without human supervision is currently out-of-scope. Use cases that involve manipulating novel objects without humans-in-the-loop is also not recommended for safety-critical systems. The agent is also intended to be trained and evaluated with English language instructions.

Pre-training Data for CLIP: See OpenAI's Model Card for full details.
Manipulation Data for CLIPort: The agent was trained with image-caption-action pairs from expert demonstrations. In simulation we use oracle agents and in real-world we use human demonstrations. Since the agent is used in few-shot settings with very limited data, the agent might exploit intended and un-intented biases in the training demonstrations. Currently, these biases are limited to just objects that appear on tabletops.

Limited to SE(2) action space.
Exploits biases in training demonstrations.
Needs good hand-eye calibration.
Struggles with novel objects that are completely outside the training distribution of objects.
Struggles with grounding complex spatial relationships.
Does not predict task completion.
Prone to biases in CLIP's training data.

See Appendix I in the paper for an extended discussion.