Vision-Language Foundation Models as Effective Robot Imitators

1ByteDance Research, 2Tsinghua University, 3Shanghai Jiao Tong University, 4National University of Singapore
Equal Contribution
Work done during the first authors' internship at ByteDance


Recent progress in vision language foundation models has shown their ability to understand multimodal data and resolve complicated vision language tasks, including robotics manipulation. We seek a straightforward way of making use of existing vision-language models (VLMs) with simple fine-tuning on robotics data. To this end, we derive a simple and novel vision-language manipulation framework, dubbed RoboFlamingo, built upon the open-source VLMs, OpenFlamingo. Unlike prior works, RoboFlamingo utilizes pre-trained VLMs for single-step vision-language comprehension, models sequential history information with an explicit policy head, and is slightly fine-tuned by imitation learning only on language-conditioned manipulation datasets. Such a decomposition provides RoboFlamingo the flexibility for open-loop control and deployment on low-performance platforms. By exceeding the state-of-the-art performance with a large margin on the tested benchmark, we show RoboFlamingo can be an effective and competitive alternative to adapt VLMs to robot control. Our extensive experimental results also reveal several interesting conclusions regarding the behavior of different pre-trained VLMs on manipulation tasks. RoboFlamingo can be trained or evaluated on a single GPU server, and we believe RoboFlamingo has the potential to be a cost-effective and easy-to-use solution for robotics manipulation, empowering everyone with the ability to fine-tune their own robotics policy.

Comparison among RoboFlamingo and existing vision-language manipulation solutions.


The proposed RoboFlamingo framework. The output from the feature fusion decoder is trained as the representation of the vision observation and language instruction, which will be further translated into low-level control signals. To achieve this, we simply adopt an additional policy head to predict the action, e.g., the 7 DoF end-effector pose and gripper status. We test various strategies to model the historical observation sequences and behave as the policy head, e.g., a long short-term memory (LSTM) network with an MLP for the final prediction; a decoder-only transformer with an predictive MLP; or a single MLP that only models single-step information, etc.

The Calvin Benchmark

CALVIN (Composing Actions from Language and Vision) is an open-source simulated benchmark to learn long-horizon language-conditioned tasks. CALVIN tasks are more complex in terms of sequence length, action space, and language than existing vision-and-language task datasets and supports flexible specification of sensor suites. See their website for more details.


We investigate three main questions during experiments:

  • (1) Effectiveness. We wonder the imitation learning performance of RoboFlamingo by training it on the given demonstration data.
  • (2) Zero-shot Generalization. We focus on generalization on unseen tasks. In other words, we wonder how the model will behave given unseen vision contexts like different objects, even with unseen instructions.
  • (3) Ablation Studies. We further explore the essential factors that matter in adapting VLMs to robot control policy in the framework of RoboFlamingo.

(1) Imitation Performance

Table1: The imitation performance on various settings, all results are reported using the best-behaved model checkpoints. (Full) and (Lang) denote if the model is trained using unpaired vision data (i.e., vision data without language pairs); Freeze-emb refers to freezing the embedding layer of the fusion decoder; Enriched denote using GPT-4 enriched instructions. The gray rows denote numerical results evaluated by our re-trained model. We re-implement RT-1 and take the original code of HULC provided by Mees et al. All other results are reported by Mees et al.

(2) Zero-shot Generalization

we evaluate two aspects of generalization for our RoboFlamingo: vision and language. For vision generalization, we train models on splits A, B, and C and test on split D, which presents a different vision context. Our method significantly outperforms baselines in this vision generalization scenario (ABC → D), as shown in the above Table 1. Regarding language generalization, we enrich the language setting by generating 50 synonymous instructions for each task using GPT-4. We then randomly sample instructions during evaluation. Our method exhibits superior performance compared to all baselines in this language generalization setting. It's important to note that the success rate of our method on subsequent tasks shows a more noticeable drop compared to HULC. This may be due to our approach directly using word tokens as input during training, which can result in larger variations for synonymous sentences compared to HULC using a frozen sentence model for embedding instructions. To address this, we freeze the embedding layer of the feature fusion decoder in our method, leading to improved generalization and reduced performance drop.

(3) Ablation Studies

We conduct ablation studies for RoboFlamingo to answer the following questions:

  • 1. How does RoboFlamingo perform with different heads?
  • 2. Does vision-language pre-training improve downstream robotic tasks?
  • 3. How do critical factors in vision-language pre-training affect robotic tasks?

Table2: Variants of VLMs tested. Pre-train denotes the original performance of VLM on the pre-training VL dataset, Best Avg. Len. denotes the best performance of the average success length of VLMs within 5 epochs, and Mean Avg. Len. denotes the mean performance of the average success length of VLMs of last 3 epochs on CALVIN..

(a) Various policy head

(b) Different training paradigms

(c) Open loop control

We observe also that 1) policy heads with history encoding performs the best and GPT and LSTM behaves similar under the framework of our RoboFlamingo; 2) tuning on the VL model itself on robotic tasks is indispensable due to limited capacity of the policy head and vision-langauge pre-training crucially improves the downstream robotic manipulation by a large margin; and 3) a larger model which usually results in better VL performance, achieves much higher performance, indicating that a larger VLM can be more data-efficient.

Conclusion and Future Work

This paper explores the potential of pre-trained vision-language models in advancing language-conditioned robotic manipulation. Our proposed RoboFlamingo, based on the pre-trained OpenFlamingo model, showcases state-of-the-art performance on a benchmark dataset. Moreover, our experimental findings highlight the benefits of pre-trained models in terms of data efficiency and zero-shot generalization ability. This research contributes to the ongoing efforts to develop intelligent robotic systems that can seamlessly understand and respond to human language instructions, paving the way for more intuitive and efficient human-robot collaboration. Due to the lack of real-robot data, this paper does not deploy on real-world robotics. To our delight, recent progress on large-scale real robotics data (Open X-Embodiment) has shown the potential of fine-tuning large VLMs for real robots, and the most interesting future work is to see how RoboFlamingo will behave in real-world tasks combined with such amount of data.


  title     = {Vision-Language Foundation Models as Effective Robot Imitators},
  author    = {Li, Xinghang and Liu, Minghuan and Zhang, Hanbo and Yu, Cunjun and Xu, Jie and Wu, Hongtao and Cheang, Chilam and Jing, Ya and Zhang, Weinan and Liu, Huaping and Li, Hang and Kong, Tao},
  journal={arXiv preprint arXiv:2311.01378},