Objective
The goal of this project is to train, deploy and evaluate embodied multimodal (large) models to perform dual-arm manipulation tasks relevant to healthcare assistance. The student will work towards a demonstration task. In the work leading up to the demonstration, the student will train different large models and evaluate & benchmark the performance using different sensor modalities.
Demonstration task: A representative task involves preparing trays or work surfaces by arranging items in specific positions—such as for a scheduled examination, monitoring, or care routine. The items to be placed on the tray can come from fixed locations. A human demonstrates this in this video around the 1:52 mark. In hospitals these food trays are put together based on the order of a patient. The robot would have taken items from boxes in fixed locations and put them on the tray for the patient.
Key Techniques
This project targets the integration and application of multimodal large models in robotic manipulation tasks, with relevance to healthcare-assistance scenarios. Multimodal large models — vision-language-action (VLA) models, vision-language models (VLMs) and vision-action (VA) models— enable robots to interpret instructions, perceive via multiple sensory modalities, and map those inputs into actionable motor behavior.
Figure AI gives a good demonstration of what these models are capable of, in their ‘Scaling Helix’ article, or their more recent full body autonomy demonstration. Their model is not open-source, however a lot of similar models are. RT2 is a VLA model developed by DeepMind that co-fine-tunes a VLM on both web-scale vision/language data and robot trajectory data, thereby enabling the model to output robotic actions directly from image + instruction inputs. Another example is GR00T N1 (from NVIDIA), which employs a dual-system architecture: a vision-language reasoning module and a dedicated action module for motor control, enabling manipulation across different embodiments and modalities.
One of the key factors that determine the performance of these models are the right sensory inputs. Force sensors on the gripper tips, in the joints, depth cameras, multiple cameras, and sensor data augmentation all contribute to the quality of the data that the model receives and directly impact the model understanding of the world and ultimately the performance.
To compare the differences in performance, a benchmark needs to be established, on which the different sensor modalities can be tested across different models. The LeRobot software stack is a great starting point, as it contains both ready-to use implementations of the models, as well as an interface to the LeRobot hardware which makes real-world testing a lot more accessible.
The student is encouraged to test: different sensor types, sensor locations, sensor data preprocessing, data augmentation techniques, (pre-)training methods, model architecture changes, number of training samples, or hardware changes that may increase performance.
Project Phases
Phase 1: Preparation
Goal: Prepare and plan the project. Preliminary literature research resulting in a plan for a benchmark, models, pre-processing steps, and sensor types that will be evaluated.
Phase 2 – Baseline
Goal: Use, train/finetune the baseline models on the LeRobot platform, using the SO101 dual robot arm setup. Get familiar with training strategies and develop an understanding of the baseline. Creation of a physical benchmark pick & place / grasping scene which could be representative of a healthcare environment.
Phase 3 – Implementing & testing on dual-arm hardware
Goal: Research the different sensor modalities and techniques to improve performance. Develop hypotheses that can be tested using the aforementioned benchmark. In this phase there is room for experimentation and iteration to maximize the task performance. Unseq will support / facilitate sensor integrations or other hardware changes to the platform.
Phase 4 – Demonstration on Healthcare task
Goal: A final demonstration of the best performing large model & sensor modality as a showcase demonstration.
Expected Outcomes
Company contact
Unseq B.V.
Jan-Pieter Rijstenbil (CTO)
jp@unseq.com
Bram Grooten
Joaquin Vanschoren