back to list

Project: Fine-Tuning Vision-Language Models for Industrial Fruit Packing

Description

Recent vision-language-action (VLA) and vision-language models (VLMs) show strong promise for robot learning, but their practical value for industrial cells remains unclear. This thesis focuses on a more relevant question than public benchmarking alone: can a fine-tuned VLA or VLM outperform an already existing industrial benchmark in a real robotic setup? The project builds on a unique in-house ecosystem consisting of Teachbot teleoperation, a proprietary AI stack, FANUC LR Mate robots, and suction grippers. The current application domain is fruit and vegetable picking and placing, with a natural next step toward meal-box filling, where robustness, speed, and gentle handling matter as much as raw task success (Kim et al., 2024; Kim et al., 2025; Khazatsky et al., 2024).

Core thesis idea: compare a fine-tuned VLA/VLM against the current industrial benchmark and determine which architecture, training strategy, and integration method creates the biggest real-world gain.


Why this project is interesting

Unlike many academic studies that evaluate models mainly on public robotics benchmarks, this project starts from a working industrial reference. That makes the thesis immediately valuable: you are not proving that foundation models are good in research, but whether they can produce measurable gains in throughput, pick reliability, error recovery, generalization, and deployment flexibility in a production-style environment. This is especially relevant in agri-food robotics, where variable object geometry, occlusions, fragile products, and strict handling requirements still limit real-world adoption (Zeng et al., 2023; Zhang et al., 2026; Zhou et al., 2025).


Research gap

The literature shows rapid progress in open VLAs such as OpenVLA and in scalable teleoperation datasets such as DROID, yet there is still little evidence on how these models should be adapted for industrial picking and placing cells that already have a strong task-specific baseline. In particular, there is a gap between general robot foundation models and industrial food-handling use cases where success depends not only on task completion, but also on cycle time, consistency, product safety, and ease of integration. A second gap is methodological: it is still open whether the best solution is an end-to-end VLA, a VLM used as a perception layer on top of the existing controller, or a hybrid architecture that combines teleoperated action learning with structured decision support (Kim et al., 2024; Kim et al., 2025; Khazatsky et al., 2024).


Proposed thesis direction

The thesis should investigate how a VLA or VLM can be fine-tuned on Teachbot demonstrations and compared against the current industrial benchmark. The setup should remain intentionally open. You could explore direct policy fine-tuning, VLM-guided perception feeding the current motion stack, skill decomposition into sub-actions, or hybrid supervision strategies that use teleoperated data to improve specific failure modes. The comparison should be grounded in real benchmarks and evaluated on practical metrics such as pick success, cycle time, box-fill accuracy, retry behavior, robustness to new produce, and potentially transfer from fruit picking and placing toward meal-box filling.


Main research question

To what extent can a fine-tuned VLA or VLM, trained on Teachbot teleoperation data, outperform the current industrial benchmark for robotic fruit and vegetable picking and placing?


Expected outcome

The result of the thesis should not only be a model comparison, but also a design recommendation: which model family is most promising, how it should be integrated into the existing stack, and under which conditions it offers real industrial value. That outcome would give direct guidance for future deployment at TOS, especially in extending from produce picking and placing toward flexible meal-box assembly.


References

  • Khazatsky, A., et al. (2024). DROID: A Large-Scale In-The-Wild Robot Manipulation Dataset.
  • Kim, M. J., Pertsch, K., Karamcheti, S., et al. (2025). OpenVLA: An Open-Source Vision-Language-Action Model. Proceedings of the 8th Conference on Robot Learning, PMLR, 270, 2679–2713.
  • Kim, M. J., Finn, C., & Liang, P. (2025). Fine-Tuning Vision-Language-Action Models: Optimizing Speed and Success. arXiv:2502.19645.
  • Zitkovich, B., et al. (2023). RT-2: Vision-Language-Action Models Transfer Web Knowledge to Robotic Control. Proceedings of the 7th Conference on Robot Learning, PMLR, 229, 2165–2183.
  • Lin, T., Sun, F., Li, X., Guo, X., Ying, J., Wu, H., & Li, H. (2026). A Review of Key Technologies and Recent Advances in Intelligent Fruit-Picking Robots. Horticulturae, 12(2), 158.
  • Ao, J., Ji, W., Yu, X., Ruan, C., & Xu, B. (2025). End-Effectors for Fruit and Vegetable Harvesting Robots: A Review of Key Technologies, Challenges, and Future Prospects. Agronomy, 15(11), 2650.


Prerequisites

  • Proficiency in programming languages such as Python and/or C/C++.
  • Experience with machine learning frameworks (e.g., PyTorch, TensorFlow).
  • Analytical mindset with a passion for robotics and AI innovation.


Do you want to apply or do you want more information? 

Company Supervisor: Teun Jansen
Email: t.jansen@tosrobotics.com
Website: tosrobotics.com

Details
Supervisor
Bram Grooten
Secondary supervisor
Joaquin Vanschoren
External location
TeleOperation Services
Interested?
Get in contact