Abstract: Currently, within the field of Computer Vision in Robotics, algorithms that work in real time are required, so they must be processed as quickly as possible and locally on Edge Devices.
The benchmark is designed to evaluate whether Multimodal Large Language Models (MLLMs) can process multi-UAV collaborative visual data for question answering, covering perception, reasoning, and ...
Some results have been hidden because they may be inaccessible to you
Show inaccessible results