
Research Article
MLLM-Track: An End-to-End Framework for Single Object Tracking
@INPROCEEDINGS{10.4108/eai.18-12-2025.2365287, author={Hao Sun and Guosen Li and Mingzhe Zhang and Cun Ji and Xiangwei Zheng and Xinchun Cui}, title={MLLM-Track: An End-to-End Framework for Single Object Tracking}, proceedings={Proceedings of the 13th International Conference on Identification, Information and Knowledge in the Internet of Things, IIKI 2025, 18-21 December 2025, Chengdu, China}, publisher={EAI}, proceedings_a={IIKI}, year={2026}, month={6}, keywords={RSOT vision-language model token selection Qwen-VL LLM}, doi={10.4108/eai.18-12-2025.2365287} }- Hao Sun
Guosen Li
Mingzhe Zhang
Cun Ji
Xiangwei Zheng
Xinchun Cui
Year: 2026
MLLM-Track: An End-to-End Framework for Single Object Tracking
IIKI
EAI
DOI: 10.4108/eai.18-12-2025.2365287
Abstract
Referring Single-Object Tracking (RSOT) requires locating objects via natural language. However, ambiguity in references and redundancy in visual features hinder performance. We propose MLLM-Track, an end-to-end framework. Its outer loop uses Reflective Prompt Optimization (RPO) to generate discriminative prompts via a Vision-Language Model (VLM), guided by unified alignment and localization scores. The inner loop features TGoT-K, which filters visual tokens via text-guided attention, and CM-GTR, a gated transformer using binary gating and sparse sampling to aggregate temporal cues efficiently. On the Elysium benchmark, MLLM-Track achieves 92.0% AUC, significantly outperforming baselines while maintaining fixed token budgets.


