MLLM-Track: An End-to-End Framework for Single Object Tracking

Hao Sun; Guosen Li; Mingzhe Zhang; Cun Ji; Xiangwei Zheng; Xinchun Cui

Proceedings of the 13th International Conference on Identification, Information and Knowledge in the Internet of Things, IIKI 2025, 18-21 December 2025, Chengdu, China

Research Article

MLLM-Track: An End-to-End Framework for Single Object Tracking

Download83 downloads

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.4108/eai.18-12-2025.2365287,
    author={Hao  Sun and Guosen  Li and Mingzhe  Zhang and Cun  Ji and Xiangwei  Zheng and Xinchun  Cui},
    title={MLLM-Track: An End-to-End Framework for Single Object Tracking},
    proceedings={Proceedings of the 13th International Conference on Identification, Information and Knowledge in the Internet of Things, IIKI 2025, 18-21 December 2025, Chengdu, China},
    publisher={EAI},
    proceedings_a={IIKI},
    year={2026},
    month={6},
    keywords={RSOT vision-language model token selection Qwen-VL LLM},
    doi={10.4108/eai.18-12-2025.2365287}
}

Hao Sun
Guosen Li
Mingzhe Zhang
Cun Ji
Xiangwei Zheng
Xinchun Cui
Year: 2026
MLLM-Track: An End-to-End Framework for Single Object Tracking
IIKI
EAI
DOI: 10.4108/eai.18-12-2025.2365287

Hao Sun¹, Guosen Li¹, Mingzhe Zhang¹, Cun Ji¹, Xiangwei Zheng¹^,*, Xinchun Cui²

1: School of Computer Science and Artificial Intelligence, Shandong Normal University, Jinan, China
2: School of Foundational Education, University of Health and Rehabilitation Sciences, Qingdao, China

*Contact email: xwzhengcn@163.com

Abstract

Referring Single-Object Tracking (RSOT) requires locating objects via natural language. However, ambiguity in references and redundancy in visual features hinder performance. We propose MLLM-Track, an end-to-end framework. Its outer loop uses Reflective Prompt Optimization (RPO) to generate discriminative prompts via a Vision-Language Model (VLM), guided by unified alignment and localization scores. The inner loop features TGoT-K, which filters visual tokens via text-guided attention, and CM-GTR, a gated transformer using binary gating and sparse sampling to aggregate temporal cues efficiently. On the Elysium benchmark, MLLM-Track achieves 92.0% AUC, significantly outperforming baselines while maintaining fixed token budgets.

Keywords: RSOT, vision-language model, token selection, Qwen-VL, LLM

Published: 2026-06-17
Publisher: EAI

: http://dx.doi.org/10.4108/eai.18-12-2025.2365287

MLLM-Track: An End-to-End Framework for Single Object Tracking

Abstract

About EAI

Community

Publish with EAI