About | Contact Us | Register | Login
ProceedingsSeriesJournalsSearchEAI
Proceedings of the 13th International Conference on Identification, Information and Knowledge in the Internet of Things, IIKI 2025, 18-21 December 2025, Chengdu, China

Research Article

MLLM-Track: An End-to-End Framework for Single Object Tracking

Download19 downloads
Cite
BibTeX Plain Text
  • @INPROCEEDINGS{10.4108/eai.18-12-2025.2365287,
        author={Hao  Sun and Guosen  Li and Mingzhe  Zhang and Cun  Ji and Xiangwei  Zheng and Xinchun  Cui},
        title={MLLM-Track: An End-to-End Framework for Single Object Tracking},
        proceedings={Proceedings of the 13th International Conference on Identification, Information and Knowledge in the Internet of Things, IIKI 2025, 18-21 December 2025, Chengdu, China},
        publisher={EAI},
        proceedings_a={IIKI},
        year={2026},
        month={6},
        keywords={RSOT vision-language model token selection Qwen-VL LLM},
        doi={10.4108/eai.18-12-2025.2365287}
    }
    
  • Hao Sun
    Guosen Li
    Mingzhe Zhang
    Cun Ji
    Xiangwei Zheng
    Xinchun Cui
    Year: 2026
    MLLM-Track: An End-to-End Framework for Single Object Tracking
    IIKI
    EAI
    DOI: 10.4108/eai.18-12-2025.2365287
Hao Sun1, Guosen Li1, Mingzhe Zhang1, Cun Ji1, Xiangwei Zheng1,*, Xinchun Cui2
  • 1: School of Computer Science and Artificial Intelligence, Shandong Normal University, Jinan, China
  • 2: School of Foundational Education, University of Health and Rehabilitation Sciences, Qingdao, China
*Contact email: xwzhengcn@163.com

Abstract

Referring Single-Object Tracking (RSOT) requires locating objects via natural language. However, ambiguity in references and redundancy in visual features hinder performance. We propose MLLM-Track, an end-to-end framework. Its outer loop uses Reflective Prompt Optimization (RPO) to generate discriminative prompts via a Vision-Language Model (VLM), guided by unified alignment and localization scores. The inner loop features TGoT-K, which filters visual tokens via text-guided attention, and CM-GTR, a gated transformer using binary gating and sparse sampling to aggregate temporal cues efficiently. On the Elysium benchmark, MLLM-Track achieves 92.0% AUC, significantly outperforming baselines while maintaining fixed token budgets.

Keywords
RSOT, vision-language model, token selection, Qwen-VL, LLM
Published
2026-06-17
Publisher
EAI
http://dx.doi.org/10.4108/eai.18-12-2025.2365287
Copyright © 2025–2026 EAI
EBSCOProQuestDBLPDOAJPortico
EAI Logo

About EAI

  • Who We Are
  • Leadership
  • Research Areas
  • Partners
  • Media Center
  • Cookie Preferences

Community

  • Membership
  • Conference
  • Recognition
  • Sponsor Us

Publish with EAI

  • Publishing
  • Journals
  • Proceedings
  • Books
  • EUDL