
Research Article
CUTE: A Collaborative Fusion Representation-Based Fine-Tuning and Retrieval Framework for Code Search
@INPROCEEDINGS{10.1007/978-3-031-54521-4_19, author={Qihong Song and Jianxun Liu and Haize Hu}, title={CUTE: A Collaborative Fusion Representation-Based Fine-Tuning and Retrieval Framework for Code Search}, proceedings={Collaborative Computing: Networking, Applications and Worksharing. 19th EAI International Conference, CollaborateCom 2023, Corfu Island, Greece, October 4-6, 2023, Proceedings, Part I}, proceedings_a={COLLABORATECOM}, year={2024}, month={2}, keywords={Code search Collaborative fusion representation Fine tuning Hard negative sample Data augmentation}, doi={10.1007/978-3-031-54521-4_19} }
- Qihong Song
Jianxun Liu
Haize Hu
Year: 2024
CUTE: A Collaborative Fusion Representation-Based Fine-Tuning and Retrieval Framework for Code Search
COLLABORATECOM
Springer
DOI: 10.1007/978-3-031-54521-4_19
Abstract
Code search aims at searching semantically related code snippets from the large-scale database based on a given natural descriptive query. Fine-tuning pre-trained models for code search tasks has recently emerged as a new trend. However, most studies fine-tune models merely using metric learning, overlooking the beneficial effect of the collaborative relationship between code and query. In this paper, we introduce an effective fine-tuning and retrieval framework called CUTE. In the fine-tuning component, we propose a Collaborative Fusion Representation (CFR) consisting of three stages: pre-representation, collaborative representation, and residual fusion. CFR enhances the representation of code and query, considering token-level collaborative features between code and query. Furthermore, we apply augmentation techniques to generate vector-level hard negative samples for training, which further improves the ability of the pre-trained model to distinguish and represent features during fine-tuning. In the retrieval component, we introduce a two-stage retrieval architecture that includes pre-retrieval and refined ranking, significantly reducing time and computational resource consumption. We evaluate CUTE with three advanced pre-trained models on CodeSearchNet consisting of six programming languages. Extensive experiments demonstrate the fine-tuning effectiveness and retrieval efficiency of CUTE.