Proceedings of the 2nd International Conference on Information Economy, Data Modeling and Cloud Computing, ICIDC 2023, June 2–4, 2023, Nanchang, China

Research Article

Research on Dense Enhanced Document Retrieval Based on G-mixup

Download148 downloads
  • @INPROCEEDINGS{10.4108/eai.2-6-2023.2334610,
        author={Jiawei  Tang and Junping  Liu},
        title={Research on Dense Enhanced Document Retrieval Based on G-mixup},
        proceedings={Proceedings of the 2nd International Conference on Information Economy, Data Modeling and Cloud Computing, ICIDC 2023, June 2--4, 2023, Nanchang, China},
        publisher={EAI},
        proceedings_a={ICIDC},
        year={2023},
        month={8},
        keywords={mixup dense document retrieval graph convolutional neural},
        doi={10.4108/eai.2-6-2023.2334610}
    }
    
  • Jiawei Tang
    Junping Liu
    Year: 2023
    Research on Dense Enhanced Document Retrieval Based on G-mixup
    ICIDC
    EAI
    DOI: 10.4108/eai.2-6-2023.2334610
Jiawei Tang1,*, Junping Liu1
  • 1: Wuhan Textile University
*Contact email: 2359451809@qq.com

Abstract

The dense document retrieval model based on Mixup regards words as independent individuals, splits the connection between words, ignores the semantic information of the text, and also has the problem of insufficient labeled training data. In view of the above problems, this paper proposes a G-mixup graph based data intensive enhanced document retrieval model GDAR (Graph Data Augment Retrieval). The model first uses the graph convolutional neural network to convert queries and documents into graph data; then, uses the same type of document graph to construct a graph genera-tor Graphon; finally, mixes the graph generator Graphon in the Euclidean space to ob-tain The new graph generator Graphons performs linear interpolation and perturbation operations on Graphons to obtain new training data with soft labels, solving the prob-lem of lack of labeled data in dense document retrieval models. Experiments on the Natural Questions and TriviaQA datasets show that the method improves the accura-cy index of T-1 by 4.12% and 4.88% respectively compared with the best baseline method.