Stable Random Sampling (SRS): A New Method to Refine Causal Masking in Decoder-Only Transformer

Shuhao Zhang; Jiayi Yu; Jiarui Li

Proceedings of the 2nd International Conference on Machine Learning and Automation, CONF-MLA 2024, November 21, 2024, Adana, Turkey

Research Article

Stable Random Sampling (SRS): A New Method to Refine Causal Masking in Decoder-Only Transformer

Download228 downloads

Cite: BibTeX Plain Text

@INPROCEEDINGS{10.4108/eai.21-11-2024.2354592,
    author={Shuhao  Zhang and Jiayi  Yu and Jiarui  Li},
    title={Stable Random Sampling (SRS): A New Method to Refine Causal Masking in Decoder-Only Transformer},
    proceedings={Proceedings of the 2nd International Conference on Machine Learning and Automation, CONF-MLA 2024, November 21, 2024, Adana, Turkey},
    publisher={EAI},
    proceedings_a={CONF-MLA},
    year={2025},
    month={3},
    keywords={decoder-only transformer causal masking random sampling positional information},
    doi={10.4108/eai.21-11-2024.2354592}
}

Shuhao Zhang
Jiayi Yu
Jiarui Li
Year: 2025
Stable Random Sampling (SRS): A New Method to Refine Causal Masking in Decoder-Only Transformer
CONF-MLA
EAI
DOI: 10.4108/eai.21-11-2024.2354592

Shuhao Zhang¹^,*, Jiayi Yu², Jiarui Li³

1: University of Science and Technology Beijing
2: UM-SJTU Joint Institute, Shanghai Jiaotong University
3: Xidian University

*Contact email: U202142800@xs.ustb.edu.cn

Abstract

In current language modelling, the decoder-only Transformer architecture with causal masking has become a cornerstone, demonstrating exceptional performance across various tasks. However, we have identified two significant limitations: First, causal masking presents a substantial obstacle to further optimizing overall model efficiency, particularly in handling long contexts. Second, traditional optimization of causal masking struggles with uneven attention distribution and the inability to encode absolute positional information, limiting their effectiveness in position-sensitive tasks. In this work, we propose the Stable Random Sampling (SRS) algorithm, a novel method to address both limitations by refining the causal masking process. SRS introduces a pseudo-attention mask to balance attention distributions for performance refinement and incorporates random sampling and Locality-Sensitive Hashing (LSH) in causal masking part for efficient processing, reducing time complexity of this part to O(n). The effectiveness of SRS is validated both theoretically and empirically. Our pre-training ablation experiments demonstrate that SRS module virtually enhances the performance of causal masking while each functional part of it relatively improves efficiency and effectiveness towards different sizes of tasks, on average showing a 30% reduction in training time and a 50% decrease in loss rate compared to traditional methods.

Keywords: decoder-only transformer causal masking random sampling positional information

Published: 2025-03-11
Publisher: EAI

: http://dx.doi.org/10.4108/eai.21-11-2024.2354592

Stable Random Sampling (SRS): A New Method to Refine Causal Masking in Decoder-Only Transformer

Abstract

About EAI

Community

Publish with EAI