
Research Article
Stable Random Sampling (SRS): A New Method to Refine Causal Masking in Decoder-Only Transformer
@INPROCEEDINGS{10.4108/eai.21-11-2024.2354592, author={Shuhao Zhang and Jiayi Yu and Jiarui Li}, title={Stable Random Sampling (SRS): A New Method to Refine Causal Masking in Decoder-Only Transformer}, proceedings={Proceedings of the 2nd International Conference on Machine Learning and Automation, CONF-MLA 2024, November 21, 2024, Adana, Turkey}, publisher={EAI}, proceedings_a={CONF-MLA}, year={2025}, month={3}, keywords={decoder-only transformer causal masking random sampling positional information}, doi={10.4108/eai.21-11-2024.2354592} }
- Shuhao Zhang
Jiayi Yu
Jiarui Li
Year: 2025
Stable Random Sampling (SRS): A New Method to Refine Causal Masking in Decoder-Only Transformer
CONF-MLA
EAI
DOI: 10.4108/eai.21-11-2024.2354592
Abstract
In current language modelling, the decoder-only Transformer architecture with causal masking has become a cornerstone, demonstrating exceptional performance across various tasks. However, we have identified two significant limitations: First, causal masking presents a substantial obstacle to further optimizing overall model efficiency, particularly in handling long contexts. Second, traditional optimization of causal masking struggles with uneven attention distribution and the inability to encode absolute positional information, limiting their effectiveness in position-sensitive tasks. In this work, we propose the Stable Random Sampling (SRS) algorithm, a novel method to address both limitations by refining the causal masking process. SRS introduces a pseudo-attention mask to balance attention distributions for performance refinement and incorporates random sampling and Locality-Sensitive Hashing (LSH) in causal masking part for efficient processing, reducing time complexity of this part to O(n). The effectiveness of SRS is validated both theoretically and empirically. Our pre-training ablation experiments demonstrate that SRS module virtually enhances the performance of causal masking while each functional part of it relatively improves efficiency and effectiveness towards different sizes of tasks, on average showing a 30% reduction in training time and a 50% decrease in loss rate compared to traditional methods.