Research Article
A Study Towards Building Content Aware Models in NLP using Genetic Algorithms
@ARTICLE{10.4108/airo.4078, author={Umesh Tank and Saranya Arirangan and Anwesh Reddy Paduri and Narayana Darapaneni}, title={A Study Towards Building Content Aware Models in NLP using Genetic Algorithms}, journal={EAI Endorsed Transactions on AI and Robotics}, volume={2}, number={1}, publisher={EAI}, journal_a={AIRO}, year={2023}, month={11}, keywords={Content awareness, Large language models, data poisoning, genetic algorithms}, doi={10.4108/airo.4078} }
- Umesh Tank
Saranya Arirangan
Anwesh Reddy Paduri
Narayana Darapaneni
Year: 2023
A Study Towards Building Content Aware Models in NLP using Genetic Algorithms
AIRO
EAI
DOI: 10.4108/airo.4078
Abstract
INTRODUCTION: With the advancement in the large language models, often called LLMs, there has been increasing concerns around the usage of these models. As they can generate human-like text and can also perform a number of tasks such as generating code, question answering, essay writing and even generating text for research papers. OBJECTIVES: The generated text is subject to the usage of the original data (using which models are trained) which might be protected or may be personal/private data. The detailed description of such concerns and various potential solutions is discussed in ‘Generative language models and automated influence operations: Emerging threats and potential mitigations’. METHODS: Addressing these concerns becomes the paramount for LLMs usability. There are several directions explored by the researchers and one of the interesting works is around building content aware models. The idea is that the model is aware of the type of content it is learning from and aware what type of content should be used to generate a response to a specific query. RESULTS: In our work we explored direction by applying poisoning techniques to contaminate data and then applying genetic algorithms to extract the non-poisoned content from the poisoned content that can generate a good response when paraphrased. CONCLUSION: While we demonstrated the idea using poisoning techniques and tried to make the model aware of copyrighted content, the same can be extended to detect other types of contents or any other use cases where content awareness is required.
Copyright © 2023 U. Tank et al., licensed to EAI. This is an open access article distributed under the terms of the CC BY-NC-SA 4.0, which permits copying, redistributing, remixing, transformation, and building upon the material in any medium so long as the original work is properly cited.