airo 23(1):

Research Article

A Study Towards Building Content Aware Models in NLP using Genetic Algorithms

Download150 downloads
  • @ARTICLE{10.4108/airo.4078,
        author={Umesh Tank and Saranya  Arirangan and Anwesh Reddy Paduri and Narayana Darapaneni},
        title={A Study Towards Building Content Aware Models in NLP using Genetic Algorithms},
        journal={EAI Endorsed Transactions on AI and Robotics},
        keywords={Content awareness, Large language models, data poisoning, genetic algorithms},
  • Umesh Tank
    Saranya Arirangan
    Anwesh Reddy Paduri
    Narayana Darapaneni
    Year: 2023
    A Study Towards Building Content Aware Models in NLP using Genetic Algorithms
    DOI: 10.4108/airo.4078
Umesh Tank1, Saranya Arirangan2, Anwesh Reddy Paduri2,*, Narayana Darapaneni3
  • 1: PES University
  • 2: Great Learning
  • 3: Northwestern University
*Contact email:


INTRODUCTION: With the advancement in the large language models, often called LLMs, there has been increasing concerns around the usage of these models. As they can generate human-like text and can also perform a number of tasks such as generating code, question answering, essay writing and even generating text for research papers. OBJECTIVES: The generated text is subject to the usage of the original data (using which models are trained) which might be protected or may be personal/private data. The detailed description of such concerns and various potential solutions is discussed in ‘Generative language models and automated influence operations: Emerging threats and potential mitigations’. METHODS: Addressing these concerns becomes the paramount for LLMs usability. There are several directions explored by the researchers and one of the interesting works is around building content aware models. The idea is that the model is aware of the type of content it is learning from and aware what type of content should be used to generate a response to a specific query. RESULTS: In our work we explored direction by applying poisoning techniques to contaminate data and then applying genetic algorithms to extract the non-poisoned content from the poisoned content that can generate a good response when paraphrased. CONCLUSION: While we demonstrated the idea using poisoning techniques and tried to make the model aware of copyrighted content, the same can be extended to detect other types of contents or any other use cases where content awareness is required.