Design and Development of Bioinformatics Feature Based DNA Sequence Data Compression Algorithm

INTRODUCTION: Genetic data plays a key role in the healthcare area in specific, but they are typically very large in size. Many research shows that absence of DNA information at the right time is one of the major causes of error in the healthcare area. The more genomics information that analysts secure, the better the prospects for individual and general wellbeing. Persevering and retrieving genetic information in the right form within the given time is a big challenge in the field of Healthcare. Effectively, pre-birth DNA tests screen for formative variations from the norm. Before long, patients will have their blood sequenced to detect any nonhuman DNA that may flag an irresistible illness. Later on, somebody managing malignancy will most likely track the movement of the sickness by having the DNA and RNA of single cells from various tissues sequenced every day. DNA sequencing of whole population will give a complete and better prediction of population wellbeing. OBJECTIVES: Hereditary data is growing exponentially; hence it is hard to deal with the consistently developing hereditary database. The human genome in its base configuration occupies almost thirty terabyte of storage space. Computational assets are constrained. Not just storage, transmission abilities and run time memory is likewise constrained. Data Compression is a test when the hereditary information is exponentially expanding. It is critical to save the integrity of hereditary information while packing it. Hence the main objective of this paper is to develop a lossless DNA compression algorithm that not only gives better compression but also help in retrieval of Information for efficient use in the area of Healthcare. METHODS: In this paper a lossless hereditary data compression method is being proposed. The proposed calculation works in a horizontal mode and utilization a reference based substitution technique for compression. The principle thought of this paper is in the kind of similarity scanned. All the predominant hereditary Compression methods search for similarity within the chromosome. These calculations either pursue flat mode or vertical mode for accomplishing compression. But whichever method the existing genetic compression algorithms use, they are all based on searching similarities within the chromosome i.e. they exploit only inter chromosomal similarities. The current studies focus will show that compression ratio achieved by analyzing individual chromosome is always less than the method in which we analyze and compress intra chromosomal similarities. RESULTS: This study shows that by simply using exactly matching repeats amongst all the chromosomes of the same genome, not only the compression ratio is improving but also a detailed study of all the similarities and differences between two genomes of the same species can be conducted. CONCLUSION: In this study, a new compression algorithm is being proposed for compressing DNA. Along with Inter chromosomal similarities, Intra chromosomal similarities are considered for this method. The results clearly shows that intra chromosomal matches are bigger and more than inter chromosomal matches which helps us to achieve better compression ratio.


Need of Compression in Health Care Industry
As the Health care industry is shifting from the classical methods to more prediction based methods [16][17][18][19], the need of storing biological information efficiently is becoming a more critical issue. One of the most important biological information is the DNA sequences. By the detailed study of DNA Sequences mutating over a period of time, many diseases can be predicted and stopped from occurring. Biological sequences are like blue prints of all living organisms. Special biological sequences like DNA are functional description of cells that are the basic building units of each organism. Research shows that DNA information is a very important Data Set, in the Health Care Industry. Getting the right data at the right time is of utmost importance. But as the genomic database is increasing exponentially, storing and retrieving the data is becoming a challenge. Hence there is requirement of such type of Genetic Data Compression Algorithm, which can yield best compression ratio and also retrieve the data in minimal time.

DNA
DNA successions are made of four chemical bases to be specific -Adenine (A), Thymine (T), Guanine (G) and Cytosine(C). The human DNA itself comprises of three billion such bases. The extraordinary information bases like GenBank which stores hereditary data shared by analysts everywhere throughout the world, doubles itself every ninety months. As effectively expressed computational assets are constrained.

Genetic Data Set Used
This research focuses on the genome of Saccharomyces Cerevisiae (Baker's Yeast) also known as Budding Yeast. The budding yeast Saccharomyces cerevisiae is one of the major model organisms for understanding cellular and molecular processes in eukaryotes. This single-celled organism is also important in industry, where it is used to make bread, beer, wine, enzymes, and pharmaceuticals. The Saccharomyces cerevisiae yeast genome is organized in 16 chromosomes. In this research, the study and compression of these 16 chromosomes have been carried out. But this compression algorithm is not restricted to only Saccharomyces Cerevisiae genome, it can be used to compress any chromosomal sequence of a given genome.

What is Data Compression
Capacity is restricted, so are runtime memory and transmission abilities. With these constrained assets, the treatment of such exponentially developing information is a test. In this situation the main thing that comes as a main priority is -"Compression". What is Compression -an unavoidable issue to think about -is it simply lessening the size information. No. Compression is substantially more than that. Compression is "Modeling + Coding". Modeling is the place we discover diverse kind of techniques to discover repetition in information and coding is the place we supplant these redundancies by some sort of references. Subsequently for dealing with such enormous volume information Compression is must.

Issue till now
The main issue is that, regardless of whether the standard compression algorithm can deal with such unique hereditary information. The standard compression algorithms flop in compacting the DNA successions as well as wind up yielding negative compression proportion.
The standard pressure calculations don't be able to misuse the unique qualities of DNA groupings, and this is the motivation behind why the size of the packed record is more than that of the first document. There are some unique qualities that DNA arrangements hold, which can be misused to get positive pressure.

Need of Compression
Compression and analysis of DNA sequences is very important as it can result into better and more customized medical treatment, disease diagnosis and finding drug based solutions. Compression not only helps in efficient storage and retrieval but also in the process of querying, transfer, comparison and analysis.
The main aim of this paper is not only DNA compression but also to find out the commonness between two different chromosomes. Table 1, Figure 1 and Table 2, Figure 2 demonstrates the exponential growth of the GenBankdatabase in terms of sequences and base pairs.     The above statistics shows the exponential growth of genetic database and this is why we need compression.

Literature Survey
Compression of a DNA fragment is a challenging task for current compression algorithms since these algorithms are primarily intended for compression of English text, whereas the observable behaviors in DNA samples are low .
There are four bases in DNA sequences { G, C, T, A}. Thus, two bits can represent each base. These DNA sequences can not be compressed by conventional text compression instruments such as compress, gzip and bzip2. Thai and Grumbach introduced the first hereditary Compression Algorithm, which were known as BioCompress and its subsequent adaptation BioCompress-2 [4]. BioCompress and its subsequent adaptation BioCompress-2 utilize the LZ Compression methods. BioCompress-2 accompanied the additional component of looking through exact repeats in 2010. The repeats were encoded by rehash length and event position. The rest of the non repeat districts are encoded utilizing Arithmetic-2 Method.
The next in the row is Cfact, It's a two part calculation. and it uses a postfix tree to find the biggest accurate match [3]. In the principal part the biggest accurate match is found and in the second part the matches are coded.
DNABIT was presented in 2011. It is likewise a dual stage calculation. In the main stage paired bits are allotted to each nucleotide and in the next stage aged piece strategy was utilized which replaced 3,5,7 and 9 bits dependent on the length of matches [5].
CTW+LZ was presented with a combination of CTW + LZ77 [2]. The calculation utilized the context tree weight for encoding long matches though LZ77 was utilized to code short matches. It was introduced in 2012.
A two-phase lossless chromosome compression algorithm that shows supplementary qualitative design synthesis to enhance compression efficiency [13] was introduced, in 2013. DNA-COMPACT was a algorithm in which the suggested structure could manage compression of genetic code with and with-out referred frames and showed efficiency benefits over best current algorithms.
In 2014, SEQCOMPRESS was introduced in which the compression algorithm copes with genetic pattern spatial complexity. The algorithm is focused on lossless compression of information and utilizes both mathematical model and calculation coding to encode Chromosome sequences [14]. In 2018 Fatigue Detection of Workers using Supervised Learning was suggested. [15] [16] An Ideal Seed Based Compression Algorithm for DNA data was launched in 2016, offering a seed-based lossless compression algorithm to compress a DNA pattern using a compression-like substitute technique from LZ [17].

Observation
The test results demonstrate that when the calculation scans for intra chromosomal matches -the quantity of matches and the length of matches increase. In the compression procedure, the algorithm is going to look for precise and inexact rehashes. Saccharomyces Cerevisiae genome has sixteen chromosomes. For discovering investigative outcomes, the Saccharomyces Cerevisiae (Budding Yeast) genome is being used in this paper. First the examination begins with discovering definite matches in chromosome I and VIII independently and after that breaking down them two together to discover the intra chromosomal similitudes too. For this examination chromosome I and chromosome VIII are being investigated as Chromosome I indicates most extreme likeness with Chromosome VIII.

Methodology
The process starts with search for exact matches of base 100 and 100% similarity.
The complete number of bases that fall under these matches are 7860, which are being supplanted by 29 eight piece ASCII code. That implies 62880 bits will be supplanted by 232 bits. The primary chromosome is comprised of 230119 bases. After analysis, 58 matches are discovered, that implies they can be supplanted by 29 eight piece ASCII code. Exploratory outcomes demonstrate that various matches take place between the lengths of 101-337. We have 74087 codon. For coding these 74087 codons (beginning Size -1778072 bits) just 444518 bits are required. Without compression each base involves 8 bpb. While after compression each base possesses 1.98 bpb. The remaining 222259 bases are converted into three base codon and every codon is supplanted by six piece binary code.
In chromosome I+VIII, 206 matches are found. The non matching 732112 bases are changed over into codons and every codon is supplanted by a six piece binary code. The absolute number of bases between the length of 101-3232 are 60440, which are supplanted by 16 eight piece ASCII code. Chromosome I+VIII is compression to 1.84bpb.
In chromosome VIII, 32 matches are found. The non matching 554511 bases are changed over into codons and every codon is supplanted by a six piece binary code. The bases between the length of 105-1988 are 7912, which are supplanted by 16 eight piece ASCII code. It is compressed to 1.97.

Development Model/ Algorithm Developed for Compression
The basic idea behind this research paper is to create a comp ression algorithm that can successfully compress the sequen ce of DNA and offer a comparatively better compression rati o than other existing algorithms and provide the muchneeded partial decompression function as well.
The design of the compression algorithm has been divided into four different phases.

Development Model/ Algorithm Developed for Decompression
The design of the Decompression algorithm has been divided into two different phases.
• Phase 1 -Decompression Phase I -Decompression where complete sequence was decompressed in one go.
• Phase 2 -Decompression Phase II -Decompression where only a part of the compressed Sequence was decompressed.

The Mathematical Results
The following table shows a comparison of Intra Vs Inter Chromosomal Repeats, which clearly shows the improvement in the number of repeats. The current study not only focuses on development of a novel DNA compression Algorithm, but also aims at improving the compression ratio of already existing algorithms. As already concluded -"More number of repeats better is the compression". The Genetic compression Algorithms that have been developed till now and discussed in the literature survey, don't use Inter Chromosomal Similarities. All the Algorithms have been using only intra Chromosomal repeats till now. The novelty of this algorithm is in the use of Inter Chromosomal Repeats along with Inta Chromosomal Repeats of the same Genome. Table Number  6 shows the huge difference between Intra and Inter Chromosomal Repeats in number. This study is not only end up developing a better compression algorithm but will also change the approach of already existing repeat based DNA compression algorithms, hence forth improve the compression ratio of all existing algorithms.

Implication of Data and Findings
The proposed technique is not only going to improve the compression ratio of this algorithm but can improve the compression ratio of other existing algorithms also. In the proposed technique the mature dictionary is created only for inter and intra chromosomal exact repeats. The compression ratio can further improve if other features like approximate repeat tandem repeats and palindromes of DNA sequences are also exploited.
Hence it can be concluded that in the proposed technique inter and intra chromosomal exact repeats have been used for creating a mature dictionary, which in turn is going to compress and decompress the DNA sequence. This method is giving more than 75% of compression and adding the feature of Random Access and partial decompression. The compression ratio can further be improved if other features of DNA sequences are also used.

Conclusion
In this paper, a new compression algorithm is being proposed for solving DNA sequence compression problem. Another idea is being presented in this paper. This idea is about intra chromosomal similarities and sub strings. The numerical calculation results demonstrate that intra chromosomal matches are bigger and more than inter chromosomal matches. For the current paper, only uniquely exact matches are considered. This work can be stretched out for rough matches for better compression. This work can be utilized for successfully compress DNA arrangements and can likewise be utilized to discover the degree of comparability between two unique chromosomes of a similar genome.
For Inter Chromosomal Sequences, this compression algorithm achieves more than 75% compression ratio by assigning ASCII values to the repeats and CODONS found in the original database and swapping them to generate compressed sequence. During compression, the input files stores the data sequence and assign ASCII from the dictionary table. The file can be accessed from client side and server side as well. Partial Decompression can be done by accessing random string from the Database Dictionary. The input file should contain a starting space then ASCII code and then the ending space in order to distinguish the repeats and decompress.
The compression algorithm proposed in this research work does not compress biological sequences randomly. This algorithm requires the complete genome data with all the chromosomes to prepare the mature dictionary. Complete compression of the sequences can take place only if the mature dictionary is ready for the replacement process.