An Experimental Study with Tensor Flow for Characteristic mining of Mathematical Formulae from a Document

Through this article a deep learning technique is proposed for the extraction and classification of mathematical keywords from textual documents. Extraction of math keywords from textual data is predominant problem as textual documents contain a culmination of mathematical symbols and literals from natural language such as alphabets and words. Separation of these textual words embedded in the mathematical formulae is a complex task. Our proposed technique solves this critical problem of extracting mathematical keywords from textual documents using techniques such as stemming, tokenization and clustering mathematical keywords based on a training set of mathematical keyword and formulae pairs. The performance of the proposed technique is measured using the metrics such as retrieval time, Sensitivity, Accuracy, FPR, FNR, and FDR are used for appraisal of the proposed technique. Received on 17 January 2019, accepted on 20 May 2019, published on 10 June 2019 Copyright © 2019 K. N. Brahmaji Rao et al., licensed to EAI. This is an open access article distributed under the terms of the Creative Commons Attribution licence (http://creativecommons.org/licenses/by/3.0/), which permits unlimited use, distribution and reproduction in any medium so long as the original work is properly cited. doi: 10.4108/eai.10-6-2019.159097 Corresponding author. Email:brahmaji77@gmail.com


Introduction
TensorFlow is used for the high performance of numerical computation.It is an open source software library.It is flexible for simple deployment of computation across different platforms.It was developed by Google Brain team within Google's AI organization.It is flexible for numerical computations used in many scientific fields and gives strong support for machine learning and deep learning.
Text can be classified as well as clustered by using Tensorflow.The chief advantage of Tensorflow is that it is a base documentation that can be used to generate Deep Learning models directly.The text classification with Tensorflow will be separated into numerous segments.
The first segment deals the text pre-processing and formation of the container of words.Second segment trains the text classifier and finally performs the testing using the classifier [1,2,3,4,5,6,7,8].

Stemming
The procedure applied to a single word to obtain its root is called stemming.The words that are used in a sentence are often derived.To normalize our procedure, we would like to trunk such words and end up with only root words [4].
For example, after stemming following words "writing", "written", and "writer" ends up with their root word "write".

Tokenization
The words in a sentence are called Tokens.Tokenization is a process of finding unique words in the text from a given piece of text.
Tokenization splits the sentence "C Programming Language" in to a set of token list ["C", "Programming", and "Language"] [4].

Bag of Words
The Bag of Words is the process of generating an exclusive list of words.It acts as a tool for characteristic generation.

Training the Data
After the preparation of data, we have to train the model.In the proposed approach, we first take a CSV file which is a sample data.In the first column the file contains the entire formula notation and the second column contains related text for that.Likewise, we have to prepare huge data sample.After preparing the huge data sample the CSV file need to be converted in to JSON File by importing required python libraries [4].

Loading and Pre-processing of Data
In this step, we load the attained JSON data that we have created for training.Let us presume that we have that json data stored in a file named "testdata.json".After loading the data, we would have to perform some required operations called pre-processing for cleaning the data like elimination of bag of words, tokenizing, stemming etc.The exclusive stemmed words in all the sentences provided for training are placed in one list.The other list clutches the different categories.The "docs" list is the output of this step which includes the words from each sentence and which category the sentence fit in.The document is (["limit", "x to 0", "y to 0"], "sigma") is an example [4].

Convert the processed data to Tensorflow requirements and instigate Tensorflow text categorization
After the above two steps the documents are in text form, a bag of words to be applied in order to translate the sentence in to numeric array.As Tensorflow being a math library accepts the data in the numeric form.A deep Neural Network is developed and used for the training of the proposed model.Now the categorization of Tensorflow text document is performed on documents in the right form [4].

Assessment of the Tensorflow Categorization Model
After the completion of training, the text file should be loaded into our program and then parse every line in the text file with our neural network training model to check with how much accuracy the model retrieves math formulae.
During training, the model was able to correctly classify all the sentences.The accuracy and efficiency of retrieval depends on the size of the training document.For example primarily we train our model with 25 lines of text data and the testing is performed with a document contains 10 lines of text file and the accuracy is around 98%.In this model we also calculated the time for performing complete program and for the above document with 10 lines of text its takes around 18-20 milliseconds.Depending on the size of text sample the time and accuracy will increase [4].

Experimental Analysis and Results
For calculating the efficiency of the text document first the JSON file should be prepared for training.The JSON file is prepared from CSV file which contains more than 120 different formulae.Now the JSON format is loaded into our program to train the data.After training the data, some sample text files are loaded for checking the efficiency of program based on how the training sample identifies formulae in the text file.The efficiency of various sizes of test documents are matched with training document and the results are tabulated in tables 1-3 [9, 10, 11, 12, 13, 14, 15 ,16 ,17].

Efficiency
Efficiency is measured as the number of formulae retrieved from the number of the number of formulae in the training document.The efficiency of the proposed math formulae retrieval system depends on the size of the training data.The efficiency is increased with increased number of math data in the training document.Efficiency = (number of lines identified correctly/total number of lines)*100 (1)

Time Analysis
We

Sensitivity Measure
Sensitivity is used to measure the ratio of actual math keywords that are exactly matched with the training document from the text file, supplied as an input.The overall Sensitivity Measure with Tensorflow model is presented in tables 7-9 and from the tables it is obvious that with huge training data more number of matched math formulae from the text document will be retrieved results high sensitivity.
Sensitivity can be expressed as:

False Negative Rate
FNR is the number of Math formulae those responding negative on the test, means the formulae which are wrongly retrieved as obtainable in tables 10-12.The data in the tables illustrates that FNR value decrease with increase in the number of formulae in the text document.FDR is a much unfussy consideration.It is a ratio between the number of unwanted math formulae retrievals in a text document divided by total number of retrievals after comparison with training document and are accessible in tables 16-18.The value of FDR for different range of samples is 0% means no unwanted formulae are retrieved with proposed approach.FDR -False Discovery Rate = n(F p )/n(F p )+ n(T p )

Accuracy
The accuracy of a test is its ability to categorize the retrieval of not needed and required math formulae acceptably.The accuracy can be calculated with the quantity of true positive and true negative in all assessed cases as shown in tables 19-21.The Accuracy of retrieval of math formulae increases with increase in number of samples.

Positive Predictive Value
Positive predictive value (PPV) is a measure of significant occurrences amid the retrieved occurrences it is also known as precision.The PPV value with the proposed approach is 100% for different number of samples with different dominating types of formulae as shown in tables 22-24.

Conclusion
In this article an approach which retrieves mathematical formulae was projected.The efficiency of the wished-for procedure presented in terms of time analysis and

Figure 1 .
Figure 1.Procedure of TensorFlow based retrieval of mathematical formulae.

Figure 2 .
Figure 2. Training model and testing with text file = Number of True Positives n(Fn) = Number of False Negatives

FPR is the number
of Math formulae, those responding positive on the test, means the math formulae which are correctly retrieved from the test document which are available in the training document as shown in tables 13-15.The value FNR is mainly dependent on false positives.The number unwanted formulae retried with Tensorflow is almost zero as the procedure of retrieval of math formulae mainly depends on the training data.FPR-False Positive Rate = n(F p )/ n(F p )+ n(T n ) (4)

Figure 3 .
Figure 3. Overall Sensitivity Measure with Tensorflow from a document.

Figure 4 .
Figure 4. Efficiency Measure with Tensorflow from the document.
accuracy.The proposed Tensorflow based math classification retrieves all the math formulae that are matched with data in the training document.The efficiency of proposed method is evaluated in terms of metrics like Sensitivity, Efficiency, Accuracy, PPV FDR, FPR and FNR.As more number of matched formulae and no unwanted formulae are retrieved with Tensorflow based math classification it out performs in terms of efficiency.The efficiency increases with increase in the number of math formulae in training document.The proposed method with Tensolflow produces best results in terms of Sensitivity, Specificity, PPV, False Positive Rate and False Negative Rate.

Table 1 .
The above tabulated value represents overall Efficiency Measure with Tensorflow from a document of samples.

Table 2 .
The above tabulated value represents Overall Efficiency Measure with Tensorflow from a document of samples.

Table 3
From the above tables 1-3 the efficiency of math formulae retrieval with Tesorflow is measured and from the table it is concluded that the efficiency is more if number of samples increases.i.e. for example out of 20 lines of text file if the program identifies around 19 lines then efficiency of proposed model is 95%.The efficiency always depends on the size of training sample.
. The above tabulated value represents overall Efficiency Measure with Tensorflow from a document of samples.EAI Endorsed Transactions on Scalable Information Systems 03 2019 -06 2019 | Volume 6 | Issue 21 | e6

Table 7 .
The above tabulated value represents overall Sensitivity Measure with Tensorflow from a document of 20 samples.

Table 8 .
The above tabulated value represents overall Sensitivity Measure with Tensorflow from a document of 40 samples.

Table 12 .
The above tabulated value represents overall FNR Measure with Tensorflow from a document of 60 samples.

Table 13 .
The above tabulated value represents overall FPR Measure with Tensorflow from a document of 20 samples.

Table 14 .
The above tabulated value represents overall FPR Measure with Tensorflow from a document of 40 samples.

Table 16 .
The above tabulated value represents overall FDR Measure with Tensorflow from a document of 20 samples.

Table 19 .
The above tabulated value represents overall Accuracy Measure with Tensorflow from a document of 20 samples.

Table 22 .
The above tabulated value represents overall PPV Measure with Tensorflow from a document of 60 samples.

Table 23 .
The above tabulated value represents overall PPV Measure with Tensorflow from a document of 60 samples.

Table 24 .
The above tabulated value represents overall PPV Measure with Tensorflow from a document of 60 samples.Overall Accuracy Measure with Tensorflow from a document.K. N. Brahmaji Rao, G. Srinivas and P. V. G. D. Prasad Reddy EAI Endorsed Transactions on Scalable Information Systems 03 2019 -06 2019 | Volume 6 | Issue 21 | e6