Ambient Air Quality Estimation using Supervised Learning Techniques

The exponential increase of population in the urban areas has led to deforestation and industrialization that greatly affects the air quality. The polluted air affects the human health. Due to this concern, the prediction of air quality has become a potential research area. For the assessment of air quality an important indicator is Air Quality Index (AQI). The objective of this paper is to build prediction models using supervised learning. Supervised Learning is broadly classified into: classification, regression and ensemble techniques. This study has been carried out using various techniques of classification, regression and ensemble learning. It has been observed from experimental work that Decision Trees from classification, Support Vector Regression from regression and Stacking Ensemble from ensemble techniques work more effectively and efficiently than the rest of the other techniques that fall under these categories.


Introduction
Air Pollution is induced by the growing industrialization in the developing countries.The air pollutants present a potential hazard to human health [1].Therefore, there is a need to monitor the air quality on a regular basis.The air quality is defined by a tool called Air Quality Index (AQI).It transforms the concentration of various air pollutants into a value that represents the air quality [2].
A plethora of research had been focused on the different machine learning techniques for the accurate prediction of AQI.Veljanovska and Dimoski [3] compared the performance of artificial neural network, support vector machines, k-nearest neighbor and decision tree for predicting the air quality index for the Republic of Macedonia.It was concluded that the algorithm with the highest classification accuracy was neural networks.Liu et al [4] predicted the air quality index of China using Support Vector Regression (SVR) technique.The input to the algorithm was the meteorological and the air quality data from multiple cities like Beijing, Tianjin and Shijiazhuang.It was proved that the prediction results improved by utilizing multi city data of pollutants and the meteorological parameters.Rubal and Kumar [5] proposed an ensemble model for predicting the air quality based on Random Forest and Differential Evolution for the cities of Patna and Delhi.It was found out that the proposed ensemble model leads to performance gains when compared to the existing classifiers.Sharma et al [6] developed model for predicting air quality index of Delhi based on seasonal trends using neural networks.The results proved that the error of the proposed model was much lesser when compared to Auto-Regressive Integrated Moving Average (ARIMA) model.Saxena and Shekhawat [7] presented a cumulative index based on the concentration of sulphur dioxide, nitrogen dioxide, PM 10 and PM 2.5 .Further, a classifier based on support vector machine using Grey Wolf Optimizer is proposed which takes the cumulative index as an input and classifies the air quality as good or harmful.The classifier is tested for the air quality dataset of Delhi, Kolkata and Bhopal.It was found that the proposed classifier has high classification accuracy.Lei and Wan [8] forecasted the air pollution index for the city of Macau based on the ensembles of Adaptive Neuro-Fuzzy Inference System (ANFIS).It was verified that the proposed model produced promising results.Yu et al [9] developed an approach namely RAQ for the air quality prediction of Shenyang based on Random Forest using urban sensing data that includes the meteorological data, real time traffic updates and the road information collected from Baidu Map and Google Map.High precision was reported by RAQ in comparison to naïve bayes, logistic regression, neural networks and decision tree.The accurate prediction of AQI is required so that measures to control pollution could be undertaken in advance.In this paper, the prediction of air quality index has been carried out using various supervised learning techniques for Faridabad, Haryana.
The next section describes the mathematical model for cleaning the air quality dataset.The details about the study area and the parameters used have also been discussed in this section.The various supervised learning techniques used have been summarized in Section 3. The AQI prediction methodology has been presented in Section 4. The performance evaluation results for various classification and regression have been discussed in Section 5.The paper has been described in a conclusive form in Section 6.

Mathematical Model for Cleansing Air Quality Dataset
According to the WHO global air pollution database,.Faridabad, a city located in Haryana is in the second most polluted city in the world [10].Therefore, the air quality dataset of Faridabad was selected for the work.For performing the research, the air quality dataset of Faridabad from Central Pollution Control Board (CPCB) website [11] has been used.This dataset has been preprocessed to obtain the value of AQI from the various collected parameters, the mathematical model of which has been further explained in Section 2.2.

Figure 1. Map of the Study Area
The parameters of the dataset include the AQI and the concentration of carbon monoxide (CO), sulphur dioxide (SO 2 ), nitrogen dioxide (NO 2 ), PM 2.5 and ozone.

Mathematical Model for Computing AQI
The scheme that is used to convert the concentration of various air pollutants into a single value is called air quality index.It transforms the various parameter values of the pollutants into one value by the use of numerical manipulation.The process of computation of AQI consists of the following two steps: a) Calculation of Sub-Index The sub index S i for the concentration of pollutant A i is calculated using a sub index function.The sub index signifies the relationship between the pollutant concentration and its corresponding health effect.The linear relationship between the sub index and the pollutant concentration is represented as follows: Where m is the slope of the line and b is the intercept at A=0.The general equation for the calculation of sub index S i given the pollutant concentration A p is as follows: Where, C hi is the breakpoint concentration greater than or the same as the given pollutant concentration.
Ambient Air Quality Estimation using Supervised Learning Techniques 3 C lo is the breakpoint concentration less than or the same as the given pollutant concentration.S hi is the value of the AQI corresponding to C hi S lo is the value of the AQI corresponding to C lo b) Formation of AQI The sub indices of the various pollutants are combined in a simple additive form or weighted additive form: to calculate the overall air quality index S as follows: For the calculation of AQI in India, a mathematical function consisting of the maximum operator is used: The maximum operator is used as it not ambiguous.Further, the weighted sum cannot be found as the effects that combination of pollutants has on human health is not known [10].

Supervised Learning Techniques
Supervised learning is learning that involves an expert that is well versed with the environment.In this type of learning, the response is found based on labelled training data.The desired response is given by the expert and the predicted response by the learning system [13].The supervised learning techniques comprises of classification, regression and ensemble techniques where the target variable is categorical in classification and continuous in regression.Clustering is part of unsupervised learning [35,36].Ensemble techniques combine various models to increase the prediction accuracy.The existing classification, regression and ensemble techniques have been summarized in Figure 2.

Classification Techniques
Classification techniques play a major role to produce the general hypothesis to forecast the future data instances [30].These techniques assign the class labels to the test dataset given the input predictors where the response is unknown [14,15].[32].Support Vector Machine (SVM) perform classification by using a hyperplane to separate the various classes.Both the sides of the hyperplane is called the margin [16].

Regression Techniques
Regression techniques are used to estimate the relationship between various predictor variables and the response variable.It is used when the response variable to be predicted has continuous values.
In simple linear regression, a single predictor is used to predict the response and in multiple linear regression, response is a linear combination of more than one predictor variables.Support Vector Regression (SVR) depends only on partial training data as the training data closer to a margin or a threshold value is ignored by it.SVR uses a hyperplane to maximize the threshold and thus minimize the error [17].In Quantile Regression, introduced by Koenker and Bassett [18], the conditional distribution of the output is a function of covariates of the input variables.The rest of the functions are calculated by minimization of weighted sum of errors [19].Stepwise Regression makes use of forward, backward or both selection methods to evaluate the importance of predictors.In order to perform feature selection, it develops a sequence of linear models.At each step in forward selection method, a predictor is added one at a time and removal of already added predictor is also considered when it is used in combination of newly added one [20].

Ensemble Techniques
These methods combine various models to create a single aggregate model.Random Forest is constructed by merging the prediction made by a number of trees when each of the trees is trained independently and the result is found out by taking their average [21].Bagging is an ensemble technique that performs classification by voting the class chosen by the majority of its base methods.AdaBoost is an algorithm based on boosting.It constructs each time a new training dataset based on the weights [22].Stacking combines the predictions from different sub models using a base model [23].The sub models can be constructed by using any classification or regression techniques [24].Majority Voting performs prediction by taking into consideration the maximum votes from various base models.

AQI Prediction Methodology
The air quality dataset including the parameters affecting AQI of Faridabad has been collected from Central Pollution Control Board (CPCB) website.The AQI is calculated from these parameters by calculating the sub index of each parameter and then applying the max operator using formula mentioned in Section 2. Noise/missing data has been ignored during preprocessing.Prediction has been carried out using various classification, regression and ensemble techniques.Further, the performance of these techniques has been evaluated based on various metrics.The AQI Prediction methodology has been depicted in Figure 3.  From the various parameters of the collected dataset, the sub index and the air quality index has been calculated using mathematical formula.Further, AQI has been converted to a categorical variable based on the National Ambient Air Quality Standards specified in Table 1.

Metrics used for AQI Prediction
The various parameters used for performance evaluation of the classification techniques are Precision, Recall, Accuracy, Error rate, F1 score and ROC curve [25,28].These parameters have been discussed as follows: • Precision It specifies given a class, the number of correctly predicted instances for all predicted labels [34].It is defined by: Where T and F are the number of true positive and false positive instances of a given class.

Given a 3 class confusion matrix, Precision for class P is defined as:
• Recall Recall also called Sensitivity indicates the number of correctly predicted instances from all the instances that should have that class label Recall is defined as: Where T and F ' are true positives and false negatives predictions for a particular class.

For the given 3 class confusion Matrix:
• Accuracy It is defined as total correct predictions divided by the size of the dataset [33].Accuracy is defined as: Where N is the size of the dataset.

• Error Rate
Error rate is the ratio of total incorrect predictions and the size of the dataset.It is defined as: • F1 Score F1 score is a metric that is useful when the dataset is imbalanced [26,31].It is the harmonic mean of recall and precision.F1 score is given as: Ambient Air Quality Estimation using Supervised Learning Techniques EAI Endorsed Transactions on Scalable Information Systems Online First • Receiver Operating Characteristic (ROC) Curve It is a plot between true positive rate and false positive rate.Every point on the curve represents a recall and precision pair for a threshold value [27].
The various parameters used for performance evaluation of the regression techniques are correlation coefficient, coefficient of determination, min max accuracy and mean absolute percentage error [27].

• Correlation Coefficient
The measure of the linear or nonlinear relationship between the input and output variables is specified by the correlation coefficient.Its value ranges between +1 and -1.A negative correlation means that the response varies inversely with the predictor.A lower correlation value specifies that there is a need to add more predictor variables in order to explain the variation of the response.The Pearson correlation coefficient between predictor variables Q i and response variable P i is calculated as: Where Q '' and P '' are the mean values of predictor Q and response P respectively.
• Coefficient of Determination It can be calculated by squaring the value of Pearson correlation coefficient.This parameter represents goodness of fit and lies between 0 and 1.It is specified as fraction of variation in response by the variation in predictor as specified by the linear relationship between them.The coefficient of determination is computed as: Where P, P ' and P '' are the observed, predicted and average value of response variable respectively.
• Min Max Accuracy Min Max accuracy is a measure that calculates the average between the minimum and maximum prediction of the response variable.It is calculated as: Where P and P ' are the observed and predicted value of response variable respectively.

• Mean Absolute Percentage Error (MAPE)
MAPE specifies the difference between the actual response and predicted response in percentage.It is calculated as the average of the absolute error of response in percentage.It is given by: Where n is the size of the dataset and P and P ' are the observed and predicted value of response variable respectively.

Results Analysis of Air Quality Dataset
To predict the AQI, three classification techniques namely Decision tree, Naïve Bayes and SVM and three ensemble techniques namely Random Forest, Voting and Stacking have been used.Stacking ensemble has been performed by sub models: Decision Trees, SVM, Naïve Bayes and Random Forest with Logistic Regression as the base model.All the sub models used in stacking has been used to perform majority voting.The performance of these classification and ensemble techniques has been evaluated based on number of metrics.The results of Precision and Recall for each of the class label for each classification and ensemble technique has been depicted in Table 3 and Table 4 respectively.To calculate the F1 score for each technique, the average precision and average recall for all classes has been taken into consideration.These results have been depicted in Table 5.Further, prediction of AQI has been carried out using various regression techniques namely Linear Regression, Quantile Regression, Stepwise Regression and Support Vector Regression.The performance of regression techniques has been evaluated on metrics depicted in Table 6.The plots between the predicted and the observed values of AQI have been depicted in Figure 6.Preventing burning of garbage in residential areas and using natural gas rather than coal in power plants are some of the measures that could be used to improve the air quality [29].

Faridabad located at 28 .
4211°N 77.3078°E in the South eastern part of Haryana has Gurugram and Palwal as its neighbouring districts.It has sub tropical climate with hot summers and cold winters.The area gets sufficient rainfall in the summer season and some rain in the winter season.The study area is depicted in Figure 1 [12].

Symbolic
Learning methods perform prediction based on a set of rules.Decision trees and rule based classifiers are types of symbolic learning methods.Decision Trees perform classification by dividing the instances based on the values of the various parameters.Perceptron Based methods consists of classification algorithms that make predictions based on functions based on set of weights to be used with the input vector.Statistical methods perform classification by taking into account the probability of an instance.Naive Bayes is represented by directed acyclic graph with the independence between one parent node and many children nodes.Instance-based methods postpone the induction till classification and use distance based metrics like k-Nearest Neighbors (kNN)

Figure 4 .
Figure 4. Subset of Air Quality Dataset

Figure 5 .
Figure 5. ROC Curve for Random ForestIn the above ROC curve, all the six classes have been depicted for the Random Forest technique.In the figure, the colours red, blue, yellow, green, black and magenta show the six classes Good, Satisfactory, Moderately Polluted, Poor, Very Poor and Severe respectively.Further, prediction of AQI has been carried out using various regression techniques namely Linear Regression, Quantile Regression, Stepwise Regression and Support Vector Regression.The performance of regression techniques has been evaluated on metrics depicted in Table6.

Figure 6 .
Figure 6.Predicted Vs Observed Values of AQI (a) Linear Regression (b) Quantile Regression (c)Stepwise Repression (d) Support Vector Regression

Table 1 .
National Air Quality Index

Table 2 .
Confusion Matrix for a Three Class Problem R True positive: Correctly predicted instances for each class True Negative: Correctly rejected instances for each class False Positive: Incorrectly predicted instances for each class False Negative: Incorrectly rejected instances for each class

Table 3 .
Precision Values for Various Classification and Ensemble Techniques

Table 4 .
Recall Values for Various Classification and Ensemble TechniquesFrom the above tables, it has been observed that ensemble techniques have higher values of precision and recall than classification techniques and Decision Trees have the highest of these values out of all the classification techniques.It is further observed that the lowest value of precision and recall exists for Support Vector Machines.Next, the accuracy, error rate and F1 score has been calculated for each classification and ensemble technique.

Table 5 .
Metrics Used for Performance Evaluation of Classification and Ensemble Techniques From the above table, it has been observed that Decision Trees have the highest value of accuracy and F1 score among the classification techniques.Decision Trees have the lowest value of error rate.It is also observed that the values of lowest accuracy and F1 score exist for Support Vector Machine.The ensemble techniques observe higher accuracy and F1 score and lower error rate in comparison to the classification techniques.The Stacking ensemble has the highest value of accuracy and F1 score.The ROC curve for Random Forest is shown in Figure 5.

Table 6 .
Metrics Used for Performance Evaluation of Regression TechniquesFrom the above table, it has been found that Support Vector Regression has the highest value of min max accuracy and least value of MAPE amongst all regression techniques.It is further found that Linear Regression has the lowest value of accuracy and highest value of MAPE.
EAI Endorsed Transactions on Scalable Information SystemsOnline First