Relationship Among the Diameter of the Area of Influence & Refill Usage of Sri Lanka Using Anonymized Call Detail Records

only considering the maximum distance (km) between the set of BTSs used to make or receive calls which was used in [3]. All location data considered was during the period under study. Abstract Abstract—Economic activity and human mobility are two of the three pillars of socioeconomic indicators and understanding how these correlate with each other is important to society as it may be useful in attempting to classify people into socioeconomic levels using call detail records. Refill usage is one of the key attributes that can taken to model socioeconomic levels as the general sense is that people who spend on more refill are considered as people with high purchasing power. This type of research on SES classification by CDR happened for first time in Sri Lanka and using refill features for modeling, also is not seen in any literature to date. This paper describes what the Diameter of Area of Influence (DAI) is and the relationship between the DAI of an individual which is one of many user mobility features that can be extracted from Call Detail Records (CDRs) and the amount the user refills which is the main economic activity that can be derived from CDRs. This paper also describes a methodology to find DAI using CDRs and how to cluster individual users using this distance.

The period of study is for five months from January 2012 to May 2012.Data includes both prepaid and post paid users.Data pre-processing was needed to get location data by considering caller direction and removing duplicate records.

A. Finding the diameter of the area of influence
The diameter of the area of influence is the geographical area of influence of an individual's location during their daily activities.In this study it is computed as shown below in Figure 1

Methodology
Below is the equation to derive the haversine distance using earths radius as 6371km,

B. Clustering Location Data
In order to cluster users first it is important to filter out users who use their phone less regularly by removing users with average activity of less than 2 per day.For example a user on average who takes or receives less than 2 calls per day is filtered out.This is done in order to get more accurate result as some occasional users may have distorted movement as most of the location data of these users may not be captured.This figure was taken from similar research done [4] Then a random sample of 100,000 users from these set of frequent users is taken for randomisation and for the ease of computation.In order to cluster the data, first the haversine distance should be normalized.that the root mean square for the distance is calculated and every distance is divided from the root mean square.Then the normalised distance is centred to the root mean square.
Using the k-means algorithm and drawing an elbow graph for k = 1 to 15, it is possible to find the optimal number of clusters in the normalised and centred DAI by finding the elbow point of group sum of squares and by using the that optimal k value as the number of clusters for k-means to label each user by the cluster they belong to.
K-means was used as the clustering algorithms as other algorithms such as DBSCAN required initial parameters which was difficult to calculate and most other research done on similar data preferred k-means.[

1][2]
C. Finding Refill Data for each distance cluster First calculate the total refill amount for each user in Rupees by the summation of all reloads, recharge cards and other methods of refills.Then by selecting the set of users in 100,000 users set taken from haversine distance calculation and joining it with refill data it is possible to find the refill usage of each cluster which can be seen in TABLE II.

D. Clustering Refill Data
It is possible to cluster refill data separately using the same method described in Clustering Location Data section and using the same subset of 100,000 users for the process.
Following the methods described in Methodology and by random sampling the below values and figures were calculated.It is also to be noted that all random sample of users taken didn't show any significant variation in results and can be assumed to be true representation of the large set of users.

Results
The following steps were taken to find the above attribute, • Find the positions of calls taken by combining voice records and cell id records for each individual • Find maximum minimum longitude latitude position of each individual during the period of study     Clustering using k=4, it is possible to identify from TABLE IV that there are a small number of very high refill users.These may be assumed to be communications, business salesmen etc, which dont represent the society behaviour at large.These then can be removed from our analysis considering them as noise.This then improves the correlation to 0.3384 which is inside the "good" feature range of 0.3-0.5 described in [3] At individual level it is possible to interpret the results with reasonable confidence (with a correlation of 0.34) that people who have a higher diameter of the area of influence also spend more money for mobile communications and in the case of averaging to the general population, with significant confidence (with a correlation of 0.98) for a developing country like Sri Lanka.This conclusion may be taken forward to use DAI as feature for classifying socioeconomic levels.

B. Correlation between DAI and refills
The correlation among the diameter of the area of influence value of an individual and the total refill amount spent during the same period was found to be 0.3026.This value was considered after removing low frequency users and post-paid

rather thanFig. 1 :
Fig. 1: Diameter of the area of influence calculation: points are the user activity locations which are inside the area of influence dashed circle

Fig. 2 :
Fig. 2: Elbow curve for identifying best k value for diameter of area of influence K-means clustering technique applies for clustering the dataset and estimate the optimum number of clusters using elbow curve which shows the group squared sum error (GSSE) versus the number of clusters.The error measure (GSSE) drops monotonically as the number of k clusters increases, but from k=3 the drop flattens significantly as seen by Figure 2.

•
Calculate the diameter of influence using the Haversine Formula for each individual
Final cluster table for diameter of area of influenceThe above table displays a sample set of users diameter of their influence, normalised distance and finally the cluster label for each user. 2 EAI Endorsed Transactions on Scalable Information Systems 12 2016 -01 2017 | Volume 4 | Issue 12 | e5

TABLE II :
Cluster averages for distance and refill users which was over 80,000 users.If only considering the cluster averages for the correlation between the two attributes then a correlation of 0.9870 can be found from TABLE II.

TABLE III :
Cluster distribution for the diameter of the area of influence From TABLE III, the diameter of the area of influence can be used to classify users as Low, Medium & High using the clusters 2, 1 & 3 respectively.C. Outcome of refill clusteringFig.3:Elbow curve for identifying best k value for total refill amountHere the elbow point is not clear as some might say it is 3 where as another might say it is 4.

TABLE IV :
Cluster averages for refill clustering