Features Analysis of Online Shopping System Using WCM

Data mining techniques being used for web information extraction are unbelievable systems and suggested for the protection of extremely susceptible data. By the web sources huge amount of data is maintained and can be easily retrieved by using the web mining techniques as the techniques are applied exactly based on the needs of the users. ECommerce and online shopping noticed a huge growth in business industry. This facility has been mostly employed in western countries during the last two decades. In east online shopping is increasing as most of the business is running through web site as well as in west. This business can be boost by feature analysis of different successful running online web stores. This study is going to present analysis of different features of successful online business website and of those which are not that much popular and accessed infrequently, their features will be extracted and compared to get the reasons of popularity of frequently accessed online shopping websites and after that recommendations will be made to increase the traffic of unpopular online shopping websites to dominate online business in Pakistan. According to the presented work it has been concluded that unpopular websites lack some features that are included in popular websites such as brands, as people are more conscious about brands and labels so they visit and shop from the websites which offer them best quality famous brands, moreover it has been observed that unpopular websites have less categories they must broaden their variety of products especially related to sports, fitness, bathroom accessories, technology and cosmetics. Apart from these another interesting fact that has been found is that popular websites mostly attract their customers by giving them offers such as buy one get one free, free home delivery and free gifts, such offers are always attracting new and more customers. Unpopular websites can improve their business by including the features discussed above. The results of research on this dataset also show that the Naïve bayes is better than j48 in terms of efficiency and accuracy respectively.


Introduction
WWW is a collection of massive data.The web is very enormous, diverse, flexible and dynamic.Continuous expansion of web with respect to amount of traffic and size and complexity of websites, it's becoming a great obscurity to find the suitable and relevant information from the web.Content of web mostly consist of unstructured and heterogeneous information which is hidden in web.Web mining is a promising field that aims to find and extract the relevant and valuable information that is hidden in the data related to the web.Data mining is the process of finding and extraction of meaningful and valuable information from large amount of data.Today every type of data is on the web sites, such as text, video, images, hyperlinks, audio, and metadata.On the basis of such diversity the web mining further divided into three categories which are Web Usage mining, WCM, and Web structure mining figure 1.1.Web usage mining deals with the web logs and search histories of different users while they interact with web.
Web usage mining involves user access patterns from one or more web servers for the automatic discovery.This analysis is used for classification such as site amendments, personalization, system enhancements, business intelligence and usage classification [1].WCM targets to mine the content of the web such as text, images, audio, video and results of search.Further classification is search result mining and web page content mining [3].Web structure mining aims to deal with hyperlinks with the web itself.Structure of most web graphs consists of hyperlinks as edges and web pages as nodes, hyperlinks behave like edges between two related web pages [2].Web mining can be further divided into following sub tasks 1) Resource finding: in this task the required web documents are retrieved.
2) Information selection and pre-processing: automatic selection and preprocessing of specific information from extracted web documents 3) Analysis: interpretation and/or validation of patterns mined from web [4].
Web mining is a technique to discover unknown and undiscovered patterns of users of web.These patterns give us a lot of information which is later transformed in to the knowledge and we use this knowledge increase our business [5].Learning about users patterns gives us an overview of user behaviors on the web, which may lead us to web personalized as our user's usage.Web mining a sub category to find unknown knowledge from WWW [2].Many other techniques were applied in order to retrieve information and extracting information form huge data resides on the web, comparison of those techniques are given in [5].Selection of useful information is done after indexing the text [6].Extractions of information depend upon selected relevant data and facts whereas selecting relevant document is done by information retrieval.Web mining have become part of Information Extraction System (IES) and Information Retrieval System (IRS).IES is pre-processing stage before mining is applied to data which also index text to retrieve data.Machine Learning (ML) is not a part or directly has connection with web mining but it help to improve and gets better text classification than retrieval system [7].

1.1Web content Mining
Traditionally content of web used to search by content of web.WCM tends to be work like search engine.It is the process of extraction of relevant knowledge available on the web in form of data.This mining concerned with extraction of relevant text by removing noise like navigational elements, advertisements, copy right notes and contacts.Growing applications of WCM includes Automatic extraction of semantic relations and structure from web.
Two approaches are being used in WCM, first one is Agent based and second one is database approach.Agents based approach further divided into three kinds of agent's names as, personalized web agents, categorizing/filtering agent, and intelligent agents [3].Mining of multimedia, semi structured, structured, and unstructured data makes the WCM a complicated task.Web content mining sub categories are shown in figure 1.2.

Literature Review
Online shopping systems information extraction assist to find hidden information from the vast amount of products such as the product features and its specification.In earlier days the techniques used for the information extraction from web documents were based on the HTML documents.Based on that HTML document a tree is formed of a web page.Information is retrieved through the search methodologies of a tree.The leaf node must be a text node which is to be mined from the product.Extraction is performed by parsing through Hidden Markov Model and then it classifies the information needed.This model was used to learn the attributes automatically.Gengxin Miao (2009) focuses on the list of objects that appears repeatedly based on the tag paths in the DOM tree of the respective web documents.Then based on the comparison of the occurrence patterns of the tag paths the visually appearing signals are identified and clustering is performed based on the similarity measures of tag paths.This method had higher accuracy when comparing to previous methods [10].
Wei Liu (2010) presents an approach that extracts the products and its specifications from the online shopping web sites based on the visual features.All the visual features are considered such as content feature, format feature etc of the text document and clustered ba sed on the similarity measures.This implementation also takes the DOM tree for data records extraction.From that extracted record the data items which are the product information can be retrieved [11].Ali Ghobadi (2011) presents an improved web information extraction which is based on ontology.To extract the attributes that is of semantic meaning the ontology method of label identification for attributes are used.These processes make use of assumptions on information and fully understand the semantics of the HTML documents and extract the information automatically [12].Xiaoqing Zheng (2012) introduced structural semantic entropy used for locating the data of interest in a web page, based on the measurement of the density of occurrence of the relevant information.Due to the difficulty of writing and, maintaining the wrappers and blocks identification in the vision based extractors this method has been introduced.Entropy measure is calculated to identify the density of the product specified and labeled [13].

Methodology
The methodology used for this study is shown in Fig 4 .1.The major steps are data collection, preprocessing (i.e., attribute selection, feature weighting and tokenization), feature selection, classification and evaluation.These steps are described below.

Data Collection
Our data set consisted of text extracted from popular and unpopular online shopping websites.Text files consist of data from home pages and further 9 more categories mentioned.In order to create the dataset, services of web crawler, popular and unpopular online shopping websites were used.

Popularity comparison between online shopping websites
The comparison between the popularity among different online shopping websites has been carried out by online comparison of number of unique visitors on daily bases using web traffic analytic.

Data Pre-Processing
Before feature classification method some pre-processing steps are necessary.These steps consist of tokenization; feature weighting and removal of stop words.There are number of tokenization techniques available such as, phrase level, word level, and sentence level.

Feature selection
First and most important process is pre-processing in classification and pattern recognition in data mining are Feature extraction or selection.It's considered as efficient preprocessing technique for eliminating noise and it also reduces dimensionality.Feature selection is major significant step in feature classification.In feature selection we struggle to get rid of worthless words from the text to increase classification correctness and to reduce computational hurdles.

Classification Methods
The classification algorithm learns from the training set and builds a model.The built model is employed to classify new items.We have posed feature analysis of online shopping websites as feature classification assignment where text based dataset is applied to classifiers as input then tokenized the dataset structure and output is the counts of each word or token.Whole idea is presented in the diagram Fig 3 .2.We have selected two algorithms J48 a tree based algorithm and Naive Bayes (NB).A lot of empirical studies have been carried out to comparatively evaluate the efficiency of the algorithms.

Confusion matrix:
Confusion matrix includes information about predicted and actual classification.Confusion matrix depicts the accurateness of the result to a classification problem.Given n classes a confusion matrix is a m x n matrix, where Ci,j indicates the number of tuples from D that were assign to class Ci,jbut where the correct class is Ci.Obviously the best solution will have only zero values outside the diagonal.Performance of such systems is normally assessed using the data given in the matrix.The a. experimentation Tool: WEKA WEKA (acronym of "Waikato Environment for Knowledge Analysis").We have chosen WEKA software for our experimentations.As it includes all the essential functionalities for this work.For example, it includes feature selection method IG, stop words removal; attribute selection and classification method etc.

Experimentation and Results
The results shown in this section is purely in terms of the classifier success rate over given dataset.The experimentation flow can be seen in the Fig 6 .1.In this study we have used two algorithms for our classifying tasks.For characterization of classifiers 10 cross standard validation is used.It is a technique to generalize independent set in statistical results.

Conclusion
In this research a Features Analysis of Online Shopping System Using WCM has been presented.The results have presented with the help of two popular classifiers algorithm and compared their performance over given dataset.10folded cross validation has been used in order to validate the results.Results show that our approach for feature analysis is very effective.This study will help the researchers to understand the trend of e-commerce and encourage them to work on content mining to get more depth knowledge of ongoing business in order to get better results in future.
According to the results of this study it is been concluded that unpopular websites lack brands, as people are more conscious about brands and labels so they visit and shop from the websites which offer them best quality famous brands, moreover it has been observed that unpopular websites have less categories they must broaden their variety of products especially related to sports, fitness, bathroom accessories, technology and cosmetics.Apart from these another interesting fact that has been found is that popular websites mostly attract their customers by giving them offers such as buy one get one free, free home delivery and free gifts, such offers are always attracting new and more customers.Unpopular websites can improve their business by including the features discussed above.The future work includes creating of self-ruling specialists that break the found standards to give important approaches or proposals to clients.Future extent of WCM incorporates anticipating client needs with a specific end goal to enhance the ease of use, adaptability, client maintenance, and confining a productive structure for Web Personalization through productive utilize Web Log Files.

Future Directions
We have selected shopping sites running in Pakistan.Be that as it may, this review can be further prompt to various substance of different sorts of sites, for example, stimulation and facilitating destinations.Doing this, we can make a suggestion framework for setting parameters to get fame among online business regarding site content.It will help the web designers and agents to create online organizations in an approach to get more mainstream among clients.Furthermore different looks into can without much of stretch discover the dataset of mainstream and disagreeable Online Shopping sites of Pakistan which data can be gathered efficiently.As trained once, this classified model can be further use for prediction.

EAI
Endorsed Transactions on Scalable Information Systems 12 2017 -04 2018 | Volume 4 | Issue 16 | e5 entries in the confusion matrix have the following Table5.1 meaning in the context of our study: For above confusion matrix, true positives for class a='POP' is 85 while false positives is 8 whereas, for class b='UNPOP', true positives is 35 and false positives is 12 i.e. diagonal elements of matrix 85+35 =120corresponds to the instances classified correctly and other elements 12+8 = 20 represents the incorrect instances.True positive rate = diagonal element/ sum of relevant row False positive rate = non-diagonal element/ sum of relevant row Hence,

Table 6 .
Features Analysis of Online Shopping System Using WCM EAI Endorsed Transactions on Scalable Information Systems 12 2017 -04 2018 | Volume 4 | Issue 16 | e5

Table 6 .
In Table6.7,20 Different words are displayed which shows the effects of these word among popular and unpopular websites.In above table value One '1' means that this word is available and value Zero '0' is missing in the respective category.As shown we have selected only twenty words for demonstration from more than 10000 words.We selected words like, Calvin, Alkaram and Gucci are the brands name which are present popular websites but are missing in unpopular websites Similarly popular websites are providing more products than unpopular websites such sports facilities (Tennis, Squash, and Cricket) which are not present in unpopular.The word 'Free'in the table is presented in popular category but not in the unpopular, this could be free delivery or free gift for customers on purchase which attracts more users to their site.

table 6
.8 Standard Deviation of the words are given whose has an effect on popular websites in getting more hits than other unpopular websites.In fig 6.5 a line chart shows the change of values in graphical form of unpopular and popular websites.