Knowledge Extraction Using Web Usage Mining

Web log files are the greatest source of knowledge now days, which keeps all the information about users interaction to web. This interaction provides us the usage patterns of the user by using web usage mining. These files contain all the information about visitors of the web which is used as input for analysis. These files are converted to required formats after completing the preprocessing so Web Usage Mining (WUM) techniques can apply on these logs. Web usage mining gives us the details of user patterns. In this study we are going discover different behaviors patterns from the web proxy server log file of an educational organization with web usage mining technique. Results are based on the interest of users towards educational websites.


Introduction
Data mining purpose is to predict unknown, useful and understandable patterns from huge data.Data mining general steps are displayed in figure 1. Web mining is an application of predicting knowledge from web log files.Because of the complex infrastructure and scalability of web it had led to numerous quality data issues like identification of page view, user and filtering robot activity [1].World Wide Web (WWW) possess huge amount of data and growing exponentially with respect to time and usage.Web has become complex for end user to browser effectively.Maintaining is as important as to building.To improve and update we need to know our user interests so we update our website according to user web surfing [2].Maintaining a web site may include improving in design which can be known user patterns.These patterns help us identify the user's interests on these websites.By visiting a website users accomplished different tasks such as viewing, buying of product as well as user can register for online courses and can attend classes online.By analyzing of web logs interaction of user with web can provide different kind of useful information which can help in enhancing web in means of efficiency and effectiveness.By browsing through a website, users complete different tasks, such as buying products, registering for classes, and attending classes online.Analysis of an interaction log file can provide useful information that helps a website engineer in enhancing the website structure in a way that will make the website usage easier and faster in the future.

Web Mining
Data mining has different application and web mining is one of the most common technique which extract knowledge from web log data.Because of the complex infrastructure and scalability of web it had led to numerous quality data issues like identification of page view, user and filtering robot activity [1].The extracted knowledge quality results are based on the selected algorithm criteria.Web mining is further divided into following three categories:

Web Usage Mining
WUM is an application of data mining technology to mining the data of the Web server log files [4].WUM is defined as applying data mining techniques to log interactions between users and a website [5].WUM also known as web log mining is the application of data mining technique on web log repositoriesto discover useful knowledge about users behavioral pattern.Data source of WUM are textual log files gathered at web servers.Log records possess a lot of useful information like IP address, URL and Time [6].WUM mainly consist of two major techniques, statistical analysis, and association rule, clustering, and sequential patterns which is advance form of web mining [7].Both techniques require huge data gathered from different sources such as proxy servers, web clients, and web servers [8].Other sources like web application data can also be used [9] first statistical approach gives common and consolidated estimated statistical usage, While in the second technique provide help to identify the user patterns.WUM has four stages as to data mining which are discussed above.

4.Web Content Mining
This web mining technique refers to discovery of knowledge about collections of main traditional multimedia documents objects such as audio, image, text, and video, embedded in our web page or linked to our page [7].WCM has two approaches, Agent and database based.Agent based approach comprises of personalization web agent, filtering information and its categorization and intelligent search-agent.Database approach consists of web Query System and Multilevel database [10].There are number of existing techniques to extract knowledge through web mining.

Web Structure Mining
In this technique we extract knowledge from the links on the web and from organization.Its basically works on the web hyperlink structure, And with technique help graph structure is made which usually provide authoritativeness, or ranking, and improve page search results by filtering [11].Parvatikar S. and Joshi B. [3] this paper concentrated on Web Usage Mining is the client route designs and their utilization of web assets.The distinctive stages engaged with this mining procedure and with the relative examination between the example disclosure calculations Apriori and FP-development calculation.Information Preprocessing is one of the essential undertakings previously applying mining calculations.It changes over the crude log record into client session.In this work, we have quickly presented log document preprocessing and executed it in a CTI log record.Likewise, we create the rundown of the client session document.We have utilized separating system to expel minimum asked for assets.Deepa and Raajan [10] implemented the preprocessing techniques to convert the log file into user sessions which are suitable for mining and reduce the size of the session file by filtering the least requested pages using the preprocessing technique.Data Preprocessing is one of the important tasks before applying mining algorithms.It converts the raw log file into user session.In this work, we have briefly introduced log file preprocessing and implemented it in a CTI log file.Also, we produce the summary of the user session file.We have used filtering technique to remove least requested resources.Researches whose only focus is to create a personalized website misses the effects of the web pages content.Adding this content to the knowledge of users patterns gives broader view for personalizing web.Author explores users searching web usage patterns relation with queries [15].A site-keyword graph is formed based on these two attributes based on which recommendations are generated for the new users.Improving personalization of web usage is mining is also aim of [16].

Proposed Work
In this system our aim is to find the usage patterns of users of the University of Lahore, Sargodha campus.Client request the web pages which are stored on the proxy server in log files.By exploring these log files by applying web usage mining algorithms and techniques we will get user interest over the world wide web.In this system we have divided our users into two groups Students, and Faculty.Result of this study will help this organization to provide better facilities to the users.

Methodology
Web Usage Mining requires huge data gathered from different sources such as proxy servers, web clients, and web servers [8].Our study focuses on the web usage of educational organization, so we got our log file from proxy server of the organization.Figure 4 shows the implemented idea to get interesting results.

Data Collection and Attribute Selection
In Web Usage Mining we need data of the browsed web pages from the institute web log server.Web proxy server is the best source because all the web request are logged there in a log file.There are as many as 65 different variables to work on.Five variables were selected to achieve desired result of this study.These attributes are shown in figure 4.

Nave Bayes Classifier
This algorithm is Bayes Theorem based.It is a collection of algorithms rather than a single algorithm.But these algorithms share independent classified of any other feature.In this study this algorithm first finds the identity of the users which are faculty/staff or students.Further it classifies the accessed URLs and classifies them in term of their percentage of categories wise.

JRip
Its one of the most popular and basic algorithm.It has set of rules which examine classes in growing size and generate incremental reduce error class proceeded by all particular decision by treating all example and also find new set of rule which cover all class members.Then it proceeds to next class and repeat it until all classes are discovered.

Results
As this chart in figure 5 clearly shows 65.98% are the students who Uses University provided internet facility to get connect to web.Due to this factor institute has to take some decision to provide facility where they can find their course related materials easily in less time.

Conclusion
Web usage mining is web mining technique which is useful to find out the trends of user towards web.By which we can explore and meet our end user needs.This technique is also equally beneficial to find out any organizational needs.In this study we choose an educational organization to find out our stakeholders.In this experiment we found that for this specific organization Faculty is more attracted towards educational related websites, whereas student used more internet services as much as 66%.Our results shows educational websites accessed by faculty and student usage is 69% and 31% respectively.Students used organizational provided facilities on other stuff rather than the educational materials.

Future Work
This study is the initial study towards improvement for providing better facilities to student, but it requires much more deep study in following areas.Preprocessing the web logs is not easily available so we need more compact and precise preprocessing technique in order to make our data much more meaningful.This data can be used to identify the usages in different departments of university campus so each department can improve their facility according to their students need.Data is useful to discover hyperlink structure.
In this section you can find different proposed techniques and objectives which can be achieved by using web mining techniques.Wu et.al, proposed a technique for WUM to find out the grid computing environment by clicking pattern [13].Aghabozorgi et.al, presented the idea of uzzy clustering incremental clustering of WUM in [2].Inbarani et.al, proposed Rough set based on feature selection for WUM in [5].Ladekar A. Pawar A. et al. [14] gave details of a widely used algorithm in web mining, which amends output's draft of association rule mining.

Table 2 .
User Ratio of Data

Table 3 .
Web Accessed Pages In table 2 a category wise comparison is shown of our two types of users.It is clearly shown that faculty has accessed educational websites more than student by a margin of 31%.While in all other categories student usage is greater than Faculty/Staff users.It shows the trend of our users in our case Faculty and Student interests of this institute.While in all other categories student usage is greater than Faculty/Staff users.It shows the trend of our users in our case Faculty and Student interests of this institute.