Web User Behavior Analysis Using Improved Naïve Bayes Prediction Algorithm

December 19, 2016 | Author: seventhsensegroup | Category: N/A
Share Embed Donate


Short Description

With the continued growth and proliferation of Web services and Web based information systems, the volumes of user dat...

Description

International Journal of Computer Trends and Technology (IJCTT) – volume 5 number 2 –Nov 2013

Web User Behavior Analysis Using Improved Naïve Bayes Prediction Algorithm K.Ashok Reddy Assistant Professor(C.S.E) Gudlavalleru Engineering College, Gudlavalleru

B.Harindra Varma M.Tech(C.S.E) Gudlavalleru Engineering College, Gudlavalleru

– With the continued growth and proliferation of Web services and Web based information systems, the volumes of user data have reached astronomical proportions. Analyzing such data using Web Usage Mining can help to determine the visiting interests or needs of the web user. As web log is incremental in nature, it becomes a crucial issue to predict exactly the ways how users browse websites. It is necessary for web miners to use predictive mining techniques to filter the unwanted categories for reducing the operational scope. Markov models& its variations have also been used to analyze web navigation behavior of users. A user's web link transition on a particular website can be modeled using first, second-order or higher-order Markov models and can be used to make predictions regarding future navigation and to personalize the web page for an individual user. All higher order Markov model holds the promise of achieving higher prediction accuracies, improved coverage than any single-order Markov model but holds high state space complexity. Hence a Hybrid Markov Model is required to improve the operation performance and prediction accuracy significantly. Markov model is assumed to be a probability model by which users’ browsing behaviors can be predicted at category level. Bayesian theorem can also be applied to present and infer users’ browsing behaviors at webpage level. In this research, Markov models and Bayesian theorem are combined and a two-level prediction model is designed. By the Markov Model, the system can effectively filter the possible category of the websites and Bayesian theorem will help to predict websites accuracy. The experiments will show that our provided model has noble hit ratio for prediction. Abstract

Keywords Mining.



Webusage,Hidden

Markov,

Bayes,Data

I. INTRODUCTION The Web is a huge, explosive, diverse, dynamic and mostly unstructured data repository, which supplies incredible amount of information, and also raises the complexity of how to deal with the information from the different perspectives of view, users, web service providers, business analysts. The users want to have the

ISSN: 2231-2803

S.Narayana Associate Professor (C.S.E), Gudlavalleru Engineering College

effective search tools to find relevant information easily and precisely. The Web service providers want to find the way to predict the users’ behaviors and personalize information to reduce the traffic load and design the Website suited for the different group of users. The business analysts want to have tools to learn the user/consumers’ needs. All of them are expecting tools or techniques to help them satisfy their demands and/or solve the problems encountered on the Web. Therefore, Web mining becomes a popular active area and is taken as the research topic for this investigation. Web Usage Mining is the application of data mining techniques to discover interesting usage patterns from Web data, in order to understand and better serve the needs of Webbased applications. Here our task is related to the web usage mining which basically Consist task related to the use of web where the access of the web will considered and the navigation pattern and the prediction operation will performed in the mining of this kind we will use the database in the form of the web log files and we will generate the results on the basis of the database given.Markov models have been used for studying and understanding stochastic processes, and were shown to be well-suited for modeling and predicting a user’s browsing behavior on a web-site.In general, the input for these problems is the sequence of web-pages that were accessed by a user and the goal is to build Markov models[2] that can be used to model and predict the webpage that the user will most likely access next[3]. In many applications, first-order Markov models are not very accurate in predicting the user’s browsing behavior, since these models do not look far into the past to correctly discriminate the different observed patterns.As a result, higher-order models are often used. Unfortunately, these higher-order models have a number of limitations associated with high state-space complexity, reduced coverage, and sometimes even worse prediction accuracy.One method proposed toovercome the problem is the clustering and cloning to duplicate the state corresponding to page that require a longer history to understand the choice of link that users made.Initially when the web log is not available means the web site is newly launched the prediction or the navigation decision will mad on the page rank our page rank strategy will also used to resolve the ambiguity of the model.Our model will use the basic strategy for the

http://www.ijcttjournal.org

Page107

International Journal of Computer Trends and Technology (IJCTT) – volume 5 number 2 –Nov 2013

preparing the model is the page rank , and variable length markov model, the problem of ambiguity in the markov model will solve on the basis of the page rank and the page rank will also used in the initial stage when the web log file is not available. Markov model have been used for studying and understanding stochastic processes, and well suited for modeling and predicting a user’s browsing behavior on a web. In general, the input for these problems is the sequence of web pages that are accessed by a user and the goal is built Markov model that can be used to predict the web user usage behavior. The state space of the Markov model depends on the number of previous actions used in predicting the next action. The simplest Markov model predicts the next action by only looking at the last action performed by the user. In this model, also known as the first order Markov model, each action that can be performed by a user corresponds to a state in the model. A somewhat more complicated model computes the prediction by looking at the last two actions performed by the user. This is called the second order Markov model, and its states correspond to all possible pairs of action that can be performed in sequence. This approach is generalized to the nth order Markov model, which computes the prediction by looking at the last N actions performed by the user, leading to a state space that contains all possible sequences of N actions. In most of the applications, the first-order Markov model has low accuracy in achieving right predictions, which is why extensions to higher order models are necessary. All higher order Markov model holds the promise of achieving higher prediction accuracies and improved coverage than any single-order Markov model, at the expense of a dramatic increase in the statespace complexity

II.

LITERATURE SURVEY

Myra Spiliopoulou [1] suggests applying Web usage mining to website evaluation to determine needed modifications, primarily to the site’s design of page content and link structure between pages. Eirinaki et al. [2] propose a method that incorporates link analysis, such as the page rank measure, into a Markov model in order to provide Web path recommendations. Schechter et al. [3] utilized a tree-based data structure that represents the collection of paths inferred from the log data to predict the next page access. Chen and Zhang [4] utilized a Prediction by Partial Match forest that restricts the roots to popular nodes; assuming that most user sessions start in popular pages, the branches having a Non popular page as their root are pruned. R. Walpole, R. Myers and S. Myers [5] proposed Bayesian theorem can be used to predict the most possible users’ next request.

stochastic process and it was to be well suited for modeling and predicting users browsing behavior in the Web log Scenario. In most of the applications, the firstorder Markov model has low accuracy in achieving right predictions, which is why extensions to higher order models are necessary. All higher order Markov model holds the promise of achieving higher prediction accuracies and improved coverage than any single-order Markov model, at the expense of a dramatic increase in the state-space complexity. Hence, the authors proposes techniques for intelligently combining different order Markov models so that the resulting model has low state space complexity, improved prediction accuracy and retains the coverage of the all higher order Markov model. Problems in Existing Work: 1) We propose a new two-tier prediction framework to improve prediction time. Such framework can accommodate various prediction models 2) We present an analysis study for Markov model and all-Kth model 3) We propose a new modified Markov model that handles the excess memory requirements in case of large data sets by reducing the number of paths during the training and testing phases. 4) We conduct extensive experiments on three benchmark data sets to study different aspects of the WPP using Markov model, modified Markov model, ARM, and all- Kth Markov model. Our analysis and results show that higher order Markov model produces better prediction accuracy.

III. PROPOSED SYSTEM

In this section, we propose another Improved variation of Markov model by reducing the number of paths in the model so that it can fit in the memory and predict faster[1]. Web prediction is perfomed on the following data :

The Hybrid Successive Markov Predictive Model HSMP has been used for investigation and understanding

ISSN: 2231-2803

http://www.ijcttjournal.org

Page108

International Journal of Computer Trends and Technology (IJCTT) – volume 5 number 2 –Nov 2013

HMM BASED BAYES APPROACH: BAYESIAN CLASSIFICATION Bayesian classifiers are statistical classifiers. They can predict class membership probabilities, such as the probability that a given tuple belongs to a particular class. Bayesian classification is based on Bayes theorem. Naïve Bayesian classifiers assume that the effect of an attribute value on a given class is independent of the values of the other attributes. This assumption is called class conditional independence. It is made to simplify the computations involved and, in this sense, is considered “naïve.” Bayesian belief networks are graphical models, which unlike naïve Bayesian classifiers, allow the representation of dependencies among subsets of attributes. Bayesian belief networks can also be used for classification. Bayes’ Theorem Bayes’ theorem is named after Thomas Bayes, a nonconformist English clergyman who did early work in probability and decision theory during the 18th century. Let X be a data tuple. In Bayesian terms, X is considered “evidence.” As usual, it is described by measurements made on a set of n attributes. Let H be some hypothesis, such as that the data tuple X belongs to a specified class C. For classification problems, we want to determine P(H/X), the probability that the hypothesis H holds given the “evidence” or observed data tuple X. In other words, we are looking for the probability that tuple X belongs to

ISSN: 2231-2803

class C, given that we know the attribute description of X. P(H/X) is the posterior probability, or a posteriori probability, of H conditioned on X. For example, suppose our world of data tuples is confined to customers described by the attributes age and income, respectively, and that X is a 35-year-old customer with an income of $40,000. Suppose that H is the hypothesis that our customer will buy a computer. Then P(H/X) reflects the probability that customer X will buy a computer given that we know the customer’s age and income. In contrast, P(H) is the prior probability, or a priori probability, of H. For our example, this is the probability that any given customer will buy a computer, regardless of age, income, or any other information, for that matter. The posterior probability, P(H/X), is based on more information (e.g., customer information) than the prior probability, P(H), which is independent of X. Similarly, P(X/H) is the posterior probability of X conditioned on H. That is, it is the probability that a customer, X, is 35 years old and earns $40,000, given that we know the customer will buy a computer. P(X) is the prior probability ofX.Using our example, it is the probability that a person from our set of customers is 35 years old and earns $40,000. P(X/H), and P(X) may be estimated from the given data, as we shall see below. Bayes’ theorem is useful in that it provides a way of calculating the posterior probability, P(H/X), from P(H), P(X/H), and P(X). Bayes’ theorem is

Naïve Bayesian Classification The naïve Bayesian classifier, or simple Bayesian classifier, works as follows: 1. Let D be a training set of tuples and their associated class labels. As usual, each tuple is represented by an n-dimensional attribute vector, X = (x1, x2, : : : , xn), depicting n measurements made on the tuple from n attributes, respectively, A1, A2, : : : , An.

2. Suppose that there are m classes, C1, C2, : : : , Cm. Given a tuple, X, the classifier will predict that X belongs to the class having the highest posterior probability, conditioned on X. That is, the naïve Bayesian classifier predicts that tuple X

http://www.ijcttjournal.org

Page109

International Journal of Computer Trends and Technology (IJCTT) – volume 5 number 2 –Nov 2013

belongs to the class Ci if and only if P(Ci/X) > P(Cj/X). Thus we maximize P(Ci/X). The classCi for which P(Ci/X) is maximized is called the maximum posteriori hypothesis. By Bayes’ theorem

3. As P(X) is constant for all classes, only P(X/Ci)P(Ci) need be maximized. If the class prior probabilities are not known, then it is commonly assumed that the classes are equally likely, that is, P(C1) = P(C2) = …….. = P(Cm). Given data sets with many attributes, it would be extremely computationally expensive to compute P(X/Ci). In order to reduce computation in evaluating P(X/Ci), the naive assumption of class conditional independence is made. This presumes that the values of the attributes are conditionally independent of one another, given the class label of the tuple (i.e., that there are no dependence relationships among the attributes). Thus,

We can easily estimate the probabilities P(x1/Ci), P(x2/Ci), : : : , P(xn/Ci) fromthe training tuples. Recall that here xk refers to the value of attribute Ak for tuple X. For each attribute, we look at whether the attribute is categorical or continuous valued. Finally m prediction existing model is applied for classifying the rules[1].

IV. RESULTS

All experiments were performed with the configurations Intel(R) Core(TM)2 CPU 2.13GHz, 2 GB RAM, and the operating system platform is Microsoft Windows XP Professional (SP2).

ISSN: 2231-2803

Existing results: Country: Texas -> 23.0 Florida -> 54.0 Illinois -> 24.0 Ontario -> 28.0 Washington -> 35.0 Oklahoma -> 53.0 California -> 29.0 Oregon -> 26.0 Alberta -> 41.0 Kentucky -> 49.0 North_Carolina -> 18.0 Georgia -> 26.0 Pennsylvania -> 24.0 Indiana -> 55.0 Virginia -> 25.0 Australia -> 27.0 Michigan -> 28.0 Ohio -> 28.0 Connecticut -> 17.0 Rhode_Island -> 41.0 New_York -> 26.0 United_Kingdom -> 22.0 Massachusetts -> 41.0 Saskatchewan -> 34.0 Idaho -> 60.0 Wisconsin -> 17.0 New_Jersey -> 45.0 Italy -> 37.0 South_Dakota -> 23.0 Louisiana -> 28.0 Vermont -> 44.0 Missouri -> 25.0 Mississippi -> 36.0 Netherlands -> 28.0 Kansas -> 28.0 Alaska -> 69.0 Minnesota -> 28.0 Colorado -> 26.0 Maryland -> 32.0 Utah -> 28.0 Nevada -> 27.0 Washington_D.C. -> 35.0 Wyoming -> 27.0 Arizona -> 41.0 New_Hampshire -> 53.0 South_Carolina -> 53.0 Delaware -> 49.0 Tennessee -> 25.0 Sweden -> 28.0 Afghanistan -> 36.0 Iowa -> 35.0 British_Columbia -> 53.0 Arkansas -> 25.0 Montana -> 41.0 France -> 26.0 Alabama -> 39.0 Kuwait -> 50.0

http://www.ijcttjournal.org

Page110

International Journal of Computer Trends and Technology (IJCTT) – volume 5 number 2 –Nov 2013

Finland -> 49.0 Switzerland -> 30.0 New_Zealand -> 19.0 Belgium -> 30.0 China -> 25.0 Spain -> 25.0 Manitoba -> 16.0 Maine -> 49.0 Hong_Kong -> 51.0 Nebraska -> 44.0 Germany -> 43.0 West_Virginia -> 55.0 Brazil -> 28.0 New_Brunswick -> 27.0 Quebec -> 34.0 Other -> 33.0 Colombia -> 33.0 Hawaii -> 28.0 Japan -> 30.0 South_Africa -> 35.0 Portugal -> 30.0 New_Mexico -> 28.0 Austria -> 49.0 India -> 34.0 Namibia -> 35.0 Argentina -> 66.0 Israel -> 31.0 Ireland -> 32.0 (123/672 instances correct)

Primary_Language = English && Race = White && Actual_Time = Other && Community_Membership_Religious 0 && Community_Membership_Hobbies Computer Primary_Language = English && Race = White && Age = 29.0 ==> Professional Primary_Language = English && Race = White && Age = Not_Say ==> Computer Primary_Language = English && Race = White && Not_Purchasing_Not_option > 0 && Not_Purchasing_Cant_find Computer

Accuracy for single country predition: Correctly Classified Instances Incorrectly Classified Instances

16 656

4.381 % 95.619 %

Primary_Language = English && Race = White && Age = 42.0 ==> Professional Primary_Language = English && Race = White && Age = 37.0 && Sexual_Preference = Heterosexual ==> Professional

Proposed Approach Results: Primary_Language = English && Actual_Time = Other && Race = White && Age = 35.0 ==> Professional

Primary_Language = English && Race = White && Age = 37.0 ==> Management

Primary_Language = English && Actual_Time = Other && Community_Membership_Religious > 0 && Who_Pays_for_Access_Self > 0 && Not_Purchasing_Security Professional

Primary_Language = English && Race = White && Age = 27.0 && Not_Purchasing_Privacy Management

Primary_Language = English && Community_Membership_Religious > 0 && Community_Membership_Family > 0 ==> Other

Primary_Language = English && Race = White && Age = 27.0 ==> Professional

Primary_Language = English && Actual_Time = Other && Community_Membership_Religious Computer

Primary_Language = English && Race = White && Age = 47.0 ==> Other

ISSN: 2231-2803

Primary_Language = English && Race = White && Age = 38.0 && Gender = Male ==> Professional

http://www.ijcttjournal.org

Page111

International Journal of Computer Trends and Technology (IJCTT) – volume 5 number 2 –Nov 2013

Primary_Language = English && Race = White && Age = 30.0 && How_You_Heard_About_Survey_Others Management Primary_Language = English && Race = White && Age = 45.0 && Gender = Male ==> Professional

Correctly Classified Instances Incorrectly Classified Instances

617 55

93.8155 % 6.1845 %

Performance Analysis: Below graph shows the time comparison between existing and proposed approach. Time(ms)

25 20

Primary_Language = English && Race = White && Age = 26.0 ==> Professional

15 Time(ms) 10 5 0

Primary_Language = English && Race = White && Age = 40.0 ==> Computer

Time(ms)

Existing HM M

Proposed Bayes

23

11

Below graph shows the Accuracy comparison between existing and proposed approach.

Primary_Language = English && Race = White && Age = 54.0 ==> Education Primary_Language = English && Race = White && Age = 30.0 ==> Professional

100 Existing HM M prediction

50

Primary_Language = English && Age = 24.0 && Community_Membership_Hobbies Professional Primary_Language = English && Race = White && How_You_Heard_About_Survey_Others Professional Primary_Language = English && Falsification_of_Information = Never ==> Other Not_Purchasing_Other
View more...

Comments

Copyright ©2017 KUPDF Inc.
SUPPORT KUPDF