K Means Handout

June 3, 2016 | Author: Edgar Coló | Category: N/A

Share Embed Donate

Report this link

Short Description

Download K Means Handout...

Description

Cluster Analysis What is Cluster Analysis? Cluster analysis is a statistical technique used to group cases (individuals or objects) into homogeneous sub‐groups based on responses to variables. Using PASW (SPSS) 17.0 to conduct a cluster analysis, there are three clustering procedures: two‐step, k‐means, and hierarchical.    K‐means clustering allows you to select the number of clusters and the procedure can be used with moderate to large datasets. The k‐means clustering algorithm assigns cases to clusters based on the smallest amount of distance between the cluster mean and case. This is an iterative process that stops once the cluster means do not change much in successive steps.

K‐Means Clustering As an example of k‐means clustering, a sample PASW 17.0 dataset was used; telco_extra.sav, telecommunications provider data that has 14 continuous variables. The continuous variables have already been standardized, with a mean of 0 and standard deviation of 1, to allow for different units in which variables were measured. This analysis will cluster customers by their service usage patterns.   In PASW 17.0, go to Analyze ‐> Classify ‐> K‐Means Cluster

Next, the K‐Means Cluster Analysis menu appears. Select Standardized log‐long distance through Standardized log‐wireless and Standardized multiple lines through Standardized electronic billing variables and place in the Variables box.   Label Cases by. Optional; place variable here to label cases Number of Clusters. You have to specify the number of clusters you want. For this example, type 3 in the box. Method. The default "Iterate and classify," which is an iterative process is used to compute the cluster means each time a case is added or deleted from the cluster. Clusters are then classified Page 1 of 7

based once cluster centers have been updated. The "Classify only" method are classified based on the initial cluster centers, which are not iteratively computed. For this example, “Iterate and classify” is chosen.   Cluster Centers. You can draw initial cluster centers from a file (“Read initial”) or you can save the final cluster centers (“Write final”). For this example, we are not using either option.

Click the Iterate button; the K‐Means Cluster Analysis: Iterate box appears. Change Maximum Iterations to 20. Click Continue. Maximum Iterations. Sets the maximum number of iterations.   Convergence Criterion: The default terminates once the largest change in means of any cluster is less than 2% of the minimum distance between initial cluster centers.   Use running means. If this box is checked, cluster centers will be updated after each case is classified, instead of after all of the cases are classified.

Page 2 of 7

Click Options in the K‐Means Cluster Analysis dialog box. Check Initial cluster centers, ANOVA table, Cluster information for each case, and Exclude cases pairwise. Click Continue. Click Ok. Initial cluster centers. Prints the initial variable means for each cluster in the output. ANOVA table. ANOVA F‐tests are conducted for each variable to indicate how well the variable discriminates between clusters. Cluster information for each case. Prints each case's final cluster assignment and the Euclidean distance between the case and the cluster center in the ouput. Missing Values. The default is listwise deletion. For this example, there are many missing values because most customers did not subscribe to all services, so excluding cases pairwise maximizes the information you can obtain from the data.

Page 3 of 7

K‐Means Clustering Interpretation The Initial Cluster Centers table shows the first step in the k‐means clustering in finding the k centers.

The Iteration History table shows the number of iterations that were enough until cluster centers did not change substantially.

Page 4 of 7

The Cluster Membership table gives you the case cluster each case belongs to and the Euclidean distance of each case to the cluster center. Below is a print out of the first and last 10 cases. Visual inspection of distances is necessary to check for outliers that may not adequately reflect the population.

The Final Cluster Centers table below allows you to describe the clusters by the variables. For example, customers in Cluster 1 tend to purchase a lot of services, as evidenced by values above the mean for all variables. Customers in Cluster 2 tend to purchase the "calling" services, shown by positive values for the four “calling” services (caller ID, call waiting, call forwarding, and 3‐way calling). Customers in Cluster 3 tend to spend very little and do not purchase many services; they have negative values on most of the variables.

Page 5 of 7

The Differences between Final Cluster Centers table shows the Euclidean distances between the final cluster centers. Greater distances between clusters mean there are greater dissimilarities.

Clusters 1 and 3 have the greatest dissimilarities.

Cluster 2 is equally similar to Clusters 1 and 3.

The ANOVA table indicates which variables contribute the most to your cluster solution. Variables with large mean square errors provide the least help in differentiating between clusters. For example, long distance and calling card had the two highest mean square errors (and lowest F statistics); therefore, the two variables were not as helpful as the other variables in forming and differentiating clusters.

Page 6 of 7

The Number of Cases in each Cluster table illustrates the split of cases into clusters. A large number of cases were assigned to the third cluster, which is the least profitable group.

Page 7 of 7

K Means Handout

Short Description

Description

Comments

We need your help!