Review on Density-based Clustering - DBSCAN, DenClue & GRID

December 27, 2016 | Author: jagatsesh | Category: N/A

Share Embed Donate

Report this link

Short Description

Download Review on Density-based Clustering - DBSCAN, DenClue & GRID...

Description

Clustering

Density-based clustering

Abraham Otero Quintana, Ph.D. Madrid, July 5th 2010

1

Course outline: 3. Density-based clustering 3.1. DBSCAN (Density Based Spatial Clustering of Applications with Noise) 3.2. Grid Clustering 3.3. DENCLUE (DENsity CLUstEring) 3.4. More algorithms

Unsupervised Pattern Recognition (Clustering)

2/20

For an overview of these techniques please read, Tan2006 and Berkhin2002 from /Docs. Some of the slides here shown are taken from the publicly available repository of the same book. Source: http://wwwusers.cs.umn.edu/~kumar/dmbook/index.php

2

3. Density based clustering •

A cluster is a dense region of points, which is separated by low-density regions, from other regions of high density.

•

Used when the clusters are irregular or intertwined, and when noise and outliers are present.

6 density-based clusters

Unsupervised Pattern Recognition (Clustering)

3/20

Density based clustering tries to identify those dense (highly populated) regions of the multidimensional space and separate them from other dense regions. For a review please, read, Tan2006 from /Docs and Ester1996 from /Docs.

3

3.1 DBSCAN: Definitions

A point is a core point if it has more than a specified number of points (MinPts) within a radius Eps (these points are the interior of a cluster) A border point has fewer than MinPts within Eps, but is in the neighborhood of a core point A noise point is any point that is not a core point or a border point.

Unsupervised Pattern Recognition (Clustering)

4/20

DBSCAN is based on the following definitions.

4

3.1 DBSCAN: Algorithm • Classify points as noise, border, core • Eliminate noise points • Perform clustering on the remaining points

Unsupervised Pattern Recognition (Clustering)

Demo

5/20

Demo: http://webdocs.cs.ualberta.ca/~yaling/Cluster/Applet/Code/Cluster.html

5

3.1 DBSCAN: Example

Point types: core, border and noise Eps = 10, MinPts = 4

Original Points

Clustering

Unsupervised Pattern Recognition (Clustering)

6/20

6

3.1 DBSCAN: Example Features: • Resistant to Noise • Can handle clusters of different shapes and sizes But: • Varying densities • High-dimensional data (MinPts=4, Eps=9.75).

Original Points (MinPts=4, Eps=9.92) Unsupervised Pattern Recognition (Clustering)

7/20

As we have seen DBSCAN is quite insensitive to outliers and can handle nonglobular shapes. However, DBSCAN is not the panacea: it is rather sensitive to varying densities and usually does not work well with high-dimensional data since in this space samples are much more sparse.

7

3.1 DBSCAN: Example

Unsupervised Pattern Recognition (Clustering)

8/20

Pixels are represented as 6 dimensional vectors (location+color) and segmented using DBSCAN. The dull study can be seen at Ye2003 in the course CD.

8

3.1 DBSCAN: Example • Parameter determination. • For MinPts a small number is usually employed. – For two-dimensional experimental data it has been shown that 4 is the most reasonable value. • Eps is more tricky, as we have seen. A possible solution:

Unsupervised Pattern Recognition (Clustering)

9/20

9

3.1 DBSCAN: Parameter determination •

The idea is that for points in a cluster, their kth nearest neighbors are at roughly the same distance

•

Noise points have the kth nearest neighbor at farther distance

•

So, plot sorted distance of every point to its kth nearest neighbor

Reasonable eps

Reasonable MinPts for 2D data Unsupervised Pattern Recognition (Clustering)

10/20

This algorithm is rather simple but it strongly depends on the parameters MinPts and eps. MinPts is usually a low numbr (for 2D data it has been experimentally shown that 4 is a reasonable value). Then eps can be easily determined by sorting the distance of the 4th closest point for every point: Noise points tend to be far from all the rest.

10

3.2 Grid clustering Basic algorithm: 1. Define a set of grid cells 2. Compute the density of cells 3. Eliminate cells with a density smaller than a threshold 4. Form clusters from contiguous cells

Unsupervised Pattern Recognition (Clustering)

11/20

Those wanting to know more about grid clustering, please, read Hinneburg1999 from /Docs. The basic algorithm for grid clustering is rather simple, form clusters with contiguous dense cells. However, in this definition there are a number of ambiguous things: -How to define cells: regular/irregular grids, cell size (too large is not accurate, too small may be empty) -How to define the threshold: it depends on the cell size and the dimensionality of data -What kind of adjacency is considered: for instance in 2D, 4 or 8 neighbours Grid clustering is a basic idea of many other clustering algorithms: WaveCluster, Bang, Clique, and Mafia

11

3.3 DENCLUE: Definitions

f

D kernel

n

n

i =1

i =1

(x) = ∑ f kernel (x, xi ) = ∑ e

−

dist 2 ( x , xi ) 2σ 2

Influence function

Unsupervised Pattern Recognition (Clustering)

12/20

This algorithm estimates the local density of the input data in a way very similar to the kernel probability density function estimators. The kernel, here called influence function, is “copied” to each data position yielding the density function. Local maxima of the density function are called density attractors. Those interested in the original paper of DENCLUE may read Hinneburg1998 from /Docs. Those interested in knowing more about probability density function (PDF) estimators, please, read Raykar2002 from /Docs.

12

3.3 DENCLUE: Clustering

Multicenter-defined cluster

Center-defined cluster

{x | f

D kernel

( x) ≥ ξ

}

{x | f

ξ

D kernel

( x) ≥ ξ

} ξ

Generalizes hierarchical clustering!

Multicenter defined clusters are a set of center-defined clusters linked by a path of significance ξ

Unsupervised Pattern Recognition (Clustering)

13/20

Clusters are formed by a level of significance . For knowing more about the connection between DENCLUE clustering and Level Set methods, please read, Yip from /Docs.

13

3.3 DENCLUE: Algorithm 1. 2. 3. 4.

5.

Grid Data Set (use r = σ, the std. dev.) Find (Highly) Populated Cells (use a threshold=ξc) (shown in blue) Identify populated cells (+nonempty cells) Find Density Attractor pts, C*, using hill climbing: 1. Randomly pick a point, pi. 2. Compute local density (use r=4σ) 3. Pick another point, pi+1, close to pi, compute local density at pi+1 4. If LocDen(pi) < LocDen(pi+1), climb 5. Put all points within distance σ/2 of path, pi, pi+1, …C* into a “density attractor cluster called C* Connect the density attractor clusters, using a threshold, ξ, on the local densities of the attractors.

Unsupervised Pattern Recognition (Clustering)

14/20

14

3.3 DENCLUE: Examples

Unsupervised Pattern Recognition (Clustering)

15/20

In the slide we show a couple of examples of how DENCLUE clusters data, according the algorithm presented in Hinneburg1998 .

15

3.3 DENCLUE: Examples

Unsupervised Pattern Recognition (Clustering)

16/20

In the slide we show a couple of examples of how DENCLUE clusters data, according the algorithm presented in Yip.

16

3.3 DENCLUE: Examples

Unsupervised Pattern Recognition (Clustering)

17/20

In the slide we show a couple of examples of how DENCLUE clusters data, according the algorithm presented in Yip.

17

3.3 DENCLUE: Features

• Dependence on the kernel width • It generalizes DBSCAN, K-means and Hierarchical Clustering • Very efficient implementation

Unsupervised Pattern Recognition (Clustering)

18/20

DENCLUE a few positive features, however, it is not free from drawbacks as its dependency with a user-defined parameter.

18

3.4 DBC: More algorithms •

Generalized DBSCAN: Any divergence function can be used and points within a neighbourhood are weighted according to their similarity to the core point.

•

Fuzzy DBSCAN: Fuzzy distance between fuzzy input vectors

•

DBCLASD: Assumes uniform density, no parameters required

•

Recursive DBC: Adaptive change of DBSCAN parameters

•

WaveCluster: Use wavelets to determine multiresolution clusters.

•

Optics: equivalent to DBC with a wide range of parameters

•

Knn DBC: Assign cluster label taking into account the k nearest neighbours

•

KerdenSOM: Self-organizing structure on the density estimation

•

STING (STatistical INformation Grid): Quadtree space division, very efficient

•

Information Theoretic Clustering: Measure the distance between cluster distributions using information theory.

Unsupervised Pattern Recognition (Clustering)

19/20

For knowing more about: -Generalized DBSCAN, please, read Sander1998 from /Docs. -Fuzzy DBSCAN, please, read Kriegel2005 from /Docs. -DBCLASD, please, read Xu1998 from /Docs. -Recursive DBC, please, read Su2001 from /Docs. -WaveCluster, please, read Sheikholeslami1997 from /Docs. -Optics, please, read Ankerst1999 from /Docs. -Knn DBC, please, read Tran2003 from /Docs. -KerdenSOM, please, read Pascual2001 from /Docs. -STING, please read, Wang1997 from /Docs.

19

Course outline 3. Density-based clustering 3.1. DBSCAN (Density Based Spatial Clustering of Applications with Noise) 3.2. Grid Clustering 3.3. DENCLUE (DENsity CLUstEring) 3.4. More algorithms

Unsupervised Pattern Recognition (Clustering)

20/20

For an overview of these techniques please read, Tan2006 and Berkhin2002 from /Docs. Some of the slides here shown are taken from the publicly available repository of the same book. Source: http://wwwusers.cs.umn.edu/~kumar/dmbook/index.php

20

Review on Density-based Clustering - DBSCAN, DenClue & GRID

Short Description

Description

Comments

We need your help!