Describe/summarize or finding structure on what we have observed:
- Data summarization and visualization can be seen as simple forms of descriptive analytics
- However, most frequently descriptive modeling is associated with clustering
- Notion of similarity is strongly related with the notion of distance between observations
- Can be measured as the opposite of the distance
- Proximity refers to a similarity or dissimilarity
- Numerical measure of how alike 2 data objects are
- Higher when objects are more alike
- Often falls in the range [0,1]
- Numerical measure of how different 2 data objects are
- Lower when objects are more alike
- Minimum dissimilarity is often 0
- Upper limit varies
Dissimilarity measure can be expressed by a distance metric. Distance metrics d have some well-known properties
![[Pasted image 20220208212925.png]]
![[Pasted image 20220208212943.png]]
![[Pasted image 20220208213059.png]]
More examples:
- Canberra distance
- Jaccard Coefficients
- Cosine similarity
Problems:
- different scales of variables
- different importance of variables
- different types of data
![[Pasted image 20220208214025.png]]
![[Pasted image 20220208214049.png]]
- Obtain the "natural" grouping of a set of data
- Key issue: notion of similarity
- Observations on the same group are supposed to share some properties
- Most methods use the information on the distances among observations in a data set to decide on the natural grouping of the cases
- Provide some abstraction of the found groups, gain novel insights of data
- Biology - group genes that have similar functionality
- Business and Marketing - group stocks with similar price fluctuations
- Web Mining - find communities in social networks
- Partitional: divide the observations in k partitions according to some criterion
- Hierarchical: generate a hierarchy of groups, from 1 to n groups, where n is the number of lines in the data set
- Agglomerative: generate a hierarchy from bottom to top (from n to 1 group)
- Divisive: create a hierarchy in a top down way (from 1 to n groups)
Partition the given set of data into k groups by either minimizing/maximizing a pre-specified criterion.
- Key Issues:
- The user needs to select the number of groups
- The number of possible divisions of n cases into k groups can grow fast
- Properties
- Cluster compactness: how similar are cases within the same cluster
- Cluster separation: how far is the cluster from the other clusters
- Goal: minimize intra-cluster distance and maximize inter-cluster distances
- Clustering solution assigns all the objects to a cluster
- hard clustering: an object belongs to a single cluster
- fuzzy clustering: each object has a probability associated to belong to each cluster
![[Pasted image 20220208215719.png]] can also be the median of its data objects
Partition-based method that obtains k groups of a data set
Algorithm
-
Initialize the center of the k groups to a set of randomly chosen observations
-
Repeat:
- Allocate each observation to the group whose center is nearest
- Re-calculate the center of each group
-
Until the groups are stable
-
Uses the squared Euclidean distance as criterion
-
Maximizes inter-cluster dissimilarity
Advantages
- Fast algorithm that scales well
- Stochastic approach that frequently works well (tens to identify local minima)
Disadvantages
- Does not ensure an optimal clustering
- We may obtain different solutions with different starting points
- The initial guess of k for the number of clusters, maybe away from the real optimal value of k
- Is the found group structure random?
- What is the "correct" number of groups?
- How to evaluate the result of a clustering algorithm when we do not have information on the number of groups exists?
- How to compare alternative solutions?
- Supervised: compare the obtained clustering (grouping) with the external information that we have available
- Unsupervised: try to measure the quality of the clustering without any information on the "ideal" structure of the data
- Cohesion coefficients: determine how compacts/cohesive are the members of a group
- Separation coefficients: determine how different are the members of different groups ![[Pasted image 20220208220605.png]]
- Popular coefficient that incorporates both the notions of cohesion and separation
- The coefficient takes values between -1 and 1
- An inappropriate choice of k can result in a clustering with poor performance
- Ideally you should have some priori knowledge on the real structure of the data
- If no a priori value is known, start with sqrt(n/2) as a rule of thumbs, where n is the number of attributes
For several possible numbers of clusters, k:
- Calculate the average silhouette coefficient value and choose the k that yields to the highest value
For several possible numbers of clusters, k:
- Calculate the within-cluster SSE, also called distortion, and choose the k so that adding another cluster doesn't yield to a much smaller SSE
- Searches for the k representative objects (the medoids) among the cases in the given data set
- As with k-means each observation is allocated to the nearest medoid
- Is more robust to the presence of outliers because it uses original objects as centroids instead of averages that may be subject to the effects of outliers
- Moreover, it uses a more robust measure of the clustering quality: L1 - norm, which is bases on absolute error instead of the squared error used in k-means
- The PAM algorithm has several advantages in terms of robustness when compared to k-means
- However, these advantages come at the price of additional computational complexity that may be too much for very large data sets
- CLARA tries to solve these efficiency problems
- it does that by using sampling
Algorithm
- Repeat n times the following
- Draw a random sample of size m
- Apply PAM to this random sample to obtain k centroids
- Allocate the full set of observations to one of these centroids
- Calculate sum of dissimilarities of the resulting clustering (as in PAM)
- Return as result of the clustering of the n repetitions that got lowest sum of dissimilarities
- clusters are of different sizes, densities and with non-globular shape
- data contains outliers/noise
- The density of a single observation is estimated by the number of observations that are within a certain radius
- Based on this idea observations are classified as:
- core points: if the number of observations within its radius are above a certain threshold
- border points: if the number of observations within their radius does not reach the threshold, but they are within the radius of a core point
- noise points: they do not have enough observations within their radius, nor are they sufficiently close to any core point
![[Pasted image 20220208222806.png]]
Algorithm
- Classify each observation in one of the three possible alternatives
- Eliminate the noise points from the formation of the groups
- All core points that are within a certain distance of each other are allocated to the same group
- Each border point is allocated to the group of the nearest core point
Note: this method does not require the user to specify the number of groups. But, you need to specify the radius (ε) and the minimum number of points (MinPts)
Advantages
- Can handle clusters with different shapes and sizes
- Resistant to noise
Disadvantages
- Varying densities
- High-dimensional data
- Obtain a hierarchy of groups, where each level represents a possible solution with x groups. It is up to user to select the solution he wants
- A dendogram can be used for visualization
Bottom-up
- Start with as many groups as there are cases
- On each upper level a pair of groups is merged into a single group
- The chosen pair is formed by the groups that are more similar
Top-down (much less used)
- Start with a single group
- On each level select a group to be split in 2
- The selected group is the one with smallest uniformity
- Single link
- Complete link
- Average link ![[Pasted image 20220208223554.png]]
- Compute the proximity matrix
- Let each data point be a cluster
- Repeat
- Merge the 2 closest clusters
- Update the proximity matrix
- Until only single cluster remains
- can handle non-eliptical shapes
- uses a local merge citerion
- distant parts of the cluster and the clusters' overall structure are not taken into account
- biases towards globular clusters
- uses a non-local merge citerion
- chooses the pair of clusters whose merge has the smallest diameter
- the similarity of 2 clusters is the similarity of their most dissimilar members
- sensitive to noise/outliers
- it is a compromise between single and complete link
- Compute the proximity matrix
- Start with a single cluster that contains all data points
- Repeat
- choose the cluster with the largest diameter
- select the data point with largest average dissimilarity to the other members in that cluster
- re-allocate the data points to either the cluster of this selected point or the "old" cluster (represented by its center), depending on which one is nearest
- Until each data point constitutes a cluster
Compare clustering methods w.r.t
- complexity and scalability
- similarity measures that can be employed
- robustness to noise
- it is able to find clusters on sub-spaces
- different runs lead to different results
- it is incremental
- ability do handle different types of data
- dependency on the order of data points
- find the number of clusters or needs it as input
- number of parameters necessary
- required domain knowledge
- shape of clusters that is able to find
- interpretability