module-8November 8, 2023
1
Module 8: Cluster Analysis
The following tutorial contains Python examples for solving classification problems. You
should refer to Chapters 7 and 8 of the “Introduction to Data Mining” book to understand
some of the concepts introduced in this tutorial. The notebook can be downloaded from
http://www.cse.msu.edu/~ptan/dmbook/tutorials/tutorial8/tutorial8.ipynb.
Cluster analysis seeks to partition the input data into groups of closely related instances so that
instances that belong to the same cluster are more similar to each other than to instances that
belong to other clusters. In this tutorial, we will provide examples of using different clustering
techniques provided by the scikit-learn library package.
Read the step-by-step instructions below carefully. To execute the code, click on the corresponding
cell and press the SHIFT-ENTER keys simultaneously.
[ ]: from google.colab import drive
drive.mount(‘/content/drive’)
Mounted at /content/drive
[14]: import numpy as np
import pandas as pd
import math
from sklearn import cluster
import matplotlib.pyplot as plt
%matplotlib inline
from scipy.cluster import hierarchy
from sklearn.cluster import DBSCAN, k_means, KMeans
1.1
8.1 K-means Clustering
The k-means clustering algorithm represents each cluster by its corresponding cluster centroid. The
algorithm would partition the input data into k disjoint clusters by iteratively applying the following
two steps: 1. Form k clusters by assigning each instance to its nearest centroid. 2. Recompute the
centroid of each cluster.
In this section, we perform k-means clustering on a toy example of movie ratings dataset. We first
create the dataset as follows.
1
[ ]: ratings =␣
↪[[‘john’,5,5,2,1],[‘mary’,4,5,3,2],[‘bob’,4,4,4,3],[‘lisa’,2,2,4,5],[‘lee’,1,2,3,4],[‘harry’
titles = [‘user’,’Jaws’,’Star Wars’,’Exorcist’,’Omen’]
movies = pd.DataFrame(ratings,columns=titles)
movies
[ ]:
0
1
2
3
4
5
user
john
mary
bob
lisa
lee
harry
Jaws
5
4
4
2
1
2
Star Wars
5
5
4
2
2
1
Exorcist
2
3
4
4
3
5
Omen
1
2
3
5
4
5
In this example dataset, the first 3 users liked action movies (Jaws and Star Wars) while the last 3
users enjoyed horror movies (Exorcist and Omen). Our goal is to apply k-means clustering on the
users to identify groups of users with similar movie preferences.
The example below shows how to apply k-means clustering (with k=2) on the movie ratings data.
We must remove the “user” column first before applying the clustering algorithm. The cluster
assignment for each user is displayed as a dataframe object.
[ ]: data = movies.drop(‘user’,axis=1)
k_means = cluster.KMeans(n_clusters=2, max_iter=50, random_state=1)
k_means.fit(data)
labels = k_means.labels_
pd.DataFrame(labels, index=movies.user, columns=[‘Cluster ID’])
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870:
FutureWarning: The default value of `n_init` will change from 10 to ‘auto’ in
1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
[ ]:
Cluster ID
user
john
mary
bob
lisa
lee
harry
1
1
1
0
0
0
The k-means clustering algorithm assigns the first three users to one cluster and the last three users
to the second cluster. The results are consistent with our expectation. We can also display the
centroid for each of the two clusters.
[ ]: centroids = k_means.cluster_centers_
pd.DataFrame(centroids,columns=data.columns)
2
[ ]:
0
1
Jaws
1.666667
4.333333
Star Wars
1.666667
4.666667
Exorcist
4.0
3.0
Omen
4.666667
2.000000
Observe that cluster 0 has higher ratings for the horror movies whereas cluster 1 has higher ratings
for action movies. The cluster centroids can be applied to other users to determine their cluster
assignments.
[ ]: testData = np.array([[4,5,1,2],[3,2,4,4],[2,3,4,1],[3,2,3,3],[5,4,1,4]])
labels = k_means.predict(testData)
labels = labels.reshape(-1,1)
usernames = np.array([‘paul’,’kim’,’liz’,’tom’,’bill’]).reshape(-1,1)
cols = movies.columns.tolist()
cols.append(‘Cluster ID’)
newusers = pd.DataFrame(np.concatenate((usernames, testData, labels),␣
↪axis=1),columns=cols)
newusers
/usr/local/lib/python3.10/dist-packages/sklearn/base.py:439: UserWarning: X does
not have valid feature names, but KMeans was fitted with feature names
warnings.warn(
[ ]:
0
1
2
3
4
user Jaws Star Wars Exorcist Omen Cluster ID
paul
4
5
1
2
1
kim
3
2
4
4
0
liz
2
3
4
1
1
tom
3
2
3
3
0
bill
5
4
1
4
1
To determine the number of clusters in the data, we can apply k-means with varying number of
clusters from 1 to 6 and compute their corresponding sum-of-squared errors (SSE) as shown in the
example below. The “elbow” in the plot of SSE versus number of clusters can be used to estimate
the number of clusters.
[ ]: numClusters = [1,2,3,4,5,6]
SSE = []
for k in numClusters:
k_means = cluster.KMeans(n_clusters=k)
k_means.fit(data)
SSE.append(k_means.inertia_)
plt.plot(numClusters, SSE)
plt.xlabel(‘Number of Clusters’)
plt.ylabel(‘SSE’)
3
1.2
8.2 Hierarchical Clustering
This section demonstrates examples of applying hierarchical clustering to the vertebrate dataset
used in Module 6 (Classification). Specifically, we illustrate the results of using 3 hierarchical
clustering algorithms provided by the Python scipy library: (1) single link (MIN), (2) complete
link (MAX), and (3) group average. Other hierarchical clustering algorithms provided by the
library include centroid-based and Ward’s method.
[ ]: data = pd.read_csv(‘/content/drive/MyDrive/datamining/vertebrate.
↪csv’,header=’infer’)
data
[ ]:
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Name
human
python
salmon
whale
frog
komodo
bat
pigeon
cat
leopard shark
turtle
penguin
porcupine
eel
salamander
Warm-blooded
1
0
0
1
0
0
1
1
1
0
0
1
1
0
0
0
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Aerial Creature
0
0
0
0
0
0
1
1
0
0
0
0
0
0
0
Has Legs
1
0
0
0
1
1
1
1
1
0
1
1
1
0
1
Gives Birth
1
0
0
1
0
0
1
0
1
1
0
0
1
0
0
Hibernates
0
1
0
0
1
0
1
0
0
0
0
0
1
0
1
4
Aquatic Creature
0
0
1
1
1
0
0
0
0
1
1
1
0
1
1
Class
mammals
reptiles
fishes
mammals
amphibians
reptiles
mammals
birds
mammals
fishes
reptiles
birds
mammals
fishes
amphibians
\
1.2.1
8.2.1 Single Link (MIN)
[ ]: names = data[‘Name’]
Y = data[‘Class’]
X = data.drop([‘Name’,’Class’],axis=1)
Z = hierarchy.linkage(X, ‘single’)
dn = hierarchy.dendrogram(Z,labels=names.tolist(),orientation=’right’)
1.2.2
8.2.2 Complete Link (MAX)
[ ]: Z = hierarchy.linkage(X, ‘complete’)
dn = hierarchy.dendrogram(Z,labels=names.tolist(),orientation=’right’)
5
1.2.3
8.3.3 Group Average
[ ]: Z = hierarchy.linkage(X, ‘average’)
dn = hierarchy.dendrogram(Z,labels=names.tolist(),orientation=’right’)
6
1.3
8.3 Density-Based Clustering
Density-based clustering identifies the individual clusters as high-density regions that are separated
by regions of low density. DBScan is one of the most popular density based clustering algorithms. In
DBScan, data points are classified into 3 types—core points, border points, and noise points—based
on the density of their local neighborhood. The local neighborhood density is defined according to 2
parameters: radius of neighborhood size (eps) and minimum number of points in the neighborhood
(min_samples).
For this approach, we will use a noisy, 2-dimensional dataset originally created by Karypis et al.
[1] for evaluating their proposed CHAMELEON algorithm. The example code shown below will
load and plot the distribution of the data.
[ ]: data = pd.read_csv(‘/content/drive/MyDrive/datamining/chameleon.data’,␣
↪delimiter=’ ‘, names=[‘x’,’y’])
data.plot.scatter(x=’x’,y=’y’)
[ ]:
7
We apply the DBScan clustering algorithm on the data by setting the neighborhood radius (eps)
to 15.5 and minimum number of points (min_samples) to be 5. The clusters are assigned to IDs
between 0 to 8 while the noise points are assigned to a cluster ID equals to -1.
[ ]: print(data.shape)
db = DBSCAN(eps=15.5, min_samples=5).fit(data)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = pd.DataFrame(db.labels_,columns=[‘Cluster ID’])
result = pd.concat((data,labels), axis=1)
result.plot.scatter(x=’x’,y=’y’,c=’Cluster ID’, colormap=’jet’)
(1971, 2)
[ ]:
8
1.4
8.4 Spectral Clustering
One of the main limitations of the k-means clustering algorithm is its tendency to seek for globularshaped clusters. Thus, it does not work when applied to datasets with arbitrary-shaped clusters
or when the cluster centroids overlapped with one another. Spectral clustering can overcome this
limitation by exploiting properties of the similarity graph to overcome such limitations. To illustrate
this, consider the following two-dimensional datasets.
[ ]: import pandas as pd
data1 = pd.read_csv(‘/content/drive/MyDrive/datamining/2d_data.txt’,␣
↪delimiter=’ ‘, names=[‘x’,’y’])
data2 = pd.read_csv(‘/content/drive/MyDrive/datamining/elliptical.txt’,␣
↪delimiter=’ ‘, names=[‘x’,’y’])
fig, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, figsize=(12,5))
data1.plot.scatter(x=’x’,y=’y’,ax=ax1)
data2.plot.scatter(x=’x’,y=’y’,ax=ax2)
[ ]:
9
Below, we demonstrate the results of applying k-means to the datasets (with k=2).
[ ]: from sklearn import cluster
k_means = cluster.KMeans(n_clusters=2, max_iter=50, random_state=1)
k_means.fit(data1)
labels1 = pd.DataFrame(k_means.labels_,columns=[‘Cluster ID’])
result1 = pd.concat((data1,labels1), axis=1)
k_means2 = cluster.KMeans(n_clusters=2, max_iter=50, random_state=1)
k_means2.fit(data2)
labels2 = pd.DataFrame(k_means2.labels_,columns=[‘Cluster ID’])
result2 = pd.concat((data2,labels2), axis=1)
fig, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, figsize=(12,5))
result1.plot.scatter(x=’x’,y=’y’,c=’Cluster ID’,colormap=’jet’,ax=ax1)
ax1.set_title(‘K-means Clustering’)
result2.plot.scatter(x=’x’,y=’y’,c=’Cluster ID’,colormap=’jet’,ax=ax2)
ax2.set_title(‘K-means Clustering’)
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870:
FutureWarning: The default value of `n_init` will change from 10 to ‘auto’ in
1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
/usr/local/lib/python3.10/dist-packages/sklearn/cluster/_kmeans.py:870:
FutureWarning: The default value of `n_init` will change from 10 to ‘auto’ in
1.4. Set the value of `n_init` explicitly to suppress the warning
warnings.warn(
[ ]: Text(0.5, 1.0, ‘K-means Clustering’)
10
The plots above show the poor performance of k-means clustering. Next, we apply spectral clustering to the datasets. Spectral clustering converts the data into a similarity graph and applies the
normalized cut graph partitioning algorithm to generate the clusters. In the example below, we
use the Gaussian radial basis function as our affinity (similarity) measure. Users need to tune the
kernel parameter (gamma) value in order to obtain the appropriate clusters for the given dataset.
[ ]: from sklearn import cluster
import pandas as pd
spectral = cluster.
↪SpectralClustering(n_clusters=2,random_state=1,affinity=’rbf’,gamma=5000)
spectral.fit(data1)
labels1 = pd.DataFrame(spectral.labels_,columns=[‘Cluster ID’])
result1 = pd.concat((data1,labels1), axis=1)
spectral2 = cluster.
↪SpectralClustering(n_clusters=2,random_state=1,affinity=’rbf’,gamma=100)
spectral2.fit(data2)
labels2 = pd.DataFrame(spectral2.labels_,columns=[‘Cluster ID’])
result2 = pd.concat((data2,labels2), axis=1)
fig, (ax1,ax2) = plt.subplots(nrows=1, ncols=2, figsize=(12,5))
result1.plot.scatter(x=’x’,y=’y’,c=’Cluster ID’,colormap=’jet’,ax=ax1)
ax1.set_title(‘Spectral Clustering’)
result2.plot.scatter(x=’x’,y=’y’,c=’Cluster ID’,colormap=’jet’,ax=ax2)
ax2.set_title(‘Spectral Clustering’)
[ ]: Text(0.5, 1.0, ‘Spectral Clustering’)
11
1.5
8.5 Summary
This tutorial illustrates examples of using different Python’s implementation of clustering algorithms. Algorithms such as k-means, spectral clustering, and DBScan are designed to create disjoint partitions of the data whereas the single-link, complete-link, and group average algorithms
are designed to generate a hierarchy of cluster partitions.
References: [1] George Karypis, Eui-Hong Han, and Vipin Kumar. CHAMELEON: A Hierarchical
Clustering Algorithm Using Dynamic Modeling. IEEE Computer 32(8): 68-75, 1999.
1.6
In-class Clustering Practice
1. Given college-and-university.csv, conduct K-means clustering analysis based on Median
SAT,Acceptance Rate,Expenditures/Student,Top 10% HS, and Graduation %. Do not use
Type and School in your analysis. Find out the the best K value.
2. Given college-and-university.csv, conduct Hierarchical clustering analysis based on Median
SAT,Acceptance Rate,Expenditures/Student,Top 10% HS, and Graduation %. Do not use
Type and School in your analysis.
[3]: np.random.seed(42)
# Function for creating datapoints in the form of a circle
def PointsInCircum(r,n=100):
return [(math.cos(2*math.pi/n*x)*r+np.random.normal(-30,30),math.sin(2*math.
↪pi/n*x)*r+np.random.normal(-30,30)) for x in range(1,n+1)]
# Creating data points in the form of a circle
df1 = pd.DataFrame(PointsInCircum(500,1000))
df2 = pd.DataFrame(PointsInCircum(300,700))
df3 = pd.DataFrame(PointsInCircum(100,300))
12
# Adding noise to the dataset
df4 = pd.DataFrame([(np.random.randint(-600,600),np.random.randint(-600,600))␣
↪for i in range(300)])
df = pd.concat([df1, df2, df3, df4], axis= 0)
plt.figure(figsize=(6,6))
plt.scatter(df[0],df[1],s=15,color=’grey’)
plt.title(‘Dataset’,fontsize=20)
plt.xlabel(‘Feature 1’,fontsize=14)
plt.ylabel(‘Feature 2’,fontsize=14)
plt.show()
13
3. Using the data in df above, conduct K-means and density-based clustering. Provide visulization for the clustering results. Use eps=30 and min_samples=6 for DBScan.
4. Using the data in df above, conduct spectral clustering and provide visualization for the
clustering result.
1.7
Spectral Clustering
https://towardsdatascience.com/spectral-clustering-aba2640c0d5b
The data rerpesented as a graph.
[17]: # Adjacency Matrix
A = np.array([
[0, 1, 1, 0, 0, 0, 0, 0, 1, 1],
[1, 0, 1, 0, 0, 0, 0, 0, 0, 0],
[1, 1, 0, 0, 0, 0, 0, 0, 0, 0],
[0, 0, 0, 0, 1, 1, 0, 0, 0, 0],
[0, 0, 0, 1, 0, 1, 0, 0, 0, 0],
[0, 0, 0, 1, 1, 0, 1, 1, 0, 0],
[0, 0, 0, 0, 0, 1, 0, 1, 0, 0],
[0, 0, 0, 0, 0, 1, 1, 0, 0, 0],
[1, 0, 0, 0, 0, 0, 0, 0, 0, 1],
[1, 0, 0, 0, 0, 0, 0, 0, 1, 0]])
# Degree Matrix
D = np.diag(A.sum(axis=1))
print(D)
L = D-A
print(L)
# graph laplacian
L = D-A
# eigenvalues and eigenvectors
vals, vecs = np.linalg.eig(L)
# sort these based on the eigenvalues
vecs = vecs[:,np.argsort(vals)]
vals = vals[np.argsort(vals)]
[[4 0 0 0 0 0 0 0 0 0]
[0 2 0 0 0 0 0 0 0 0]
[0 0 2 0 0 0 0 0 0 0]
[0 0 0 2 0 0 0 0 0 0]
[0 0 0 0 2 0 0 0 0 0]
[0 0 0 0 0 4 0 0 0 0]
[0 0 0 0 0 0 2 0 0 0]
14
[0 0 0 0 0 0 0 2 0 0]
[0 0 0 0 0 0 0 0 2 0]
[0 0 0 0 0 0 0 0 0 2]]
[[ 4 -1 -1 0 0 0 0 0 -1 -1]
[-1 2 -1 0 0 0 0 0 0 0]
[-1 -1 2 0 0 0 0 0 0 0]
[ 0 0 0 2 -1 -1 0 0 0 0]
[ 0 0 0 -1 2 -1 0 0 0 0]
[ 0 0 0 -1 -1 4 -1 -1 0 0]
[ 0 0 0 0 0 -1 2 -1 0 0]
[ 0 0 0 0 0 -1 -1 2 0 0]
[-1 0 0 0 0 0 0 0 2 -1]
[-1 0 0 0 0 0 0 0 -1 2]]
[18]: plt.scatter(range(len(vals)), vals)
plt.xlabel(‘order’)
plt.ylabel(‘values’)
[18]: Text(0, 0.5, ‘values’)
15
[19]: # kmeans on first three vectors with nonzero eigenvalues
kmeans = KMeans(n_clusters=4, n_init = “auto”)
kmeans.fit(vecs[:,1:4])
colors = kmeans.labels_
print(“Clusters:”, colors)
# Clusters: [2 1 1 0 0 0 3 3 2 2]
Clusters: [0 2 2 0 0 0 3 3 1 1]
16
Clustering
October 25, 2023
0.1
Loading the data
[4]: import pandas as pd
import numpy as np
from scipy.stats import zscore
# Load the CSV file into a pandas DataFrame
data = pd.read_csv(‘Data.csv’)
# Display the first few rows of the DataFrame
print(data.head(10))
School Type
Amherst
Lib Arts
Barnard
Lib Arts
Bates
Lib Arts
Berkeley
University
Bowdoin
Lib Arts
Brown
University
Bryn Mawr
Lib Arts
Cal Tech
University
Carleton
Lib Arts
Carnegie Mellon University
Amherst
Barnard
Bates
Berkeley
Bowdoin
Brown
Bryn Mawr
Cal Tech
Carleton
Carnegie Mellon
0.2
Median
1315
1220
1240
1176
1300
1281
1255
1400
1300
1225
SAT
22
53
36
37
24
24
56
31
40
64
Expenditures/Student
85
69
58
95
78
80
70
98
75
52
Acceptance Rate
26636
17653
17554
23665
25703
24201
18847
102262
15904
33607
\
Top 10% HS Graduation %
93
80
88
68
90
90
84
75
80
77
Data cleaning
Handle the missing values
1
[5]: # Drop rows with any missing values
data_cleaned = data.dropna()
# Fill missing values in a specific column (e.g., ‘Acceptance Rate’) with the␣
↪mean of that column
data[‘Acceptance Rate’].fillna(data[‘Acceptance Rate’].mean(), inplace=True)
0.2.1
Removing duplicates
[6]: # Remove duplicate rows based on all columns
data_cleaned = data.drop_duplicates()
# Remove duplicates based on specific columns
data_cleaned = data.drop_duplicates(subset=[‘School Type’, ‘Median’, ‘SAT’,␣
↪’Acceptance Rate’, ‘Expenditures/Student’, ‘Top 10% HS Graduation %’])
0.2.2
Correcting data formats
[7]: # Convert ‘Acceptance Rate’ to numeric (assuming it contains numerical values)
data[‘Acceptance Rate’] = pd.to_numeric(data[‘Acceptance Rate’],␣
↪errors=’coerce’)
# Convert ‘Expenditures/Student’ to numeric (assuming it contains numerical␣
↪values)
data[‘Expenditures/Student’] = pd.to_numeric(data[‘Expenditures/Student’],␣
↪errors=’coerce’)
# Convert ‘Top 10% HS Graduation %’ to numeric (assuming it contains numerical␣
↪values)
data[‘Top 10% HS Graduation %’] = pd.to_numeric(data[‘Top 10% HS Graduation␣
↪%’], errors=’coerce’)
0.2.3
Handling outliers
[8]: # Identify and remove outliers using Z-score for ‘Acceptance Rate’
data = data[(np.abs(zscore(data[‘Acceptance Rate’])) < 3)]
# Replace outliers with the median for 'Expenditures/Student'
data['Expenditures/Student'] = np.where((np.abs(zscore(data['Expenditures/
↪Student'])) < 3),
data['Expenditures/Student'],
data['Expenditures/Student'].median())
# Replace outliers with the median for 'Top 10% HS Graduation %'
data['Top 10% HS Graduation %'] = np.where((np.abs(zscore(data['Top 10% HS␣
↪Graduation %'])) < 3),
2
data['Top 10% HS Graduation %'],
data['Top 10% HS Graduation %'].
median())
↪
0.3
Clustering analysis
K-means clustering An iterative process called K-Means clustering divides a dataset into K
unique, non-overlapping clusters. In the feature space, K centroids are initially distributed at
random. The closest centroid is then assigned to each data point, creating clusters. The centroids
are recalculated as the average of each data point in each cluster. Until the centroids stop changing
considerably or a predetermined number of repetitions has been reached, this assignment and
update process is repeated. To lessen variance inside each cluster, K-Means optimizes cluster
centroids to minimize the within-cluster sum of squares. The clusters are defined by the K cluster
centroids and the data points assigned to these centroids in the final output. The elbow technique
or silhouette score can be used to select the appropriate number of clusters (K), which will increase
the algorithm’s ability to identify significant patterns in the data.
[18]: from sklearn.cluster import KMeans
# Choose the number of clusters (K) - you can use techniques like the elbow␣
↪method to find the optimal K
k = 3
# Initialize the KMeans model
kmeans = KMeans(n_clusters=k, n_init=10)
# Fit the model to your data_encoded
kmeans.fit(data_encoded)
# Get cluster labels for each data point
cluster_labels = kmeans.labels_
# Add the cluster labels back to your DataFrame
data_encoded['Cluster'] = cluster_labels
# Display the first few rows of the DataFrame with cluster labels
print(data_encoded.head())
Amherst
Barnard
Bates
Berkeley
Bowdoin
Median
1315
1220
1240
1176
1300
SAT
22
53
36
37
24
Acceptance Rate
26636
17653
17554
23665
25703
Amherst
Barnard
Top 10% HS Graduation %
93.0
80.0
Expenditures/Student
85.0
69.0
58.0
95.0
78.0
School Type_Lib Arts
True
True
3
\
\
Bates
Berkeley
Bowdoin
88.0
68.0
90.0
Amherst
Barnard
Bates
Berkeley
Bowdoin
School Type_University
False
False
False
True
False
0.3.1
True
False
True
Cluster
0
0
0
0
0
Aspects of the k-means
Elbow Method for Optimal K: As the number of clusters rises, the Elbow Method graph
shows the distortion (inertia). It aids in locating the optimal K below which the model is not
considerably enhanced by the addition of more clusters. In the graph, we seek out the location
where the rate of decline suddenly changes direction, creating a “elbow.” This idea advises striking
a balance between minimizing overfitting and maximizing the volatility in the data. A more distinct
elbow denotes a clearer preference for the number of clusters.
[46]: import warnings
import matplotlib.pyplot as plt
warnings.simplefilter(action='ignore', category=FutureWarning)
distortions = []
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k)
kmeanModel.fit(data_encoded)
distortions.append(kmeanModel.inertia_)
plt.figure(figsize=(16,8))
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()
4
Cluster statistics The features of each cluster can be seen by computing statistics like mean and
median for each attribute within clusters. For instance, median values reveal the core tendency,
which is less influenced by outliers, while mean values reveal the average behavior of data points in a
cluster. Understanding the characteristic characteristics of the clusters facilitates their comparison
and interpretation through statistical analysis.
[48]: cluster_stats = data_encoded.groupby('Cluster').agg(['mean', 'median'])
print(cluster_stats)
Cluster
0
1
2
Median
mean
median
1335.000000
1246.235294
1253.571429
1350.0
1248.5
1280.0
SAT
Acceptance Rate
mean median
mean
25.571429
39.470588
45.000000
19.0
37.0
45.0
51076.428571
22194.470588
36935.285714
\
median
48123.0
22077.0
38597.0
Expenditures/Student
Top 10% HS Graduation %
mean median
mean median
Cluster
0
1
2
86.142857
71.264706
73.142857
90.0
72.0
74.0
89.428571
83.632353
79.857143
90.0
85.0
77.0
School Type_Lib Arts
School Type_University
mean median
mean median
Cluster
0
0.000000
0.0
1.000000
5
1.0
\
1
2
0.735294
0.000000
1.0
0.0
0.264706
1.000000
0.0
1.0
Visualizing clusters Visualization is essential for understanding the cluster distribution in the
dataset on an intuitive level. We can visualize clusters in 2D (or 3D) space by applying PCA to
reduce the number of dimensions. Each dot on the figure represents a data point, and the color of
the dot indicates the cluster to which it belongs. Visualization is an effective tool for understanding
cluster analysis because it frequently exposes patterns that may be difficult to see in raw data.
[49]: from sklearn.decomposition import PCA
pca = PCA(n_components=2) # Change to 3 for 3D visualization
reduced_features = pca.fit_transform(data_encoded)
plt.scatter(reduced_features[:,0], reduced_features[:,1], c=cluster_labels,␣
↪cmap='viridis')
plt.title('PCA - K-means Clustering')
plt.show()
Cluster Profilings Cluster centroids—which reflect the average of the attributes for each
cluster—are analyzed during cluster profiling. These centroids can be examined to determine
the key characteristics of each cluster. It is easier to meaningfully name and interpret clusters
6
when you are aware of their key characteristics. This makes it possible for stakeholders to use the
cluster profiles to inform their decisions.
[53]: k = 8
kmeans = KMeans(n_clusters=k, n_init=10)
kmeans.fit(data_encoded)
cluster_centers = kmeans.cluster_centers_
# Now cluster_centers will have the shape (k, 8), where k is the number of␣
↪clusters
print(cluster_centers)
[[1.23568750e+03 3.99375000e+01 1.88721250e+04 6.57500000e+01
8.26250000e+01 8.75000000e-01 1.25000000e-01 1.00000000e+00]
[1.31700000e+03 2.80000000e+01 4.65950000e+04 8.15000000e+01
8.97500000e+01 0.00000000e+00 1.00000000e+00 0.00000000e+00]
[1.26544444e+03 3.58888889e+01 2.68982222e+04 7.98888889e+01
8.66111111e+01 6.66666667e-01 3.33333333e-01 1.00000000e+00]
[1.35350000e+03 2.45000000e+01 5.46170000e+04 9.25000000e+01
8.95000000e+01 0.00000000e+00 1.00000000e+00 0.00000000e+00]
[1.25250000e+03 5.25000000e+01 3.22445000e+04 6.95000000e+01
8.15000000e+01 0.00000000e+00 1.00000000e+00 2.00000000e+00]
[1.37000000e+03 1.80000000e+01 6.19210000e+04 9.20000000e+01
8.80000000e+01 0.00000000e+00 1.00000000e+00 0.00000000e+00]
[1.25400000e+03 4.20000000e+01 3.88116000e+04 7.46000000e+01
7.92000000e+01 0.00000000e+00 1.00000000e+00 2.00000000e+00]
[1.24577778e+03 4.22222222e+01 2.33971111e+04 7.24444444e+01
8.24444444e+01 5.55555556e-01 4.44444444e-01 1.00000000e+00]]
0.4
Hierarchical clustering
A hierarchy of clusters is created by the cluster analysis technique known as hierarchical clustering.
To produce a structure resembling a tree known as a dendrogram, it first treats each data point as
a separate cluster before iteratively merging or dividing clusters. The final clusters are determined
by where the dendrogram is cut. Hierarchical clustering does not need a predetermined number of
clusters, unlike KMeans. When the underlying data is hierarchical by nature or when the number
of clusters is unknown in advance, it is very helpful. Dendrograms, a visual representation of the
links between clusters, are provided by hierarchical clustering, which enables analysts to decide on
the ideal number of clusters based on the structure of the data. In contrast to KMeans, it can
be computationally demanding for large datasets, and the interpretation of clusters may be less
objective.
[54]: from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt
# Perform hierarchical clustering
hierarchical_clusters = linkage(data_encoded, method='ward')
7
# Plot the dendrogram
plt.figure(figsize=(12, 8))
dendrogram(hierarchical_clusters, labels=data_encoded.index, leaf_rotation=90)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Universities')
plt.ylabel('Distance')
plt.show()
Single-Linkage clustering The nearest-neighbor or minimal approach, which is another name
for single-linkage clustering, calculates the separation between two clusters’ closest points. It is
prone to producing extended clusters and is outlier-sensitive. The chaining phenomena can result
from single-linkage, which can combine clusters based on just one or a few comparable data points.
This approach may have trouble with complex cluster shapes, but it can be effective for compact
and well-separated clusters.
[57]: from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt
# Perform single-linkage hierarchical clustering and create dendrogram
8
single_linkage_dendrogram = dendrogram(linkage(data_encoded, method='single'))
# Display the dendrogram
plt.title('Single-Linkage Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()
0.4.1
Complete linkage
The furthest-neighbor method, or complete-linkage clustering, determines the separation between
two clusters’ farthest points. Compared to single-linkage, it is less susceptible to outliers and has a
tendency to form tight, spherical clusters. Complete-linkage is especially helpful for locating dense,
well-defined clusters in the midst of noise since it is less impacted by noise and outliers. It can,
however, have trouble handling extended clusters.
[59]: # Perform complete-linkage hierarchical clustering and create dendrogram
complete_linkage_dendrogram = dendrogram(linkage(data_encoded,␣
↪method='complete'))
9
# Display the dendrogram
plt.title('Complete-Linkage Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()
Group average By using group average clustering, the average distance between each pair of
points in two groups is calculated. It achieves a balance between the complete-linkage and singlelinkage techniques. The group average is less susceptible to chaining effects than complete-linkage,
but it is also less susceptible to outliers than single-linkage. When the data includes a combination
of compact and elongated clusters, it is frequently chosen since it can manage clusters with different
densities and shapes.
[60]: # Perform group average hierarchical clustering and create dendrogram
average_linkage_dendrogram = dendrogram(linkage(data_encoded, method='average'))
# Display the dendrogram
plt.title('Group Average Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
10
plt.show()
0.5
Density Based clustering
A method called density-based clustering uses the density of data points in the feature space to
determine the locations of clusters. DBSCAN (Density-Based Spatial Clustering of Applications
with Noise) is one of the most widely used density-based clustering techniques. DBSCAN classifies
data points in low-density areas as outliers and aggregates data points that are densely packed
together. Finding clusters of any shape can be facilitated by the fact that the number of clusters
need not be predetermined.
[61]: from sklearn.cluster import DBSCAN
# Initialize the DBSCAN model with appropriate parameters
# `eps` controls the maximum distance between two samples for one to be␣
↪considered as in the neighborhood of the other
# `min_samples` is the number of samples (or total weight) in a neighborhood␣
↪for a point to be considered as a core point
dbscan = DBSCAN(eps=0.5, min_samples=5)
11
# Fit the DBSCAN model to your preprocessed data
cluster_labels = dbscan.fit_predict(data_encoded)
# Add the cluster labels back to your DataFrame
data_encoded['Cluster'] = cluster_labels
# Check the clusters
print(data_encoded['Cluster'].value_counts())
Cluster
-1
48
Name: count, dtype: int64
[65]: import pandas as pd
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
# Assuming 'data' is your DataFrame containing the features you want to cluster
# For example, if you want to cluster based on 'SAT' and 'Expenditures/Student':
features = ['SAT', 'Expenditures/Student']
X = data[features]
# Instantiate and fit DBSCAN model
dbscan = DBSCAN(eps=0.3, min_samples=5)
↪'min_samples' based on your data
clusters = dbscan.fit_predict(X)
# You might need to adjust 'eps' and␣
# Add the cluster labels to the original DataFrame
data['Cluster'] = clusters
# Plotting the clusters
plt.figure(figsize=(8, 6))
for cluster_id in data['Cluster'].unique():
if cluster_id == -1:
# -1 represents noise points in DBSCAN
plt.scatter(data.loc[data['Cluster'] == cluster_id, 'SAT'],
data.loc[data['Cluster'] == cluster_id, 'Expenditures/
↪Student'],
label=f'Noise', color='gray', alpha=0.5)
else:
plt.scatter(data.loc[data['Cluster'] == cluster_id, 'SAT'],
data.loc[data['Cluster'] == cluster_id, 'Expenditures/
↪Student'],
label=f'Cluster {cluster_id}')
plt.xlabel('SAT Scores')
plt.ylabel('Expenditures per Student')
12
plt.title('DBSCAN Clustering')
plt.legend()
plt.show()
0.5.1
Ordering Points To Determine the Clustering Structure, or OPTICS:
Ordering Points To Identify the Clustering Structure, or OPTICS, is a flexible density-based clustering technique that finds clusters in big datasets with different densities and shapes. OPTICS
is unique in that it can find clusters without requiring one to know how many clusters there are.
This makes it especially helpful in situations where the underlying structure of the data is complex
and poorly defined. By sorting the data points according to their reachability distance, OPTICS
enables the algorithm to reveal the underlying clustering structure in the form of a reachability
plot. By using OPTICS, we can avoid assuming anything about the sizes or shapes of the clusters
and instead obtain important insights about the natural groups that exist in our data.
[70]: from sklearn.cluster import OPTICS
# Assuming you have defined your data matrix X
clusterer = OPTICS(min_samples=5, xi=0.05, min_cluster_size=0.05)
13
clusters = clusterer.fit_predict(X)
# Print unique cluster labels
print("Unique Cluster Labels:", set(clusters))
Unique Cluster Labels: {0, 1, 2, -1}
0.5.2
Visualize the clusters
[20]: from sklearn.cluster import OPTICS
# Define your feature matrix X
X = data_encoded.drop('Cluster', axis=1)
# Reset the index of the DataFrame
X.reset_index(drop=True, inplace=True)
# Initialize the OPTICS clusterer
clusterer = OPTICS(min_samples=5, xi=0.05, min_cluster_size=0.05)
# Perform clustering
clusters = clusterer.fit_predict(X)
plt.figure(figsize=(8, 6))
plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=clusters, cmap='viridis', s=50,␣
↪edgecolors='k')
plt.xlabel('Mean')
plt.ylabel('SAT')
plt.title('OPTICS Clustering Result')
plt.colorbar(label='Cluster Label')
plt.show()
14
0.5.3
Mean Shift:
This non-parametric, intuitive clustering approach works well at finding clusters in data without
requiring a predetermined shape for the clusters. Mean Shift is very adaptable to various datasets
since, in contrast to many other clustering approaches, it does not require prior knowledge of the
number of clusters. Data points are iteratively moved toward the mode, or peak, of the underlying
data distribution in order for the method to function. Clusters spontaneously form as points
converge towards the local maxima. Mean Shift is resistant against outliers and especially useful
for capturing intricate cluster patterns. We can uncover hidden patterns within our data without
making strict assumptions about the cluster geometry thanks to its versatility and ease of use in
finding clusters of varied shapes.
[75]: from sklearn.cluster import MeanShift
clusterer = MeanShift(bandwidth=0.5)
clusters = clusterer.fit_predict(X)
print(clusters)
15
[42 10 27 23 0 40 6 20 2 26 15 21 34 33 25 37 41 1 22 45 29 13 38 32
3 14 9 12 30 46 39 5 44 0 4 16 7 19 24 17 31 18 8 35 11 28 36 43]
0.5.4
Visualizing the clusters
[79]: import matplotlib.pyplot as plt
from sklearn.cluster import MeanShift
# Assuming you have defined your data matrix X as a pandas DataFrame
clusterer = MeanShift(bandwidth=0.5)
clusters = clusterer.fit_predict(X)
# Visualize the clusters
plt.figure(figsize=(8, 6))
plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=clusters, cmap='viridis', s=50,␣
↪edgecolors='k')
plt.xlabel('Mean')
plt.ylabel('SAT')
plt.title('MeanShift Clustering Result')
plt.colorbar(label='Cluster Label')
plt.show()
16
0.6
Summary
We used a variety of approaches during the cluster analysis process to identify underlying patterns
in a dataset. We started by managing missing values, switching data types, and standardizing
characteristics as part of the preprocessing step of the data. We then used the partitioning technique
known as k-means clustering to organize related data points into discrete clusters. We improved
the effectiveness of our clustering model by figuring out the ideal number of clusters using the elbow
approach. The results were made easier to grasp with the creation of cluster visualizations. After
that, we looked into hierarchical clustering and created dendrograms using a variety of linkage
techniques, including single, complete, and group average. An understanding of the hierarchical
relationships between the data points was given by these dendrograms. Moreover, dense clusters
of data points were found using density-based clustering techniques, such as DBSCAN. We also
talked about OPTICS and mean shift algorithms, which are helpful in recognizing clusters with
different densities and allow for more variable cluster designs. By means of these techniques, we
have acquired a thorough comprehension of the underlying structures present in the datasets, which
has facilitated efficient analysis and interpretation.
17
Clustering
October 25, 2023
0.1
Loading the data
[4]: import pandas as pd
import numpy as np
from scipy.stats import zscore
# Load the CSV file into a pandas DataFrame
data = pd.read_csv('Data.csv')
# Display the first few rows of the DataFrame
print(data.head(10))
School Type
Amherst
Lib Arts
Barnard
Lib Arts
Bates
Lib Arts
Berkeley
University
Bowdoin
Lib Arts
Brown
University
Bryn Mawr
Lib Arts
Cal Tech
University
Carleton
Lib Arts
Carnegie Mellon University
Amherst
Barnard
Bates
Berkeley
Bowdoin
Brown
Bryn Mawr
Cal Tech
Carleton
Carnegie Mellon
0.2
Median
1315
1220
1240
1176
1300
1281
1255
1400
1300
1225
SAT
22
53
36
37
24
24
56
31
40
64
Expenditures/Student
85
69
58
95
78
80
70
98
75
52
Acceptance Rate
26636
17653
17554
23665
25703
24201
18847
102262
15904
33607
\
Top 10% HS Graduation %
93
80
88
68
90
90
84
75
80
77
Data cleaning
Handle the missing values
1
[5]: # Drop rows with any missing values
data_cleaned = data.dropna()
# Fill missing values in a specific column (e.g., 'Acceptance Rate') with the␣
↪mean of that column
data['Acceptance Rate'].fillna(data['Acceptance Rate'].mean(), inplace=True)
0.2.1
Removing duplicates
[6]: # Remove duplicate rows based on all columns
data_cleaned = data.drop_duplicates()
# Remove duplicates based on specific columns
data_cleaned = data.drop_duplicates(subset=['School Type', 'Median', 'SAT',␣
↪'Acceptance Rate', 'Expenditures/Student', 'Top 10% HS Graduation %'])
0.2.2
Correcting data formats
[7]: # Convert 'Acceptance Rate' to numeric (assuming it contains numerical values)
data['Acceptance Rate'] = pd.to_numeric(data['Acceptance Rate'],␣
↪errors='coerce')
# Convert 'Expenditures/Student' to numeric (assuming it contains numerical␣
↪values)
data['Expenditures/Student'] = pd.to_numeric(data['Expenditures/Student'],␣
↪errors='coerce')
# Convert 'Top 10% HS Graduation %' to numeric (assuming it contains numerical␣
↪values)
data['Top 10% HS Graduation %'] = pd.to_numeric(data['Top 10% HS Graduation␣
↪%'], errors='coerce')
0.2.3
Handling outliers
[8]: # Identify and remove outliers using Z-score for 'Acceptance Rate'
data = data[(np.abs(zscore(data['Acceptance Rate'])) < 3)]
# Replace outliers with the median for 'Expenditures/Student'
data['Expenditures/Student'] = np.where((np.abs(zscore(data['Expenditures/
↪Student'])) < 3),
data['Expenditures/Student'],
data['Expenditures/Student'].median())
# Replace outliers with the median for 'Top 10% HS Graduation %'
data['Top 10% HS Graduation %'] = np.where((np.abs(zscore(data['Top 10% HS␣
↪Graduation %'])) < 3),
2
data['Top 10% HS Graduation %'],
data['Top 10% HS Graduation %'].
median())
↪
0.3
Clustering analysis
K-means clustering An iterative process called K-Means clustering divides a dataset into K
unique, non-overlapping clusters. In the feature space, K centroids are initially distributed at
random. The closest centroid is then assigned to each data point, creating clusters. The centroids
are recalculated as the average of each data point in each cluster. Until the centroids stop changing
considerably or a predetermined number of repetitions has been reached, this assignment and
update process is repeated. To lessen variance inside each cluster, K-Means optimizes cluster
centroids to minimize the within-cluster sum of squares. The clusters are defined by the K cluster
centroids and the data points assigned to these centroids in the final output. The elbow technique
or silhouette score can be used to select the appropriate number of clusters (K), which will increase
the algorithm’s ability to identify significant patterns in the data.
[18]: from sklearn.cluster import KMeans
# Choose the number of clusters (K) - you can use techniques like the elbow␣
↪method to find the optimal K
k = 3
# Initialize the KMeans model
kmeans = KMeans(n_clusters=k, n_init=10)
# Fit the model to your data_encoded
kmeans.fit(data_encoded)
# Get cluster labels for each data point
cluster_labels = kmeans.labels_
# Add the cluster labels back to your DataFrame
data_encoded['Cluster'] = cluster_labels
# Display the first few rows of the DataFrame with cluster labels
print(data_encoded.head())
Amherst
Barnard
Bates
Berkeley
Bowdoin
Median
1315
1220
1240
1176
1300
SAT
22
53
36
37
24
Acceptance Rate
26636
17653
17554
23665
25703
Amherst
Barnard
Top 10% HS Graduation %
93.0
80.0
Expenditures/Student
85.0
69.0
58.0
95.0
78.0
School Type_Lib Arts
True
True
3
\
\
Bates
Berkeley
Bowdoin
88.0
68.0
90.0
Amherst
Barnard
Bates
Berkeley
Bowdoin
School Type_University
False
False
False
True
False
0.3.1
True
False
True
Cluster
0
0
0
0
0
Aspects of the k-means
Elbow Method for Optimal K: As the number of clusters rises, the Elbow Method graph
shows the distortion (inertia). It aids in locating the optimal K below which the model is not
considerably enhanced by the addition of more clusters. In the graph, we seek out the location
where the rate of decline suddenly changes direction, creating a “elbow.” This idea advises striking
a balance between minimizing overfitting and maximizing the volatility in the data. A more distinct
elbow denotes a clearer preference for the number of clusters.
[46]: import warnings
import matplotlib.pyplot as plt
warnings.simplefilter(action='ignore', category=FutureWarning)
distortions = []
K = range(1,10)
for k in K:
kmeanModel = KMeans(n_clusters=k)
kmeanModel.fit(data_encoded)
distortions.append(kmeanModel.inertia_)
plt.figure(figsize=(16,8))
plt.plot(K, distortions, 'bx-')
plt.xlabel('k')
plt.ylabel('Distortion')
plt.title('The Elbow Method showing the optimal k')
plt.show()
4
Cluster statistics The features of each cluster can be seen by computing statistics like mean and
median for each attribute within clusters. For instance, median values reveal the core tendency,
which is less influenced by outliers, while mean values reveal the average behavior of data points in a
cluster. Understanding the characteristic characteristics of the clusters facilitates their comparison
and interpretation through statistical analysis.
[48]: cluster_stats = data_encoded.groupby('Cluster').agg(['mean', 'median'])
print(cluster_stats)
Cluster
0
1
2
Median
mean
median
1335.000000
1246.235294
1253.571429
1350.0
1248.5
1280.0
SAT
Acceptance Rate
mean median
mean
25.571429
39.470588
45.000000
19.0
37.0
45.0
51076.428571
22194.470588
36935.285714
\
median
48123.0
22077.0
38597.0
Expenditures/Student
Top 10% HS Graduation %
mean median
mean median
Cluster
0
1
2
86.142857
71.264706
73.142857
90.0
72.0
74.0
89.428571
83.632353
79.857143
90.0
85.0
77.0
School Type_Lib Arts
School Type_University
mean median
mean median
Cluster
0
0.000000
0.0
1.000000
5
1.0
\
1
2
0.735294
0.000000
1.0
0.0
0.264706
1.000000
0.0
1.0
Visualizing clusters Visualization is essential for understanding the cluster distribution in the
dataset on an intuitive level. We can visualize clusters in 2D (or 3D) space by applying PCA to
reduce the number of dimensions. Each dot on the figure represents a data point, and the color of
the dot indicates the cluster to which it belongs. Visualization is an effective tool for understanding
cluster analysis because it frequently exposes patterns that may be difficult to see in raw data.
[49]: from sklearn.decomposition import PCA
pca = PCA(n_components=2) # Change to 3 for 3D visualization
reduced_features = pca.fit_transform(data_encoded)
plt.scatter(reduced_features[:,0], reduced_features[:,1], c=cluster_labels,␣
↪cmap='viridis')
plt.title('PCA - K-means Clustering')
plt.show()
Cluster Profilings Cluster centroids—which reflect the average of the attributes for each
cluster—are analyzed during cluster profiling. These centroids can be examined to determine
the key characteristics of each cluster. It is easier to meaningfully name and interpret clusters
6
when you are aware of their key characteristics. This makes it possible for stakeholders to use the
cluster profiles to inform their decisions.
[53]: k = 8
kmeans = KMeans(n_clusters=k, n_init=10)
kmeans.fit(data_encoded)
cluster_centers = kmeans.cluster_centers_
# Now cluster_centers will have the shape (k, 8), where k is the number of␣
↪clusters
print(cluster_centers)
[[1.23568750e+03 3.99375000e+01 1.88721250e+04 6.57500000e+01
8.26250000e+01 8.75000000e-01 1.25000000e-01 1.00000000e+00]
[1.31700000e+03 2.80000000e+01 4.65950000e+04 8.15000000e+01
8.97500000e+01 0.00000000e+00 1.00000000e+00 0.00000000e+00]
[1.26544444e+03 3.58888889e+01 2.68982222e+04 7.98888889e+01
8.66111111e+01 6.66666667e-01 3.33333333e-01 1.00000000e+00]
[1.35350000e+03 2.45000000e+01 5.46170000e+04 9.25000000e+01
8.95000000e+01 0.00000000e+00 1.00000000e+00 0.00000000e+00]
[1.25250000e+03 5.25000000e+01 3.22445000e+04 6.95000000e+01
8.15000000e+01 0.00000000e+00 1.00000000e+00 2.00000000e+00]
[1.37000000e+03 1.80000000e+01 6.19210000e+04 9.20000000e+01
8.80000000e+01 0.00000000e+00 1.00000000e+00 0.00000000e+00]
[1.25400000e+03 4.20000000e+01 3.88116000e+04 7.46000000e+01
7.92000000e+01 0.00000000e+00 1.00000000e+00 2.00000000e+00]
[1.24577778e+03 4.22222222e+01 2.33971111e+04 7.24444444e+01
8.24444444e+01 5.55555556e-01 4.44444444e-01 1.00000000e+00]]
0.4
Hierarchical clustering
A hierarchy of clusters is created by the cluster analysis technique known as hierarchical clustering.
To produce a structure resembling a tree known as a dendrogram, it first treats each data point as
a separate cluster before iteratively merging or dividing clusters. The final clusters are determined
by where the dendrogram is cut. Hierarchical clustering does not need a predetermined number of
clusters, unlike KMeans. When the underlying data is hierarchical by nature or when the number
of clusters is unknown in advance, it is very helpful. Dendrograms, a visual representation of the
links between clusters, are provided by hierarchical clustering, which enables analysts to decide on
the ideal number of clusters based on the structure of the data. In contrast to KMeans, it can
be computationally demanding for large datasets, and the interpretation of clusters may be less
objective.
[54]: from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt
# Perform hierarchical clustering
hierarchical_clusters = linkage(data_encoded, method='ward')
7
# Plot the dendrogram
plt.figure(figsize=(12, 8))
dendrogram(hierarchical_clusters, labels=data_encoded.index, leaf_rotation=90)
plt.title('Hierarchical Clustering Dendrogram')
plt.xlabel('Universities')
plt.ylabel('Distance')
plt.show()
Single-Linkage clustering The nearest-neighbor or minimal approach, which is another name
for single-linkage clustering, calculates the separation between two clusters’ closest points. It is
prone to producing extended clusters and is outlier-sensitive. The chaining phenomena can result
from single-linkage, which can combine clusters based on just one or a few comparable data points.
This approach may have trouble with complex cluster shapes, but it can be effective for compact
and well-separated clusters.
[57]: from scipy.cluster.hierarchy import linkage, dendrogram
import matplotlib.pyplot as plt
# Perform single-linkage hierarchical clustering and create dendrogram
8
single_linkage_dendrogram = dendrogram(linkage(data_encoded, method='single'))
# Display the dendrogram
plt.title('Single-Linkage Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()
0.4.1
Complete linkage
The furthest-neighbor method, or complete-linkage clustering, determines the separation between
two clusters’ farthest points. Compared to single-linkage, it is less susceptible to outliers and has a
tendency to form tight, spherical clusters. Complete-linkage is especially helpful for locating dense,
well-defined clusters in the midst of noise since it is less impacted by noise and outliers. It can,
however, have trouble handling extended clusters.
[59]: # Perform complete-linkage hierarchical clustering and create dendrogram
complete_linkage_dendrogram = dendrogram(linkage(data_encoded,␣
↪method='complete'))
9
# Display the dendrogram
plt.title('Complete-Linkage Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
plt.show()
Group average By using group average clustering, the average distance between each pair of
points in two groups is calculated. It achieves a balance between the complete-linkage and singlelinkage techniques. The group average is less susceptible to chaining effects than complete-linkage,
but it is also less susceptible to outliers than single-linkage. When the data includes a combination
of compact and elongated clusters, it is frequently chosen since it can manage clusters with different
densities and shapes.
[60]: # Perform group average hierarchical clustering and create dendrogram
average_linkage_dendrogram = dendrogram(linkage(data_encoded, method='average'))
# Display the dendrogram
plt.title('Group Average Hierarchical Clustering Dendrogram')
plt.xlabel('Sample Index')
plt.ylabel('Distance')
10
plt.show()
0.5
Density Based clustering
A method called density-based clustering uses the density of data points in the feature space to
determine the locations of clusters. DBSCAN (Density-Based Spatial Clustering of Applications
with Noise) is one of the most widely used density-based clustering techniques. DBSCAN classifies
data points in low-density areas as outliers and aggregates data points that are densely packed
together. Finding clusters of any shape can be facilitated by the fact that the number of clusters
need not be predetermined.
[61]: from sklearn.cluster import DBSCAN
# Initialize the DBSCAN model with appropriate parameters
# `eps` controls the maximum distance between two samples for one to be␣
↪considered as in the neighborhood of the other
# `min_samples` is the number of samples (or total weight) in a neighborhood␣
↪for a point to be considered as a core point
dbscan = DBSCAN(eps=0.5, min_samples=5)
11
# Fit the DBSCAN model to your preprocessed data
cluster_labels = dbscan.fit_predict(data_encoded)
# Add the cluster labels back to your DataFrame
data_encoded['Cluster'] = cluster_labels
# Check the clusters
print(data_encoded['Cluster'].value_counts())
Cluster
-1
48
Name: count, dtype: int64
[65]: import pandas as pd
from sklearn.cluster import DBSCAN
import matplotlib.pyplot as plt
# Assuming 'data' is your DataFrame containing the features you want to cluster
# For example, if you want to cluster based on 'SAT' and 'Expenditures/Student':
features = ['SAT', 'Expenditures/Student']
X = data[features]
# Instantiate and fit DBSCAN model
dbscan = DBSCAN(eps=0.3, min_samples=5)
↪'min_samples' based on your data
clusters = dbscan.fit_predict(X)
# You might need to adjust 'eps' and␣
# Add the cluster labels to the original DataFrame
data['Cluster'] = clusters
# Plotting the clusters
plt.figure(figsize=(8, 6))
for cluster_id in data['Cluster'].unique():
if cluster_id == -1:
# -1 represents noise points in DBSCAN
plt.scatter(data.loc[data['Cluster'] == cluster_id, 'SAT'],
data.loc[data['Cluster'] == cluster_id, 'Expenditures/
↪Student'],
label=f'Noise', color='gray', alpha=0.5)
else:
plt.scatter(data.loc[data['Cluster'] == cluster_id, 'SAT'],
data.loc[data['Cluster'] == cluster_id, 'Expenditures/
↪Student'],
label=f'Cluster {cluster_id}')
plt.xlabel('SAT Scores')
plt.ylabel('Expenditures per Student')
12
plt.title('DBSCAN Clustering')
plt.legend()
plt.show()
0.5.1
Ordering Points To Determine the Clustering Structure, or OPTICS:
Ordering Points To Identify the Clustering Structure, or OPTICS, is a flexible density-based clustering technique that finds clusters in big datasets with different densities and shapes. OPTICS
is unique in that it can find clusters without requiring one to know how many clusters there are.
This makes it especially helpful in situations where the underlying structure of the data is complex
and poorly defined. By sorting the data points according to their reachability distance, OPTICS
enables the algorithm to reveal the underlying clustering structure in the form of a reachability
plot. By using OPTICS, we can avoid assuming anything about the sizes or shapes of the clusters
and instead obtain important insights about the natural groups that exist in our data.
[70]: from sklearn.cluster import OPTICS
# Assuming you have defined your data matrix X
clusterer = OPTICS(min_samples=5, xi=0.05, min_cluster_size=0.05)
13
clusters = clusterer.fit_predict(X)
# Print unique cluster labels
print("Unique Cluster Labels:", set(clusters))
Unique Cluster Labels: {0, 1, 2, -1}
0.5.2
Visualize the clusters
[20]: from sklearn.cluster import OPTICS
# Define your feature matrix X
X = data_encoded.drop('Cluster', axis=1)
# Reset the index of the DataFrame
X.reset_index(drop=True, inplace=True)
# Initialize the OPTICS clusterer
clusterer = OPTICS(min_samples=5, xi=0.05, min_cluster_size=0.05)
# Perform clustering
clusters = clusterer.fit_predict(X)
plt.figure(figsize=(8, 6))
plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=clusters, cmap='viridis', s=50,␣
↪edgecolors='k')
plt.xlabel('Mean')
plt.ylabel('SAT')
plt.title('OPTICS Clustering Result')
plt.colorbar(label='Cluster Label')
plt.show()
14
0.5.3
Mean Shift:
This non-parametric, intuitive clustering approach works well at finding clusters in data without
requiring a predetermined shape for the clusters. Mean Shift is very adaptable to various datasets
since, in contrast to many other clustering approaches, it does not require prior knowledge of the
number of clusters. Data points are iteratively moved toward the mode, or peak, of the underlying
data distribution in order for the method to function. Clusters spontaneously form as points
converge towards the local maxima. Mean Shift is resistant against outliers and especially useful
for capturing intricate cluster patterns. We can uncover hidden patterns within our data without
making strict assumptions about the cluster geometry thanks to its versatility and ease of use in
finding clusters of varied shapes.
[75]: from sklearn.cluster import MeanShift
clusterer = MeanShift(bandwidth=0.5)
clusters = clusterer.fit_predict(X)
print(clusters)
15
[42 10 27 23 0 40 6 20 2 26 15 21 34 33 25 37 41 1 22 45 29 13 38 32
3 14 9 12 30 46 39 5 44 0 4 16 7 19 24 17 31 18 8 35 11 28 36 43]
0.5.4
Visualizing the clusters
[79]: import matplotlib.pyplot as plt
from sklearn.cluster import MeanShift
# Assuming you have defined your data matrix X as a pandas DataFrame
clusterer = MeanShift(bandwidth=0.5)
clusters = clusterer.fit_predict(X)
# Visualize the clusters
plt.figure(figsize=(8, 6))
plt.scatter(X.iloc[:, 0], X.iloc[:, 1], c=clusters, cmap='viridis', s=50,␣
↪edgecolors='k')
plt.xlabel('Mean')
plt.ylabel('SAT')
plt.title('MeanShift Clustering Result')
plt.colorbar(label='Cluster Label')
plt.show()
16
0.6
Summary
We used a variety of approaches during the cluster analysis process to identify underlying patterns
in a dataset. We started by managing missing values, switching data types, and standardizing
characteristics as part of the preprocessing step of the data. We then used the partitioning technique
known as k-means clustering to organize related data points into discrete clusters. We improved
the effectiveness of our clustering model by figuring out the ideal number of clusters using the elbow
approach. The results were made easier to grasp with the creation of cluster visualizations. After
that, we looked into hierarchical clustering and created dendrograms using a variety of linkage
techniques, including single, complete, and group average. An understanding of the hierarchical
relationships between the data points was given by these dendrograms. Moreover, dense clusters
of data points were found using density-based clustering techniques, such as DBSCAN. We also
talked about OPTICS and mean shift algorithms, which are helpful in recognizing clusters with
different densities and allow for more variable cluster designs. By means of these techniques, we
have acquired a thorough comprehension of the underlying structures present in the datasets, which
has facilitated efficient analysis and interpretation.
17
Delivering a high-quality product at a reasonable price is not enough anymore.
That’s why we have developed 5 beneficial guarantees that will make your experience with our service enjoyable, easy, and safe.
You have to be 100% sure of the quality of your product to give a money-back guarantee. This describes us perfectly. Make sure that this guarantee is totally transparent.
Read moreEach paper is composed from scratch, according to your instructions. It is then checked by our plagiarism-detection software. There is no gap where plagiarism could squeeze in.
Read moreThanks to our free revisions, there is no way for you to be unsatisfied. We will work on your paper until you are completely happy with the result.
Read moreYour email is safe, as we store it according to international data protection rules. Your bank details are secure, as we use only reliable payment systems.
Read moreBy sending us your money, you buy the service we provide. Check out our terms and conditions if you prefer business talks to be laid out in official language.
Read more