Sitemap

A Practical Guide To Doing K-Means Clustering

11 min readOct 23, 2023

You are Chief Marketing Officer of Walmart. Millions of customers with different purchasing behaviours buy your products daily.

In an ideal world, you’d have an unlimited budget to target every specific customer segment to increase their spending. Unfortunately, you have a limited budget and time so you need to focus on the customer segments that are most receptive to your marketing campaigns. How should you go about segmenting these customers into groups?

One way is a K-means clustering model. In data science, K-means clustering is a type of unsupervised machine learning algorithm that creates coherent groups in large datasets. The classic example of K-means clustering put to work in a real-world business use-case is for customer segmentation.

In this article, I use Python to apply a K-means clustering model to a real-world dataset of grocery store customers. Let’s dive in.

Getting started: Data cleaning

Here’s our dataset. I load it as df0.

A quick look shows 2240 entries across 29 variables.

For quick reference, here’s what each datapoint means:

Source

Before exploring any data, we check for duplicates and missing values.

df0.isna().sum()
-------------------------------------------------------------------------------
ID 0
Year_Birth 0
Education 0
Marital_Status 0
Income 24
Kidhome 0
Teenhome 0
Dt_Customer 0
Recency 0
MntWines 0
MntFruits 0
MntMeatProducts 0
MntFishProducts 0
MntSweetProducts 0
MntGoldProds 0
NumDealsPurchases 0
NumWebPurchases 0
NumCatalogPurchases 0
NumStorePurchases 0
NumWebVisitsMonth 0
AcceptedCmp3 0
AcceptedCmp4 0
AcceptedCmp5 0
AcceptedCmp1 0
AcceptedCmp2 0
Complain 0
Z_CostContact 0
Z_Revenue 0
Response 0
dtype: int64

There are no duplicates but 24 missing values, a relatively small number, so we’ll drop them.

df0 = df0.dropna()
print(“Total number of data-points after removing missing values:”, len(df0))
-------------------------------------------------------------------------------
Total number of data-points after removing missing values: 2216

Exploratory data analysis

Now we do some exploratory data analysis. There are two categorical variables in this dataset: Marital_Status and Education. A quick look finds that the data was collected in a messy manner, with nonsensical data input values like YOLO and Absurd.

print("Total categories in the feature Marital_Status:\n", df0["Marital_Status"].value_counts(), "\n")
print("Total categories in the feature Education:\n", df0["Education"].value_counts())
Total categories in the feature Marital_Status:
Married 857
Together 573
Single 471
Divorced 232
Widow 76
Alone 3
Absurd 2
YOLO 2
Name: Marital_Status, dtype: int64

Total categories in the feature Education:
Graduation 1116
PhD 481
Master 365
2n Cycle 200
Basic 54
Name: Education, dtype: int64

So let’s reorganize them into more concise groupings. For Marital_Status, I regroup them into Partner and Single values. For Education, regroup into High School, Undergraduate and Postgraduate.

education_mapping = {
'Basic': 'High School',
'2n Cycle': 'Undergraduate',
'Graduation': 'Undergraduate',
'Master': 'Postgraduate',
'PhD': 'Postgraduate'
}

marital_status_mapping = {
'Married': 'Partner',
'Together': 'Partner',
'Single': 'Single',
'Divorced': 'Single',
'Widow': 'Single',
'Alone': 'Single',
'Absurd': 'Single',
'YOLO': 'Single'
}

df0['Education'] = df0['Education'].map(education_mapping)
df0['Marital_Status'] = df0['Marital_Status'].map(marital_status_mapping)

education_count_updated = df0['Education'].value_counts()
marital_status_count_updated = df0['Marital_Status'].value_counts()

education_count_updated, marital_status_count_updated
(Undergraduate    1316
Postgraduate 846
High School 54
Name: Education, dtype: int64,
Partner 1430
Single 786
Name: Marital_Status, dtype: int64)

Feature engineering

Now that we’ve handled missing values and tidied up our categorical values, we can move on to feature engineering for better K-means clustering later on.

  1. First, I create a new feature Age with the Year_Birth variable.
  2. Second, I create a new feature Total_Spending by summing up all spending on different product categories (MntWines, MntFruits, MntMeatProducts, MntFishProducts, MntSweetProducts, MntGoldProds).
  3. Third, we have a variable Dt_Customer that captures when each customer started shopping with the grocery store. Using that, I create Loyalty_Years as a proxy to showcase the number of years since the customer’s first purchase.
  4. Fourth, using the data from all the boolean variables of marketing campaigns accepted, I create Total_Campaigns_Accepted to see how receptive to marketing promotions the customer is.
  5. Fifth, I create a feature Children by summing up data in Kidhome and Teenhome.
  6. Sixth, I create a binary feature Parenthood to indicate whether the customer is a parent (1) or not (0).
  7. Seventh, I create Avg_Spending_Per_Visit by dividing the Total_Spending (created above) by total number of purchases.
  8. Finally, we want to see what types of product categories customers are shopping in most. I create Spending_Ratios to indicate the ratio of spending on each product category to total spending.
from datetime import datetime
current_year = datetime.now().year

# 1. Age
df0['Age'] = current_year - df0['Year_Birth']

# 2. Total Spending
spending_columns = ['MntWines', 'MntFruits', 'MntMeatProducts', 'MntFishProducts', 'MntSweetProducts', 'MntGoldProds']
df0['Total_Spending'] = df0[spending_columns].sum(axis=1)

# 3. Customer Loyalty in years
df0['Dt_Customer'] = pd.to_datetime(df0['Dt_Customer'])
df0['Loyalty_Years'] = (datetime.now() - df0['Dt_Customer']).dt.days / 365

# 4. Total Campaigns Accepted
campaign_columns = ['AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'Response']
df0['Total_Campaigns_Accepted'] = df0[campaign_columns].sum(axis=1)

# 5. Children
df0['Children'] = df0['Kidhome'] + df0['Teenhome']

# 6. Parenthood
df0['Parenthood'] = (df0['Children'] > 0).astype(int)

# 7. Average Spending per Visit
df0['Avg_Spending_Per_Visit'] = df0['Total_Spending'] / (df0['NumDealsPurchases'] + df0['NumWebPurchases'] +
df0['NumCatalogPurchases'] + df0['NumStorePurchases'] + 1) # Added 1 to avoid division by zero

# 8. Spending Ratios for Different Product Categories
for column in spending_columns:
df0[column + '_Ratio'] = df0[column] / df0['Total_Spending']
df0[column + '_Ratio'].fillna(0, inplace=True) # Handling division by zero issue

df0.head()

Our dataset has been expanded with eight new variables.

K-means clustering is susceptible to outliers, so now is when we check for them on our continuous variables now. Seven require closer examination: Age, Income, Total_Spending, Loyalty_Years and Avg_Spending_Per_Visit. Let’s make some boxplots.

features_to_plot = ['Age',
'Income',
'Total_Spending',
'Loyalty_Years',
'Avg_Spending_Per_Visit',
'Total_Campaigns_Accepted',
'Children',
]

plt.figure(figsize=(15, 10))

for i, feature in enumerate(features_to_plot, 1):
plt.subplot(3, 3, i)
sns.boxplot(x=df0[feature])
plt.title(f'Boxplot of {feature}')

plt.tight_layout()
plt.show()

Outliers abound in many of our continuous variables, but only the ones in Age seems patently illogical, because they reside above the 100-year-old mark. It’s unlikely we have 120-year-old customers — probably a data entry error.

Another outlier that sticks out is in the Income variable, with a datapoint above the point of $600,000, possibly a very wealthy person shopping in this supermarket.

Let’s check how many of these outliers there are.

print(df0[df0['Age'] > 100]['Age'].unique())
print(df0[df0['Income'] > 200000]['Income'].unique())
[123 130 124]
[666666.]

There are three of these outliers in Age and one in Income. I’ll remove those in Age only.

df0 = df0[df0[‘Age’] <= 100]

Now we can plot a correlation chart to see the relationships in our continuous variables.

continuous_vars = [
'Age', 'Income', 'Recency', 'MntWines', 'MntFruits', 'MntMeatProducts',
'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases',
'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
'Total_Spending', 'Loyalty_Years', 'Avg_Spending_Per_Visit'
]

correlation_matrix = df0[continuous_vars].corr()

plt.figure(figsize=(15, 10))
sns.heatmap(correlation_matrix, annot=True, fmt=".2f", cmap="coolwarm", linewidths=.5)
plt.title('Correlation Matrix')
plt.xticks(rotation=45)
plt.show()

A quick observation: There is a strong positive correlation between Income and Total_Spending (0.67), indicating that as income increases, total spending also tends to increase. Total_spending is also positively correlated with the number of web, catalog, and store purchases, indicating that customers who spend more tend to make more purchases across various channels.

These can lead to better identification of meaningful and natural clusters among the store’s customers.

Data preprocessing

Before moving on to label encoding and scaling, several variables can be dropped as they have been transformed into new features and are now redundant. I drop them here.

columns_to_drop = [
'Year_Birth', 'Kidhome', 'Teenhome', 'Dt_Customer', 'Z_CostContact',
'Z_Revenue', 'ID', 'AcceptedCmp1', 'AcceptedCmp2', 'AcceptedCmp3',
'AcceptedCmp4', 'AcceptedCmp5', 'Response'
]

df0 = df0.drop(columns=columns_to_drop)

In our dataset, there are two categorical variables that needs to be label encoded as K-means clustering deals only with numerical values. They are Marital_Status and Education. I label encode them here into a new dataframe df_enc.

from sklearn.preprocessing import LabelEncoder

# Creating a copy of the DataFrame to ensure the original DataFrame df0 remains unchanged
df_enc = df0.copy()

# Initializing label encoders
le_education = LabelEncoder()
le_marital_status = LabelEncoder()

# Encoding 'Education' and 'Marital_Status'
df_enc['Education'] = le_education.fit_transform(df0['Education'])
df_enc['Marital_Status'] = le_marital_status.fit_transform(df0['Marital_Status'])

df_enc[['Education', 'Marital_Status']].head()

After label encoding, we are left with a dataset of purely numerical values. K-means is sensitive to the scale of features, so we need to scale our data. I do so using scikit-learn’s StandardScaler.

from sklearn.preprocessing import StandardScaler

continuous_vars_to_scale = [
'Age', 'Income', 'Recency', 'MntWines', 'MntFruits', 'MntMeatProducts',
'MntFishProducts', 'MntSweetProducts', 'MntGoldProds', 'NumDealsPurchases',
'NumWebPurchases', 'NumCatalogPurchases', 'NumStorePurchases', 'NumWebVisitsMonth',
'Total_Spending', 'Loyalty_Years', 'Avg_Spending_Per_Visit', 'MntWines_Ratio',
'MntFruits_Ratio', 'MntMeatProducts_Ratio', 'MntFishProducts_Ratio',
'MntSweetProducts_Ratio', 'MntGoldProds_Ratio'
]

scaler = StandardScaler()
df_enc[continuous_vars_to_scale] = scaler.fit_transform(df_enc[continuous_vars_to_scale])

df_enc[continuous_vars_to_scale].head()

Here’s our resulting dataset:

We are almost at the point where we construct the K-means model. Before doing so however, consider that our dataset is rather wide with quite a number of features. Much of the data is also correlated and possibly redundant.

Should we perform PCA (principal component analysis) to reduce the dimensionality of our dataset? This can help to reduce the noise and improve the clustering effectiveness by discarding components with lower variance.

To decide, let’s plot a chart to take a look at the explained variance ratio to determine the optimal number of components.

from sklearn.decomposition import PCA

pca = PCA()
pca_data = pca.fit_transform(df_enc)

# Plotting the explained variance ratio for each principal component
plt.figure(figsize=(10, 6))
plt.plot(range(1, len(pca.explained_variance_ratio_) + 1), np.cumsum(pca.explained_variance_ratio_), marker='o', linestyle='--')
plt.title('Explained Variance by Different Principal Components')
plt.xlabel('Number of Components')
plt.ylabel('Cumulative Explained Variance')
plt.grid(True)
plt.show()

# Displaying the explained variance ratios
explained_variance_ratios = pca.explained_variance_ratio_
cumulative_explained_variance = np.cumsum(explained_variance_ratios)

explained_variance_ratios, cumulative_explained_variance

The idea here is that we want to look for an “elbow point” of the plot where more components don’t necessarily increase the explanatory value. Based on our plot, about 80% of the variance is explained by the first 10 components, and ~100% of the variance is explained by the first 26 components.

At the risk of sacrificing some explained variance, let’s apply PCA and retain 10 principal components.

# Applying PCA with 10 principal components and renaming the resulting DataFrame to df_enc_pca
pca_10 = PCA(n_components=10)
df_enc_pca = pca_10.fit_transform(df_enc)

# Converting the PCA output to a DataFrame
df_enc_pca = pd.DataFrame(df_enc_pca, columns=[f'PC{i+1}' for i in range(10)])

df_enc_pca.head()

Here’s what the transformed data looks like after PCA:

A quick 3D visualization using just 3 principal components:

from mpl_toolkits.mplot3d import Axes3D

# Creating a 3D plot of the first three principal components
fig = plt.figure(figsize=(10, 7))
ax = fig.add_subplot(111, projection='3d')

ax.scatter(df_enc_pca['PC1'], df_enc_pca['PC2'], df_enc_pca['PC3'], c='green', marker='o', alpha=0.5)

ax.set_xlabel('Principal Component 1')
ax.set_ylabel('Principal Component 2')
ax.set_zlabel('Principal Component 3')
ax.set_title('3D Projection of the Reduced Dimension Data')

plt.show()

Fitting the K-means clustering model

Now, we can begin with applying the K-means clustering algorithm to our data. We need to determine the optimal number of clusters, and we can do so by using the Elbow Method or Silhouette Score.

As a data analyst, think of it from the marketing team’s POV. Marketing wants a few customer segments to determine how best to dedicate their efforts towards targeting.

Elbow method

Let’s start with the Elbow Method, which helps you decide what number of clusters gives the most meaningful model of your data. The following line chart compares the inertias of ten different K-means models.

from sklearn.cluster import KMeans

# Calculating WCSS for different numbers of clusters
wcss = []
for i in range(1, 11):
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(df_enc_pca)
wcss.append(kmeans.inertia_)

# Plotting the Elbow Method graph
plt.figure(figsize=(10, 6))
plt.plot(range(1, 11), wcss, marker='o', linestyle='--')
plt.title('Elbow Method')
plt.xlabel('Number of Clusters')
plt.ylabel('WCSS')
plt.grid(True)
plt.show()

The basic idea of the Elbow Method is that the “best” number of clusters lies at the sharpest bend of the line chart. That “best” point indicates that inertia is dropping significantly with increased clusters, meaning your clusters are denser and have more similarity between points in each cluster.

Although ever-increasing clusters keeps reducing inertia (which is good!), there is also a diminishing rate. As you can see in the above line chart, ten clusters has the least inertia. But is that the best? Recall that the entire purpose of this model is to eventually create actionable insights for your marketing team, and ten clusters might be too many for them to craft any meaningful marketing strategy.

In an ideal world, we’d like to see a line chart that looks like an L-shape. However, there is no sharply defined “elbow” in our graph, indicating that there isn’t a clear optimal number of clusters. An educated guess perceives a slight bend around 3–4 clusters, after which the reduction in WCSS (within-cluster sum of squares) becomes more gradual.

Silhouette score analysis

Now we do a silhouette score analysis to evaluate too across ten different K-means models.

from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score

silhouette_scores = []
for i in range(2, 11): # Start from 2 clusters as silhouette score requires at least 2 clusters
kmeans = KMeans(n_clusters=i, init='k-means++', random_state=42)
kmeans.fit(df_enc_pca)
cluster_labels = kmeans.labels_
silhouette_avg = silhouette_score(df_enc_pca, cluster_labels)
silhouette_scores.append(silhouette_avg)

plot = sns.lineplot(x=range(2, 11), y=silhouette_scores)
plot.set_xlabel("# of clusters");
plot.set_ylabel("Silhouette Score");

A silhouette score is the mean silhouette coefficient across all data in a model. They can range between -1 and +1. The closer the value to +1, it means that a point is close to other points in its own cluster and well separated from points in other clusters. Vice versa, a point close to -1 indicates that the point is likely assigned to the wrong cluster.

In short, the “best” silhouette score is the point at which it is closest to +1. A quick visualization guide:

Our score tells us that the optimal number of clusters is 2, which is probably too little.

We’ll go ahead and fit the K-means model with 4 clusters.

from sklearn.cluster import KMeans

# Fitting the final K-Means model with 4 clusters
kmeans_final = KMeans(n_clusters=4, init='k-means++', random_state=42)
kmeans_final.fit(df_enc_pca)

# Assigning the cluster labels to each data point
cluster_labels = kmeans_final.labels_

# Adding the cluster labels to the original encoded DataFrame
df_enc['Cluster'] = cluster_labels
df_enc[['Cluster'] + continuous_vars_to_scale].head()

Now we can look at the data distributions of our four clusters. Let’s first generate a 3D visualization.

fig = plt.figure(figsize=(10,7))
ax = fig.add_subplot(111, projection='3d')
scatter = ax.scatter(df_enc_pca['PC1'], df_enc_pca['PC2'], df_enc_pca['PC3'],
c=df_enc['Cluster'], cmap='viridis', s=5)

ax.set_title('3D Scatter plot of Clusters')
ax.set_xlabel('PC1')
ax.set_ylabel('PC2')
ax.set_zlabel('PC3')
legend = ax.legend(*scatter.legend_elements(), loc="upper right")
ax.add_artist(legend)

plt.show()
plt.figure(figsize=(8,5))
sns.countplot(x='Cluster', data=df_enc, palette='viridis')
plt.title('Count of Data Points in Each Cluster')
plt.show()

plt.figure(figsize=(10, 7))
sns.scatterplot(data=df_enc, x='Total_Spending', y='Income', hue='Cluster', palette='viridis', alpha=0.7, edgecolor=None)
plt.title('Scatterplot of Total Spending vs Income Colored by Clusters')
plt.show()

Now for a count and scatter plot. For the latter, I look at Total_Spending against Income.

Conclusion

At this point in the exercise, you’d want to play around and investigate the relationships between the different clusters. There are dozens of ways you could slice the data but I won’t do all of them here.

For instance, I can discern that cluster 3 has generally above average income and is less responsive to marketing campaigns. The intuitive theory you can draw here is that “richer customers care less about shaving off a few cents on their grocery spending”.

Cluster 0 and 2 however, have high response rate to marketing’s efforts and are of average economic status. As a data analyst, this is something you would want to include in your report for stakeholders in product and marketing.

This article’s purpose is to give a basic idea of how and what K-means clustering models are used for in application to an actual business use-case.

--

--

Donovan Choy
Donovan Choy

Written by Donovan Choy

Classical liberal. I love the Wu-Tang Clan, Spaghetti Westerns and anything Aly & Fila.

No responses yet