Think about you’re shopping your favourite on-line retailer, scrolling by pages of merchandise. Instantly, an e mail pops up in your cellphone, it’s from the identical retailer, providing a particular low cost on the precise sort of things you will have been eyeing. Coincidence? Not fairly. Welcome to the world of buyer segmentation!
This customized strategy to advertising isn’t any accident. It’s the results of subtle information evaluation strategies that permit companies to know and cater to their prospects’ particular person wants and preferences.
What’s Buyer Segmentation?
Buyer segmentation is the follow of dividing an organization’s buyer base into teams of people who share related traits. These traits can vary from fundamental demographic info like age and gender to extra complicated behavioral patterns equivalent to buying habits, product preferences, and on-line shopping habits.
The objective of buyer segmentation is to create a extra customized and efficient advertising technique. By understanding the distinct wants and behaviors of various buyer teams, companies can tailor their merchandise, providers, and advertising messages to resonate with every phase.
KMeans Clustering
KMeans is an unsupervised machine studying algorithm that teams related information factors collectively based mostly on their traits.
Within the context of buyer segmentation, KMeans might help determine pure groupings inside your buyer information.
Information Assortment and Preprocessing
We shall be working with a dataset shared publicly on GitHub. The dataset comprises statistics about every buyer of a retail firm.
To start working with this dataset, we’ll import libraries wanted for studying our dataset right into a pandas DataFrame.
import pandas as pd
import numpy as npimport matplotlib.pyplot as plt
import seaborn as sns
import altair as alt
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.metrics import silhouette_score
import warnings
warnings.filterwarnings('ignore')
After importing these libraries, we’ll load the dataset from a csv format right into a pandas DataFrame for exploration.
df = pd.read_csv('/content material/ml_project1_data.csv')# show the primary 5 rows within the DataFrame
df.head()
The DataFrame comprises 29 options, the outline for these options could be accessed here.
Earlier than we draw conclusions from our information, we’re going to clear it. Information cleansing focuses on detecting and correcting errors within the dataset.
#1. checking for duplicated rows within the DataFrame
df.duplicated().sum()#2. checking for lacking values within the dataFrame
def percent_missing(df):
'''Calculate and return the share of lacking information factors for every column in a DataFrame'''
percent_nan = 100*df.isnull().sum()/len(df)
#percent_nan = df.isnull().imply().mul(100)
percent_nan = percent_nan[percent_nan >0]
percent_nan = percent_nan.sort_values(ascending=False)
return percent_nan
percent_missing(df)
There aren’t any duplicated rows in our DataFrame. Nevertheless, the ‘Revenue’ column has roughly 1% of its values lacking. We are going to fill in these lacking values utilizing the median ‘Revenue’.
df1 = df.copy()
df1[‘Income’] = df.fillna(df['Income'].median())
percent_missing(df1)
Function Engineering
Earlier than making use of the KMeans algorithm, it’s essential to pick and put together the appropriate options that may successfully seize buyer habits. One of the standard and efficient approaches in retail buyer segmentation is the RFM mannequin:
Recency (R): How just lately did the client make a purchase order?
Frequency (F): How usually does the client make purchases?
Financial Worth (M): How a lot does the client spend?
def calculate_rfm_and_features(df):
# Create new options
frequency = df1['NumDealsPurchases'] + df1['NumWebPurchases'] +
df1['NumCatalogPurchases'] + df1['NumStorePurchases']
financial = df1['MntWines'] + df['MntFruits'] + df1['MntMeatProducts'] +
df1['MntFishProducts'] + df1['MntSweetProducts'] + df1['MntGoldProds']# Create General_Response
response_columns = ['AcceptedCmp3', 'AcceptedCmp4', 'AcceptedCmp5', 'AcceptedCmp1', 'AcceptedCmp2', 'Response']
general_response = df1[response_columns].sum(axis=1)
# Create Age
age = 2015 - df1['Year_Birth']
# Create Days_Enrolled (assuming information collected earlier than January 2015)
reference_date = datetime(2015, 1, 1)
days_enrolled = (reference_date - pd.to_datetime(df['Dt_Customer'])).dt.days
# Create a brand new dataframe with solely the desired columns
result_df = pd.DataFrame({
'ID': df1['ID'],
'Training': df1['Education'],
'Marital_Status': df1['Marital_Status'],
'Revenue': df1['Income'],
'Kidhome': df1['Kidhome'],
'Teenhome': df1['Teenhome'],
'Recency': df1['Recency'],
'NumWebVisitsMonth': df1['NumWebVisitsMonth'],
'Complain': df1['Complain'],
'Frequency': frequency,
'Financial': financial,
'General_Response': general_response,
'Age': age,
'Days_Enrolled': days_enrolled
})
return result_df
df2 = calculate_rfm_and_features(df1)
We engineered a number of key options:
- Frequency: Aggregated whole purchases throughout numerous channels (offers, internet, catalog, retailer).
- Financial: Summed spending throughout product classes (wines, fruits, meat, fish, sweets, gold).
- General_Response: Calculated general marketing campaign responsiveness by summing responses to particular person campaigns.
- Age: Derived from start 12 months, assuming 2015 because the 12 months wherein the information was collected.
- Days_Enrolled: Computed buyer tenure based mostly on enrollment date.
We retained necessary demographic variables (Training, Marital_Status, Revenue, Kidhome, Teenhome) and behavioral metrics (Recency, NumWebVisitsMonth, Complain). This complete function set permits extra detailed buyer segmentation, doubtlessly revealing precious insights for focused advertising methods.
numeric_features = df2.select_dtypes(embody=['int64', 'float64'])# Compute the correlation matrix
corr_matrix = numeric_features.corr()
# Create a heatmap
plt.determine(figsize=(12, 8))
sns.heatmap(corr_matrix, annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title('Correlation Heatmap of Numeric Options')
plt.present()
The correlation heatmap above reveals key relationships amongst numeric options. Robust constructive correlations exist between Frequency, Financial, and Revenue, whereas destructive correlations are noticed between Revenue, KidHome and NumWebVisitsMonth.
Earlier than continuing any additional, we’ll visualize the distribution of those newly created RFM options.
rfm_data = df2[['ID','Recency','Frequency','Monetary']]rfm_features = ['Recency', 'Frequency', 'Monetary']
fig, axes = plt.subplots(1, 3, figsize=(15, 5))
fig.suptitle('Boxplots of RFM Variables')
for i, variable in enumerate(rfm_features):
sns.boxplot(x= rfm_data[variable], ax=axes[i])
axes[i].set_title(variable.capitalize())
axes[i].set_xlabel('')
plt.tight_layout()
plt.present()
The boxplots of the RFM variables present that the majority prospects have reasonable recency, frequency, and financial values. Just a few outliers exist, within the Frequency and Financial options, indicating that few prospects spends considerably extra.
Standardizing Information
Standardization ensures that every one options contribute equally to clustering by scaling them to have a imply of 0 and customary deviation of 1. This prevents options with bigger ranges from dominating, resulting in extra correct and significant clusters.
rfm_data1 = rfm_data.drop(columns=['Customer_ID'])col_names = ['Recency', 'Frequency', 'Monetary']
options = rfm_data1[col_names]
standard_scaler = StandardScaler()
scaled_features = standard_scaler.fit_transform(options)
# Create a brand new DataFrame with scaled options and unique column names
rfm_scaled = pd.DataFrame(scaled_features, columns=col_names)
# Show the primary 5 rows of the scaled dataframe
print(rfm_scaled.head())
Figuring out Optimum Cluster Rely
from sklearn.metrics import silhouette_scoremin_clusters, max_clusters = 2, 10
cluster_range = vary(min_clusters, max_clusters + 1)
# Compute metrics
outcomes = []
for okay in cluster_range:
kmeans = KMeans(n_clusters=okay, random_state=42)
kmeans.match(rfm_scaled)
outcomes.append({
'n_clusters': okay,
'inertia': kmeans.inertia_,
'silhouette': silhouette_score(rfm_scaled, kmeans.labels_)
})
df_results = pd.DataFrame(outcomes)
def plot_metric(information, x, y, title, ylabel):
plt.determine(figsize=(10, 6))
sns.lineplot(information=information, x=x, y=y, marker='o')
plt.title(title)
plt.xlabel('Variety of clusters (Okay)')
plt.ylabel(ylabel)
plt.present()
# Plot elbow curve and silhouette scores
plot_metric(df_results, 'n_clusters', 'inertia', 'Elbow Curve', 'Inertia')
plot_metric(df_results, 'n_clusters', 'silhouette', 'Silhouette Evaluation', 'Silhouette Rating')
# Discover optimum Okay
optimal_k_elbow = df_results.iloc[np.argmax(np.diff(df_results['inertia'])[1:] / np.diff(df_results['inertia'])[:-1]) + 1]['n_clusters']
optimal_k_silhouette = df_results.loc[df_results['silhouette'].idxmax(), 'n_clusters']
print(f"Optimum variety of clusters (Elbow technique): {optimal_k_elbow}")
print(f"Optimum variety of clusters (Silhouette technique): {optimal_k_silhouette}")
In our clustering evaluation, the silhouette technique urged 2 clusters, whereas the elbow technique indicated 7. We are going to go for 2 clusters, because the silhouette rating prioritizes cluster separation and cohesion, which is essential for clear, interpretable segments in our buyer base.
clusters_number = 2kmeans = KMeans(random_state=42, n_clusters=clusters_number,
init='k-means++', n_init=50, max_iter=1000)
kmeans.match(rfm_scaled)
Segmentation
We are going to use the predict() technique from sklearn to get the assigned clusters for all information factors saved in rfm_scaled:
rfm_data['Cluster'] = kmeans.predict(rfm_scaled)rfm_data.head()
We are going to visualize our information to determine the frequent traits of consumers inside the identical clusters.
rfm_clustered = rfm_data.copy()
#rfm_clustered['Customer_ID'] = df1['ID']# common values for every cluster
avg_df = rfm_clustered.groupby(['Cluster'], as_index=False).imply()
# bar plots for every RFM metric
metrics = ['Recency', 'Frequency', 'Monetary']
fig, axes = plt.subplots(1, 3, figsize=(20, 6))
fig.suptitle('Common RFM Values by Cluster', fontsize=16)
for i, metric in enumerate(metrics):
sns.barplot(x='Cluster', y=metric, information=avg_df, ax=axes[i])
axes[i].set_title(f'Common {metric} by Cluster')
axes[i].set_xlabel('Cluster')
axes[i].set_ylabel(f'Common {metric}')
plt.tight_layout()
plt.present()
Based mostly on the RFM options proven within the visualization above, we will interpret the 2 clusters as follows:
Cluster 0 (Low-Worth Prospects):
1. Related recency to Cluster 1
2. Decrease frequency of purchases
3. A lot decrease financial worth
Cluster 1 (Excessive-Worth Prospects):
1. Related recency to Cluster 0
2. Larger frequency of purchases
3. Considerably larger financial worth
The important thing differentiators are frequency and financial worth, with Cluster 1 representing extra frequent prospects who spend more cash. Recency seems related for each clusters, suggesting latest exercise just isn’t a significant distinguishing issue between these buyer segments.
Suggestions
For Cluster 0 (Decrease-value prospects):
- Implement loyalty packages to extend buy frequency
- Supply focused promotions to encourage higher-value purchases.
- Use e mail advertising with customized product suggestions.
For Cluster 1 (Larger-value prospects):
1. Develop a VIP program with unique advantages and early entry to new merchandise.
2. Present customized customer support and devoted account managers.
3. Create cross-selling and upselling alternatives to extend their already excessive financial worth.
4. Search referrals from these precious prospects
5. Analyze their buying patterns to determine tendencies for product growth.
These suggestions purpose to extend the worth of Cluster 0 prospects whereas retaining and maximizing the worth of Cluster 1 prospects.
In conclusion, buyer segmentation utilizing KMeans clustering has supplied us with precious insights into our retail customers’ profiles. These distinct buyer teams permit us to tailor our advertising methods, and customer support approaches extra successfully.
Yow will discover the code for this evaluation here.
Thanks for studying!