Customer Segmentation and Time Series Forecasting Based on Sales Data #1/3 | by Naveen Malla

undertaking that received me an ml internship

First issues first. This weblog is split right into a 3-part collection the place I’m going to give attention to three totally different features:

Exploratory Knowledge Evaluation (EDA)
Buyer segmentation
SKU forecasting (with bonus code)

Why Ought to You Care About This Sequence❓

This work received me a machine studying internship. I offered my notebooks to the entire staff, together with each technical and non-technical stakeholders, they usually cherished it. I put in quite a lot of hours into this undertaking to make it as informative and simply comprehensible as doable. I used to be actually glad that this undertaking was extremely partaking for the staff, and I needed to share it right here. So, for those who’re interested by engaged on some machine studying tasks to achieve information or are in search of concepts on the way to current your tasks associated to this matter in your upcoming interviews, this collection is for you. Let’s dive in.

🗂️ The Dataset

This can be a typical gross sales dataset for healthcare merchandise. It incorporates gross sales information for 7 months of the 12 months, ranging from January. It additionally incorporates noise, so the outcomes could appear evenly distributed.

However this isn’t going to be concerning the information itself; somewhat, it’s about what you are able to do with it.

The options within the dataset are kinda self explanatory, so I’m not gonna waste your time giving descriptions for every. You’ll see what they imply as we proceed.

order_number
order_date
customer_number
sort
month
item_number (SKU)
amount
class
income
customer_source
order_source

📊 Let’s Do Some EDA!!!

I’ll skip some components of code like information loading and processing to maintain it extra partaking. I’ll connect a hyperlink to the entire pocket book on the finish of the article although.

print(df.nunique())

distinctive values rely in for every characteristic (all photos by writer)

Commentary:

There are complete of

1000 distinctive prospects
1000 distinctive gadgets
2 classes

Month-to-month Income

df['month'] = df['order_date'].dt.month
month_abbr = {1: 'Jan', 2: 'Feb', 3: 'Mar', 4: 'Apr', 5: 'Might', 6: 'Jun', 7: 'Jul', 8: 'Aug', 9: 'Sep', 10: 'Oct', 11: 'Nov', 12: 'Dec'}
df['month_abbr'] = df['month'].map(month_abbr)# Mixture month-to-month income
monthly_revenue = df.groupby('month_abbr')['revenue'].sum().reset_index()
# Outline a categorical sort for the month_abbr to make sure correct sorting
monthly_revenue['month_abbr'] = pd.Categorical(monthly_revenue['month_abbr'], 
classes=['Jan', 'Feb', 'Mar', 'Apr', 'May', 'Jun', 'Jul', 'Aug', 'Sep', 'Oct', 'Nov', 'Dec'], 
ordered=True)
# Kind by the 'month_abbr' column
monthly_revenue = monthly_revenue.sort_values(by='month_abbr')
# Plot month-to-month income
plt.determine(figsize=(8, 4))
sns.barplot(x='month_abbr', y='income', information=monthly_revenue, palette='viridis')
plt.xlabel('Month')
plt.ylabel('Income')
plt.title('Month-to-month Income')
# Format y-label as shortened type
ax = plt.gca()
ax.yaxis.set_major_formatter(FuncFormatter(lambda x, pos: f'{int(x/1000)}ok'))
plt.xticks(rotation=45)
plt.present()

Commentary:

The income appears to be rising over the months.
Bigger jumps in income will be seen ranging from the month of Might in comparison with earlier months

🔧 Characteristic Engineering to create extra significant options

Characteristic Engineering is the method of making new options from the prevailing options within the dataset.

This helps us analyse the info higher as the brand new options are extra significant.
Instance: prime 10 merchandise, common income per class, and many others.

# Prime 10 performing merchandise by income
top_10_products = df.groupby('item_number')['revenue'].sum().nlargest(10).reset_index()# Common income by class
avg_revenue_category = df.groupby('class')['revenue'].imply().reset_index()
# Common income by sort
avg_revenue_type = df.groupby('sort')['revenue'].imply().reset_index()
# Common income by buyer supply
avg_revenue_customer_source = df.groupby('customer_source')['revenue'].imply().reset_index()
# Common income by order supply
avg_revenue_order_source = df.groupby('order_source')['revenue'].imply().reset_index()

📈 Visualizing the engineered options

Prime 10 Merchandise by Income

# Bar chart for Prime 10 performing merchandise
plt.determine(figsize=(8, 4))
sns.barplot(x='item_number', y='income', information=top_10_products, palette='viridis')
plt.title('Prime 10 Performing Merchandise by Income')
plt.xlabel('Merchandise Quantity')
plt.ylabel('Complete Income')
plt.xticks(rotation=45)
plt.present()

Commentary:

The highest performing merchandise appear to be producing the same quantity of income.

Now, let’s see how a lot p.c of complete income is contributed by the highest 10 merchandise


top_10_revenue = top_10_products['revenue'].sum()
total_revenue = df['revenue'].sum()print(f'Prime 10 merchandise contribute {top_10_revenue/total_revenue:.2%} to the entire income')

1.45% of complete income isn’t a major quantity, indicating that these merchandise don’t dominate the general gross sales.

Common Income per Categorical variable

fig, axes = plt.subplots(1, 2, figsize=(8, 4))# Pie chart for Common Income by Buyer Supply
axes[0].pie(avg_revenue_customer_source['revenue'], labels=avg_revenue_customer_source['customer_source'], autopct='%1.1f%%', colours=sns.color_palette('viridis', 2))
axes[0].set_title('Common Income by Buyer Supply')
# Donut chart for Common Income by Order Supply
axes[1].pie(avg_revenue_order_source['revenue'], labels=avg_revenue_order_source['order_source'], autopct='%1.1f%%', colours=sns.color_palette('viridis', 5))
axes[1].set_title('Common Income by Order Supply')
centre_circle = plt.Circle((0,0), 0.70, fc='white')
fig = plt.gcf()
fig.gca().add_artist(centre_circle)

fig, axes = plt.subplots(1, 2, figsize=(8, 4))# Pie chart for Common Income by Class
axes[0].pie(avg_revenue_category['revenue'], labels=avg_revenue_category['category'], autopct='%1.1f%%', colours=sns.color_palette('viridis', 2))
axes[0].set_title('Common Income by Class')
# Pie chart for Common Income by Kind
axes[1].pie(avg_revenue_type['revenue'], labels=avg_revenue_type['type'], autopct='%1.1f%%', colours=sns.color_palette('viridis', 2))
axes[1].set_title('Common Income by Kind')
plt.tight_layout()
plt.present()

Commentary:

The distribution of income appears to be virtually equally unfold amongst Classes, Varieties, Buyer and Order sources.

Rely of orders by class

category_counts = df.groupby('class')['quantity'].sum()# Plot the rely of orders by class
plt.determine(figsize=(8, 4))
sns.barplot(x=category_counts.index, y=category_counts, palette='viridis')
plt.title('Complete Orders by Class')
plt.xlabel('Class')
plt.ylabel('Complete Orders')
plt.xticks(rotation=45)
plt.present()

Commentary:

Regardless of DIABETES and HYPERTENSIVES merchandise producing identical income, the variety of HYPERTENSIVES merchandise bought is greater than DIABETES merchandise.
This perception means that DIABETES merchandise being bought are costlier than HYPERTENSIVES merchandise.

total_products_sold = df.groupby('class')['quantity'].sum()
print(total_products_sold)

HYPERTENSIVES merchandise certainly appear to be extra widespread amongst prospects with a lead of round 50000 merchandise bought in final 7 months.

Income Distribution

# Calculate complete income for every buyer
total_revenue = df.groupby('customer_number')['revenue'].sum().reset_index()
total_revenue.columns = ['customer_number', 'total_revenue']# Distribution Plot
plt.determine(figsize=(8, 4))
sns.histplot(total_revenue['total_revenue'], kde=True, colour='skyblue')
plt.title('Distribution of Complete Income by Buyer')
plt.xlabel('Complete Income')
plt.ylabel('Frequency')
plt.present()

Observations:

🔔 Regular Distribution

A Regular distribution is a option to present how information is unfold out, the place most values are near the center (the common). It appears to be like symmetric, and bell-shaped.
The above graph is a close to regular distribution.
In our case, the common spending is round $700 to $800 mark which you’ll be able to determine round on the center of the curve.

💰 Income Unfold

Income Unfold signifies how various the shopper spending is. We have a look at the bottom or leftmost and the very best or the rightmost figures to grasp this metric.
There’s a vital unfold starting from from $200 to $1400.
Signifies a various vary of buyer spending.

💎 Excessive-Worth Clients/Outliers

Increased finish tail represents high-value prospects.
Could be B2B prospects or prospects who purchase in bulk.

Source link

Mastering Linear Algebra: Part 8 — Singular Value Decomposition (SVD) | by Ebrahim Mousavi | Sep, 2024

Feature Caching for Recommender Systems w/ Cachelib | by Pinterest Engineering | Pinterest Engineering Blog | Sep, 2024

Title: How Pavlov and Markov Were Closer Than They Knew: A Journey from Conditioned Responses to the Free Energy Principle | by Graham Wallington | Sep, 2024

Leave A Reply Cancel Reply

Mastering Linear Algebra: Part 8 — Singular Value Decomposition (SVD) | by Ebrahim Mousavi | Sep, 2024

Feature Caching for Recommender Systems w/ Cachelib | by Pinterest Engineering | Pinterest Engineering Blog | Sep, 2024

I switched to the iPhone 16 from an iPhone 15, and the upgrade was bigger than expected

Mastering SQL for Data Engineering: Part I

Title: How Pavlov and Markov Were Closer Than They Knew: A Journey from Conditioned Responses to the Free Energy Principle | by Graham Wallington | Sep, 2024

Most Popular

The Hamas Threat of Hostage Execution Videos Looms Large Over Social Media

Revolutionizing the Way We Find Love

Federal Investigators Widen Tesla Inquiry, Company Says

Our Picks

Mastering Linear Algebra: Part 8 — Singular Value Decomposition (SVD) | by Ebrahim Mousavi | Sep, 2024

Feature Caching for Recommender Systems w/ Cachelib | by Pinterest Engineering | Pinterest Engineering Blog | Sep, 2024

I switched to the iPhone 16 from an iPhone 15, and the upgrade was bigger than expected

Customer Segmentation and Time Series Forecasting Based on Sales Data #1/3 | by Naveen Malla | Sep, 2024

undertaking that received me an ml internship

Why Ought to You Care About This Sequence❓

🗂️ The Dataset

📊 Let’s Do Some EDA!!!

🔧 Characteristic Engineering to create extra significant options

📈 Visualizing the engineered options

Related Posts

Leave A Reply Cancel Reply