Skip to content

This project aims to gain businees insight of Online Retail Datasets through Exploratory Data Analysis (EDA) and gain recommendation based on customer segementation through RFM Analysis and K-Means Clustering.

Notifications You must be signed in to change notification settings

suciaulyaputri/Customer-Segmentation

Repository files navigation

"CUSTOMER SEGMENTATION"

Customer segmentation is the practice of dividing a company's customers into groups that reflect similarity among customers in each group. The goal of segmenting customers is to decide how to relate to customers in each segment in order to maximize the value of each customer to the business.

Use Case Summary

Objective Statement:

  1. Get business insight about how many product sold every month.
  2. Get business insight about how much customer spend their money every month.
  3. Get business insight about how many customers make transactions each month.
  4. Get business insight about how much is the frequency of transactions in months, days, and hours.
  5. Get business insight about the most popular products.
  6. Get business insight about the most consumers by country.
  7. To reduce risk in deciding where, when, how, and to whom a product, service, or brand will be marketed.
  8. To increase marketing efficiency by directing effort specifically toward the designated segment in a manner consistent with that segment’s characteristics.

Challenges:

  1. Large size of data, can not maintain by excel spreadsheet.
  2. Need several coordination from each department.
  3. Demography data have a lot missing values.

Business Benefit:

  1. Helping Business Development Team to create product differentiation based on the characteristic for each customer.
  2. Know how to treat customer with specific criteria.

Expected Outcome:

  1. Know how many product sold every month.
  2. Know how much customer spend their money every month.
  3. Know how many customers make transactions each month.
  4. Know how much is the frequency of transactions in months, days, and hours.
  5. Know the most popular products.
  6. Know the most customer by the country.
  7. Customer segmentation analysis.
  8. Recommendation based on customer segmentation.

Data Understanding

  • The data is a real online retail transaction data set of two years.

  • The data consists of 2 datasets where:

    • Dataset 1:
      • Online Retail Dataset between 01/12/2009 until 09/12/2010.
      • Dataset 1 consists of 525461 rows and 8 columns.
    • Dataset 2:
      • Online Retail Dataset between 01/12/2010 until 09/12/2011.
      • Dataset 2 consists of 541910 rows and 8 columns.
  • This Online Retail II data set contains all the transactions occurring for a UK-based and registered, non-store online retail between 01/12/2009 until 09/12/2011.The company mainly sells unique all-occasion gift-ware. Many customers of the company are wholesalers.

  • Source Data

  • Data Dictionary:

    • Invoice: Invoice number. Nominal. A 6-digit integral number uniquely assigned to each transaction. If this code starts with the letter 'c', it indicates a cancellation.
    • StockCode: Product (item) code. Nominal. A 5-digit integral number uniquely assigned to each distinct product.
    • Description: Product (item) name. Nominal.
    • Quantity: The quantities of each product (item) per transaction. Numeric.
    • Invoice Date: Invice date and time. Numeric. The day and time when a transaction was generated.
    • Price: Unit price. Numeric. Product price per unit in sterling (£).
    • Customer ID: Customer number. Nominal. A 5-digit integral number uniquely assigned to each customer.
    • Country: Country name. Nominal. The name of the country where a customer resides.

Business Understanding

Retail is the process of selling consumer goods or services to customers through multiple channels of distribution to earn a profit.

This case has some business question using the data:

  1. How many product sold every month?
  2. How much customer spend their money every month?
  3. How many customers make transactions each month?
  4. How much is the frequency of transactions in months, days, and hours?
  5. What products are the most popular?
  6. Most consumers by country?
  7. How about Customer segmentation analysis?
  8. How about recommendation based on customer segmentation?

Exploratory Data

Number Of Products Sold Each Month

In 2009 - 2010

image

Product sold in November has the highest quantity that has around 13,97% product sold from all transaction along 1 year.

In 2010 - 2011

image

Product sold in November has the highest quantity that has around 15,42% product sold from all transaction along 1 year.

The business team can increase sales in this month such as promoting new products to customers in this month.

The Amount of Money That Customers Spend on Each Month

In 2009 - 2010

image

Revenue in November has the highest amount that has around 14,11% revenue from total revenue along 1 year.

In 2010 - 2011

image

Revenue in November has the highest amount that has around 15,6% revenue from total revenue along 1 year.

The business team can replicate the success of sales strategies in November to be implemented in other months.

Number of Customers Who Make Transactions Every Month

In 2009 - 2010

image

The number of customers from December 2009 to November 2010 was fluctuating. However, in general, the number of customers almost every month tends to show an increase, only in January, April, July, and August do the number of customers show a decrease.The business team can provide special discounts in January, April, July, and August to increase the number of customers and sales in this month.

In 2010 - 2011

image

The number of customers from December 2010 to November 2011 was fluctuating. However, in general, the number of customers almost every month tends to show an increase, only in January, February,and April do the number of customers show a decrease.The business team can provide special discounts in January, February,and April to increase the number of customers and sales in this month.

Transaction Frequency Every Month, Day, and Hour

In 2009 - 2010

image

  • The number of customers in November is the highest number of customers that has around 15,3% of the total customers along 1 year. The business team can increase sales by promoting new products to customers in November.
  • Most consumers make transactions on Thursday, which is around 19,8% of the total daily transactions. Business teams can increase sales by promoting new products to customers on Thursday
  • Most consumers order the products at 12 AM with a transaction amount of 17.8% of the total daily transactions. Business teams can increase sales by promoting new products to customers at 12 AM.

In 2010 - 2011

image

  • The number of customers in November is the highest number of customers that has around 17,3% of the total customers along 1 year. The business team can increase sales by promoting new products to customers in November.
  • Most consumers make transactions on Thursday, which is around 19,5% of the total daily transactions. Business teams can increase sales by promoting new products to customers on Thursday.
  • Most consumers order the products at 12 AM with a transaction amount of 18,2% of the total daily transactions. Business teams can increase sales by promoting new products to customers at 12 AM.

The Most Popular Product

In 2009 - 2010

image

White Hanging Heart T-Light Holder became the product that was most in-demand by consumers in 2010. The number of purchases of White Hanging Heart T-Light Holder reached 2369 units in 2010.The business team can provide special discounts from this product to attract more users.

In 2010 - 2011

image

White Hanging Heart T-Light Holder became the product that was most in-demand by consumers in 2011. The number of purchases of White Hanging Heart T-Light reached 1625 units in 2011.The business team can provide special discounts from this product to attract more users.

The Most Customers By Country

In 2009 - 2010

image

The United Kingdom became the city with the highest number of customers in 2010. The total number of customers in United Kingdom reached 302776 (91.71%) customers in 2010. The business team can focus on promotions in the United Kingdom to increase sales.

In 2010 - 2011

image

The United Kingdom became the city with the highest number of customers in 2011. The total number of customers in United Kingdom reached 286683 (90%) customers in 2011. The business team can focus on promotions in the United Kingdom to increase sales.

Customer Segementation

1. Recency, Frequency, Monetary Value (RFM) Analysis

Recency, Frequency, Monetary Value (RFM) analysis method is a method of customer analysis and segmentation based on customer habits. The variables used to perform RFM analysis are:

  • Recency : How recently the customer made a transaction.
  • Frequency : How often customers make transactions
  • Monetary : How many transactions the customer has made

In this case, the dataset contains transaction data from 01/12/2009 to 01/12/2011, so the RFM Value is treated as follows:

  • Recency : The difference between the last day the customer made a transaction and the day he did the analysis. In this case, the day of analysis uses the data of the last day of the transaction.
  • Frequency : The number of transactions made by customers from 01/12/2009 to 01/12/2011.
  • Monetary : Total order amount issued by customers from 01/12/2009 to 01/12/2011.

Here are the steps in RFM analysis:

1. Calculate RFM Value

RFM Value in 2009-2010

image

RFM Value in 2010-2011

image

2. Calculate RFM Score

The calculation of the individual RFM Score can be done using the Quartile statistical method. The steps is:

  1. Split the metrics into segments using quantiles.
  2. Assign a score from 1 to 4 to Recency, Frequency and Monetary.
  3. Four is the best/highest value, and one is the lowest/worst value.

RFM Score in 2009-2010

image

RFM Score in 2010- 2011

image

3. Calculate the total RFM score

A total RFM score is calculated simply by combining individual RFM score numbers.

In 2009 - 2010

image

In 2010-2011

image

4. Labelling

image

Labelling in 2009-2010

image

Labelling in 2010 - 2011

image

Customer Segmentation

In 2009-2010

image

In 2010-2011

image

2. K-Means Clustering

K-Means clustering algorithm is an unsupervised machine learning algorithm that uses multiple iterations to segment the unlabeled data points into K different clusters in a way such that each data point belongs to only a single group that has similar properties. K-means gives the best result under the following conditions:

  1. Data’s distribution is not skewed.

image

The data is highly skewed,therefore we will perform log transformations to reduce the skewness of each variable.I add a small constant as log transformation demands all the values to be positive.

  1. Data is standardised (i.e. mean of 0 and standard deviation of 1).

image

Finding the Optimal Number of Clusters Using Elbow Method

In 2009-2010

image

The cluster value where this decrease in inertia value becomes constant can be chosen as the right cluster value for our data. Looking at the above elbow curve, we can choose any number of clusters between 3 to 5.

image image

From the flattened graphs and the snake plots it is evident that having a cluster value of 4, segments our customers well. We could also go for higher number of clusters, it completely depends on how the company wants to segment their customers.

In 2010-2011

image The cluster value where this decrease in inertia value becomes constant can be chosen as the right cluster value for our data. Looking at the above elbow curve, we can choose any number of clusters between 3 to 5.

image image

From the flattened graphs and the snake plots it is evident that having a cluster value of 4, segments our customers well. We could also go for higher number of clusters, it completely depends on how the company wants to segment their customers.

Evaluating Model

1. Davies Bouldin Score

Davies Bouldin Score is a metric for evaluating clustering algorithms. The smaller Davies Bouldin Score is the more optimal the cluster.

2. Silhouetter Score

Silhoutter Score is a metric for evaluating clustering algorithms. The higher Silhouter Score is the more optimal the cluster.

In 2009-2010

1. Davies Bouldin Score

image

K-Means with 4 clusters has lowest davies bouldin score than other cluster. Therefore the optimum cluster is 4.

2. Silhouetter Score

image

K-Means with 4 clusters has higher silhouette score than other cluster. Therefore the optimum cluster is 4.

In 2010-2011

1. Davies Bouldin Score

image

K-Means with 4 clusters has lowest davies bouldin score than other cluster. Therefore the optimum cluster is 4.

2. Silhouetter Score

image

K-Means with 4 clusters has higher silhouette score than other cluster. Therefore the optimum cluster is 4.

Interpretation of The Clusters Formed Using K-means

In 2009-2010

image

image

In 2010-2011

image

image

Recommendation

Based on the 4 clusters, we could formulate marketing strategies relevant to each cluster:

In 2009-2010

image

In 2010-2011

image

About

This project aims to gain businees insight of Online Retail Datasets through Exploratory Data Analysis (EDA) and gain recommendation based on customer segementation through RFM Analysis and K-Means Clustering.

Topics

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published