Customer Segmentation

As a business owner of marketer, understanding your customer is essential to create targeted marketing campaigns and improve customer satisfaction. One way to achieve this is through demographic segmentation, where customers are grouped based on shared characteristics such as age, gender, income, education and location. By understanding these demographic characteristics, businesses can better target their marketing efforts, tailor their products and services, and improve their customer experience. However, analysing large datasets can be time-consuming, and manual grouping can be prone to errors. This is where k-means algorithm comes into play.

K-means clustering is a popular unsupervised machine learning algorithm used for clustering data points into clusters based on their proximity to each other, with the goal of minimising the distance between the points withing each cluster.

The dataset used for this project was downloaded from the Kaggle Website. The goal of the project was to cluster the customer dataset in order to identify similar groups of customers.

The code developed for the project, along with the data pre-processing and data analysis performed, can be viewed here.

The steps followed for this analysis were:

1. Import the dataset

2. Pre-process the dataset

3. Explore the dataset

4. Perform clustering

5. Identify the calculated clusters

In this section, I would like to discuss the outcomes of the analysis be describing two figures. The first figure shows the relationships between the original features and the principal components extracted from the data.

Overall, I observe that some features have strong correlations with certain principal components, while other features have weaker or no correlations. For example, feature 'Recency' is highly correlated with principal component 3, while feature 'Kidhome' has no significant correlation with any of the principal components. Interestingly, I also observe a negative correlation between feature 'Education_Binary' and principal component 4, as well as feature 'Relationship_Status' and principal component 5. Principal components 1 and 2 are correlated with a mixture of features. These findings suggest that certain features may be more important for explaining the variance in the data captured by the principal components.

The second figure shows a scatter plot with four distinct clusters of customers based on the first and second principal components. Group 1 (purple dots) represents low income parents, while Group 2 (red dots) includes average-income parents. Group 3 (orange dots) represents high-income senior customers, and Group 4 (green dots) includes senior parent customers.

The clustering suggests that there are clear demographic differences among customers that should be taken into account when developing marketing strategies.

If you would like to read more about clustering methods, specifically hierarchical and k-means, read my medium article.

Again, the code can be found in my GitHub repo.