Machine Learning Engineer Nanodegree

Unsupervised Learning

Project 3: Creating Customer Segments

Welcome to the third project of the Machine Learning Engineer Nanodegree! In this notebook, some template code has already been provided for you, and it will be your job to implement the additional functionality necessary to successfully complete this project. Sections that begin with 'Implementation' in the header indicate that the following block of code will require additional functionality which you must provide. Instructions will be provided for each section and the specifics of the implementation are marked in the code block with a 'TODO' statement. Please be sure to read the instructions carefully!

In addition to implementing code, there will be questions that you must answer which relate to the project and your implementation. Each section where you will answer a question is preceded by a 'Question X' header. Carefully read each question and provide thorough answers in the following text boxes that begin with 'Answer:'. Your project submission will be evaluated based on your answers to each of the questions and the implementation you provide.

Note: Code and Markdown cells can be executed using the Shift + Enter keyboard shortcut. In addition, Markdown cells can be edited by typically double-clicking the cell to enter edit mode.

Getting Started

In this project, you will analyze a dataset containing data on various customers' annual spending amounts (reported in monetary units) of diverse product categories for internal structure. One goal of this project is to best describe the variation in the different types of customers that a wholesale distributor interacts with. Doing so would equip the distributor with insight into how to best structure their delivery service to meet the needs of each customer.

The dataset for this project can be found on the UCI Machine Learning Repository. For the purposes of this project, the features 'Channel' and 'Region' will be excluded in the analysis — with focus instead on the six product categories recorded for customers.

Run the code block below to load the wholesale customers dataset, along with a few of the necessary Python libraries required for this project. You will know the dataset loaded successfully if the size of the dataset is reported.

In [8]:
# Import libraries necessary for this project
import numpy as np
import pandas as pd
import renders as rs
from IPython.display import display # Allows the use of display() for DataFrames
import random
import collections

from sklearn.cross_validation import train_test_split
from sklearn.tree import DecisionTreeRegressor
from sklearn import metrics
from sklearn.decomposition import PCA

from matplotlib import pyplot as plt
# Show matplotlib plots inline (nicely formatted in the notebook)
%matplotlib inline

# Load the wholesale customers dataset
try:
    data = pd.read_csv("customers.csv")
    data.drop(['Region', 'Channel'], axis = 1, inplace = True)
    print "Wholesale customers dataset has {} samples with {} features each.".format(*data.shape)
except:
    print "Dataset could not be loaded. Is the dataset missing?"
Wholesale customers dataset has 440 samples with 6 features each.

Data Exploration

In this section, you will begin exploring the data through visualizations and code to understand how each feature is related to the others. You will observe a statistical description of the dataset, consider the relevance of each feature, and select a few sample data points from the dataset which you will track through the course of this project.

Run the code block below to observe a statistical description of the dataset. Note that the dataset is composed of six important product categories: 'Fresh', 'Milk', 'Grocery', 'Frozen', 'Detergents_Paper', and 'Delicatessen'. Consider what each category represents in terms of products you could purchase.

In [9]:
# Display a description of the dataset
display(data.describe())
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
count 440.000000 440.000000 440.000000 440.000000 440.000000 440.000000
mean 12000.297727 5796.265909 7951.277273 3071.931818 2881.493182 1524.870455
std 12647.328865 7380.377175 9503.162829 4854.673333 4767.854448 2820.105937
min 3.000000 55.000000 3.000000 25.000000 3.000000 3.000000
25% 3127.750000 1533.000000 2153.000000 742.250000 256.750000 408.250000
50% 8504.000000 3627.000000 4755.500000 1526.000000 816.500000 965.500000
75% 16933.750000 7190.250000 10655.750000 3554.250000 3922.000000 1820.250000
max 112151.000000 73498.000000 92780.000000 60869.000000 40827.000000 47943.000000

Implementation: Selecting Samples

To get a better understanding of the customers and how their data will transform through the analysis, it would be best to select a few sample data points and explore them in more detail. In the code block below, add three indices of your choice to the indices list which will represent the customers to track. It is suggested to try different sets of samples until you obtain customers that vary significantly from one another.

In [10]:
# TODO: Select three indices of your choice you wish to sample from the dataset
#indices = random.sample(range(len(data)),3)
indices = [420,272,92]
print indices
# Create a DataFrame of the chosen samples
samples = pd.DataFrame(data.loc[indices], columns = data.keys()).reset_index(drop = True)

print "Chosen samples of wholesale customers dataset:"
display(samples)

print 'Comparison against Mean'
display(samples - np.round(data.mean()))

print 'Comparison against Median'
display(samples - np.round(data.median()))

bins = 5 # number of regions
print 'Regions ({} Bins)'.format(bins)
samples_R = samples.copy(deep=True)
samples_R *= 0 #reset to 0

for i in range(1,bins):
    samples_R[samples > data.quantile(float(i)/bins)] += 1
display(samples_R)
[420, 272, 92]
Chosen samples of wholesale customers dataset:
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 4456 5266 13227 25 6818 1393
1 514 8323 6869 529 93 1040
2 9198 27472 32034 3232 18906 5130
Comparison against Mean
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 -7544 -530 5276 -3047 3937 -132
1 -11486 2527 -1082 -2543 -2788 -485
2 -2802 21676 24083 160 16025 3605
Comparison against Median
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 -4048 1639 8471 -1501 6002 427
1 -7990 4696 2113 -997 -723 74
2 694 23845 27278 1706 18090 4164
Regions (5 Bins)
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 1 3 4 0 4 3
1 0 4 3 0 0 2
2 2 4 4 3 4 4

Question 1

Consider the total purchase cost of each product category and the statistical description of the dataset above for your sample customers.
What kind of establishment (customer) could each of the three samples you've chosen represent?
Hint: Examples of establishments include places like markets, cafes, and retailers, among many others. Avoid using names for establishments, such as saying "McDonalds" when describing a sample customer as a restaurant.

Answer:

Each value of the feature in the sample set was analyzed -- i.e. statistically put in context -- by comparing the respective quantity with the Mean, and the Median; finally, I also categorized the values into 5 discrete regions that it fell (in terms of the quantiles, at intervals of 20%).

With the comparison, it becomes apparent that Customer 420 represents a group that seldom buys Frozen products, but buys a substantial amount of Groceries and Detergents/Paper. This is made clear by the fact that the amount of Frozen products that the person buys is within the lower 20% of the data set; similarly, the amount of Grocery and Detergents_Paper that the person buys is within the upper 20% of the data set. Based on this, the customer may be an owner of a Groceries Retailer. -- perhaps this person needs large quantities of paper to package the products.

Likewise, Customer 272 represents a group that only buys Milk to a substantial amount, along with regular purchases of Groceries and Delicatessen. Considering the large purchase of Milk, this person may qualify for Domestic Cheese-maker.

Finally, Customer 92 represents a group that buys large quantities of most categories: considering the scale and the variety of the purchase, this person may be a General-Purpose Goods Vendor who simply serves all kinds of goods in ample quantities.

Implementation: Feature Relevance

One interesting thought to consider is if one (or more) of the six product categories is actually relevant for understanding customer purchasing. That is to say, is it possible to determine whether customers purchasing some amount of one category of products will necessarily purchase some proportional amount of another category of products? We can make this determination quite easily by training a supervised regression learner on a subset of the data with one feature removed, and then score how well that model can predict the removed feature.

In the code block below, you will need to implement the following:

  • Assign new_data a copy of the data by removing a feature of your choice using the DataFrame.drop function.
  • Use sklearn.cross_validation.train_test_split to split the dataset into training and testing sets.
    • Use the removed feature as your target label. Set a test_size of 0.25 and set a random_state.
  • Import a decision tree regressor, set a random_state, and fit the learner to the training data.
  • Report the prediction score of the testing set using the regressor's score function.
In [11]:
# TODO: Make a copy of the DataFrame, using the 'drop' function to drop the given feature
for drop_feature in data.keys():
    new_data = data.drop([drop_feature], axis = 1, inplace = False)
    #display(new_data.describe())
    # TODO: Split the data into training and testing sets using the given feature as the target

    # TODO: Create a decision tree regressor and fit it to the training set
    repeat = 100
    #train_score = 0
    test_score = 0
    for i in range(repeat):
        X_train, X_test, y_train, y_test = train_test_split(new_data,data[drop_feature])
        regressor = DecisionTreeRegressor(random_state = np.random.randint(65536))
        regressor.fit(X_train,y_train)
        # TODO: Report the score of the prediction using the testing set
        #train_score += regressor.score(X_train,y_train)
        test_score += regressor.score(X_test,y_test)
    #train_score /= repeat
    test_score /= repeat
    print 'FEATURE : {}'.format(drop_feature)
    #print 'TRAIN SCORE : {}'.format(train_score) #should be zero
    print 'TEST SCORE : {}'.format(test_score)
FEATURE : Fresh
TEST SCORE : -0.771297438506
FEATURE : Milk
TEST SCORE : 0.0434633350197
FEATURE : Grocery
TEST SCORE : 0.654384472847
FEATURE : Frozen
TEST SCORE : -1.20374516063
FEATURE : Detergents_Paper
TEST SCORE : 0.652307340751
FEATURE : Delicatessen
TEST SCORE : -2.76374103425

Question 2

Which feature did you attempt to predict? What was the reported prediction score? Is this feature is necessary for identifying customers' spending habits?
Hint: The coefficient of determination, R^2, is scored between 0 and 1, with 1 being a perfect fit. A negative R^2 implies the model fails to fit the data.

Answer:

I attempted all of the features, and repeated the evaluation for 100 times to remove the noise due to the instability in the formation of the Decision Tree. From the negative R^2 scores, it became clear that features Fresh, Frozen, and Delicatessen couldn't be predicted from other features: i.e. it is independent. However, Grocery and Detergents_Paper were features that could be reconstructed from other features. This is consistent to intuition, as Groceries would encompass a general category of items that would often be accompanied by the purchase of other products such as Detergents.

Visualize Feature Distributions

To get a better understanding of the dataset, we can construct a scatter matrix of each of the six product features present in the data. If you found that the feature you attempted to predict above is relevant for identifying a specific customer, then the scatter matrix below may not show any correlation between that feature and the others. Conversely, if you believe that feature is not relevant for identifying a specific customer, the scatter matrix might show a correlation between that feature and another feature in the data. Run the code block below to produce a scatter matrix.

In [12]:
# Produce a scatter matrix for each pair of features in the data
pd.scatter_matrix(data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');

Question 3

Are there any pairs of features which exhibit some degree of correlation? Does this confirm or deny your suspicions about the relevance of the feature you attempted to predict? How is the data for those features distributed?
Hint: Is the data normally distributed? Where do most of the data points lie?

Answer:

There is a clear linear trend formed between Groceries, Detergents/Paper, and Milk, which seems to indicate some relevance among the three features; this resolves my intuitive suspicion that, generally, people would purchase the products together -- i.e. there exists a redundancy in the feature set.

Otherwise, for each pair of features, the data seem not to be normally distributed, as most of the data points are clustered in the lower region. Neither do they, however, show a clear sign of relation: i.e. no functional correlation that maps one to the other.

Data Preprocessing

In this section, you will preprocess the data to create a better representation of customers by performing a scaling on the data and detecting (and optionally removing) outliers. Preprocessing data is often times a critical step in assuring that results you obtain from your analysis are significant and meaningful.

Implementation: Feature Scaling

If data is not normally distributed, especially if the mean and median vary significantly (indicating a large skew), it is most often appropriate to apply a non-linear scaling — particularly for financial data. One way to achieve this scaling is by using a Box-Cox test, which calculates the best power transformation of the data that reduces skewness. A simpler approach which can work in most cases would be applying the natural logarithm.

In the code block below, you will need to implement the following:

  • Assign a copy of the data to log_data after applying a logarithm scaling. Use the np.log function for this.
  • Assign a copy of the sample data to log_samples after applying a logrithm scaling. Again, use np.log.
In [13]:
# TODO: Scale the data using the natural logarithm
log_data = np.log(data)

# TODO: Scale the sample data using the natural logarithm
log_samples = np.log(samples)

# Produce a scatter matrix for each pair of newly-transformed features
pd.scatter_matrix(log_data, alpha = 0.3, figsize = (14,8), diagonal = 'kde');

Observation

After applying a natural logarithm scaling to the data, the distribution of each feature should appear much more normal. For any pairs of features you may have identified earlier as being correlated, observe here whether that correlation is still present (and whether it is now stronger or weaker than before).

Run the code below to see how the sample data has changed after having the natural logarithm applied to it.

In [14]:
# Display the log-transformed sample data
display(log_samples)
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 8.402007 8.569026 9.490015 3.218876 8.827321 7.239215
1 6.242223 9.026778 8.834774 6.270988 4.532599 6.946976
2 9.126741 10.220923 10.374553 8.080856 9.847235 8.542861

Implementation: Outlier Detection

Detecting outliers in the data is extremely important in the data preprocessing step of any analysis. The presence of outliers can often skew results which take into consideration these data points. There are many "rules of thumb" for what constitutes an outlier in a dataset. Here, we will use Tukey's Method for identfying outliers: An outlier step is calculated as 1.5 times the interquartile range (IQR). A data point with a feature that is beyond an outlier step outside of the IQR for that feature is considered abnormal.

In the code block below, you will need to implement the following:

  • Assign the value of the 25th percentile for the given feature to Q1. Use np.percentile for this.
  • Assign the value of the 75th percentile for the given feature to Q3. Again, use np.percentile.
  • Assign the calculation of an outlier step for the given feature to step.
  • Optionally remove data points from the dataset by adding indices to the outliers list.

NOTE: If you choose to remove any outliers, ensure that the sample data does not contain any of these points!
Once you have performed this implementation, the dataset will be stored in the variable good_data.

In [15]:
# For each feature find the data points with extreme high or low values

# OPTIONAL: Select the indices for data points you wish to remove
outliers = []

for feature in log_data.keys():
    feature_log_data = log_data[feature]
    # TODO: Calculate Q1 (25th percentile of the data) for the given feature
    Q1 = np.percentile(feature_log_data,25)
    
    # TODO: Calculate Q3 (75th percentile of the data) for the given feature
    Q3 = np.percentile(feature_log_data,75)
    
    print 'Q1', Q1
    print 'Q3', Q3
    # TODO: Use the interquartile range to calculate an outlier step (1.5 times the interquartile range)
    IQR = Q3-Q1
    step = 1.5 * IQR
    
    # Display the outliers
    print "Data points considered outliers for the feature '{}':".format(feature)
    outliers_frame = log_data[~((log_data[feature] >= Q1 - step) & (log_data[feature] <= Q3 + step))]
    display(outliers_frame)
    outliers += list(outliers_frame.index)
    
print 'outliers', outliers
# Remove the outliers, if any were specified

duplicates = [item for item, count in collections.Counter(outliers).items() if count > 1]
print 'duplicates', duplicates

good_data = log_data.drop(log_data.index[duplicates]).reset_index(drop = True)
display(good_data)
Q1 8.04805870221
Q3 9.73706394795
Data points considered outliers for the feature 'Fresh':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
65 4.442651 9.950323 10.732651 3.583519 10.095388 7.260523
66 2.197225 7.335634 8.911530 5.164786 8.151333 3.295837
81 5.389072 9.163249 9.575192 5.645447 8.964184 5.049856
95 1.098612 7.979339 8.740657 6.086775 5.407172 6.563856
96 3.135494 7.869402 9.001839 4.976734 8.262043 5.379897
128 4.941642 9.087834 8.248791 4.955827 6.967909 1.098612
171 5.298317 10.160530 9.894245 6.478510 9.079434 8.740337
193 5.192957 8.156223 9.917982 6.865891 8.633731 6.501290
218 2.890372 8.923191 9.629380 7.158514 8.475746 8.759669
304 5.081404 8.917311 10.117510 6.424869 9.374413 7.787382
305 5.493061 9.468001 9.088399 6.683361 8.271037 5.351858
338 1.098612 5.808142 8.856661 9.655090 2.708050 6.309918
353 4.762174 8.742574 9.961898 5.429346 9.069007 7.013016
355 5.247024 6.588926 7.606885 5.501258 5.214936 4.844187
357 3.610918 7.150701 10.011086 4.919981 8.816853 4.700480
412 4.574711 8.190077 9.425452 4.584967 7.996317 4.127134
Q1 7.33498124004
Q3 8.88048008859
Data points considered outliers for the feature 'Milk':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
86 10.039983 11.205013 10.377047 6.894670 9.906981 6.805723
98 6.220590 4.718499 6.656727 6.796824 4.025352 4.882802
154 6.432940 4.007333 4.919981 4.317488 1.945910 2.079442
356 10.029503 4.897840 5.384495 8.057377 2.197225 6.306275
Q1 7.67461620137
Q3 9.27385367724
Data points considered outliers for the feature 'Grocery':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
75 9.923192 7.036148 1.098612 8.390949 1.098612 6.882437
154 6.432940 4.007333 4.919981 4.317488 1.945910 2.079442
Q1 6.60967774917
Q3 8.17589608318
Data points considered outliers for the feature 'Frozen':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
38 8.431853 9.663261 9.723703 3.496508 8.847360 6.070738
57 8.597297 9.203618 9.257892 3.637586 8.932213 7.156177
65 4.442651 9.950323 10.732651 3.583519 10.095388 7.260523
145 10.000569 9.034080 10.457143 3.737670 9.440738 8.396155
175 7.759187 8.967632 9.382106 3.951244 8.341887 7.436617
264 6.978214 9.177714 9.645041 4.110874 8.696176 7.142827
325 10.395650 9.728181 9.519735 11.016479 7.148346 8.632128
420 8.402007 8.569026 9.490015 3.218876 8.827321 7.239215
429 9.060331 7.467371 8.183118 3.850148 4.430817 7.824446
439 7.932721 7.437206 7.828038 4.174387 6.167516 3.951244
Q1 5.54810142479
Q3 8.27434059875
Data points considered outliers for the feature 'Detergents_Paper':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
75 9.923192 7.036148 1.098612 8.390949 1.098612 6.882437
161 9.428190 6.291569 5.645447 6.995766 1.098612 7.711101
Q1 6.01187465693
Q3 7.50672842655
Data points considered outliers for the feature 'Delicatessen':
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
66 2.197225 7.335634 8.911530 5.164786 8.151333 3.295837
109 7.248504 9.724899 10.274568 6.511745 6.728629 1.098612
128 4.941642 9.087834 8.248791 4.955827 6.967909 1.098612
137 8.034955 8.997147 9.021840 6.493754 6.580639 3.583519
142 10.519646 8.875147 9.018332 8.004700 2.995732 1.098612
154 6.432940 4.007333 4.919981 4.317488 1.945910 2.079442
183 10.514529 10.690808 9.911952 10.505999 5.476464 10.777768
184 5.789960 6.822197 8.457443 4.304065 5.811141 2.397895
187 7.798933 8.987447 9.192075 8.743372 8.148735 1.098612
203 6.368187 6.529419 7.703459 6.150603 6.860664 2.890372
233 6.871091 8.513988 8.106515 6.842683 6.013715 1.945910
285 10.602965 6.461468 8.188689 6.948897 6.077642 2.890372
289 10.663966 5.655992 6.154858 7.235619 3.465736 3.091042
343 7.431892 8.848509 10.177932 7.283448 9.646593 3.610918
outliers [65, 66, 81, 95, 96, 128, 171, 193, 218, 304, 305, 338, 353, 355, 357, 412, 86, 98, 154, 356, 75, 154, 38, 57, 65, 145, 175, 264, 325, 420, 429, 439, 75, 161, 66, 109, 128, 137, 142, 154, 183, 184, 187, 203, 233, 285, 289, 343]
duplicates [128, 154, 65, 66, 75]
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 9.446913 9.175335 8.930759 5.365976 7.891331 7.198931
1 8.861775 9.191158 9.166179 7.474205 8.099554 7.482119
2 8.756682 9.083416 8.946896 7.785305 8.165079 8.967504
3 9.492884 7.086738 8.347827 8.764678 6.228511 7.488853
4 10.026369 8.596004 8.881558 8.272571 7.482682 8.553525
5 9.149847 9.019059 8.542081 6.501290 7.492760 7.280008
6 9.403107 8.070594 8.850088 6.173786 8.051978 6.300786
7 8.933137 8.508354 9.151227 7.419980 8.108021 7.850104
8 8.693329 8.201934 8.731013 6.052089 7.447751 6.620073
9 8.700514 9.314070 9.845911 7.055313 8.912608 7.648740
10 8.121480 8.594710 9.470703 8.389360 8.695674 7.463937
11 9.483873 7.024649 8.416931 7.258412 6.308098 6.208590
12 10.364514 9.418898 9.372204 5.659482 8.263848 7.983099
13 9.962558 8.733594 9.614605 8.037543 8.810907 6.400257
14 10.112654 9.155356 9.400217 5.683580 8.528726 7.681560
15 9.235326 7.015712 8.248267 5.983936 6.871091 6.021023
16 6.927558 9.084324 9.402695 4.897840 8.413609 6.984716
17 8.678632 8.725345 7.983781 6.732211 5.913503 8.406932
18 9.830971 8.752581 9.220192 7.698483 7.925519 8.064951
19 8.959312 7.822044 9.155250 6.505784 7.831220 6.216606
20 9.772581 8.416046 8.434246 6.971669 7.722678 7.661056
21 8.624612 6.769642 7.605890 8.126518 5.926926 6.343880
22 10.350606 7.558517 8.404920 9.149316 7.775276 8.374246
23 10.180096 10.502956 9.999661 8.547528 8.374938 9.712509
24 10.027783 9.187686 9.531844 7.977625 8.407825 8.661813
25 9.690604 8.349957 8.935245 5.303305 8.294799 4.043051
26 9.200088 6.867974 7.958926 8.055475 5.488938 6.725034
27 9.566335 6.688355 8.021256 6.184149 4.605170 6.249975
28 8.321908 9.927399 10.164197 7.054450 9.059982 8.557567
29 10.671000 7.649693 7.866722 7.090077 7.009409 6.712956
... ... ... ... ... ... ...
405 8.799812 7.647786 8.425736 7.236339 7.528332 7.545390
406 7.661998 8.098339 8.095904 7.336286 5.459586 8.381373
407 4.574711 8.190077 9.425452 4.584967 7.996317 4.127134
408 8.513787 8.488588 8.799812 9.790655 6.815640 7.797702
409 8.694335 7.595890 8.136518 8.644530 7.034388 5.669881
410 8.967249 8.707152 9.053920 7.433075 8.171882 7.535830
411 8.386857 9.300181 9.297252 6.742881 8.814033 6.900731
412 8.530109 8.612322 9.310638 5.897154 8.156223 6.968850
413 6.492240 9.047115 9.832099 4.890349 8.815815 6.654153
414 9.089415 8.238273 7.706613 6.450470 7.365180 7.327123
415 8.402007 8.569026 9.490015 3.218876 8.827321 7.239215
416 9.744668 8.486115 9.110851 6.938284 8.135933 7.486613
417 10.181119 7.227662 8.336151 6.721426 6.854355 7.104965
418 9.773664 8.212297 8.446127 6.965080 7.497207 6.504288
419 9.739791 7.966933 9.411811 6.773080 8.074960 5.517453
420 9.327501 7.786552 7.860571 9.638740 4.682131 7.542213
421 9.482960 9.142811 9.569133 8.052296 8.532870 7.546446
422 10.342130 9.722385 8.599510 9.621257 6.084499 7.058758
423 8.021913 8.694502 8.499029 7.695303 6.745236 5.758902
424 9.060331 7.467371 8.183118 3.850148 4.430817 7.824446
425 8.038189 8.349957 9.710085 6.354370 5.484797 7.640123
426 9.051696 8.613594 8.548692 9.509407 7.227662 7.311886
427 9.957834 7.057898 8.466742 5.594711 7.191429 5.978886
428 7.591862 8.076515 7.308543 7.340187 5.874931 7.278629
429 9.725019 8.274357 8.986447 6.533789 7.771067 6.731018
430 10.299003 9.396903 9.682030 9.483036 5.204007 7.698029
431 10.577146 7.266129 6.638568 8.414052 4.532599 7.760467
432 9.584040 9.647821 10.317020 6.079933 9.605149 7.532088
433 9.238928 7.591357 7.710653 6.945051 5.123964 7.661527
434 7.932721 7.437206 7.828038 4.174387 6.167516 3.951244

435 rows × 6 columns

Question 4

Are there any data points considered outliers for more than one feature? Should these data points be removed from the dataset? If any data points were added to the outliers list to be removed, explain why.

Answer:

Data points 65,66,75,128, and 154 were considered outliers for multiple features -- these definitely qualify as an outlier, and should be removed from the dataset.

37 Other points in the accumulated list, however, were considered ordinary for most of the features except one, so I decided to keep those; the dataset is relatively small, so it is only natural to assume that even ordinary data may be determined to be outliers due to insufficient observation; thus, I figured that they shouldn't be discarded immediately.

Removing outliers is instrumental in the clustering algorithms, as its cost is often measured by the euclidian distance. Should there be an outlier, the centroid is forced to deviate from the majority just to account for that outlier, which may produce inaccurate results.

Feature Transformation

In this section you will use principal component analysis (PCA) to draw conclusions about the underlying structure of the wholesale customer data. Since using PCA on a dataset calculates the dimensions which best maximize variance, we will find which compound combinations of features best describe customers.

Implementation: PCA

Now that the data has been scaled to a more normal distribution and has had any necessary outliers removed, we can now apply PCA to the good_data to discover which dimensions about the data best maximize the variance of features involved. In addition to finding these dimensions, PCA will also report the explained variance ratio of each dimension — how much variance within the data is explained by that dimension alone. Note that a component (dimension) from PCA can be considered a new "feature" of the space, however it is a composition of the original features present in the data.

In the code block below, you will need to implement the following:

  • Import sklearn.decomposition.PCA and assign the results of fitting PCA in six dimensions with good_data to pca.
  • Apply a PCA transformation of the sample log-data log_samples using pca.transform, and assign the results to pca_samples.
In [16]:
# TODO: Apply PCA to the good data with the same number of dimensions as features
pca = PCA(n_components=6).fit(good_data)

# TODO: Apply a PCA transformation to the sample log-data
pca_samples = pca.transform(log_samples)

# Generate PCA results plot
pca_results = rs.pca_results(good_data, pca)

Question 5

How much variance in the data is explained in total by the first and second principal component? What about the first four principal components? Using the visualization provided above, discuss what the first four dimensions best represent in terms of customer spending.
Hint: A positive increase in a specific dimension corresponds with an increase of the positive-weighted features and a decrease of the negative-weighted features. The rate of increase or decrease is based on the indivdual feature weights.

Answer:

The first two principal components account for about 70.7% of the variance in the data set; the first four account for 93.1%.

These principal components are a mixture of the lower-level features, and can be thought of as a higher-level feature, or a more abstract category of goods, in a way.

The first principal component is composed mostly of Detergents_Paper, with complementary mixture of Milk and Grocery factors. Given that this is the first principal component, and the type of component it comprises, it would be fair to denote this as Common Consumables.

The second principal component is composed mostly of Fresh goods, with complementary mixture of Frozen and Delicatessen goods, which seems to mostly include General Food Products.

The third principal component is composed mostly of Fresh goods as well, though with adversary mixture of Frozen and Delicatessen goods: i.e. a combination of products that are Fresh, but not exotic or frozen. In that vein, this may be classified as Plain Spoilables.

The fourth principal component is composed of prominent Frozen feature with adversary Delicatessen goods. This is probably the result of the Frozen feature being underrepresented, at least without the accompanying Delicatessen goods, in the previous principal components. This qualifies this component as Plain Frozen Goods.

Observation

Run the code below to see how the log-transformed sample data has changed after having a PCA transformation applied to it in six dimensions. Observe the numerical value for the first four dimensions of the sample points. Consider if this is consistent with your initial interpretation of the sample points.

In [17]:
# Display sample log-data after having a PCA transformation applied
display(pd.DataFrame(np.round(pca_samples, 4), columns = pca_results.index.values))
Dimension 1 Dimension 2 Dimension 3 Dimension 4 Dimension 5 Dimension 6
0 3.0266 -1.8034 -1.1358 -2.9577 0.4264 -0.1528
1 -0.5299 -2.0536 2.0455 -0.8916 -2.0329 -0.1451
2 4.0512 2.1537 0.5577 0.2903 -0.0713 0.0530

Implementation: Dimensionality Reduction

When using principal component analysis, one of the main goals is to reduce the dimensionality of the data — in effect, reducing the complexity of the problem. Dimensionality reduction comes at a cost: Fewer dimensions used implies less of the total variance in the data is being explained. Because of this, the cumulative explained variance ratio is extremely important for knowing how many dimensions are necessary for the problem. Additionally, if a signifiant amount of variance is explained by only two or three dimensions, the reduced data can be visualized afterwards.

In the code block below, you will need to implement the following:

  • Assign the results of fitting PCA in two dimensions with good_data to pca.
  • Apply a PCA transformation of good_data using pca.transform, and assign the reuslts to reduced_data.
  • Apply a PCA transformation of the sample log-data log_samples using pca.transform, and assign the results to pca_samples.
In [18]:
# TODO: Fit PCA to the good data using only two dimensions
pca = PCA(n_components=2).fit(good_data)

# TODO: Apply a PCA transformation the good data
reduced_data = pca.transform(good_data)

# TODO: Apply a PCA transformation to the sample log-data
pca_samples = pca.transform(log_samples)

# Create a DataFrame for the reduced data
reduced_data = pd.DataFrame(reduced_data, columns = ['Dimension 1', 'Dimension 2'])

Observation

Run the code below to see how the log-transformed sample data has changed after having a PCA transformation applied to it using only two dimensions. Observe how the values for the first two dimensions remains unchanged when compared to a PCA transformation in six dimensions.

In [41]:
# Display sample log-data after applying PCA transformation in two dimensions
display(pd.DataFrame(np.round(pca_samples, 4), columns = ['Dimension 1', 'Dimension 2']))
Dimension 1 Dimension 2
0 3.0266 -1.8034
1 -0.5299 -2.0536
2 4.0512 2.1537

Clustering

In this section, you will choose to use either a K-Means clustering algorithm or a Gaussian Mixture Model clustering algorithm to identify the various customer segments hidden in the data. You will then recover specific data points from the clusters to understand their significance by transforming them back into their original dimension and scale.

Question 6

What are the advantages to using a K-Means clustering algorithm? What are the advantages to using a Gaussian Mixture Model clustering algorithm? Given your observations about the wholesale customer data so far, which of the two algorithms will you use and why?

Answer:

K Means clustering is in fact a specialized form of a Gaussian Mixture Models, in that it corresponds to a case in which each of the data points must be identified with a specific cluster (i.e. one-hot encoding of the probabilities). Naturally, Gaussian Mixture Models have more parameters to work with -- but it could also provide a deeper insight into the data. For instance, by examining the probability distribution of a data point belonging to each cluster, we can determine how strongly the point associates itself with the given cluster, and how likely it is that the point is classified correctly.

The Gaussian Mixture Models employ EM(Expectation Maximization, or Expectimax) algorithm for training; the extra parametrization renders this slower than the EM. When I measured the times (implemented below), the K-Means and the GMM reported 11.8 and 12.3 seconds, respectively. This is not a very significant difference, and since the dataset is small, the speed is not critical in this scenario.

Considering the benefits and costs of both algorithms, I would use the Gaussian Mixture Model. Despite that its silhouette score was lower than that of the K-Means the difference was not very prominent with two clusters. Moreover, the GMM exposes the deeper structure that may be present within the data set: it would be useful to examine how much the chosen sample points identify with the found clusters, by seeing the mixture of the weights.

Implementation: Creating Clusters

Depending on the problem, the number of clusters that you expect to be in the data may already be known. When the number of clusters is not known a priori, there is no guarantee that a given number of clusters best segments the data, since it is unclear what structure exists in the data — if any. However, we can quantify the "goodness" of a clustering by calculating each data point's silhouette coefficient. The silhouette coefficient for a data point measures how similar it is to its assigned cluster from -1 (dissimilar) to 1 (similar). Calculating the mean silhouette coefficient provides for a simple scoring method of a given clustering.

In the code block below, you will need to implement the following:

  • Fit a clustering algorithm to the reduced_data and assign it to clusterer.
  • Predict the cluster for each data point in reduced_data using clusterer.predict and assign them to preds.
  • Find the cluster centers using the algorithm's respective attribute and assign them to centers.
  • Predict the cluster for each sample data point in pca_samples and assign them sample_preds.
  • Import sklearn.metrics.silhouette_score and calculate the silhouette score of reduced_data against preds.
    • Assign the silhouette score to score and print the result.
In [40]:
# TODO: Apply your clustering algorithm of choice to the reduced data 
from sklearn.cluster import KMeans
from sklearn.mixture import GMM
import time

scores_K = []
scores_G = []
n_range = range(2,10)
repeat = 10 #for smoother scores

#TEST GMM
times_G = 0
start = time.time()
for n in n_range:
    scores_G_n = []
    for _ in range(repeat):
        clusterer_G = GMM(n_components=n)
        preds_G = clusterer_G.fit_predict(reduced_data)
        centroids_G = clusterer_G.means_
        scores_G_n += [metrics.silhouette_score(reduced_data,preds_G)]
    scores_G += [np.average(scores_G_n)]
mid = time.time()
times_G = mid - start

#TEST KMeans
times_K = 0
for n in n_range:
    scores_K_n = []
    for _ in range(repeat):
        clusterer_K = KMeans(n_clusters=n)
        preds_K = clusterer_K.fit_predict(reduced_data)
        centroids_K = clusterer_K.cluster_centers_ 
        sample_preds_K = clusterer_K.predict(pca_samples)
        scores_K_n += [metrics.silhouette_score(reduced_data,preds_K)]
    scores_K += [np.average(scores_K_n)]
end = time.time()
times_K = end - mid


#Print Results

print 'K-Means Time : {}'.format(times_K)
print 'GMM Time : {}'.format(times_G)

clusterer = GMM(n_components=2)
preds = clusterer.fit_predict(reduced_data)
sample_preds = clusterer.predict(pca_samples)

print 'sample predictions :'
print sample_preds
print 'sample probabilities :'
print clusterer.predict_proba(pca_samples)

centers = clusterer.means_

print pd.DataFrame(data={'Kmeans':scores_K,'GMM':scores_G},index=n_range)
plt.plot(n_range,scores_K)
plt.plot(n_range,scores_G)
plt.legend(['KMeans','GMM'])
plt.show()
K-Means Time : 11.785022974
GMM Time : 12.0481350422
sample predictions :
[1 0 1]
sample probabilities :
[[ 0.01889556  0.98110444]
 [ 0.60975906  0.39024094]
 [ 0.00566077  0.99433923]]
        GMM    Kmeans
2  0.411819  0.426281
3  0.373702  0.395986
4  0.333826  0.331563
5  0.291087  0.349850
6  0.277118  0.363303
7  0.314618  0.363313
8  0.304916  0.357923
9  0.293377  0.358553

Question 7

Report the silhouette score for several cluster numbers you tried. Of these, which number of clusters has the best silhouette score?

Answer:

# Clusters GMM Kmeans
2 0.411819 0.426281
3 0.376278 0.396495
4 0.330628 0.331936
5 0.290384 0.350180
6 0.273230 0.364566
7 0.316544 0.362240
8 0.308589 0.360471
9 0.294413 0.356102

In general, the silhouette score of the K-Means algorithm was better than the GMM;

I also smoothed the evaluation over 10 repetitions to reduce possible noise.

Among all the options, K-Means with 2 clusters performed best (0.426).

Cluster Visualization

Once you've chosen the optimal number of clusters for your clustering algorithm using the scoring metric above, you can now visualize the results by executing the code block below. Note that, for experimentation purposes, you are welcome to adjust the number of clusters for your clustering algorithm to see various visualizations. The final visualization provided should, however, correspond with the optimal number of clusters.

In [25]:
# Display the results of the clustering from implementation
rs.cluster_results(reduced_data, preds, centers, pca_samples)

Implementation: Data Recovery

Each cluster present in the visualization above has a central point. These centers (or means) are not specifically data points from the data, but rather the averages of all the data points predicted in the respective clusters. For the problem of creating customer segments, a cluster's center point corresponds to the average customer of that segment. Since the data is currently reduced in dimension and scaled by a logarithm, we can recover the representative customer spending from these data points by applying the inverse transformations.

In the code block below, you will need to implement the following:

  • Apply the inverse transform to centers using pca.inverse_transform and assign the new centers to log_centers.
  • Apply the inverse function of np.log to log_centers using np.exp and assign the true centers to true_centers.
In [26]:
# TODO: Inverse transform the centers
log_centers = pca.inverse_transform(centers)
log_centers_df = pd.DataFrame(np.round(log_centers,4), columns = data.keys())
#display(log_centers_df)

# TODO: Exponentiate the centers
true_centers = np.exp(log_centers)

# Display the true centers
segments = ['Segment {}'.format(i) for i in range(0,len(centers))]
true_centers = pd.DataFrame(np.round(true_centers), columns = data.keys())
true_centers.index = segments
display(true_centers)
#display(data.describe())

# Display Regions
print 'Regions (log-based)'
centers_R = log_centers_df.copy(deep=True)
centers_R *= 0 #reset to 0

division = 5 # number of regions
for i in range(1,division):
    centers_R[log_centers_df > log_data.quantile(float(i)/division)] += 1
display(centers_R)
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
Segment 0 8867 1897 2477 2088 294 681
Segment 1 4005 7900 12104 952 4561 1036
Regions (log-based)
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 2 1 1 2 1 1
1 1 3 3 1 3 2

Question 8

Consider the total purchase cost of each product category for the representative data points above, and reference the statistical description of the dataset at the beginning of this project. What set of establishments could each of the customer segments represent?
Hint: A customer who is assigned to 'Cluster X' should best identify with the establishments represented by the feature set of 'Segment X'.

Answer: Customers belonging to Cluster 0 tends to purchase a small (20-40%) amount of all products, with small emphasis on Fresh and Frozen goods. This best matches an establishment of a Cafe.

Customers belonging to Cluster 1 tends to purchase large (60-80%) amount of Milk, Grocery, and Detergents_Paper; in fact, Cluster 1 generally tends to purchase more. Based on the large quantity of purchases in a broad range of categories, the cluster qualifies as a Retailer.

Question 9

For each sample point, which customer segment from Question 8 best represents it? Are the predictions for each sample point consistent with this?

Run the code block below to find which cluster each sample point is predicted to be.

In [28]:
# Display the predictions
for i, pred in enumerate(sample_preds):
    print "Sample point", i, "predicted to be in Cluster", pred
# Better Visualization

print ''
print 'Regions (samples, log-based)'
sample_R = pd.DataFrame(log_samples,index=log_samples.index)
sample_R *= 0
division = 5 # number of regions
for i in range(1,division):
    sample_R[log_samples > log_data.quantile(float(i)/division)] += 1
    
display(sample_R)
print 'Predictions', sample_preds
display(centers_R.loc[sample_preds])
Sample point 0 predicted to be in Cluster 1
Sample point 1 predicted to be in Cluster 0
Sample point 2 predicted to be in Cluster 1

Regions (samples, log-based)
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
0 1 3 4 0 4 3
1 0 4 3 0 0 2
2 2 4 4 3 4 4
Predictions [1 0 1]
Fresh Milk Grocery Frozen Detergents_Paper Delicatessen
1 1 3 3 1 3 2
0 2 1 1 2 1 1
1 1 3 3 1 3 2

Answer:

For the following answers, the data was split into 5 regions to allow for better visualizations, essentially by making the values discrete. The region was split based on the quantiles of the log of the data, with intervals of 20%.

Based on the predictions, and the quantile in which each data in the sample set falls under, the predicted clusters are quite consistent to the obtained result:

The first sample point falls under the same region as the centroid of the cluster for the feature of Fresh and Milk products, and otherwise follows the consuming pattern quite nicely: large quantities of Grocery and Detergents_Paper with a slightly subtle purchase of Delicatessen, and a very limited purchase of Frozen goods.

As for the second sample point, it is not immediately apparent that it would be classified with cluster 0: there seem to be a significant deviation in terms of purchases in Milk and Groceries. Indeed, from the cluster diagram, the point belongs in the rather competitive region (closer to the middle of the two clusters). However, recall that the most significant factor in the first principal components was Detergents_Paper: i.e. it is the most influential factor (> 0.7, in terms of the weights) in determining the categorization of the clusters. Indeed, the second sample point identifies strongly with cluster 0 in terms of Detergents_Paper -- at least, more so than that of cluster 1.

The apparent confusion here as to which cluster this sample belongs to is reasonable: from the probability assignment yielded from the GMM, it was seen that unlike the other two points that was assigned to cluster 1 with confidence of 98.1% and 99.4%, respectively, sample 2 was assigned to cluster 0 with the confidence of mere 61.0%.

The third sample point does not exactly match the quantity of the centroid of cluster 1. However, it is important to notice that the overall trend that the point follows is closely related to its cluster: i.e. purchasing large quantities of Milk, Grocery and Detergents_Paper with a more moderate purchase of Fresh and Frozen goods. There is a slight deviation with Delicatessen, but this can be overlooked based on the fact that it is a less-pronounced factor: it contributes a mere 0.1, in terms of its weights, to the first principal component.

Conclusion

Question 10

Companies often run A/B tests when making small changes to their products or services. If the wholesale distributor wanted to change its delivery service from 5 days a week to 3 days a week, how would you use the structure of the data to help them decide on a group of customers to test?
Hint: Would such a change in the delivery service affect all customers equally? How could the distributor identify who it affects the most?

Answer:

The identified clusters provide insight into the different user groups, each with different interests; with this information, several subsamples can be extracted from each group. Among these smaller groups, the company can conduct new experiments to gauge their responses to, say, different types of delivery services: considering that these samples would identify strongly with a certain group -- and assuming the population within the group would react similarly -- the result from the experiment can then be generalized to represent a distinct group, to whom the company may formulate a more tuned strategy.

Question 11

Assume the wholesale distributor wanted to predict a new feature for each customer based on the purchasing information available. How could the wholesale distributor use the structure of the data to assist a supervised learning analysis?
Hint: What other input feature could the supervised learner use besides the six product features to help make a prediction?

Answer:

If not conducting another set of data collection to gather information on a whole another category, the supervised learner could augment its feature set with the identified cluster to which the sample belongs. This new, high-level feature would indicate directly what type of customer the sample represents, which would assist the supervised learner quite well.

Visualizing Underlying Distributions

At the beginning of this project, it was discussed that the 'Channel' and 'Region' features would be excluded from the dataset so that the customer product categories were emphasized in the analysis. By reintroducing the 'Channel' feature to the dataset, an interesting structure emerges when considering the same PCA dimensionality reduction applied earlier on to the original dataset.

Run the code block below to see how each data point is labeled either 'HoReCa' (Hotel/Restaurant/Cafe) or 'Retail' the reduced space. In addition, you will find the sample points are circled in the plot, which will identify their labeling.

In [29]:
# Display the clustering results based on 'Channel' data
# Here, "duplicates" are samples that were considered outliers for more than one feature

rs.channel_results(reduced_data, duplicates, pca_samples)

Question 12

How well does the clustering algorithm and number of clusters you've chosen compare to this underlying distribution of Hotel/Restaurant/Cafe customers to Retailer customers? Are there customer segments that would be classified as purely 'Retailers' or 'Hotels/Restaurants/Cafes' by this distribution? Would you consider these classifications as consistent with your previous definition of the customer segments?

Answer:

The HoReCa Cluster seem to be directly related to cluster 0, whereas the Retailer Cluster seem to identify with cluster 1 -- in fact, the clusters look almost identical. There are indeed a few minor incongruencies (especially within the region of overlap in the middle) -- but in perspective of the distributor, it may even be more beneficial to adhere to the newly found clusters. The reasoning is simple: the few Retailer customers, who were identified as belonging to the HoReCa cluster, are effectively functioning as HoReCa customers regardless of their true establishment. Based on their purchasing patterns, they should still be classified in the same cluster as the HoReCa customers -- and vice versa.

Overall, this result reinforces the confidence that the algorithm successfully classified customers into discrete clusters. Moreover, it is consistent with the customer segments assigned to each of the discovered clusters.

Note: Once you have completed all of the code implementations and successfully answered each question above, you may finalize your work by exporting the iPython Notebook as an HTML document. You can do this by using the menu above and navigating to
File -> Download as -> HTML (.html). Include the finished document along with this notebook as your submission.

In [ ]: