第一题¶

HW4_1_Solution.jpg

第二题¶

利用某银行提供的信用卡用户行为数据,进行聚类分析,帮助开展客户细分。详细数据字段描述如下:

  • CUST_ID: Identification of Credit Card holder (Categorical) 信用卡持有人ID
  • BALANCE_FREQ: How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated) 余额更新的频率,0到1之间的得分(1=频繁,0=不频繁)
  • PURCHASE_FREQ: How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased) 进行购买的频率,得分在0到1之间(1=频繁,0=不频繁)
  • ONEOFF_FREQ: How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased) 一次性购买的频率(1=频繁,0=不频繁)
  • INSTALLMENTS_FREQ: How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done) 分期购买的频率(1=频繁,0=不频繁)
  • CREDIT_LIMIT: Limit of Credit Card for user 用户的信用卡限额
  • PRC_FULL_PAYMENT: Percent of full payment paid by user 用户还款占全额账单的百分比
  • TENURE: Tenure of credit card service for user 用户信用卡服务使用期限
In [7]:
import matplotlib.pyplot as plt 
import pandas as pd 
import numpy as np
import seaborn as sns
import warnings
warnings.simplefilter("ignore")

载入数据¶

In [8]:
df = pd.read_csv('./credit_card.csv')
df.head()
Out[8]:
CUST_ID BALANCE_FREQ PURCHASE_FREQ ONEOFF_FREQ INSTALLMENTS_FREQ CREDIT_LIMIT PRC_FULL_PAYMENT TENURE
0 C10001 0.818182 0.166667 0.000000 0.083333 1000.0 0.000000 12
1 C10002 0.909091 0.000000 0.000000 0.000000 7000.0 0.222222 12
2 C10003 1.000000 1.000000 1.000000 0.000000 7500.0 0.000000 12
3 C10004 0.636364 0.083333 0.083333 0.000000 7500.0 0.000000 12
4 C10005 1.000000 0.083333 0.083333 0.000000 1200.0 0.000000 12

数据预处理¶

  • 处理缺失值
In [9]:
missing_values_count = df.isnull().sum()
print(missing_values_count)
CUST_ID              0
BALANCE_FREQ         0
PURCHASE_FREQ        0
ONEOFF_FREQ          0
INSTALLMENTS_FREQ    0
CREDIT_LIMIT         1
PRC_FULL_PAYMENT     0
TENURE               0
dtype: int64
In [10]:
df = df.fillna(df.mean())
df.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 8950 entries, 0 to 8949
Data columns (total 8 columns):
 #   Column             Non-Null Count  Dtype  
---  ------             --------------  -----  
 0   CUST_ID            8950 non-null   object 
 1   BALANCE_FREQ       8950 non-null   float64
 2   PURCHASE_FREQ      8950 non-null   float64
 3   ONEOFF_FREQ        8950 non-null   float64
 4   INSTALLMENTS_FREQ  8950 non-null   float64
 5   CREDIT_LIMIT       8950 non-null   float64
 6   PRC_FULL_PAYMENT   8950 non-null   float64
 7   TENURE             8950 non-null   int64  
dtypes: float64(6), int64(1), object(1)
memory usage: 559.5+ KB
  • 去除与聚类分析无关的属性列
In [11]:
data = df.drop(["CUST_ID"], axis = 1)
print(data.shape)
(8950, 7)
  • 查看数据属性分布
In [12]:
graph_by_variables = data.columns
plt.figure(figsize = (15, 18))
for i in range(0, 7):
    plt.subplot(4, 2, i+1)
    sns.distplot(data[graph_by_variables[i]])
    plt.title(graph_by_variables[i])
plt.tight_layout()
No description has been provided for this image
  • 查看数据属性间的相关性
In [13]:
f, ax = plt.subplots(figsize = (15, 15))
sns.heatmap(data.corr(), annot = True, linewidths = 0.5, fmt = '.1f', ax = ax)
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x12eb1ebd0>
No description has been provided for this image
  • 数据标准化
In [14]:
from sklearn.preprocessing import StandardScaler
standardscaler = StandardScaler()
data_s = standardscaler.fit_transform(data)
data_s[:2]
Out[14]:
array([[-0.24943448, -0.80649035, -0.67866081, -0.70731317, -0.96043334,
        -0.52555097,  0.36067954],
       [ 0.13432467, -1.22175806, -0.67866081, -0.91699519,  0.68863903,
         0.2342269 ,  0.36067954]])

聚类分析 - K均值¶

In [15]:
from sklearn.cluster import KMeans
n_clusters = 4
kmeans = KMeans(n_clusters = n_clusters, random_state = 628)
labels = kmeans.fit_predict(data_s)

# plot cluster sizes
plt.hist(labels, bins = range(n_clusters + 1))
plt.title ('Customers per Cluster')
plt.xlabel('Cluster')
plt.ylabel('Customers')
plt.show()
No description has been provided for this image
  • 根据肘部法则确定聚类数目
In [16]:
sse = []
cluster_list = range(1, 16)
for i in cluster_list :
    kmeans = KMeans(n_clusters = i, random_state = 628)
    kmeans.fit(data_s)
    sse.append(kmeans.inertia_)
plt.plot(cluster_list, sse)
plt.title('Elbow Method')
plt.xlabel('Clusters')
plt.ylabel('SSE')
plt.show()
No description has been provided for this image

无法观测到明显的肘部

  • 根据轮廓系数确定聚类数目
In [17]:
from sklearn.metrics import silhouette_score
s = [] 
cluster_list = range(2, 10)
for i in cluster_list:
    kmeans = KMeans(n_clusters = i, random_state = 628)
    s.append(silhouette_score(data_s, kmeans.fit_predict(data_s))) 
    
# Plotting a bar graph to compare the results 
plt.bar(cluster_list, s) 
plt.xlabel('Number of clusters', fontsize = 10) 
plt.ylabel('Silhouette Score', fontsize = 10) 
plt.show()
No description has been provided for this image

可知,使用K均值聚类将用户聚成5类时轮廓系数最大,为0.33左右。

In [18]:
kmeans = KMeans(n_clusters = 5, random_state = 628) 
labels = kmeans.fit_predict(data_s)
data["cluster"] = labels
# plot cluster sizes
plt.hist(labels, bins = range(6))
plt.title ('Customers per Cluster')
plt.xlabel('Cluster')
plt.ylabel('Customers')
plt.show()
No description has been provided for this image

对聚类中心进行可视化分析,可以归纳聚类结果中各组用户的特点。

In [19]:
plt.subplots(figsize = (15, 18))
centers = kmeans.cluster_centers_
idx = np.arange(7)
plt.bar(idx, centers[0], color = 'blue', width = 1/6, tick_label = graph_by_variables)
plt.bar(idx + 1/6, centers[1], color = 'green', width = 1/6)
plt.bar(idx + 2/6, centers[2], color = 'red', width = 1/6)
plt.bar(idx + 3/6, centers[3], color = 'orange', width = 1/6)
plt.bar(idx + 4/6, centers[4], color = 'grey', width = 1/6)
plt.xticks()
plt.show()
No description has been provided for this image
  • 第1类用户(蓝色柱)呈现出余额更新频率较高、购买频率却最低、信用卡限额较为中等,还款比例最低的特点。
  • 第2类用户(绿色柱),其余额更新频率较高,同时购买频率也较高,尤其喜爱使用分期购买服务。
  • 第3类用户(红色柱),呈现余额更新频率较高且购买频率也较高的特点,但相比于第2类用户,更喜欢使用一次性购买服务。
  • 第4类用户(橙色柱),信用卡额度最低、信用卡服务使用期限最短,其他各项频率也较低。
  • 第5类用户(灰色柱),余额更新频率最低,各项购买频率也较低,信用卡限额较为中等,还款比例较高。

聚类分析 - 层次聚类¶

  • By sklearn.cluster
In [14]:
from sklearn.cluster import AgglomerativeClustering
s = [] 
cluster_list = range(2, 10)
for i in cluster_list:
    hc = AgglomerativeClustering(n_clusters = i)
    s.append(silhouette_score(data_s, hc.fit_predict(data_s)))
    
# Plotting a bar graph to compare the results 
plt.bar(cluster_list, s) 
plt.xlabel('Number of Clusters', fontsize = 10) 
plt.ylabel('Silhouette Score', fontsize = 10) 
plt.show() 
No description has been provided for this image

可知,使用凝聚聚类将用户聚成4类时轮廓系数最大,为0.28左右。

  • By scipy.cluster
In [16]:
import scipy.cluster.hierarchy as sch
plt.figure(figsize = (10, 6))
linked = sch.linkage(data_s, method = 'ward')
dendrogram = sch.dendrogram(linked)
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean Distances')
plt.show()
No description has been provided for this image

若依据轮廓系数这一客观指标,可认为K均值聚类方法的表现优于凝聚聚类方法,其中当K=5时达到最大。也可以结合聚类任务背景,进行主观评价,选取认为合适的聚类方法。

In [ ]: