第一题¶
第二题¶
利用某银行提供的信用卡用户行为数据,进行聚类分析,帮助开展客户细分。详细数据字段描述如下:
- CUST_ID: Identification of Credit Card holder (Categorical) 信用卡持有人ID
- BALANCE_FREQ: How frequently the Balance is updated, score between 0 and 1 (1 = frequently updated, 0 = not frequently updated) 余额更新的频率,0到1之间的得分(1=频繁,0=不频繁)
- PURCHASE_FREQ: How frequently the Purchases are being made, score between 0 and 1 (1 = frequently purchased, 0 = not frequently purchased) 进行购买的频率,得分在0到1之间(1=频繁,0=不频繁)
- ONEOFF_FREQ: How frequently Purchases are happening in one-go (1 = frequently purchased, 0 = not frequently purchased) 一次性购买的频率(1=频繁,0=不频繁)
- INSTALLMENTS_FREQ: How frequently purchases in installments are being done (1 = frequently done, 0 = not frequently done) 分期购买的频率(1=频繁,0=不频繁)
- CREDIT_LIMIT: Limit of Credit Card for user 用户的信用卡限额
- PRC_FULL_PAYMENT: Percent of full payment paid by user 用户还款占全额账单的百分比
- TENURE: Tenure of credit card service for user 用户信用卡服务使用期限
In [7]:
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import seaborn as sns
import warnings
warnings.simplefilter("ignore")
载入数据¶
In [8]:
df = pd.read_csv('./credit_card.csv')
df.head()
Out[8]:
| CUST_ID | BALANCE_FREQ | PURCHASE_FREQ | ONEOFF_FREQ | INSTALLMENTS_FREQ | CREDIT_LIMIT | PRC_FULL_PAYMENT | TENURE | |
|---|---|---|---|---|---|---|---|---|
| 0 | C10001 | 0.818182 | 0.166667 | 0.000000 | 0.083333 | 1000.0 | 0.000000 | 12 |
| 1 | C10002 | 0.909091 | 0.000000 | 0.000000 | 0.000000 | 7000.0 | 0.222222 | 12 |
| 2 | C10003 | 1.000000 | 1.000000 | 1.000000 | 0.000000 | 7500.0 | 0.000000 | 12 |
| 3 | C10004 | 0.636364 | 0.083333 | 0.083333 | 0.000000 | 7500.0 | 0.000000 | 12 |
| 4 | C10005 | 1.000000 | 0.083333 | 0.083333 | 0.000000 | 1200.0 | 0.000000 | 12 |
数据预处理¶
- 处理缺失值
In [9]:
missing_values_count = df.isnull().sum()
print(missing_values_count)
CUST_ID 0 BALANCE_FREQ 0 PURCHASE_FREQ 0 ONEOFF_FREQ 0 INSTALLMENTS_FREQ 0 CREDIT_LIMIT 1 PRC_FULL_PAYMENT 0 TENURE 0 dtype: int64
In [10]:
df = df.fillna(df.mean())
df.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 8950 entries, 0 to 8949 Data columns (total 8 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 CUST_ID 8950 non-null object 1 BALANCE_FREQ 8950 non-null float64 2 PURCHASE_FREQ 8950 non-null float64 3 ONEOFF_FREQ 8950 non-null float64 4 INSTALLMENTS_FREQ 8950 non-null float64 5 CREDIT_LIMIT 8950 non-null float64 6 PRC_FULL_PAYMENT 8950 non-null float64 7 TENURE 8950 non-null int64 dtypes: float64(6), int64(1), object(1) memory usage: 559.5+ KB
- 去除与聚类分析无关的属性列
In [11]:
data = df.drop(["CUST_ID"], axis = 1)
print(data.shape)
(8950, 7)
- 查看数据属性分布
In [12]:
graph_by_variables = data.columns
plt.figure(figsize = (15, 18))
for i in range(0, 7):
plt.subplot(4, 2, i+1)
sns.distplot(data[graph_by_variables[i]])
plt.title(graph_by_variables[i])
plt.tight_layout()
- 查看数据属性间的相关性
In [13]:
f, ax = plt.subplots(figsize = (15, 15))
sns.heatmap(data.corr(), annot = True, linewidths = 0.5, fmt = '.1f', ax = ax)
Out[13]:
<matplotlib.axes._subplots.AxesSubplot at 0x12eb1ebd0>
- 数据标准化
In [14]:
from sklearn.preprocessing import StandardScaler
standardscaler = StandardScaler()
data_s = standardscaler.fit_transform(data)
data_s[:2]
Out[14]:
array([[-0.24943448, -0.80649035, -0.67866081, -0.70731317, -0.96043334,
-0.52555097, 0.36067954],
[ 0.13432467, -1.22175806, -0.67866081, -0.91699519, 0.68863903,
0.2342269 , 0.36067954]])
聚类分析 - K均值¶
In [15]:
from sklearn.cluster import KMeans
n_clusters = 4
kmeans = KMeans(n_clusters = n_clusters, random_state = 628)
labels = kmeans.fit_predict(data_s)
# plot cluster sizes
plt.hist(labels, bins = range(n_clusters + 1))
plt.title ('Customers per Cluster')
plt.xlabel('Cluster')
plt.ylabel('Customers')
plt.show()
- 根据肘部法则确定聚类数目
In [16]:
sse = []
cluster_list = range(1, 16)
for i in cluster_list :
kmeans = KMeans(n_clusters = i, random_state = 628)
kmeans.fit(data_s)
sse.append(kmeans.inertia_)
plt.plot(cluster_list, sse)
plt.title('Elbow Method')
plt.xlabel('Clusters')
plt.ylabel('SSE')
plt.show()
无法观测到明显的肘部
- 根据轮廓系数确定聚类数目
In [17]:
from sklearn.metrics import silhouette_score
s = []
cluster_list = range(2, 10)
for i in cluster_list:
kmeans = KMeans(n_clusters = i, random_state = 628)
s.append(silhouette_score(data_s, kmeans.fit_predict(data_s)))
# Plotting a bar graph to compare the results
plt.bar(cluster_list, s)
plt.xlabel('Number of clusters', fontsize = 10)
plt.ylabel('Silhouette Score', fontsize = 10)
plt.show()
可知,使用K均值聚类将用户聚成5类时轮廓系数最大,为0.33左右。
In [18]:
kmeans = KMeans(n_clusters = 5, random_state = 628)
labels = kmeans.fit_predict(data_s)
data["cluster"] = labels
# plot cluster sizes
plt.hist(labels, bins = range(6))
plt.title ('Customers per Cluster')
plt.xlabel('Cluster')
plt.ylabel('Customers')
plt.show()
对聚类中心进行可视化分析,可以归纳聚类结果中各组用户的特点。
In [19]:
plt.subplots(figsize = (15, 18))
centers = kmeans.cluster_centers_
idx = np.arange(7)
plt.bar(idx, centers[0], color = 'blue', width = 1/6, tick_label = graph_by_variables)
plt.bar(idx + 1/6, centers[1], color = 'green', width = 1/6)
plt.bar(idx + 2/6, centers[2], color = 'red', width = 1/6)
plt.bar(idx + 3/6, centers[3], color = 'orange', width = 1/6)
plt.bar(idx + 4/6, centers[4], color = 'grey', width = 1/6)
plt.xticks()
plt.show()
- 第1类用户(蓝色柱)呈现出余额更新频率较高、购买频率却最低、信用卡限额较为中等,还款比例最低的特点。
- 第2类用户(绿色柱),其余额更新频率较高,同时购买频率也较高,尤其喜爱使用分期购买服务。
- 第3类用户(红色柱),呈现余额更新频率较高且购买频率也较高的特点,但相比于第2类用户,更喜欢使用一次性购买服务。
- 第4类用户(橙色柱),信用卡额度最低、信用卡服务使用期限最短,其他各项频率也较低。
- 第5类用户(灰色柱),余额更新频率最低,各项购买频率也较低,信用卡限额较为中等,还款比例较高。
聚类分析 - 层次聚类¶
- By sklearn.cluster
In [14]:
from sklearn.cluster import AgglomerativeClustering
s = []
cluster_list = range(2, 10)
for i in cluster_list:
hc = AgglomerativeClustering(n_clusters = i)
s.append(silhouette_score(data_s, hc.fit_predict(data_s)))
# Plotting a bar graph to compare the results
plt.bar(cluster_list, s)
plt.xlabel('Number of Clusters', fontsize = 10)
plt.ylabel('Silhouette Score', fontsize = 10)
plt.show()
可知,使用凝聚聚类将用户聚成4类时轮廓系数最大,为0.28左右。
- By scipy.cluster
In [16]:
import scipy.cluster.hierarchy as sch
plt.figure(figsize = (10, 6))
linked = sch.linkage(data_s, method = 'ward')
dendrogram = sch.dendrogram(linked)
plt.title('Dendrogram')
plt.xlabel('Customers')
plt.ylabel('Euclidean Distances')
plt.show()
若依据轮廓系数这一客观指标,可认为K均值聚类方法的表现优于凝聚聚类方法,其中当K=5时达到最大。也可以结合聚类任务背景,进行主观评价,选取认为合适的聚类方法。
In [ ]: