针对Titanic数据集的数据探索分析与预处理¶
1. 数据集介绍¶
Titanic数据集来自Kaggle竞赛平台的入门项目Titanic: Machine Learning from Disaster,数据记录了泰坦尼克号处女航撞上冰山沉没北大西洋时,不同年龄、性别和社会地位的乘客及船员的生存情况。数据一共包含两个文件,分别是训练数据(train.csv)和测试数据(test.csv)。数据经过适当的探索分析和预处理后,可开展泰坦尼克号乘客生存预测。数据字段及具体含义如下:
- PassengerId: ID
- Survived: Survival status: 0 = No, 1 = Yes
- Pclass: Ticket class: 1 = 1st (Upper), 2 = 2nd (Middle), 3 = 3rd (Lower)
- Name: e.g., "Braund, Mr. Owen Harris"
- Sex: "female" or "male"
- Age: Age in years
- SibSp: # of siblings / spouses aboard the Titanic
- Parch: # of parents / children aboard the Titanic
- Ticket: Ticket number
- Fare: Passenger fare
- Cabin: Cabin number
- Embarked: Port of Embarkation: C = Cherbourg, Q = Queenstown, S = Southampton
In [1]:
import numpy as np
import pandas as pd
train = pd.read_csv('./titanic/train.csv')
test = pd.read_csv('./titanic/test.csv')
print('训练数据集: ', train.shape, '测试数据集: ', test.shape)
训练数据集: (891, 12) 测试数据集: (418, 11)
合并数据,方便统一进行数据清洗
In [2]:
data = pd.concat([train, test], ignore_index=True)
print("合并后数据集: ", data.shape)
合并后数据集: (1309, 12)
2.2 查看数据¶
查看数据导入情况
In [3]:
# 查看前5行
data.head()
Out[3]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.0 | 3 | Braund, Mr. Owen Harris | male | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1.0 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | female | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1.0 | 3 | Heikkinen, Miss. Laina | female | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1.0 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | female | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0.0 | 3 | Allen, Mr. William Henry | male | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
In [4]:
# 查看后5行
data.tail()
Out[4]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 1304 | 1305 | NaN | 3 | Spector, Mr. Woolf | male | NaN | 0 | 0 | A.5. 3236 | 8.0500 | NaN | S |
| 1305 | 1306 | NaN | 1 | Oliva y Ocana, Dona. Fermina | female | 39.0 | 0 | 0 | PC 17758 | 108.9000 | C105 | C |
| 1306 | 1307 | NaN | 3 | Saether, Mr. Simon Sivertsen | male | 38.5 | 0 | 0 | SOTON/O.Q. 3101262 | 7.2500 | NaN | S |
| 1307 | 1308 | NaN | 3 | Ware, Mr. Frederick | male | NaN | 0 | 0 | 359309 | 8.0500 | NaN | S |
| 1308 | 1309 | NaN | 3 | Peter, Master. Michael J | male | NaN | 1 | 1 | 2668 | 22.3583 | NaN | C |
In [5]:
#查看数据维度
data.shape
Out[5]:
(1309, 12)
In [6]:
# 浏览数据集整体情况
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1309 entries, 0 to 1308 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 1309 non-null int64 1 Survived 891 non-null float64 2 Pclass 1309 non-null int64 3 Name 1309 non-null object 4 Sex 1309 non-null object 5 Age 1046 non-null float64 6 SibSp 1309 non-null int64 7 Parch 1309 non-null int64 8 Ticket 1309 non-null object 9 Fare 1308 non-null float64 10 Cabin 295 non-null object 11 Embarked 1307 non-null object dtypes: float64(3), int64(4), object(5) memory usage: 122.8+ KB
In [7]:
# 查看数据集统计信息
data.describe()
Out[7]:
| PassengerId | Survived | Pclass | Age | SibSp | Parch | Fare | |
|---|---|---|---|---|---|---|---|
| count | 1309.000000 | 891.000000 | 1309.000000 | 1046.000000 | 1309.000000 | 1309.000000 | 1308.000000 |
| mean | 655.000000 | 0.383838 | 2.294882 | 29.881138 | 0.498854 | 0.385027 | 33.295479 |
| std | 378.020061 | 0.486592 | 0.837836 | 14.413493 | 1.041658 | 0.865560 | 51.758668 |
| min | 1.000000 | 0.000000 | 1.000000 | 0.170000 | 0.000000 | 0.000000 | 0.000000 |
| 25% | 328.000000 | 0.000000 | 2.000000 | 21.000000 | 0.000000 | 0.000000 | 7.895800 |
| 50% | 655.000000 | 0.000000 | 3.000000 | 28.000000 | 0.000000 | 0.000000 | 14.454200 |
| 75% | 982.000000 | 1.000000 | 3.000000 | 39.000000 | 1.000000 | 0.000000 | 31.275000 |
| max | 1309.000000 | 1.000000 | 3.000000 | 80.000000 | 8.000000 | 9.000000 | 512.329200 |
In [8]:
#查看数据缺失情况
data.isnull().sum()
Out[8]:
PassengerId 0 Survived 418 Pclass 0 Name 0 Sex 0 Age 263 SibSp 0 Parch 0 Ticket 0 Fare 1 Cabin 1014 Embarked 2 dtype: int64
发现:1)Survived字段的缺失来自于测试集;2)Age和Cabin字段存在较多的数据缺失,Age字段可以尝试一定的缺失值补全,Cabin字段由于缺失值过多可以考虑删去。
In [9]:
# 使用箱线图刻画Age变量的分布,查看异常点
import seaborn as sns
sns.boxplot(x = 'Survived', y ='Age', data = data)
Out[9]:
<Axes: xlabel='Survived', ylabel='Age'>
发现:1)数据集中乘客平均年龄在30岁左右;2)数据集中乘客年龄存在异常点(高龄乘客)。3)平均而言,幸存者相对更年轻。
In [10]:
# 均值填充
# Age
print('年龄均值:', data['Age'].mean())
data['Age'] = data['Age'].fillna(data['Age'].mean())
# Fare
print('旅客票价均值:', data['Fare'].mean())
data['Fare'] = data['Fare'].fillna(data['Fare'].mean())
年龄均值: 29.881137667304014 旅客票价均值: 33.29547928134557
In [11]:
# 众数填充
# Embarked
print(data['Embarked'].value_counts())
data['Embarked'] = data['Embarked'].fillna('S')
Embarked S 914 C 270 Q 123 Name: count, dtype: int64
- 基于近邻推断填充缺失值
sklearn.impute.KNNImputer类能提供基于K近邻推断填补缺失值的方法。基本思路是
1)基于其他未缺失值计算K近邻;2)利用K近邻对缺失字段的值进行推断,如加权平均。
注意:该方法只能处理数值属性;
In [12]:
from sklearn.impute import KNNImputer
#select the numeric columns and transform it to numpy
n_train= train[['Age','SibSp','Parch','Fare']].to_numpy()
imputer = KNNImputer(n_neighbors=2, weights="uniform")
print('处理前缺失值个数:', np.isnan(n_train).sum())
n_train_impute = imputer.fit_transform(n_train)
print('处理后缺失值个数:', np.isnan(n_train_impute).sum())
print('处理后各均值:', np.mean(n_train_impute,axis=0))
处理前缺失值个数: 177 处理后缺失值个数: 0 处理后各均值: [30.38679012 0.52300786 0.38159371 32.20420797]
In [13]:
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 1309 entries, 0 to 1308 Data columns (total 12 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 PassengerId 1309 non-null int64 1 Survived 891 non-null float64 2 Pclass 1309 non-null int64 3 Name 1309 non-null object 4 Sex 1309 non-null object 5 Age 1309 non-null float64 6 SibSp 1309 non-null int64 7 Parch 1309 non-null int64 8 Ticket 1309 non-null object 9 Fare 1309 non-null float64 10 Cabin 295 non-null object 11 Embarked 1309 non-null object dtypes: float64(3), int64(4), object(5) memory usage: 122.8+ KB
- 其他缺失值处理方法,可以查看链接
2.5 离散属性编码¶
- 对Sex根据指定的map进行0-1编码
In [14]:
# Sex
data['Sex'].head()
Out[14]:
0 male 1 female 2 female 3 female 4 male Name: Sex, dtype: object
In [15]:
# Sex
Sex_map = {
'female': 0,
'male': 1
}
data['Sex'] = data['Sex'].map(Sex_map)
data.head()
Out[15]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S |
| 1 | 2 | 1.0 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C |
| 2 | 3 | 1.0 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S |
| 3 | 4 | 1.0 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S |
| 4 | 5 | 0.0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S |
- 对变量Embarked(登船港口)进行编码
In [16]:
data = data.join(pd.get_dummies(data['Embarked'], prefix = 'Embarked'))
data.head()
Out[16]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Embarked_C | Embarked_Q | Embarked_S | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | False | False | True |
| 1 | 2 | 1.0 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | True | False | False |
| 2 | 3 | 1.0 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | False | False | True |
| 3 | 4 | 1.0 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | False | False | True |
| 4 | 5 | 0.0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | False | False | True |
- 对变量Pclass(船票等级)进行编码
In [17]:
data = data.join(pd.get_dummies(data['Pclass'], prefix = 'Pclass'))
data.head()
Out[17]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Embarked_C | Embarked_Q | Embarked_S | Pclass_1 | Pclass_2 | Pclass_3 | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | False | False | True | False | False | True |
| 1 | 2 | 1.0 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | True | False | False | True | False | False |
| 2 | 3 | 1.0 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | False | False | True | False | False | True |
| 3 | 4 | 1.0 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | False | False | True | True | False | False |
| 4 | 5 | 0.0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | False | False | True | False | False | True |
In [18]:
def get_title(name):
str1 = name.split(',')[1]
str2 = str1.split('.')[0]
str3 = str2.strip()
return str3
data['Title'] = data['Name'].map(get_title)
data.head()
Out[18]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | Cabin | Embarked | Embarked_C | Embarked_Q | Embarked_S | Pclass_1 | Pclass_2 | Pclass_3 | Title | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | NaN | S | False | False | True | False | False | True | Mr |
| 1 | 2 | 1.0 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | C85 | C | True | False | False | True | False | False | Mrs |
| 2 | 3 | 1.0 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | NaN | S | False | False | True | False | False | True | Miss |
| 3 | 4 | 1.0 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | C123 | S | False | False | True | True | False | False | Mrs |
| 4 | 5 | 0.0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | NaN | S | False | False | True | False | False | True | Mr |
In [19]:
data['Title'].value_counts()
Out[19]:
Title Mr 757 Miss 260 Mrs 197 Master 61 Rev 8 Dr 8 Col 4 Mlle 2 Major 2 Ms 2 Lady 1 Sir 1 Mme 1 Don 1 Capt 1 the Countess 1 Jonkheer 1 Dona 1 Name: count, dtype: int64
变量Title中部分头衔并不常见,因此进行汇总处理,得到共6类,分别是Officer, Royalty, Mrs, Miss, Mr, Master
In [20]:
Title_map = {
'Mr': 'Mr',
'Miss': 'Miss',
'Mrs': 'Mrs',
'Master': 'Master',
'Rev': 'Officer',
'Dr': 'Officer',
'Col': 'Officer',
'Ms': 'Mrs',
'Mlle': 'Miss',
'Major': 'Officer',
'Dona': 'Royalty',
'Sir': 'Royalty',
'Capt': 'Officer',
'the Countess': 'Royalty',
'Don': 'Royalty',
'Lady': 'Royalty',
'Mme': 'Mrs',
'Jonkheer': 'Royalty'
}
data['Title'] = data['Title'].map(Title_map)
data['Title'].value_counts()
data = data.join(pd.get_dummies(data['Title'], prefix = 'Title'))
- 基于变量SibSp和Parch计算家庭规模数据
In [21]:
data['Family'] = data['SibSp'] + data['Parch'] + 1
data['FamilySingle'] = data['Family'].map(lambda a:1 if a == 1 else 0)
data['FamilySmall'] = data['Family'].map(lambda a:1 if 2 <= a <= 4 else 0)
data['FamilyLarge'] = data['Family'].map(lambda a:1 if 5 <= a else 0)
data.head()
Out[21]:
| PassengerId | Survived | Pclass | Name | Sex | Age | SibSp | Parch | Ticket | Fare | ... | Title_Master | Title_Miss | Title_Mr | Title_Mrs | Title_Officer | Title_Royalty | Family | FamilySingle | FamilySmall | FamilyLarge | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 0.0 | 3 | Braund, Mr. Owen Harris | 1 | 22.0 | 1 | 0 | A/5 21171 | 7.2500 | ... | False | False | True | False | False | False | 2 | 0 | 1 | 0 |
| 1 | 2 | 1.0 | 1 | Cumings, Mrs. John Bradley (Florence Briggs Th... | 0 | 38.0 | 1 | 0 | PC 17599 | 71.2833 | ... | False | False | False | True | False | False | 2 | 0 | 1 | 0 |
| 2 | 3 | 1.0 | 3 | Heikkinen, Miss. Laina | 0 | 26.0 | 0 | 0 | STON/O2. 3101282 | 7.9250 | ... | False | True | False | False | False | False | 1 | 1 | 0 | 0 |
| 3 | 4 | 1.0 | 1 | Futrelle, Mrs. Jacques Heath (Lily May Peel) | 0 | 35.0 | 1 | 0 | 113803 | 53.1000 | ... | False | False | False | True | False | False | 2 | 0 | 1 | 0 |
| 4 | 5 | 0.0 | 3 | Allen, Mr. William Henry | 1 | 35.0 | 0 | 0 | 373450 | 8.0500 | ... | False | False | True | False | False | False | 1 | 1 | 0 | 0 |
5 rows × 29 columns
2.6 数据标准化¶
可以利用sklearn.preprocessing包实现数据标准化操作。其中,StandardScaler可用来做正态标准化,MinMaxScaler可用来做最小最大标准化。
In [22]:
#选取range不在[0,1]的变量
print(data.describe())
data_r = data[['Age','SibSp','Parch','Fare','Family']]
PassengerId Survived Pclass Sex Age \
count 1309.000000 891.000000 1309.000000 1309.000000 1309.000000
mean 655.000000 0.383838 2.294882 0.644003 29.881138
std 378.020061 0.486592 0.837836 0.478997 12.883193
min 1.000000 0.000000 1.000000 0.000000 0.170000
25% 328.000000 0.000000 2.000000 0.000000 22.000000
50% 655.000000 0.000000 3.000000 1.000000 29.881138
75% 982.000000 1.000000 3.000000 1.000000 35.000000
max 1309.000000 1.000000 3.000000 1.000000 80.000000
SibSp Parch Fare Family FamilySingle \
count 1309.000000 1309.000000 1309.000000 1309.000000 1309.000000
mean 0.498854 0.385027 33.295479 1.883881 0.603514
std 1.041658 0.865560 51.738879 1.583639 0.489354
min 0.000000 0.000000 0.000000 1.000000 0.000000
25% 0.000000 0.000000 7.895800 1.000000 0.000000
50% 0.000000 0.000000 14.454200 1.000000 1.000000
75% 1.000000 0.000000 31.275000 2.000000 1.000000
max 8.000000 9.000000 512.329200 11.000000 1.000000
FamilySmall FamilyLarge
count 1309.000000 1309.000000
mean 0.333843 0.062643
std 0.471765 0.242413
min 0.000000 0.000000
25% 0.000000 0.000000
50% 0.000000 0.000000
75% 1.000000 0.000000
max 1.000000 1.000000
In [23]:
#正态标准化
from sklearn import preprocessing
z_scaler = preprocessing.StandardScaler().fit(data_r)
print('mean:', z_scaler.mean_)
print('std:', z_scaler.scale_)
print(z_scaler.transform(data_r))
data_r_std = z_scaler.transform(data_r)
# 计算标准化后的均值和方差
print('标准化后的均值:', data_r_std.mean(axis=0))
print('标准化后的方差:', data_r_std.var(axis=0))
mean: [29.88113767 0.49885409 0.38502674 33.29547928 1.88388083] std: [12.8782713 1.04126043 0.86522959 51.71911251 1.58303407] [[-0.61197171 0.48128777 -0.4449995 -0.50359486 0.07335229] [ 0.63043107 0.48128777 -0.4449995 0.73450256 0.07335229] [-0.30137101 -0.47908676 -0.4449995 -0.49054359 -0.55834605] ... [ 0.66925616 -0.47908676 -0.4449995 -0.50359486 -0.55834605] [ 0. -0.47908676 -0.4449995 -0.48812669 -0.55834605] [ 0. 0.48128777 0.71076309 -0.21147268 0.70505064]] 标准化后的均值: [ 1.03473804e-16 -1.62844019e-17 1.73021770e-17 2.44266028e-17 1.62844019e-17] 标准化后的方差: [1. 1. 1. 1. 1.]
In [24]:
#最小最大标准化
m_scaler = preprocessing.MinMaxScaler().fit(data_r)
print(m_scaler.transform(data_r))
[[0.27345609 0.125 0. 0.01415106 0.1 ] [0.473882 0.125 0. 0.13913574 0.1 ] [0.32356257 0. 0. 0.01546857 0. ] ... [0.48014531 0. 0. 0.01415106 0. ] [0.3721801 0. 0. 0.01571255 0. ] [0.3721801 0.125 0.11111111 0.0436405 0.2 ]]
2.7 PCA降维分析(示例演示)¶
In [25]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
In [26]:
# 1) 生成“电商客户行为”合成数据(N=300)
rng = np.random.default_rng(42)
N = 300
segments = rng.choice(["High-Value", "Bargain-Hunter", "Low-Engage"], size=N, p=[0.35, 0.4, 0.25])
def gen_by_seg(seg):
if seg == "High-Value":
order_cnt = rng.normal(18, 4)
aov = rng.normal(420, 60) # avg order value
ret_rate = np.clip(rng.normal(0.05, 0.02), 0, 1)
browse_min = rng.normal(12, 4)
disc_ratio = np.clip(rng.normal(0.08, 0.05), 0, 1)
tickets = rng.poisson(0.6)
elif seg == "Bargain-Hunter":
order_cnt = rng.normal(12, 3)
aov = rng.normal(180, 40)
ret_rate = np.clip(rng.normal(0.10, 0.04), 0, 1)
browse_min = rng.normal(26, 6)
disc_ratio = np.clip(rng.normal(0.32, 0.08), 0, 1)
tickets = rng.poisson(1.2)
else: # Low-Engage
order_cnt = rng.normal(5, 2)
aov = rng.normal(120, 30)
ret_rate = np.clip(rng.normal(0.04, 0.02), 0, 1)
browse_min = rng.normal(8, 3)
disc_ratio = np.clip(rng.normal(0.12, 0.06), 0, 1)
tickets = rng.poisson(0.3)
return order_cnt, aov, ret_rate, browse_min, disc_ratio, tickets
data = np.array([gen_by_seg(s) for s in segments])
cols = ["order_count","avg_order_value","return_rate","browse_minutes","discount_ratio","support_tickets"]
df = pd.DataFrame(data, columns=cols)
df["segment_true"] = segments # 真实业务中通常未知
df.head(10)
Out[26]:
| order_count | avg_order_value | return_rate | browse_minutes | discount_ratio | support_tickets | segment_true | |
|---|---|---|---|---|---|---|---|
| 0 | 4.415658 | 116.895352 | 0.034960 | 8.457688 | 0.208290 | 0.0 | Low-Engage |
| 1 | 11.289449 | 187.060497 | 0.111840 | 23.768513 | 0.179462 | 1.0 | Bargain-Hunter |
| 2 | 1.932277 | 145.914840 | 0.033429 | 7.816027 | 0.056826 | 0.0 | Low-Engage |
| 3 | 15.900134 | 203.306210 | 0.169292 | 33.064471 | 0.355127 | 3.0 | Bargain-Hunter |
| 4 | 18.266183 | 378.154571 | 0.069792 | 7.286786 | 0.119118 | 1.0 | High-Value |
| 5 | 7.342494 | 142.526070 | 0.076413 | 10.192324 | 0.025678 | 0.0 | Low-Engage |
| 6 | 2.655986 | 104.451605 | 0.070225 | 9.912601 | 0.078064 | 1.0 | Low-Engage |
| 7 | 2.566880 | 99.865792 | 0.046240 | 11.465936 | 0.156526 | 0.0 | Low-Engage |
| 8 | 19.217467 | 424.322014 | 0.058278 | 18.464839 | 0.000000 | 1.0 | High-Value |
| 9 | 7.255217 | 239.037962 | 0.114734 | 31.079504 | 0.274325 | 1.0 | Bargain-Hunter |
In [27]:
# 2) 标准化(PCA对量纲敏感)
X = df[cols].values
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
In [28]:
# 3) 先拟合全成分,查看解释方差占比与累计解释方差
pca_full = PCA(n_components=len(cols), random_state=42)
pca_full.fit(X_std)
explained = pca_full.explained_variance_ratio_
cum_explained = np.cumsum(explained)
print("Explained variance ratio by component:", np.round(explained, 3))
print("Cumulative explained variance:", np.round(cum_explained, 3))
Explained variance ratio by component: [0.408 0.281 0.153 0.08 0.042 0.036] Cumulative explained variance: [0.408 0.689 0.841 0.921 0.964 1. ]
In [29]:
# 4) 碎石图(Scree Plot)
plt.figure()
plt.plot(range(1, len(cols)+1), explained, marker='o')
plt.xlabel("Principal Component")
plt.ylabel("Explained Variance Ratio")
plt.title("Scree Plot")
plt.grid(True)
plt.show()
In [30]:
# 5) 取前两个主成分做降维与可视化
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_std)
In [31]:
# 6) 2D 主成分散点图
# 定义一个颜色映射字典,不同客户类型对应不同颜色
color_map = {
"High-Value": "#1f77b4", # 高价值客户 → 蓝色
"Bargain-Hunter":"#ff7f0e", # 价格敏感型客户 → 橙色
"Low-Engage": "#2ca02c" # 低活跃客户 → 绿色
}
# 创建新的绘图窗口,避免和之前的图混在一起
plt.figure()
# 遍历每个客户类别及其对应颜色
for seg, color in color_map.items():
# mask 是布尔索引,筛选出属于该类别的样本行
mask = (df["segment_true"] == seg)
# 绘制该类别在PC1-PC2平面的散点
plt.scatter(
X_pca[mask, 0], # 横轴:PC1坐标
X_pca[mask, 1], # 纵轴:PC2坐标
s=25, # 点的大小
alpha=0.8, # 透明度(避免点过多时看不清)
label=seg, # 图例中的标签(类别名)
c=color # 点的颜色
)
# 设置横纵坐标轴标题
plt.xlabel("PC1")
plt.ylabel("PC2")
# 设置图表标题
plt.title("Customers projected onto PC1–PC2 space")
# 显示图例,并为图例框加上标题
plt.legend(title="True Segment")
# 打开网格线,方便对比和读数
plt.grid(True)
# 显示绘制好的图形
plt.show()
In [32]:
# 7) 将PC结果拼接回表,便于后续分析
out = df.copy()
out["PC1"] = X_pca[:, 0]
out["PC2"] = X_pca[:, 1]
print("\n Head of output (with PCs):")
print(out.head(10))
Head of output (with PCs): order_count avg_order_value return_rate browse_minutes discount_ratio \ 0 4.415658 116.895352 0.034960 8.457688 0.208290 1 11.289449 187.060497 0.111840 23.768513 0.179462 2 1.932277 145.914840 0.033429 7.816027 0.056826 3 15.900134 203.306210 0.169292 33.064471 0.355127 4 18.266183 378.154571 0.069792 7.286786 0.119118 5 7.342494 142.526070 0.076413 10.192324 0.025678 6 2.655986 104.451605 0.070225 9.912601 0.078064 7 2.566880 99.865792 0.046240 11.465936 0.156526 8 19.217467 424.322014 0.058278 18.464839 0.000000 9 7.255217 239.037962 0.114734 31.079504 0.274325 support_tickets segment_true PC1 PC2 0 0.0 Low-Engage -0.188869 -2.109725 1 1.0 Bargain-Hunter 1.324149 0.305075 2 0.0 Low-Engage -0.935139 -2.344901 3 3.0 Bargain-Hunter 3.602360 2.189832 4 1.0 High-Value -1.177862 0.936413 5 0.0 Low-Engage -0.568071 -1.353950 6 1.0 Low-Engage 0.037546 -1.827962 7 0.0 Low-Engage -0.003813 -2.206281 8 1.0 High-Value -1.365332 1.500631 9 1.0 Bargain-Hunter 2.204198 0.379637
In [ ]: