针对Titanic数据集的数据探索分析与预处理¶

1. 数据集介绍¶

Titanic数据集来自Kaggle竞赛平台的入门项目Titanic: Machine Learning from Disaster,数据记录了泰坦尼克号处女航撞上冰山沉没北大西洋时,不同年龄、性别和社会地位的乘客及船员的生存情况。数据一共包含两个文件,分别是训练数据(train.csv)和测试数据(test.csv)。数据经过适当的探索分析和预处理后,可开展泰坦尼克号乘客生存预测。数据字段及具体含义如下:

  • PassengerId: ID
  • Survived: Survival status: 0 = No, 1 = Yes
  • Pclass: Ticket class: 1 = 1st (Upper), 2 = 2nd (Middle), 3 = 3rd (Lower)
  • Name: e.g., "Braund, Mr. Owen Harris"
  • Sex: "female" or "male"
  • Age: Age in years
  • SibSp: # of siblings / spouses aboard the Titanic
  • Parch: # of parents / children aboard the Titanic
  • Ticket: Ticket number
  • Fare: Passenger fare
  • Cabin: Cabin number
  • Embarked: Port of Embarkation: C = Cherbourg, Q = Queenstown, S = Southampton

2. 数据探索分析与预处理¶

2.1 导入数据¶

In [1]:
import numpy as np
import pandas as pd
train = pd.read_csv('./titanic/train.csv')
test = pd.read_csv('./titanic/test.csv')
print('训练数据集: ', train.shape, '测试数据集: ', test.shape)
训练数据集:  (891, 12) 测试数据集:  (418, 11)

合并数据,方便统一进行数据清洗

In [2]:
data = pd.concat([train, test], ignore_index=True)
print("合并后数据集: ", data.shape)
合并后数据集:  (1309, 12)

2.2 查看数据¶

查看数据导入情况

In [3]:
# 查看前5行
data.head()
Out[3]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0.0 3 Braund, Mr. Owen Harris male 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1.0 1 Cumings, Mrs. John Bradley (Florence Briggs Th... female 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1.0 3 Heikkinen, Miss. Laina female 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1.0 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) female 35.0 1 0 113803 53.1000 C123 S
4 5 0.0 3 Allen, Mr. William Henry male 35.0 0 0 373450 8.0500 NaN S
In [4]:
# 查看后5行
data.tail()
Out[4]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
1304 1305 NaN 3 Spector, Mr. Woolf male NaN 0 0 A.5. 3236 8.0500 NaN S
1305 1306 NaN 1 Oliva y Ocana, Dona. Fermina female 39.0 0 0 PC 17758 108.9000 C105 C
1306 1307 NaN 3 Saether, Mr. Simon Sivertsen male 38.5 0 0 SOTON/O.Q. 3101262 7.2500 NaN S
1307 1308 NaN 3 Ware, Mr. Frederick male NaN 0 0 359309 8.0500 NaN S
1308 1309 NaN 3 Peter, Master. Michael J male NaN 1 1 2668 22.3583 NaN C
In [5]:
#查看数据维度
data.shape
Out[5]:
(1309, 12)
In [6]:
# 浏览数据集整体情况
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1046 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1308 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1307 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
In [7]:
# 查看数据集统计信息
data.describe()
Out[7]:
PassengerId Survived Pclass Age SibSp Parch Fare
count 1309.000000 891.000000 1309.000000 1046.000000 1309.000000 1309.000000 1308.000000
mean 655.000000 0.383838 2.294882 29.881138 0.498854 0.385027 33.295479
std 378.020061 0.486592 0.837836 14.413493 1.041658 0.865560 51.758668
min 1.000000 0.000000 1.000000 0.170000 0.000000 0.000000 0.000000
25% 328.000000 0.000000 2.000000 21.000000 0.000000 0.000000 7.895800
50% 655.000000 0.000000 3.000000 28.000000 0.000000 0.000000 14.454200
75% 982.000000 1.000000 3.000000 39.000000 1.000000 0.000000 31.275000
max 1309.000000 1.000000 3.000000 80.000000 8.000000 9.000000 512.329200

2.3 检查数据质量¶

针对数据可能存在的“不完整、不正确、不一致”问题,重点检查以下几个维度:

  • 不完整:查看缺失值;
  • 不正确:查看异常点和噪音;
  • 不一致:主要检查文本字段质量。
In [8]:
#查看数据缺失情况
data.isnull().sum()
Out[8]:
PassengerId       0
Survived        418
Pclass            0
Name              0
Sex               0
Age             263
SibSp             0
Parch             0
Ticket            0
Fare              1
Cabin          1014
Embarked          2
dtype: int64

发现:1)Survived字段的缺失来自于测试集;2)Age和Cabin字段存在较多的数据缺失,Age字段可以尝试一定的缺失值补全,Cabin字段由于缺失值过多可以考虑删去。

In [9]:
# 使用箱线图刻画Age变量的分布,查看异常点
import seaborn as sns
sns.boxplot(x = 'Survived', y ='Age', data = data)
Out[9]:
<Axes: xlabel='Survived', ylabel='Age'>
No description has been provided for this image

发现:1)数据集中乘客平均年龄在30岁左右;2)数据集中乘客年龄存在异常点(高龄乘客)。3)平均而言,幸存者相对更年轻。

2.4 数据清洗¶

针对Titanic数据集,主要进行缺失值填充策略的探索。

  1. 使用特定值(如均值、众数)填充缺失值
In [10]:
# 均值填充
# Age
print('年龄均值:', data['Age'].mean())
data['Age'] = data['Age'].fillna(data['Age'].mean())
# Fare
print('旅客票价均值:', data['Fare'].mean())
data['Fare'] = data['Fare'].fillna(data['Fare'].mean())
年龄均值: 29.881137667304014
旅客票价均值: 33.29547928134557
In [11]:
# 众数填充
# Embarked
print(data['Embarked'].value_counts())
data['Embarked'] = data['Embarked'].fillna('S')
Embarked
S    914
C    270
Q    123
Name: count, dtype: int64
  1. 基于近邻推断填充缺失值

sklearn.impute.KNNImputer类能提供基于K近邻推断填补缺失值的方法。基本思路是
1)基于其他未缺失值计算K近邻;2)利用K近邻对缺失字段的值进行推断,如加权平均。
注意:该方法只能处理数值属性;

In [12]:
from sklearn.impute import KNNImputer
#select the numeric columns and transform it to numpy
n_train= train[['Age','SibSp','Parch','Fare']].to_numpy() 
imputer = KNNImputer(n_neighbors=2, weights="uniform")
print('处理前缺失值个数:', np.isnan(n_train).sum())
n_train_impute = imputer.fit_transform(n_train)
print('处理后缺失值个数:', np.isnan(n_train_impute).sum())
print('处理后各均值:', np.mean(n_train_impute,axis=0))
处理前缺失值个数: 177
处理后缺失值个数: 0
处理后各均值: [30.38679012  0.52300786  0.38159371 32.20420797]
In [13]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1309 entries, 0 to 1308
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  1309 non-null   int64  
 1   Survived     891 non-null    float64
 2   Pclass       1309 non-null   int64  
 3   Name         1309 non-null   object 
 4   Sex          1309 non-null   object 
 5   Age          1309 non-null   float64
 6   SibSp        1309 non-null   int64  
 7   Parch        1309 non-null   int64  
 8   Ticket       1309 non-null   object 
 9   Fare         1309 non-null   float64
 10  Cabin        295 non-null    object 
 11  Embarked     1309 non-null   object 
dtypes: float64(3), int64(4), object(5)
memory usage: 122.8+ KB
  1. 其他缺失值处理方法,可以查看链接

2.5 离散属性编码¶

  1. 对Sex根据指定的map进行0-1编码
In [14]:
# Sex
data['Sex'].head()
Out[14]:
0      male
1    female
2    female
3    female
4      male
Name: Sex, dtype: object
In [15]:
# Sex
Sex_map = {
    'female': 0,
    'male': 1
}
data['Sex'] = data['Sex'].map(Sex_map)
data.head()
Out[15]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked
0 1 0.0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 NaN S
1 2 1.0 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 C85 C
2 3 1.0 3 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 NaN S
3 4 1.0 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 C123 S
4 5 0.0 3 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 NaN S
  1. 对变量Embarked(登船港口)进行编码
In [16]:
data = data.join(pd.get_dummies(data['Embarked'], prefix = 'Embarked'))
data.head()
Out[16]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Embarked_C Embarked_Q Embarked_S
0 1 0.0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 NaN S False False True
1 2 1.0 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 C85 C True False False
2 3 1.0 3 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 NaN S False False True
3 4 1.0 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 C123 S False False True
4 5 0.0 3 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 NaN S False False True
  1. 对变量Pclass(船票等级)进行编码
In [17]:
data = data.join(pd.get_dummies(data['Pclass'], prefix = 'Pclass'))
data.head()
Out[17]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Embarked_C Embarked_Q Embarked_S Pclass_1 Pclass_2 Pclass_3
0 1 0.0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 NaN S False False True False False True
1 2 1.0 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 C85 C True False False True False False
2 3 1.0 3 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 NaN S False False True False False True
3 4 1.0 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 C123 S False False True True False False
4 5 0.0 3 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 NaN S False False True False False True

2.6 特征工程¶

基于数据集已有字段和挖掘任务,有针对性地创建新的特征。

  1. 提取变量Name中的头衔并编码
In [18]:
def get_title(name):
    str1 = name.split(',')[1]
    str2 = str1.split('.')[0]
    str3 = str2.strip()
    return str3

data['Title'] = data['Name'].map(get_title)
data.head()
Out[18]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare Cabin Embarked Embarked_C Embarked_Q Embarked_S Pclass_1 Pclass_2 Pclass_3 Title
0 1 0.0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 NaN S False False True False False True Mr
1 2 1.0 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 C85 C True False False True False False Mrs
2 3 1.0 3 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 NaN S False False True False False True Miss
3 4 1.0 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 C123 S False False True True False False Mrs
4 5 0.0 3 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 NaN S False False True False False True Mr
In [19]:
data['Title'].value_counts()
Out[19]:
Title
Mr              757
Miss            260
Mrs             197
Master           61
Rev               8
Dr                8
Col               4
Mlle              2
Major             2
Ms                2
Lady              1
Sir               1
Mme               1
Don               1
Capt              1
the Countess      1
Jonkheer          1
Dona              1
Name: count, dtype: int64

变量Title中部分头衔并不常见,因此进行汇总处理,得到共6类,分别是Officer, Royalty, Mrs, Miss, Mr, Master

In [20]:
Title_map = {
    'Mr': 'Mr',
    'Miss': 'Miss',
    'Mrs': 'Mrs',
    'Master': 'Master',
    'Rev': 'Officer',
    'Dr': 'Officer',
    'Col': 'Officer',
    'Ms': 'Mrs',
    'Mlle': 'Miss',
    'Major': 'Officer',
    'Dona': 'Royalty',
    'Sir': 'Royalty',
    'Capt': 'Officer',
    'the Countess': 'Royalty',
    'Don': 'Royalty',
    'Lady': 'Royalty',
    'Mme': 'Mrs',
    'Jonkheer': 'Royalty'
}
data['Title'] = data['Title'].map(Title_map)
data['Title'].value_counts()
data = data.join(pd.get_dummies(data['Title'], prefix = 'Title'))
  1. 基于变量SibSp和Parch计算家庭规模数据
In [21]:
data['Family'] = data['SibSp'] + data['Parch'] + 1
data['FamilySingle'] = data['Family'].map(lambda a:1 if a == 1 else 0)
data['FamilySmall'] = data['Family'].map(lambda a:1 if 2 <= a <= 4 else 0)
data['FamilyLarge'] = data['Family'].map(lambda a:1 if 5 <= a else 0)
data.head()
Out[21]:
PassengerId Survived Pclass Name Sex Age SibSp Parch Ticket Fare ... Title_Master Title_Miss Title_Mr Title_Mrs Title_Officer Title_Royalty Family FamilySingle FamilySmall FamilyLarge
0 1 0.0 3 Braund, Mr. Owen Harris 1 22.0 1 0 A/5 21171 7.2500 ... False False True False False False 2 0 1 0
1 2 1.0 1 Cumings, Mrs. John Bradley (Florence Briggs Th... 0 38.0 1 0 PC 17599 71.2833 ... False False False True False False 2 0 1 0
2 3 1.0 3 Heikkinen, Miss. Laina 0 26.0 0 0 STON/O2. 3101282 7.9250 ... False True False False False False 1 1 0 0
3 4 1.0 1 Futrelle, Mrs. Jacques Heath (Lily May Peel) 0 35.0 1 0 113803 53.1000 ... False False False True False False 2 0 1 0
4 5 0.0 3 Allen, Mr. William Henry 1 35.0 0 0 373450 8.0500 ... False False True False False False 1 1 0 0

5 rows × 29 columns

2.6 数据标准化¶

可以利用sklearn.preprocessing包实现数据标准化操作。其中,StandardScaler可用来做正态标准化,MinMaxScaler可用来做最小最大标准化。

In [22]:
#选取range不在[0,1]的变量
print(data.describe())
data_r = data[['Age','SibSp','Parch','Fare','Family']]
       PassengerId    Survived       Pclass          Sex          Age  \
count  1309.000000  891.000000  1309.000000  1309.000000  1309.000000   
mean    655.000000    0.383838     2.294882     0.644003    29.881138   
std     378.020061    0.486592     0.837836     0.478997    12.883193   
min       1.000000    0.000000     1.000000     0.000000     0.170000   
25%     328.000000    0.000000     2.000000     0.000000    22.000000   
50%     655.000000    0.000000     3.000000     1.000000    29.881138   
75%     982.000000    1.000000     3.000000     1.000000    35.000000   
max    1309.000000    1.000000     3.000000     1.000000    80.000000   

             SibSp        Parch         Fare       Family  FamilySingle  \
count  1309.000000  1309.000000  1309.000000  1309.000000   1309.000000   
mean      0.498854     0.385027    33.295479     1.883881      0.603514   
std       1.041658     0.865560    51.738879     1.583639      0.489354   
min       0.000000     0.000000     0.000000     1.000000      0.000000   
25%       0.000000     0.000000     7.895800     1.000000      0.000000   
50%       0.000000     0.000000    14.454200     1.000000      1.000000   
75%       1.000000     0.000000    31.275000     2.000000      1.000000   
max       8.000000     9.000000   512.329200    11.000000      1.000000   

       FamilySmall  FamilyLarge  
count  1309.000000  1309.000000  
mean      0.333843     0.062643  
std       0.471765     0.242413  
min       0.000000     0.000000  
25%       0.000000     0.000000  
50%       0.000000     0.000000  
75%       1.000000     0.000000  
max       1.000000     1.000000  
In [23]:
#正态标准化
from sklearn import preprocessing
z_scaler = preprocessing.StandardScaler().fit(data_r)
print('mean:', z_scaler.mean_)
print('std:', z_scaler.scale_)
print(z_scaler.transform(data_r))
data_r_std = z_scaler.transform(data_r)
# 计算标准化后的均值和方差
print('标准化后的均值:', data_r_std.mean(axis=0))
print('标准化后的方差:', data_r_std.var(axis=0))
mean: [29.88113767  0.49885409  0.38502674 33.29547928  1.88388083]
std: [12.8782713   1.04126043  0.86522959 51.71911251  1.58303407]
[[-0.61197171  0.48128777 -0.4449995  -0.50359486  0.07335229]
 [ 0.63043107  0.48128777 -0.4449995   0.73450256  0.07335229]
 [-0.30137101 -0.47908676 -0.4449995  -0.49054359 -0.55834605]
 ...
 [ 0.66925616 -0.47908676 -0.4449995  -0.50359486 -0.55834605]
 [ 0.         -0.47908676 -0.4449995  -0.48812669 -0.55834605]
 [ 0.          0.48128777  0.71076309 -0.21147268  0.70505064]]
标准化后的均值: [ 1.03473804e-16 -1.62844019e-17  1.73021770e-17  2.44266028e-17
  1.62844019e-17]
标准化后的方差: [1. 1. 1. 1. 1.]
In [24]:
#最小最大标准化
m_scaler = preprocessing.MinMaxScaler().fit(data_r)
print(m_scaler.transform(data_r))
[[0.27345609 0.125      0.         0.01415106 0.1       ]
 [0.473882   0.125      0.         0.13913574 0.1       ]
 [0.32356257 0.         0.         0.01546857 0.        ]
 ...
 [0.48014531 0.         0.         0.01415106 0.        ]
 [0.3721801  0.         0.         0.01571255 0.        ]
 [0.3721801  0.125      0.11111111 0.0436405  0.2       ]]

2.7 PCA降维分析(示例演示)¶

In [25]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.preprocessing import StandardScaler
from sklearn.decomposition import PCA
In [26]:
# 1) 生成“电商客户行为”合成数据(N=300)
rng = np.random.default_rng(42)
N = 300
segments = rng.choice(["High-Value", "Bargain-Hunter", "Low-Engage"], size=N, p=[0.35, 0.4, 0.25])

def gen_by_seg(seg):
    if seg == "High-Value":
        order_cnt   = rng.normal(18, 4)
        aov         = rng.normal(420, 60)              # avg order value
        ret_rate    = np.clip(rng.normal(0.05, 0.02), 0, 1)
        browse_min  = rng.normal(12, 4)
        disc_ratio  = np.clip(rng.normal(0.08, 0.05), 0, 1)
        tickets     = rng.poisson(0.6)
    elif seg == "Bargain-Hunter":
        order_cnt   = rng.normal(12, 3)
        aov         = rng.normal(180, 40)
        ret_rate    = np.clip(rng.normal(0.10, 0.04), 0, 1)
        browse_min  = rng.normal(26, 6)
        disc_ratio  = np.clip(rng.normal(0.32, 0.08), 0, 1)
        tickets     = rng.poisson(1.2)
    else:  # Low-Engage
        order_cnt   = rng.normal(5, 2)
        aov         = rng.normal(120, 30)
        ret_rate    = np.clip(rng.normal(0.04, 0.02), 0, 1)
        browse_min  = rng.normal(8, 3)
        disc_ratio  = np.clip(rng.normal(0.12, 0.06), 0, 1)
        tickets     = rng.poisson(0.3)
    return order_cnt, aov, ret_rate, browse_min, disc_ratio, tickets

data = np.array([gen_by_seg(s) for s in segments])
cols = ["order_count","avg_order_value","return_rate","browse_minutes","discount_ratio","support_tickets"]
df = pd.DataFrame(data, columns=cols)
df["segment_true"] = segments   # 真实业务中通常未知
df.head(10)
Out[26]:
order_count avg_order_value return_rate browse_minutes discount_ratio support_tickets segment_true
0 4.415658 116.895352 0.034960 8.457688 0.208290 0.0 Low-Engage
1 11.289449 187.060497 0.111840 23.768513 0.179462 1.0 Bargain-Hunter
2 1.932277 145.914840 0.033429 7.816027 0.056826 0.0 Low-Engage
3 15.900134 203.306210 0.169292 33.064471 0.355127 3.0 Bargain-Hunter
4 18.266183 378.154571 0.069792 7.286786 0.119118 1.0 High-Value
5 7.342494 142.526070 0.076413 10.192324 0.025678 0.0 Low-Engage
6 2.655986 104.451605 0.070225 9.912601 0.078064 1.0 Low-Engage
7 2.566880 99.865792 0.046240 11.465936 0.156526 0.0 Low-Engage
8 19.217467 424.322014 0.058278 18.464839 0.000000 1.0 High-Value
9 7.255217 239.037962 0.114734 31.079504 0.274325 1.0 Bargain-Hunter
In [27]:
# 2) 标准化(PCA对量纲敏感)
X = df[cols].values
scaler = StandardScaler()
X_std = scaler.fit_transform(X)
In [28]:
# 3) 先拟合全成分,查看解释方差占比与累计解释方差
pca_full = PCA(n_components=len(cols), random_state=42)
pca_full.fit(X_std)
explained = pca_full.explained_variance_ratio_
cum_explained = np.cumsum(explained)
print("Explained variance ratio by component:", np.round(explained, 3))
print("Cumulative explained variance:", np.round(cum_explained, 3))
Explained variance ratio by component: [0.408 0.281 0.153 0.08  0.042 0.036]
Cumulative explained variance: [0.408 0.689 0.841 0.921 0.964 1.   ]
In [29]:
# 4) 碎石图(Scree Plot)
plt.figure()
plt.plot(range(1, len(cols)+1), explained, marker='o')
plt.xlabel("Principal Component")
plt.ylabel("Explained Variance Ratio")
plt.title("Scree Plot")
plt.grid(True)
plt.show()
No description has been provided for this image
In [30]:
# 5) 取前两个主成分做降维与可视化
pca = PCA(n_components=2, random_state=42)
X_pca = pca.fit_transform(X_std)
In [31]:
# 6) 2D 主成分散点图
# 定义一个颜色映射字典,不同客户类型对应不同颜色
color_map = {
    "High-Value":   "#1f77b4",     # 高价值客户 → 蓝色
    "Bargain-Hunter":"#ff7f0e",    # 价格敏感型客户 → 橙色
    "Low-Engage":   "#2ca02c"      # 低活跃客户 → 绿色
}

# 创建新的绘图窗口,避免和之前的图混在一起
plt.figure()
# 遍历每个客户类别及其对应颜色
for seg, color in color_map.items():
    # mask 是布尔索引,筛选出属于该类别的样本行
    mask = (df["segment_true"] == seg)
    
    # 绘制该类别在PC1-PC2平面的散点
    plt.scatter(
        X_pca[mask, 0],   # 横轴:PC1坐标
        X_pca[mask, 1],   # 纵轴:PC2坐标
        s=25,             # 点的大小
        alpha=0.8,        # 透明度(避免点过多时看不清)
        label=seg,        # 图例中的标签(类别名)
        c=color           # 点的颜色
    )
# 设置横纵坐标轴标题
plt.xlabel("PC1")
plt.ylabel("PC2")
# 设置图表标题
plt.title("Customers projected onto PC1–PC2 space")
# 显示图例,并为图例框加上标题
plt.legend(title="True Segment")
# 打开网格线,方便对比和读数
plt.grid(True)
# 显示绘制好的图形
plt.show()
No description has been provided for this image
In [32]:
# 7) 将PC结果拼接回表,便于后续分析
out = df.copy()
out["PC1"] = X_pca[:, 0]
out["PC2"] = X_pca[:, 1]
print("\n Head of output (with PCs):")
print(out.head(10))
 Head of output (with PCs):
   order_count  avg_order_value  return_rate  browse_minutes  discount_ratio  \
0     4.415658       116.895352     0.034960        8.457688        0.208290   
1    11.289449       187.060497     0.111840       23.768513        0.179462   
2     1.932277       145.914840     0.033429        7.816027        0.056826   
3    15.900134       203.306210     0.169292       33.064471        0.355127   
4    18.266183       378.154571     0.069792        7.286786        0.119118   
5     7.342494       142.526070     0.076413       10.192324        0.025678   
6     2.655986       104.451605     0.070225        9.912601        0.078064   
7     2.566880        99.865792     0.046240       11.465936        0.156526   
8    19.217467       424.322014     0.058278       18.464839        0.000000   
9     7.255217       239.037962     0.114734       31.079504        0.274325   

   support_tickets    segment_true       PC1       PC2  
0              0.0      Low-Engage -0.188869 -2.109725  
1              1.0  Bargain-Hunter  1.324149  0.305075  
2              0.0      Low-Engage -0.935139 -2.344901  
3              3.0  Bargain-Hunter  3.602360  2.189832  
4              1.0      High-Value -1.177862  0.936413  
5              0.0      Low-Engage -0.568071 -1.353950  
6              1.0      Low-Engage  0.037546 -1.827962  
7              0.0      Low-Engage -0.003813 -2.206281  
8              1.0      High-Value -1.365332  1.500631  
9              1.0  Bargain-Hunter  2.204198  0.379637  
In [ ]: