垃圾邮件识别¶
1. 数据集介绍¶
email数据集(email.csv)整理自某邮箱账户2012年前3个月收到的所有邮件,邮件按是否为垃圾邮件进行标注,邮件内容也已进行预处理与特征提取。数据字段及具体含义如下:
- spam: Indicator for whether the email was spam.
- to_multiple: Indicator for whether the email was addressed to more than one recipient.
- from: Whether the message was listed as from anyone (this is usually set by default for regular outgoing email).
- cc: Number of people cc'ed.
- sent_email: Indicator for whether the sender had been sent an email in the last 30 days.
- time: Time at which email was sent.
- image: The number of images attached.
- attach: The number of attached files.
- dollar: The number of times a dollar sign or the word "dollar" appeared in the email.
- winner: Indicates whether "winner" appeared in the email.
- inherit: The number of times “inherit” (or an extension, such as “inheritance”) appeared in the email.
- viagra: The number of times “viagra” appeared in the email.
- password: The number of times “password” appeared in the email.
- num_char: The number of characters in the email, in thousands.
- line_breaks: The number of line breaks in the email (does not count text wrapping).
- format: Indicates whether the email was written using HTML (e.g. may have included bolding or active links).
- re_subj: Whether the subject started with "Re:", "RE:", "re:", or "rE:"
- exclaim_subj: Whether there was an exclamation point in the subject.
- urgent_subj: Whether the word “urgent” was in the email subject.
- exclaim_mess: The number of exclamation points in the email message.
- number: Factor variable saying whether there was no number, a small number (under 1 million), or a big number.
2. 数据探索分析与预处理¶
In [1]:
## load required libraries
import matplotlib.pyplot as plt
import pandas as pd
import numpy as np
import warnings
warnings.simplefilter("ignore")
2.1. 导入数据¶
In [2]:
data = pd.read_csv("./email.csv")
data.head()
Out[2]:
| spam | to_multiple | from | cc | sent_email | image | attach | dollar | winner | inherit | password | num_char | line_breaks | format | re_subj | exclaim_subj | urgent_subj | exclaim_mess | number | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | no | yes | no | no | no | no | no | no | no | no | 11.370 | 202 | HTML | no | no | no | 0 | big |
| 1 | 0 | no | yes | no | no | no | no | no | no | no | no | 10.504 | 202 | HTML | no | no | no | 1 | small |
| 2 | 0 | no | yes | no | no | no | no | yes | no | yes | no | 7.773 | 192 | HTML | no | no | no | 6 | small |
| 3 | 0 | no | yes | no | no | no | no | no | no | no | no | 13.256 | 255 | HTML | no | no | no | 48 | small |
| 4 | 0 | no | yes | no | no | no | no | no | no | no | yes | 1.231 | 29 | Plain | no | no | no | 1 | none |
In [3]:
data.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 3921 entries, 0 to 3920 Data columns (total 19 columns): # Column Non-Null Count Dtype --- ------ -------------- ----- 0 spam 3921 non-null int64 1 to_multiple 3921 non-null object 2 from 3921 non-null object 3 cc 3921 non-null object 4 sent_email 3921 non-null object 5 image 3921 non-null object 6 attach 3921 non-null object 7 dollar 3921 non-null object 8 winner 3921 non-null object 9 inherit 3921 non-null object 10 password 3921 non-null object 11 num_char 3921 non-null float64 12 line_breaks 3921 non-null int64 13 format 3921 non-null object 14 re_subj 3921 non-null object 15 exclaim_subj 3921 non-null object 16 urgent_subj 3921 non-null object 17 exclaim_mess 3921 non-null int64 18 number 3921 non-null object dtypes: float64(1), int64(3), object(15) memory usage: 582.2+ KB
比如将二分类变量映射为1/0,例如yes/no
In [4]:
yes_no_map = {
'yes': 1,
'no': 0
}
data['to_multiple'] = data['to_multiple'].map(yes_no_map)
data['from'] = data['from'].map(yes_no_map)
data['cc'] = data['cc'].map(yes_no_map)
data['sent_email'] = data['sent_email'].map(yes_no_map)
data['image'] = data['image'].map(yes_no_map)
data['attach'] = data['attach'].map(yes_no_map)
data['dollar'] = data['dollar'].map(yes_no_map)
data['winner'] = data['winner'].map(yes_no_map)
data['inherit'] = data['inherit'].map(yes_no_map)
data['password'] = data['password'].map(yes_no_map)
data['re_subj'] = data['re_subj'].map(yes_no_map)
data['exclaim_subj'] = data['exclaim_subj'].map(yes_no_map)
data['urgent_subj'] = data['urgent_subj'].map(yes_no_map)
In [5]:
# format
data['format'].value_counts()
Out[5]:
format HTML 2726 Plain 1195 Name: count, dtype: int64
In [6]:
format_map = {
'HTML': 1,
'Plain': 0
}
data['format'] = data['format'].map(format_map)
In [7]:
# number
data['number'].value_counts()
Out[7]:
number small 2827 none 549 big 545 Name: count, dtype: int64
使用哑变量表示多类别变量
In [8]:
data = data.join(pd.get_dummies(data['number'], prefix = 'number'))
data = data.drop(['number'], axis = 1)
data.head()
Out[8]:
| spam | to_multiple | from | cc | sent_email | image | attach | dollar | winner | inherit | ... | num_char | line_breaks | format | re_subj | exclaim_subj | urgent_subj | exclaim_mess | number_big | number_none | number_small | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 11.370 | 202 | 1 | 0 | 0 | 0 | 0 | True | False | False |
| 1 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 10.504 | 202 | 1 | 0 | 0 | 0 | 1 | False | False | True |
| 2 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 1 | 0 | 1 | ... | 7.773 | 192 | 1 | 0 | 0 | 0 | 6 | False | False | True |
| 3 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 13.256 | 255 | 1 | 0 | 0 | 0 | 48 | False | False | True |
| 4 | 0 | 0 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 1.231 | 29 | 0 | 0 | 0 | 0 | 1 | False | True | False |
5 rows × 21 columns
In [9]:
print(data['spam'].unique())
data_label_count = data.groupby('spam').count()['cc'].sort_values(ascending = False)
print(data_label_count)
data_label_count.plot.bar()
plt.show()
[0 1] spam 0 3554 1 367 Name: cc, dtype: int64
自变量: 分布情况 & 相关性分析¶
In [10]:
import seaborn as sns
graph_by_variables = data.columns
plt.figure(figsize = (15, 18))
for i in range(0, 21):
plt.subplot(7, 3, i+1)
sns.distplot(data[graph_by_variables[i]])
plt.title(graph_by_variables[i])
plt.tight_layout()
In [11]:
f, ax = plt.subplots(figsize = (15, 15))
sns.heatmap(data.corr(), annot = True, linewidths = 0.5, fmt = '.1f', ax = ax)
Out[11]:
<Axes: >
2.4. 数据集划分¶
In [12]:
from sklearn.model_selection import train_test_split
# Splitting into train and test sets
X = data.drop(['spam'], axis = 1)
y = data['spam']
# test_size = 0.2意为训练集占80%,测试集占20%,即将数据集按4:1的比例划分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 628)
print(len(X_train), len(X_test))
3136 785
In [13]:
from sklearn.linear_model import LogisticRegression
# Fitting a logistic regression model with default parameters
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
Out[13]:
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
LogisticRegression()
效果评价¶
In [14]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
In [15]:
# Prediction & Evaluation
y_hat_test = logreg.predict(X_test)
# Logistic Regression score
print("Logistic regression score for test set:")
print("Predicion: {:.3f}".format(precision_score(y_test, y_hat_test)), "Recall: {:.3f}".format(recall_score(y_test, y_hat_test)))
print("F1 score: {:.3f}".format(f1_score(y_test, y_hat_test)))
print("AUC score: {:.3f}".format(roc_auc_score(y_test, y_hat_test)))
Logistic regression score for test set: Predicion: 0.667 Recall: 0.116 F1 score: 0.198 AUC score: 0.555
In [16]:
from sklearn.naive_bayes import BernoulliNB
# Fitting a Naive Bayes model with default parameters
clf = BernoulliNB()
clf.fit(X_train, y_train)
Out[16]:
BernoulliNB()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
BernoulliNB()
效果评价¶
In [17]:
# Prediction & Evaluation
y_hat_test = clf.predict(X_test)
# Naive Bayes score
print("Naive Bayes score for test set:")
print("Predicion: {:.3f}".format(precision_score(y_test, y_hat_test)), "Recall: {:.3f}".format(recall_score(y_test, y_hat_test)))
print("F1 score: {:.3f}".format(f1_score(y_test, y_hat_test)))
print("AUC score: {:.3f}".format(roc_auc_score(y_test, y_hat_test)))
Naive Bayes score for test set: Predicion: 0.500 Recall: 0.377 F1 score: 0.430 AUC score: 0.670
In [18]:
from sklearn.tree import DecisionTreeClassifier
# Fitting a decision tree model with default parameters
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
Out[18]:
DecisionTreeClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier()
效果评价¶
In [19]:
# Prediction & Evaluation
y_hat_test = dt.predict(X_test)
# Decision Tree score
print("Decision tree score for test set:")
print("Predicion: {:.3f}".format(precision_score(y_test, y_hat_test)), "Recall: {:.3f}".format(recall_score(y_test, y_hat_test)))
print("F1 score: {:.3f}".format(f1_score(y_test, y_hat_test)))
print("AUC score: {:.3f}".format(roc_auc_score(y_test, y_hat_test)))
Decision tree score for test set: Predicion: 0.595 Recall: 0.725 F1 score: 0.654 AUC score: 0.839
决策树可视化¶
In [20]:
# use graphviz
# from sklearn import tree
# import graphviz
# dot_data = tree.export_graphviz(dt, feature_names = X.columns, filled = True, class_names = True,out_file=None)
# graph = graphviz.Source(dot_data)
# graph
# use matplotlib
import matplotlib.pyplot as plt
from sklearn import tree
plt.figure(figsize=(12,8))
tree.plot_tree(dt, feature_names=X.columns, class_names=[str(c) for c in dt.classes_], filled=True)
plt.show()
防止模型过度增长¶
In [21]:
from sklearn.tree import DecisionTreeClassifier
# Fitting a decision tree model with default parameters
dt = DecisionTreeClassifier(max_depth = 4)
dt.fit(X_train, y_train)
Out[21]:
DecisionTreeClassifier(max_depth=4)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
On GitHub, the HTML representation is unable to render, please try loading this page with nbviewer.org.
DecisionTreeClassifier(max_depth=4)
In [22]:
# Prediction & Evaluation
y_hat_test = dt.predict(X_test)
# Decision Tree score
print("Decision tree score for test set:")
print("Predicion: {:.3f}".format(precision_score(y_test, y_hat_test)), "Recall: {:.3f}".format(recall_score(y_test, y_hat_test)))
print("F1 score: {:.3f}".format(f1_score(y_test, y_hat_test)))
print("AUC score: {:.3f}".format(roc_auc_score(y_test, y_hat_test)))
Decision tree score for test set: Predicion: 0.649 Recall: 0.348 F1 score: 0.453 AUC score: 0.665
In [23]:
# use graphviz
# from sklearn import tree
# import graphviz
# dot_data = tree.export_graphviz(dt, feature_names = X.columns, filled = True, class_names = True,out_file=None)
# graph = graphviz.Source(dot_data)
# graph
# use matplotlib
import matplotlib.pyplot as plt
plt.figure(figsize=(12,8))
tree.plot_tree(dt, feature_names=X.columns, class_names=[str(c) for c in dt.classes_], filled=True)
plt.show()
In [ ]: