垃圾邮件识别

1. 数据集介绍

email数据集(email.csv)整理自某邮箱账户2012年前3个月收到的所有邮件,邮件按是否为垃圾邮件进行标注,邮件内容也已进行预处理与特征提取。数据字段及具体含义如下:

  • spam: Indicator for whether the email was spam.
  • to_multiple: Indicator for whether the email was addressed to more than one recipient.
  • from: Whether the message was listed as from anyone (this is usually set by default for regular outgoing email).
  • cc: Number of people cc'ed.
  • sent_email: Indicator for whether the sender had been sent an email in the last 30 days.
  • time: Time at which email was sent.
  • image: The number of images attached.
  • attach: The number of attached files.
  • dollar: The number of times a dollar sign or the word "dollar" appeared in the email.
  • winner: Indicates whether "winner" appeared in the email.
  • inherit: The number of times “inherit” (or an extension, such as “inheritance”) appeared in the email.
  • viagra: The number of times “viagra” appeared in the email.
  • password: The number of times “password” appeared in the email.
  • num_char: The number of characters in the email, in thousands.
  • line_breaks: The number of line breaks in the email (does not count text wrapping).
  • format: Indicates whether the email was written using HTML (e.g. may have included bolding or active links).
  • re_subj: Whether the subject started with "Re:", "RE:", "re:", or "rE:"
  • exclaim_subj: Whether there was an exclamation point in the subject.
  • urgent_subj: Whether the word “urgent” was in the email subject.
  • exclaim_mess: The number of exclamation points in the email message.
  • number: Factor variable saying whether there was no number, a small number (under 1 million), or a big number.

2. 数据探索分析与预处理

In [1]:
## load required libraries 
import matplotlib.pyplot as plt 
import pandas as pd 
import numpy as np
import warnings
warnings.simplefilter("ignore")
Bad key "text.kerning_factor" on line 4 in
C:\Users\HP\Anaconda3\lib\site-packages\matplotlib\mpl-data\stylelib\_classic_test_patch.mplstyle.
You probably need to get an updated matplotlibrc file from
http://github.com/matplotlib/matplotlib/blob/master/matplotlibrc.template
or from the matplotlib source distribution

2.1. 导入数据

In [2]:
data = pd.read_csv("./email.csv")
data.head()
Out[2]:
spam to_multiple from cc sent_email image attach dollar winner inherit password num_char line_breaks format re_subj exclaim_subj urgent_subj exclaim_mess number
0 0 no yes no no no no no no no no 11.370 202 HTML no no no 0 big
1 0 no yes no no no no no no no no 10.504 202 HTML no no no 1 small
2 0 no yes no no no no yes no yes no 7.773 192 HTML no no no 6 small
3 0 no yes no no no no no no no no 13.256 255 HTML no no no 48 small
4 0 no yes no no no no no no no yes 1.231 29 Plain no no no 1 none
In [3]:
data.info()
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 3921 entries, 0 to 3920
Data columns (total 19 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   spam          3921 non-null   int64  
 1   to_multiple   3921 non-null   object 
 2   from          3921 non-null   object 
 3   cc            3921 non-null   object 
 4   sent_email    3921 non-null   object 
 5   image         3921 non-null   object 
 6   attach        3921 non-null   object 
 7   dollar        3921 non-null   object 
 8   winner        3921 non-null   object 
 9   inherit       3921 non-null   object 
 10  password      3921 non-null   object 
 11  num_char      3921 non-null   float64
 12  line_breaks   3921 non-null   int64  
 13  format        3921 non-null   object 
 14  re_subj       3921 non-null   object 
 15  exclaim_subj  3921 non-null   object 
 16  urgent_subj   3921 non-null   object 
 17  exclaim_mess  3921 non-null   int64  
 18  number        3921 non-null   object 
dtypes: float64(1), int64(3), object(15)
memory usage: 582.1+ KB

2.2. 数据预处理

类别型变量 => 数值型变量

比如将二分类变量映射为1/0,例如yes/no

In [4]:
yes_no_map = {
    'yes': 1,
    'no': 0
}
data['to_multiple'] = data['to_multiple'].map(yes_no_map)
data['from'] = data['from'].map(yes_no_map)
data['cc'] = data['cc'].map(yes_no_map)
data['sent_email'] = data['sent_email'].map(yes_no_map)
data['image'] = data['image'].map(yes_no_map)
data['attach'] = data['attach'].map(yes_no_map)
data['dollar'] = data['dollar'].map(yes_no_map)
data['winner'] = data['winner'].map(yes_no_map)
data['inherit'] = data['inherit'].map(yes_no_map)
data['password'] = data['password'].map(yes_no_map)
data['re_subj'] = data['re_subj'].map(yes_no_map)
data['exclaim_subj'] = data['exclaim_subj'].map(yes_no_map)
data['urgent_subj'] = data['urgent_subj'].map(yes_no_map)
In [5]:
# format
data['format'].value_counts()
Out[5]:
HTML     2726
Plain    1195
Name: format, dtype: int64
In [6]:
format_map = {
    'HTML': 1,
    'Plain': 0
}
data['format'] = data['format'].map(format_map)
In [7]:
# number
data['number'].value_counts()
Out[7]:
small    2827
none      549
big       545
Name: number, dtype: int64

使用哑变量表示多类别变量

In [8]:
data = data.join(pd.get_dummies(data['number'], prefix = 'number'))
data = data.drop(['number'], axis = 1)
data.head()
Out[8]:
spam to_multiple from cc sent_email image attach dollar winner inherit ... num_char line_breaks format re_subj exclaim_subj urgent_subj exclaim_mess number_big number_none number_small
0 0 0 1 0 0 0 0 0 0 0 ... 11.370 202 1 0 0 0 0 1 0 0
1 0 0 1 0 0 0 0 0 0 0 ... 10.504 202 1 0 0 0 1 0 0 1
2 0 0 1 0 0 0 0 1 0 1 ... 7.773 192 1 0 0 0 6 0 0 1
3 0 0 1 0 0 0 0 0 0 0 ... 13.256 255 1 0 0 0 48 0 0 1
4 0 0 1 0 0 0 0 0 0 0 ... 1.231 29 0 0 0 0 1 0 1 0

5 rows × 21 columns

2.3. 数据探索性分析

因变量: 分布情况

In [9]:
print(data['spam'].unique())
data_label_count = data.groupby('spam').count()['cc'].sort_values(ascending = False)
print(data_label_count)
data_label_count.plot.bar()
plt.show()
[0 1]
spam
0    3554
1     367
Name: cc, dtype: int64

自变量: 分布情况 & 相关性分析

In [10]:
import seaborn as sns
graph_by_variables = data.columns
plt.figure(figsize = (15, 18))
for i in range(0, 21):
    plt.subplot(7, 3, i+1)
    sns.distplot(data[graph_by_variables[i]])
    plt.title(graph_by_variables[i])
plt.tight_layout()
In [11]:
f, ax = plt.subplots(figsize = (15, 15))
sns.heatmap(data.corr(), annot = True, linewidths = 0.5, fmt = '.1f', ax = ax)
Out[11]:
<matplotlib.axes._subplots.AxesSubplot at 0x293a0b176d8>

2.4. 数据集划分

In [12]:
from sklearn.model_selection import train_test_split
# Splitting into train and test sets
X = data.drop(['spam'], axis = 1)
y = data['spam']
# test_size = 0.2意为训练集占80%,测试集占20%,即将数据集按4:1的比例划分为训练集和测试集
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2, random_state = 628)
print(len(X_train), len(X_test))
3136 785

3. 垃圾邮件识别

3.1. 逻辑回归

模型构建

In [40]:
from sklearn.linear_model import LogisticRegression
# Fitting a logistic regression model with default parameters
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
Out[40]:
LogisticRegression()

效果评价

In [15]:
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import roc_auc_score
In [42]:
# Prediction & Evaluation
y_hat_test = logreg.predict(X_test)
# Logistic Regression score
print("Logistic regression score for test set:")
print("Predicion: {:.3f}".format(precision_score(y_test, y_hat_test)), "Recall: {:.3f}".format(recall_score(y_test, y_hat_test)))
print("F1 score: {:.3f}".format(f1_score(y_test, y_hat_test)))
print("AUC score: {:.3f}".format(roc_auc_score(y_test, y_hat_test)))
Logistic regression score for test set:
Predicion: 0.667 Recall: 0.116
F1 score: 0.198
AUC score: 0.555

3.2. 朴素贝叶斯

模型构建

In [43]:
from sklearn.naive_bayes import BernoulliNB
# Fitting a Naive Bayes model with default parameters
clf = BernoulliNB()
clf.fit(X_train, y_train)
Out[43]:
BernoulliNB()

效果评价

In [44]:
# Prediction & Evaluation
y_hat_test = clf.predict(X_test)
# Naive Bayes score
print("Naive Bayes score for test set:")
print("Predicion: {:.3f}".format(precision_score(y_test, y_hat_test)), "Recall: {:.3f}".format(recall_score(y_test, y_hat_test)))
print("F1 score: {:.3f}".format(f1_score(y_test, y_hat_test)))
print("AUC score: {:.3f}".format(roc_auc_score(y_test, y_hat_test)))
Naive Bayes score for test set:
Predicion: 0.500 Recall: 0.377
F1 score: 0.430
AUC score: 0.670

3.3. 决策树模型

模型构建

In [13]:
from sklearn.tree import DecisionTreeClassifier
# Fitting a decision tree model with default parameters
dt = DecisionTreeClassifier()
dt.fit(X_train, y_train)
Out[13]:
DecisionTreeClassifier()

效果评价

In [16]:
# Prediction & Evaluation
y_hat_test = dt.predict(X_test)
# Decision Tree score
print("Decision tree score for test set:")
print("Predicion: {:.3f}".format(precision_score(y_test, y_hat_test)), "Recall: {:.3f}".format(recall_score(y_test, y_hat_test)))
print("F1 score: {:.3f}".format(f1_score(y_test, y_hat_test)))
print("AUC score: {:.3f}".format(roc_auc_score(y_test, y_hat_test)))
Decision tree score for test set:
Predicion: 0.603 Recall: 0.681
F1 score: 0.639
AUC score: 0.819

决策树可视化

In [19]:
from sklearn import tree
import graphviz
dot_data = tree.export_graphviz(dt, feature_names = X.columns, filled = True, class_names = True,out_file=None)  
graph = graphviz.Source(dot_data)
graph
Out[19]:
Tree 0 line_breaks <= 46.5 gini = 0.172 samples = 3136 value = [2838, 298] class = y[0] 1 sent_email <= 0.5 gini = 0.334 samples = 952 value = [750, 202] class = y[0] 0->1 True 302 number_none <= 0.5 gini = 0.084 samples = 2184 value = [2088, 96] class = y[0] 0->302 False 2 num_char <= 0.538 gini = 0.419 samples = 677 value = [475, 202] class = y[0] 1->2 301 gini = 0.0 samples = 275 value = [275, 0] class = y[0] 1->301 3 to_multiple <= 0.5 gini = 0.498 samples = 163 value = [77, 86] class = y[1] 2->3 94 line_breaks <= 15.5 gini = 0.349 samples = 514 value = [398, 116] class = y[0] 2->94 4 re_subj <= 0.5 gini = 0.486 samples = 137 value = [57, 80] class = y[1] 3->4 85 attach <= 0.5 gini = 0.355 samples = 26 value = [20, 6] class = y[0] 3->85 5 format <= 0.5 gini = 0.475 samples = 131 value = [51, 80] class = y[1] 4->5 84 gini = 0.0 samples = 6 value = [6, 0] class = y[0] 4->84 6 attach <= 0.5 gini = 0.494 samples = 110 value = [49, 61] class = y[1] 5->6 77 exclaim_mess <= 1.5 gini = 0.172 samples = 21 value = [2, 19] class = y[1] 5->77 7 number_big <= 0.5 gini = 0.498 samples = 83 value = [44, 39] class = y[0] 6->7 66 num_char <= 0.043 gini = 0.302 samples = 27 value = [5, 22] class = y[1] 6->66 8 line_breaks <= 11.5 gini = 0.481 samples = 72 value = [43, 29] class = y[0] 7->8 61 line_breaks <= 8.5 gini = 0.165 samples = 11 value = [1, 10] class = y[1] 7->61 9 num_char <= 0.399 gini = 0.471 samples = 29 value = [11, 18] class = y[1] 8->9 34 exclaim_subj <= 0.5 gini = 0.381 samples = 43 value = [32, 11] class = y[0] 8->34 10 num_char <= 0.148 gini = 0.435 samples = 25 value = [8, 17] class = y[1] 9->10 31 num_char <= 0.45 gini = 0.375 samples = 4 value = [3, 1] class = y[0] 9->31 11 line_breaks <= 5.5 gini = 0.498 samples = 15 value = [7, 8] class = y[1] 10->11 26 num_char <= 0.296 gini = 0.18 samples = 10 value = [1, 9] class = y[1] 10->26 12 exclaim_mess <= 0.5 gini = 0.444 samples = 12 value = [4, 8] class = y[1] 11->12 25 gini = 0.0 samples = 3 value = [3, 0] class = y[0] 11->25 13 num_char <= 0.054 gini = 0.48 samples = 10 value = [4, 6] class = y[1] 12->13 24 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 12->24 14 line_breaks <= 2.5 gini = 0.32 samples = 5 value = [1, 4] class = y[1] 13->14 19 num_char <= 0.059 gini = 0.48 samples = 5 value = [3, 2] class = y[0] 13->19 15 num_char <= 0.002 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 14->15 18 gini = 0.0 samples = 3 value = [0, 3] class = y[1] 14->18 16 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 15->16 17 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 15->17 20 gini = 0.0 samples = 2 value = [2, 0] class = y[0] 19->20 21 num_char <= 0.071 gini = 0.444 samples = 3 value = [1, 2] class = y[1] 19->21 22 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 21->22 23 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 21->23 27 gini = 0.0 samples = 7 value = [0, 7] class = y[1] 26->27 28 num_char <= 0.313 gini = 0.444 samples = 3 value = [1, 2] class = y[1] 26->28 29 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 28->29 30 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 28->30 32 gini = 0.0 samples = 3 value = [3, 0] class = y[0] 31->32 33 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 31->33 35 number_none <= 0.5 gini = 0.363 samples = 42 value = [32, 10] class = y[0] 34->35 60 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 34->60 36 cc <= 0.5 gini = 0.48 samples = 15 value = [9, 6] class = y[0] 35->36 49 winner <= 0.5 gini = 0.252 samples = 27 value = [23, 4] class = y[0] 35->49 37 num_char <= 0.481 gini = 0.5 samples = 12 value = [6, 6] class = y[0] 36->37 48 gini = 0.0 samples = 3 value = [3, 0] class = y[0] 36->48 38 password <= 0.5 gini = 0.444 samples = 9 value = [3, 6] class = y[1] 37->38 47 gini = 0.0 samples = 3 value = [3, 0] class = y[0] 37->47 39 line_breaks <= 13.5 gini = 0.375 samples = 8 value = [2, 6] class = y[1] 38->39 46 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 38->46 40 num_char <= 0.345 gini = 0.444 samples = 3 value = [2, 1] class = y[0] 39->40 45 gini = 0.0 samples = 5 value = [0, 5] class = y[1] 39->45 41 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 40->41 42 num_char <= 0.413 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 40->42 43 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 42->43 44 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 42->44 50 num_char <= 0.416 gini = 0.211 samples = 25 value = [22, 3] class = y[0] 49->50 57 line_breaks <= 18.0 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 49->57 51 exclaim_mess <= 1.5 gini = 0.105 samples = 18 value = [17, 1] class = y[0] 50->51 54 num_char <= 0.449 gini = 0.408 samples = 7 value = [5, 2] class = y[0] 50->54 52 gini = 0.0 samples = 13 value = [13, 0] class = y[0] 51->52 53 gini = 0.32 samples = 5 value = [4, 1] class = y[0] 51->53 55 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 54->55 56 gini = 0.0 samples = 5 value = [5, 0] class = y[0] 54->56 58 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 57->58 59 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 57->59 62 num_char <= 0.258 gini = 0.444 samples = 3 value = [1, 2] class = y[1] 61->62 65 gini = 0.0 samples = 8 value = [0, 8] class = y[1] 61->65 63 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 62->63 64 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 62->64 67 gini = 0.0 samples = 2 value = [2, 0] class = y[0] 66->67 68 num_char <= 0.447 gini = 0.211 samples = 25 value = [3, 22] class = y[1] 66->68 69 cc <= 0.5 gini = 0.397 samples = 11 value = [3, 8] class = y[1] 68->69 76 gini = 0.0 samples = 14 value = [0, 14] class = y[1] 68->76 70 gini = 0.0 samples = 6 value = [0, 6] class = y[1] 69->70 71 exclaim_mess <= 0.5 gini = 0.48 samples = 5 value = [3, 2] class = y[0] 69->71 72 line_breaks <= 20.0 gini = 0.444 samples = 3 value = [2, 1] class = y[0] 71->72 75 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 71->75 73 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 72->73 74 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 72->74 78 line_breaks <= 7.5 gini = 0.095 samples = 20 value = [1, 19] class = y[1] 77->78 83 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 77->83 79 gini = 0.0 samples = 12 value = [0, 12] class = y[1] 78->79 80 line_breaks <= 8.5 gini = 0.219 samples = 8 value = [1, 7] class = y[1] 78->80 81 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 80->81 82 gini = 0.0 samples = 7 value = [0, 7] class = y[1] 80->82 86 gini = 0.0 samples = 15 value = [15, 0] class = y[0] 85->86 87 num_char <= 0.321 gini = 0.496 samples = 11 value = [5, 6] class = y[1] 85->87 88 gini = 0.0 samples = 4 value = [4, 0] class = y[0] 87->88 89 line_breaks <= 20.0 gini = 0.245 samples = 7 value = [1, 6] class = y[1] 87->89 90 gini = 0.0 samples = 3 value = [0, 3] class = y[1] 89->90 91 exclaim_mess <= 1.0 gini = 0.375 samples = 4 value = [1, 3] class = y[1] 89->91 92 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 91->92 93 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 91->93 95 num_char <= 1.234 gini = 0.499 samples = 67 value = [35, 32] class = y[0] 94->95 136 dollar <= 0.5 gini = 0.305 samples = 447 value = [363, 84] class = y[0] 94->136 96 exclaim_mess <= 0.5 gini = 0.464 samples = 52 value = [33, 19] class = y[0] 95->96 129 re_subj <= 0.5 gini = 0.231 samples = 15 value = [2, 13] class = y[1] 95->129 97 line_breaks <= 9.5 gini = 0.32 samples = 30 value = [24, 6] class = y[0] 96->97 114 num_char <= 0.893 gini = 0.483 samples = 22 value = [9, 13] class = y[1] 96->114 98 number_none <= 0.5 gini = 0.48 samples = 5 value = [2, 3] class = y[1] 97->98 103 num_char <= 0.71 gini = 0.211 samples = 25 value = [22, 3] class = y[0] 97->103 99 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 98->99 100 num_char <= 0.962 gini = 0.444 samples = 3 value = [2, 1] class = y[0] 98->100 101 gini = 0.0 samples = 2 value = [2, 0] class = y[0] 100->101 102 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 100->102 104 to_multiple <= 0.5 gini = 0.408 samples = 7 value = [5, 2] class = y[0] 103->104 109 num_char <= 1.062 gini = 0.105 samples = 18 value = [17, 1] class = y[0] 103->109 105 exclaim_subj <= 0.5 gini = 0.444 samples = 3 value = [1, 2] class = y[1] 104->105 108 gini = 0.0 samples = 4 value = [4, 0] class = y[0] 104->108 106 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 105->106 107 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 105->107 110 gini = 0.0 samples = 14 value = [14, 0] class = y[0] 109->110 111 line_breaks <= 13.5 gini = 0.375 samples = 4 value = [3, 1] class = y[0] 109->111 112 gini = 0.0 samples = 3 value = [3, 0] class = y[0] 111->112 113 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 111->113 115 num_char <= 0.78 gini = 0.18 samples = 10 value = [1, 9] class = y[1] 114->115 120 re_subj <= 0.5 gini = 0.444 samples = 12 value = [8, 4] class = y[0] 114->120 116 gini = 0.0 samples = 7 value = [0, 7] class = y[1] 115->116 117 num_char <= 0.826 gini = 0.444 samples = 3 value = [1, 2] class = y[1] 115->117 118 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 117->118 119 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 117->119 121 exclaim_mess <= 2.5 gini = 0.32 samples = 10 value = [8, 2] class = y[0] 120->121 128 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 120->128 122 num_char <= 0.989 gini = 0.198 samples = 9 value = [8, 1] class = y[0] 121->122 127 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 121->127 123 gini = 0.0 samples = 4 value = [4, 0] class = y[0] 122->123 124 num_char <= 1.005 gini = 0.32 samples = 5 value = [4, 1] class = y[0] 122->124 125 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 124->125 126 gini = 0.0 samples = 4 value = [4, 0] class = y[0] 124->126 130 num_char <= 1.712 gini = 0.133 samples = 14 value = [1, 13] class = y[1] 129->130 135 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 129->135 131 gini = 0.0 samples = 9 value = [0, 9] class = y[1] 130->131 132 num_char <= 1.805 gini = 0.32 samples = 5 value = [1, 4] class = y[1] 130->132 133 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 132->133 134 gini = 0.0 samples = 4 value = [0, 4] class = y[1] 132->134 137 format <= 0.5 gini = 0.255 samples = 380 value = [323, 57] class = y[0] 136->137 256 num_char <= 1.939 gini = 0.481 samples = 67 value = [40, 27] class = y[0] 136->256 138 num_char <= 0.722 gini = 0.181 samples = 269 value = [242, 27] class = y[0] 137->138 205 num_char <= 0.895 gini = 0.394 samples = 111 value = [81, 30] class = y[0] 137->205 139 line_breaks <= 16.5 gini = 0.393 samples = 52 value = [38, 14] class = y[0] 138->139 166 winner <= 0.5 gini = 0.113 samples = 217 value = [204, 13] class = y[0] 138->166 140 gini = 0.0 samples = 14 value = [14, 0] class = y[0] 139->140 141 number_small <= 0.5 gini = 0.465 samples = 38 value = [24, 14] class = y[0] 139->141 142 line_breaks <= 26.0 gini = 0.346 samples = 18 value = [14, 4] class = y[0] 141->142 151 to_multiple <= 0.5 gini = 0.5 samples = 20 value = [10, 10] class = y[0] 141->151 143 attach <= 0.5 gini = 0.133 samples = 14 value = [13, 1] class = y[0] 142->143 146 cc <= 0.5 gini = 0.375 samples = 4 value = [1, 3] class = y[1] 142->146 144 gini = 0.0 samples = 12 value = [12, 0] class = y[0] 143->144 145 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 143->145 147 to_multiple <= 0.5 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 146->147 150 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 146->150 148 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 147->148 149 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 147->149 152 num_char <= 0.684 gini = 0.469 samples = 16 value = [6, 10] class = y[1] 151->152 165 gini = 0.0 samples = 4 value = [4, 0] class = y[0] 151->165 153 num_char <= 0.659 gini = 0.346 samples = 9 value = [2, 7] class = y[1] 152->153 160 line_breaks <= 19.5 gini = 0.49 samples = 7 value = [4, 3] class = y[0] 152->160 154 line_breaks <= 25.5 gini = 0.444 samples = 6 value = [2, 4] class = y[1] 153->154 159 gini = 0.0 samples = 3 value = [0, 3] class = y[1] 153->159 155 num_char <= 0.65 gini = 0.32 samples = 5 value = [1, 4] class = y[1] 154->155 158 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 154->158 156 gini = 0.0 samples = 4 value = [0, 4] class = y[1] 155->156 157 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 155->157 161 gini = 0.0 samples = 3 value = [3, 0] class = y[0] 160->161 162 num_char <= 0.692 gini = 0.375 samples = 4 value = [1, 3] class = y[1] 160->162 163 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 162->163 164 gini = 0.0 samples = 3 value = [0, 3] class = y[1] 162->164 167 number_small <= 0.5 gini = 0.098 samples = 214 value = [203, 11] class = y[0] 166->167 202 number_small <= 0.5 gini = 0.444 samples = 3 value = [1, 2] class = y[1] 166->202 168 line_breaks <= 16.5 gini = 0.204 samples = 78 value = [69, 9] class = y[0] 167->168 191 num_char <= 1.53 gini = 0.029 samples = 136 value = [134, 2] class = y[0] 167->191 169 to_multiple <= 0.5 gini = 0.375 samples = 4 value = [1, 3] class = y[1] 168->169 172 num_char <= 0.813 gini = 0.149 samples = 74 value = [68, 6] class = y[0] 168->172 170 gini = 0.0 samples = 3 value = [0, 3] class = y[1] 169->170 171 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 169->171 173 gini = 0.0 samples = 24 value = [24, 0] class = y[0] 172->173 174 num_char <= 0.822 gini = 0.211 samples = 50 value = [44, 6] class = y[0] 172->174 175 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 174->175 176 exclaim_mess <= 0.5 gini = 0.183 samples = 49 value = [44, 5] class = y[0] 174->176 177 exclaim_subj <= 0.5 gini = 0.298 samples = 22 value = [18, 4] class = y[0] 176->177 188 number_big <= 0.5 gini = 0.071 samples = 27 value = [26, 1] class = y[0] 176->188 178 line_breaks <= 24.0 gini = 0.245 samples = 21 value = [18, 3] class = y[0] 177->178 187 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 177->187 179 gini = 0.0 samples = 8 value = [8, 0] class = y[0] 178->179 180 line_breaks <= 27.0 gini = 0.355 samples = 13 value = [10, 3] class = y[0] 178->180 181 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 180->181 182 num_char <= 1.196 gini = 0.165 samples = 11 value = [10, 1] class = y[0] 180->182 183 number_big <= 0.5 gini = 0.444 samples = 3 value = [2, 1] class = y[0] 182->183 186 gini = 0.0 samples = 8 value = [8, 0] class = y[0] 182->186 184 gini = 0.0 samples = 2 value = [2, 0] class = y[0] 183->184 185 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 183->185 189 gini = 0.0 samples = 26 value = [26, 0] class = y[0] 188->189 190 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 188->190 192 gini = 0.0 samples = 107 value = [107, 0] class = y[0] 191->192 193 num_char <= 1.538 gini = 0.128 samples = 29 value = [27, 2] class = y[0] 191->193 194 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 193->194 195 to_multiple <= 0.5 gini = 0.069 samples = 28 value = [27, 1] class = y[0] 193->195 196 num_char <= 2.142 gini = 0.165 samples = 11 value = [10, 1] class = y[0] 195->196 201 gini = 0.0 samples = 17 value = [17, 0] class = y[0] 195->201 197 gini = 0.0 samples = 9 value = [9, 0] class = y[0] 196->197 198 line_breaks <= 42.0 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 196->198 199 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 198->199 200 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 198->200 203 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 202->203 204 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 202->204 206 number_small <= 0.5 gini = 0.484 samples = 17 value = [7, 10] class = y[1] 205->206 221 exclaim_mess <= 0.5 gini = 0.335 samples = 94 value = [74, 20] class = y[0] 205->221 207 line_breaks <= 22.0 gini = 0.408 samples = 7 value = [5, 2] class = y[0] 206->207 214 num_char <= 0.71 gini = 0.32 samples = 10 value = [2, 8] class = y[1] 206->214 208 urgent_subj <= 0.5 gini = 0.444 samples = 3 value = [1, 2] class = y[1] 207->208 213 gini = 0.0 samples = 4 value = [4, 0] class = y[0] 207->213 209 exclaim_mess <= 1.0 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 208->209 212 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 208->212 210 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 209->210 211 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 209->211 215 line_breaks <= 21.0 gini = 0.444 samples = 3 value = [2, 1] class = y[0] 214->215 220 gini = 0.0 samples = 7 value = [0, 7] class = y[1] 214->220 216 num_char <= 0.688 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 215->216 219 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 215->219 217 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 216->217 218 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 216->218 222 num_char <= 2.458 gini = 0.18 samples = 50 value = [45, 5] class = y[0] 221->222 235 num_char <= 1.164 gini = 0.449 samples = 44 value = [29, 15] class = y[0] 221->235 223 line_breaks <= 31.5 gini = 0.083 samples = 46 value = [44, 2] class = y[0] 222->223 232 number_none <= 0.5 gini = 0.375 samples = 4 value = [1, 3] class = y[1] 222->232 224 gini = 0.0 samples = 35 value = [35, 0] class = y[0] 223->224 225 num_char <= 1.903 gini = 0.298 samples = 11 value = [9, 2] class = y[0] 223->225 226 num_char <= 1.629 gini = 0.444 samples = 6 value = [4, 2] class = y[0] 225->226 231 gini = 0.0 samples = 5 value = [5, 0] class = y[0] 225->231 227 gini = 0.0 samples = 3 value = [3, 0] class = y[0] 226->227 228 to_multiple <= 0.5 gini = 0.444 samples = 3 value = [1, 2] class = y[1] 226->228 229 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 228->229 230 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 228->230 233 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 232->233 234 gini = 0.0 samples = 3 value = [0, 3] class = y[1] 232->234 236 gini = 0.0 samples = 6 value = [0, 6] class = y[1] 235->236 237 line_breaks <= 42.5 gini = 0.361 samples = 38 value = [29, 9] class = y[0] 235->237 238 line_breaks <= 30.0 gini = 0.238 samples = 29 value = [25, 4] class = y[0] 237->238 251 num_char <= 1.775 gini = 0.494 samples = 9 value = [4, 5] class = y[1] 237->251 239 from <= 0.5 gini = 0.444 samples = 12 value = [8, 4] class = y[0] 238->239 250 gini = 0.0 samples = 17 value = [17, 0] class = y[0] 238->250 240 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 239->240 241 num_char <= 2.356 gini = 0.397 samples = 11 value = [8, 3] class = y[0] 239->241 242 num_char <= 1.303 gini = 0.5 samples = 6 value = [3, 3] class = y[0] 241->242 249 gini = 0.0 samples = 5 value = [5, 0] class = y[0] 241->249 243 gini = 0.0 samples = 2 value = [2, 0] class = y[0] 242->243 244 num_char <= 1.503 gini = 0.375 samples = 4 value = [1, 3] class = y[1] 242->244 245 num_char <= 1.36 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 244->245 248 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 244->248 246 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 245->246 247 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 245->247 252 gini = 0.0 samples = 3 value = [3, 0] class = y[0] 251->252 253 exclaim_mess <= 1.5 gini = 0.278 samples = 6 value = [1, 5] class = y[1] 251->253 254 gini = 0.0 samples = 5 value = [0, 5] class = y[1] 253->254 255 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 253->255 257 num_char <= 0.884 gini = 0.499 samples = 44 value = [21, 23] class = y[1] 256->257 294 exclaim_subj <= 0.5 gini = 0.287 samples = 23 value = [19, 4] class = y[0] 256->294 258 number_none <= 0.5 gini = 0.346 samples = 9 value = [7, 2] class = y[0] 257->258 263 line_breaks <= 24.0 gini = 0.48 samples = 35 value = [14, 21] class = y[1] 257->263 259 exclaim_subj <= 0.5 gini = 0.219 samples = 8 value = [7, 1] class = y[0] 258->259 262 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 258->262 260 gini = 0.0 samples = 7 value = [7, 0] class = y[0] 259->260 261 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 259->261 264 gini = 0.0 samples = 5 value = [0, 5] class = y[1] 263->264 265 format <= 0.5 gini = 0.498 samples = 30 value = [14, 16] class = y[1] 263->265 266 line_breaks <= 35.5 gini = 0.499 samples = 27 value = [14, 13] class = y[0] 265->266 293 gini = 0.0 samples = 3 value = [0, 3] class = y[1] 265->293 267 number_none <= 0.5 gini = 0.465 samples = 19 value = [12, 7] class = y[0] 266->267 286 to_multiple <= 0.5 gini = 0.375 samples = 8 value = [2, 6] class = y[1] 266->286 268 num_char <= 0.935 gini = 0.444 samples = 18 value = [12, 6] class = y[0] 267->268 285 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 267->285 269 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 268->269 270 exclaim_mess <= 0.5 gini = 0.415 samples = 17 value = [12, 5] class = y[0] 268->270 271 num_char <= 1.086 gini = 0.486 samples = 12 value = [7, 5] class = y[0] 270->271 284 gini = 0.0 samples = 5 value = [5, 0] class = y[0] 270->284 272 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 271->272 273 number_small <= 0.5 gini = 0.42 samples = 10 value = [7, 3] class = y[0] 271->273 274 gini = 0.0 samples = 3 value = [3, 0] class = y[0] 273->274 275 num_char <= 1.233 gini = 0.49 samples = 7 value = [4, 3] class = y[0] 273->275 276 num_char <= 1.104 gini = 0.48 samples = 5 value = [2, 3] class = y[1] 275->276 283 gini = 0.0 samples = 2 value = [2, 0] class = y[0] 275->283 277 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 276->277 278 num_char <= 1.2 gini = 0.375 samples = 4 value = [1, 3] class = y[1] 276->278 279 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 278->279 280 line_breaks <= 31.5 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 278->280 281 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 280->281 282 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 280->282 287 num_char <= 1.465 gini = 0.245 samples = 7 value = [1, 6] class = y[1] 286->287 292 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 286->292 288 gini = 0.0 samples = 5 value = [0, 5] class = y[1] 287->288 289 num_char <= 1.574 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 287->289 290 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 289->290 291 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 289->291 295 inherit <= 0.5 gini = 0.1 samples = 19 value = [18, 1] class = y[0] 294->295 298 line_breaks <= 42.5 gini = 0.375 samples = 4 value = [1, 3] class = y[1] 294->298 296 gini = 0.0 samples = 18 value = [18, 0] class = y[0] 295->296 297 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 295->297 299 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 298->299 300 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 298->300 303 urgent_subj <= 0.5 gini = 0.061 samples = 2113 value = [2046, 67] class = y[0] 302->303 516 num_char <= 8.988 gini = 0.483 samples = 71 value = [42, 29] class = y[0] 302->516 304 winner <= 0.5 gini = 0.06 samples = 2111 value = [2046, 65] class = y[0] 303->304 515 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 303->515 305 re_subj <= 0.5 gini = 0.053 samples = 2073 value = [2017, 56] class = y[0] 304->305 498 num_char <= 5.507 gini = 0.361 samples = 38 value = [29, 9] class = y[0] 304->498 306 num_char <= 3.38 gini = 0.073 samples = 1441 value = [1386, 55] class = y[0] 305->306 493 password <= 0.5 gini = 0.003 samples = 632 value = [631, 1] class = y[0] 305->493 307 line_breaks <= 96.5 gini = 0.257 samples = 112 value = [95, 17] class = y[0] 306->307 354 num_char <= 41.396 gini = 0.056 samples = 1329 value = [1291, 38] class = y[0] 306->354 308 image <= 0.5 gini = 0.229 samples = 106 value = [92, 14] class = y[0] 307->308 351 line_breaks <= 114.0 gini = 0.5 samples = 6 value = [3, 3] class = y[0] 307->351 309 num_char <= 3.371 gini = 0.217 samples = 105 value = [92, 13] class = y[0] 308->309 350 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 308->350 310 num_char <= 1.913 gini = 0.204 samples = 104 value = [92, 12] class = y[0] 309->310 349 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 309->349 311 format <= 0.5 gini = 0.43 samples = 16 value = [11, 5] class = y[0] 310->311 322 dollar <= 0.5 gini = 0.146 samples = 88 value = [81, 7] class = y[0] 310->322 312 line_breaks <= 49.0 gini = 0.165 samples = 11 value = [10, 1] class = y[0] 311->312 317 num_char <= 1.826 gini = 0.32 samples = 5 value = [1, 4] class = y[1] 311->317 313 line_breaks <= 47.5 gini = 0.444 samples = 3 value = [2, 1] class = y[0] 312->313 316 gini = 0.0 samples = 8 value = [8, 0] class = y[0] 312->316 314 gini = 0.0 samples = 2 value = [2, 0] class = y[0] 313->314 315 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 313->315 318 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 317->318 319 num_char <= 1.836 gini = 0.444 samples = 3 value = [1, 2] class = y[1] 317->319 320 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 319->320 321 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 319->321 323 exclaim_mess <= 0.5 gini = 0.08 samples = 72 value = [69, 3] class = y[0] 322->323 340 num_char <= 3.149 gini = 0.375 samples = 16 value = [12, 4] class = y[0] 322->340 324 gini = 0.0 samples = 28 value = [28, 0] class = y[0] 323->324 325 num_char <= 2.535 gini = 0.127 samples = 44 value = [41, 3] class = y[0] 323->325 326 gini = 0.0 samples = 16 value = [16, 0] class = y[0] 325->326 327 num_char <= 2.553 gini = 0.191 samples = 28 value = [25, 3] class = y[0] 325->327 328 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 327->328 329 line_breaks <= 73.0 gini = 0.137 samples = 27 value = [25, 2] class = y[0] 327->329 330 line_breaks <= 70.0 gini = 0.245 samples = 14 value = [12, 2] class = y[0] 329->330 339 gini = 0.0 samples = 13 value = [13, 0] class = y[0] 329->339 331 num_char <= 2.857 gini = 0.153 samples = 12 value = [11, 1] class = y[0] 330->331 336 exclaim_mess <= 1.5 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 330->336 332 exclaim_mess <= 1.5 gini = 0.444 samples = 3 value = [2, 1] class = y[0] 331->332 335 gini = 0.0 samples = 9 value = [9, 0] class = y[0] 331->335 333 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 332->333 334 gini = 0.0 samples = 2 value = [2, 0] class = y[0] 332->334 337 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 336->337 338 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 336->338 341 line_breaks <= 68.5 gini = 0.153 samples = 12 value = [11, 1] class = y[0] 340->341 346 exclaim_subj <= 0.5 gini = 0.375 samples = 4 value = [1, 3] class = y[1] 340->346 342 gini = 0.0 samples = 10 value = [10, 0] class = y[0] 341->342 343 sent_email <= 0.5 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 341->343 344 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 343->344 345 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 343->345 347 gini = 0.0 samples = 3 value = [0, 3] class = y[1] 346->347 348 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 346->348 352 gini = 0.0 samples = 3 value = [0, 3] class = y[1] 351->352 353 gini = 0.0 samples = 3 value = [3, 0] class = y[0] 351->353 355 num_char <= 6.421 gini = 0.045 samples = 1258 value = [1229, 29] class = y[0] 354->355 474 num_char <= 41.501 gini = 0.221 samples = 71 value = [62, 9] class = y[0] 354->474 356 inherit <= 0.5 gini = 0.109 samples = 243 value = [229, 14] class = y[0] 355->356 411 num_char <= 23.234 gini = 0.029 samples = 1015 value = [1000, 15] class = y[0] 355->411 357 line_breaks <= 255.5 gini = 0.095 samples = 241 value = [229, 12] class = y[0] 356->357 410 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 356->410 358 num_char <= 6.406 gini = 0.087 samples = 240 value = [229, 11] class = y[0] 357->358 409 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 357->409 359 to_multiple <= 0.5 gini = 0.08 samples = 239 value = [229, 10] class = y[0] 358->359 408 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 358->408 360 exclaim_mess <= 5.5 gini = 0.119 samples = 158 value = [148, 10] class = y[0] 359->360 407 gini = 0.0 samples = 81 value = [81, 0] class = y[0] 359->407 361 line_breaks <= 77.5 gini = 0.103 samples = 147 value = [139, 8] class = y[0] 360->361 400 line_breaks <= 153.5 gini = 0.298 samples = 11 value = [9, 2] class = y[0] 360->400 362 number_big <= 0.5 gini = 0.236 samples = 22 value = [19, 3] class = y[0] 361->362 377 num_char <= 4.745 gini = 0.077 samples = 125 value = [120, 5] class = y[0] 361->377 363 line_breaks <= 70.0 gini = 0.18 samples = 20 value = [18, 2] class = y[0] 362->363 374 num_char <= 3.797 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 362->374 364 gini = 0.0 samples = 10 value = [10, 0] class = y[0] 363->364 365 num_char <= 3.918 gini = 0.32 samples = 10 value = [8, 2] class = y[0] 363->365 366 exclaim_mess <= 2.5 gini = 0.444 samples = 6 value = [4, 2] class = y[0] 365->366 373 gini = 0.0 samples = 4 value = [4, 0] class = y[0] 365->373 367 line_breaks <= 76.0 gini = 0.32 samples = 5 value = [4, 1] class = y[0] 366->367 372 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 366->372 368 gini = 0.0 samples = 3 value = [3, 0] class = y[0] 367->368 369 format <= 0.5 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 367->369 370 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 369->370 371 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 369->371 375 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 374->375 376 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 374->376 378 gini = 0.0 samples = 66 value = [66, 0] class = y[0] 377->378 379 num_char <= 4.749 gini = 0.155 samples = 59 value = [54, 5] class = y[0] 377->379 380 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 379->380 381 exclaim_mess <= 1.5 gini = 0.128 samples = 58 value = [54, 4] class = y[0] 379->381 382 line_breaks <= 105.5 gini = 0.225 samples = 31 value = [27, 4] class = y[0] 381->382 399 gini = 0.0 samples = 27 value = [27, 0] class = y[0] 381->399 383 line_breaks <= 104.5 gini = 0.42 samples = 10 value = [7, 3] class = y[0] 382->383 394 num_char <= 5.729 gini = 0.091 samples = 21 value = [20, 1] class = y[0] 382->394 384 line_breaks <= 94.5 gini = 0.346 samples = 9 value = [7, 2] class = y[0] 383->384 393 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 383->393 385 num_char <= 5.106 gini = 0.444 samples = 6 value = [4, 2] class = y[0] 384->385 392 gini = 0.0 samples = 3 value = [3, 0] class = y[0] 384->392 386 gini = 0.0 samples = 2 value = [2, 0] class = y[0] 385->386 387 line_breaks <= 80.0 gini = 0.5 samples = 4 value = [2, 2] class = y[0] 385->387 388 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 387->388 389 exclaim_mess <= 0.5 gini = 0.444 samples = 3 value = [1, 2] class = y[1] 387->389 390 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 389->390 391 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 389->391 395 gini = 0.0 samples = 14 value = [14, 0] class = y[0] 394->395 396 num_char <= 5.778 gini = 0.245 samples = 7 value = [6, 1] class = y[0] 394->396 397 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 396->397 398 gini = 0.0 samples = 6 value = [6, 0] class = y[0] 396->398 401 line_breaks <= 86.5 gini = 0.18 samples = 10 value = [9, 1] class = y[0] 400->401 406 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 400->406 402 exclaim_mess <= 8.0 gini = 0.444 samples = 3 value = [2, 1] class = y[0] 401->402 405 gini = 0.0 samples = 7 value = [7, 0] class = y[0] 401->405 403 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 402->403 404 gini = 0.0 samples = 2 value = [2, 0] class = y[0] 402->404 412 num_char <= 22.906 gini = 0.043 samples = 679 value = [664, 15] class = y[0] 411->412 473 gini = 0.0 samples = 336 value = [336, 0] class = y[0] 411->473 413 line_breaks <= 435.5 gini = 0.038 samples = 676 value = [663, 13] class = y[0] 412->413 470 exclaim_subj <= 0.5 gini = 0.444 samples = 3 value = [1, 2] class = y[1] 412->470 414 line_breaks <= 157.5 gini = 0.028 samples = 630 value = [621, 9] class = y[0] 413->414 455 line_breaks <= 441.0 gini = 0.159 samples = 46 value = [42, 4] class = y[0] 413->455 415 gini = 0.0 samples = 165 value = [165, 0] class = y[0] 414->415 416 line_breaks <= 158.5 gini = 0.038 samples = 465 value = [456, 9] class = y[0] 414->416 417 num_char <= 7.63 gini = 0.48 samples = 5 value = [3, 2] class = y[0] 416->417 420 num_char <= 10.599 gini = 0.03 samples = 460 value = [453, 7] class = y[0] 416->420 418 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 417->418 419 gini = 0.0 samples = 3 value = [3, 0] class = y[0] 417->419 421 gini = 0.0 samples = 167 value = [167, 0] class = y[0] 420->421 422 num_char <= 10.645 gini = 0.047 samples = 293 value = [286, 7] class = y[0] 420->422 423 dollar <= 0.5 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 422->423 426 line_breaks <= 247.5 gini = 0.04 samples = 291 value = [285, 6] class = y[0] 422->426 424 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 423->424 425 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 423->425 427 inherit <= 0.5 gini = 0.088 samples = 109 value = [104, 5] class = y[0] 426->427 450 num_char <= 20.059 gini = 0.011 samples = 182 value = [181, 1] class = y[0] 426->450 428 line_breaks <= 246.5 gini = 0.071 samples = 108 value = [104, 4] class = y[0] 427->428 449 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 427->449 429 line_breaks <= 199.0 gini = 0.055 samples = 106 value = [103, 3] class = y[0] 428->429 446 num_char <= 13.143 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 428->446 430 gini = 0.0 samples = 53 value = [53, 0] class = y[0] 429->430 431 line_breaks <= 200.5 gini = 0.107 samples = 53 value = [50, 3] class = y[0] 429->431 432 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 431->432 433 exclaim_mess <= 0.5 gini = 0.074 samples = 52 value = [50, 2] class = y[0] 431->433 434 to_multiple <= 0.5 gini = 0.245 samples = 7 value = [6, 1] class = y[0] 433->434 441 exclaim_mess <= 16.5 gini = 0.043 samples = 45 value = [44, 1] class = y[0] 433->441 435 line_breaks <= 209.0 gini = 0.444 samples = 3 value = [2, 1] class = y[0] 434->435 440 gini = 0.0 samples = 4 value = [4, 0] class = y[0] 434->440 436 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 435->436 437 num_char <= 14.375 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 435->437 438 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 437->438 439 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 437->439 442 gini = 0.0 samples = 38 value = [38, 0] class = y[0] 441->442 443 exclaim_mess <= 17.5 gini = 0.245 samples = 7 value = [6, 1] class = y[0] 441->443 444 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 443->444 445 gini = 0.0 samples = 6 value = [6, 0] class = y[0] 443->445 447 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 446->447 448 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 446->448 451 gini = 0.0 samples = 155 value = [155, 0] class = y[0] 450->451 452 num_char <= 20.146 gini = 0.071 samples = 27 value = [26, 1] class = y[0] 450->452 453 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 452->453 454 gini = 0.0 samples = 26 value = [26, 0] class = y[0] 452->454 456 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 455->456 457 exclaim_subj <= 0.5 gini = 0.124 samples = 45 value = [42, 3] class = y[0] 455->457 458 num_char <= 16.674 gini = 0.05 samples = 39 value = [38, 1] class = y[0] 457->458 463 line_breaks <= 481.0 gini = 0.444 samples = 6 value = [4, 2] class = y[0] 457->463 459 num_char <= 16.648 gini = 0.198 samples = 9 value = [8, 1] class = y[0] 458->459 462 gini = 0.0 samples = 30 value = [30, 0] class = y[0] 458->462 460 gini = 0.0 samples = 8 value = [8, 0] class = y[0] 459->460 461 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 459->461 464 line_breaks <= 472.5 gini = 0.444 samples = 3 value = [1, 2] class = y[1] 463->464 469 gini = 0.0 samples = 3 value = [3, 0] class = y[0] 463->469 465 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 464->465 466 num_char <= 19.334 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 464->466 467 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 466->467 468 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 466->468 471 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 470->471 472 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 470->472 475 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 474->475 476 number_small <= 0.5 gini = 0.202 samples = 70 value = [62, 8] class = y[0] 474->476 477 exclaim_mess <= 23.0 gini = 0.346 samples = 27 value = [21, 6] class = y[0] 476->477 486 exclaim_mess <= 66.5 gini = 0.089 samples = 43 value = [41, 2] class = y[0] 476->486 478 num_char <= 57.821 gini = 0.5 samples = 12 value = [6, 6] class = y[0] 477->478 485 gini = 0.0 samples = 15 value = [15, 0] class = y[0] 477->485 479 exclaim_mess <= 2.5 gini = 0.375 samples = 8 value = [6, 2] class = y[0] 478->479 484 gini = 0.0 samples = 4 value = [0, 4] class = y[1] 478->484 480 num_char <= 41.907 gini = 0.444 samples = 3 value = [1, 2] class = y[1] 479->480 483 gini = 0.0 samples = 5 value = [5, 0] class = y[0] 479->483 481 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 480->481 482 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 480->482 487 gini = 0.0 samples = 40 value = [40, 0] class = y[0] 486->487 488 num_char <= 72.727 gini = 0.444 samples = 3 value = [1, 2] class = y[1] 486->488 489 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 488->489 490 num_char <= 111.771 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 488->490 491 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 490->491 492 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 490->492 494 gini = 0.0 samples = 619 value = [619, 0] class = y[0] 493->494 495 dollar <= 0.5 gini = 0.142 samples = 13 value = [12, 1] class = y[0] 493->495 496 gini = 0.0 samples = 12 value = [12, 0] class = y[0] 495->496 497 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 495->497 499 gini = 0.0 samples = 5 value = [0, 5] class = y[1] 498->499 500 exclaim_mess <= 624.5 gini = 0.213 samples = 33 value = [29, 4] class = y[0] 498->500 501 num_char <= 31.434 gini = 0.17 samples = 32 value = [29, 3] class = y[0] 500->501 514 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 500->514 502 num_char <= 9.099 gini = 0.08 samples = 24 value = [23, 1] class = y[0] 501->502 507 line_breaks <= 799.5 gini = 0.375 samples = 8 value = [6, 2] class = y[0] 501->507 503 exclaim_mess <= 7.0 gini = 0.278 samples = 6 value = [5, 1] class = y[0] 502->503 506 gini = 0.0 samples = 18 value = [18, 0] class = y[0] 502->506 504 gini = 0.0 samples = 5 value = [5, 0] class = y[0] 503->504 505 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 503->505 508 exclaim_mess <= 15.0 gini = 0.444 samples = 3 value = [1, 2] class = y[1] 507->508 513 gini = 0.0 samples = 5 value = [5, 0] class = y[0] 507->513 509 number_big <= 0.5 gini = 0.5 samples = 2 value = [1, 1] class = y[0] 508->509 512 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 508->512 510 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 509->510 511 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 509->511 517 exclaim_mess <= 4.5 gini = 0.298 samples = 44 value = [36, 8] class = y[0] 516->517 538 num_char <= 14.905 gini = 0.346 samples = 27 value = [6, 21] class = y[1] 516->538 518 dollar <= 0.5 gini = 0.245 samples = 42 value = [36, 6] class = y[0] 517->518 537 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 517->537 519 cc <= 0.5 gini = 0.214 samples = 41 value = [36, 5] class = y[0] 518->519 536 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 518->536 520 line_breaks <= 48.5 gini = 0.18 samples = 40 value = [36, 4] class = y[0] 519->520 535 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 519->535 521 num_char <= 2.73 gini = 0.444 samples = 3 value = [2, 1] class = y[0] 520->521 524 num_char <= 7.419 gini = 0.149 samples = 37 value = [34, 3] class = y[0] 520->524 522 gini = 0.0 samples = 2 value = [2, 0] class = y[0] 521->522 523 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 521->523 525 num_char <= 2.61 gini = 0.111 samples = 34 value = [32, 2] class = y[0] 524->525 532 line_breaks <= 101.0 gini = 0.444 samples = 3 value = [2, 1] class = y[0] 524->532 526 num_char <= 2.334 gini = 0.278 samples = 12 value = [10, 2] class = y[0] 525->526 531 gini = 0.0 samples = 22 value = [22, 0] class = y[0] 525->531 527 gini = 0.0 samples = 9 value = [9, 0] class = y[0] 526->527 528 exclaim_subj <= 0.5 gini = 0.444 samples = 3 value = [1, 2] class = y[1] 526->528 529 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 528->529 530 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 528->530 533 gini = 0.0 samples = 1 value = [0, 1] class = y[1] 532->533 534 gini = 0.0 samples = 2 value = [2, 0] class = y[0] 532->534 539 line_breaks <= 333.5 gini = 0.087 samples = 22 value = [1, 21] class = y[1] 538->539 542 gini = 0.0 samples = 5 value = [5, 0] class = y[0] 538->542 540 gini = 0.0 samples = 21 value = [0, 21] class = y[1] 539->540 541 gini = 0.0 samples = 1 value = [1, 0] class = y[0] 539->541

防止模型过度增长

In [24]:
from sklearn.tree import DecisionTreeClassifier
# Fitting a decision tree model with default parameters
dt = DecisionTreeClassifier(max_depth = 4)
dt.fit(X_train, y_train)
Out[24]:
DecisionTreeClassifier(max_depth=4)
In [25]:
# Prediction & Evaluation
y_hat_test = dt.predict(X_test)
# Decision Tree score
print("Decision tree score for test set:")
print("Predicion: {:.3f}".format(precision_score(y_test, y_hat_test)), "Recall: {:.3f}".format(recall_score(y_test, y_hat_test)))
print("F1 score: {:.3f}".format(f1_score(y_test, y_hat_test)))
print("AUC score: {:.3f}".format(roc_auc_score(y_test, y_hat_test)))
Decision tree score for test set:
Predicion: 0.649 Recall: 0.348
F1 score: 0.453
AUC score: 0.665
In [26]:
from sklearn import tree
import graphviz
dot_data = tree.export_graphviz(dt, feature_names = X.columns, filled = True, class_names = True,out_file=None)  
graph = graphviz.Source(dot_data)
graph
Out[26]:
Tree 0 line_breaks <= 46.5 gini = 0.172 samples = 3136 value = [2838, 298] class = y[0] 1 sent_email <= 0.5 gini = 0.334 samples = 952 value = [750, 202] class = y[0] 0->1 True 10 number_none <= 0.5 gini = 0.084 samples = 2184 value = [2088, 96] class = y[0] 0->10 False 2 num_char <= 0.538 gini = 0.419 samples = 677 value = [475, 202] class = y[0] 1->2 9 gini = 0.0 samples = 275 value = [275, 0] class = y[0] 1->9 3 to_multiple <= 0.5 gini = 0.498 samples = 163 value = [77, 86] class = y[1] 2->3 6 line_breaks <= 15.5 gini = 0.349 samples = 514 value = [398, 116] class = y[0] 2->6 4 gini = 0.486 samples = 137 value = [57, 80] class = y[1] 3->4 5 gini = 0.355 samples = 26 value = [20, 6] class = y[0] 3->5 7 gini = 0.499 samples = 67 value = [35, 32] class = y[0] 6->7 8 gini = 0.305 samples = 447 value = [363, 84] class = y[0] 6->8 11 urgent_subj <= 0.5 gini = 0.061 samples = 2113 value = [2046, 67] class = y[0] 10->11 16 num_char <= 8.988 gini = 0.483 samples = 71 value = [42, 29] class = y[0] 10->16 12 winner <= 0.5 gini = 0.06 samples = 2111 value = [2046, 65] class = y[0] 11->12 15 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 11->15 13 gini = 0.053 samples = 2073 value = [2017, 56] class = y[0] 12->13 14 gini = 0.361 samples = 38 value = [29, 9] class = y[0] 12->14 17 exclaim_mess <= 4.5 gini = 0.298 samples = 44 value = [36, 8] class = y[0] 16->17 20 num_char <= 14.905 gini = 0.346 samples = 27 value = [6, 21] class = y[1] 16->20 18 gini = 0.245 samples = 42 value = [36, 6] class = y[0] 17->18 19 gini = 0.0 samples = 2 value = [0, 2] class = y[1] 17->19 21 gini = 0.087 samples = 22 value = [1, 21] class = y[1] 20->21 22 gini = 0.0 samples = 5 value = [5, 0] class = y[0] 20->22
In [ ]: