{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 机器学习与社会科学应用"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第三章 经典分类算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第三节 贝叶斯分类算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"宋体\" >郭峰    \n",
    "    教授、博士生导师  \n",
    "上海财经大学公共经济与管理学院  \n",
    "上海财经大学数实融合与智能治理实验室  \n",
    "邮箱：guofengsfi@163.com</font> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"宋体\" >本节目录：  \n",
    "3.1.朴素贝叶斯的三种类型  \n",
    "3.2.朴素贝叶斯算法参数  \n",
    "3.3.演示性案例：网站恶意留言过滤    \n",
    "3.4.实战案例：泰坦尼克号幸存者</font>  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.1.朴素贝叶斯分类算法的三种类型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"宋体\" >在scikit-learn中，一共有3个朴素贝叶斯的分类算法类。分别是GaussianNB，MultinomialNB和BernoulliNB。其中GaussianNB就是先验为高斯分布的朴素贝叶斯,MultinomialNB就是先验为多项式分布的朴素贝叶斯,而BernoulliNB就是先验为伯努利分布的朴素贝叶斯。</font>  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  高斯朴素贝叶斯"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 高斯朴素贝叶斯  \n",
    "from sklearn import datasets\n",
    "from sklearn.naive_bayes import GaussianNB\n",
    "from sklearn.model_selection import train_test_split  \n",
    "\n",
    "iris = datasets.load_iris()\n",
    "\n",
    "clf = GaussianNB()\n",
    "x = iris.data # 特征  \n",
    "y = iris.target # 标签 \n",
    "X_train,X_test,y_train,y_test = train_test_split(x,y, test_size=0.3)\n",
    "\n",
    "clf = clf.fit(x,y)\n",
    "score = clf.score(X_test,y_test)\n",
    "print(\"score:\", score) # 输出在测试集上的分数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  多项分布朴素贝叶斯"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import datasets\n",
    "from sklearn.naive_bayes import MultinomialNB\n",
    "from sklearn.model_selection import train_test_split \n",
    "\n",
    "iris = datasets.load_iris()\n",
    "\n",
    "clf = MultinomialNB()\n",
    "x = iris.data # 特征  \n",
    "y = iris.target # 标签 \n",
    "X_train,X_test,y_train,y_test = train_test_split(x,y, test_size=0.3)\n",
    "clf = clf.fit(x,y)\n",
    "score = clf.score(X_test,y_test)\n",
    "print(\"score:\", score) # 输出在测试集上的分数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  伯努利朴素贝叶斯"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import datasets\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.naive_bayes import BernoulliNB\n",
    "\n",
    "iris = datasets.load_iris()\n",
    "\n",
    "clf = BernoulliNB()\n",
    "x = iris.data # 特征  \n",
    "y = iris.target # 标签 \n",
    "X_train,X_test,y_train,y_test = train_test_split(x,y, test_size=0.3)\n",
    "clf = clf.fit(X_train,y_train)\n",
    "clf = clf.fit(x,y)\n",
    "score = clf.score(X_test,y_test)\n",
    "print(\"score:\", score) # 输出在测试集上的分数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.2 朴素贝叶斯算法参数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 参数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.naive_bayes import GaussianNB\n",
    "clf = GaussianNB()\n",
    "# alpha:先验平滑因子，默认等于1，当等于1时表示拉普拉斯平滑。\n",
    "# fit_prior:是否去学习类的先验概率，默认是True\n",
    "# class_prior:各个类别的先验概率，如果没有指定，则模型会根据数据自动学习， 每个类别的先验概率相同，等于类标记总个数N分之一。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.naive_bayes import BernoulliNB\n",
    "clf = BernoulliNB()\n",
    "# alpha:平滑因子，与多项式中的alpha一致。\n",
    "# binarize:样本特征二值化的阈值，默认是0。如果不输入，则模型会认为所有特征都已经是二值化形式了；如果输入具体的值，则模型会把大于该值的部分归为一类，小于的归为另一类。\n",
    "# fit_prior:是否去学习类的先验概率，默认是True\n",
    "# class_prior:各个类别的先验概率，如果没有指定，则模型会根据数据自动学习， 每个类别的先验概率相同，等于类标记总个数N分之一。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 多项式\n",
    "clf = MultinomialNB(alpha=1.0,fit_prior=True)\n",
    "# 参数说明如下：\n",
    "# alpha：浮点型可选参数，默认为1.0，其实就是添加拉普拉斯平滑，即为上述公式中的λ ，如果这个参数设置为0，就是不添加平滑；\n",
    "# fit_prior：布尔型可选参数，默认为True。布尔参数fit_prior表示是否要考虑先验概率，如果是false,则所有的样本类别输出都有相同的类别先验概率。\n",
    "# 否则可以自己用第三个参数class_prior输入先验概率，或者不输入第三个参数class_prior让MultinomialNB自己从训练集样本来计算先验概率，\n",
    "# 此时的先验概率为P(Y=Ck)=mk/m。其中m为训练集样本总数量，mk为输出为第k类别的训练集样本数。\n",
    "# class_prior：可选参数，默认为None。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 算法\"方法\""
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- fit(X,Y):在数据集(X,Y)上拟合模型。  \n",
    "- get_params():获取模型参数。  \n",
    "- predict(X):对数据集X进行预测。  \n",
    "- predict_log_proba(X):对数据集X预测，得到每个类别的概率对数值。  \n",
    "- predict_proba(X):对数据集X预测，得到每个类别的概率。  \n",
    "- score(X,Y):得到模型在数据集(X,Y)的得分情况。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 方法格式\n",
    "clf = BernoulliNB()\n",
    "x = iris.data\n",
    "y = iris.target\n",
    "X_train,X_test,y_train,y_test = train_test_split(x,y, test_size=0.3)\n",
    "clf = clf.fit(X_train,y_train)\n",
    "y_pred=clf.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 方法格式-修改，为了和普遍标准保持统一，X要大写，y小写\n",
    "clf = BernoulliNB()\n",
    "X = iris.data\n",
    "y = iris.target\n",
    "X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.3)\n",
    "clf = clf.fit(X_train,y_train)\n",
    "y_pred=clf.predict(X_test)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.3. 演示性案例：网站恶意留言过滤"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"宋体\" >有一堆已经清理好的留言单词及它的所属类，现在根据已有的数据求一条新的留言所属分类。\n",
    "样本数据：六条已经划分好了的留言数据集，以及各自对应的分类。具体看下面的代码及注释：</font>  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 这是一个没有使用sklearn，从底层代码开始实现贝叶斯算法过程\n",
    "from numpy import *\n",
    " \n",
    "# 加载数据\n",
    "def loadDataSet():\n",
    "    postingList = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],\n",
    "                   ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],\n",
    "                   ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],\n",
    "                   ['stop', 'posting', 'stupid', 'worthless', 'garbage'],\n",
    "                   ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],\n",
    "                   ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]\n",
    "    classVec = [0, 1, 0, 1, 0, 1]\n",
    "    return postingList, classVec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 合并所有单词，利用set来去重，得到所有单词的唯一列表\n",
    "def createVocabList(dataSet):\n",
    "    vocabSet = set([])\n",
    "    for document in dataSet:\n",
    "        vocabSet = vocabSet | set(document)\n",
    "    return list(vocabSet)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 优化词集模型,将单词列表变为数字向量列表\n",
    "def bagOfWords2VecMN(vocabList, inputSet):\n",
    "    returnVec = [0] * len(vocabList)    #获得所有单词等长的0列表\n",
    "    for word in inputSet:\n",
    "        if word in vocabList:\n",
    "            returnVec[vocabList.index(word)] += 1   #对应单词位置加1\n",
    "    return returnVec"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 返回的是0、1各自两个分类中每个单词数量除以该分类单词总量再取对数ln 以及0、1两类的比例\n",
    "def trainNB0(trainMatrix, trainCategory):\n",
    "    numTrainDocs = len(trainMatrix)  # 样本数\n",
    "    numWords = len(trainMatrix[0])  # 特征数\n",
    "    pAbusive = sum(trainCategory) / float(numTrainDocs)  # 1类所占比例\n",
    "    p0Num = ones(numWords)\n",
    "    p1Num = ones(numWords)  #初始化所有单词为1\n",
    "    p0Denom = 2.0\n",
    "    p1Denom = 2.0  #初始化总单词为2        后面解释为什么这四个不初始化为0\n",
    "    for i in range(numTrainDocs):\n",
    "        if trainCategory[i] == 1:       # 求1类\n",
    "            p1Num += trainMatrix[i]\n",
    "            p1Denom += sum(trainMatrix[i])\n",
    "        else:\n",
    "            p0Num += trainMatrix[i]     # 求0类\n",
    "            p0Denom += sum(trainMatrix[i])\n",
    "    p1Vect = log(p1Num / p1Denom)  # numpy数组 / float = 1中每个单词/1中总单词\n",
    "    p0Vect = log(p0Num / p0Denom)  # 这里为什么还用ln来处理，后面说明\n",
    "    return p0Vect, p1Vect, pAbusive"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# P(X|C)判断各类别的概率大小（这里是0、1）\n",
    "def classifyNB(vec2Classify, p0Vec, p1Vec, pClass1):\n",
    "    p1 = sum(vec2Classify * p1Vec) + log(pClass1)  # 相乘后得到哪些单词存在，再求和，再+log(P(C))\n",
    "    p0 = sum(vec2Classify * p0Vec) + log(1.0 - pClass1) # 由于使用的是ln，这里其实都是对数相加\n",
    "    if p1 > p0:\n",
    "        return 1\n",
    "    else:\n",
    "        return 0"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "#封装调用的函数\n",
    "def testingNB():\n",
    "    listOPosts, listClasses = loadDataSet()\n",
    "    print( listOPosts)\n",
    "    print(listClasses)\n",
    "    myVocabList = createVocabList(listOPosts)\n",
    "    print(len(myVocabList))\n",
    "    print(myVocabList )\n",
    "    trainMat = []\n",
    "    print(bagOfWords2VecMN(myVocabList, listOPosts[0]))\n",
    "    for postinDoc in listOPosts:\n",
    "        trainMat.append(bagOfWords2VecMN(myVocabList, postinDoc))\n",
    "    print(trainMat)\n",
    "    p0V, p1V, pAb = trainNB0(array(trainMat), array(listClasses))\n",
    "    # 上面求出了0、1两个类中各单词所占该类的比例，以及0、1的比例\n",
    " \n",
    "    # 下面是预测两条样本数据的类别\n",
    "    testEntry = ['love', 'my', 'dalmation']\n",
    "    thisDoc = array(bagOfWords2VecMN(myVocabList, testEntry)) #先将测试数据转为numpy的词袋模型 [0 2 0 5 1 0 0 3 ...]\n",
    "    print(testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb)) #传值判断\n",
    " \n",
    "    testEntry = ['stupid', 'garbage']\n",
    "    thisDoc = array(bagOfWords2VecMN(myVocabList, testEntry))\n",
    "    print(testEntry, 'classified as: ', classifyNB(thisDoc, p0V, p1V, pAb))\n",
    "\n",
    "if __name__==\"__main__\":\n",
    "    testingNB()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 3.4. 实战案例：泰坦尼克号幸存者"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"宋体\" >背景：泰坦尼克号幸存者，广泛应用于机器学习各章算法的演示  \n",
    "包含泰坦尼克号乘客的个人信息以及是否从那场海难中生还  \n",
    "方法：高斯朴素贝叶斯。  \n",
    "特征：船舱等级、性别、年龄、兄弟姐妹数目、父母/子女数量、票价和登船口岸  \n",
    "响应：是否获救</font>  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#步骤1：加载并清理数据\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "import time\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.naive_bayes import GaussianNB, BernoulliNB, MultinomialNB\n",
    "\n",
    "path = \"D:/python/机器学习与社会科学应用/演示数据/03经典分类算法/titanic/data/\"\n",
    "\n",
    "# 导入数据集\n",
    "data_train = pd.read_csv(path+\"train.csv\",encoding='utf8')\n",
    "#print(data_train.head())\n",
    "\n",
    "# 将分类变量转换为数字\n",
    "data_train[\"Sex_cleaned\"] = np.where(data_train[\"Sex\"]==\"male\",0,1)\n",
    "data_train[\"Embarked_cleaned\"] = np.where(data_train[\"Embarked\"]==\"S\",0,\n",
    "                                 np.where(data_train[\"Embarked\"]==\"C\",1,\n",
    "                                 np.where(data_train[\"Embarked\"]==\"Q\",2,\n",
    "                                          3)))\n",
    "\n",
    "# 清除数据集中的非数字值（NaN）\n",
    "data_train = data_train[[\"Survived\",\"Pclass\",\"Sex_cleaned\",\"Age\",\"SibSp\",\"Parch\",\"Fare\",\"Embarked_cleaned\"]]\n",
    "data_train.sample(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "data_train = data_train.dropna(axis=0, how='any')\n",
    "X = data_train[[\"Pclass\",\"Sex_cleaned\",\"Age\",\"SibSp\",\"Parch\",\"Fare\",\"Embarked_cleaned\"]]\n",
    "y = data_train[[\"Survived\"]]\n",
    "y.sample(10)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 将数据集拆分成训练集和测试集\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state=666)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 步骤2：利用训练集数据测试模型\n",
    "# 实例化分类器\n",
    "gnb = GaussianNB()\n",
    "\n",
    "# 训练分类器\n",
    "gnb.fit(X_train,y_train)\n",
    "y_pred = gnb.predict(X_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 打印结果\n",
    "print(\"Number of mislabeled points out of a total {} points : {}, performance {:05.2f}%\"\n",
    "      .format(\n",
    "          X_test.shape[0],\n",
    "          (y_test[\"Survived\"] != y_pred).sum(),\n",
    "          100*(1-(y_test[\"Survived\"] != y_pred).sum()/X_test.shape[0])\n",
    "))\n",
    "#注：分类器成功率78.15%，比网文稍微低一些"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 步骤3：泛化到完全新的数据集\n",
    "# 导入test数据集并整理\n",
    "f1 = open(path+\"test.csv\",encoding='utf8')\n",
    "data_test = pd.read_csv(f1,header=0,sep=',')\n",
    "\n",
    "# 将分类变量转换为数字\n",
    "data_test[\"Sex_cleaned\"] = np.where(data_test[\"Sex\"]==\"male\",0,1)\n",
    "data_test[\"Embarked_cleaned\"] = np.where(data_test[\"Embarked\"]==\"S\",0,\n",
    "                                np.where(data_test[\"Embarked\"]==\"C\",1,\n",
    "                                np.where(data_test[\"Embarked\"]==\"Q\",2,\n",
    "                                         3)))\n",
    "\n",
    "# 清除数据集中的非数字值（NaN）\n",
    "data_test=data_test[[\"Pclass\",\"Sex_cleaned\",\"Age\",\"SibSp\",\"Parch\",\"Fare\",\"Embarked_cleaned\"]].dropna(axis=0, how='any')"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "used_features = [\"Pclass\",\"Sex_cleaned\",\"Age\",\"SibSp\",\"Parch\",\"Fare\",\"Embarked_cleaned\"]\n",
    "\n",
    "y_pred_test = gnb.predict(data_test[used_features])\n",
    "print('泛化到测试集后的预测结果')\n",
    "print(y_pred_test)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 本节结束"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
