{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 机器学习与社会科学应用"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第三章 经典分类算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第一节 Logit算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"宋体\" >郭峰    \n",
    "    教授、博士生导师  \n",
    "上海财经大学公共经济与管理学院  \n",
    "上海财经大学数实融合与智能治理实验室  \n",
    "邮箱：guofengsfi@163.com</font> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"宋体\" >本节目录  \n",
    "1.1. 二分类  \n",
    "1.2. 多分类  </font>  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.1. 二分类"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "数据集介绍：  \n",
    "- 胎儿健康分类：https://www.kaggle.com/andrewmvd/fetal-health-classification\n",
    "- 该数据包含胎儿心电图，胎动，子宫收缩等特征值，而我们所需要做的就是通过这些特征值来对胎儿的健康状况(fetal_health)进行分类。\n",
    "- 数据集包含从心电图检查中提取的2126条特征记录，然后由三名产科专家将其分为3类，并用数字来代表：1-普通的，2-疑似病例，3-确定病例。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "path = \"D:/python/机器学习与社会科学应用/演示数据/03经典分类算法/\"\n",
    "health = pd.read_csv(path+\"fetal_health.csv\", encoding='utf8')\n",
    "print(health.shape)\n",
    "print(\"无病样本量:\", len(health[health['fetal_health']==1]))\n",
    "print(\"疑似样本量:\", len(health[health['fetal_health']==2]))\n",
    "print(\"确诊样本量:\", len(health[health['fetal_health']==3]))\n",
    "health.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "logit模型超参数：  \n",
    "- penalty：正则化项的选择。正则化主要有两种：L1和L2，默认选择L2正则化。（相较于ridge和lasso这两个默认正则化的线性算法来说，自由选择正则化是这个算法的一大优点。）  \n",
    "- C ：正则化强度（正则化系数λ）的倒数; 必须是大于0的浮点数。 与支持向量机一样，较小的值指定更强的正则化，通常默认为1。  \n",
    "- solver ：‘newton-cg’,‘lbfgs’,‘liblinear’,‘sag’,'saga'。(默认参数：liblinear )  \n",
    "    - liblinear：使用了开源的liblinear库实现，内部使用了坐标轴下降法来迭代优化损失函数。  \n",
    "    - lbfgs：拟牛顿法的一种，利用损失函数二阶导数矩阵即海森矩阵来迭代优化损失函数。  \n",
    "    - newton-cg：也是牛顿法家族的一种，利用损失函数二阶导数矩阵即海森矩阵来迭代优化损失函数。  \n",
    "    - sag：即随机平均梯度下降，是梯度下降法的变种，和普通梯度下降法的区别是每次迭代仅仅用一部分的样本来计算梯度，适合于样本数据多的时候。  \n",
    "    - saga：线性收敛的随机优化算法。  \n",
    "- max_iter：求解器收敛的最大迭代次数。（通常来说越大越好）"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#导入第三方库\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import winreg\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.model_selection import train_test_split # 划分数据集与测试集\n",
    "from sklearn.linear_model import LogisticRegression # 导入算法模块\n",
    "from sklearn.metrics import accuracy_score # 导入评分模块\n",
    "\n",
    "from sklearn import metrics\n",
    "from sklearn.metrics import precision_recall_curve\n",
    "from sklearn.metrics import average_precision_score\n",
    "from sklearn.metrics import roc_curve, auc\n",
    "from sklearn.model_selection import StratifiedKFold\n",
    "from sklearn.preprocessing import label_binarize\n",
    "from sklearn.metrics import confusion_matrix,precision_score,accuracy_score,recall_score,f1_score,roc_auc_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 导入数据\n",
    "path = \"D:/python/机器学习与社会科学应用/演示数据/03经典分类算法/\"\n",
    "health = pd.read_csv(path+\"fetal_health.csv\", encoding='utf8')\n",
    "\n",
    "def replace(x):\n",
    "    if x == 1:\n",
    "        return 0\n",
    "    else:\n",
    "        return 1\n",
    "    \n",
    "health['fetal_health'] = health['fetal_health'].apply(lambda x:replace(x)) # 1、2替换为0、1\n",
    "print('确诊+疑似样本量：', len(health[health['fetal_health']==1]))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "tags": []
   },
   "outputs": [],
   "source": [
    "# 定义特征变量和响应变量\n",
    "X = health.drop([\"fetal_health\"], axis=1)\n",
    "y = health[\"fetal_health\"]\n",
    "\n",
    "#训练集和测试集划分\n",
    "X_train,X_test,y_train,y_test = train_test_split(X,y,random_state=1)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 训练模型\n",
    "logistic = LogisticRegression(penalty='l1',C=1,solver='liblinear')  #超参函数详见上文\n",
    "logistic.fit(X_train,y_train)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#预测\n",
    "y_pred = logistic.predict(X_test)\n",
    "print(y_pred[0:10])\n",
    "y_pred_pro = logistic.predict_proba(X_test)[:,1] # 计算auc需要概率结果\n",
    "print(y_pred_pro[0:10])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 模型评估\n",
    "print(\"Logistic训练集评分：\"+str(accuracy_score(y_train,logistic.predict(X_train))))\n",
    "print(\"Logistic测试集评分：\"+str(accuracy_score(y_test,logistic.predict(X_test))))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 其他评价指标\n",
    "print(\"accuracy_score:\", accuracy_score(y_test, y_pred))\n",
    "print(\"precision_score:\", metrics.precision_score(y_test, y_pred))\n",
    "print(\"recall_score:\", metrics.recall_score(y_test, y_pred))\n",
    "print(\"f1_score:\", metrics.f1_score(y_test, y_pred))\n",
    "print(\"auc:\",auc)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#画图\n",
    "auc = roc_auc_score(y_pred,y_pred_pro,average='micro')\n",
    "fpr, tpr, thresholds = roc_curve(y_test.ravel(),y_pred_pro.ravel())\n",
    "plt.figure()\n",
    "plt.figure(figsize = (10,10))\n",
    "plt.plot(fpr, tpr, linewidth = 2,color='darkorange',linestyle = ':',label = 'AUC_Sigmoid=%.3f' % auc)\n",
    "plt.plot([0, 1], [0, 1], color = 'navy', linestyle = '--')\n",
    "plt.xlim([0.0, 1.0])\n",
    "plt.ylim([0.0, 1.05])\n",
    "plt.xlabel('False Positive Rate')\n",
    "plt.ylabel('True Positive Rate')\n",
    "plt.title('Receiver operating characteristic example')\n",
    "plt.legend(loc = \"lower right\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 1.2. 多分类"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#导入第三方库\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import winreg\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn.model_selection import train_test_split # 划分数据集与测试集\n",
    "from sklearn.linear_model import LogisticRegression # 导入算法模块\n",
    "from sklearn.metrics import accuracy_score # 导入评分模块\n",
    "\n",
    "from sklearn import metrics\n",
    "from sklearn.metrics import precision_recall_curve\n",
    "from sklearn.metrics import average_precision_score\n",
    "from sklearn.metrics import roc_curve, auc\n",
    "from sklearn.model_selection import StratifiedKFold\n",
    "from sklearn.preprocessing import label_binarize\n",
    "from sklearn.metrics import confusion_matrix,precision_score,accuracy_score,recall_score,f1_score,roc_auc_score"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 导入数据\n",
    "path = \"D:/python/机器学习与社会科学应用/演示数据/03经典分类算法/\"\n",
    "health = pd.read_csv(path+\"fetal_health.csv\",encoding='utf8')\n",
    "\n",
    "# 指定特征变量与响应变量\n",
    "X = health.drop([\"fetal_health\"],axis = 1)\n",
    "y=health[\"fetal_health\"]\n",
    "\n",
    "# 划分训练集与测试集\n",
    "X_train,X_test,y_train,y_test = train_test_split(X, y, random_state=1)\n",
    "print(X_train.shape)\n",
    "print(X_test.shape)\n",
    "print(y_train.shape)\n",
    "print(y_test.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#引入LogisticRegression算法，并将算法中的参数依次设立好\n",
    "logistic = LogisticRegression(penalty='l2',C = 1,solver = 'lbfgs',max_iter = 1000)\n",
    "logistic.fit(X_train,y_train)\n",
    "\n",
    "y_pred = logistic.predict(X_test)\n",
    "y_pred_pro = logistic.predict_proba(X_test) # 计算auc需要概率结果"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "y_test_one_hot = label_binarize(y_test, classes=[1, 2, 3])\n",
    "y_pred_one_hot = label_binarize(y_pred, classes=[1, 2, 3])\n",
    "auc = roc_auc_score(y_test_one_hot,y_pred_pro,average='micro')\n",
    "fpr, tpr, thresholds = roc_curve(y_test_one_hot.ravel(),y_pred_pro.ravel())"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 模型评估\n",
    "print(\"Logistic训练集模型评分：\"+str(accuracy_score(y_train,logistic.predict(X_train))))\n",
    "print(\"Logistic测试集模型评分：\"+str(accuracy_score(y_test,logistic.predict(X_test))))\n",
    "print(\"accuracy_score:\", accuracy_score(y_test, y_pred))\n",
    "print(\"precision_score_macro:\", metrics.precision_score(y_test, y_pred,average='macro'))\n",
    "print(\"recall_score_macro:\", metrics.recall_score(y_test, y_pred,average='macro'))\n",
    "print(\"f1_score_macro:\", metrics.f1_score(y_test, y_pred,average='macro'))\n",
    "print(\"precision_score_micro:\", metrics.precision_score(y_test, y_pred,average='micro'))\n",
    "print(\"recall_score_micro:\", metrics.recall_score(y_test, y_pred,average='micro'))\n",
    "print(\"f1_score_micro:\", metrics.f1_score(y_test, y_pred,average='micro'))\n",
    "print(\"auc:\",auc)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 画图\n",
    "plt.figure()\n",
    "plt.figure(figsize = (10,10))\n",
    "plt.plot(fpr, tpr, linewidth = 2,color = 'darkorange',linestyle = ':',label = 'AUC_Sigmoid=%.3f' % auc)\n",
    "plt.plot([0, 1], [0, 1], color = 'navy', linestyle='--')\n",
    "plt.xlim([0.0, 1.0])\n",
    "plt.ylim([0.0, 1.05])\n",
    "plt.xlabel('False Positive Rate')\n",
    "plt.ylabel('True Positive Rate')\n",
    "plt.title('Receiver operating characteristic example')\n",
    "plt.legend(loc = \"lower right\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 本节结束"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
