{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 机器学习与社会科学应用"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第三章 经典分类算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第五节 支持向量机算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"宋体\" >郭峰    \n",
    "    教授、博士生导师  \n",
    "上海财经大学公共经济与管理学院  \n",
    "上海财经大学数实融合与智能治理实验室  \n",
    "邮箱：guofengsfi@163.com</font> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"宋体\" >本讲目录  \n",
    "5.1. SVM基本实现  \n",
    "5.2. Sklearn-SVM接口说明  \n",
    "5.3. SVM网格搜索调参  \n",
    "</font> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.1. SVM的基本实现"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 本程序运用Sklearn自带的数据集加载方式加载手写数字数据集，并利用sklearn库中的SVM算法以及libsvm中的SVM算法演示Python中的SVM算法实现，以及各个参数调试\n",
    "import matplotlib.pyplot as plt  \n",
    "from sklearn import datasets, svm, metrics  \n",
    "\n",
    "# 利用sklearn.datasets加载数据集演示  \n",
    "digits = datasets.load_digits()  \n",
    "images_and_labels = list(zip(digits.images, digits.target))  \n",
    "print(len(digits.target))  \n",
    "print(digits.target[0:20])  \n",
    "print(digits.images.shape)  \n",
    "print(digits.images[0])  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline  \n",
    "#显示数据0-9,前10个数据正好是0-9，方便演示  \n",
    "for index, (image, label) in enumerate(images_and_labels[:10]):  \n",
    "    plt.subplot(3, 4, index + 1)  \n",
    "    # 方便显示，关闭坐标轴  \n",
    "    plt.axis('off')  \n",
    "    # 颜色映射，方便显示  \n",
    "    plt.imshow(image, cmap=plt.cm.Greys)  \n",
    "    plt.title('Labels: %i' % label)  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 原数据是8x8的矩阵，代表着像素点，为了训练需要，需要把数据转换成向量形式  \n",
    "# 每一个像素点代表一个特征  \n",
    "n_samples = len(digits.images)  \n",
    "# 把数据全部变成一维，直接运用reshape的方法，-1代表着第二维自动确定  \n",
    "data = digits.images.reshape((n_samples, -1))  \n",
    "print(data.shape)  \n",
    "print(data[0])  \n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#训练集测试集分割  \n",
    "from sklearn.model_selection import train_test_split  \n",
    "X_train,X_test,y_train,y_test = train_test_split(data,digits.target,test_size=0.25,stratify=digits.target)  \n",
    " \n",
    "# Sklearn模型训练  \n",
    "svm_classifier = svm.SVC(gamma=0.001)  \n",
    "svm_classifier.fit(X_train, y_train)  \n",
    " \n",
    "# 评价模型：运用classification_report报告  \n",
    "from sklearn.metrics import classification_report  \n",
    "print(classification_report(svm_classifier.predict(X_test),y_test))  \n",
    "# 运用sklearn-svm自带评分  \n",
    "print('训练集得分:',svm_classifier.score(X_train,y_train))  \n",
    "print('测试集得分:',svm_classifier.score(X_test,y_test))  \n"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 5.2. Sklearn-SVM接口说明"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 分类问题"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- sklearn.svm.NuSVC()\n",
    "- sklearn.svm.LinearSVC()\n",
    "- sklearn.svm.SVC()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sklearn.svm.SVC(C=1.0, kernel='rbf', degree=3, gamma='auto', coef0=0.0, shrinking=True, \n",
    "                probability=False, tol=0.001, cache_size=200, class_weight=None, \n",
    "                verbose=False, max_iter=-1, decision_function_shape='ovr', \n",
    "                random_state=None)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 参数说明"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- **C （float参数 默认值为1.0）**\n",
    "\n",
    "表示错误项的惩罚系数C越大，即对分错样本的惩罚程度越大，因此在训练样本中准确率越高，但是泛化能力降低；相反，减小C的话，容许训练样本中有一些误分类错误样本，泛化能力强。对于训练样本带有噪声的情况，一般采用后者，把训练样本集中错误分类的样本作为噪声。\n",
    "\n",
    "- **kernel （str参数 默认为‘rbf’）**\n",
    "\n",
    "该参数用于选择模型所使用的核函数，算法中常用的核函数有：\n",
    "    - linear：线性核函数\n",
    "    - poly：多项式核函数\n",
    "    - rbf：径像核函数/高斯核\n",
    "    - sigmod：sigmod核函数\n",
    "    - precomputed：核矩阵，该矩阵表示自己事先计算好的，输入后算法内部将使用你提供的矩阵进行计算\n",
    "    \n",
    "- **degree （int型参数 默认为3）**\n",
    "\n",
    "该参数只对'kernel=poly'(多项式核函数)有用，是指多项式核函数的阶数n，如果给的核函数参数是其他核函数，则会自动忽略该参数。\n",
    "\n",
    "- **gamma （float参数 默认为auto）**\n",
    "\n",
    "该参数为核函数系数，只对‘rbf’,‘poly’,‘sigmod’有效。如果gamma设置为auto，代表其值为样本特征数的倒数，即1/n_features，也有其他值可设定。\n",
    "\n",
    "- **coef0:（float参数 默认为0.0）**\n",
    "\n",
    "该参数表示核函数中的独立项，只有对‘poly’和‘sigmod’核函数有用，是指其中的参数c。\n",
    "\n",
    "- **probability（ bool参数 默认为False）**\n",
    "\n",
    "该参数表示是否启用概率估计。 这必须在调用fit()之前启用，并且会使fit()方法速度变慢。\n",
    "\n",
    "- **shrinkintol: float参数 默认为1e^-3g（bool参数 默认为True）**\n",
    "\n",
    "该参数表示是否选用启发式收缩方式。\n",
    "\n",
    "- **tol（ float参数 默认为1e^-3）**\n",
    "\n",
    "svm停止训练的误差精度，也即阈值。\n",
    "\n",
    "- **cache_size（float参数 默认为200）**\n",
    "\n",
    "该参数表示指定训练所需要的内存，以MB为单位，默认为200MB。\n",
    "\n",
    "- **class_weight（字典类型或者‘balance’字符串。默认为None）**\n",
    "\n",
    "该参数表示给每个类别分别设置不同的惩罚参数C，如果没有给，则会给所有类别都给C=1，即前面参数指出的参数C。如果给定参数‘balance’，则使用y的值自动调整与输入数据中的类频率成反比的权重。\n",
    "\n",
    "- **verbose （ bool参数 默认为False）**\n",
    "\n",
    "该参数表示是否启用详细输出。此设置利用libsvm中的每个进程运行时设置，如果启用，可能无法在多线程上下文中正常工作。一般情况都设为False，不用管它。\n",
    "\n",
    "- **max_iter （int参数 默认为-1）**\n",
    "\n",
    "该参数表示最大迭代次数，如果设置为-1则表示不受限制。\n",
    "\n",
    "- **random_state（int，RandomState instance ，None 默认为None）**\n",
    "\n",
    "该参数表示在混洗数据时所使用的伪随机数发生器的种子，如果选int，则为随机数生成器种子；如果选RandomState instance，则为随机数生成器；如果选None,则随机数生成器使用的是np.random。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 方法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- svc.decision_function(X)\n",
    "\n",
    "计算样本X到分离超平面的距离\n",
    "\n",
    "- svc.fit(X, y[, sample_weight])\n",
    "\n",
    "拟合SVM模型\n",
    "\n",
    "- svc.get_params([deep])\n",
    "\n",
    "获取此估算器的参数并以字典行书储存,默认deep=True，以分类iris数据集为例，得到的参数如下\n",
    "\n",
    "{'C': 1.0, 'cache_size': 200, 'class_weight': None, 'coef0': 0.0,\n",
    "'decision_function_shape': 'ovr', 'degree': 3, 'gamma': 'auto', 'kernel': 'rbf', \n",
    "'max_iter': -1, 'probability': False, 'random_state': None, 'shrinking': True, \n",
    "'tol': 0.001, 'verbose': False}\n",
    "\n",
    "- svc.predict(X)\n",
    "\n",
    "根据测试数据集进行预测\n",
    "\n",
    "- svc.score(X, y[, sample_weight])\n",
    "\n",
    "返回给定测试数据和标签的平均精确度\n",
    "\n",
    "- svc.predict_log_proba(X_test)，svc.predict_proba(X_test)\n",
    "\n",
    "当sklearn.svm.SVC(probability=True)时，才会有这两个值，分别得到样本的对数概率以及普通概率。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 6.3. SVM调参"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"宋体\" > 本程序主要介绍SVM调参的方法，首先介绍各个参数的意义，之后介绍网格搜索的调参方法。 </font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"宋体\" > SVM参数意义:\n",
    "SVM在模型建立中主要需要选择的参数有C（软间隔大小）以及核函数的选择，\n",
    "同时根据各个核函数选择后，针对核函数也会产生新的参数需要调节，如gamma值。 </font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"宋体\" > 核的选择\n",
    "1. Linear核：主要用于线性可分的情形。参数少，速度快，对于一般数据，分类效果已经很理想了。\n",
    "2. RBF核：主要用于线性不可分的情形。参数多，分类结果非常依赖于参数。 </font>"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "**高维用线性，不行换特征；低维试线性，不行换高斯**"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 网格搜索选择超参数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import numpy as np\n",
    "import os\n",
    "from sklearn import svm, metrics\n",
    "from sklearn.model_selection import GridSearchCV\n",
    "from sklearn.metrics import classification_report\n",
    "\n",
    "# 路径与标签\n",
    "path1 = 'D:/python/机器学习与社会科学应用/演示数据/03经典分类算法/digits/trainingDigits/'\n",
    "path2 = 'D:/python/机器学习与社会科学应用/演示数据/03经典分类算法/digits/testDigits/'\n",
    "train_files = os.listdir(path1)\n",
    "print('训练集样本量：',len(train_files))\n",
    "\n",
    "test_files = os.listdir(path2)\n",
    "print('测试集样本量：',len(test_files))\n",
    "\n",
    "\n",
    "# 训练集\n",
    "y_train = []\n",
    "X_train = []\n",
    "for i in range(len(train_files)):\n",
    "    \n",
    "    # 定义y\n",
    "    filename = train_files[i] # 文件名\n",
    "    filestr = filename.split('.')[0] # 不带后缀的文件名\n",
    "    filelabel = int(filestr.split('_')[0]) # 文件的标签,0,1,2,....,9\n",
    "    # 将标签添加至handlabel中\n",
    "    y_train.append(filelabel)\n",
    "    \n",
    "    #定义x\n",
    "    f = open(path1+train_files[i])\n",
    "    data = f.read()\n",
    "    data = data.split('\\n')[0:-1]\n",
    "    data = [list(line) for line in data]\n",
    "    data = [int(j) for i in data for j in i]  \n",
    "    X_train.append(data)\n",
    "X_train[50]\n",
    "\n",
    "\n",
    "# 测试集：一步到位\n",
    "y_test = []\n",
    "X_test = []\n",
    "for i in range(len(test_files)):\n",
    "    \n",
    "    #定义y\n",
    "    filename = test_files[i] # 文件名\n",
    "    filestr = filename.split('.')[0] # 不带后缀的文件名\n",
    "    filelabel=int(filestr.split('_')[0]) # 文件的标签,0,1,2,....,9\n",
    "    # 将标签添加至handlabel中\n",
    "    y_test.append(filelabel)\n",
    "    \n",
    "    # 定义x\n",
    "    f=open(path2+test_files[i])\n",
    "    data=f.read()\n",
    "    data=data.split('\\n')[0:-1]\n",
    "   \n",
    "    data=[list(line) for line in data]\n",
    "    data=[int(j) for i in data for j in i]\n",
    "    \n",
    "    X_test.append(data)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 搜索参数范围\n",
    "# 搜索参数为每一个set中各个参数的组合\n",
    "parameters = [{'kernel': ['rbf'], 'gamma': [1e-3, 1e-4], 'C': [1, 10, 100, 1000]},\n",
    "                    {'kernel': ['linear'], 'C': [1, 10, 100, 1000]}]\n",
    "\n",
    "# 设定评分，分别搜索在两种评分下个最优参数\n",
    "scores = ['precision', 'recall']"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "for score in scores:    \n",
    "    print(\"开始选择 %s\" % score,\"评分下最优参数\")    \n",
    "    print()     \n",
    "    # 设定搜索器，GridSearchCV将模型以及参数及传入，并设定搜索评分方式，以及交叉验证折数\n",
    "    clf = GridSearchCV(svm.SVC(), parameters, cv=5,scoring='%s_macro' % score)    \n",
    "    # 利用定义好的clf去训练\n",
    "    clf.fit(X_train, y_train)    \n",
    "    print(\"最优参数为:\")    \n",
    "    print()    \n",
    "    print(clf.best_params_)    \n",
    "    print()    \n",
    "    print(\"各搜索参数评分分别为:\")    \n",
    "    print()\n",
    "    means = clf.cv_results_['mean_test_score']\n",
    "    stds = clf.cv_results_['std_test_score']    \n",
    "    for mean, std, params in zip(means, stds, clf.cv_results_['params']):        \n",
    "        print(\"%0.3f (+/-%0.03f) for %r\"\n",
    "              % (mean, std * 2, params))              \n",
    "        print()    \n",
    "\n",
    "    y_true, y_pred = y_test, clf.predict(X_test)    \n",
    "    print(classification_report(y_true, y_pred))    \n",
    "    print()    \n",
    "    print()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### GridSearchCV参数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "sklearn.model_selection.GridSearchCV(estimator, param_grid, scoring=None, fit_params=None, n_jobs=1, \n",
    "                                           iid=True, refit=True, cv=None, verbose=0, pre_dispatch=‘2*n_jobs’, \n",
    "                                           error_score=’raise’, return_train_score=’warn’)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- estimator\n",
    "\n",
    "选择使用的分类器，并且传入除需要确定最佳的参数之外的其他参数。每一个分类器都需要一个scoring参数，或者score方法：如estimator=RandomForestClassifier(min_samples_split=100,min_samples_leaf=20,max_depth=8,max_features='sqrt',random_state=10),\n",
    "\n",
    "- param_grid\n",
    "\n",
    "需要最优化的参数的取值，值为字典或者列表，例如：param_grid =param_test1，param_test1 = {'n_estimators':range(10,71,10)}。\n",
    "\n",
    "- scoring=None\n",
    "\n",
    "模型评价标准，默认None,这时需要使用score函数；或者如scoring='roc_auc'，根据所选模型不同，评价准则不同。字符串（函数名），或是可调用对象，需要其函数签名形如：scorer(estimator, X, y)；如果是None，则使用estimator的误差估计函数。具体值的选取看本篇第三节内容。\n",
    "\n",
    "- fit_params=None\n",
    "\n",
    "- n_jobs=1\n",
    "\n",
    "n_jobs: 并行数，int：个数,-1：跟CPU核数一致, 1:默认值\n",
    "\n",
    "- iid=True\n",
    "\n",
    "iid:默认True,为True时，默认为各个样本fold概率分布一致，误差估计为所有样本之和，而非各个fold的平均。\n",
    "\n",
    "- refit=True\n",
    "\n",
    "默认为True,程序将会以交叉验证训练集得到的最佳参数，重新对所有可用的训练集与开发集进行，作为最终用于性能评估的最佳模型参数。即在搜索参数结束后，用最佳参数结果再次fit一遍全部数据集。\n",
    "\n",
    "- cv=None\n",
    "\n",
    "交叉验证参数，默认None，使用三折交叉验证。指定fold数量，默认为3，也可以是yield训练/测试数据的生成器。\n",
    "\n",
    "- verbose=0, scoring=None\n",
    "\n",
    "verbose：日志冗长度，int：冗长度，0：不输出训练过程，1：偶尔输出，>1：对每个子模型都输出。\n",
    "\n",
    "- pre_dispatch=‘2*n_jobs’\n",
    "\n",
    "指定总共分发的并行任务数。当n_jobs大于1时，数据将在每个运行点进行复制，这可能导致OOM，而设置pre_dispatch参数，则可以预先划分总共的job数量，使数据最多被复制pre_dispatch次\n",
    "\n",
    "- error_score=’raise’\n",
    "\n",
    "- return_train_score=’warn’"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 本节结束"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
