{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 机器学习与社会科学应用"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第三章 经典分类算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第二节 K近邻算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"宋体\" >郭峰    \n",
    "    教授、博士生导师  \n",
    "上海财经大学公共经济与管理学院  \n",
    "上海财经大学数实融合与智能治理实验室  \n",
    "邮箱：guofengsfi@163.com</font> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"宋体\" >本节目录  \n",
    "2.1.自编代码实现KNN算法  \n",
    "2.2.调用sklearn函数  \n",
    "2.3.调参：寻找最好的k    \n",
    "2.4.实战案例：手写数字识别  </font>  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.1 自编代码实现KNN算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "K近邻算法（KNN）分类的本质是计算训练集中每个样本与所要分类的样本之间欧式距离，然后根据距离进行排序，取距离该样本最近的K个训练集样本的多数作为所有预测的分类。  \n",
    "以鸢尾花分类为示例，介绍KNN算法。鸢尾花可以被分为setosa、versicolor、virginica三个品种，现在我们要建立一个模型，输入特定数据判定它是属于哪一类。这是一个sklearn自带的数据集，很多教科书都用其作为机器学习演示案例"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 目标：鸢尾花分类为示例，介绍KNN算法\n",
    "# 鸢尾花可以被分为setosa、versicolor、virginica三个品种，现在我们要建立一个模型，输入特定数据判定它是属于哪一类。  \n",
    "\n",
    "%matplotlib inline \n",
    "# matplotlib notebook  # 这两个命令都可以让图片直接显示在本notebook上\n",
    "import numpy as np\n",
    "import matplotlib.pyplot as plt\n",
    "from sklearn import datasets\n",
    "plt.rcParams['font.sans-serif']=['SimHei'] # 用来正常显示中文标签"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  指定特征变量与响应变量"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# 准备数据集，这是一个sklearn自带的数据集，各个教科书都用其作为演示案例\n",
    "iris = datasets.load_iris()\n",
    "X = iris.data   # x展示了样本的四个特征，根据这四个特征，预测花的品种\n",
    "print('X:\\n',X)\n",
    "Y = iris.target # Y对应样本的分类标签\n",
    "print('Y:\\n',Y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# 为简化处理，处理二分类问题，所以只针对Y=0,1的行，然后从这些行中取X的前两列\n",
    "x = X[Y<2, :2] # 同时用两个条件进行筛选\n",
    "print(x.shape)\n",
    "print('x:\\n',x)\n",
    "y = Y[Y<2]\n",
    "print('y:\\n',y)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 图形展示代码\n",
    "plt.scatter(x[y==0, 0],x[y==0, 1],color='red')\n",
    "plt.scatter(x[y==1, 0],x[y==1, 1],color='green')\n",
    "plt.scatter(5.6,3.2,color='blue')\n",
    "x_1=np.array([5.6,3.2])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "###  KNN算法实现\n",
    "算法步骤：  \n",
    "- 如上图所示，我们要对图中蓝色的点进行预测，从而判断他属于哪一类\n",
    "- 我们使用欧氏距离公式，计算两个向量点之间的距离.\n",
    "- 计算完所有点之间的距离后，可以对数据按照从小到大的次序排序\n",
    "- 统计距离最近前k个数据点的类别数，返回票数最多的那类即为蓝色点的类别。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 采用欧式距离计算要预测的点（x_1)到其他所有点的距离\n",
    "distances = [np.sqrt(np.sum((x_t-x_1)**2)) for x_t in x]\n",
    "\n",
    "# 对上述距离数组进行排序，返回的是排序后的索引\n",
    "d = np.sort(distances)\n",
    "nearest = np.argsort(distances)\n",
    "print(nearest)\n",
    "\n",
    "# 选择的是6近邻\n",
    "k = 6                 \n",
    "topk_y = [y[i] for i in nearest[:k]]\n",
    "print(topk_y)\n",
    "\n",
    "# 对topk_y进行统计返回字典\n",
    "from collections import Counter\n",
    "votes = Counter(topk_y)\n",
    "# 返回票数最多的1类元素 \n",
    "print(votes)\n",
    "predict_y = votes.most_common(1)[0][0]\n",
    "print(votes.most_common(1)) # votes.most_common(1)返回的结果是票数的结果以及具体票数\n",
    "print(predict_y)  #预测结果\n",
    "# 从结果可以看出，k=6时，距离蓝色的点最近的6个点钟，有4个属于绿色，2个属于红色，最终蓝色点的标签被预测为绿色。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.2 调用sklearn函数"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在上面的例子中，我们通过自变代码实现了KNN算法，实践中，我们可以直接调用sklean封装好的KNN算法进行计算。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import datasets\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "\n",
    "\n",
    "iris = datasets.load_iris()\n",
    "X = iris.data\n",
    "y = iris.target\n",
    " \n",
    "# sklearn自带的train_test_split进行训练集、测试集的切分，训练模型\n",
    "X_train,X_test,y_train,y_test = train_test_split(X, y, test_size=0.25, random_state=666)\n",
    "knn_classifier = KNeighborsClassifier(n_neighbors=5)\n",
    "# 因为knn对算法进行了封装，既包括构建模型的算法，也包括预测的算法，我们只需要调用fit方法来训练数据即可。\n",
    "knn_classifier.fit(X_train, y_train)    \n",
    "\n",
    "# 模型评估--预测准确度\n",
    "y_predict = knn_classifier.predict(X_test)\n",
    "scores = knn_classifier.score(X_test,y_test)\n",
    "# 注：scikit-learn中所有的机器学习模型都在各自的类中实现，统称为Estimator类  \n",
    "# K近邻算法是在neighbours模块中的KNeighboursClassifier类中实现，我们设置邻居参数为1  \n",
    "\n",
    "print('acc:{}'.format(sum(y_predict==y_test)/len(y_test)),scores)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 直接使用sklean自带的程序，预测某一个点的分类结果  \n",
    "X_new = np.array([[5,2.9,1,0.2]])\n",
    "prediction = knn_classifier.predict(X_new)\n",
    "print(prediction)\n",
    "print(\"Predicted target name:{}\".format(iris['target_names'][prediction]))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.3. 调参：寻找最好的k"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "- 得到正确率之后，想要进一步的提升在测试集上的正确率，我们就需要对模型进行调参\n",
    "- 超参数：在算法运行前需要设定的参数（通过领域知识、经验数值、实验搜索来寻找好的超参数）\n",
    "- 模型参数：算法过程中学习的参数\n",
    "- 在KNN中没有模型参数，KNN算法中的k是典型的超参数，我们将采用实验搜索来寻找好的超参数\n",
    "- 逻辑是在k=1到10之间一个个测试，看那个k效果最好"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# KNN的超参数调优\n",
    "from sklearn import datasets\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "\n",
    "digits = datasets.load_digits()\n",
    "x = digits.data\n",
    "y = digits.target\n",
    "print(x.shape)\n",
    "print(y.shape)\n",
    "x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state=666)\n",
    "knn_clf = KNeighborsClassifier(n_neighbors=5)\n",
    "knn_clf.fit(x_train,y_train)\n",
    "y_pred = knn_clf.predict(x_test)\n",
    "print(\"测试集准确率:\", ((y_test==y_pred).sum())/len(y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在上面的示例中，我们使用K=5训练模型，那么K=5是否为最优的模型？下面我们使用一个循环，寻找最优的K。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 寻找最优K\n",
    "from sklearn import datasets\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "digits = datasets.load_digits()\n",
    "x = digits.data\n",
    "y = digits.target\n",
    "x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state=666)\n",
    "\n",
    "# 假设最优的K在[1,9]之间，当然也可以设定更宽的区间\n",
    "best_k = -1\n",
    "best_score = 0\n",
    "for i in range(1,10):\n",
    "    knn_clf = KNeighborsClassifier(n_neighbors=i)\n",
    "    knn_clf.fit(x_train,y_train)\n",
    "    scores = knn_clf.score(x_test,y_test)\n",
    "    print(i,scores)\n",
    "    if scores>best_score:\n",
    "        best_score=scores\n",
    "        best_k=i\n",
    "print('最好的k为:%d,最好的得分为:%.4f'%(best_k,best_score))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "knn算法参数：  \n",
    "- n_neighbors：默认为5，就是k-NN的k的值，选取最近的k个点。   \n",
    "- weights：默认是uniform，参数可以是uniform、distance，也可以是用户自己定义的函数。uniform是均等的权重，就说所有的邻近点的权重都是相等的。distance是不均等的权重，距离近的点比距离远的点的影响大。用户自定义的函数，接收距离的数组，返回一组维数相同的权重。   \n",
    "- algorithm：快速k近邻搜索算法，默认参数为auto，可以理解为算法自己决定合适的搜索算法。除此之外，用户也可以自己指定搜索算法ball_tree、kd_tree、brute方法进行搜索，brute是蛮力搜索，也就是线性扫描，当训练集很大时，计算非常耗时。- kd_tree，构造kd树存储数据以便对其进行快速检索的树形数据结构，kd树也就是数据结构中的二叉树。以中值切分构造的树，每个结点是一个超矩形，在维数小于20时效率高。ball tree是为了克服kd树高纬失效而发明的，其构造过程是以质心C和半径r分割样本空间，每个节点是一个超球体。   \n",
    "- leaf_size：默认是30，这个是构造的kd树和ball树的大小。这个值的设置会影响树构建的速度和搜索速度，同样也影响着存储树所需的内存大小。需要根据问题的性质选择最优的大小。   \n",
    "- metric：用于距离度量，默认度量是minkowski，也就是p=2的欧氏距离(欧几里德度量)。   \n",
    "- p：距离度量公式。除欧氏距离外，还有其他的度量方法，例如曼哈顿距离。这个参数默认为2，也就是默认使用欧式距离公式进行距离度量。也可以设置为1，使用曼哈顿距离公式进行距离度量。   \n",
    "- metric_params：距离公式的其他关键参数，这个可以不管，使用默认的None即可。   \n",
    "- n_jobs：并行处理设置。默认为1，临近点搜索并行工作数。如果为-1，那么CPU的所有cores都用于并行工作。 "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 寻找最优超参数weights\n",
    "from sklearn import datasets\n",
    "from sklearn.model_selection import train_test_split\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "\n",
    "digits = datasets.load_digits()\n",
    "\n",
    "x = digits.data\n",
    "y = digits.target\n",
    "\n",
    "x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.25,random_state=666)\n",
    "\n",
    "# 寻找最好的k,weights\n",
    "best_k = -1\n",
    "best_score = 0\n",
    "best_method = ''\n",
    "for method in ['uniform','distance']:\n",
    "    for i in range(1,11):\n",
    "        knn_clf = KNeighborsClassifier(n_neighbors=i,weights=method)\n",
    "        knn_clf.fit(x_train,y_train)\n",
    "        scores = knn_clf.score(x_test,y_test)\n",
    "        if scores>best_score:\n",
    "            best_score = scores\n",
    "            best_k = i\n",
    "            best_method = method\n",
    "print('最好的k为:%d,最好的得分为:%.4f,最好的方法%s'%(best_k,best_score,best_method))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.4.实战案例：手写数字识别"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 导入第三方库\n",
    "import numpy as np\n",
    "import os\n",
    "from sklearn import svm, metrics"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 路径与标签\n",
    "path1=\"D:/python/机器学习与社会科学应用/演示数据/03经典分类算法/digits/trainingDigits/\"\n",
    "path2=\"D:/python/机器学习与社会科学应用/演示数据/03经典分类算法/digits/testDigits/\"\n",
    "train_files=os.listdir(path1)\n",
    "print('训练集样本量：',len(train_files))\n",
    "\n",
    "test_files=os.listdir(path2)\n",
    "print('测试集样本量：',len(test_files))"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "#训练集\n",
    "y_train = []\n",
    "X_train_temp = []\n",
    "for i in range(len(train_files)):    \n",
    "    #定义y\n",
    "    filename = train_files[i]#文件名\n",
    "    filestr = filename.split('.')[0]#不带后缀的文件名\n",
    "    filelabel = int(filestr.split('_')[0])#文件的标签,0,1,2,....,9\n",
    "    #将标签添加至handlabel中\n",
    "    y_train.append(filelabel)\n",
    "    \n",
    "    #定义x\n",
    "    f = open(path1+train_files[i])\n",
    "    data = f.read()\n",
    "    data = data.split('\\n')[0:-1]\n",
    "    X_train_temp.append(data)\n",
    "X_train_temp[50]"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "#分两步处理数据，上一步方便直观展示\n",
    "X_train = []\n",
    "for i in range(len(train_files)):\n",
    "    data = [list(line) for line in X_train_temp[i]]\n",
    "    data = [int(j) for i in data for j in i]\n",
    "    X_train.append(data)\n",
    "print(X_train[50])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#测试集：一步到位\n",
    "y_test = []\n",
    "X_test = []\n",
    "for i in range(len(test_files)):    \n",
    "    #定义y\n",
    "    filename = test_files[i]#文件名\n",
    "    filestr = filename.split('.')[0]#不带后缀的文件名\n",
    "    filelabel = int(filestr.split('_')[0])#文件的标签,0,1,2,....,9\n",
    "    #将标签添加至handlabel中\n",
    "    y_test.append(filelabel)\n",
    "    \n",
    "    #定义x\n",
    "    f = open(path2+test_files[i])\n",
    "    data = f.read()\n",
    "    data = data.split('\\n')[0:-1]\n",
    "   \n",
    "    data = [list(line) for line in data]\n",
    "    data = [int(j) for i in data for j in i]\n",
    "    \n",
    "    X_test.append(data)\n",
    "print(X_test[50])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#直接调用KNN算法\n",
    "from sklearn.neighbors import KNeighborsClassifier\n",
    "knn_classifier=KNeighborsClassifier(n_neighbors=3)\n",
    "knn_classifier.fit(X_train,y_train)    \n",
    "\n",
    "#训练集和测试集预测\n",
    "y_train_pred=knn_classifier.predict(X_train)\n",
    "y_test_pred=knn_classifier.predict(X_test)\n",
    "\n",
    "# 模型评估\n",
    "print('训练集准确率：',sum(y_train_pred==y_train)/len(y_train))\n",
    "print('测试集准确率：',sum(y_test_pred==y_test)/len(y_test))\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 本节结束"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
