{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 机器学习与社会科学应用"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第二章 经典回归算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"宋体\" >郭峰    \n",
    "    教授、博士生导师  \n",
    "上海财经大学公共经济与管理学院  \n",
    "上海财经大学数实融合与智能治理实验室    \n",
    "    邮箱：guofengsfi@163.com</font> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"宋体\" >本讲目录  \n",
    "第一节  OLS回归算法  \n",
    "第二节  岭回归算法  \n",
    "第三节  Lasso回归算法   \n",
    "第四节  算法调参</font>  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#机器学习常用库\n",
    "#pip install sklearn\n",
    "#Python模块经常更新，不同版本可能不兼容，导致报错，可以通过安装对应版本软件，规避这个问题\n",
    "#也可以网络上搜索相关问题，寻找解决办法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第一节  OLS回归算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"宋体\" >循环发电场数据，共有9568个样本数据，每个数据有5列，分别是:AT（温度）, V（压力）, AP（湿度）, RH（压强）, PE（输出电力)。\n",
    "其中，PE是样本输出，而AT/V/AP/RH这4个是样本特征， 机器学习的目的就是得到一个线性回归模型，即:\n",
    "$$\n",
    "PE = \\theta_0 + \\theta_1\\times AT + \\theta_2\\times V + \\theta_3\\times AP + \\theta_4\\times RH\n",
    "$$\n",
    "而需要学习的就是$\\theta_0, \\theta_1, \\theta_2, \\theta_3, \\theta_4$ 这5个参数。 </font>  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "一个完整的机器学习流程一般包括：\n",
    "- 导入模块与加载数据\n",
    "- 指定特征变量与响应变量\n",
    "- 划分训练集与测试集\n",
    "- 使用训练集训练模型\n",
    "- 评估模型预测性能\n",
    "- 调整参数选择最优模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.1 导入模块与加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#python示例代码\n",
    "\n",
    "#导入四个可能用到的包：sklearn， scipy ，numpy和pandas \n",
    "import sklearn\n",
    "import scipy\n",
    "import pandas as pd \n",
    "import numpy as np"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#设置路径名称\n",
    "path=\"D:/python/机器学习与社会科学应用/演示数据/02经典回归算法/CCPP/\"\n",
    "\n",
    "#导入存储在上述路径中的数据ccpp.csv，并将这份数据命名为data\n",
    "data = pd.read_csv(path+'ccpp.csv', encoding='utf8', header=0)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#查看数据结构：行数和列数\n",
    "print(data.shape)\n",
    "\n",
    "#预览数据（默认查看前5行数据）\n",
    "data.head(10)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.2 指定特征变量与响应变量"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 特征变量为X，我们用AT， V，AP和RH这4个列作为样本特征\n",
    "X = data[['AT', 'V', 'AP', 'RH']]\n",
    "\n",
    "#预览特征变量\n",
    "X.head()"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 定义响应变量为y， 我们用PE作为响应变量。\n",
    "y = data[['PE']]\n",
    "\n",
    "#预览相应变量\n",
    "y.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.3 划分训练集与测试集"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为了强化所构建模型的外部有效性和预测性能，机器学习算法一般会将全部样本划分成训练集和测试集两大类，首先在训练集中训练模型，得到参数估计值，然后将训练后的模型应用到测试集中，以观察模型的预测能力是否符合要求。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#从sklearn中进一步调用train_test_split，用来划分训练集和测试集\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "#划分训练集和测试集，并将训练集和测试集的样本规模比例定义为0.7:0.3（默认为3：1）\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=1)\n",
    "\n",
    "#分别打印训练集和测试集中的特征变量和响应变量的数据结构\n",
    "print(X_train.shape,y_train.shape)\n",
    "print(X_test.shape,y_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.4 使用训练集训练模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "定义一个算法，使用专业术语来讲，就是实例化一个对象，然后使用训练集开始训练模型，即求得模型的参数\n",
    "机器学习算法核心三步：  \n",
    "1、实例化对象：linreg = LinearRegression()  \n",
    "2、利用训练集训练模型：linreg.fit(X_train, y_train)  \n",
    "3、用于预测：y_predict = linreg.predict(X_test)  "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#从sklearn中调用LinearRegression，并用这一线性模型来拟合我们的问题\n",
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn import metrics\n",
    "\n",
    "#所有超参都用默认的\n",
    "# 实例化对象\n",
    "linreg = LinearRegression()\n",
    "\n",
    "# 调用实例方法fit()\n",
    "linreg.fit(X_train, y_train)   #对应着stata 中的reg y x\n",
    "\n",
    "# 打印计算的结果：模型估计的参数和截距项\n",
    "print(\"参数：\",linreg.coef_)\n",
    "print(\"截距项：\",linreg.intercept_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.5 评估模型预测性能"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在模型训练完毕后，需要对模型进行评估，即该模型的预测效果如何。对于线性回归来说，我们一般用均方差（Mean Squared Error, MSE）或者均方根差(Root Mean Squared Error, RMSE)在测试集上的表现来评价模型的好坏。RMSE和MSE的值越小，表明预测值越接近真实值，模型的预测性能也就越好。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#模型拟合测试集，并打印模型预测结果的前五行\n",
    "y_pred = linreg.predict(X_test)\n",
    "y_pred1 = linreg.predict(X_train)\n",
    "print(y_pred[0:5])\n",
    "print(y_pred1[0:5])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 用scikit-learn计算MSE（Mean Squared Error，预测值与真值之差平方和的平均）\n",
    "# 为什么要用MSE而不用SSE（经济学常用），是因为可以规避样本数量带来的总量上不可比较问题\n",
    "#打印训练集和测试集所计算的均方误差\n",
    "print(\"MSE for train sample:\",metrics.mean_squared_error(y_train, y_pred1))\n",
    "print(\"MSE for test sample:\",metrics.mean_squared_error(y_test, y_pred))\n",
    "\n",
    "# 用scikit-learn计算RMSE（MSE的平方根）\n",
    "print(\"RMSE for test sample:\",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 1.6 调整参数选择最优模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "一般而言，我们需要多次调整超参训练模型，每次调整都会得到一个MSE或者RMSE。需要选择模型时，就用MSE最小时对应的模型。这里我们通过调整特征变量的个数为例，来选择最优模型。上述模型分析中，我们一共包括了4个特征变量，我们这里选择将RH剔除后，重新进行训练，以观察模型的预测性能是否得到提升。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn.linear_model import LinearRegression\n",
    "from sklearn import metrics\n",
    "\n",
    "# 这里我们用AT， V，AP这3个列作为样本特征。不要RH， 输出仍然是PE。代码如下\n",
    "X = data[['AT', 'V', 'AP']]\n",
    "y = data[['PE']]\n",
    "\n",
    "# random_state用于复现结果，计算机实现过程是伪随机数\n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1) \n",
    "linreg = LinearRegression()\n",
    "linreg.fit(X_train, y_train)\n",
    "\n",
    "#模型拟合测试集\n",
    "y_pred = linreg.predict(X_test)\n",
    "\n",
    "# 用scikit-learn计算MSE\n",
    "print(\"MSE:\",metrics.mean_squared_error(y_test, y_pred))\n",
    "\n",
    "# 用scikit-learn计算RMSE\n",
    "print( \"RMSE:\",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))\n",
    "\n",
    "#去掉RH后，模型拟合的没有加上RH的好，MSE变大了"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第二节  岭回归算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.1 导入模块与加载数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "首先，调入可能使用到的模块pandas、numpy和sklearn和相关数据，并查看相关数据的特征。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#设置路径名称，导入数据ccpp.csv，并将其命名为data\n",
    "import pandas as pd\n",
    "path=\"D:/python/机器学习与社会科学应用/演示数据/02经典回归算法/CCPP/\"\n",
    "data = pd.read_csv(path+'ccpp.csv', encoding='utf8', header=0)\n",
    "\n",
    "#打印数据结构，并预览数据前5行\n",
    "print(data.shape)\n",
    "data.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.2 指定特征变量与响应变量"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "与之前操作OLS回归算法时的操作相类似，在这份演示数据中，我们同样将AT、V、AP和RH这四个变量定义为特征变量，将PE这一变量定义为响应变量。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 定义特征变量为X，我们用AT， V，AP和RH这4个列作为样本特征    \n",
    "X = data[['AT', 'V', 'AP', 'RH']]    \n",
    "       \n",
    "# 定义响应变量为y，以PE作为响应变量    \n",
    "y = data[['PE']]  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.3 划分训练集与测试集"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在这则示例中，我们采用默认（3:1）方式来设定训练集和测试集的个数，并将随机状态（random_state）设定为0，以方面后续结果复现。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#划分训练集和测试集，并将训练集和测试集的样本规模比例定义为3：1    \n",
    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)    \n",
    "       \n",
    "#分别打印训练集和测试集中的特征变量和响应变量的数据结构    \n",
    "print(X_train.shape,y_train.shape)    \n",
    "print(X_test.shape,y_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.4 使用训练集训练模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "我们在使用岭回归机器学习算法时，需要设定调节参数的取值，才能进行系数估计。这里，我们暂将调节参数的取值设定为2，然后再用训练集数据进行模型训练。最后，输出岭回归的测试性能得分、系数估计值和常数项。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#从sklearn中导入ridge\n",
    "from sklearn.linear_model import Ridge\n",
    "\n",
    "#从sklearn中进一步调用train_test_split，用来划分训练集和测试集\n",
    "from sklearn.model_selection import train_test_split\n",
    "\n",
    "#指定一个正则化参数，此处将正则化参数设定为2；然后用岭回归算法对数据进行拟合\n",
    "#正则化强度; 必须是正浮点数。 正则化改善了问题的条件并减少了估计的方差。 较大的值指定较强的正则化。\n",
    "ridge = Ridge(alpha=2)\n",
    "ridge.fit(X_train, y_train)\n",
    "\n",
    "#打印使用岭回归算法的得分，参数估计值和常数项\n",
    "print(ridge.score(X_train, y_train))\n",
    "print(ridge.coef_)\n",
    "print(ridge.intercept_)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.5 评估模型预测性能 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "与OLS回归算法相类似，我们同样通过均方差（MSE）和均方根差（RMSE）来判定所训练模型的预测性能。若MSE或RMSE的数值越小，表明所训练模型的预测能力越强。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import metrics    \n",
    "        \n",
    "y_pred = ridge.predict(X_test)    \n",
    "       \n",
    "# 用scikit-learn计算MSE    \n",
    "print(\"MSE:\",metrics.mean_squared_error(y_test, y_pred))    \n",
    "       \n",
    "# 用scikit-learn计算RMSE    \n",
    "print(\"RMSE:\",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 2.6调整参数选择最优模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "为了能够得到预测性能最佳的模型，一般而言，我们需要多次调整超参来训练模型，每次调整都会得到一个MSE或者RMSE。需要选择模型时，就用MSE或RMSE最小时所对应的模型。在使用岭回归算法的调参过程中，我们选择通过调整调节参数λ的方式进行调参。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在设定完训练集、测试集的样本规模和随机状态后，在示例数据集中，我们设置了一个调节参数λ可能取值的列表（列表中的具体取值范围可根据个人研究情形进行调整），然后观察计算每一个调节参数λ对应下的MSE或RMSE，选择MSE或RMSE最小时所对应的调节参数λ即为最优参数，并以这一最优参数为基础来训练模型，并根据训练出的模型来进行预测即可。关于调参，在本章的第四节我们还会更详细地讨论。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import linear_model    \n",
    "       \n",
    "#划分测试集和训练集，并定义随机状态    \n",
    "X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.25, random_state=0)    \n",
    "   \n",
    "#建立一个备选参数的列表    \n",
    "alphas = [0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000]    \n",
    "scores = []    \n",
    "        \n",
    "#循环计算每一个参数对应的预测结果，并将其一一打印    \n",
    "for i, alpha in enumerate(alphas):\n",
    "    ridgeRegression = linear_model.Ridge(alpha=alpha)    \n",
    "    ridgeRegression.fit(X_train, y_train)    \n",
    "    scores.append(ridgeRegression.score(X_test, y_test))    \n",
    "print(scores)   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "##  第三节  Lasso回归算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.1 导入模块与加载数据"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "首先，调入可能使用到的模块pandas、numpy和sklearn和相关数据。"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.2 指定特征变量与响应变量"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#导入numpy、pandas和sklearn等包    \n",
    "import numpy as np    \n",
    "import pandas as pd    \n",
    "from sklearn import datasets, linear_model    \n",
    "from sklearn.model_selection import train_test_split     \n",
    "import matplotlib.pyplot as plt    \n",
    "   \n",
    "#定义路径名称，并导入数据ccpp.csv，并将其命名为data    \n",
    "path=\"D:/python/机器学习与社会科学应用/演示数据/02经典回归算法/CCPP/\"    \n",
    "data= pd.read_csv(path+\"ccpp.csv\", encoding='utf8', header=0)    \n",
    "       \n",
    "#打印data的数据结构，并展示其前5行数据    \n",
    "print(data.shape)    \n",
    "data.head()   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "与之前操作OLS回归算法时的操作相类似，在这份演示数据中，我们同样将AT、V、AP和RH这四个变量定义为特征变量，将PE这一变量定义为响应变量。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 指定数据中的特征变量和响应变量    \n",
    "X = data[['AT', 'V', 'AP', 'RH']]    \n",
    "y = data[['PE']]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.3 划分训练集与测试集"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在这则示例中，我们采用默认（3:1）方式来设定训练集和测试集的个数，并将随机状态（random_state）继续设定为1。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=1)    \n",
    "print(X_train.shape,y_train.shape)    \n",
    "print(X_test.shape,y_test.shape)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.4  使用训练集训练模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "如同岭回归一样，我们在使用Lasso回归算法时，也需要设定调节参数的取值，才能进行系数估计。这里，我们仍将调节参数的取值设定为2，然后再用训练集数据进行模型训练。最后，输出Lasso回归的测试性能得分、系数估计值和常数项。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#指定一个正则化参数，此处我们将正则化参数设定为2    \n",
    "from sklearn.linear_model import Lasso    \n",
    "lasso =Lasso(alpha=2)    \n",
    "       \n",
    "#用Lasso模型进行拟合，并打印Lasso模型的得分、参数估计值和常数项    \n",
    "lasso.fit(X_train, y_train)    \n",
    "  \n",
    "print(lasso.score(X_train, y_train))    \n",
    "print(lasso.coef_)    \n",
    "print(lasso.intercept_) "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.5 评估模型预测性能 "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "与OLS回归和岭回归算法相类似，我们同样通过均方差（MSE）和均方根差（RMSE）来判定所训练模型的预测性能。MSE或RMSE的数值越小，表明所训练模型的预测能力越强。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from sklearn import metrics    \n",
    "       \n",
    "y_pred =lasso.predict(X_test)    \n",
    "# 用scikit-learn计算MSE    \n",
    "print(\"MSE:\",metrics.mean_squared_error(y_test, y_pred))    \n",
    "      \n",
    "# 用scikit-learn计算RMSE    \n",
    "print(\"RMSE:\",np.sqrt(metrics.mean_squared_error(y_test, y_pred)))  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 3.6  调整参数选择最优模型"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "在使用Lasso回归算法的调参过程中，我们选择通过调整调节参数的方式进行调参。在设定完数据的训练集、测试集的样本规模和随机状态后，在示例数据集中，我们设置了一个调节参数可能取值的列表（列表中的具体取值范围可根据个人研究情形进行调整），然后观察计算每一个调节参数对应下的MSE或RMSE，选择MSE或RMSE最小时所对应的调节参数 即为最优参数，并以这一最优参数为基础来训练模型，并根据训练出的模型来进行预测即可。"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#划分测试集和训练集，并定义随机状态      \n",
    "X_train, X_test, y_train, y_test =train_test_split(X, y, test_size=0.25, random_state=0)    \n",
    "       \n",
    "#建立一个备选参数的列表      \n",
    "alphas = [0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000]    \n",
    "scores = []    \n",
    "        \n",
    "#循环计算每一个参数对应的预测结果，并将其一一打印      \n",
    "for alpha in alphas:\n",
    "    lassoRegression = linear_model.Lasso(alpha=alpha)    \n",
    "    lassoRegression.fit(X_train, y_train)    \n",
    "    scores.append(lassoRegression.score(X_test, y_test))    \n",
    "print(scores)  "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 第四节 算法调参"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1 循环寻找最优超参"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(1)导入可能用到的模块"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline \n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "from sklearn import datasets\n",
    "from sklearn.linear_model import Lasso\n",
    "from sklearn.model_selection import train_test_split \n",
    "from sklearn.linear_model import LinearRegression"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "（2）加载数据diabetes，并定义特征变量和响应变量，打印特征变量的数据结构和响应变量的前10个值，定义训练集和测试集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 加载数据\n",
    "diabetes = datasets.load_diabetes()\n",
    "\n",
    "# 指定特征变量与响应变量\n",
    "X = diabetes.data\n",
    "y = diabetes.target\n",
    "print(X.shape)\n",
    "print(y[0:10])\n",
    "dir(diabetes)\n",
    "\n",
    "# 划分训练集和测试集\n",
    "X_train, X_test, y_train, y_test=train_test_split(X,y, test_size=0.25, random_state=0)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "（3）先使用默认超参，观察模型的预测性能"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#默认超参\n",
    "lassoRegression = Lasso()\n",
    "lassoRegression.fit(X_train, y_train)\n",
    "print(\"权重向量:%s, b的值为:%.2f\" % (lassoRegression.coef_, lassoRegression.intercept_))\n",
    "print(\"损失函数的值:%.2f\" % np.mean((lassoRegression.predict(X_test) - y_test) ** 2))\n",
    "print(\"预测性能得分: %.2f\" % lassoRegression.score(X_test, y_test))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "(4)设置一系列备选的参数集，然后通过循环的方式计算每个参数所对应的预测性能"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#通过循环，寻找最有参数\n",
    "X_train, X_test, y_train, y_test =train_test_split(diabetes.data, diabetes.target, test_size=0.25, random_state=0)\n",
    "alphas = [0.01, 0.02, 0.05, 0.1, 0.2, 0.5, 1, 2, 5, 10, 20, 50, 100, 200, 500, 1000]\n",
    "scores = []\n",
    "for alpha in alphas:\n",
    "    lassoRegression = Lasso(alpha=alpha)\n",
    "    lassoRegression.fit(X_train, y_train)\n",
    "    scores.append(lassoRegression.score(X_test, y_test))\n",
    "print(scores)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "（5）将每个参数对应的预测解雇，以图形的方式展现出来"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "figure = plt.figure()\n",
    "ax = figure.add_subplot(1, 1, 1)\n",
    "ax.plot(alphas, scores)\n",
    "ax.set_xlabel(r\"$\\alpha$\")\n",
    "ax.set_ylabel(r\"score\")\n",
    "ax.set_xscale(\"log\")\n",
    "ax.set_title(\"Lasso\")\n",
    "plt.show()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 5.2 交叉验证选择超参"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "（1）导入可能用到的模块"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import matplotlib.pyplot as plt\n",
    "%matplotlib inline \n",
    "import numpy as np\n",
    "from sklearn import datasets\n",
    "from sklearn.linear_model import Lasso\n",
    "from sklearn.model_selection import KFold\n",
    "from sklearn.linear_model import LassoCV"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "（2）加载数据diabetes，采用10折交叉验证，并将随机状态设为20；并设置一组备选的参数集（初始为0.01、终值为100、共包含100个数的等比数列），打印这个参数集的前10个数和倒数10个数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "diabetes = datasets.load_diabetes()\n",
    "kfold = KFold(n_splits=10, shuffle=True, random_state=20)\n",
    "alphas=np.logspace(-2, 2, 100)\n",
    "print(alphas[0:10])\n",
    "print(alphas[-10:])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "（3）采用10折交叉验证的方法，在备选参数集中选择最优参数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#交叉验证寻找最优参数\n",
    "model = LassoCV(alphas=alphas, cv=kfold)\n",
    "model.fit(diabetes.data, diabetes.target)\n",
    "model.alpha_   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 4.3 重要特征选择"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "（1）导入可能用到的包"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "%matplotlib inline \n",
    "import matplotlib.pyplot as plt\n",
    "import numpy as np\n",
    "from sklearn import datasets\n",
    "from sklearn.linear_model import Lasso\n",
    "from sklearn.model_selection import KFold\n",
    "from sklearn.linear_model import LassoCV\n",
    "import pandas as pd\n",
    "from sklearn.preprocessing import StandardScaler"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "（2）加载diabetes数据集"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "diabetes = datasets.load_diabetes()\n",
    "scaler = StandardScaler()\n",
    "X = scaler.fit_transform(diabetes[\"data\"])\n",
    "Y = diabetes[\"target\"]\n",
    "names = diabetes[\"feature_names\"]"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "（3）设置10折交叉验证，并设置备选参数集（初始为0.01、终值为100、共包含100个数的等比数列），打印这个参数集的前10个数和倒数10个数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "kfold = KFold(n_splits=10, shuffle=True, random_state=30)\n",
    "alphas = np.logspace(-2, 2, 100)\n",
    "print(alphas[0:10])\n",
    "print(alphas[-10:])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "（4）通过交叉验证，在上述备选参数集中选择最优参数"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#交叉验证寻找最优参数\n",
    "model = LassoCV(alphas=alphas, cv=kfold)\n",
    "model.fit(X, Y)\n",
    "model.alpha_   "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "（5）查看模型最终选择了几个特征向量，剔除了几个特征向量"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#输出看模型最终选择了几个特征向量，剔除了几个特征向量\n",
    "import pandas as pd\n",
    "coef = pd.Series(model.coef_, index = names)\n",
    "print(\"Lasso picked \" + str(sum(coef != 0)) + \" variables and eliminated the other \" +  str(sum(coef == 0)) + \" variables\")"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "（6）画图展示下各变量的重要程度"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#画出特征变量的重要程度\n",
    "import matplotlib\n",
    "imp_coef = pd.concat([coef.sort_values().head(3),\n",
    "                     coef.sort_values().tail(3)])\n",
    "\n",
    "matplotlib.rcParams['figure.figsize'] = (8.0, 10.0)\n",
    "coef.plot(kind = \"barh\")\n",
    "plt.title(\"Coefficients in the Lasso Model\")\n",
    "plt.show() "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#本章结束"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}