{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 机器学习与社会科学应用"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第四章 自然语言处理入门"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第二节 分词"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"宋体\" >郭峰    \n",
    "    教授、博士生导师  \n",
    "上海财经大学公共经济与管理学院  \n",
    "上海财经大学数实融合与智能治理实验室    \n",
    "    邮箱：guofengsfi@163.com</font> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"宋体\" >本节目录  \n",
    "2.1.jieba分词  \n",
    "2.2.停用词和自定义词典  \n",
    "2.3.分词综合练习  \n",
    "2.4.文本的向量化表达</font> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.1. jieba中文分词"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"宋体\" >英文用空格风格了“词”，中文没有这样的分割，需要进行分割,中文分词有一套数学逻辑，并不是唯一的，更技术性介绍，可以百度学习资料</font> "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 需要先安装jieba模块：pip install jieba\n",
    "# jieba版本-'0.42.1'\n",
    "import jieba\n",
    "jieba.__version__"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"微软雅黑\" size=3>jieba分词方法比较</font>   "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# jieba支持多种分词方式，具体对比如下\n",
    "import jieba \n",
    "seg1 = jieba.cut(\"我来到南京大学上课\") # 默认精确模式，这个用来统计词频率\n",
    "seg2 = jieba.cut_for_search(\"我来到南京大学上课\") #搜索引擎模式\n",
    "seg3 = jieba.cut(\"我来到南京大学上课\", cut_all=True) # 全模式\n",
    "# print(seg1)\n",
    "# seg1=list(seg1)\n",
    "# print(list(seg1))\n",
    "print(' '.join(seg1))\n",
    "print(' '.join(seg2))\n",
    "print(' '.join(seg3))\n",
    "wordList = list(jieba.cut(\"南京市长江大桥视察南京大桥\") ) \n",
    "print(wordList)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"微软雅黑\" size=3>jieba分词实例</font> "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 政府工作报告\n",
    "path = 'D:/python/机器学习与社会科学应用/演示数据/04自然语言处理入门/政府工作报告/'\n",
    "report = open(path+\"政府工作报告2018.txt\",\"r\").read()\n",
    "report = report[200:400]\n",
    "print(report)\n",
    "seg1 = jieba.cut(report)\n",
    "print('************************我这是一个分割线*****************')\n",
    "print(' '.join(seg1))"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"微软雅黑\" size=3>分析一份报告出现的中文词汇频率，包含数字、标点等</font> "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true
   },
   "outputs": [],
   "source": [
    "import jieba\n",
    "from collections import Counter\n",
    "path='D:/python/机器学习与社会科学应用/演示数据/04自然语言处理入门/'\n",
    "with open(path+\"000002_2016.txt\", 'r', encoding= 'gbk' ,errors=\"ignore\") as file:\n",
    "    data1=file.read()\n",
    "print(\"总字数:\",len(data1))\n",
    "cut = jieba.cut(data1)\n",
    "data = dict(Counter(cut))\n",
    "#print(data)\n",
    "y = sorted(data.items(), key=lambda d:d[1], reverse = True)\n",
    "print(\"总词数：\",len(y))\n",
    "print(y[0:50])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 去掉标点符号和单字词\n",
    "yn = [c for c in y if len(c[0])>1]\n",
    "print(yn[0:10])\n",
    "\n",
    "# 去掉低频词\n",
    "yn = [c for c in yn if c[1]>100]\n",
    "print(yn[0:10])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 分析两份报告的中文词汇频率的交集，即包含的共同高频词汇\n",
    "\n",
    "import jieba\n",
    "from collections import Counter\n",
    "path = 'D:/python/机器学习与社会科学应用/演示数据/04自然语言处理入门/'\n",
    "\n",
    "with open(path+\"000001_2016.txt\", 'r', encoding= 'gbk' ,errors=\"ignore\") as file:\n",
    "    data=file.read()\n",
    "cut = jieba.cut(data)\n",
    "data1 = dict(Counter(cut))\n",
    "#print(data1)\n",
    "data11 = data1.copy()\n",
    "for i in data1:\n",
    "    if len(i)<2 or data1[i]<100:\n",
    "        data11.pop(i)\n",
    "with open(path+\"000002_2016.txt\", 'r', encoding= 'gbk' , errors=\"ignore\") as file:\n",
    "    data=file.read()\n",
    "cut = jieba.cut(data)\n",
    "data2 = dict(Counter(cut))\n",
    "data22 = data2.copy()\n",
    "for i in data2:\n",
    "    if len(i)<2 or data2[i]<100:\n",
    "        data22.pop(i)\n",
    "k1 = data11.keys()\n",
    "k2 = data22.keys()\n",
    "k3 = k1 & k2\n",
    "print(k3)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#剔除上面生成的ke3当中的数字\n",
    "def is_chinese(uchar):   \n",
    "    if uchar >= u'\\u4E00' and uchar <= u'\\u9FA5':\n",
    "        return True\n",
    "    else:\n",
    "        return False\n",
    "\n",
    "k4 = list(k3)\n",
    "k4 = {k for k in k4 if is_chinese(k)}\n",
    "print(k4)"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"微软雅黑\" size=3>标注词性</font> "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 将“省*年份”报告中关于环保的表述片段合成一个文档，\n",
    "# 然后统计词频，并标注词性\n",
    "import os\n",
    "from collections import Counter\n",
    "import jieba.posseg as pseg\n",
    "import pandas as pd\n",
    "\n",
    "path = \"D:/python/机器学习与社会科学应用/演示数据/04自然语言处理入门/政府工作报告/环保表述/\" \n",
    "pathn = \"D:/python/机器学习与社会科学应用/演示数据/04自然语言处理入门/政府工作报告/\" \n",
    "\n",
    "files = os.listdir(path)  # 得到文件夹下的所有文件名称  \n",
    "\n",
    "f_total = open(pathn+\"huanbao_total.txt\", \"w+\", encoding='gbk', errors='replace')  # 统计结果读入到一个txt文件当中\n",
    "for file in files: \n",
    "    if int(file[-8:-4])<2008:  # 2008年后和2008年年前的文件编码格式不一样\n",
    "        f = open(path+file,encoding='gbk', errors='replace') \n",
    "    else:\n",
    "        f = open(path+file,encoding='utf-8', errors='replace') \n",
    "    try:\n",
    "        text = f.read()\n",
    "        text = text.replace('\\n', '')\n",
    "        text = text.replace(' ', '')\n",
    "        f_total.write(file[0:-4]+'\\n')\n",
    "        f_total.write(text+\"\\n\")\n",
    "    except:\n",
    "        print(file)\n",
    "    f.close()\n",
    "f_total.close()\n",
    "\n",
    "word_list = []\n",
    "flag_list = []\n",
    "\n",
    "df1 = pd.DataFrame(columns=['word', 'type']) \n",
    "\n",
    "f_total = open(pathn+\"huanbao_total.txt\", \"r\", encoding='gbk',errors='replace') \n",
    "text = f_total.read()\n",
    "words = pseg.cut(text) # 默认精确模式，这个用来统计词频率\n",
    "for w in words:\n",
    "    word = w.word\n",
    "    flag = w.flag\n",
    "    word_list.append(word)\n",
    "    flag_list.append(flag)  # 标注词性\n",
    "    \n",
    "df1['word'] = word_list \n",
    "df1['type'] = flag_list\n",
    "df3=df1.groupby(['word','type']).size()\n",
    "df3.to_csv(pathn+\"huanbao_cut.csv\")\n",
    "df3.tail(50)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 练习：不使用pandas，而是将上述程序中的word_list和flag_list设计成两个字典，并分别输出为txt文件"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.2. 停用词和自定义词"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"微软雅黑\" size=3>停用词</font> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"宋体\" >在任何自然语言中停用词是最常用的词。为了分析文本数据和构建NLP模型，这些停用词可能对构成文档的意义没有太多价值。通常，英语文本中使用的最常用词是\"the\"，\"is\"，\"in\"，\"for\"，\"where\"，\"when\"，\"to\"，\"at\"等。中文中的”的地得“等也没有太多价值，因此可以删除</font> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"微软雅黑\" size=3>对比：标准分词</font> "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import jieba\n",
    "path2 = \"D:/python/机器学习与社会科学应用/演示数据/04自然语言处理入门/\" \n",
    "with open(path2+\"000002_2016.txt\", 'r', encoding= 'gbk', errors=\"ignore\") as file:\n",
    "    data = file.read()\n",
    "cut = jieba.cut(data)\n",
    "cut_new = (' '.join(cut))\n",
    "print(cut_new[500:700])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"微软雅黑\" size=3>对比：去停用词</font> "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 建议停用词使用cntext库\n",
    "from cntext.dictionary import STOPWORDS_zh \n",
    "\n",
    "stopwords = list(STOPWORDS_zh)\n",
    "print(stopwords[0: 20])"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 把停用词做成字典【也有将其做成一个列表】\n",
    "import jieba\n",
    "path= \"D:/python/机器学习与社会科学应用/演示数据/04自然语言处理入门/tfidf相似度计算/\" \n",
    "stopwords = {}\n",
    "fstop = open(path+'stopword.txt', 'r')\n",
    "for eachWord in fstop:\n",
    "    stopwords[eachWord.strip()] = eachWord.strip()\n",
    "fstop.close()\n",
    "print(list(stopwords.values())[20:40])\n",
    "print(list(stopwords.keys())[20:40])\n",
    "\n",
    "\n",
    "with open (path2+\"000002_2016.txt\", 'r', encoding= 'gbk' ,errors=\"ignore\") as file:\n",
    "    data=file.read()\n",
    "cut = jieba.cut(data)\n",
    "\n",
    "wordList = list(cut)                      # 用jieba分词，对每行内容进行分词  \n",
    "cut2= ''  \n",
    "for word in wordList:\n",
    "    if word not in stopwords:  \n",
    "        cut2 += word  \n",
    "        cut2 += ' ' \n",
    "print(cut2[400:600])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"微软雅黑\" size=3>对比：自定义词典</font> "
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# 重新生成关键词词典\n",
    "# 自定义词典格式：词 词频 词性（可省略）\n",
    "path= \"D:/python/机器学习与社会科学应用/演示数据/04自然语言处理入门/cssci/\" \n",
    "\n",
    "keywords = open(path+\"keyword.txt\", encoding='utf8').read()\n",
    "keywords = keywords.strip().split('\\n')\n",
    "keywords = dict(Counter(keywords))\n",
    "keywords.pop('')\n",
    "with open(path+'keywords.txt','w',encoding='utf8') as f:\n",
    "    for key, value in keywords.items():\n",
    "        ele = key + \" \" + str(value) + '\\n'\n",
    "        f.write(ele)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "path= \"D:/python/机器学习与社会科学应用/演示数据/04自然语言处理入门/cssci/\" \n",
    "\n",
    "# 加载一下自定义词典即可\n",
    "jieba.load_userdict(path+\"keywords.txt\") # 加载自定义词典  \n",
    "\n",
    "\n",
    "with open (path2+\"000002_2016.txt\", 'r', encoding= 'gbk' ,errors=\"ignore\") as file:\n",
    "    data=file.read()\n",
    "cut = jieba.cut(data)\n",
    "cut_new=(' '.join(cut))\n",
    "print(cut_new[500:700])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.3.分词综合练习"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import pandas as pd\n",
    "import numpy as np\n",
    "import datetime\n",
    "starttime = datetime.datetime.now()\n",
    "\n",
    "path = \"D:/python/机器学习与社会科学应用/演示数据/04自然语言处理入门/cssci/\" \n",
    "f = open(path+\"cssci_clean.csv\",encoding='utf-8')\n",
    "papers = pd.read_csv(f,header=0,sep=',')\n",
    "f.close()\n",
    "\n",
    "# 计算主题模型时，需要将标题、关键词和摘要合并\n",
    "papers['keyword'] = papers['keyword'].fillna(\";\")\n",
    "papers['content'] = papers['title']+\";\"+papers['keyword']+papers['abstract']\n",
    "papers = papers[papers['content'].str.len()>100]   #将标题+关键词+摘要少于100字的样本删除\n",
    "print(\"标题+关键词+摘要少于100字的样本删除后数量:\"+str(len(papers))) #查看行*列数\n",
    "papers.to_csv(path+'cssci_clean_cut.csv',encoding='utf8')\n",
    "print(papers.shape)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 根据关键词为分词准备自定义词典\n",
    "import jieba\n",
    "import pandas as pd\n",
    "import numpy as np\n",
    "import datetime\n",
    "starttime = datetime.datetime.now()\n",
    "print(starttime)\n",
    "\n",
    "path = \"D:/python/机器学习与社会科学应用/演示数据/04自然语言处理入门/cssci/\" \n",
    "f = open(path+\"cssci_clean_cut.csv\",encoding='utf-8')\n",
    "papers = pd.read_csv(f,header=0,sep=',')\n",
    "f.close()\n",
    "\n",
    "# 去掉一些关键词较为特殊的样本\n",
    "# 关键词不能为空，且长度不超过30字符，早期系统自动识别的关键词数量较多\n",
    "papers = papers[papers['keyword'].str.len()>1]\n",
    "papers = papers[papers['keyword'].str.len()<30]\n",
    "# papers = papers[papers['kwnum']<6]\n",
    "\n",
    "keyword = papers['keyword'].sum()  # 聚合关键词，成为一个字符串\n",
    "print(\"关键词聚合后：\",keyword[0:100])\n",
    "\n",
    "keyword = keyword.split(\";\")  # 关键词分割成一个列表\n",
    "print(\"关键词分割后：\",keyword[0:10])\n",
    "print(\"关键词总个数：\",len(keyword))\n",
    "\n",
    "keyword = list(set(keyword)) # 去重复，set为集合的意思\n",
    "print(\"去重后关键词个数：\",len(keyword))\n",
    "\n",
    "keyword = [kw for kw in keyword if len(kw) > 1 and len(kw)<7]\n",
    "print(\"去掉过短过长的关键词后的个数：\",len(keyword))\n",
    "\n",
    "# 保存关键词，一个词一行\n",
    "fn=open(path+'keyword.txt','w',encoding='utf-8')\n",
    "for kw in keyword:\n",
    "    fn.write(str(kw)+\"\\n\")\n",
    "fn.close()\n",
    "\n",
    "endtime = datetime.datetime.now()\n",
    "print((endtime - starttime).seconds)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "import jieba   \n",
    "import jieba.posseg as pseg\n",
    "import pandas as pd\n",
    "import re\n",
    "import numpy as np\n",
    "import datetime\n",
    "starttime = datetime.datetime.now()\n",
    "\n",
    "path = \"D:/python/机器学习与社会科学应用/演示数据/04自然语言处理入门/cssci/\" \n",
    "f = open(path+\"cssci_clean_cut.csv\",encoding='utf-8')\n",
    "papers = pd.read_csv(f,header=0,sep=',')\n",
    "#papers=papers[0:100]\n",
    "\n",
    "papers['index'] = range(papers.shape[0])  # 之前的index序号不连贯了,重新整理\n",
    "papers.set_index('index',inplace=True)\n",
    "\n",
    "jieba.load_userdict(path+\"keywords.txt\") # 加载自定义词典  \n",
    "\n",
    "#把停用词做成字典\n",
    "stopwords = {}\n",
    "fstop = open(path+'stopword.txt', 'r')\n",
    "for eachWord in fstop:\n",
    "    stopwords[eachWord.strip()] = eachWord.strip()\n",
    "fstop.close()\n",
    "\n",
    "content = papers['content'].astype(str)                                 # 以读的方式打开文件\n",
    "\n",
    "\n",
    "#切词的函数\n",
    "def word_cut(x):\n",
    "    line = x.strip()\n",
    "    line = re.sub(\"[0-9\\s+\\.\\!\\/_,$%^*()?;；:-【】+\\\"\\']+|[+——！，;:。？、~@#￥%……&*（）]+\", \"\",line)\n",
    "    wordList = list(jieba.cut(line))     # 用结巴分词，对每行内容进行分词  \n",
    "    outStr = ''\n",
    "    for word in wordList:\n",
    "        if word not in stopwords:  \n",
    "            outStr += word  \n",
    "            outStr += ' '\n",
    "    return outStr\n",
    "\n",
    "papers['cut_out'] = papers['content'].apply(word_cut)\n",
    "papers[\"cutlength\"] = papers['cut_out'].str.len()\n",
    "papers = papers[papers['cutlength'] >10]  # 分词之后，部分出现空值等异常现象\n",
    "cut_out = papers['cut_out']\n",
    "papers.to_csv(path+'cssci_abstract_cut.csv',encoding='utf8',index=False)\n",
    "cut_out.to_csv(path+'cut_out.csv')\n",
    "print(papers.shape)\n",
    "endtime = datetime.datetime.now()\n",
    "print((endtime - starttime).seconds)\n",
    "papers.head()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "## 2.4.文本的向量化表达"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from numpy import *\n",
    "list1 = [['my', 'dog', 'has', 'flea', 'problems', 'help', 'please'],\n",
    "                   ['maybe', 'not', 'take', 'him', 'to', 'dog', 'park', 'stupid'],\n",
    "                   ['my', 'dalmation', 'is', 'so', 'cute', 'I', 'love', 'him'],\n",
    "                   ['stop', 'posting', 'stupid', 'worthless', 'garbage'],\n",
    "                   ['mr', 'licks', 'ate', 'my', 'steak', 'how', 'to', 'stop', 'him'],\n",
    "                   ['quit', 'buying', 'worthless', 'dog', 'food', 'stupid']]\n",
    "\n",
    "set_v = set([])\n",
    "for document in list1:\n",
    "    set_v = set_v | set(document)\n",
    "set_v = list(set_v)\n",
    "print(set_v)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 另外一种方法\n",
    "list_2 = [i for k in list1 for i in k]\n",
    "v2 = list(set(list_2))\n",
    "print(v2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 优化词集模型,将单词列表变为数字向量列表\n",
    "def bagOfWords2VecMN(vocabList, inputSet):\n",
    "    returnVec = [0] * len(vocabList)    # 获得所有单词等长的0列表\n",
    "    for word in inputSet:\n",
    "        if word in vocabList:\n",
    "            returnVec[vocabList.index(word)] += 1   # 对应单词位置加1\n",
    "    return returnVec\n",
    "test1 = bagOfWords2VecMN(set_v,list1[0])\n",
    "print(test1)\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "v2 = ['my', 'dog', 'has']\n",
    "test2 = bagOfWords2VecMN(set_v,v2)\n",
    "print(test2)"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "# 本节结束"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
