{
 "cells": [
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 机器学习与社会科学应用"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第七章 深度学习"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "# 第四节 Word2Vec词嵌入算法"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"宋体\" >郭峰    \n",
    "    教授、博士生导师  \n",
    "上海财经大学公共经济与管理学院  \n",
    "上海财经大学数实融合与智能治理实验室   \n",
    "    邮箱：guofengsfi@163.com</font> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "<font face=\"宋体\" >本节目录：  \n",
    "4.1.加载数据    \n",
    "4.2.分词  \n",
    "4.3.训练模型  \n",
    "4.4.Word2Vec超参</font> "
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.1 加载数据"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "from gensim.models import Word2Vec\n",
    "import multiprocessing\n",
    "import jieba\n",
    "import pandas as pd\n",
    "import os \n",
    "\n",
    "os.chdir(\"D:/python/机器学习与社会科学应用/演示数据/07深度学习/word2vec/\")\n",
    "\n",
    "cssci = pd.read_csv('cssci_clean_short.csv')\n",
    "sentences_list = cssci['abstract'].tolist()"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.2 分词"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#加载停用词表\n",
    "stopwords = [ line.strip() for line in open('stopword.txt','r').readlines()]\n",
    "#把关键词做成字典\n",
    "jieba.load_userdict(\"keyword.txt\") #加载自定义词典\n",
    "sentences_cut = []\n",
    "#结巴分词\n",
    "for ele in sentences_list:\n",
    "    cuts = jieba.cut(ele,cut_all=False)\n",
    "    new_cuts = []\n",
    "    for cut in cuts:\n",
    "        if (cut not in  stopwords) & (len(cut)>1):       # 剔除停用词和字符数小于1的词    \n",
    "            new_cuts.append(cut)\n",
    "    sentences_cut.append(new_cuts)\n",
    "print(sentences_cut[0: 10])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.3 训练模型"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#训练模型\n",
    "model = Word2Vec(sentences_cut,vector_size=10, min_count=1, window=5,sg=0, workers=multiprocessing.cpu_count())\n",
    "print(model)\n",
    "vectors = model.wv.vectors    \n",
    "# print(\"获取模型的全部向量\", vectors)\n",
    "#获取模型中全部的词\n",
    "words = model.wv.index_to_key   \n",
    "print(words[0: 20])"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.4 相似度计算"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {
    "scrolled": true,
    "tags": []
   },
   "outputs": [],
   "source": [
    "# 获取单个词的向量，以“研发”为例\n",
    "vec = model.wv['研发']\n",
    "print(vec)\n",
    "#计算2个词之间的余弦相似度\n",
    "print(model.wv.similarity(\"研发\", \"创新\"))\n",
    "# 找出前N个最相似的词\n",
    "model.wv.most_similar(positive=[\"创新\"],topn=10)  # positive表示与创新同方向的词"
   ]
  },
  {
   "cell_type": "markdown",
   "metadata": {},
   "source": [
    "### 4.5 Word2Vec超参"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "默认参数：\n",
    "sentences=None\n",
    "size=100\n",
    "alpha=0.025\n",
    "window=5\n",
    "min_count=5\n",
    "max_vocab_size=None\n",
    "sample=1e-3\n",
    "seed=1\n",
    "workers=3\n",
    "min_alpha=0.0001\n",
    "sg=0\n",
    "hs=0\n",
    "negative=5\n",
    "cbow_mean=1\n",
    "hashfxn=hash\n",
    "iter=5\n",
    "null_word=0\n",
    "trim_rule=None\n",
    "sorted_vocab=1\n",
    "batch_words=MAX_WORDS_IN_BATCH"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#参数含义：\n",
    "在gensim中，word2vec 相关的参数都在包gensim.models.word2vec中。完整函数如下：\n",
    "gensim.models.word2vec.Word2Vec(sentences=None,size=100,alpha=0.025,window=5, min_count=5, max_vocab_size=None, sample=0.001,seed=1, workers=3,min_alpha=0.0001, sg=0, hs=0, negative=5, cbow_mean=1, hashfxn=<built-in function hash>,iter=5,null_word=0, trim_rule=None, sorted_vocab=1, batch_words=10000)\n",
    "1) sentences: 我们要分析的语料，可以是一个列表，或者从文件中遍历读出。对于大语料集，建议使用BrownCorpus,Text8Corpus或lineSentence构建。\n",
    "2) size: 词向量的维度，默认值是100。这个维度的取值一般与我们的语料的大小相关，视语料库的大小而定。\n",
    "3) alpha： 是初始的学习速率，在训练过程中会线性地递减到min_alpha。\n",
    "4) window：即词向量上下文最大距离，skip-gram和cbow算法是基于滑动窗口来做预测。默认值为5。在实际使用中，可以根据实际的需求来动态调整这个window的大小。对于一般的语料这个值推荐在[5,10]之间。\n",
    "5) min_count:可以对字典做截断. 词频少于min_count次数的单词会被丢弃掉, 默认值为5。\n",
    "6) max_vocab_size: 设置词向量构建期间的RAM限制，设置成None则没有限制。\n",
    "7) sample: 高频词汇的随机降采样的配置阈值，默认为1e-3，范围是(0,1e-5)。\n",
    "8) seed：用于随机数发生器。与初始化词向量有关。\n",
    "9) workers：用于控制训练的并行数。\n",
    "10) min_alpha: 由于算法支持在迭代的过程中逐渐减小步长，min_alpha给出了最小的迭代步长值。随机梯度下降中每    轮的迭代步长可以由iter，alpha， min_alpha一起得出。对于大语料，需要对alpha, min_alpha,iter一起调参，来选                        择合适的三个值。\n",
    "11) sg: 即我们的word2vec两个模型的选择了。如果是0， 则是CBOW模型，是1则是Skip-Gram模型，默认是0即CBOW模型。\n",
    "12)hs: 即我们的word2vec两个解法的选择了，如果是0， 则是Negative Sampling，是1的话并且负采样个数negative大于0， 则是Hierarchical Softmax。默认是0即Negative Sampling。\n",
    "13) negative:如果大于零，则会采用negativesampling，用于设置多少个noise words（一般是5-20）。\n",
    "14) cbow_mean: 仅用于CBOW在做投影的时候，为0，则采用上下文的词向量之和，为1则为上下文的词向量的平均值。默认值也是1,不推荐修改默认值。\n",
    "15) hashfxn： hash函数来初始化权重，默认使用python的hash函数。\n",
    "16) iter: 随机梯度下降法中迭代的最大次数，默认是5。对于大语料，可以增大这个值。\n",
    "17) trim_rule： 用于设置词汇表的整理规则，指定那些单词要留下，哪些要被删除。可以设置为None（min_count会被使用）。\n",
    "18) sorted_vocab： 如果为1（默认），则在分配word index 的时候会先对单词基于频率降序排序。\n",
    "19) batch_words：每一批的传递给线程的单词的数量，默认为10000。\n"
   ]
  },
  {
   "cell_type": "code",
   "execution_count": null,
   "metadata": {},
   "outputs": [],
   "source": [
    "#本节结束"
   ]
  }
 ],
 "metadata": {
  "kernelspec": {
   "display_name": "Python 3 (ipykernel)",
   "language": "python",
   "name": "python3"
  },
  "language_info": {
   "codemirror_mode": {
    "name": "ipython",
    "version": 3
   },
   "file_extension": ".py",
   "mimetype": "text/x-python",
   "name": "python",
   "nbconvert_exporter": "python",
   "pygments_lexer": "ipython3",
   "version": "3.9.12"
  }
 },
 "nbformat": 4,
 "nbformat_minor": 4
}
