kanhatakeyama commited on
Commit
40c4ba5
1 Parent(s): 5aa4afe

Upload train_topic_model.ipynb

Browse files
Files changed (1) hide show
  1. train_topic_model.ipynb +361 -0
train_topic_model.ipynb ADDED
@@ -0,0 +1,361 @@
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
1
+ {
2
+ "cells": [
3
+ {
4
+ "cell_type": "code",
5
+ "execution_count": 1,
6
+ "metadata": {},
7
+ "outputs": [],
8
+ "source": [
9
+ "#!pip install bertopic\n",
10
+ "\n",
11
+ "# bertopicのmodelを作るscript"
12
+ ]
13
+ },
14
+ {
15
+ "cell_type": "code",
16
+ "execution_count": 2,
17
+ "metadata": {},
18
+ "outputs": [
19
+ {
20
+ "name": "stderr",
21
+ "output_type": "stream",
22
+ "text": [
23
+ "/home/user/miniconda3/envs/ft/lib/python3.11/site-packages/tqdm/auto.py:21: TqdmWarning: IProgress not found. Please update jupyter and ipywidgets. See https://ipywidgets.readthedocs.io/en/stable/user_install.html\n",
24
+ " from .autonotebook import tqdm as notebook_tqdm\n"
25
+ ]
26
+ }
27
+ ],
28
+ "source": [
29
+ "from bertopic import BERTopic"
30
+ ]
31
+ },
32
+ {
33
+ "cell_type": "code",
34
+ "execution_count": 3,
35
+ "metadata": {},
36
+ "outputs": [],
37
+ "source": [
38
+ "from datasets import load_dataset\n",
39
+ "streaming=True\n",
40
+ "dataset_list =[\n",
41
+ " load_dataset('mc4', 'ja', split='train',streaming=streaming),\n",
42
+ " load_dataset('oscar', 'unshuffled_deduplicated_ja', split='train',streaming=streaming),\n",
43
+ " load_dataset('cc100', lang='ja', split='train',streaming=streaming),\n",
44
+ " load_dataset(\"augmxnt/shisa-pretrain-en-ja-v1\",split=\"train\",streaming=streaming),\n",
45
+ " load_dataset(\"hpprc/wikipedia-20240101\", split=\"train\",streaming=streaming),\n",
46
+ "]"
47
+ ]
48
+ },
49
+ {
50
+ "cell_type": "code",
51
+ "execution_count": 4,
52
+ "metadata": {},
53
+ "outputs": [
54
+ {
55
+ "name": "stderr",
56
+ "output_type": "stream",
57
+ "text": [
58
+ "10000it [00:20, 482.63it/s]\n",
59
+ "10000it [00:19, 524.44it/s]\n",
60
+ "10000it [00:12, 778.96it/s]\n",
61
+ "10000it [00:25, 386.40it/s]\n",
62
+ "10000it [00:58, 171.79it/s]\n"
63
+ ]
64
+ }
65
+ ],
66
+ "source": [
67
+ "from tqdm import tqdm\n",
68
+ "docs=[]\n",
69
+ "#prepare data for training model\n",
70
+ "for dataset in dataset_list:\n",
71
+ " cnt=0\n",
72
+ " for record in tqdm(dataset):\n",
73
+ " text=record[\"text\"]\n",
74
+ " docs.append(text)\n",
75
+ " cnt+=1\n",
76
+ "\n",
77
+ " if cnt>10000:\n",
78
+ " break\n"
79
+ ]
80
+ },
81
+ {
82
+ "cell_type": "code",
83
+ "execution_count": 5,
84
+ "metadata": {},
85
+ "outputs": [
86
+ {
87
+ "name": "stderr",
88
+ "output_type": "stream",
89
+ "text": [
90
+ "2024-03-12 08:37:19,823 - BERTopic - Embedding - Transforming documents to embeddings.\n",
91
+ "Batches: 100%|██████████| 1563/1563 [00:50<00:00, 30.79it/s] \n",
92
+ "2024-03-12 08:38:20,622 - BERTopic - Embedding - Completed ✓\n",
93
+ "2024-03-12 08:38:20,622 - BERTopic - Dimensionality - Fitting the dimensionality reduction algorithm\n",
94
+ "2024-03-12 08:38:59,566 - BERTopic - Dimensionality - Completed ✓\n",
95
+ "2024-03-12 08:38:59,567 - BERTopic - Cluster - Start clustering the reduced embeddings\n",
96
+ "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
97
+ "To disable this warning, you can either:\n",
98
+ "\t- Avoid using `tokenizers` before the fork if possible\n",
99
+ "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
100
+ "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
101
+ "To disable this warning, you can either:\n",
102
+ "\t- Avoid using `tokenizers` before the fork if possible\n",
103
+ "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
104
+ "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
105
+ "To disable this warning, you can either:\n",
106
+ "\t- Avoid using `tokenizers` before the fork if possible\n",
107
+ "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
108
+ "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
109
+ "To disable this warning, you can either:\n",
110
+ "\t- Avoid using `tokenizers` before the fork if possible\n",
111
+ "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
112
+ "huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...\n",
113
+ "To disable this warning, you can either:\n",
114
+ "\t- Avoid using `tokenizers` before the fork if possible\n",
115
+ "\t- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)\n",
116
+ "2024-03-12 08:46:25,241 - BERTopic - Cluster - Completed ✓\n",
117
+ "2024-03-12 08:46:25,242 - BERTopic - Representation - Extracting topics from clusters using representation models.\n",
118
+ "2024-03-12 08:47:25,876 - BERTopic - Representation - Completed ✓\n",
119
+ "2024-03-12 08:47:25,952 - BERTopic - Topic reduction - Reducing number of topics\n",
120
+ "2024-03-12 08:48:28,300 - BERTopic - Topic reduction - Reduced number of topics from 435 to 342\n"
121
+ ]
122
+ }
123
+ ],
124
+ "source": [
125
+ "\n",
126
+ "model_path=\"data/topic_model.bin\"\n",
127
+ "topic_model = BERTopic(language=\"japanese\", calculate_probabilities=True, verbose=True, nr_topics=\"20\")\n",
128
+ "topics, probs = topic_model.fit_transform(docs)\n",
129
+ "\n",
130
+ "\n",
131
+ "#topic_model=BERTopic.load(model_path)"
132
+ ]
133
+ },
134
+ {
135
+ "cell_type": "code",
136
+ "execution_count": 6,
137
+ "metadata": {},
138
+ "outputs": [
139
+ {
140
+ "name": "stderr",
141
+ "output_type": "stream",
142
+ "text": [
143
+ "2024-03-12 08:48:42,599 - BERTopic - WARNING: When you use `pickle` to save/load a BERTopic model,please make sure that the environments in which you saveand load the model are **exactly** the same. The version of BERTopic,its dependencies, and python need to remain the same.\n",
144
+ "/home/user/miniconda3/envs/ft/lib/python3.11/site-packages/scipy/sparse/_index.py:143: SparseEfficiencyWarning: Changing the sparsity structure of a csr_matrix is expensive. lil_matrix is more efficient.\n",
145
+ " self._set_arrayXarray(i, j, x)\n"
146
+ ]
147
+ }
148
+ ],
149
+ "source": [
150
+ "topic_model.save(model_path)"
151
+ ]
152
+ },
153
+ {
154
+ "cell_type": "code",
155
+ "execution_count": 7,
156
+ "metadata": {},
157
+ "outputs": [
158
+ {
159
+ "data": {
160
+ "text/html": [
161
+ "<div>\n",
162
+ "<style scoped>\n",
163
+ " .dataframe tbody tr th:only-of-type {\n",
164
+ " vertical-align: middle;\n",
165
+ " }\n",
166
+ "\n",
167
+ " .dataframe tbody tr th {\n",
168
+ " vertical-align: top;\n",
169
+ " }\n",
170
+ "\n",
171
+ " .dataframe thead th {\n",
172
+ " text-align: right;\n",
173
+ " }\n",
174
+ "</style>\n",
175
+ "<table border=\"1\" class=\"dataframe\">\n",
176
+ " <thead>\n",
177
+ " <tr style=\"text-align: right;\">\n",
178
+ " <th></th>\n",
179
+ " <th>Topic</th>\n",
180
+ " <th>Count</th>\n",
181
+ " <th>Name</th>\n",
182
+ " <th>Representation</th>\n",
183
+ " <th>Representative_Docs</th>\n",
184
+ " </tr>\n",
185
+ " </thead>\n",
186
+ " <tbody>\n",
187
+ " <tr>\n",
188
+ " <th>0</th>\n",
189
+ " <td>-1</td>\n",
190
+ " <td>22559</td>\n",
191
+ " <td>-1_the_and_to_of</td>\n",
192
+ " <td>[the, and, to, of, 送料無料, in, 12, 11, 10, また]</td>\n",
193
+ " <td>[Створення сайту - Сторінка 419 - Форум\\nЧетве...</td>\n",
194
+ " </tr>\n",
195
+ " <tr>\n",
196
+ " <th>1</th>\n",
197
+ " <td>0</td>\n",
198
+ " <td>1585</td>\n",
199
+ " <td>0_送料無料_サマータイヤ_代引不可_中古</td>\n",
200
+ " <td>[送料無料, サマータイヤ, 代引不可, 中古, ブラック, diy, レディース, 工具,...</td>\n",
201
+ " <td>[上品なスタイル 【5/1(土)クーポン&amp;ワンダフルデー 4本1台分!!】 215/45R1...</td>\n",
202
+ " </tr>\n",
203
+ " <tr>\n",
204
+ " <th>2</th>\n",
205
+ " <td>1</td>\n",
206
+ " <td>1209</td>\n",
207
+ " <td>1_としあき_無念_name_投稿日</td>\n",
208
+ " <td>[としあき, 無念, name, 投稿日, id, 16, 名前, 柳宗理, no, 11]</td>\n",
209
+ " <td>[ハニーセレクト日曜昼の部テンプレセット髪型全然使ってなかったけど - ふたろぐばこ−二次元...</td>\n",
210
+ " </tr>\n",
211
+ " <tr>\n",
212
+ " <th>3</th>\n",
213
+ " <td>2</td>\n",
214
+ " <td>801</td>\n",
215
+ " <td>2_ワンピース_5cm_レディース_着丈</td>\n",
216
+ " <td>[ワンピース, 5cm, レディース, 着丈, 肩幅, 素材, 格安通販, シューズ, 袖丈...</td>\n",
217
+ " <td>[非売品 入学式 セレモニー 秋冬 秋 他と被らない 冬 小さいサイズ スカート セット 卒...</td>\n",
218
+ " </tr>\n",
219
+ " <tr>\n",
220
+ " <th>4</th>\n",
221
+ " <td>3</td>\n",
222
+ " <td>799</td>\n",
223
+ " <td>3_ベンジャミン_フランクリン_passion_thee</td>\n",
224
+ " <td>[ベンジャミン, フランクリン, passion, thee, nベンジャミン, 全業種, ...</td>\n",
225
+ " <td>[it's ok with me 意味\\t9\\n英語で「It's okay.(イッツオーケー...</td>\n",
226
+ " </tr>\n",
227
+ " <tr>\n",
228
+ " <th>...</th>\n",
229
+ " <td>...</td>\n",
230
+ " <td>...</td>\n",
231
+ " <td>...</td>\n",
232
+ " <td>...</td>\n",
233
+ " <td>...</td>\n",
234
+ " </tr>\n",
235
+ " <tr>\n",
236
+ " <th>337</th>\n",
237
+ " <td>336</td>\n",
238
+ " <td>11</td>\n",
239
+ " <td>336_abuse_you_counselling_emotional</td>\n",
240
+ " <td>[abuse, you, counselling, emotional, addiction...</td>\n",
241
+ " <td>[スピリチュアルカウンセリングは、魂の向上を目的とした、至高神からのヒーリングで魂を整えて頂...</td>\n",
242
+ " </tr>\n",
243
+ " <tr>\n",
244
+ " <th>338</th>\n",
245
+ " <td>337</td>\n",
246
+ " <td>10</td>\n",
247
+ " <td>337_京都の道_snorkeling_その1_中の池</td>\n",
248
+ " <td>[京都の道, snorkeling, その1, 中の池, k7, silfra, 今だけ特別...</td>\n",
249
+ " <td>[オアフ島(ホノルル) 福岡発 ◎今だけ無料で海の見える部屋へアップグレード!◎シェラトン・...</td>\n",
250
+ " </tr>\n",
251
+ " <tr>\n",
252
+ " <th>339</th>\n",
253
+ " <td>338</td>\n",
254
+ " <td>10</td>\n",
255
+ " <td>338_実印_いつ使う_件のレビュー例えば_いつ使うは</td>\n",
256
+ " <td>[実印, いつ使う, 件のレビュー例えば, いつ使うは, しっかりした会社, 印鑑, 実印の...</td>\n",
257
+ " <td>[冊子の「契約内容のお知らせ」ページをめくると、登録情報の変更シートがあります。\\n, 今回...</td>\n",
258
+ " </tr>\n",
259
+ " <tr>\n",
260
+ " <th>340</th>\n",
261
+ " <td>339</td>\n",
262
+ " <td>10</td>\n",
263
+ " <td>339_galaxy_s7_samsung_edge</td>\n",
264
+ " <td>[galaxy, s7, samsung, edge, i9195i, s8, 3i9200...</td>\n",
265
+ " <td>[ S8 PlusとS9 Plus - bajatyoutube.com\\n2019/0...</td>\n",
266
+ " </tr>\n",
267
+ " <tr>\n",
268
+ " <th>341</th>\n",
269
+ " <td>340</td>\n",
270
+ " <td>10</td>\n",
271
+ " <td>340_゚д゚_対価_労働_産業別組合</td>\n",
272
+ " <td>[゚д゚, 対価, 労働, 産業別組合, 工会, 約款, union, 契約書, 規約, 労...</td>\n",
273
+ " <td>[ただし、中小企業の事業主等、労働者以外でも業務の実態や災害の発生状況からみて、労働者に準じ...</td>\n",
274
+ " </tr>\n",
275
+ " </tbody>\n",
276
+ "</table>\n",
277
+ "<p>342 rows × 5 columns</p>\n",
278
+ "</div>"
279
+ ],
280
+ "text/plain": [
281
+ " Topic Count Name \\\n",
282
+ "0 -1 22559 -1_the_and_to_of \n",
283
+ "1 0 1585 0_送料無料_サマータイヤ_代引不可_中古 \n",
284
+ "2 1 1209 1_としあき_無念_name_投稿日 \n",
285
+ "3 2 801 2_ワンピース_5cm_レディース_着丈 \n",
286
+ "4 3 799 3_ベンジャミン_フランクリン_passion_thee \n",
287
+ ".. ... ... ... \n",
288
+ "337 336 11 336_abuse_you_counselling_emotional \n",
289
+ "338 337 10 337_京都の道_snorkeling_その1_中の池 \n",
290
+ "339 338 10 338_実印_いつ使う_件のレビュー例えば_いつ使うは \n",
291
+ "340 339 10 339_galaxy_s7_samsung_edge \n",
292
+ "341 340 10 340_゚д゚_対価_労働_産業別組合 \n",
293
+ "\n",
294
+ " Representation \\\n",
295
+ "0 [the, and, to, of, 送料無料, in, 12, 11, 10, また] \n",
296
+ "1 [送料無料, サマータイヤ, 代引不可, 中古, ブラック, diy, レディース, 工具,... \n",
297
+ "2 [としあき, 無念, name, 投稿日, id, 16, 名前, 柳宗理, no, 11] \n",
298
+ "3 [ワンピース, 5cm, レディース, 着丈, 肩幅, 素材, 格安通販, シューズ, 袖丈... \n",
299
+ "4 [ベンジャミン, フランクリン, passion, thee, nベンジャミン, 全業種, ... \n",
300
+ ".. ... \n",
301
+ "337 [abuse, you, counselling, emotional, addiction... \n",
302
+ "338 [京都の道, snorkeling, その1, 中の池, k7, silfra, 今だけ特別... \n",
303
+ "339 [実印, いつ使う, 件のレビュー例えば, いつ使うは, しっかりした会社, 印鑑, 実印の... \n",
304
+ "340 [galaxy, s7, samsung, edge, i9195i, s8, 3i9200... \n",
305
+ "341 [゚д゚, 対価, 労働, 産業別組合, 工会, 約款, union, 契約書, 規約, 労... \n",
306
+ "\n",
307
+ " Representative_Docs \n",
308
+ "0 [Створення сайту - Сторінка 419 - Форум\\nЧетве... \n",
309
+ "1 [上品なスタイル 【5/1(土)クーポン&ワンダフルデー 4本1台分!!】 215/45R1... \n",
310
+ "2 [ハニーセレクト日曜昼の部テンプレセット髪型全然使ってなかったけど - ふたろぐばこ−二次元... \n",
311
+ "3 [非売品 入学式 セレモニー 秋冬 秋 他と被らない 冬 小さいサイズ スカート セット 卒... \n",
312
+ "4 [it's ok with me 意味\\t9\\n英語で「It's okay.(イッツオーケー... \n",
313
+ ".. ... \n",
314
+ "337 [スピリチュアルカウンセリングは、魂の向上を目的とした、至高神からのヒーリングで魂を整えて頂... \n",
315
+ "338 [オアフ島(ホノルル) 福岡発 ◎今だけ無料で海の見える部屋へアップグレード!◎シェラトン・... \n",
316
+ "339 [冊子の「契約内容のお知らせ」ページをめくると、登録情報の変更シートがあります。\\n, 今回... \n",
317
+ "340 [ S8 PlusとS9 Plus - bajatyoutube.com\\n2019/0... \n",
318
+ "341 [ただし、中小企業の事業主等、労働者以外でも業務の実態や災害の発生状況からみて、労働者に準じ... \n",
319
+ "\n",
320
+ "[342 rows x 5 columns]"
321
+ ]
322
+ },
323
+ "execution_count": 7,
324
+ "metadata": {},
325
+ "output_type": "execute_result"
326
+ }
327
+ ],
328
+ "source": [
329
+ "topic_model.get_topic_info()"
330
+ ]
331
+ },
332
+ {
333
+ "cell_type": "code",
334
+ "execution_count": null,
335
+ "metadata": {},
336
+ "outputs": [],
337
+ "source": []
338
+ }
339
+ ],
340
+ "metadata": {
341
+ "kernelspec": {
342
+ "display_name": "ft",
343
+ "language": "python",
344
+ "name": "python3"
345
+ },
346
+ "language_info": {
347
+ "codemirror_mode": {
348
+ "name": "ipython",
349
+ "version": 3
350
+ },
351
+ "file_extension": ".py",
352
+ "mimetype": "text/x-python",
353
+ "name": "python",
354
+ "nbconvert_exporter": "python",
355
+ "pygments_lexer": "ipython3",
356
+ "version": "3.11.5"
357
+ }
358
+ },
359
+ "nbformat": 4,
360
+ "nbformat_minor": 2
361
+ }