-
Notifications
You must be signed in to change notification settings - Fork 1
/
wiki_word_embedding.py
137 lines (111 loc) · 3.16 KB
/
wiki_word_embedding.py
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
# -*- coding: utf-8 -*-
import jieba
import gensim
import pickle
import re
# ----- read converted traditional_zh_wiki data ----- #
wiki = open('traditional_zh_wiki_utf-8','r').readlines()
# ----- data cleaning and re-segmentation with user-defined dict ----- #
wiki_posts = []
for content in wiki[:]:
stopword = open('stopword.txt','r', encoding='utf8').read()
jieba.load_userdict('target.test.txt') # seg by test word list
clean_content = re.sub(' ','', content)
word = jieba.cut(clean_content, cut_all=False)
wiki_posts.append([i for i in word if i not in stopword])
print('Processing ',wiki.index(content)+1,'/',len(wiki))
print('='*40)
# print(wiki_posts)
print('total number of sentences',len(wiki_posts)) # total:306,129 wiki posts
# ----- train word embedding ----- #
print('='*40)
print('Start training Word2Vec model...')
model = gensim.models.word2vec.Word2Vec(sentences= wiki_posts, size=100, alpha=0.025, window=10, min_count=5,\
max_vocab_size=None, sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, hs=0, negative=5,\
cbow_mean=1, iter=5, null_word=0, trim_rule=None, sorted_vocab=1,\
batch_words=10000) # train wiki sentences with w2v model
# ----- write training result as python pickle file ----- #
print('='*40)
print('Writing taining result as pickle...')
filename = 'wiki_word_embedding_model.sav'
pickle.dump(model, open(filename, 'wb'))
print('W2V Training Finished!')
print('='*40)
## ----- 詞彙測試 ---- #
## ----- load model from pickle ----- #
loaded_model = pickle.load(open(filename, 'rb'))
res = loaded_model.wv.most_similar(positive = ['淑女'], topn = 1000)
num = 1
for i in res:
print(num,i)
num +=1
'''
[資料來源] wikipedia
[文章數量(放進w2v訓練的data數量)] 306,129 posts
[用來做測試的target word與抓到怪異詞的value (以10個詞為一個range)]
Category: l
word: 淑女
怪異詞range: 女飾(0.59587) - 笨丈夫(0.59243)
word: 紳士
怪異詞range: 御廚(0.40847) - 加拿大勳(0.40768)
Category: m
word: 人妻
怪異詞range: 沈玉琳(0.46814) - 仕事(0.46727)
word: 帥哥
怪異詞range: 愛搞(0.62271) - 女性格(0.62149)
Category: s
word: 肥宅
怪異詞range: 非常帥(0.46381) - cpv(0.46193)
word: 正妹
怪異詞range: 包小柏(0.61574) - 翁立友(0.61501)
Threshold 的決定方法:將三個category中怪異詞range的最大值和最小值加總平均
最大值:0.62271
最小值:0.40768
最終threshold: 0.515195
'''
res = loaded_model.wv.most_similar(positive = ['辣妹'], topn = 10000)
w2v_res = []
for relevant_word in res:
if relevant_word[1] > 0.515195:
w2v_res.append(relevant_word[0])
print(len(w2v_res)) # number of relevant words above threshold value
'''
[Threshold&在評估值裡面的詞彙數量]
Threshold = 0.515195
Category: l (11 words)
淑女 690
紳士 6
嬌娃 6597
姑娘 37
夥計 3499
女士 15
君子 1034
大哥 267
大姊 1539
少年 46
小子 106
Category: m (11 words)
美女 325
帥哥 1784
優質 55
人氣 16
人妻 65
色狼 2685
女友 1831
男友 1989
老公 2880
老婆 841
太太 1449
Category: s (11 words)
潮男 3542
網美 159
型男 2166
熟女 210
大媽 2720
女神 53
正妹 4063
美眉 1218
肥宅 2
宅男 821
辣妹 1993
'''