-
Notifications
You must be signed in to change notification settings - Fork 2.1k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
skip_grams #25
Comments
没认真看代码啊 vocab_to_int这玩意做了set(words)后取index作为一个onehot标识 |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
发现这块逻辑存在问题,
words_count = Counter(words)
words = [w for w in words if words_count[w] > 50]
In [19]:
vocab = set(words)
vocab_to_int = {w: c for c, w in enumerate(vocab)}
int_to_vocab = {c: w for c, w in enumerate(vocab)}
In [20]:
print("total words: {}".format(len(words)))
print("unique words: {}".format(len(set(words))))
total words: 8623686
unique words: 6791
In [21]:
int_words = [vocab_to_int[w] for w in words]
其实vocab_to_int这个数据只是每个单词对应的第一次出现的位置
t = 1e-5 # t值
threshold = 0.9 # 剔除概率阈值
然后这里居然用这个下标用来计算词频??有人能告诉我是什么情况
int_word_counts = Counter(int_words)
total_count = len(int_words)
word_freqs = {w: c/total_count for w, c in int_word_counts.items()}
prob_drop = {w: 1 - np.sqrt(t / word_freqs[w]) for w in int_word_counts}
对单词进行采样
train_words = [w for w in int_words if prob_drop[w] < threshold]
The text was updated successfully, but these errors were encountered: