question about subsampler #7

majortomz · 2018-07-11T03:26:19Z

Line 58 in 6966b1c

 subsampler = dict([(word, 1 - sqrt(subsample / count)) for word, count in six.iteritems(vocab) if count > subsample]) #subsampling technique 

I am confused about the sub-sampler in corpus2pairs. I think 1 - sqrt(subsample / count) should be replaced with 1 - sqrt(subsample / (count / total_word_count_in_vocab)).

ps. I might misunderstand your implementation, and in actual implementation of original word2vec.c ，the subsample probability equals 1 - (sqrt(subsample / (count / total_word_count_in_vocab)) + subsample / (count / total_word_count_in_vocab) ).

The text was updated successfully, but these errors were encountered:

zhezhaoa · 2018-07-11T04:21:34Z

It's an interesting question. Notice that we have subsample *= train_uni_num before subsampler = dict([(word, 1 - sqrt(subsample / count)) for word, count in six.iteritems(vocab) if count > subsample]) #subsampling technique . Maybe the name is inappropriate.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

question about subsampler #7

question about subsampler #7

majortomz commented Jul 11, 2018

zhezhaoa commented Jul 11, 2018

question about subsampler #7

question about subsampler #7

Comments

majortomz commented Jul 11, 2018

zhezhaoa commented Jul 11, 2018