Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

词向量文件加载错误 #139

Open
lxysl opened this issue Apr 16, 2021 · 2 comments
Open

词向量文件加载错误 #139

lxysl opened this issue Apr 16, 2021 · 2 comments

Comments

@lxysl
Copy link

lxysl commented Apr 16, 2021

在使用以下代码加载搜狗新闻Word + Character + Ngram 300d,名为sgns.sogounews.bigram-char的文件时,发生错误:

with open(WORD2VEC_PATH, encoding='utf-8') as f:
    for l in f.readlines():
        values = l.split()
        word = values[0]
        embeddings_index[word] = np.asarray(values[1:], dtype='float32')

错误为:

ValueError: could not convert string to float: '姚'

经过检查,我发现该文件某行的词向量是:

扬 姚 -0.890708 1.429886 ......

所以这个词应该是“扬 姚”吗?还是说“扬”和“姚”对应同一个词向量?


附:我按“扬”和“姚”对应同一个词向量进行解析的代码:

with open(WORD2VEC_PATH, encoding='utf-8') as f:
    for l in f.readlines():
        values = l.split()
        word = values[0]
        try:
            embeddings_index[word] = np.asarray(values[1:], dtype='float32')
        except ValueError:
            word2 = values[1]
            embeddings_index[word] = np.asarray(values[2:], dtype='float32')
            embeddings_index[word2] = np.asarray(values[2:], dtype='float32')
@shenshen-hungry
Copy link
Collaborator

“扬 姚”这个地方是个全角空格,用

values = l.split(' ')

进行切分就可以了。

@Yufanggg
Copy link

请问你们用的是windows系统还是linux啊?
我用同样的代码加载的时候报错:
embeddings_index[word] = np.asarray(values[1:], dtype='float32')
TypeError: list indices must be integers or slices, not str

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants