Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Wrong pinyin tone #1

Open
begeekmyfriend opened this issue Jul 6, 2019 · 19 comments
Open

Wrong pinyin tone #1

begeekmyfriend opened this issue Jul 6, 2019 · 19 comments

Comments

@begeekmyfriend
Copy link

>>> g2p('一心一意')
[('一心一意', 'i', 'yi1 xin1 yi1 yi4', "/concentrating one's thoughts and efforts/single-minded/bent on/intently/", '一心一意')]

Should be 'yi4 xin1 yi2 yi4'.
See mozillazg/phrase-pinyin-data#20

@begeekmyfriend begeekmyfriend changed the title Wrong pinyin Wrong pinyin tone Jul 6, 2019
@Kyubyong
Copy link
Owner

Kyubyong commented Jul 6, 2019

Thanks for this. But I'm not sure if this is WRONG. Chinese regular tone changes are not written according to https://resources.allsetlearning.com/chinese/pronunciation/Tone_change_rules#Why_Tone_Changes_Are_Not_Written. Instead, I think it's better to distinguish those two: original vs. rule-applied.

@Jackiexiao
Copy link

cedict.txt only have yi1, only have bu4, if you want to distinguish them, you need data

@Kyubyong
Copy link
Owner

Kyubyong commented Jul 7, 2019

I think some simple rules can help. I'm working on them. I'll be back in hours.

@Kyubyong
Copy link
Owner

Kyubyong commented Jul 7, 2019

@begeekmyfriend I've added the pronunciation that tone change rules are applied to. Upgrade the library to check it and please let me know if it is okay. Thanks for pointing this out.

>>> g2p("一心一意")
[[('一心一意', 
'i', 
'yi1 xin1 yi1 yi4', # this is the original pronunciation
'yi4 xin1 yi2 yi4',   # this is the descriptive pronunciation
"/concentrating one's thoughts and efforts/single-minded/bent on/intently/", 
'一心一意')]

@Jackiexiao
Copy link

and 33 to 23 actually need to predict.

for example:
有一次 -> you2 yi2 ci4, but 第一次 -> di4 yi1 ci4; the pronunciation of depends on semantic

@begeekmyfriend
Copy link
Author

begeekmyfriend commented Jul 8, 2019

有一次 should be close to the segmentation of 一次 (not 有一 because it is not a word in Chinese) and 第一次 should be close to the segmentation of 第一.

@Jackiexiao
Copy link

more example:

十一{yi1}二岁来到戏校
同年十一{yi1}月
一{yi1}九八二年英文版
欧洲统一{yi1}步伐
吉林省一{yi1}号工程
一{yi1}是选拔优秀干部

@Kyubyong
Copy link
Owner

Kyubyong commented Jul 8, 2019

@Jackiexiao Can you clarify what you mean? It's confusing. The current results for the strings above are like:

有一次
original: you3 yi1 ci4
descriptive (tone changed): you3 yi2 ci4

第一次。
original: di4 yi1 ci4 。
descriptive (tone changed): di4 yi2 ci4 。

十一二岁来到戏校
original: shi2 yi1 er4 sui4 lai2 dao4 xi4 xiao4
descriptive (tone changed): shi2 yi2 er4 sui4 lai2 dao4 xi4 xiao4

同年十一月
original: tong2 nian2 shi2 yi1 yue4
descriptive (tone changed): tong2 nian2 shi2 yi2 yue4

一九八二年英文版
original: yi1 jiu3 ba1 er4 nian2 ying1 wen2 ban3
descriptive (tone changed): yi4 jiu3 ba1 er4 nian2 ying1 wen2 ban3

欧洲统一步伐
original: ou1 zhou1 tong3 yi1 bu4 fa2
descriptive (tone changed): ou1 zhou1 tong3 yi2 bu4 fa2

吉林省一号工程
original: ji2 lin2 sheng3 yi1 hao4 gong1 cheng2
descriptive (tone changed): ji2 lin2 sheng3 yi2 hao4 gong1 cheng2

一是选拔优秀干部
original: yi1 shi4 xuan3 ba2 you1 xiu4 gan4 bu4
descriptive (tone changed): yi2 shi4 xuan3 ba2 you1 xiu4 gan4 bu4

Which parts are incorrect?

@begeekmyfriend
Copy link
Author

Well it is really confusing when you first learn Chinese on , and double 3rd tone. Let me show you the rough rule on base on sentences above.

有一次 means one time which is a regular word in Chinese. Therefore the tone of depends on the following character . And 有一 is not a segmented word. So we read it as yi2 ci4.

第一次 means the first time where 第一 is a segmented word in Chinese. So we ignore behind and read it as di4 yi1 ci4.

十一二岁 we find that here 十一 can be segmented as a word. So we ignore the following and read it as shi2 yi1 er2.

同年十一月 here 十一 can be segmented as a word so we read it as shi2 yi1 yue4.

一九八二年英文版 here can be regarded as a single number character and parallel with . So we read it as yi1 jiu3 ba1 er4.

欧洲统一步伐 where 统一 is seperate from 欧洲 and 步伐 in Chinese words and there is no following character behind it so we read it as tong3 yi1.

吉林省一号工程 where 一号 is seperate from 吉林省 and 工程, and 一号 is not a regular word like one day or one time. It only means number one so we read it as yi1 hao4.

一是选拔优秀干部 where is a single number word and 一是 is not a segmented word. So we read it as yi1.

@Kyubyong
Copy link
Owner

Kyubyong commented Jul 8, 2019

According to https://en.wikipedia.org/wiki/Standard_Chinese_phonology#Tone_sandhi

For 一 yī:

   1.  一 is pronounced with second tone when followed by a fourth tone syllable.

        Example: 一定 (yī+dìng, "must") becomes yídìng [i˧˥tiŋ˥˩]

   2. Before a first, second or third tone syllable, 一 is pronounced with fourth tone.

        Examples:一天 (yī+tiān, "one day") becomes yìtiān [i˥˩tʰjɛn˥], 一年 (yī+nián, "one year") becomes yìnián [i˥˩njɛn˧˥], 一起 (yī+qǐ, "together") becomes yìqǐ [i˥˩t͡ɕʰi˨˩˦].

    3. When final, or when it comes at the end of a multi-syllable word (regardless of the first tone of the next word), 一 is pronounced with first tone. It also has first tone when used as an ordinal number (or part of one), and when it is immediately followed by any digit (including another 一; hence both syllables of the word 一一 yīyī and its compounds have first tone).
    4. When 一 is used between two reduplicated words, it may become neutral in tone (e.g. 看一看 kànyikàn ("to take a look of")).

So are the rules 1 and 2 applied word-internally only? In other words, when 一 is followed by a fourth-tone character which belongs to a separate word, 一 is read as first tone, not second tone?

@begeekmyfriend
Copy link
Author

That is right for what you have learned.

@Jackiexiao
Copy link

give another interesting example:

一{yi1}线城市
一{yi2}线希望

@begeekmyfriend
Copy link
Author

begeekmyfriend commented Jul 10, 2019

一线希望 can be regarded as a regular word in such case while 一线城市 should be segmented as , 线 and 城市. That is why Chinese always drives you mad.

@Kyubyong
Copy link
Owner

I'm looking at the literature about the tone change rules. Unfortunately, most of them are not clear about the boundaries. But some say the tone change rules MAY work across word boundaries. If my understanding is correct, things are more complicated. If we just think all the tone change rules including third tone, 一, and 不 occur word-internally, things are simple, but I'm not sure if that's true.

@begeekmyfriend
Copy link
Author

I do not think one can do Chinese Pinyin conversion totally correct. There are no rules but conventions. A enoumous pinyin dictionary is indisensable in such issue. That is what we can do about it in all.

@Kyubyong
Copy link
Owner

Okay. I've updated it to 0.9.9.3. I tried to refine the rules. Feel free to check it.

@Weil2017
Copy link

Hi Kyubyong,
The tone change for "一" also depends on context.
Some more complicated examples:
一(yi1)层 means the first floor; 一(yi4)层,means one floor or one layer.
一(yi1)级 means the first level (class); 一(yi4)级,means one (more or less) level

Do you consider to use machine learning like CRF to predict the tone change of 一?

Thanks.

@begeekmyfriend
Copy link
Author

I have found a well designed Chinese pinyin dictionary from espeak with 21567 single characters plus 36098 compound exceptions (includes 332 added 'yi' and 10720 added 'bu' exceptions, and 9713 extra 2-syllable words for 3rd-tone sandhi blocking). Would you like to replace the original one with it @Kyubyong ?

@JohnHerry
Copy link

It is hard to get correct tone all the time to some characters.
As for "一"
一心一意 yi4 xin1 yi2 yi4 【 yi1 yin1 yi2 yi4 , it is fine in oral, too】
赵一心 zhao4 yi1 xin1
一起 yi4 qi3
一起案件 yi1 qi3 an4 jian4
三百零一 san1 bai3 ling2 yi1
看一看 kan4 yi5 kan4
一看究竟 yi2 kan4 jiu1 jing4
独一无二 du2 yi1 wu2 er4
一无所有 yi4 wu2 suo3 you3

As for "不"
来不来 lai2 bu5 lai2
不来算了 bu4 lai2 suan4 le5
不得不说 bu4 de2 bu1 shuo1
你不说谁知道 ni3 bu1 shuo1 shui2 zhi1 dao5
不要 bu2 yao4
不三不四 bu4 san1 bu2 si4

As for the consistent third tones:
蒙古 meng2 gu3
奄奄一息 yan6 yan3 yi4 xi1
取水组 qu6 shui6 zu3
李组长 Li3 zu6 zhang3
懒懒散散 lan6 lan3 san6 san3 OR lan6 lan6 san6 san3
懵懵懂懂 meng6 meng6 dong6 dong3

As for “子”
燕子 yan4 zi5
孩子 hai2 zi5
虫子 chong2 zi5
孔子 kong3 zi3
韩非子 han2 fei1 zi3
五味子 wu3 wei4 zi3
妹子 mei4 zi5
小野妹子 xiao3 ye3 mei4 zi3

As for "个"
个性 ge4 xing4
个体 ge4 ti3
三个和尚 san1 ge5 he2 shang4
打个的 da3 ge5 di1
买个袜子 mai3 ge5 wa4 zi5

As for “头”
头发 tou2 fa5
头头是道 tou2 tou2 shi4 dao4
尽头 jin4 tou2
个头 ge4 tou2
甜头 tian2 tou5
木头 mu4 tou5
锄头 chu2 tou5
彩头 cai3 tou2

Even when on the same character in same word, it will pronounce differently when the speaker have different emotion.
大家都要好好的(hao2 hao3 de5)。
你好好的(hao3 hao1 de5)学着点,别人怎么做的!
这就是个好好(hao3 hao3)先生。

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants