Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Only 204 unique tokens (vocabulary size) in enwik8 (transformer-XL example) #163

Open
chenwydj opened this issue Jul 7, 2023 · 3 comments

Comments

@chenwydj
Copy link

chenwydj commented Jul 7, 2023

Describe the bug
When running the transformer-XL example on enwik8, the log shows there are only 204 unique tokens (vocabulary size) in enwik8 training set.

To Reproduce
Steps to reproduce the behavior:
bash ./scripts/run_enwik8_base.sh train

Expected behavior
I am not sure how many unique tokens (vocabulary size) should be for enwik8, but I suppose it should be much larger.

Logs
Run training...
Experiment dir : LM-TFM-enwik8/20230706-192048
Producing dataset enwik8...
building vocab with min_freq=0, max_size=None
final vocab size 204 from 204 unique tokens

/home/username/fastmoe/examples/transformer-xl/train.py(194)()
-> ntokens = len(corpus.vocab)
(Pdb) len(corpus.vocab)
204

Platform

  • Device: NVIDIA Quadro RTX 8000
  • OS: Ubuntu 18.04
  • CUDA version: 11.4
  • PyTorch version: 1.10.0

Additional context
Add any other context about the problem here.

@laekov
Copy link
Owner

laekov commented Jul 7, 2023

@xptree any ideas on this?

@chenwydj
Copy link
Author

chenwydj commented Jul 7, 2023

I just printed the corpus.vocab.sym2idx, which should be wrong. The key should be words.
OrderedDict([('32', 0), ('101', 1), ('116', 2), ('97', 3), ('105', 4), ('111', 5), ('110', 6), ('114', 7), ('115', 8), ('108', 9), ('104', 10), ('100', 11), ('99', 12), ('117', 13), ('93', 14), ('91', 15), ('109', 16), ('112', 17), ('103', 18), ('102', 19), ('121', 20), ('98', 21), ('39', 22), ('119', 23), ('46', 24), ('44', 25), ('118', 26), ('59', 27), ('38', 28), ('124', 29), ('47', 30), ('49', 31), ('107', 32), ('61', 33), ('48', 34), ('67', 35), ('65', 36), ('58', 37), ('45', 38), ('84', 39), ('83', 40), ('60', 41), ('62', 42), ('50', 43), ('113', 44), ('73', 45), ('57', 46), ('42', 47), ('120', 48), ('41', 49), ('40', 50), ('66', 51), ('77', 52), ('80', 53), ('69', 54), ('68', 55), ('53', 56), ('51', 57), ('72', 58), ('70', 59), ('56', 60), ('52', 61), ('71', 62), ('82', 63), ('54', 64), ('76', 65), ('55', 66), ('78', 67), ('87', 68), ('122', 69), ('125', 70), ('123', 71), ('79', 72), ('106', 73), ('85', 74), ('74', 75), ('75', 76), ('208', 77), ('95', 78), ('195', 79), ('35', 80), ('86', 81), ('215', 82), ('90', 83), ('34', 84), ('89', 85), ('209', 86), ('128', 87), ('224', 88), ('184', 89), ('131', 90), ('92', 91), ('227', 92), ('37', 93), ('33', 94), ('176', 95), ('169', 96), ('206', 97), ('226', 98), ('130', 99), ('63', 100), ('88', 101), ('81', 102), ('161', 103), ('153', 104), ('43', 105), ('129', 106), ('188', 107), ('179', 108), ('216', 109), ('164', 110), ('181', 111), ('189', 112), ('148', 113), ('190', 114), ('173', 115), ('187', 116), ('186', 117), ('229', 118), ('225', 119), ('167', 120), ('217', 121), ('177', 122), ('178', 123), ('168', 124), ('149', 125), ('185', 126), ('197', 127), ('144', 128), ('147', 129), ('196', 130), ('207', 131), ('194', 132), ('180', 133), ('156', 134), ('132', 135), ('170', 136), ('166', 137), ('136', 138), ('182', 139), ('191', 140), ('9', 141), ('230', 142), ('141', 143), ('160', 144), ('175', 145), ('36', 146), ('152', 147), ('140', 148), ('165', 149), ('145', 150), ('94', 151), ('133', 152), ('163', 153), ('183', 154), ('171', 155), ('157', 156), ('137', 157), ('174', 158), ('134', 159), ('135', 160), ('236', 161), ('151', 162), ('231', 163), ('155', 164), ('201', 165), ('158', 166), ('138', 167), ('143', 168), ('150', 169), ('162', 170), ('159', 171), ('139', 172), ('172', 173), ('154', 174), ('126', 175), ('232', 176), ('235', 177), ('146', 178), ('233', 179), ('228', 180), ('202', 181), ('203', 182), ('142', 183), ('214', 184), ('237', 185), ('204', 186), ('219', 187), ('234', 188), ('213', 189), ('96', 190), ('218', 191), ('199', 192), ('64', 193), ('210', 194), ('239', 195), ('198', 196), ('211', 197), ('205', 198), ('212', 199), ('240', 200), ('222', 201), ('220', 202), ('200', 203)])

@chenwydj
Copy link
Author

chenwydj commented Jul 7, 2023

@laekov @xptree the problem is: for enwik8 the vocabulary.py should use train.txt.raw instead of train.txt

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants