Only 204 unique tokens (vocabulary size) in enwik8 (transformer-XL example) #163

chenwydj · 2023-07-07T03:22:52Z

Describe the bug
When running the transformer-XL example on enwik8, the log shows there are only 204 unique tokens (vocabulary size) in enwik8 training set.

To Reproduce
Steps to reproduce the behavior:
bash ./scripts/run_enwik8_base.sh train

Expected behavior
I am not sure how many unique tokens (vocabulary size) should be for enwik8, but I suppose it should be much larger.

Logs
Run training...
Experiment dir : LM-TFM-enwik8/20230706-192048
Producing dataset enwik8...
building vocab with min_freq=0, max_size=None
final vocab size 204 from 204 unique tokens

/home/username/fastmoe/examples/transformer-xl/train.py(194)()
-> ntokens = len(corpus.vocab)
(Pdb) len(corpus.vocab)
204

Platform

Device: NVIDIA Quadro RTX 8000
OS: Ubuntu 18.04
CUDA version: 11.4
PyTorch version: 1.10.0

Additional context
Add any other context about the problem here.

laekov · 2023-07-07T03:24:11Z

@xptree any ideas on this?

chenwydj · 2023-07-07T16:31:03Z

I just printed the corpus.vocab.sym2idx, which should be wrong. The key should be words.
OrderedDict([('32', 0), ('101', 1), ('116', 2), ('97', 3), ('105', 4), ('111', 5), ('110', 6), ('114', 7), ('115', 8), ('108', 9), ('104', 10), ('100', 11), ('99', 12), ('117', 13), ('93', 14), ('91', 15), ('109', 16), ('112', 17), ('103', 18), ('102', 19), ('121', 20), ('98', 21), ('39', 22), ('119', 23), ('46', 24), ('44', 25), ('118', 26), ('59', 27), ('38', 28), ('124', 29), ('47', 30), ('49', 31), ('107', 32), ('61', 33), ('48', 34), ('67', 35), ('65', 36), ('58', 37), ('45', 38), ('84', 39), ('83', 40), ('60', 41), ('62', 42), ('50', 43), ('113', 44), ('73', 45), ('57', 46), ('42', 47), ('120', 48), ('41', 49), ('40', 50), ('66', 51), ('77', 52), ('80', 53), ('69', 54), ('68', 55), ('53', 56), ('51', 57), ('72', 58), ('70', 59), ('56', 60), ('52', 61), ('71', 62), ('82', 63), ('54', 64), ('76', 65), ('55', 66), ('78', 67), ('87', 68), ('122', 69), ('125', 70), ('123', 71), ('79', 72), ('106', 73), ('85', 74), ('74', 75), ('75', 76), ('208', 77), ('95', 78), ('195', 79), ('35', 80), ('86', 81), ('215', 82), ('90', 83), ('34', 84), ('89', 85), ('209', 86), ('128', 87), ('224', 88), ('184', 89), ('131', 90), ('92', 91), ('227', 92), ('37', 93), ('33', 94), ('176', 95), ('169', 96), ('206', 97), ('226', 98), ('130', 99), ('63', 100), ('88', 101), ('81', 102), ('161', 103), ('153', 104), ('43', 105), ('129', 106), ('188', 107), ('179', 108), ('216', 109), ('164', 110), ('181', 111), ('189', 112), ('148', 113), ('190', 114), ('173', 115), ('187', 116), ('186', 117), ('229', 118), ('225', 119), ('167', 120), ('217', 121), ('177', 122), ('178', 123), ('168', 124), ('149', 125), ('185', 126), ('197', 127), ('144', 128), ('147', 129), ('196', 130), ('207', 131), ('194', 132), ('180', 133), ('156', 134), ('132', 135), ('170', 136), ('166', 137), ('136', 138), ('182', 139), ('191', 140), ('9', 141), ('230', 142), ('141', 143), ('160', 144), ('175', 145), ('36', 146), ('152', 147), ('140', 148), ('165', 149), ('145', 150), ('94', 151), ('133', 152), ('163', 153), ('183', 154), ('171', 155), ('157', 156), ('137', 157), ('174', 158), ('134', 159), ('135', 160), ('236', 161), ('151', 162), ('231', 163), ('155', 164), ('201', 165), ('158', 166), ('138', 167), ('143', 168), ('150', 169), ('162', 170), ('159', 171), ('139', 172), ('172', 173), ('154', 174), ('126', 175), ('232', 176), ('235', 177), ('146', 178), ('233', 179), ('228', 180), ('202', 181), ('203', 182), ('142', 183), ('214', 184), ('237', 185), ('204', 186), ('219', 187), ('234', 188), ('213', 189), ('96', 190), ('218', 191), ('199', 192), ('64', 193), ('210', 194), ('239', 195), ('198', 196), ('211', 197), ('205', 198), ('212', 199), ('240', 200), ('222', 201), ('220', 202), ('200', 203)])

chenwydj · 2023-07-07T17:10:01Z

@laekov @xptree the problem is: for enwik8 the vocabulary.py should use train.txt.raw instead of train.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Only 204 unique tokens (vocabulary size) in enwik8 (transformer-XL example) #163

Only 204 unique tokens (vocabulary size) in enwik8 (transformer-XL example) #163

chenwydj commented Jul 7, 2023

laekov commented Jul 7, 2023

chenwydj commented Jul 7, 2023

chenwydj commented Jul 7, 2023

Only 204 unique tokens (vocabulary size) in enwik8 (transformer-XL example) #163

Only 204 unique tokens (vocabulary size) in enwik8 (transformer-XL example) #163

Comments

chenwydj commented Jul 7, 2023

laekov commented Jul 7, 2023

chenwydj commented Jul 7, 2023

chenwydj commented Jul 7, 2023